Documentation

Command line interface

happler

happler: A haplotype-based fine-mapping method

Test for associations between a trait and haplotypes (ie sets of correlated SNPs) rather than individual SNPs

happler [OPTIONS] COMMAND [ARGS]...

Options

--version: Show the version and exit.

run

Use the tool to find trait-associated haplotypes

GENOTYPES must be formatted as VCFs and

PHENOTYPES must be a tab-separated file containing two columns: sample ID and phenotype value

Ex: happler run tests/data/simple.vcf tests/data/simple.tsv > simple.hap

happler run [OPTIONS] GENOTYPES PHENOTYPES

Options

--pheno <pheno>

Which phenotype from the .pheno file should we use?

Default:: '0'

--pheno-name

Treat the –pheno argument as a column name instead of an index

Default:: False

--region <region>

The region from which to extract genotypes; ex: ‘chr1:1234-34566’ or ‘chr7’

For this to work, the VCF must be indexed and the seqname must match!

Default:: 'all genotypes'

-s, --sample <samples>

A list of the samples to subset from the genotypes file (ex: ‘-s sample1 -s sample2’)

Default:: 'all samples'

-S, --samples-file <samples_file>

A single column txt file containing a list of the samples (one per line) to subset from the genotypes file

Default:: 'all samples'

--discard-multiallelic

Whether to discard multi-allelic variants or just complain about them.

Default:: 'do not discard multi-allelic variants'

--discard-missing

Ignore any samples that are missing genotypes for the required variants

Default:: False

-c, --chunk-size <chunk_size>

Perform memory intensive operations in chunks of X variants. This reduces memory but at the cost of time.

Default:: 'all variants'

--maf <maf>

Only use variants with an MAF above this threshold

Default:: 'all variants'

--hap-maf <hap_maf>

Only build haplotypes with an MAF above this threshold

Default:: 'the MAF equivalent to an MAC of 20'

--max-signals <max_signals>

The maximum number of expected causal signals

Default:: 1

--max-iterations <max_iterations>

The max number of times to repeat the tree building

Default:: 1

--ld-prune-thresh <ld_prune_thresh>

The LD threshold used to prune leaf nodes based on LD with their siblings

Default:: 0.95

--out-thresh <out_thresh>

Threshold used to determine whether to output haplotypes (single-SNP)

Default:: 5e-08

--show-tree

Output a tree in addition to the regular output.

Default:: False

--remove-SNPs

Remove haplotypes with only a single variant

Default:: False

-o, --output <output>

A .hap file describing the extracted haplotypes

Default:: 'stdout'

-v, --verbosity <verbosity>

The level of verbosity desired

Default:: 'INFO'
Options:: CRITICAL | ERROR | WARNING | INFO | DEBUG | NOTSET

Arguments

GENOTYPES: Required argument

PHENOTYPES: Required argument

transform

Transform a genotype matrix via a haplotype

GENOTYPES must be formatted as a VCF or PGEN file

HAPLOTYPES must be formatted as a .hap file

Ex: happler transform tests/data/simple.vcf tests/data/simple.hap > simple.vcf

happler transform [OPTIONS] GENOTYPES HAPLOTYPES

Options

--region <region>

The region from which to extract genotypes; ex: ‘chr1:1234-34566’ or ‘chr7’

For this to work, the VCF must be indexed and the seqname must match!

Default:: 'all genotypes'

-s, --sample <samples>

A list of the samples to subset from the genotypes file (ex: ‘-s sample1 -s sample2’)

Default:: 'all samples'

-S, --samples-file <samples_file>

A single column txt file containing a list of the samples (one per line) to subset from the genotypes file

Default:: 'all samples'

--var-id <variants>

A list of the variants to subset from the genotypes file (ex: ‘-s sample1 -s sample2’)

Default:: 'all variants'

--discard-multiallelic

Whether to discard multi-allelic variants or just complain about them.

Default:: 'do not discard multi-allelic variants'

--discard-missing

Ignore any samples that are missing genotypes for the required variants

Default:: False

-c, --chunk-size <chunk_size>

If using a PGEN file, read genotypes in chunks of X variants; reduces memory

Default:: 'all variants'

--maf <maf>

Ignore variants with a MAF below this threshold

Default:: 'no filtering'

-i, --hap-id <hap_id>

Which haplotype to use from the .hap file

Default:: 'the first haplotype'

-a, --allele <allele>

The allele of the next causal SNP

Default:: 0

-o, --output <output>

A transformed genotypes file

Default:: 'stdout'

-v, --verbosity <verbosity>

The level of verbosity desired

Default:: 'INFO'
Options:: CRITICAL | ERROR | WARNING | INFO | DEBUG | NOTSET

Arguments

GENOTYPES: Required argument

HAPLOTYPES: Required argument

Module contents

happler.tree.tree module

class happler.tree.tree.Tree(log=None)

Bases: object

A tree where

nodes are variants (and also haplotypes)
each node can have at most two branches for each of its alleles

Attributes:

graph: nx.DiGraph: The underlying directed acyclic graph representing the tree
variant_locs: dict: The indices of each variant within the tree’s list of nodes

add_node(node, parent_idx, allele, results=None)

Add node to the tree

Return type:

int

Parameters:

nodeVariant: The node to add to the tree
parent_idxint: The index of the node under which to place the new node
alleleint: The allele for the edge from a parent node to this node
resultsnp.void, optional: The results (beta, pval) of the association test for this haplotype once this variant is included

Returns:

int: The index of the new node within the tree

dot()

Convert the tree to its representation in the dot language This is useful for quickly viewing the tree on the command line

Return type:

str

Returns:

str

A string representing the tree

Nodes are labeled by their variant ID and edges are labeled by their allele

haplotypes(root=0)

Return the haplotypes at the leaves of the tree rooted at the index “root”

Return type:

list[deque[dict]]

Returns:

list[deque[dict]]

A list of haplotypes, where each haplotype consists of a list of dictionaries.

Each dictionary contains the node and all of its attributes.

leaves(from_root=False)

Return all leaves of this tree

Parameters:

from_root: bool, optional: Whether to only return leaves attached to the root of the tree

Returns:

tuple[int, Variant, int]: The variant at each leaf node of the tree. Returns the index in the tree, the Variant object, and the allele of the variant.

property num_nodes

property num_variants

remove_leaf_node(node_idx)

Remove a leaf node from the tree

Parameters:

node_idxint: The index of the node to remove

siblings(node_idx)

Locate sibling(s) of this node in the tree

Return type:

list[Variant]

Parameters:

node_idxint: The index of a node in the tree

Returns:

list[Variant]: The variants at the sibling nodes

happler.tree.tree_builder module

class happler.tree.tree_builder.TreeBuilder(genotypes, phenotypes, maf=None, method=<happler.tree.assoc_test.AssocTestSimpleFastBIC object>, terminator=<happler.tree.terminator.BICTerminator object>, indep_thresh=15, ld_prune_thresh=None, covariance_correction=True, log=None)

Bases: object

Creates a Tree object from provided Genotypes and Phenotypes

Attributes:

tree: Tree: A tree representing haplotypes composed from trait-associated genotypes
gens: Genotypes: The genotypes from which the tree should be built
phens: Phenotypes: The phenotypes from which the tree should be built
method: AssocTest, optional: The type of association test to perform at each node when constructing the tree
terminator: Terminator, optional: The type of test to use for deciding whether to terminate a branch
ld_prune_thresh: float, optional: Any leaf nodes with a greater LD with their sibling than this value will be pruned
log: Logger: A logging instance for recording debug statements.

Examples

>>> gens = Genotypes.load('tests/data/simple.vcf')
>>> phens = Phenotypes.load('tests/data/simple.tsv')
>>> tree = TreeBuilder(gens, phens).run()

prune_tree(from_root=True)

Remove any leaf nodes that are in strong LD with their sibling branches

Parameters:

from_root: bool, optional: Whether to only prune leaves attached to the root of the tree

run(): Run the tree builder and create a tree rooted at the provided variant

happler.tree.variant module

class happler.tree.variant.Variant(idx, id, pos)

Bases: object

A variant within the genotypes matrix

Attributes:

idstr: The variant’s unique ID
idxint: The index of the variant within the genotype data
posint: The chromosomal start position of the variant

property ID

property POS

classmethod from_np(np_mixed_arr_var, idx)

Convert a numpy mixed array variant record into a Variant

Return type:

Variant

Parameters:

np_mixed_arr_varnp.void: A numpy mixed array variant record with entries ‘id’ and ‘pos’
idxint: See idx

Returns:

Variant: The converted Variant

id: int

idx: int

pos: int

class happler.tree.variant.VariantType(variant_type='snp')

Bases: object

A class denoting the type of variant

Attributes:

typestr, optional

The type of variant (ex: SNP, STR, etc)

Defaults to a single nucleotide polymorphism

happler.tree.haplotypes module

class happler.tree.haplotypes.Haplotype(nodes=(), data=None, num_samples=None)

Bases: object

A haplotype within the tree

Attributes:

nodestuple[tuple[Variant, int]]: An ordered collection of pairs, where each pair is a node and its allele
datanpt.NDArray[bool]: A np array (with shape n x 2, num_samples x num_chromosomes) denoting the presence of this haplotype in each chromosome of each sample

append(node, allele, variant_genotypes)

Append a new node (variant) to this haplotype

Return type:

Haplotype

Parameters:

nodeVariant: The node to add to this haplotype
alleleint: The allele associated with this node
variant_genotypesnpt.NDArray[bool]: A np array (with length n x 2, num_samples x num_chromosomes) denoting the presence of the new allele in each chromosome of each sample

Returns:

Haplotype: A new haplotype object extended by the node and its allele

data: ndarray[Any, dtype[bool]]

classmethod from_haptools_haplotype(haplotype, variant_genotypes)

Create a new haplotype from a haptools Haplotype and a GenotypesVCF object

Return type:: Haplotype

classmethod from_node(node, allele, variant_genotypes)

Create a new haplotype with a single node entry

Return type:

Haplotype

Parameters:

nodeVariant: The initializing node for this haplotype
alleleint: The allele associated with node
variant_genotypesnpt.NDArray[bool]: A np array (with length n x 2, num_samples x num_chromosomes) denoting the presence of this haplotype in each chromosome of each sample

Returns:

Haplotype: The newly created haplotype object containing node and allele

property node_indices: tuple[int]

Get the indices of the nodes in this haplotype

Returns:

tuple: The indices of the nodes

nodes: tuple[tuple[Variant, int]]

transform(genotypes, allele, idxs=None, remove_self=True)

Transform a genotypes matrix via the current haplotype:

Each entry in the returned matrix denotes the presence of the current haplotype extended by each of the variants in the genotype matrix

Return type:

ndarray[Any, dtype[bool]]

Parameters:

genotypesGenotypes: The genotypes which to transform using the current haplotype
alleleint: The allele (either 0 or 1) of the SNPs we’re adding
idxstuple[int], optional: If specified, we will only output haplotypes for the variants at these indices. Otherwise, we’ll output all of them.
remove_selfbool, optional: Whether to first remove any variants that are already in this haplotype using np.delete. This is expensive because it creates a new copy.

Returns:

npt.NDArray[bool]: A 3D haplotype matrix similar to the genotype matrix but with haplotypes instead of variants in the columns. It will have the same shape except that the number of columns (second dimension) will have decreased by the number of variants in this haplotype if remove_self is True

transform_and_sum(genotypes, allele, idxs=None, remove_self=True)

Transform a genotypes matrix and sum along the ploidy axis using JAX JIT.

Combines transform() and .sum(axis=2) into a single JAX JIT-compiled operation, avoiding materialization of the intermediate 3D boolean array.

Return type:

ndarray[Any, dtype[uint8]]

Parameters:

genotypesGenotypes: The genotypes which to transform using the current haplotype
alleleint: The allele (either 0 or 1) of the SNPs we’re adding
idxstuple[int], optional: If specified, we will only output haplotypes for the variants at these indices. Otherwise, we’ll output all of them.
remove_selfbool, optional: Whether to first remove any variants that are already in this haplotype using np.delete. This is expensive because it creates a new copy.

Returns:

npt.NDArray[np.uint8]: A 2D matrix with shape (num_samples, num_variants) where each entry is the count of chromosomes carrying the haplotype (0, 1, or 2)

class happler.tree.haplotypes.Haplotypes(fname, haplotype=<class 'haptools.data.haplotypes.Haplotype'>, variant=<class 'haptools.data.haplotypes.Variant'>, repeat=<class 'haptools.data.haplotypes.Repeat'>, log=None)

Bases: Haplotypes

A class for processing haplotypes from a file

Attributes:

fname: Path | str: The path to the file containing the data
data: dict[str, Haplotype]: A dict of Haplotype objects keyed by their IDs
types: dict: A dict of class names keyed by the symbol denoting their line type Ex: {‘H’: Haplotype, ‘V’: Variant}
version: str: A string denoting the current file format version
log: Logger: A logging instance for recording debug statements.

classmethod from_tree(fname, tree, gts, pts=None, log=None)

Create a Haplotypes object from a Tree object and a Genotypes object

Return type:

Haplotypes

Parameters:

fnamePath | str: The fname parameter for the Haplotypes object
treeTree: The Tree object containing the haplotypes to encode within a Haplotypes obj
gtsGenotypesVCF: The genotypes from which the tree was constructed
pts: Phenotypes: The phenotypes with which these haplotypes were built. If not provided, the results won’t be recomputed
logLogger, optional: The log parameter for the Haplotypes object

Returns:

Haplotypes: The completed Haplotypes object

class happler.tree.haplotypes.HapplerHaplotype(chrom, start, end, id, beta, pval)

Bases: Haplotype

A haplotype with sufficient fields for happler Properties and functions are shared with the base Haplotype object, “HaplotypeBase”

beta: float

pval: float

class happler.tree.haplotypes.HapplerVariant(start, end, id, allele, score)

Bases: Variant

A variant allele with sufficient fields for happler Properties and functions are shared with the base Variant object, “VariantBase”

score: float

happler.tree.assoc_test module

class happler.tree.assoc_test.AssocResults(data)

Bases: object

The results of an association test

Attributes:

datanpt.NDArray[np.float64]

A numpy mixed array with fields: beta, pval, stderr

It has shape num_variants x num_fields

data: ndarray[Any, dtype[TypeVar(_ScalarType_co, bound= generic, covariant=True)]]

class happler.tree.assoc_test.AssocTest(with_bic=False)

Bases: ABC

Abstract class for performing phenotype-haplotype association tests

Attributes:

with_bic: bool: Whether to also output the BIC (Bayesian Information Criteria)

static pval_as_decimal(t_stat, df, precision=1000)

Given a t statistic, return the associated p-value as a high precision Decimal

This can be helpful when a p-value is too small to be represented by the precision in python float64s

Return type:

Decimal

Parameters:

t_statfloat: The t statistic from a simple linear model
dfint: The degrees of freedom of the associated t distribution
precision: int: The precision of the returned value

Returns:

Decimal: An approximate, higher precision p-value for the provided t statistic

Raises:

ValueError: This function is only valid when the t distribution is approximately equivalent to the normal distribution. Thus, if df < 1000, we raise a ValueError to indicate that the function will not return an accurate value

abstract run(X, y)

Run a series of phenotype-haplotype association tests for each haplotype (column) in X and return their p-values

Return type:

AssocResults

Parameters:

Xnpt.NDArray[np.float64]: The genotypes, with shape n x p. There are only two dimensions. Each column is a haplotype and each row is a sample.
ynpt.NDArray[np.float64]: The phenotypes, with shape n x 1

Returns:

npt.NDArray[np.float64]: The p-values from testing each haplotype, with shape p x 1

standardize(X)

Standardize the genotypes so they have mean 0 and variance 1

Return type:

ndarray[Any, dtype[float64]]

Parameters:

Xnpt.NDArray[np.float64]: The genotypes, with shape n x p. There are only two dimensions. Each column is a haplotype and each row is a sample.

Returns:

npt.NDArray[np.float64]: An array with the same shape as X but standardized properly

class happler.tree.assoc_test.AssocTestSimple(with_bic=False)

Bases: AssocTest

bic(n, residuals)

Return the BIC (Bayesian Information Criterion) for an OLS test

This function follows the implementation in https://stackoverflow.com/a/58984868

Return type:

float

Parameters:

nint: The number of observations
residualsnpt.NDArray[np.float64]: The residuals

Returns:

float: The Bayesian Information Criterion

perform_test(x, y)

Perform the test for a single haplotype.

Return type:

tuple

Parameters:

xnpt.NDArray[np.float64]: The genotypes with shape n x 1 (for a single haplotype)
ynpt.NDArray[np.float64]: The phenotypes, with shape n x 1

Returns:

tuple: The slope, p-value, and stderr obtained from the test. The BIC is appended to the end if self.with_bic is True.

run(X, y)

Implement AssocTest for a simple linear regression.

Return type:

AssocResults

Parameters:

Xnpt.NDArray[np.float64]: The genotypes, with shape n x p. There are only two dimensions. Each row is a sample and each column is a haplotype.
ynpt.NDArray[np.float64]: The phenotypes, with shape n x 1

Returns:

npt.NDArray[np.float64]: The results from testing each haplotype, with shape p x 3

class happler.tree.assoc_test.AssocTestSimpleCovariates(covars, with_bic=False): Bases: AssocTestSimpleSM

class happler.tree.assoc_test.AssocTestSimpleFastBIC(chunk_size=None)

Bases: AssocTestSimpleSM

Calculate only BIC in a quick, vectorized fashion without statsmodels

perform_test(X, yc)

Perform the test for a chunk of haplotypes using JAX JIT compilation

This method uses the JIT-compiled _compute_bic_jit helper function for improved computational efficiency with GPU/TPU acceleration.

Return type:

ndarray[Any, dtype[TypeVar(_ScalarType_co, bound= generic, covariant=True)]]

Parameters:

Xnpt.NDArray[np.float64]: The genotypes with shape (n, p)
ycnpt.NDArray[np.float64]: The phenotypes, with shape (n, 1) They are assumed to be centered already

Returns:

npt.NDArray[np.float64]: The BIC values from testing this chunk of haplotypes, with shape (p,)

run(X, y)

Implement AssocTest for a simple, univariate OLS model: y ~ 1 + X[:, j]

Does not use statsmodels at all but replicates its behavior

Return type:

AssocResults

Parameters:

Xnpt.NDArray[np.float64]: The genotypes, with shape n x p. There are only two dimensions. Each row is a sample and each column is a haplotype.
ynpt.NDArray[np.float64]: The phenotypes, with shape n x 1

Returns:

npt.NDArray[np.float64]: The results from testing each haplotype, with shape p x 1

class happler.tree.assoc_test.AssocTestSimpleSM(with_bic=False)

Bases: AssocTestSimple

perform_test(x, y)

Perform the test for a single haplotype.

Return type:

tuple

Parameters:

xnpt.NDArray[np.float64]: The genotypes with shape n x 1 (for a single haplotype)
ynpt.NDArray[np.float64]: The phenotypes, with shape n x 1

Returns:

tuple: The slope, p-value, and stderr obtained from the test. The BIC is appended to the end if self.with_bic is True.

class happler.tree.assoc_test.AssocTestSimpleSMTScore(with_bic=False)

Bases: AssocTestSimpleSM

perform_test(x, y, parent_res=None, parent_corr=0)

Perform the test for a single haplotype.

Return type:

tuple

Parameters:

xnpt.NDArray[np.float64]: The genotypes with shape n x 1 (for a single haplotype)
ynpt.NDArray[np.float64]: The phenotypes, with shape n x 1

Returns:

tuple: The slope, p-value, and stderr obtained from the test. The BIC is appended to the end if self.with_bic is True.

run(X, y, parent_res=None, parent_corr=None)

Implement AssocTest for a simple linear regression.

Return type:

AssocResults

Parameters:

Xnpt.NDArray[np.float64]: The genotypes, with shape n x p. There are only two dimensions. Each row is a sample and each column is a haplotype.
ynpt.NDArray[np.float64]: The phenotypes, with shape n x 1

Returns:

npt.NDArray[np.float64]: The results from testing each haplotype, with shape p x 3

class happler.tree.assoc_test.NodeResults(beta, pval, stderr)

Bases: object

The results of testing SNPs at a node in the tree

Attributes:

betafloat: The best effect size among all of the SNPs tried
pvalfloat: The best p-value among all of the SNPs tried
stderr: float: The standard error of beta

beta: float

classmethod from_np(np_mixed_arr_var)

Return type:: NodeResults

pval: float

stderr: float

class happler.tree.assoc_test.NodeResultsBIC(bic)

Bases: object

The results of testing SNPs at a node in the tree

Attributes:

bicfloat: The best BIC among all of the SNPs

bic: float

classmethod from_np(np_mixed_arr_var)

Return type:: NodeResults

class happler.tree.assoc_test.NodeResultsExtra(beta, pval, stderr, bic)

Bases: NodeResults

bic: float

class happler.tree.assoc_test.NodeResultsExtraTScore(beta, pval, stderr, bic, tscore)

Bases: NodeResultsExtra

tscore: float

class happler.tree.assoc_test.NodeResultsTScore(beta, pval, stderr, tscore)

Bases: NodeResults

tscore: float