Documentation
Command line interface
happler
happler: A haplotype-based fine-mapping method
Test for associations between a trait and haplotypes (ie sets of correlated SNPs) rather than individual SNPs
happler [OPTIONS] COMMAND [ARGS]...
Options
- --version
Show the version and exit.
run
Use the tool to find trait-associated haplotypes
GENOTYPES must be formatted as VCFs and
PHENOTYPES must be a tab-separated file containing two columns: sample ID and phenotype value
Ex: happler run tests/data/simple.vcf tests/data/simple.tsv > simple.hap
happler run [OPTIONS] GENOTYPES PHENOTYPES
Options
- --pheno <pheno>
Which phenotype from the .pheno file should we use?
- Default:
'0'
- --pheno-name
Treat the –pheno argument as a column name instead of an index
- Default:
False
- --region <region>
The region from which to extract genotypes; ex: ‘chr1:1234-34566’ or ‘chr7’
For this to work, the VCF must be indexed and the seqname must match!
- Default:
'all genotypes'
- -s, --sample <samples>
A list of the samples to subset from the genotypes file (ex: ‘-s sample1 -s sample2’)
- Default:
'all samples'
- -S, --samples-file <samples_file>
A single column txt file containing a list of the samples (one per line) to subset from the genotypes file
- Default:
'all samples'
- --discard-multiallelic
Whether to discard multi-allelic variants or just complain about them.
- Default:
'do not discard multi-allelic variants'
- --discard-missing
Ignore any samples that are missing genotypes for the required variants
- Default:
False
- -c, --chunk-size <chunk_size>
Perform memory intensive operations in chunks of X variants. This reduces memory but at the cost of time.
- Default:
'all variants'
- --maf <maf>
Only use variants with an MAF above this threshold
- Default:
'all variants'
- --hap-maf <hap_maf>
Only build haplotypes with an MAF above this threshold
- Default:
'the MAF equivalent to an MAC of 20'
- --max-signals <max_signals>
The maximum number of expected causal signals
- Default:
1
- --max-iterations <max_iterations>
The max number of times to repeat the tree building
- Default:
1
- --ld-prune-thresh <ld_prune_thresh>
The LD threshold used to prune leaf nodes based on LD with their siblings
- Default:
0.95
- --out-thresh <out_thresh>
Threshold used to determine whether to output haplotypes (single-SNP)
- Default:
5e-08
- --show-tree
Output a tree in addition to the regular output.
- Default:
False
- --remove-SNPs
Remove haplotypes with only a single variant
- Default:
False
- -o, --output <output>
A .hap file describing the extracted haplotypes
- Default:
'stdout'
- -v, --verbosity <verbosity>
The level of verbosity desired
- Default:
'INFO'- Options:
CRITICAL | ERROR | WARNING | INFO | DEBUG | NOTSET
Arguments
- GENOTYPES
Required argument
- PHENOTYPES
Required argument
transform
Transform a genotype matrix via a haplotype
GENOTYPES must be formatted as a VCF or PGEN file
HAPLOTYPES must be formatted as a .hap file
Ex: happler transform tests/data/simple.vcf tests/data/simple.hap > simple.vcf
happler transform [OPTIONS] GENOTYPES HAPLOTYPES
Options
- --region <region>
The region from which to extract genotypes; ex: ‘chr1:1234-34566’ or ‘chr7’
For this to work, the VCF must be indexed and the seqname must match!
- Default:
'all genotypes'
- -s, --sample <samples>
A list of the samples to subset from the genotypes file (ex: ‘-s sample1 -s sample2’)
- Default:
'all samples'
- -S, --samples-file <samples_file>
A single column txt file containing a list of the samples (one per line) to subset from the genotypes file
- Default:
'all samples'
- --var-id <variants>
A list of the variants to subset from the genotypes file (ex: ‘-s sample1 -s sample2’)
- Default:
'all variants'
- --discard-multiallelic
Whether to discard multi-allelic variants or just complain about them.
- Default:
'do not discard multi-allelic variants'
- --discard-missing
Ignore any samples that are missing genotypes for the required variants
- Default:
False
- -c, --chunk-size <chunk_size>
If using a PGEN file, read genotypes in chunks of X variants; reduces memory
- Default:
'all variants'
- --maf <maf>
Ignore variants with a MAF below this threshold
- Default:
'no filtering'
- -i, --hap-id <hap_id>
Which haplotype to use from the .hap file
- Default:
'the first haplotype'
- -a, --allele <allele>
The allele of the next causal SNP
- Default:
0
- -o, --output <output>
A transformed genotypes file
- Default:
'stdout'
- -v, --verbosity <verbosity>
The level of verbosity desired
- Default:
'INFO'- Options:
CRITICAL | ERROR | WARNING | INFO | DEBUG | NOTSET
Arguments
- GENOTYPES
Required argument
- HAPLOTYPES
Required argument
Module contents
happler.tree.tree module
- class happler.tree.tree.Tree(log=None)
Bases:
objectA tree where
nodes are variants (and also haplotypes)
each node can have at most two branches for each of its alleles
- Attributes:
- graph: nx.DiGraph
The underlying directed acyclic graph representing the tree
- variant_locs: dict
The indices of each variant within the tree’s list of nodes
- add_node(node, parent_idx, allele, results=None)
Add node to the tree
- Return type:
int- Parameters:
- nodeVariant
The node to add to the tree
- parent_idxint
The index of the node under which to place the new node
- alleleint
The allele for the edge from a parent node to this node
- resultsnp.void, optional
The results (beta, pval) of the association test for this haplotype once this variant is included
- Returns:
- int
The index of the new node within the tree
- dot()
Convert the tree to its representation in the dot language This is useful for quickly viewing the tree on the command line
- Return type:
str- Returns:
- str
A string representing the tree
Nodes are labeled by their variant ID and edges are labeled by their allele
- haplotypes(root=0)
Return the haplotypes at the leaves of the tree rooted at the index “root”
- Return type:
list[deque[dict]]- Returns:
- list[deque[dict]]
A list of haplotypes, where each haplotype consists of a list of dictionaries.
Each dictionary contains the node and all of its attributes.
- leaves(from_root=False)
Return all leaves of this tree
- Parameters:
- from_root: bool, optional
Whether to only return leaves attached to the root of the tree
- Returns:
- tuple[int, Variant, int]
The variant at each leaf node of the tree. Returns the index in the tree, the Variant object, and the allele of the variant.
- property num_nodes
- property num_variants
- remove_leaf_node(node_idx)
Remove a leaf node from the tree
- Parameters:
- node_idxint
The index of the node to remove
happler.tree.tree_builder module
- class happler.tree.tree_builder.TreeBuilder(genotypes, phenotypes, maf=None, method=<happler.tree.assoc_test.AssocTestSimpleFastBIC object>, terminator=<happler.tree.terminator.BICTerminator object>, indep_thresh=15, ld_prune_thresh=None, covariance_correction=True, log=None)
Bases:
objectCreates a Tree object from provided Genotypes and Phenotypes
- Attributes:
- tree: Tree
A tree representing haplotypes composed from trait-associated genotypes
- gens: Genotypes
The genotypes from which the tree should be built
- phens: Phenotypes
The phenotypes from which the tree should be built
- method: AssocTest, optional
The type of association test to perform at each node when constructing the tree
- terminator: Terminator, optional
The type of test to use for deciding whether to terminate a branch
- ld_prune_thresh: float, optional
Any leaf nodes with a greater LD with their sibling than this value will be pruned
- log: Logger
A logging instance for recording debug statements.
Examples
>>> gens = Genotypes.load('tests/data/simple.vcf') >>> phens = Phenotypes.load('tests/data/simple.tsv') >>> tree = TreeBuilder(gens, phens).run()
- prune_tree(from_root=True)
Remove any leaf nodes that are in strong LD with their sibling branches
- Parameters:
- from_root: bool, optional
Whether to only prune leaves attached to the root of the tree
- run()
Run the tree builder and create a tree rooted at the provided variant
happler.tree.variant module
- class happler.tree.variant.Variant(idx, id, pos)
Bases:
objectA variant within the genotypes matrix
- Attributes:
- idstr
The variant’s unique ID
- idxint
The index of the variant within the genotype data
- posint
The chromosomal start position of the variant
- property ID
- property POS
- classmethod from_np(np_mixed_arr_var, idx)
Convert a numpy mixed array variant record into a Variant
-
id:
int
-
idx:
int
-
pos:
int
- class happler.tree.variant.VariantType(variant_type='snp')
Bases:
objectA class denoting the type of variant
- Attributes:
- typestr, optional
The type of variant (ex: SNP, STR, etc)
Defaults to a single nucleotide polymorphism
happler.tree.haplotypes module
- class happler.tree.haplotypes.Haplotype(nodes=(), data=None, num_samples=None)
Bases:
objectA haplotype within the tree
- Attributes:
- nodestuple[tuple[Variant, int]]
An ordered collection of pairs, where each pair is a node and its allele
- datanpt.NDArray[bool]
A np array (with shape n x 2, num_samples x num_chromosomes) denoting the presence of this haplotype in each chromosome of each sample
- append(node, allele, variant_genotypes)
Append a new node (variant) to this haplotype
- Return type:
- Parameters:
- nodeVariant
The node to add to this haplotype
- alleleint
The allele associated with this node
- variant_genotypesnpt.NDArray[bool]
A np array (with length n x 2, num_samples x num_chromosomes) denoting the presence of the new allele in each chromosome of each sample
- Returns:
- Haplotype
A new haplotype object extended by the node and its allele
-
data:
ndarray[Any,dtype[bool]]
- classmethod from_haptools_haplotype(haplotype, variant_genotypes)
Create a new haplotype from a haptools Haplotype and a GenotypesVCF object
- Return type:
- classmethod from_node(node, allele, variant_genotypes)
Create a new haplotype with a single node entry
- Return type:
- Parameters:
- nodeVariant
The initializing node for this haplotype
- alleleint
The allele associated with node
- variant_genotypesnpt.NDArray[bool]
A np array (with length n x 2, num_samples x num_chromosomes) denoting the presence of this haplotype in each chromosome of each sample
- Returns:
- Haplotype
The newly created haplotype object containing node and allele
- property node_indices: tuple[int]
Get the indices of the nodes in this haplotype
- Returns:
- tuple
The indices of the nodes
- transform(genotypes, allele, idxs=None, remove_self=True)
Transform a genotypes matrix via the current haplotype:
Each entry in the returned matrix denotes the presence of the current haplotype extended by each of the variants in the genotype matrix
- Return type:
ndarray[Any,dtype[bool]]- Parameters:
- genotypesGenotypes
The genotypes which to transform using the current haplotype
- alleleint
The allele (either 0 or 1) of the SNPs we’re adding
- idxstuple[int], optional
If specified, we will only output haplotypes for the variants at these indices. Otherwise, we’ll output all of them.
- remove_selfbool, optional
Whether to first remove any variants that are already in this haplotype using np.delete. This is expensive because it creates a new copy.
- Returns:
- npt.NDArray[bool]
A 3D haplotype matrix similar to the genotype matrix but with haplotypes instead of variants in the columns. It will have the same shape except that the number of columns (second dimension) will have decreased by the number of variants in this haplotype if remove_self is True
- transform_and_sum(genotypes, allele, idxs=None, remove_self=True)
Transform a genotypes matrix and sum along the ploidy axis using JAX JIT.
Combines
transform()and.sum(axis=2)into a single JAX JIT-compiled operation, avoiding materialization of the intermediate 3D boolean array.- Return type:
ndarray[Any,dtype[uint8]]- Parameters:
- genotypesGenotypes
The genotypes which to transform using the current haplotype
- alleleint
The allele (either 0 or 1) of the SNPs we’re adding
- idxstuple[int], optional
If specified, we will only output haplotypes for the variants at these indices. Otherwise, we’ll output all of them.
- remove_selfbool, optional
Whether to first remove any variants that are already in this haplotype using np.delete. This is expensive because it creates a new copy.
- Returns:
- npt.NDArray[np.uint8]
A 2D matrix with shape (num_samples, num_variants) where each entry is the count of chromosomes carrying the haplotype (0, 1, or 2)
- class happler.tree.haplotypes.Haplotypes(fname, haplotype=<class 'haptools.data.haplotypes.Haplotype'>, variant=<class 'haptools.data.haplotypes.Variant'>, repeat=<class 'haptools.data.haplotypes.Repeat'>, log=None)
Bases:
HaplotypesA class for processing haplotypes from a file
- Attributes:
- fname: Path | str
The path to the file containing the data
- data: dict[str, Haplotype]
A dict of Haplotype objects keyed by their IDs
- types: dict
A dict of class names keyed by the symbol denoting their line type Ex: {‘H’: Haplotype, ‘V’: Variant}
- version: str
A string denoting the current file format version
- log: Logger
A logging instance for recording debug statements.
- classmethod from_tree(fname, tree, gts, pts=None, log=None)
Create a Haplotypes object from a Tree object and a Genotypes object
- Return type:
- Parameters:
- fnamePath | str
The fname parameter for the Haplotypes object
- treeTree
The Tree object containing the haplotypes to encode within a Haplotypes obj
- gtsGenotypesVCF
The genotypes from which the tree was constructed
- pts: Phenotypes
The phenotypes with which these haplotypes were built. If not provided, the results won’t be recomputed
- logLogger, optional
The log parameter for the Haplotypes object
- Returns:
- Haplotypes
The completed Haplotypes object
happler.tree.assoc_test module
- class happler.tree.assoc_test.AssocResults(data)
Bases:
objectThe results of an association test
- Attributes:
- datanpt.NDArray[np.float64]
A numpy mixed array with fields: beta, pval, stderr
It has shape num_variants x num_fields
-
data:
ndarray[Any,dtype[TypeVar(_ScalarType_co, bound=generic, covariant=True)]]
- class happler.tree.assoc_test.AssocTest(with_bic=False)
Bases:
ABCAbstract class for performing phenotype-haplotype association tests
- Attributes:
- with_bic: bool
Whether to also output the BIC (Bayesian Information Criteria)
- static pval_as_decimal(t_stat, df, precision=1000)
Given a t statistic, return the associated p-value as a high precision Decimal
This can be helpful when a p-value is too small to be represented by the precision in python float64s
- Return type:
Decimal- Parameters:
- t_statfloat
The t statistic from a simple linear model
- dfint
The degrees of freedom of the associated t distribution
- precision: int
The precision of the returned value
- Returns:
- Decimal
An approximate, higher precision p-value for the provided t statistic
- Raises:
- ValueError
This function is only valid when the t distribution is approximately equivalent to the normal distribution. Thus, if df < 1000, we raise a ValueError to indicate that the function will not return an accurate value
- abstract run(X, y)
Run a series of phenotype-haplotype association tests for each haplotype (column) in X and return their p-values
- Return type:
- Parameters:
- Xnpt.NDArray[np.float64]
The genotypes, with shape n x p. There are only two dimensions. Each column is a haplotype and each row is a sample.
- ynpt.NDArray[np.float64]
The phenotypes, with shape n x 1
- Returns:
- npt.NDArray[np.float64]
The p-values from testing each haplotype, with shape p x 1
- standardize(X)
Standardize the genotypes so they have mean 0 and variance 1
- Return type:
ndarray[Any,dtype[float64]]- Parameters:
- Xnpt.NDArray[np.float64]
The genotypes, with shape n x p. There are only two dimensions. Each column is a haplotype and each row is a sample.
- Returns:
- npt.NDArray[np.float64]
An array with the same shape as X but standardized properly
- class happler.tree.assoc_test.AssocTestSimple(with_bic=False)
Bases:
AssocTest- bic(n, residuals)
Return the BIC (Bayesian Information Criterion) for an OLS test
This function follows the implementation in https://stackoverflow.com/a/58984868
- Return type:
float- Parameters:
- nint
The number of observations
- residualsnpt.NDArray[np.float64]
The residuals
- Returns:
- float
The Bayesian Information Criterion
- perform_test(x, y)
Perform the test for a single haplotype.
- Return type:
tuple- Parameters:
- xnpt.NDArray[np.float64]
The genotypes with shape n x 1 (for a single haplotype)
- ynpt.NDArray[np.float64]
The phenotypes, with shape n x 1
- Returns:
- tuple
The slope, p-value, and stderr obtained from the test. The BIC is appended to the end if
self.with_bicis True.
- run(X, y)
Implement AssocTest for a simple linear regression.
- Return type:
- Parameters:
- Xnpt.NDArray[np.float64]
The genotypes, with shape n x p. There are only two dimensions. Each row is a sample and each column is a haplotype.
- ynpt.NDArray[np.float64]
The phenotypes, with shape n x 1
- Returns:
- npt.NDArray[np.float64]
The results from testing each haplotype, with shape p x 3
- class happler.tree.assoc_test.AssocTestSimpleCovariates(covars, with_bic=False)
Bases:
AssocTestSimpleSM
- class happler.tree.assoc_test.AssocTestSimpleFastBIC(chunk_size=None)
Bases:
AssocTestSimpleSMCalculate only BIC in a quick, vectorized fashion without statsmodels
- perform_test(X, yc)
Perform the test for a chunk of haplotypes using JAX JIT compilation
This method uses the JIT-compiled _compute_bic_jit helper function for improved computational efficiency with GPU/TPU acceleration.
- Return type:
ndarray[Any,dtype[TypeVar(_ScalarType_co, bound=generic, covariant=True)]]- Parameters:
- Xnpt.NDArray[np.float64]
The genotypes with shape (n, p)
- ycnpt.NDArray[np.float64]
The phenotypes, with shape (n, 1) They are assumed to be centered already
- Returns:
- npt.NDArray[np.float64]
The BIC values from testing this chunk of haplotypes, with shape (p,)
- run(X, y)
Implement AssocTest for a simple, univariate OLS model: y ~ 1 + X[:, j]
Does not use statsmodels at all but replicates its behavior
- Return type:
- Parameters:
- Xnpt.NDArray[np.float64]
The genotypes, with shape n x p. There are only two dimensions. Each row is a sample and each column is a haplotype.
- ynpt.NDArray[np.float64]
The phenotypes, with shape n x 1
- Returns:
- npt.NDArray[np.float64]
The results from testing each haplotype, with shape p x 1
- class happler.tree.assoc_test.AssocTestSimpleSM(with_bic=False)
Bases:
AssocTestSimple- perform_test(x, y)
Perform the test for a single haplotype.
- Return type:
tuple- Parameters:
- xnpt.NDArray[np.float64]
The genotypes with shape n x 1 (for a single haplotype)
- ynpt.NDArray[np.float64]
The phenotypes, with shape n x 1
- Returns:
- tuple
The slope, p-value, and stderr obtained from the test. The BIC is appended to the end if
self.with_bicis True.
- class happler.tree.assoc_test.AssocTestSimpleSMTScore(with_bic=False)
Bases:
AssocTestSimpleSM- perform_test(x, y, parent_res=None, parent_corr=0)
Perform the test for a single haplotype.
- Return type:
tuple- Parameters:
- xnpt.NDArray[np.float64]
The genotypes with shape n x 1 (for a single haplotype)
- ynpt.NDArray[np.float64]
The phenotypes, with shape n x 1
- Returns:
- tuple
The slope, p-value, and stderr obtained from the test. The BIC is appended to the end if
self.with_bicis True.
- run(X, y, parent_res=None, parent_corr=None)
Implement AssocTest for a simple linear regression.
- Return type:
- Parameters:
- Xnpt.NDArray[np.float64]
The genotypes, with shape n x p. There are only two dimensions. Each row is a sample and each column is a haplotype.
- ynpt.NDArray[np.float64]
The phenotypes, with shape n x 1
- Returns:
- npt.NDArray[np.float64]
The results from testing each haplotype, with shape p x 3
- class happler.tree.assoc_test.NodeResults(beta, pval, stderr)
Bases:
objectThe results of testing SNPs at a node in the tree
- Attributes:
- betafloat
The best effect size among all of the SNPs tried
- pvalfloat
The best p-value among all of the SNPs tried
- stderr: float
The standard error of beta
-
beta:
float
- classmethod from_np(np_mixed_arr_var)
- Return type:
-
pval:
float
-
stderr:
float
- class happler.tree.assoc_test.NodeResultsBIC(bic)
Bases:
objectThe results of testing SNPs at a node in the tree
- Attributes:
- bicfloat
The best BIC among all of the SNPs
-
bic:
float
- classmethod from_np(np_mixed_arr_var)
- Return type:
- class happler.tree.assoc_test.NodeResultsExtra(beta, pval, stderr, bic)
Bases:
NodeResults-
bic:
float
-
bic:
- class happler.tree.assoc_test.NodeResultsExtraTScore(beta, pval, stderr, bic, tscore)
Bases:
NodeResultsExtra-
tscore:
float
-
tscore:
- class happler.tree.assoc_test.NodeResultsTScore(beta, pval, stderr, tscore)
Bases:
NodeResults-
tscore:
float
-
tscore: