Documentation

Command line interface

happler

happler: A haplotype-based fine-mapping method

Test for associations between a trait and haplotypes (ie sets of correlated SNPs) rather than individual SNPs

happler [OPTIONS] COMMAND [ARGS]...

Options

--version

Show the version and exit.

run

Use the tool to find trait-associated haplotypes

GENOTYPES must be formatted as VCFs and

PHENOTYPES must be a tab-separated file containing two columns: sample ID and phenotype value

Ex: happler run tests/data/simple.vcf tests/data/simple.tsv > simple.hap

happler run [OPTIONS] GENOTYPES PHENOTYPES

Options

--pheno <pheno>

Which phenotype from the .pheno file should we use?

Default:

'0'

--pheno-name

Treat the –pheno argument as a column name instead of an index

Default:

False

--region <region>

The region from which to extract genotypes; ex: ‘chr1:1234-34566’ or ‘chr7’

For this to work, the VCF must be indexed and the seqname must match!

Default:

'all genotypes'

-s, --sample <samples>

A list of the samples to subset from the genotypes file (ex: ‘-s sample1 -s sample2’)

Default:

'all samples'

-S, --samples-file <samples_file>

A single column txt file containing a list of the samples (one per line) to subset from the genotypes file

Default:

'all samples'

--discard-multiallelic

Whether to discard multi-allelic variants or just complain about them.

Default:

'do not discard multi-allelic variants'

--discard-missing

Ignore any samples that are missing genotypes for the required variants

Default:

False

-c, --chunk-size <chunk_size>

Perform memory intensive operations in chunks of X variants. This reduces memory but at the cost of time.

Default:

'all variants'

--maf <maf>

Only use variants with an MAF above this threshold

Default:

'all variants'

--hap-maf <hap_maf>

Only build haplotypes with an MAF above this threshold

Default:

'the MAF equivalent to an MAC of 20'

--max-signals <max_signals>

The maximum number of expected causal signals

Default:

1

--max-iterations <max_iterations>

The max number of times to repeat the tree building

Default:

1

--ld-prune-thresh <ld_prune_thresh>

The LD threshold used to prune leaf nodes based on LD with their siblings

Default:

0.95

--out-thresh <out_thresh>

Threshold used to determine whether to output haplotypes (single-SNP)

Default:

5e-08

--show-tree

Output a tree in addition to the regular output.

Default:

False

--remove-SNPs

Remove haplotypes with only a single variant

Default:

False

-o, --output <output>

A .hap file describing the extracted haplotypes

Default:

'stdout'

-v, --verbosity <verbosity>

The level of verbosity desired

Default:

'INFO'

Options:

CRITICAL | ERROR | WARNING | INFO | DEBUG | NOTSET

Arguments

GENOTYPES

Required argument

PHENOTYPES

Required argument

transform

Transform a genotype matrix via a haplotype

GENOTYPES must be formatted as a VCF or PGEN file

HAPLOTYPES must be formatted as a .hap file

Ex: happler transform tests/data/simple.vcf tests/data/simple.hap > simple.vcf

happler transform [OPTIONS] GENOTYPES HAPLOTYPES

Options

--region <region>

The region from which to extract genotypes; ex: ‘chr1:1234-34566’ or ‘chr7’

For this to work, the VCF must be indexed and the seqname must match!

Default:

'all genotypes'

-s, --sample <samples>

A list of the samples to subset from the genotypes file (ex: ‘-s sample1 -s sample2’)

Default:

'all samples'

-S, --samples-file <samples_file>

A single column txt file containing a list of the samples (one per line) to subset from the genotypes file

Default:

'all samples'

--var-id <variants>

A list of the variants to subset from the genotypes file (ex: ‘-s sample1 -s sample2’)

Default:

'all variants'

--discard-multiallelic

Whether to discard multi-allelic variants or just complain about them.

Default:

'do not discard multi-allelic variants'

--discard-missing

Ignore any samples that are missing genotypes for the required variants

Default:

False

-c, --chunk-size <chunk_size>

If using a PGEN file, read genotypes in chunks of X variants; reduces memory

Default:

'all variants'

--maf <maf>

Ignore variants with a MAF below this threshold

Default:

'no filtering'

-i, --hap-id <hap_id>

Which haplotype to use from the .hap file

Default:

'the first haplotype'

-a, --allele <allele>

The allele of the next causal SNP

Default:

0

-o, --output <output>

A transformed genotypes file

Default:

'stdout'

-v, --verbosity <verbosity>

The level of verbosity desired

Default:

'INFO'

Options:

CRITICAL | ERROR | WARNING | INFO | DEBUG | NOTSET

Arguments

GENOTYPES

Required argument

HAPLOTYPES

Required argument

Module contents

happler.tree.tree module

class happler.tree.tree.Tree(log=None)

Bases: object

A tree where

  1. nodes are variants (and also haplotypes)

  2. each node can have at most two branches for each of its alleles

Attributes:
graph: nx.DiGraph

The underlying directed acyclic graph representing the tree

variant_locs: dict

The indices of each variant within the tree’s list of nodes

add_node(node, parent_idx, allele, results=None)

Add node to the tree

Return type:

int

Parameters:
nodeVariant

The node to add to the tree

parent_idxint

The index of the node under which to place the new node

alleleint

The allele for the edge from a parent node to this node

resultsnp.void, optional

The results (beta, pval) of the association test for this haplotype once this variant is included

Returns:
int

The index of the new node within the tree

dot()

Convert the tree to its representation in the dot language This is useful for quickly viewing the tree on the command line

Return type:

str

Returns:
str

A string representing the tree

Nodes are labeled by their variant ID and edges are labeled by their allele

haplotypes(root=0)

Return the haplotypes at the leaves of the tree rooted at the index “root”

Return type:

list[deque[dict]]

Returns:
list[deque[dict]]

A list of haplotypes, where each haplotype consists of a list of dictionaries.

Each dictionary contains the node and all of its attributes.

leaves(from_root=False)

Return all leaves of this tree

Parameters:
from_root: bool, optional

Whether to only return leaves attached to the root of the tree

Returns:
tuple[int, Variant, int]

The variant at each leaf node of the tree. Returns the index in the tree, the Variant object, and the allele of the variant.

property num_nodes
property num_variants
remove_leaf_node(node_idx)

Remove a leaf node from the tree

Parameters:
node_idxint

The index of the node to remove

siblings(node_idx)

Locate sibling(s) of this node in the tree

Return type:

list[Variant]

Parameters:
node_idxint

The index of a node in the tree

Returns:
list[Variant]

The variants at the sibling nodes

happler.tree.tree_builder module

class happler.tree.tree_builder.TreeBuilder(genotypes, phenotypes, maf=None, method=<happler.tree.assoc_test.AssocTestSimpleFastBIC object>, terminator=<happler.tree.terminator.BICTerminator object>, indep_thresh=15, ld_prune_thresh=None, covariance_correction=True, log=None)

Bases: object

Creates a Tree object from provided Genotypes and Phenotypes

Attributes:
tree: Tree

A tree representing haplotypes composed from trait-associated genotypes

gens: Genotypes

The genotypes from which the tree should be built

phens: Phenotypes

The phenotypes from which the tree should be built

method: AssocTest, optional

The type of association test to perform at each node when constructing the tree

terminator: Terminator, optional

The type of test to use for deciding whether to terminate a branch

ld_prune_thresh: float, optional

Any leaf nodes with a greater LD with their sibling than this value will be pruned

log: Logger

A logging instance for recording debug statements.

Examples

>>> gens = Genotypes.load('tests/data/simple.vcf')
>>> phens = Phenotypes.load('tests/data/simple.tsv')
>>> tree = TreeBuilder(gens, phens).run()
prune_tree(from_root=True)

Remove any leaf nodes that are in strong LD with their sibling branches

Parameters:
from_root: bool, optional

Whether to only prune leaves attached to the root of the tree

run()

Run the tree builder and create a tree rooted at the provided variant

happler.tree.variant module

class happler.tree.variant.Variant(idx, id, pos)

Bases: object

A variant within the genotypes matrix

Attributes:
idstr

The variant’s unique ID

idxint

The index of the variant within the genotype data

posint

The chromosomal start position of the variant

property ID
property POS
classmethod from_np(np_mixed_arr_var, idx)

Convert a numpy mixed array variant record into a Variant

Return type:

Variant

Parameters:
np_mixed_arr_varnp.void

A numpy mixed array variant record with entries ‘id’ and ‘pos’

idxint

See idx

Returns:
Variant

The converted Variant

id: int
idx: int
pos: int
class happler.tree.variant.VariantType(variant_type='snp')

Bases: object

A class denoting the type of variant

Attributes:
typestr, optional

The type of variant (ex: SNP, STR, etc)

Defaults to a single nucleotide polymorphism

happler.tree.haplotypes module

class happler.tree.haplotypes.Haplotype(nodes=(), data=None, num_samples=None)

Bases: object

A haplotype within the tree

Attributes:
nodestuple[tuple[Variant, int]]

An ordered collection of pairs, where each pair is a node and its allele

datanpt.NDArray[bool]

A np array (with shape n x 2, num_samples x num_chromosomes) denoting the presence of this haplotype in each chromosome of each sample

append(node, allele, variant_genotypes)

Append a new node (variant) to this haplotype

Return type:

Haplotype

Parameters:
nodeVariant

The node to add to this haplotype

alleleint

The allele associated with this node

variant_genotypesnpt.NDArray[bool]

A np array (with length n x 2, num_samples x num_chromosomes) denoting the presence of the new allele in each chromosome of each sample

Returns:
Haplotype

A new haplotype object extended by the node and its allele

data: ndarray[Any, dtype[bool]]
classmethod from_haptools_haplotype(haplotype, variant_genotypes)

Create a new haplotype from a haptools Haplotype and a GenotypesVCF object

Return type:

Haplotype

classmethod from_node(node, allele, variant_genotypes)

Create a new haplotype with a single node entry

Return type:

Haplotype

Parameters:
nodeVariant

The initializing node for this haplotype

alleleint

The allele associated with node

variant_genotypesnpt.NDArray[bool]

A np array (with length n x 2, num_samples x num_chromosomes) denoting the presence of this haplotype in each chromosome of each sample

Returns:
Haplotype

The newly created haplotype object containing node and allele

property node_indices: tuple[int]

Get the indices of the nodes in this haplotype

Returns:
tuple

The indices of the nodes

nodes: tuple[tuple[Variant, int]]
transform(genotypes, allele, idxs=None, remove_self=True)

Transform a genotypes matrix via the current haplotype:

Each entry in the returned matrix denotes the presence of the current haplotype extended by each of the variants in the genotype matrix

Return type:

ndarray[Any, dtype[bool]]

Parameters:
genotypesGenotypes

The genotypes which to transform using the current haplotype

alleleint

The allele (either 0 or 1) of the SNPs we’re adding

idxstuple[int], optional

If specified, we will only output haplotypes for the variants at these indices. Otherwise, we’ll output all of them.

remove_selfbool, optional

Whether to first remove any variants that are already in this haplotype using np.delete. This is expensive because it creates a new copy.

Returns:
npt.NDArray[bool]

A 3D haplotype matrix similar to the genotype matrix but with haplotypes instead of variants in the columns. It will have the same shape except that the number of columns (second dimension) will have decreased by the number of variants in this haplotype if remove_self is True

transform_and_sum(genotypes, allele, idxs=None, remove_self=True)

Transform a genotypes matrix and sum along the ploidy axis using JAX JIT.

Combines transform() and .sum(axis=2) into a single JAX JIT-compiled operation, avoiding materialization of the intermediate 3D boolean array.

Return type:

ndarray[Any, dtype[uint8]]

Parameters:
genotypesGenotypes

The genotypes which to transform using the current haplotype

alleleint

The allele (either 0 or 1) of the SNPs we’re adding

idxstuple[int], optional

If specified, we will only output haplotypes for the variants at these indices. Otherwise, we’ll output all of them.

remove_selfbool, optional

Whether to first remove any variants that are already in this haplotype using np.delete. This is expensive because it creates a new copy.

Returns:
npt.NDArray[np.uint8]

A 2D matrix with shape (num_samples, num_variants) where each entry is the count of chromosomes carrying the haplotype (0, 1, or 2)

class happler.tree.haplotypes.Haplotypes(fname, haplotype=<class 'haptools.data.haplotypes.Haplotype'>, variant=<class 'haptools.data.haplotypes.Variant'>, repeat=<class 'haptools.data.haplotypes.Repeat'>, log=None)

Bases: Haplotypes

A class for processing haplotypes from a file

Attributes:
fname: Path | str

The path to the file containing the data

data: dict[str, Haplotype]

A dict of Haplotype objects keyed by their IDs

types: dict

A dict of class names keyed by the symbol denoting their line type Ex: {‘H’: Haplotype, ‘V’: Variant}

version: str

A string denoting the current file format version

log: Logger

A logging instance for recording debug statements.

classmethod from_tree(fname, tree, gts, pts=None, log=None)

Create a Haplotypes object from a Tree object and a Genotypes object

Return type:

Haplotypes

Parameters:
fnamePath | str

The fname parameter for the Haplotypes object

treeTree

The Tree object containing the haplotypes to encode within a Haplotypes obj

gtsGenotypesVCF

The genotypes from which the tree was constructed

pts: Phenotypes

The phenotypes with which these haplotypes were built. If not provided, the results won’t be recomputed

logLogger, optional

The log parameter for the Haplotypes object

Returns:
Haplotypes

The completed Haplotypes object

class happler.tree.haplotypes.HapplerHaplotype(chrom, start, end, id, beta, pval)

Bases: Haplotype

A haplotype with sufficient fields for happler Properties and functions are shared with the base Haplotype object, “HaplotypeBase”

beta: float
pval: float
class happler.tree.haplotypes.HapplerVariant(start, end, id, allele, score)

Bases: Variant

A variant allele with sufficient fields for happler Properties and functions are shared with the base Variant object, “VariantBase”

score: float

happler.tree.assoc_test module

class happler.tree.assoc_test.AssocResults(data)

Bases: object

The results of an association test

Attributes:
datanpt.NDArray[np.float64]

A numpy mixed array with fields: beta, pval, stderr

It has shape num_variants x num_fields

data: ndarray[Any, dtype[TypeVar(_ScalarType_co, bound= generic, covariant=True)]]
class happler.tree.assoc_test.AssocTest(with_bic=False)

Bases: ABC

Abstract class for performing phenotype-haplotype association tests

Attributes:
with_bic: bool

Whether to also output the BIC (Bayesian Information Criteria)

static pval_as_decimal(t_stat, df, precision=1000)

Given a t statistic, return the associated p-value as a high precision Decimal

This can be helpful when a p-value is too small to be represented by the precision in python float64s

Return type:

Decimal

Parameters:
t_statfloat

The t statistic from a simple linear model

dfint

The degrees of freedom of the associated t distribution

precision: int

The precision of the returned value

Returns:
Decimal

An approximate, higher precision p-value for the provided t statistic

Raises:
ValueError

This function is only valid when the t distribution is approximately equivalent to the normal distribution. Thus, if df < 1000, we raise a ValueError to indicate that the function will not return an accurate value

abstract run(X, y)

Run a series of phenotype-haplotype association tests for each haplotype (column) in X and return their p-values

Return type:

AssocResults

Parameters:
Xnpt.NDArray[np.float64]

The genotypes, with shape n x p. There are only two dimensions. Each column is a haplotype and each row is a sample.

ynpt.NDArray[np.float64]

The phenotypes, with shape n x 1

Returns:
npt.NDArray[np.float64]

The p-values from testing each haplotype, with shape p x 1

standardize(X)

Standardize the genotypes so they have mean 0 and variance 1

Return type:

ndarray[Any, dtype[float64]]

Parameters:
Xnpt.NDArray[np.float64]

The genotypes, with shape n x p. There are only two dimensions. Each column is a haplotype and each row is a sample.

Returns:
npt.NDArray[np.float64]

An array with the same shape as X but standardized properly

class happler.tree.assoc_test.AssocTestSimple(with_bic=False)

Bases: AssocTest

bic(n, residuals)

Return the BIC (Bayesian Information Criterion) for an OLS test

This function follows the implementation in https://stackoverflow.com/a/58984868

Return type:

float

Parameters:
nint

The number of observations

residualsnpt.NDArray[np.float64]

The residuals

Returns:
float

The Bayesian Information Criterion

perform_test(x, y)

Perform the test for a single haplotype.

Return type:

tuple

Parameters:
xnpt.NDArray[np.float64]

The genotypes with shape n x 1 (for a single haplotype)

ynpt.NDArray[np.float64]

The phenotypes, with shape n x 1

Returns:
tuple

The slope, p-value, and stderr obtained from the test. The BIC is appended to the end if self.with_bic is True.

run(X, y)

Implement AssocTest for a simple linear regression.

Return type:

AssocResults

Parameters:
Xnpt.NDArray[np.float64]

The genotypes, with shape n x p. There are only two dimensions. Each row is a sample and each column is a haplotype.

ynpt.NDArray[np.float64]

The phenotypes, with shape n x 1

Returns:
npt.NDArray[np.float64]

The results from testing each haplotype, with shape p x 3

class happler.tree.assoc_test.AssocTestSimpleCovariates(covars, with_bic=False)

Bases: AssocTestSimpleSM

class happler.tree.assoc_test.AssocTestSimpleFastBIC(chunk_size=None)

Bases: AssocTestSimpleSM

Calculate only BIC in a quick, vectorized fashion without statsmodels

perform_test(X, yc)

Perform the test for a chunk of haplotypes using JAX JIT compilation

This method uses the JIT-compiled _compute_bic_jit helper function for improved computational efficiency with GPU/TPU acceleration.

Return type:

ndarray[Any, dtype[TypeVar(_ScalarType_co, bound= generic, covariant=True)]]

Parameters:
Xnpt.NDArray[np.float64]

The genotypes with shape (n, p)

ycnpt.NDArray[np.float64]

The phenotypes, with shape (n, 1) They are assumed to be centered already

Returns:
npt.NDArray[np.float64]

The BIC values from testing this chunk of haplotypes, with shape (p,)

run(X, y)

Implement AssocTest for a simple, univariate OLS model: y ~ 1 + X[:, j]

Does not use statsmodels at all but replicates its behavior

Return type:

AssocResults

Parameters:
Xnpt.NDArray[np.float64]

The genotypes, with shape n x p. There are only two dimensions. Each row is a sample and each column is a haplotype.

ynpt.NDArray[np.float64]

The phenotypes, with shape n x 1

Returns:
npt.NDArray[np.float64]

The results from testing each haplotype, with shape p x 1

class happler.tree.assoc_test.AssocTestSimpleSM(with_bic=False)

Bases: AssocTestSimple

perform_test(x, y)

Perform the test for a single haplotype.

Return type:

tuple

Parameters:
xnpt.NDArray[np.float64]

The genotypes with shape n x 1 (for a single haplotype)

ynpt.NDArray[np.float64]

The phenotypes, with shape n x 1

Returns:
tuple

The slope, p-value, and stderr obtained from the test. The BIC is appended to the end if self.with_bic is True.

class happler.tree.assoc_test.AssocTestSimpleSMTScore(with_bic=False)

Bases: AssocTestSimpleSM

perform_test(x, y, parent_res=None, parent_corr=0)

Perform the test for a single haplotype.

Return type:

tuple

Parameters:
xnpt.NDArray[np.float64]

The genotypes with shape n x 1 (for a single haplotype)

ynpt.NDArray[np.float64]

The phenotypes, with shape n x 1

Returns:
tuple

The slope, p-value, and stderr obtained from the test. The BIC is appended to the end if self.with_bic is True.

run(X, y, parent_res=None, parent_corr=None)

Implement AssocTest for a simple linear regression.

Return type:

AssocResults

Parameters:
Xnpt.NDArray[np.float64]

The genotypes, with shape n x p. There are only two dimensions. Each row is a sample and each column is a haplotype.

ynpt.NDArray[np.float64]

The phenotypes, with shape n x 1

Returns:
npt.NDArray[np.float64]

The results from testing each haplotype, with shape p x 3

class happler.tree.assoc_test.NodeResults(beta, pval, stderr)

Bases: object

The results of testing SNPs at a node in the tree

Attributes:
betafloat

The best effect size among all of the SNPs tried

pvalfloat

The best p-value among all of the SNPs tried

stderr: float

The standard error of beta

beta: float
classmethod from_np(np_mixed_arr_var)
Return type:

NodeResults

pval: float
stderr: float
class happler.tree.assoc_test.NodeResultsBIC(bic)

Bases: object

The results of testing SNPs at a node in the tree

Attributes:
bicfloat

The best BIC among all of the SNPs

bic: float
classmethod from_np(np_mixed_arr_var)
Return type:

NodeResults

class happler.tree.assoc_test.NodeResultsExtra(beta, pval, stderr, bic)

Bases: NodeResults

bic: float
class happler.tree.assoc_test.NodeResultsExtraTScore(beta, pval, stderr, bic, tscore)

Bases: NodeResultsExtra

tscore: float
class happler.tree.assoc_test.NodeResultsTScore(beta, pval, stderr, tscore)

Bases: NodeResults

tscore: float