kalis: A Modern Implementation of the Li Stephens Model for Local Ancestry Inference in R

by   Louis J. M. Aslett, et al.

Approximating the recent phylogeny of N phased haplotypes at a set of variants along the genome is a core problem in modern population genomics and central to performing genome-wide screens for association, selection, introgression, and other signals. The Li Stephens (LS) model provides a simple yet powerful hidden Markov model for inferring the recent ancestry at a given variant, represented as an N × N distance matrix based on posterior decodings. However, existing posterior decoding implementations for the LS model cannot scale to modern datasets with tens or hundreds of thousands of genomes. This work focuses on providing a high-performance engine to compute the LS model, enabling users to rapidly develop a range of variant-specific ancestral inference pipelines on top, exposed via an easy to use package, kalis, in the statistical programming language R. kalis exploits both multi-core parallelism and modern CPU vector instruction sets to enable scaling to problem sizes that would previously have been prohibitively slow to work with. The resulting distance matrices enable local ancestry, selection, and association studies in modern large scale genomic datasets.


page 15

page 25


A Simple Yet Efficient Parametric Method of Local False Discovery Rate Estimation Designed for Genome-Wide Association Data Analysis

In genome-wide association studies (GWAS), hundreds of thousands of gene...

A structural model of genome-wide association studies

A structural genetic model incorporating a modern understanding of the g...

Hierarchical inference for genome-wide association studies: a view on methodology with software

We provide a view on high-dimensional statistical inference for genome-w...

varbvs: Fast Variable Selection for Large-scale Regression

We introduce varbvs, a suite of functions written in R and MATLAB for re...

Learning an Astronomical Catalog of the Visible Universe through Scalable Bayesian Inference

Celeste is a procedure for inferring astronomical catalogs that attains ...

Multiple Testing in Genome-Wide Association Studies via Hierarchical Hidden Markov Models

The problems of large-scale multiple testing are often encountered in mo...

Fiuncho: a program for any-order epistasis detection in CPU clusters

Epistasis can be defined as the statistical interaction of genes during ...

Please sign up or login with your details

Forgot password? Click here to reset