Geometry of the sample frequency spectrum and the perils of demographic inference

12/13/2017
by   Zvi Rosen, et al.
0

The sample frequency spectrum (SFS), which describes the distribution of mutant alleles in a sample of DNA sequences, is a widely used summary statistic in population genetics. The expected SFS has a strong dependence on the historical population demography and this property is exploited by popular statistical methods to infer complex demographic histories from DNA sequence data. Most, if not all, of these inference methods exhibit pathological behavior, however. Specifically, they often display runaway behavior in optimization, where the inferred population sizes and epoch durations can degenerate to 0 or diverge to infinity, and show undesirable sensitivity of the inferred demography to perturbations in the data. The goal of this paper is to provide theoretical insights into why such problems arise. To this end, we characterize the geometry of the expected SFS for piecewise-constant demographic histories and use our results to show that the aforementioned pathological behavior of popular inference methods is intrinsic to the geometry of the expected SFS. We provide explicit descriptions and visualizations for a toy model with sample size 4, and generalize our intuition to arbitrary sample sizes n using tools from convex and algebraic geometry. We also develop a universal characterization result which shows that the expected SFS of a sample of size n under an arbitrary population history can be recapitulated by a piecewise-constant demography with only k(n) epochs, where k(n) is between n/2 and 2n-1. The set of expected SFS for piecewise-constant demographies with fewer than k(n) epochs is open and non-convex, which causes the above phenomena for inference from data.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
07/21/2023

The Population Resemblance Statistic: A Chi-Square Measure of Fit for Banking

The Population Stability Index (PSI) is a widely used measure in credit ...
research
04/05/2023

The Expected Sample Allele Frequencies from Populations of Changing Size via Orthogonal Polynomials

In this article, discrete and stochastic changes in (effective) populati...
research
04/14/2020

The Tajima heterochronous n-coalescent: inference from heterochronously sampled molecular data

The observed sequence variation at a locus informs about the evolutionar...
research
11/15/2017

Exact Limits of Inference in Coalescent Models

Recovery of population size history from sequence data and testing of hy...
research
05/15/2019

Demographic Inference and Representative Population Estimates from Multilingual Social Media Data

Social media provide access to behavioural data at an unprecedented scal...
research
04/29/2022

Finite sequences representing expected order statistics

Characterizations of finite sequences β_1<⋯<β_n representing expected va...

Please sign up or login with your details

Forgot password? Click here to reset