Estimating the unseen from multiple populations

07/12/2017
by   Aditi Raghunathan, et al.
0

Given samples from a distribution, how many new elements should we expect to find if we continue sampling this distribution? This is an important and actively studied problem, with many applications ranging from unseen species estimation to genomics. We generalize this extrapolation and related unseen estimation problems to the multiple population setting, where population j has an unknown distribution D_j from which we observe n_j samples. We derive an optimal estimator for the total number of elements we expect to find among new samples across the populations. Surprisingly, we prove that our estimator's accuracy is independent of the number of populations. We also develop an efficient optimization algorithm to solve the more general problem of estimating multi-population frequency distributions. We validate our methods and theory through extensive experiments. Finally, on a real dataset of human genomes across multiple ancestries, we demonstrate how our approach for unseen estimation can enable cohort designs that can discover interesting mutations with greater efficiency.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
04/07/2021

Near-optimal estimation of the unseen under regularly varying tail populations

Given n samples from a population of individuals belonging to different ...
research
11/23/2015

Estimating the number of unseen species: A bird in the hand is worth n in the bush

Estimating the number of unseen species is an important problem in many ...
research
09/06/2012

The Sample Complexity of Search over Multiple Populations

This paper studies the sample complexity of searching over multiple popu...
research
01/13/2020

Convergence of Chao Unseen Species Estimator

Support size estimation and the related problem of unseen species estima...
research
03/11/2022

Bayesian Nonparametric Inference for "Species-sampling" Problems

"Species-sampling" problems (SSPs) refer to a broad class of statistical...
research
09/24/2021

Analysis of Ordinal Populations from Judgment Post-Stratification

In surveys requiring cost efficiency, such as medical research, measurin...
research
02/14/2019

Dualizing Le Cam's method, with applications to estimating the unseens

One of the most commonly used techniques for proving statistical lower b...

Please sign up or login with your details

Forgot password? Click here to reset