Sample complexity of population recovery

02/18/2017
by   Yury Polyanskiy, et al.
0

The problem of population recovery refers to estimating a distribution based on incomplete or corrupted samples. Consider a random poll of sample size n conducted on a population of individuals, where each pollee is asked to answer d binary questions. We consider one of the two polling impediments: (a) in lossy population recovery, a pollee may skip each question with probability ϵ, (b) in noisy population recovery, a pollee may lie on each question with probability ϵ. Given n lossy or noisy samples, the goal is to estimate the probabilities of all 2^d binary vectors simultaneously within accuracy δ with high probability. This paper settles the sample complexity of population recovery. For lossy model, the optimal sample complexity is Θ̃(δ^-2{ϵ/1-ϵ,1}), improving the state of the art by Moitra and Saks in several ways: a lower bound is established, the upper bound is improved and the result depends at most on the logarithm of the dimension. Surprisingly, the sample complexity undergoes a phase transition from parametric to nonparametric rate when ϵ exceeds 1/2. For noisy population recovery, the sharp sample complexity turns out to be more sensitive to dimension and scales as (Θ(d^1/3^2/3(1/δ))) except for the trivial cases of ϵ=0,1/2 or 1. For both models, our estimators simply compute the empirical mean of a certain function, which is found by pre-solving a linear program (LP). Curiously, the dual LP can be understood as Le Cam's method for lower-bounding the minimax risk, thus establishing the statistical optimality of the proposed estimators. The value of the LP is determined by complex-analytic methods.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
01/30/2021

Phase Transition for Support Recovery from Gaussian Linear Measurements

We study the problem of recovering the common k-sized support of a set o...
research
05/21/2020

Extrapolating the profile of a finite population

We study a prototypical problem in empirical Bayes. Namely, consider a p...
research
04/30/2021

Network Recovery from Unlabeled Noisy Samples

There is a growing literature on the statistical analysis of multiple ne...
research
03/20/2020

Sample Complexity Result for Multi-category Classifiers of Bounded Variation

We control the probability of the uniform deviation between empirical an...
research
02/14/2019

Optimal disclosure risk assessment

Protection against disclosure is a legal and ethical obligation for agen...
research
09/04/2018

Parity Crowdsourcing for Cooperative Labeling

Consider a database of k objects, e.g., a set of videos, where each obje...
research
12/04/2020

Near-Optimal Model Discrimination with Non-Disclosure

Let θ_0,θ_1 ∈ℝ^d be the population risk minimizers associated to some lo...

Please sign up or login with your details

Forgot password? Click here to reset