Learning from Untrusted Data

11/07/2016
by   Moses Charikar, et al.
0

The vast majority of theoretical results in machine learning and statistics assume that the available training data is a reasonably reliable reflection of the phenomena to be learned or estimated. Similarly, the majority of machine learning and statistical techniques used in practice are brittle to the presence of large amounts of biased or malicious data. In this work we consider two frameworks in which to study estimation, learning, and optimization in the presence of significant fractions of arbitrary data. The first framework, list-decodable learning, asks whether it is possible to return a list of answers, with the guarantee that at least one of them is accurate. For example, given a dataset of n points for which an unknown subset of α n points are drawn from a distribution of interest, and no assumptions are made about the remaining (1-α)n points, is it possible to return a list of poly(1/α) answers, one of which is correct? The second framework, which we term the semi-verified learning model, considers the extent to which a small dataset of trusted data (drawn from the distribution in question) can be leveraged to enable the accurate extraction of information from a much larger but untrusted dataset (of which only an α-fraction is drawn from the distribution). We show strong positive results in both settings, and provide an algorithm for robust learning in a very general stochastic optimization setting. This general result has immediate implications for robust estimation in a number of settings, including for robustly estimating the mean of distributions with bounded second moments, robustly learning mixtures of such distributions, and robustly finding planted partitions in random graphs in which significant portions of the graph have been perturbed by an adversary.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
06/10/2022

List-Decodable Sparse Mean Estimation via Difference-of-Pairs Filtering

We study the problem of list-decodable sparse mean estimation. Specifica...
research
06/18/2020

List-Decodable Mean Estimation via Iterative Multi-Fitering

We study the problem of list-decodable mean estimation for bounded cova...
research
03/15/2017

Resilience: A Criterion for Learning in the Presence of Arbitrary Outliers

We introduce a criterion, resilience, which allows properties of a datas...
research
06/18/2020

List-Decodable Mean Estimation via Iterative Multi-Filtering

We study the problem of list-decodable mean estimation for bounded covar...
research
05/01/2023

A Spectral Algorithm for List-Decodable Covariance Estimation in Relative Frobenius Norm

We study the problem of list-decodable Gaussian covariance estimation. G...
research
06/16/2021

Clustering Mixture Models in Almost-Linear Time via List-Decodable Mean Estimation

We study the problem of list-decodable mean estimation, where an adversa...
research
11/14/2013

Fundamental Limits of Online and Distributed Algorithms for Statistical Learning and Estimation

Many machine learning approaches are characterized by information constr...

Please sign up or login with your details

Forgot password? Click here to reset