
Correcting the bias in least squares regression with volumerescaled sampling
Consider linear regression where the examples are generated by an unknown distribution on R^d× R. Without any assumptions on the noise, the linear least squares solution for any i.i.d. sample will typically be biased w.r.t. the least squares optimum over the entire distribution. However, we show that if an i.i.d. sample of any size k is augmented by a certain small additional sample, then the solution of the combined sample becomes unbiased. We show this when the additional sample consists of d points drawn jointly according to the input distribution that is rescaled by the squared volume spanned by the points. Furthermore, we propose algorithms to sample from this volumerescaled distribution when the data distribution is only known through an i.i.d sample.
10/04/2018 ∙ by Michal Derezinski, et al. ∙ 38 ∙ shareread it

Overfitting or perfect fitting? Risk bounds for classification and regression rules that interpolate
Many modern machine learning models are trained to achieve zero or nearzero training error in order to obtain nearoptimal (but nonzero) test error. This phenomenon of strong generalization performance for "overfitted" / interpolated classifiers appears to be ubiquitous in highdimensional data, having been observed in deep networks, kernel machines, boosting and random forests. Their performance is robust even when the data contain large amounts of label noise. Very little theory is available to explain these observations. The vast majority of theoretical analyses of generalization allows for interpolation only when there is little or no label noise. This paper takes a step toward a theoretical foundation for interpolated classifiers by analyzing local interpolating schemes, including geometric simplicial interpolation algorithm and weighted knearest neighbor schemes. Consistency or nearconsistency is proved for these schemes in classification and regression problems. These schemes have an inductive bias that benefits from higher dimension, a kind of "blessing of dimensionality". Finally, connections to kernel machines, random forests, and adversarial examples in the interpolated regime are discussed.
06/13/2018 ∙ by Mikhail Belkin, et al. ∙ 8 ∙ shareread it

Reconciling modern machine learning and the biasvariance tradeoff
The question of generalization in machine learninghow algorithms are able to learn predictors from a training sample to make accurate predictions outofsampleis revisited in light of the recent breakthroughs in modern machine learning technology. The classical approach to understanding generalization is based on biasvariance tradeoffs, where model complexity is carefully calibrated so that the fit on the training sample reflects performance outofsample. However, it is now common practice to fit highly complex models like deep neural networks to data with (nearly) zero training error, and yet these interpolating predictors are observed to have good outofsample accuracy even for noisy data. How can the classical understanding of generalization be reconciled with these observations from modern machine learning practice? In this paper, we bridge the two regimes by exhibiting a new "double descent" risk curve that extends the traditional Ushaped biasvariance curve beyond the point of interpolation. Specifically, the curve shows that as soon as the model complexity is high enough to achieve interpolation on the training samplea point that we call the "interpolation threshold"the risk of suitably chosen interpolating predictors from these models can, in fact, be decreasing as the model complexity increases, often below the risk achieved using noninterpolating models. The double descent risk curve is demonstrated for a broad range of models, including neural networks and random forests, and a mechanism for producing this behavior is posited.
12/28/2018 ∙ by Mikhail Belkin, et al. ∙ 6 ∙ shareread it

Two models of double descent for weak features
The "double descent" risk curve was recently proposed to qualitatively describe the outofsample prediction accuracy of variablyparameterized machine learning models. This article provides a precise mathematical analysis for the shape of this curve in two simple data models with the least squares/least norm predictor. Specifically, it is shown that the risk peaks when the number of features p is close to the sample size n, but also that the risk decreases towards its minimum as p increases beyond n. This behavior is contrasted with that of "prescient" models that select features in an a priori optimal order.
03/18/2019 ∙ by Mikhail Belkin, et al. ∙ 6 ∙ shareread it

How many variables should be entered in a principal component regression equation?
We study least squares linear regression over N uncorrelated Gaussian features that are selected in order of decreasing variance. When the number of selected features p is at most the sample size n, the estimator under consideration coincides with the principal component regression estimator; when p>n, the estimator is the least ℓ_2 norm solution over the selected features. We give an averagecase analysis of the outofsample prediction error as p,n,N →∞ with p/N →α and n/N →β, for some constants α∈ [0,1] and β∈ (0,1). In this averagecase setting, the prediction error exhibits a `double descent' shape as a function of p.
06/04/2019 ∙ by Ji Xu, et al. ∙ 6 ∙ shareread it

Diameterbased Interactive Structure Search
In this work, we introduce interactive structure search, a generic framework that encompasses many interactive learning settings, both explored and unexplored. We show that a recently developed active learning algorithm of TD17 can be adapted for interactive structure search, that it can be made noisetolerant, and that it enjoys favorable convergence rates.
06/05/2019 ∙ by Christopher Tosh, et al. ∙ 4 ∙ shareread it

Unbiased estimators for random design regression
In linear regression we wish to estimate the optimum linear least squares predictor for a distribution over ddimensional input points and realvalued responses, based on a small sample. Under standard random design analysis, where the sample is drawn i.i.d. from the input distribution, the least squares solution for that sample can be viewed as the natural estimator of the optimum. Unfortunately, this estimator almost always incurs an undesirable bias coming from the randomness of the input points. In this paper we show that it is possible to draw a noni.i.d. sample of input points such that, regardless of the response model, the least squares solution is an unbiased estimator of the optimum. Moreover, this sample can be produced efficiently by augmenting a previously drawn i.i.d. sample with an additional set of d points drawn jointly from the input distribution rescaled by the squared volume spanned by the points. Motivated by this, we develop a theoretical framework for studying volumerescaled sampling, and in the process prove a number of new matrix expectation identities. We use them to show that for any input distribution and ϵ>0 there is a random design consisting of O(d d+ d/ϵ) points from which an unbiased estimator can be constructed whose square loss over the entire distribution is with high probability bounded by 1+ϵ times the loss of the optimum. We provide efficient algorithms for generating such unbiased estimators in a number of practical settings and support our claims experimentally.
07/08/2019 ∙ by Michał Dereziński, et al. ∙ 4 ∙ shareread it

Benefits of overparameterization with EM
Expectation Maximization (EM) is among the most popular algorithms for maximum likelihood estimation, but it is generally only guaranteed to find its stationary points of the loglikelihood objective. The goal of this article is to present theoretical and empirical evidence that overparameterization can help EM avoid spurious local optima in the loglikelihood. We consider the problem of estimating the mean vectors of a Gaussian mixture model in a scenario where the mixing weights are known. Our study shows that the global behavior of EM, when one uses an overparameterized model in which the mixing weights are treated as unknown, is better than that when one uses the (correct) model with the mixing weights fixed to the known values. For symmetric Gaussians mixtures with two components, we prove that introducing the (statistically redundant) weight parameters enables EM to find the global maximizer of the loglikelihood starting from almost any initial mean parameters, whereas EM without this overparameterization may very often fail. For other Gaussian mixtures, we provide empirical evidence that shows similar behavior. Our results corroborate the value of overparameterization in solving nonconvex optimization problems, previously observed in other domains.
10/26/2018 ∙ by Ji Xu, et al. ∙ 2 ∙ shareread it

A cryptographic approach to black box adversarial machine learning
We propose an ensemble technique for converting any classifier into a computationally secure classifier. We define a simpler security problem for random binary classifiers and prove a reduction from this model to the security of the overall ensemble classifier. We provide experimental evidence of the security of our random binary classifiers, as well as empirical results of the adversarial accuracy of the overall ensemble to blackbox attacks. Our construction crucially leverages hidden randomness in the multiclasstobinary reduction.
06/07/2019 ∙ by Kevin Shi, et al. ∙ 2 ∙ shareread it

A gradual, semidiscrete approach to generative network training via explicit wasserstein minimization
This paper provides a simple procedure to fit generative networks to target distributions, with the goal of a small Wasserstein distance (or other optimal transport cost). The approach is based on two principles: (a) if the source randomness of the network is a continuous distribution (the "semidiscrete" setting), then the Wasserstein distance is realized by a deterministic optimal transport mapping; (b) given an optimal transport mapping between a generator network and a target distribution, the Wasserstein distance may be decreased via a regression between the generated data and the mapped target points. The procedure here therefore alternates these two steps, forming an optimal transport and regressing against it, gradually adjusting the generator network towards the target distribution. Mathematically, this approach is shown to minimize the Wasserstein distance to both the empirical target distribution, and also its underlying population counterpart. Empirically, good performance is demonstrated on the training and testing sets of the MNIST and Thin8 data. The paper closes with a discussion of the unsuitability of the Wasserstein distance for certain tasks, as has been identified in prior work [Arora et al., 2017, Huang et al., 2017].
06/08/2019 ∙ by Yucheng Chen, et al. ∙ 1 ∙ shareread it

Time Series Compression Based on Adaptive Piecewise Recurrent Autoencoder
Time series account for a large proportion of the data stored in financial, medical and scientific databases. The efficient storage of time series is important in practical applications. In this paper, we propose a novel compression scheme for time series. The encoder and decoder are both composed by recurrent neural networks (RNN) such as long shortterm memory (LSTM). There is an autoencoder between encoder and decoder, which encodes the hidden state and input together and decodes them at the decoder side. Moreover, we preprocess the original time series by partitioning it into segments with various lengths which have similar total variation. The experimental study shows that the proposed algorithm can achieve competitive compression ratio on realworld time series.
07/23/2017 ∙ by Daniel Hsu, et al. ∙ 0 ∙ shareread it