Daniel Hsu

is this you? claim profile


Assistant professor at Columbia University

  • Correcting the bias in least squares regression with volume-rescaled sampling

    Consider linear regression where the examples are generated by an unknown distribution on R^d× R. Without any assumptions on the noise, the linear least squares solution for any i.i.d. sample will typically be biased w.r.t. the least squares optimum over the entire distribution. However, we show that if an i.i.d. sample of any size k is augmented by a certain small additional sample, then the solution of the combined sample becomes unbiased. We show this when the additional sample consists of d points drawn jointly according to the input distribution that is rescaled by the squared volume spanned by the points. Furthermore, we propose algorithms to sample from this volume-rescaled distribution when the data distribution is only known through an i.i.d sample.

    10/04/2018 ∙ by Michal Derezinski, et al. ∙ 38 share

    read it

  • Overfitting or perfect fitting? Risk bounds for classification and regression rules that interpolate

    Many modern machine learning models are trained to achieve zero or near-zero training error in order to obtain near-optimal (but non-zero) test error. This phenomenon of strong generalization performance for "overfitted" / interpolated classifiers appears to be ubiquitous in high-dimensional data, having been observed in deep networks, kernel machines, boosting and random forests. Their performance is robust even when the data contain large amounts of label noise. Very little theory is available to explain these observations. The vast majority of theoretical analyses of generalization allows for interpolation only when there is little or no label noise. This paper takes a step toward a theoretical foundation for interpolated classifiers by analyzing local interpolating schemes, including geometric simplicial interpolation algorithm and weighted k-nearest neighbor schemes. Consistency or near-consistency is proved for these schemes in classification and regression problems. These schemes have an inductive bias that benefits from higher dimension, a kind of "blessing of dimensionality". Finally, connections to kernel machines, random forests, and adversarial examples in the interpolated regime are discussed.

    06/13/2018 ∙ by Mikhail Belkin, et al. ∙ 8 share

    read it

  • Reconciling modern machine learning and the bias-variance trade-off

    The question of generalization in machine learning---how algorithms are able to learn predictors from a training sample to make accurate predictions out-of-sample---is revisited in light of the recent breakthroughs in modern machine learning technology. The classical approach to understanding generalization is based on bias-variance trade-offs, where model complexity is carefully calibrated so that the fit on the training sample reflects performance out-of-sample. However, it is now common practice to fit highly complex models like deep neural networks to data with (nearly) zero training error, and yet these interpolating predictors are observed to have good out-of-sample accuracy even for noisy data. How can the classical understanding of generalization be reconciled with these observations from modern machine learning practice? In this paper, we bridge the two regimes by exhibiting a new "double descent" risk curve that extends the traditional U-shaped bias-variance curve beyond the point of interpolation. Specifically, the curve shows that as soon as the model complexity is high enough to achieve interpolation on the training sample---a point that we call the "interpolation threshold"---the risk of suitably chosen interpolating predictors from these models can, in fact, be decreasing as the model complexity increases, often below the risk achieved using non-interpolating models. The double descent risk curve is demonstrated for a broad range of models, including neural networks and random forests, and a mechanism for producing this behavior is posited.

    12/28/2018 ∙ by Mikhail Belkin, et al. ∙ 6 share

    read it

  • Two models of double descent for weak features

    The "double descent" risk curve was recently proposed to qualitatively describe the out-of-sample prediction accuracy of variably-parameterized machine learning models. This article provides a precise mathematical analysis for the shape of this curve in two simple data models with the least squares/least norm predictor. Specifically, it is shown that the risk peaks when the number of features p is close to the sample size n, but also that the risk decreases towards its minimum as p increases beyond n. This behavior is contrasted with that of "prescient" models that select features in an a priori optimal order.

    03/18/2019 ∙ by Mikhail Belkin, et al. ∙ 6 share

    read it

  • How many variables should be entered in a principal component regression equation?

    We study least squares linear regression over N uncorrelated Gaussian features that are selected in order of decreasing variance. When the number of selected features p is at most the sample size n, the estimator under consideration coincides with the principal component regression estimator; when p>n, the estimator is the least ℓ_2 norm solution over the selected features. We give an average-case analysis of the out-of-sample prediction error as p,n,N →∞ with p/N →α and n/N →β, for some constants α∈ [0,1] and β∈ (0,1). In this average-case setting, the prediction error exhibits a `double descent' shape as a function of p.

    06/04/2019 ∙ by Ji Xu, et al. ∙ 6 share

    read it

  • Diameter-based Interactive Structure Search

    In this work, we introduce interactive structure search, a generic framework that encompasses many interactive learning settings, both explored and unexplored. We show that a recently developed active learning algorithm of TD17 can be adapted for interactive structure search, that it can be made noise-tolerant, and that it enjoys favorable convergence rates.

    06/05/2019 ∙ by Christopher Tosh, et al. ∙ 4 share

    read it

  • Unbiased estimators for random design regression

    In linear regression we wish to estimate the optimum linear least squares predictor for a distribution over d-dimensional input points and real-valued responses, based on a small sample. Under standard random design analysis, where the sample is drawn i.i.d. from the input distribution, the least squares solution for that sample can be viewed as the natural estimator of the optimum. Unfortunately, this estimator almost always incurs an undesirable bias coming from the randomness of the input points. In this paper we show that it is possible to draw a non-i.i.d. sample of input points such that, regardless of the response model, the least squares solution is an unbiased estimator of the optimum. Moreover, this sample can be produced efficiently by augmenting a previously drawn i.i.d. sample with an additional set of d points drawn jointly from the input distribution rescaled by the squared volume spanned by the points. Motivated by this, we develop a theoretical framework for studying volume-rescaled sampling, and in the process prove a number of new matrix expectation identities. We use them to show that for any input distribution and ϵ>0 there is a random design consisting of O(d d+ d/ϵ) points from which an unbiased estimator can be constructed whose square loss over the entire distribution is with high probability bounded by 1+ϵ times the loss of the optimum. We provide efficient algorithms for generating such unbiased estimators in a number of practical settings and support our claims experimentally.

    07/08/2019 ∙ by Michał Dereziński, et al. ∙ 4 share

    read it

  • Benefits of over-parameterization with EM

    Expectation Maximization (EM) is among the most popular algorithms for maximum likelihood estimation, but it is generally only guaranteed to find its stationary points of the log-likelihood objective. The goal of this article is to present theoretical and empirical evidence that over-parameterization can help EM avoid spurious local optima in the log-likelihood. We consider the problem of estimating the mean vectors of a Gaussian mixture model in a scenario where the mixing weights are known. Our study shows that the global behavior of EM, when one uses an over-parameterized model in which the mixing weights are treated as unknown, is better than that when one uses the (correct) model with the mixing weights fixed to the known values. For symmetric Gaussians mixtures with two components, we prove that introducing the (statistically redundant) weight parameters enables EM to find the global maximizer of the log-likelihood starting from almost any initial mean parameters, whereas EM without this over-parameterization may very often fail. For other Gaussian mixtures, we provide empirical evidence that shows similar behavior. Our results corroborate the value of over-parameterization in solving non-convex optimization problems, previously observed in other domains.

    10/26/2018 ∙ by Ji Xu, et al. ∙ 2 share

    read it

  • A cryptographic approach to black box adversarial machine learning

    We propose an ensemble technique for converting any classifier into a computationally secure classifier. We define a simpler security problem for random binary classifiers and prove a reduction from this model to the security of the overall ensemble classifier. We provide experimental evidence of the security of our random binary classifiers, as well as empirical results of the adversarial accuracy of the overall ensemble to black-box attacks. Our construction crucially leverages hidden randomness in the multiclass-to-binary reduction.

    06/07/2019 ∙ by Kevin Shi, et al. ∙ 2 share

    read it

  • A gradual, semi-discrete approach to generative network training via explicit wasserstein minimization

    This paper provides a simple procedure to fit generative networks to target distributions, with the goal of a small Wasserstein distance (or other optimal transport cost). The approach is based on two principles: (a) if the source randomness of the network is a continuous distribution (the "semi-discrete" setting), then the Wasserstein distance is realized by a deterministic optimal transport mapping; (b) given an optimal transport mapping between a generator network and a target distribution, the Wasserstein distance may be decreased via a regression between the generated data and the mapped target points. The procedure here therefore alternates these two steps, forming an optimal transport and regressing against it, gradually adjusting the generator network towards the target distribution. Mathematically, this approach is shown to minimize the Wasserstein distance to both the empirical target distribution, and also its underlying population counterpart. Empirically, good performance is demonstrated on the training and testing sets of the MNIST and Thin-8 data. The paper closes with a discussion of the unsuitability of the Wasserstein distance for certain tasks, as has been identified in prior work [Arora et al., 2017, Huang et al., 2017].

    06/08/2019 ∙ by Yucheng Chen, et al. ∙ 1 share

    read it

  • Time Series Compression Based on Adaptive Piecewise Recurrent Autoencoder

    Time series account for a large proportion of the data stored in financial, medical and scientific databases. The efficient storage of time series is important in practical applications. In this paper, we propose a novel compression scheme for time series. The encoder and decoder are both composed by recurrent neural networks (RNN) such as long short-term memory (LSTM). There is an autoencoder between encoder and decoder, which encodes the hidden state and input together and decodes them at the decoder side. Moreover, we pre-process the original time series by partitioning it into segments with various lengths which have similar total variation. The experimental study shows that the proposed algorithm can achieve competitive compression ratio on real-world time series.

    07/23/2017 ∙ by Daniel Hsu, et al. ∙ 0 share

    read it