
A Theoretical Analysis of Contrastive Unsupervised Representation Learning
Recent empirical works have successfully used unlabeled data to learn feature representations that are broadly useful in downstream classification tasks. Several of these methods are reminiscent of the wellknown word2vec embedding algorithm: leveraging availability of pairs of semantically "similar" data points and "negative samples," the learner forces the inner product of representations of similar pairs with each other to be higher on average than with negative samples. The current paper uses the term contrastive learning for such algorithms and presents a theoretical framework for analyzing them by introducing latent classes and hypothesizing that semantically similar points are sampled from the same latent class. This framework allows us to show provable guarantees on the performance of the learned representations on the average classification task that is comprised of a subset of the same set of latent classes. Our generalization bound also shows that learned representations can reduce (labeled) sample complexity on downstream tasks. We conduct controlled experiments in both the text and image domains to support the theory.
02/25/2019 ∙ by Sanjeev Arora, et al. ∙ 34 ∙ shareread it

Theoretical Analysis of Auto RateTuning by Batch Normalization
Batch Normalization (BN) has become a cornerstone of deep learning across diverse architectures, appearing to help optimization as well as generalization. While the idea makes intuitive sense, theoretical analysis of its effectiveness has been lacking. Here theoretical support is provided for one of its conjectured properties, namely, the ability to allow gradient descent to succeed with less tuning of learning rates. It is shown that even if we fix the learning rate of scaleinvariant parameters (e.g., weights of each layer with BN) to a constant (say, 0.3), gradient descent still approaches a stationary point (i.e., a solution where gradient is zero) in the rate of T^1/2 in T iterations, asymptotically matching the best bound for gradient descent with welltuned learning rates. A similar result with convergence rate T^1/4 is also shown for stochastic gradient descent.
12/10/2018 ∙ by Sanjeev Arora, et al. ∙ 20 ∙ shareread it

FineGrained Analysis of Optimization and Generalization for Overparameterized TwoLayer Neural Networks
Recent works have cast some light on the mystery of why deep nets fit any data and generalize despite being very overparametrized. This paper analyzes training and generalization for a simple 2layer ReLU net with random initialization, and provides the following improvements over recent works: (i) Using a tighter characterization of training speed than recent papers, an explanation for why training a neural net with random labels leads to slower training, as originally observed in [Zhang et al. ICLR'17]. (ii) Generalization bound independent of network size, using a datadependent complexity measure. Our measure distinguishes clearly between random labels and true labels on MNIST and CIFAR, as shown by experiments. Moreover, recent papers require sample complexity to increase (slowly) with the size, while our sample complexity is completely independent of the network size. (iii) Learnability of a broad class of smooth functions by 2layer ReLU nets trained via gradient descent. The key idea is to track dynamics of training and generalization via properties of a related kernel.
01/24/2019 ∙ by Sanjeev Arora, et al. ∙ 16 ∙ shareread it

A Convergence Analysis of Gradient Descent for Deep Linear Neural Networks
We analyze speed of convergence to global optimum for gradient descent training a deep linear neural network (parameterized as x W_N ... W_1x) by minimizing the ℓ_2 loss over whitened data. Convergence at a linear rate is guaranteed when the following hold: (i) dimensions of hidden layers are at least the minimum of the input and output dimensions; (ii) weight matrices at initialization are approximately balanced; and (iii) the initial loss is smaller than the loss of any rankdeficient solution. The assumptions on initialization (conditions (ii) and (iii)) are necessary, in the sense that violating any one of them may lead to convergence failure. Moreover, in the important case of output dimension 1, i.e. scalar regression, they are met, and thus convergence to global optimum holds, with constant probability under a random initialization scheme. Our results significantly extend previous analyses, e.g., of deep linear residual networks (Bartlett et al., 2018).
10/04/2018 ∙ by Sanjeev Arora, et al. ∙ 8 ∙ shareread it

Implicit Regularization in Deep Matrix Factorization
Efforts to understand the generalization mystery in deep learning have led to the belief that gradientbased optimization induces a form of implicit regularization, a bias towards models of low "complexity." We study the implicit regularization of gradient descent over deep linear neural networks for matrix completion and sensing  a model referred to as deep matrix factorization. Our first finding, supported by theory and experiments, is that adding depth to a matrix factorization enhances an implicit tendency towards lowrank solutions, oftentimes leading to more accurate recovery. Secondly, we present theoretical and empirical arguments questioning a nascent view by which implicit regularization in matrix factorization can be captured using simple mathematical norms. Our results point to the possibility that the language of standard regularizers may not be rich enough to fully encompass the implicit regularization brought forth by gradientbased optimization.
05/31/2019 ∙ by Sanjeev Arora, et al. ∙ 4 ∙ shareread it

A Simple Saliency Method That Passes the Sanity Checks
There is great interest in *saliency methods* (also called *attribution methods*), which give "explanations" for a deep net's decision, by assigning a *score* to each feature/pixel in the input. Their design usually involves creditassignment via the gradient of the output with respect to input. Recently Adebayo et al. [arXiv:1810.03292] questioned the validity of many of these methods since they do not pass simple *sanity checks* which test whether the scores shift/vanish when layers of the trained net are randomized, or when the net is retrained using random labels for inputs. We propose a simple fix to existing saliency methods that helps them pass sanity checks, which we call *competition for pixels*. This involves computing saliency maps for all possible labels in the classification task, and using a simple competition among them to identify and remove less relevant pixels from the map. The simplest variant of this is *Competitive Gradient Input (CGI)*: it is efficient, requires no additional training, and uses only the input and gradient. Some theoretical justification is provided for it (especially for ReLU networks) and its performance is empirically demonstrated.
05/27/2019 ∙ by Arushi Gupta, et al. ∙ 4 ∙ shareread it

Explaining Landscape Connectivity of Lowcost Solutions for Multilayer Nets
Mode connectivity is a surprising phenomenon in the loss landscape of deep nets. Optimaat least those discovered by gradientbased optimizationturn out to be connected by simple paths on which the loss function is almost constant. Often, these paths can be chosen to be piecewise linear, with as few as two segments. We give mathematical explanations for this phenomenon, assuming generic properties (such as dropout stability and noise stability) of welltrained deep nets, which have previously been identified as part of understanding the generalization properties of deep nets. Our explanation holds for realistic multilayer nets, and experiments are presented to verify the theory.
06/14/2019 ∙ by Rohith Kuditipudi, et al. ∙ 3 ∙ shareread it

Theoretical limitations of EncoderDecoder GAN architectures
Encoderdecoder GANs architectures (e.g., BiGAN and ALI) seek to add an inference mechanism to the GANs setup, consisting of a small encoder deep net that maps datapoints to their succinct encodings. The intuition is that being forced to train an encoder alongside the usual generator forces the system to learn meaningful mappings from the code to the datapoint and viceversa, which should improve the learning of the target distribution and ameliorate modecollapse. It should also yield meaningful codes that are useful as features for downstream tasks. The current paper shows rigorously that even on reallife distributions of images, the encodedecoder GAN training objectives (a) cannot prevent mode collapse; i.e. the objective can be nearoptimal even when the generated distribution has low and finite support (b) cannot prevent learning meaningless codes for data  essentially white noise. Thus if encoderdecoder GANs do indeed work then it must be due to reasons as yet not understood, since the training objective can be low even for meaningless solutions.
11/07/2017 ∙ by Sanjeev Arora, et al. ∙ 0 ∙ shareread it

Provable benefits of representation learning
There is general consensus that learning representations is useful for a variety of reasons, e.g. efficient use of labeled data (semisupervised learning), transfer learning and understanding hidden structure of data. Popular techniques for representation learning include clustering, manifold learning, kernellearning, autoencoders, Boltzmann machines, etc. To study the relative merits of these techniques, it's essential to formalize the definition and goals of representation learning, so that they are all become instances of the same definition. This paper introduces such a formal framework that also formalizes the utility of learning the representation. It is related to previous Bayesian notions, but with some new twists. We show the usefulness of our framework by exhibiting simple and natural settings  linear mixture models and loglinear models, where the power of representation learning can be formally shown. In these examples, representation learning can be performed provably and efficiently under plausible assumptions (despite being NPhard), and furthermore: (i) it greatly reduces the need for labeled data (semisupervised learning) and (ii) it allows solving classification tasks when simpler approaches like nearest neighbors require too much data (iii) it is more powerful than manifold learning methods.
06/14/2017 ∙ by Sanjeev Arora, et al. ∙ 0 ∙ shareread it

Generalization and Equilibrium in Generative Adversarial Nets (GANs)
We show that training of generative adversarial network (GAN) may not have good generalization properties; e.g., training may appear successful but the trained distribution may be far from target distribution in standard metrics. However, generalization does occur for a weaker metric called neural net distance. It is also shown that an approximate pure equilibrium exists in the discriminator/generator game for a special class of generators with natural training objectives when generator capacity and training set sizes are moderate. This existence of equilibrium inspires MIX+GAN protocol, which can be combined with any existing GAN training, and empirically shown to improve some of them.
03/02/2017 ∙ by Sanjeev Arora, et al. ∙ 0 ∙ shareread it

Provable learning of Noisyor Networks
Many machine learning applications use latent variable models to explain structure in data, whereby visible variables (= coordinates of the given datapoint) are explained as a probabilistic function of some hidden variables. Finding parameters with the maximum likelihood is NPhard even in very simple settings. In recent years, provably efficient algorithms were nevertheless developed for models with linear structures: topic models, mixture models, hidden markov models, etc. These algorithms use matrix or tensor decomposition, and make some reasonable assumptions about the parameters of the underlying model. But matrix or tensor decomposition seems of little use when the latent variable model has nonlinearities. The current paper shows how to make progress: tensor decomposition is applied for learning the singlelayer noisy or network, which is a textbook example of a Bayes net, and used for example in the classic QMRDT software for diagnosing which disease(s) a patient may have by observing the symptoms he/she exhibits. The technical novelty here, which should be useful in other settings in future, is analysis of tensor decomposition in presence of systematic error (i.e., where the noise/error is correlated with the signal, and doesn't decrease as number of samples goes to infinity). This requires rethinking all steps of tensor decomposition methods from the ground up. For simplicity our analysis is stated assuming that the network parameters were chosen from a probability distribution but the method seems more generally applicable.
12/28/2016 ∙ by Sanjeev Arora, et al. ∙ 0 ∙ shareread it