
Eigenvalues of the Hessian in Deep Learning: Singularity and Beyond
We look at the eigenvalues of the Hessian of a loss function before and after training. The eigenvalue distribution is seen to be composed of two parts, the bulk which is concentrated around zero, and the edges which are scattered away from zero. We present empirical evidence for the bulk indicating how overparametrized the system is, and for the edges that depend on the input data.
11/22/2016 ∙ by Levent Sagun, et al. ∙ 0 ∙ shareread it

Diagonal Rescaling For Neural Networks
We define a secondorder neural network stochastic gradient training algorithm whose blockdiagonal structure effectively amounts to normalizing the unit activations. Investigating why this algorithm lacks in robustness then reveals two interesting insights. The first insight suggests a new way to scale the stepsizes, clarifying popular algorithms such as RMSProp as well as old neural network tricks such as fanin stepsize scaling. The second insight stresses the practical importance of dealing with fast changes of the curvature of the cost.
05/25/2017 ∙ by Jean Lafond, et al. ∙ 0 ∙ shareread it

Wasserstein GAN
We introduce a new algorithm named WGAN, an alternative to traditional GAN training. In this new model, we show that we can improve the stability of learning, get rid of problems like mode collapse, and provide meaningful learning curves useful for debugging and hyperparameter searches. Furthermore, we show that the corresponding optimization problem is sound, and provide extensive theoretical work highlighting the deep connections to other distances between distributions.
01/26/2017 ∙ by Martin Arjovsky, et al. ∙ 0 ∙ shareread it

Towards Principled Methods for Training Generative Adversarial Networks
The goal of this paper is not to introduce a single algorithm or method, but to make theoretical steps towards fully understanding the training dynamics of generative adversarial networks. In order to substantiate our theoretical analysis, we perform targeted experiments to verify our assumptions, illustrate our claims, and quantify the phenomena. This paper is divided into three sections. The first section introduces the problem at hand. The second section is dedicated to studying and proving rigorously the problems including instability and saturation that arize when training generative adversarial networks. The third section examines a practical and theoretically grounded direction towards solving these problems, while introducing new tools to study them.
01/17/2017 ∙ by Martin Arjovsky, et al. ∙ 0 ∙ shareread it

Optimization Methods for LargeScale Machine Learning
This paper provides a review and commentary on the past, present, and future of numerical optimization algorithms in the context of machine learning applications. Through case studies on text classification and the training of deep neural networks, we discuss how optimization problems arise in machine learning and what makes them challenging. A major theme of our study is that largescale machine learning represents a distinctive setting in which the stochastic gradient (SG) method has traditionally played a central role while conventional gradientbased nonlinear optimization techniques typically falter. Based on this viewpoint, we present a comprehensive theory of a straightforward, yet versatile SG algorithm, discuss its practical behavior, and highlight opportunities for designing algorithms with improved performance. This leads to a discussion about the next generation of optimization methods for largescale machine learning, including an investigation of two main streams of research on techniques that diminish noise in the stochastic directions and methods that make use of secondorder derivative approximations.
06/15/2016 ∙ by Leon Bottou, et al. ∙ 0 ∙ shareread it

ICE: Enabling NonExperts to Build Models Interactively for LargeScale Lopsided Problems
Quick interaction between a human teacher and a learning machine presents numerous benefits and challenges when working with webscale data. The human teacher guides the machine towards accomplishing the task of interest. The learning machine leverages big data to find examples that maximize the training value of its interaction with the teacher. When the teacher is restricted to labeling examples selected by the machine, this problem is an instance of active learning. When the teacher can provide additional information to the machine (e.g., suggestions on what examples or predictive features should be used) as the learning task progresses, then the problem becomes one of interactive learning. To accommodate the twoway communication channel needed for efficient interactive learning, the teacher and the machine need an environment that supports an interaction language. The machine can access, process, and summarize more examples than the teacher can see in a lifetime. Based on the machine's output, the teacher can revise the definition of the task or make it more precise. Both the teacher and the machine continuously learn and benefit from the interaction. We have built a platform to (1) produce valuable and deployable models and (2) support research on both the machine learning and user interface challenges of the interactive learning problem. The platform relies on a dedicated, lowlatency, distributed, inmemory architecture that allows us to construct webscale learning machines with quick interaction speed. The purpose of this paper is to describe this architecture and demonstrate how it supports our research efforts. Preliminary results are presented as illustrations of the architecture but are not the primary focus of the paper.
09/16/2014 ∙ by Patrice Simard, et al. ∙ 0 ∙ shareread it

Unifying distillation and privileged information
Distillation (Hinton et al., 2015) and privileged information (Vapnik & Izmailov, 2015) are two techniques that enable machines to learn from other machines. This paper unifies these two techniques into generalized distillation, a framework to learn from multiple machines and data representations. We provide theoretical and causal insight about the inner workings of generalized distillation, extend it to unsupervised, semisupervised and multitask learning scenarios, and illustrate its efficacy on a variety of numerical simulations on both synthetic and realworld data.
11/11/2015 ∙ by David LopezPaz, et al. ∙ 0 ∙ shareread it

No Regret Bound for Extreme Bandits
Algorithms for hyperparameter optimization abound, all of which work well under different and often unverifiable assumptions. Motivated by the general challenge of sequentially choosing which algorithm to use, we study the more specific task of choosing among distributions to use for random hyperparameter optimization. This work is naturally framed in the extreme bandit setting, which deals with sequentially choosing which distribution from a collection to sample in order to minimize (maximize) the single best cost (reward). Whereas the distributions in the standard bandit setting are primarily characterized by their means, a number of subtleties arise when we care about the minimal cost as opposed to the average cost. For example, there may not be a welldefined "best" distribution as there is in the standard bandit setting. The best distribution depends on the rewards that have been obtained and on the remaining time horizon. Whereas in the standard bandit setting, it is sensible to compare policies with an oracle which plays the single best arm, in the extreme bandit setting, there are multiple sensible oracle models. We define a sensible notion of "extreme regret" in the extreme bandit setting, which parallels the concept of regret in the standard bandit setting. We then prove that no policy can asymptotically achieve no extreme regret.
08/12/2015 ∙ by Robert Nishihara, et al. ∙ 0 ∙ shareread it

Discovering Causal Signals in Images
This paper establishes the existence of observable footprints that reveal the "causal dispositions" of the object categories appearing in collections of images. We achieve this goal in two steps. First, we take a learning approach to observational causal discovery, and build a classifier that achieves stateoftheart performance on finding the causal direction between pairs of random variables, given samples from their joint distribution. Second, we use our causal direction classifier to effectively distinguish between features of objects and features of their contexts in collections of static images. Our experiments demonstrate the existence of a relation between the direction of causality and the difference between objects and their contexts, and by the same token, the existence of observable signals that reveal the causal dispositions of objects.
05/26/2016 ∙ by David LopezPaz, et al. ∙ 0 ∙ shareread it

Counterfactual Reasoning and Learning Systems
This work shows how to leverage causal inference to understand the behavior of complex learning systems interacting with their environment and predict the consequences of changes to the system. Such predictions allow both humans and algorithms to select changes that improve both the shortterm and longterm performance of such systems. This work is illustrated by experiments carried out on the ad placement system associated with the Bing search engine.
09/11/2012 ∙ by Leon Bottou, et al. ∙ 0 ∙ shareread it

A Lower Bound for the Optimization of Finite Sums
This paper presents a lower bound for optimizing a finite sum of n functions, where each function is Lsmooth and the sum is μstrongly convex. We show that no algorithm can reach an error ϵ in minimizing all functions from this class in fewer than Ω(n + √(n(κ1))(1/ϵ)) iterations, where κ=L/μ is a surrogate condition number. We then compare this lower bound to upper bounds for recently developed methods specializing to this setting. When the functions involved in this sum are not arbitrary, but based on i.i.d. random data, then we further contrast these complexity results with those for optimal firstorder methods to directly optimize the sum. The conclusion we draw is that a lot of caution is necessary for an accurate comparison, and identify machine learning scenarios where the new methods help computationally.
10/02/2014 ∙ by Alekh Agarwal, et al. ∙ 0 ∙ shareread it
Leon Bottou
is this you? claim profile
Diplôme d'Ingénieur from the École Polytechnique (X84) in 1987, the Master of Mathematics, Applied Mathematics and Computer Science from Ecole Normale Supérieure in 1988, and a PhD in computer science from University of ParisSud in 1991 I went to AT & T Bell Laboratories, AT & T Labs, NEC Labs America, and Microsoft Research. I joined the Facebook AI Research in March 2015.