
The Difficulty of Training Sparse Neural Networks
We investigate the difficulties of training sparse neural networks and make new observations about optimization dynamics and the energy landscape within the sparse regime. Recent work of Gale2019, Liu2018 has shown that sparse ResNet50 architectures trained on ImageNet2012 dataset converge to solutions that are significantly worse than those found by pruning. We show that, despite the failure of optimizers, there is a linear path with a monotonically decreasing objective from the initialization to the "good" solution. Additionally, our attempts to find a decreasing objective path from "bad" solutions to the "good" ones in the sparse subspace fail. However, if we allow the path to traverse the dense subspace, then we consistently find a path between two solutions. These findings suggest traversing extra dimensions may be needed to escape stationary points found in the sparse subspace.
06/25/2019 ∙ by Utku Evci, et al. ∙ 5 ∙ shareread it

Information matrices and generalization
This work revisits the use of information criteria to characterize the generalization of deep learning models. In particular, we empirically demonstrate the effectiveness of the Takeuchi information criterion (TIC), an extension of the Akaike information criterion (AIC) for misspecified models, in estimating the generalization gap, shedding light on why quantities such as the number of parameters cannot quantify generalization. The TIC depends on both the Hessian of the loss H and the covariance of the gradients C. By exploring the similarities and differences between these two matrices as well as the Fisher information matrix F, we study the interplay between noise and curvature in deep models. We also address the question of whether C is a reasonable approximation to F, as is commonly assumed.
06/18/2019 ∙ by Valentin Thomas, et al. ∙ 4 ∙ shareread it

Breaking the Nonsmooth Barrier: A Scalable Parallel Method for Composite Optimization
Due to their simplicity and excellent performance, parallel asynchronous variants of stochastic gradient descent have become popular methods to solve a wide range of largescale optimization problems on multicore architectures. Yet, despite their practical success, support for nonsmooth objectives is still lacking, making them unsuitable for many problems of interest in machine learning, such as the Lasso, group Lasso or empirical risk minimization with convex constraints. In this work, we propose and analyze ProxASAGA, a fully asynchronous sparse method inspired by SAGA, a variance reduced incremental gradient algorithm. The proposed method is easy to implement and significantly outperforms the state of the art on several nonsmooth, largescale problems. We prove that our method achieves a theoretical linear speedup with respect to the sequential version under assumptions on the sparsity of gradients and blockseparability of the proximal term. Empirical benchmarks on a multicore architecture illustrate practical speedups of up to 12x on a 20core machine.
07/20/2017 ∙ by Fabian Pedregosa, et al. ∙ 0 ∙ shareread it

On the convergence rate of the three operator splitting scheme
The three operator splitting scheme was recently proposed by [Davis and Yin, 2015] as a method to optimize composite objective functions with one convex smooth term and two convex (possibly nonsmooth) terms for which we have access to their proximity operator. In this short note we provide an alternative proof for the sublinear rate of convergence of this method.
10/25/2016 ∙ by Fabian Pedregosa, et al. ∙ 0 ∙ shareread it

ASAGA: Asynchronous Parallel SAGA
We describe ASAGA, an asynchronous parallel version of the incremental gradient algorithm SAGA that enjoys fast linear convergence rates. Through a novel perspective, we revisit and clarify a subtle but important technical issue present in a large fraction of the recent convergence rate proofs for asynchronous parallel optimization algorithms, and propose a simplification of the recently introduced "perturbed iterate" framework that resolves it. We thereby prove that ASAGA can obtain a theoretical linear speedup on multicore systems even without sparsity assumptions. We present results of an implementation on a 40core architecture illustrating the practical speedup as well as the hardware overhead.
06/15/2016 ∙ by Rémi Leblond, et al. ∙ 0 ∙ shareread it

Hyperparameter optimization with approximate gradient
Most models in machine learning contain at least one hyperparameter to control for model complexity. Choosing an appropriate set of hyperparameters is both crucial in terms of model accuracy and computationally challenging. In this work we propose an algorithm for the optimization of continuous hyperparameters using inexact gradient information. An advantage of this method is that hyperparameters can be updated before model parameters have fully converged. We also give sufficient conditions for the global convergence of this method, based on regularity conditions of the involved functions and summability of errors. Finally, we validate the empirical performance of this method on the estimation of regularization constants of L2regularized logistic regression and kernel Ridge regression. Empirical benchmarks indicate that our approach is highly competitive with respect to state of the art methods.
02/07/2016 ∙ by Fabian Pedregosa, et al. ∙ 0 ∙ shareread it

Improved brain pattern recovery through ranking approaches
Inferring the functional specificity of brain regions from functional Magnetic Resonance Images (fMRI) data is a challenging statistical problem. While the General Linear Model (GLM) remains the standard approach for brain mapping, supervised learning techniques (a.k.a. decoding) have proven to be useful to capture multivariate statistical effects distributed across voxels and brain regions. Up to now, much effort has been made to improve decoding by incorporating prior knowledge in the form of a particular regularization term. In this paper we demonstrate that further improvement can be made by accounting for nonlinearities using a ranking approach rather than the commonly used leastsquare regression. Through simulation, we compare the recovery properties of our approach to linear models commonly used in fMRI based decoding. We demonstrate the superiority of ranking with a real fMRI dataset.
07/15/2012 ∙ by Fabian Pedregosa, et al. ∙ 0 ∙ shareread it

Machine Learning for Neuroimaging with ScikitLearn
Statistical machine learning methods are increasingly used for neuroimaging data analysis. Their main virtue is their ability to model highdimensional datasets, e.g. multivariate analysis of activation images or restingstate time series. Supervised learning is typically used in decoding or encoding settings to relate brain images to behavioral or clinical observations, while unsupervised learning can uncover hidden structures in sets of images (e.g. resting state functional MRI) or find subpopulations in large cohorts. By considering different functional neuroimaging applications, we illustrate how scikitlearn, a Python machine learning library, can be used to perform some key analysis steps. Scikitlearn contains a very large set of statistical learning algorithms, both supervised and unsupervised, and its application to neuroimaging data provides a versatile tool to study the brain.
12/12/2014 ∙ by Alexandre Abraham, et al. ∙ 0 ∙ shareread it

Second order scattering descriptors predict fMRI activity due to visual textures
Second layer scattering descriptors are known to provide good classification performance on natural quasistationary processes such as visual textures due to their sensitivity to higher order moments and continuity with respect to small deformations. In a functional Magnetic Resonance Imaging (fMRI) experiment we present visual textures to subjects and evaluate the predictive power of these descriptors with respect to the predictive power of simple contour energy  the first scattering layer. We are able to conclude not only that invariant second layer scattering coefficients better encode voxel activity, but also that well predicted voxels need not necessarily lie in known retinotopic regions.
08/10/2013 ∙ by Michael Eickenberg, et al. ∙ 0 ∙ shareread it

Learning to rank from medical imaging data
Medical images can be used to predict a clinical score coding for the severity of a disease, a pain level or the complexity of a cognitive task. In all these cases, the predicted variable has a natural order. While a standard classifier discards this information, we would like to take it into account in order to improve prediction performance. A standard linear regression does model such information, however the linearity assumption is likely not be satisfied when predicting from pixel intensities in an image. In this paper we address these modeling challenges with a supervised learning procedure where the model aims to order or rank images. We use a linear model for its robustness in high dimension and its possible interpretation. We show on simulations and two fMRI datasets that this approach is able to predict the correct ordering on pairs of images, yielding higher prediction accuracy than standard regression and multiclass classification techniques.
07/16/2012 ∙ by Fabian Pedregosa, et al. ∙ 0 ∙ shareread it

FrankWolfe with Subsampling Oracle
We analyze two novel randomized variants of the FrankWolfe (FW) or conditional gradient algorithm. While classical FW algorithms require solving a linear minimization problem over the domain at each iteration, the proposed method only requires to solve a linear minimization problem over a small subset of the original domain. The first algorithm that we propose is a randomized variant of the original FW algorithm and achieves a O(1/t) sublinear convergence rate as in the deterministic counterpart. The second algorithm is a randomized variant of the Awaystep FW algorithm, and again as its deterministic counterpart, reaches linear (i.e., exponential) convergence rate making it the first provably convergent randomized variant of Awaystep FW. In both cases, while subsampling reduces the convergence rate by a constant factor, the linear minimization step can be a fraction of the cost of that of the deterministic versions, especially when the data is streamed. We illustrate computational gains of the algorithms on regression problems, involving both ℓ_1 and latent group lasso penalties.
03/20/2018 ∙ by Thomas Kerdreux, et al. ∙ 0 ∙ shareread it
Fabian Pedregosa
is this you? claim profile
Currently a Marie SkłodowskaCurie postdoctoral fellow between ETH Zurich and UC Berkeley.