 # Stochastic Optimization with Variance Reduction for Infinite Datasets with Finite-Sum Structure

Stochastic optimization algorithms with variance reduction have proven successful for minimizing large finite sums of functions. Unfortunately, these techniques are unable to deal with stochastic perturbations of input data, induced for example by data augmentation. In such cases, the objective is no longer a finite sum, and the main candidate for optimization is the stochastic gradient descent method (SGD). In this paper, we introduce a variance reduction approach for these settings when the objective is composite and strongly convex. The convergence rate outperforms SGD with a typically much smaller constant factor, which depends on the variance of gradient estimates only due to perturbations on a single example.

## Code Repositories

### stochs

stochs: fast stochastic solvers for machine learning in C++ and Cython

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Many supervised machine learning problems can be cast as the minimization of an expected loss over a data distribution with respect to a vector

in  of model parameters. When an infinite amount of data is available, stochastic optimization methods such as SGD or stochastic mirror descent algorithms, or their variants, are typically used (see bottou_optimization_2016 ; duchi2009efficient ; nemirovski_robust_2009 ; xiao2010dual ). Nevertheless, when the dataset is finite, incremental methods based on variance reduction techniques (e.g.allen2016katyusha ; defazio_saga:_2014 ; johnson_accelerating_2013 ; lan2015optimal ; lin_universal_2015 ; schmidt_minimizing_2013 ; shalev-shwartz_stochastic_2013 ) have proven to be significantly faster at solving the finite-sum problem

 minx∈Rp{F(x):=f(x)+h(x)=1nn∑i=1fi(x)+h(x)}, (1)

where the functions  are smooth and convex, and  is a simple convex penalty that need not be differentiable such as the  norm. A classical setting is , where is an example-label pair,

is a convex loss function, and

is a regularization parameter.

In this paper, we are interested in a variant of (1) where random perturbations of data are introduced, which is a common scenario in machine learning. Then, the functions involve an expectation over a random perturbation , leading to the problem

 minx∈Rp{F(x):=1nn∑i=1fi(x)+h(x)}.    with    fi(x)=Eρ[~fi(x,ρ)]. (2)

Unfortunately, variance reduction methods are not compatible with the setting (2), since evaluating a single gradient

requires computing a full expectation. Yet, dealing with random perturbations is of utmost interest; for instance, this is a key to achieve stable feature selection

meinshausen2010stability , improving the generalization error both in theory wager_altitude_2014 and in practice loosli2007training ; maaten2013learning , obtaining stable and robust predictors zheng2016improving , or using complex a priori knowledge about data to generate virtually larger datasets loosli2007training ; paulin2014transformation ; simard_transformation_1998 . Injecting noise in data is also useful to hide gradient information for privacy-aware learning duchi2012privacy .

Despite its importance, the optimization problem (2) has been littled studied and to the best of our knowledge, no dedicated optimization method that is able to exploit the problem structure has been developed so far. A natural way to optimize this objective when is indeed SGD, but ignoring the finite-sum structure leads to gradient estimates with high variance and slow convergence. The goal of this paper is to introduce an algorithm for strongly convex objectives, called stochastic MISO, which exploits the underlying finite sum using variance reduction. Our method achieves a faster convergence rate than SGD, by removing the dependence on the gradient variance due to sampling the data points  in ; the dependence remains only for the variance due to random perturbations .

To the best of our knowledge, our method is the first algorithm that interpolates naturally between incremental methods for finite sums (when there are no perturbations) and the stochastic approximation setting (when

), while being able to efficiently tackle the hybrid case.

##### Related work.

Many optimization methods dedicated to the finite-sum problem (e.g., johnson_accelerating_2013 ; shalev-shwartz_stochastic_2013

) have been motivated by the fact that their updates can be interpreted as SGD steps with unbiased estimates of the full gradient, but with a variance that decreases as the algorithm approaches the optimum

johnson_accelerating_2013 ; on the other hand, vanilla SGD requires decreasing step-sizes to achieve this reduction of variance, thereby slowing down convergence. Our work aims at extending these techniques to the case where each function in the finite sum can only be accessed via a first-order stochastic oracle.

Most related to our work, recent methods that use data clustering to accelerate variance reduction techniques allen-zhu_exploiting_2016 ; hofmann_variance_2015 can be seen as tackling a special case of (2), where the expectations in are replaced by empirical averages over points in a cluster. While N-SAGA hofmann_variance_2015 was originally not designed for the stochastic context we consider, we remark that their method can be applied to (2). Their algorithm is however asymptotically biased and does not converge to the optimum. On the other hand, ClusterSVRG allen-zhu_exploiting_2016 is not biased, but does not support infinite datasets. The method proposed in achab_sgd_2015 uses variance reduction in a setting where gradients are computed approximately, but the algorithm computes a full gradient at every pass, which is not available in our stochastic setting.

##### Paper organization.

In Section 2, we present our algorithm for smooth objectives, and we analyze its convergence in Section 3. For space limitation reasons, we present an extension to composite objectives and non-uniform sampling in Appendix A. Section 4 is devoted to empirical results.

## 2 The Stochastic MISO Algorithm for Smooth Objectives

In this section, we introduce the stochastic MISO approach for smooth objectives (), which relies on the following assumptions:

• [noitemsep,topsep=0pt,leftmargin=5mm]

• (A1) global strong convexity: is -strongly convex;

• (A2) smoothness: is -smooth for all and (i.e., with -Lipschitz gradients).

Note that these assumptions are relaxed in Appendix A by supporting composite objectives and by exploiting different smoothness parameters  on each example, a setting where non-uniform sampling of the training points is typically helpful to accelerate convergence (e.g.xiao2014proximal ).

##### Complexity results.

We now introduce the following quantity, which is essential in our analysis:

 σ2p:=1nn∑i=1σ2i,~{}~{}with~{}~{}% σ2i:=Eρ[∥∇~fi(x∗,ρ)−∇fi(x∗)∥2],

where is the (unique) minimizer of . The quantity  represents the part of the variance of the gradients at the optimum that is due to the perturbations . In contrast, another quantity of interest is the total variance , which also includes the randomness in the choice of the index , defined as

 σ2tot=Ei,ρ[∥∇~fi(x∗,ρ)∥2]=σ2p+Ei[∥∇fi(x∗)∥2]       (note that~{}∇f(x∗)=0).

The relation between and is obtained by simple algebraic manipulations.

The goal of our paper is to exploit the potential imbalance , occurring when perturbations on input data are small compared to the sampling noise. The assumption is reasonable: given a data point, selecting a different one should lead to larger variation than a simple perturbation. From a theoretical point of view, the approach we propose achieves the iteration complexity presented in Table 1, see also Appendix D and bach2011non ; bottou_optimization_2016 ; nemirovski_robust_2009 for the complexity analysis of SGD. The gain over SGD is of order , which is also observed in our experiments in Section 4. We also compare against the method N-SAGA; its convergence rate is similar to ours but suffers from a non-zero asymptotic error.

##### Motivation from application cases.

One clear framework of application is the data clustering scenario already investigated in allen-zhu_exploiting_2016 ; hofmann_variance_2015 . Nevertheless, we will focus on less-studied data augmentation settings that lead instead to true stochastic formulations such as (2). First, we consider learning a linear model when adding simple direct manipulations of feature vectors, via rescaling (multiplying each entry vector by a random scalar), Dropout, or additive Gaussian noise, in order to improve the generalization error wager_altitude_2014 or to get more stable estimators meinshausen2010stability . In Table 2, we present the potential gain over SGD in these scenarios. To do that, we study the variance of perturbations applied to a feature vector . Indeed, the gradient of the loss is proportional to , which allows us to obtain good estimates of the ratio , as we observed in our empirical study of Dropout presented in Section 4. Whereas some perturbations are friendly for our method such as feature rescaling (a rescaling window of yields for instance a huge gain factor of ), a large Dropout rate would lead to less impressive acceleration (e.g., a Dropout with simply yields a factor ).

Second, we also consider more interesting domain-driven data perturbations such as classical image transformations considered in computer vision

paulin2014transformation ; zheng2016improving

including image cropping, rescaling, brightness, contrast, hue, and saturation changes. These transformations may be used to train a linear classifier on top of an unsupervised multilayer image model such as unsupervised CKNs

mairal_end–end_2016 or the scattering transform bruna2013invariant

. It may also be used for retraining the last layer of a pre-trained deep neural network: given a new task unseen during the full network training and given limited amount of training data, data augmentation may be indeed crucial to obtain good prediction and S-MISO can help accelerate learning in this setting. These scenarios are also studied in Table

2, where the experiment with ResNet-50 involving random cropping and rescaling produces images from ones. For these scenarios with realistic perturbations, the potential gain varies from to .

##### Description of stochastic MISO.

We are now in shape to present our method, described in Algorithm 1. Without perturbations and with a constant step-size, the algorithm resembles the MISO/Finito algorithms defazio_finito:_2014 ; lin_universal_2015 ; mairal_incremental_2015 , which may be seen as primal variants of SDCA shalev-shwartz_sdca_2016 ; shalev-shwartz_stochastic_2013 . Specifically, MISO is not able to deal with our stochastic objective (2), but it may address the deterministic finite-sum problem (1). It is part of a larger body of optimization methods that iteratively build a model of the objective function, typically a lower or upper bound on the objective that is easier to optimize; for instance, this strategy is commonly adopted in bundle methods hiriart2013convex ; nesterov_lectures_2004 .

More precisely, MISO assumes that each  is strongly convex and builds a model using lower bounds , where each  is a quadratic lower bound on  of the form

 dti(x)=cti,1+μ2∥x−zti∥2=cti,2−μ⟨x,zti⟩+μ2∥x∥2. (5)

These lower bounds are updated during the algorithm using strong convexity lower bounds at  of the form :

 dti(x)={(1−αt)dt−1i(x)+αtlti(x), if i=itdt−1i(x), otherwise, (6)

which corresponds to an update of the quantity :

 zti={(1−αt)zt−1i+αt(xt−1−1μ∇fit(xt−1)), if i=itzt−1i, otherwise.

The next iterate is then computed as , which is equivalent to (4). The original MISO/Finito algorithms use under a “big data” condition on the sample size   defazio_finito:_2014 ; mairal_incremental_2015 , while the theory was later extended in lin_universal_2015 to relax this condition by supporting smaller constant steps , leading to an algorithm that may be interpreted as a primal variant of SDCA (see shalev-shwartz_sdca_2016 ).

Note that when is an expectation, it is hard to obtain such lower bounds since the gradient is not available in general. For this reason, we have introduced S-MISO, which can exploit approximate lower bounds to each  using gradient estimates, by letting the step-sizes  decrease appropriately as commonly done in stochastic approximation. This leads to update (3).

Separately, SDCA shalev-shwartz_stochastic_2013 considers the Fenchel conjugates of , defined by . When  is an expectation, is not available in closed form in general, nor are its gradients, and in fact exploiting stochastic gradient estimates is difficult in the duality framework. In contrast, shalev-shwartz_sdca_2016 gives an analysis of SDCA in the primal, aka. “without duality”, for smooth finite sums, and our work extends this line of reasoning to the stochastic approximation and composite settings.

##### Relationship with SGD in the smooth case.

The link between S-MISO in the non-composite setting and SGD can be seen by rewriting the update (4) as

 xt=xt−1+1n(ztit−zt−1it)=xt−1+αtnvt,

where

 vt:=xt−1−1μ∇~fit(xt−1,ρt)−zt−1it. (7)

Note that , where contains all information up to iteration ; hence, the algorithm can be seen as an instance of the stochastic gradient method with unbiased gradients, which was a key motivation in SVRG johnson_accelerating_2013 and later in other variance reduction algorithms defazio_saga:_2014 ; shalev-shwartz_sdca_2016 . It is also worth noting that in the absence of a finite-sum structure (), we have ; hence our method becomes identical to SGD, up to a redefinition of step-sizes. In the composite case (see Appendix A), our approach yields a new algorithm that resembles regularized dual averaging xiao2010dual .

##### Memory requirements and handling of sparse datasets.

The algorithm requires storing the vectors , which takes the same amount of memory as the original dataset and which is therefore a reasonable requirement in many practical cases. In the case of sparse datasets, it is fair to assume that random perturbations applied to input data preserve the sparsity patterns of the original vectors, as is the case, e.g., when applying Dropout to text documents described with bag-of-words representations wager_altitude_2014 . If we further assume the typical setting where the -strong convexity comes from an regularizer: , where is the (sparse) perturbed example and encodes the loss, then the update (3) can be written as

 zti={(1−αt)zt−1i−αtμϕ′i(x⊤t−1ξρti)ξρti,if i=itzt−1i,otherwise,

which shows that for every index , the vector  preserves the same sparsity pattern as the examples  throughout the algorithm (assuming the initialization ), making the update (3) efficient. The update (4) has the same cost since is also sparse.

##### Limitations and alternative approaches.

Since our algorithm is uniformly better than SGD in terms of iteration complexity, its main limitation is in terms of memory storage when the dataset cannot fit into memory (remember that the memory cost of S-MISO is the same as the input dataset). In these huge-scale settings, SGD should be preferred; this holds true in fact for all incremental methods when one cannot afford to perform more than one (or very few) passes over the data. Our paper focuses instead on non-huge datasets, which are those benefiting most from data augmentation.

We note that a different approach to variance reduction like SVRG johnson_accelerating_2013 is able to trade off storage requirements for additional full gradient computations, which would be desirable in some situations. However, we were not able to obtain any decreasing step-size strategy that works for these methods, both in theory and practice, leaving us with constant step-size approaches as in achab_sgd_2015 ; hofmann_variance_2015 that either maintain a non-zero asymptotic error, or require dynamically reducing the variance of gradient estimates. One possible way to explain this difficulty is that SVRG and SAGA defazio_saga:_2014 “forget” past gradients for a given example , while S-MISO averages them in (3), which seems to be a technical key to make it suitable to stochastic approximation. Nevertheless, the question of whether it is possible to trade-off storage with computation in a setting like ours is open and of utmost interest.

## 3 Convergence Analysis of S-MISO

We now study the convergence properties of the S-MISO algorithm. For space limitation reasons, all proofs are provided in Appendix B. We start by defining the problem-dependent quantities , and then introduce the Lyapunov function

 Ct=12∥xt−x∗∥2+αtn2n∑i=1∥zti−z∗i∥2. (8)

Proposition 1 gives a recursion on , obtained by upper-bounding separately its two terms, and finding coefficients to cancel out other appearing quantities when relating  to . To this end, we borrow elements of the convergence proof of SDCA without duality shalev-shwartz_sdca_2016 ; our technical contribution is to extend their result to the stochastic approximation and composite (see Appendix A) cases.

###### Proposition 1 (Recursion on Ct).

If is a positive and non-increasing sequence satisfying

 α1≤min{12,n2(2κ−1)}, (9)

with , then obeys the recursion

 E[Ct]≤(1−αtn)E[Ct−1]+2(αtn)2σ2pμ2. (10)

We now state the main convergence result, which provides the expected rate on based on decreasing step-sizes, similar to bottou_optimization_2016 for SGD. Note that convergence of objective function values is directly related to that of the Lyapunov function  via smoothness:

 E[f(xt)−f(x∗)]≤L2E[∥xt−x∗∥2]≤LE[Ct]. (11)
###### Theorem 2 (Convergence of Lyapunov function).

Let the sequence of step-sizes be defined by with such that  satisfies (9). For all , it holds that

 E[Ct]≤νγ+t+1    % where    ν:=max{8σ2pμ2,(γ+1)C0}. (12)
##### Choice of step-sizes in practice.

Naturally, we would like to be small, in particular independent of the initial condition and equal to the first term in the definition (12). We would like the dependence on to vanish at a faster rate than , as it is the case in variance reduction algorithms on finite sums. As advised in bottou_optimization_2016 in the context of SGD, we can initially run the algorithm with a constant step-size and exploit this linear convergence regime until we reach the level of noise given by , and then start decaying the step-size. It is easy to see that by using a constant step-size , converges near a value . Indeed, Eq. (10) with yields

 E[Ct−¯C]≤(1−¯αn)E[Ct−1−¯C].

Thus, we can reach a precision  with  in iterations. Then, if we start decaying step-sizes as in Theorem 2 with large enough so that , we have

 (γ+1)E[C′0]≤(γ+1)¯ϵ=8σ2p/μ2,

making both terms in (12) smaller than or equal to . Considering these two phases, with an initial step-size  given by (9), the final work complexity for reaching is

 O((n+Lμ)logC0¯ϵ)+O(σ2pμ2ϵ). (13)

We can then use (11) in order to obtain the complexity for reaching . Note that following this step-size strategy was found to be very effective in practice (see Section 4).

##### Acceleration by iterate averaging.

When one is interested in the convergence in function values, the complexity (13) combined with (11) yields , which can be problematic for ill-conditioned problems (large condition number ). The following theorem presents an iterate averaging scheme which brings the complexity term down to , which appeared in Table 1.

###### Theorem 3 (Convergence under iterate averaging).

Let the step-size sequence be defined by

 αt=2nγ+t for γ≥1% s.t. α1≤min{12,n4(2κ−1)}.

We have

 E[f(¯xT)−f(x∗)]≤2μγ(γ−1)C0T(2γ+T−1)+16σ2pμ(2γ+T−1),

where

 ¯xT:=2T(2γ+T−1)T−1∑t=0(γ+t)xt.

The proof uses a similar telescoping sum technique to lacoste2012simpler . Note that if , the first term, which depends on the initial condition , decays as and is thus dominated by the second term. Moreover, if we start averaging after an initial phase with constant step-size , we can consider . In the ill-conditioned regime, taking as large as allowed by (9), we have  of the order of . The full convergence rate then becomes

 E[f(¯xT)−f(x∗)]≤O(σ2pμ(γ+T)(1+γT)).

When is large enough compared to , this becomes , leading to a complexity .

## 4 Experiments Figure 1: Impact of conditioning for data augmentation on STL-10 (controlled by μ, where μ=10−4 gives the best accuracy). Values of the loss are shown on a logarithmic scale (1 unit = factor 10).  η=0.1 satisfies the theory for all methods, and we include curves for larger step-sizes η=1. We omit N-SAGA for η=1 because it remains far from the optimum. For the scattering representation, the problem we study is ℓ1-regularized, and we use the composite algorithm of Appendix A. Figure 2: Re-training of the last layer of a pre-trained ResNet 50 model, on a small dataset with random color perturbations (for different values of μ). Figure 3: Impact of perturbations controlled by the Dropout rate δ. The gene data is ℓ2-normalized; hence, we consider similar step-sizes as Figure 1. The IMDB dataset is highly heterogeneous; thus, we also include non-uniform (NU) sampling variants of Appendix A. For uniform sampling, theoretical step-sizes perform poorly for all methods; thus, we show a larger tuned step-size η=10.

We present experiments comparing S-MISO with SGD and N-SAGA hofmann_variance_2015 on four different scenarios, in order to demonstrate the wide applicability of our method: we consider an image classification dataset with two different image representations and random transformations, and two classification tasks with Dropout regularization, one on genetic data, and one on (sparse) text data. Figures 1 and 3

show the curves for an estimate of the training objective using 5 sampled perturbations per example. The plots are shown on a logarithmic scale, and the values are compared to the best value obtained among the different methods in 500 epochs. The strong convexity constant

is the regularization parameter. For all methods, we consider step-sizes supported by the theory as well as larger step-sizes that may work better in practice. Our C++/Cython implementation of all methods considered in this section is available at https://github.com/albietz/stochs.

##### Choices of step-sizes.

For both S-MISO and SGD, we use the step-size strategy mentioned in Section 3 and advised by bottou_optimization_2016

, which we have found to be most effective among many heuristics we have tried: we initially keep the step-size constant (controlled by a factor

in the figures) for 2 epochs, and then start decaying as , where for S-MISO, for SGD, and is chosen large enough to match the previous constant step-size. For N-SAGA, we maintain a constant step-size throughout the optimization, as suggested in the original paper hofmann_variance_2015 . The factor  shown in the figures is such that  corresponds to an initial step-size for S-MISO (from (19) in the uniform case) and for SGD and N-SAGA (with  instead of  in the non-uniform case when using the variant of Appendix A).

##### Image classification with “data augmentation”.

The success of deep neural networks is often limited by the availability of large amounts of labeled images. When there are many unlabeled images but few labeled ones, a common approach is to train a linear classifier on top of a deep network learned in an unsupervised manner, or pre-trained on a different task (e.g., on the ImageNet dataset). We follow this approach on the STL-10 dataset coates_analysis_2011 , which contains 5K training images from classes and 100K unlabeled images, using a 2-layer unsupervised convolutional kernel network mairal_end–end_2016 , giving representations of dimension . The perturbation consists of randomly cropping and scaling the input images. We use the squared hinge loss in a one-versus-all setting. The vector representations are -normalized such that we may use the upper bound for the smoothness constant. We also present results on the same dataset using a scattering representation bruna2013invariant of dimension , with random gamma corrections (raising all pixels to the power , where  is chosen randomly around ). For this representation, we add an  regularization term and use the composite variant of S-MISO presented in Appendix A.

Figure 1 shows convergence results on one training fold (500 images), for different values of , allowing us to study the behavior of the algorithms for different condition numbers. The low variance induced by data transformations allows S-MISO to reach suboptimality that is orders of magnitude smaller than SGD after the same number of epochs. Note that one unit on these plots corresponds to one order of magnitude in the logarithmic scale. N-SAGA initially reaches a smaller suboptimality than SGD, but quickly gets stuck due to the bias in the algorithm, as predicted by the theory hofmann_variance_2015 , while S-MISO and SGD continue to converge to the optimum thanks to the decreasing step-sizes. The best validation accuracy for both representations is obtained for (middle column), and we observed relative gains of up to 1% from using data augmentation. We computed empirical variances of the image representations for these two strategies, which are closely related to the variance in gradient estimates, and observed these transformations to account for about 10% of the total variance.

Figure 2 shows convergence results when training the last layer of a 50-layer Residual network he2016deep that has been pre-trained on ImageNet. Here, we consider the common scenario of leveraging a deep model trained on a large dataset as a feature extractor in order to learn a new classifier on a different small dataset, where it would be difficult to train such a model from scratch. To simulate this setting, we consider a binary classification task on a small dataset of 100 images of size 256x256 taken from the ImageNet Large Scale Visual Recognition Challenge (ILSVRC) 2012, which we crop to 224x224 before performing random adjustments to brightness, saturation, hue and contrast. As in the STL-10 experiments, the gains of S-MISO over other methods are of about one order of magnitude in suboptimality, as predicted by Table 2.

##### Dropout on gene expression data.

We trained a binary logistic regression model on the breast cancer dataset of

van_de_vijver_gene-expression_2002 , with different Dropout rates , i.e., where at every iteration, each coordinate  of a feature vector  is set to zero independently with probability and to otherwise. The dataset consists of 295 vectors of dimension 8 141 of gene expression data, which we normalize in norm. Figure 3 (top) compares S-MISO with SGD and N-SAGA for three values of , as a way to control the variance of the perturbations. We include a Dropout rate of to illustrate the impact of on the algorithms and study the influence of the perturbation variance , even though this value of  is less relevant for the task. The plots show very clearly how the variance induced by the perturbations affects the convergence of S-MISO, giving suboptimality values that may be orders of magnitude smaller than SGD. This behavior is consistent with the theoretical convergence rate established in Section 3 and shows that the practice matches the theory.

##### Dropout on movie review sentiment analysis data.

We trained a binary classifier with a squared hinge loss on the IMDB dataset maas2011learning with different Dropout rates . We use the labeled part of the IMDB dataset, which consists of 25K training and 250K testing movie reviews, represented as 89 527-dimensional sparse bag-of-words vectors. In contrast to the previous experiments, we do not normalize the representations, which have great variability in their norms, in particular, the maximum Lipschitz constant across training points is roughly 100 times larger than the average one. Figure 3 (bottom) compares non-uniform sampling versions of S-MISO (see Appendix A) and SGD (see Appendix D) with their uniform sampling counterparts as well as N-SAGA. Note that we use a large step-size  for the uniform sampling algorithms, since

was significantly slower for all methods, likely due to outliers in the dataset. In contrast, the non-uniform sampling algorithms required no tuning and just use

. The curves clearly show that S-MISO-NU has a much faster convergence in the initial phase, thanks to the larger step-size allowed by non-uniform sampling, and later converges similarly to S-MISO, i.e., at a much faster rate than SGD when the perturbations are small. The value of  used in the experiments was chosen by cross-validation, and the use of Dropout gave improvements in test accuracy from 88.51% with no dropout to with  and with  (based on 10 different runs of S-MISO-NU after 400 epochs).

Finally, we also study the effect of the iterate averaging scheme of Theorem 3 in Appendix E.

#### Acknowledgements

This work was supported by a grant from ANR (MACARON project under grant number ANR-14-CE23-0003-01), by the ERC grant number 714381 (SOLARIS project), and by the MSR-Inria joint center.

## References

• (1) M. Achab, A. Guilloux, S. Gaïffas, and E. Bacry. SGD with Variance Reduction beyond Empirical Risk Minimization. arXiv:1510.04822, 2015.
• (2) Z. Allen-Zhu. Katyusha: The first direct acceleration of stochastic gradient methods. In

Symposium on the Theory of Computing (STOC)

, 2017.
• (3) Z. Allen-Zhu, Y. Yuan, and K. Sridharan. Exploiting the Structure: Stochastic Gradient Methods Using Raw Clusters. In Advances in Neural Information Processing Systems (NIPS), 2016.
• (4) F. Bach and E. Moulines.

Non-asymptotic analysis of stochastic approximation algorithms for machine learning.

In Advances in Neural Information Processing Systems (NIPS), 2011.
• (5) L. Bottou, F. E. Curtis, and J. Nocedal. Optimization Methods for Large-Scale Machine Learning. arXiv:1606.04838, 2016.
• (6) J. Bruna and S. Mallat. Invariant scattering convolution networks. IEEE transactions on pattern analysis and machine intelligence (PAMI), 35(8):1872–1886, 2013.
• (7) A. Coates, H. Lee, and A. Y. Ng. An Analysis of Single-Layer Networks in Unsupervised Feature Learning. In

International Conference on Artificial Intelligence and Statistics (AISTATS)

, 2011.
• (8) A. Defazio, F. Bach, and S. Lacoste-Julien. Saga: A fast incremental gradient method with support for non-strongly convex composite objectives. In Advances in Neural Information Processing Systems (NIPS), 2014.
• (9) A. Defazio, J. Domke, and T. S. Caetano. Finito: A faster, permutable incremental gradient method for big data problems. In International Conference on Machine Learning (ICML), 2014.
• (10) J. C. Duchi, M. I. Jordan, and M. J. Wainwright. Privacy aware learning. In Advances in Neural Information Processing Systems (NIPS), 2012.
• (11) J. C. Duchi and Y. Singer. Efficient online and batch learning using forward backward splitting. Journal of Machine Learning Research (JMLR), 10:2899–2934, 2009.
• (12) K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

, 2016.
• (13) J.-B. Hiriart-Urruty and C. Lemaréchal. Convex analysis and minimization algorithms I: Fundamentals. Springer science & business media, 1993.
• (14) T. Hofmann, A. Lucchi, S. Lacoste-Julien, and B. McWilliams. Variance Reduced Stochastic Gradient Descent with Neighbors. In Advances in Neural Information Processing Systems (NIPS), 2015.
• (15) R. Johnson and T. Zhang. Accelerating stochastic gradient descent using predictive variance reduction. In Advances in Neural Information Processing Systems (NIPS), 2013.
• (16) S. Lacoste-Julien, M. Schmidt, and F. Bach. A simpler approach to obtaining an O(1/t) convergence rate for the projected stochastic subgradient method. arXiv:1212.2002, 2012.
• (17) G. Lan and Y. Zhou. An optimal randomized incremental gradient method. Mathematical Programming, 2017.
• (18) H. Lin, J. Mairal, and Z. Harchaoui. A Universal Catalyst for First-Order Optimization. In Advances in Neural Information Processing Systems (NIPS), 2015.
• (19) G. Loosli, S. Canu, and L. Bottou.

Training invariant support vector machines using selective sampling.

In Large Scale Kernel Machines, pages 301–320. MIT Press, Cambridge, MA., 2007.
• (20) A. L. Maas, R. E. Daly, P. T. Pham, D. Huang, A. Y. Ng, and C. Potts.

Learning word vectors for sentiment analysis.

In The 49th Annual Meeting of the Association for Computational Linguistics (ACL), pages 142–150. Association for Computational Linguistics, 2011.
• (21) J. Mairal. Incremental Majorization-Minimization Optimization with Application to Large-Scale Machine Learning. SIAM Journal on Optimization, 25(2):829–855, 2015.
• (22) J. Mairal. End-to-End Kernel Learning with Supervised Convolutional Kernel Networks. In Advances in Neural Information Processing Systems (NIPS), 2016.
• (23) N. Meinshausen and P. Bühlmann. Stability selection. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 72(4):417–473, 2010.
• (24) A. Nemirovski, A. Juditsky, G. Lan, and A. Shapiro. Robust Stochastic Approximation Approach to Stochastic Programming. SIAM Journal on Optimization, 19(4):1574–1609, 2009.
• (25) Y. Nesterov. Introductory Lectures on Convex Optimization. Springer, 2004.
• (26) M. Paulin, J. Revaud, Z. Harchaoui, F. Perronnin, and C. Schmid. Transformation pursuit for image classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2014.
• (27) M. Schmidt, N. Le Roux, and F. Bach. Minimizing finite sums with the stochastic average gradient. Mathematical Programming, 162(1):83–112, 2017.
• (28) S. Shalev-Shwartz. SDCA without Duality, Regularization, and Individual Convexity. In International Conference on Machine Learning (ICML), 2016.
• (29) S. Shalev-Shwartz and T. Zhang. Stochastic dual coordinate ascent methods for regularized loss minimization. Journal of Machine Learning Research (JMLR), 14:567–599, 2013.
• (30) P. Y. Simard, Y. A. LeCun, J. S. Denker, and B. Victorri. Transformation Invariance in Pattern Recognition — Tangent Distance and Tangent Propagation. In G. B. Orr and K.-R. Müller, editors, Neural Networks: Tricks of the Trade, number 1524 in Lecture Notes in Computer Science, pages 239–274. Springer Berlin Heidelberg, 1998.
• (31) M. J. van de Vijver et al. A Gene-Expression Signature as a Predictor of Survival in Breast Cancer. New England Journal of Medicine, 347(25):1999–2009, Dec. 2002.
• (32) L. van der Maaten, M. Chen, S. Tyree, and K. Q. Weinberger. Learning with marginalized corrupted features. In International Conference on Machine Learning (ICML), 2013.
• (33) S. Wager, W. Fithian, S. Wang, and P. Liang. Altitude Training: Strong Bounds for Single-layer Dropout. In Advances in Neural Information Processing Systems (NIPS), 2014.
• (34) L. Xiao. Dual averaging methods for regularized stochastic learning and online optimization. Journal of Machine Learning Research (JMLR), 11:2543–2596, 2010.
• (35) L. Xiao and T. Zhang. A proximal stochastic gradient method with progressive variance reduction. SIAM Journal on Optimization, 24(4):2057–2075, 2014.
• (36) S. Zheng, Y. Song, T. Leung, and I. Goodfellow. Improving the robustness of deep neural networks via stability training. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.

## Appendix A Extension to Composite Objectives and Non-Uniform Sampling

In this section, we study extensions of S-MISO to different situations where our previous smoothness assumption (A2) is not suitable, either because of a non-smooth term  in the objective or because it ignores additional useful knowledge about each  such as the norm of each example.

In the presence of non-smooth regularizers such as the -norm, the objective is no longer smooth, but we can leverage its composite structure by using proximal operators. To this end, we assume that one can easily compute the proximal operator of , defined by

When the smoothness constants  vary significantly across different examples (typically through the norm of the feature vectors), the uniform upper bound  can be restrictive. It has been noticed (see, e.g.schmidt_minimizing_2013 ; xiao2014proximal ) that when the  are known, one can achieve better convergence rates—typically depending on the average smoothness constant rather than —by sampling examples in a non-uniform way. For that purpose, we now make the following assumptions:

• [noitemsep,topsep=0pt,leftmargin=5mm]

• (A3) strong convexity: is -strongly convex for all ;

• (A4) smoothness: is -smooth for all ;

Note that our proof relies on a slightly stronger assumption (A3) than the global strong convexity assumption (A1) made above, which holds in the situation where strong convexity comes from an  regularization term. In order to exploit the different smoothness constants, we allow the algorithm to sample indices  non-uniformly, from any distribution  such that for all  and .

The extension of S-MISO to this setting is given in Algorithm 2. Note that the step-sizes vary depending on the example, with larger steps for examples that are sampled less frequently (typically “easier” examples with smaller ). Note that when , the update directions are unbiased estimates of the gradient: we have as in the uniform case. However, in the composite case, the algorithm cannot be written in a proximal stochastic gradient form like Prox-SVRG xiao2014proximal or SAGA defazio_saga:_2014 .

##### Relationship with RDA.

When , our algorithm performs similar updates to Regularized Dual Averaging (RDA) xiao2010dual with strongly convex regularizers. In particular, if , the updates are the same when taking , since

 proxh/μ(¯zt) =argminx{⟨−μ¯zt,x⟩+μ2∥x∥2+h(x)},

and is equal to the average of the gradients of the loss term up to , which appears in the same way in the RDA updates (xiao2010dual, , Section 2.2). However, unlike RDA, our method supports arbitrary decreasing step-sizes, in particular keeping the step-size constant, which can lead to faster convergence in the initial iterations (see Section 3).

##### Lower-bound model and convergence analysis.

Again, we can view the algorithm as iteratively updating approximate lower bounds on the objective  of the form analogously to (6), and minimizing the new  in (15). Similar to MISO-Prox, we require that is initialized with a -strongly convex quadratic such that with the random perturbation . Given the form of in (5), it suffices to choose that satisfies

 ~fi(x,~ρi)≥μ2∥x−z0i∥+c, (16)

for some constant . In the common case of an regularizer with a non-negative loss, one can simply choose for all , otherwise, can be obtained by considering a strong convexity lower bound on . Our new analysis relies on the minimum of the lower bounds  through the following Lyapunov function:

 Cqt=F(x∗)−Dt(xt)+μαtn2n∑i=11qin∥zti−z∗i∥2. (17)

The convergence of the iterates  is controlled by the convergence in  thanks to the next lemma:

###### Lemma 4 (Bound on the iterates).

For all , we have

 μ2E[∥xt−x∗∥2]≤E[F(x∗)−Dt(xt)]. (18)

The following proposition gives a recursion on  similar to Proposition 1.

###### Proposition 5 (Recursion on Cqt).

If is a positive and non-increasing sequence of step-sizes satisfying

 α1≤min{nqmin2,nμ4Lq}, (19)

with and , then obeys the recursion

 E[Cqt]≤(1−αtn)E[Cqt−1]+2(αtn)2σ2qμ, (20)

with .

Note that if we consider the quantity , which is an upper bound on by Lemma 4, we have the same recursion as (10), and thus can apply Theorem 2 with the new condition (19). If we choose

 qi=12n+Li−μ2∑i(Li−μ), (21)

we have and , where . Then, taking satisfies (19), and using similar arguments to Section 3, the complexity for reaching is

 O((n+¯Lμ)logCq0¯ϵ)+O(σ2qμ2ϵ),

where , and  is the initial constant step-size. For the complexity in function suboptimality, the second term becomes by using the same averaging scheme presented in Theorem 3 and adapting the proof. Note that with our choice of , we have , for general perturbations, where  is the variance in the uniform case. Additionally, it is often reasonable to assume that the variance from perturbations increases with the norm of examples, for instance Dropout perturbations get larger when coordinates have larger magnitudes. Based on this observation, if we make the assumption that , that is , then for both (uniform case) and , we have , and thus we have for the choice of  given in (21), since  is convex in . Thus, we can expect that the  convergence phase behaves similarly or better than for uniform sampling, which is confirmed by our experiments (see Section 4).

## Appendix B Proofs for the Smooth Case (Section 3)

### b.1 Proof of Proposition 1 (Recursion on Lyapunov function Ct)

We begin by stating the following lemma, which extends a key result of variance reduction methods (see, e.g.johnson_accelerating_2013 ) to the situation considered in this paper, where one only has access to noisy estimates of the gradients of each .

###### Lemma B.1.

Let

and according to a perturbation distribution . Under assumption (A2) on the functions and their expectations , we have, for all ,

 Ei,ρ[∥∇~fi(x,ρ)−∇fi(x∗)∥2]≤4L(f(x)−f(x∗))+2σ2p.
###### Proof.

We have

 ∥∇~fi(x,ρ)−∇fi(x∗)∥2 ≤2∥∇~fi(x,ρ)−∇~fi(x∗,ρ)∥2+2∥∇~fi(x∗,ρ)−∇fi(x∗)∥2 ≤4L(~fi(x,ρ)−~fi(x∗,ρ)−⟨∇~fi(x∗,ρ),x−x∗⟩)+2∥∇~fi(x∗,ρ)−∇fi(x∗)∥2.

The first inequality comes from the simple relation . The second inequality follows from the smoothness of , in particular we used the classical relation

 g(y)≥g(x)+⟨∇g(x),y−x⟩+12L∥∇g(y)−∇g(x)∥2,

which is known to hold for any convex and -smooth function (see, e.g.(nesterov_lectures_2004, , Theorem 2.1.5)). The result follows by taking expectations on and and noting that , as well as the definition of . ∎

We now proceed with the proof of Proposition 1.

###### Proof.

Define the quantities

 At =1nn∑i=1∥zti−z∗i∥2 andBt =12∥xt−x∗∥2.

The proof successively describes recursions on , , and eventually .

###### Recursion on At.

We have

 At−At−1 =1n(∥ztit−z∗it∥2−∥zt−1it−z∗it∥2) =1n(∥∥∥(1−αt)(zt−1it−z