# Lightweight Stochastic Optimization for Minimizing Finite Sums with Infinite Data

Variance reduction has been commonly used in stochastic optimization. It relies crucially on the assumption that the data set is finite. However, when the data are imputed with random noise as in data augmentation, the perturbed data set be- comes essentially infinite. Recently, the stochastic MISO (S-MISO) algorithm is introduced to address this expected risk minimization problem. Though it converges faster than SGD, a significant amount of memory is required. In this pa- per, we propose two SGD-like algorithms for expected risk minimization with random perturbation, namely, stochastic sample average gradient (SSAG) and stochastic SAGA (S-SAGA). The memory cost of SSAG does not depend on the sample size, while that of S-SAGA is the same as those of variance reduction methods on un- perturbed data. Theoretical analysis and experimental results on logistic regression and AUC maximization show that SSAG has faster convergence rate than SGD with comparable space requirement, while S-SAGA outperforms S-MISO in terms of both iteration complexity and storage.

## Authors

• 36 publications
• 26 publications
• ### Stochastic Optimization with Variance Reduction for Infinite Datasets with Finite-Sum Structure

Stochastic optimization algorithms with variance reduction have proven s...
10/04/2016 ∙ by Alberto Bietti, et al. ∙ 0

• ### Data augmentation as stochastic optimization

We present a theoretical framework recasting data augmentation as stocha...
10/21/2020 ∙ by Boris Hanin, et al. ∙ 0

• ### A Convergence Analysis for A Class of Practical Variance-Reduction Stochastic Gradient MCMC

Stochastic gradient Markov Chain Monte Carlo (SG-MCMC) has been develope...
09/04/2017 ∙ by Changyou Chen, et al. ∙ 0

• ### Shampoo: Preconditioned Stochastic Tensor Optimization

Preconditioned gradient methods are among the most general and powerful ...
02/26/2018 ∙ by Vineet Gupta, et al. ∙ 0

• ### ROOT-SGD: Sharp Nonasymptotics and Asymptotic Efficiency in a Single Algorithm

The theory and practice of stochastic optimization has focused on stocha...
08/28/2020 ∙ by Chris Junchi Li, et al. ∙ 19

• ### Adaptive Sampling Strategies for Stochastic Optimization

In this paper, we propose a stochastic optimization method that adaptive...
10/30/2017 ∙ by Raghu Bollapragada, et al. ∙ 0

• ### k-sums: another side of k-means

In this paper, the decades-old clustering method k-means is revisited. T...
05/19/2020 ∙ by Wan-Lei Zhao, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Machine learning tasks are often cast as optimization problems with some data distributions. In regularized risk minimization with training samples, one minimizes:

 minθ1nn∑i=1ℓi(θ)+g(x), (1)

where is the model parameter, is the loss due to sample , and is a regularizer. In this paper, we assume that and

are smooth and convex. Stochastic gradient descent (SGD)

(Robbins & Monro, 1951) and its variants (Nemirovski et al., 2009; Xiao, 2010; Duchi et al., 2011; Bottou et al., 2016) are flexible, scalable, and widely used for this problem. However, SGD suffers from large variance due to sampling noise. To alleviate this problem, the stepsize has to be decreasing, which slows convergence.

By exploiting the finite-sum structure in (1), a class of variance-reduced stochastic optimization methods have been proposed recently (Le Roux et al., 2012; Johnson & Zhang, 2013; Shalev-Shwartz & Zhang, 2013; Mairal, 2013; Defazio et al., 2014a, b). Based on the use of control variates (Fishman, 1996), they construct different approximations to the true gradient so that its variance decreases as the optimal solution is approached.

In order to capture more variations in the data distribution, it is effective to obtain more training data by injecting random noise to the data samples (Decoste & Schölkopf, 2002; van der Maaten et al., 2013; Paulin et al., 2014). Theoretically, it has been shown that random noise improves generalization (Wager et al., 2014). In addition, artificially corrupting the training data has a wide range of applications in machine learning. For example, additive Gaussian noise can be used in image denoising (Vincent et al., 2010) and provides a form of -type regularization (Bishop, 1995); dropout noise serves as adaptive regularization that is useful in stabilizing predictions (van der Maaten et al., 2013) and selecting discriminative but rare features (Wager et al., 2013); and Poisson noise is of interest to count features as in document classification (van der Maaten et al., 2013).

With the addition of noise perturbations, (1) becomes the following expected risk minimization problem:

 minθ1nn∑i=1Eξi[ℓi(θ;ξi)]+g(x), (2)

where is the random noise injected to function , and denotes expectation w.r.t. . Because of the expectation, the perturbed data can be considered as infinite, and the finite data set assumption in variance reduction methods is violated. In this case, each function in problem (2) can only be accessed via a stochastic first-order oracle, and the main optimization tool is SGD.

Despite its importance, expected risk minimization has received very little attention. One very recent work for this is the stochastic MISO (S-MISO) (Bietti & Mairal, 2017). While it converges faster than SGD, S-MISO requires space, where is the feature dimensionality. This significantly limits its applicability to big data problems. The N-SAGA algorithm (Hofmann et al., 2015) can also be used on problems with infinite data. However, its asymptotic error is nonzero.

In this paper, we focus on the linear model. By exploiting the linear structure, we propose two SGD-like variants with low memory costs: stochastic sample average gradient (SSAG) and stochastic SAGA (S-SAGA). In particular, the memory cost of SSAG does not depend on the sample size , while S-SAGA has a memory requirement of , which matches the stochastic variance reduction methods on unperturbed data (Le Roux et al., 2012; Shalev-Shwartz & Zhang, 2013; Defazio et al., 2014a, b). Similar to S-MISO, the proposed algorithms have faster convergence than SGD. Moreover, the convergence rate of S-SAGA depends on a constant that is typically smaller than that of S-MISO. Experimental results on logistic regression and AUC maximization with dropout noise demonstrate the efficiency of the proposed algorithms.

Notations

. For a vector

, is its -norm. For two vectors and , denotes its dot product.

## 2 Related Work

In this paper, we consider the linear model. Given samples , with each , the regularized risk minimization problem in (1) can be written as:

 minθ1nn∑i=1ϕi(xTiθ)+g(θ), (3)

where is the prediction on sample , and is a loss. For example, logistic regression corresponds to , where

are the training labels; and linear regression corresponds to

.

### 2.1 Learning with Injected Noise

To make the predictor robust, one can inject i.i.d. random noise to each sample (van der Maaten et al., 2013). Let the perturbed sample be . The following types of noise have been popularly used: (i) additive noise (Bishop, 1995; Wager et al., 2013): , where

comes from a zero-mean distribution such as the normal or Poisson distribution; and (ii) dropout noise

(Srivastava et al., 2014): , where denotes the element-wise product, ,

is the dropout probability, and each component of

is an independent draw from a scaled Bernoullirandom variable. With random perturbations, (3) becomes the following expected risk minimization problem:

 minθF(θ)≡1nn∑i=1Eξi[ϕi(^xTiθ)]+g(θ). (4)

As the objective contains an expectation, computing the gradient is infeasible as infinite samples are needed. As an approximation, SGD uses the gradient from a single sample. However, this has large variance.

In this paper, we make the following assumption on in (4). Note that this implies and are also -smooth.

###### Assumption 1.

Each is -smooth w.r.t. , i.e., there exists constant such that .

### 2.2 Variance Reduction

In stochastic optimization, control variates have been commonly used to reduce the variance of stochastic gradients (Fishman, 1996). In general, given a random variable and another highly correlated random variable

, a variance-reduced estimate of

can be obtained as

 X−Y+EY. (5)

In stochastic optimization on problem (3), the gradient of the loss evaluated on sample is taken as . When the training set is finite, various algorithms have been recently proposed so that is strongly correlated with and can be easily evaluated. Examples include stochastic average gradient (SAG) (Le Roux et al., 2012), MISO (Mairal, 2013), stochastic variance reduced gradient (SVRG) (Johnson & Zhang, 2013), Finito (Defazio et al., 2014b), SAGA (Defazio et al., 2014a), and stochastic dual coordinate ascent (SDCA) (Shalev-Shwartz & Zhang, 2013).

However, with the expectation in (4), the full gradient (i.e., in (5)) cannot be evaluated, and variance reduction can no longer be used. Very recently, the stochastic MISO (S-MISO) algorithm (Bietti & Mairal, 2017) is proposed for solving (4). Its convergence rate outperforms that of SGD by having a smaller multiplicative constant. However, S-MISO requires an additional space, which prevents its use on large data sets.

Let the iterate at iteration be . To approximate the gradient in (4), SGD uses the gradient evaluated on a single sample , where is sampled uniformly from . The variance of is usually assumed to be bounded by a constant, as

 E∥gt−∇F(θt−1)∥2≤σ2s, (6)

where the expectation is taken w.r.t. both the random index and perturbation at iteration . Note that the gradient of regularizer does not contribute to the variance.

### 3.1 Exploiting the Model Structure

#### 3.1.1 Stochastic Sample-Average Gradient (SSAG)

At iteration , the stochastic gradient of the loss for sample is . Thus, the gradient direction is determined by , while parameter only affects its scale. With this observation, we consider using as a control variate for , where may depend on past information but not on . Note that the gradient component is deterministic, and does not contribute to the construction of control variate. Using (5), the resultant gradient estimator is:

 zt=(ϕ′it(^xTitθt−1)−at)^xit+at~xt+∇g(θt−1), (7)

where is an estimate of . For example, can be defined as

 ~xt=(1−1t)~xt−1+1t^xit, (8)

so that can be incrementally updated as ’s are sampled. As

’s are i.i.d., by the law of large number, the sample average

converges to the expected value .

The following shows that in (7) is a biased estimator of the gradient . As converges to , is still asymptotically unbiased.

###### Proposition 1.

.

Note that , where denotes the expectation w.r.t. . We assume that each can be easily computed. This is the case, for example, when the noise is dropout noise or additive zero-mean noise, and (van der Maaten et al., 2013). This suggests replacing in (7) by (which is equal to ), leading to the estimator:

 vt=(ϕ′it(^xTitθt−1)−at)^xit+at~x+∇g(θt−1). (9)

The following shows that is unbiased, and also provides an upper bound of its variance.

###### Proposition 2.

, and . The bound is minimized when

 at=a∗t≡E[ϕ′(^xTθt−1)∥^x∥2]E[∥^x∥2]. (10)

For dropout noise and other additive noise with known variance, one can compute for each , and then average to obtain . However, evaluating the expectation in the numerator of (10) is infeasible.

 at=~at/st (11)

for , and approximate the expectations in the numerator and denominator by moving averages:

 ~at+1 = (1−βt)~at+βtϕ′it(^xTitθt−1)∥^xit∥2, st+1 = (1−βt)st+βt∥^xit∥2.

We initialize , and set .

The resulting algorithm, called stochastic sample-average gradient (SSAG), is shown in Algorithm 1. Compared to S-MISO (Bietti & Mairal, 2017), SSAG is more computationally efficient. It does not require an extra memory, and only requires one single gradient evaluation (step 6) in each iteration.

The following Proposition shows that in (11) is asymptotically optimal for appropriate choices of and .

###### Proposition 3.

If (i) ; (ii) ; (iii) ; (iv) ; and (v) , then

 at→a∗tw.p.1.

A simple choice is: , where . The following Proposition quantifies the convergence of to . In particular, when , the asymptotic bound in (12) is minimized when .

###### Proposition 4.

With assumptions (i)-(v) in Proposition 3, , and , we have

 E[s2t(at−a∗t)2]≤O(max{1tc2,1t2(c1−c2)}). (12)

#### 3.1.2 Stochastic SAGA (S-SAGA)

Recall that in (9), plays the role of in (5), and plays the role of . However, the corresponding and in (5) can be negatively correlated in some iterations. This is partly because in (9) does not depend on , though serves as a control variate for . Thus, it is better for each sample to have its own scaling factor, leading to the estimator:

 vt=(ϕ′it(^xTitθt−1)−ait)^xit+mt−1+∇g(θt−1), (13)

where . Note that can be updated sequentially as:

 mt=mt−1+1n(ϕ′it(^xTitθt−1)−ait)Eξt[^xit].

Besides, (13) reduces to the SAGA estimator (Defazio et al., 2014a) when the random noise is removed. The whole procedure, which will be called stochastic SAGA (S-SAGA), is shown in Algorithm 2.

S-SAGA needs an additional space for and . However, as discussed in Section 3.1.1, for many types of noise. Hence, ’s do not need to be explicitly stored, and the additional space is reduced to . This is significantly smaller than that of S-MISO, which always requires additional space.

### 3.2 Convergence Analysis

In this section, we provide convergence results for SSAG and S-SAGA on problem (4).

#### 3.2.1 Ssag

For SSAG, we make the following additional assumptions.

###### Assumption 2.

is -strongly convex, i.e., .

###### Assumption 3.

for all .

Let the minimizer of (4) be . The following Theorem shows that SSAG has convergence rate, which is similar to SGD (Bottou et al., 2016).

###### Theorem 1.

Assume that for some and such that . For the sequence generated from SSAG, we have

 E[F(θt)]−F(θ∗)≤ν1γ+t+1, (14)

where , and .

Note that this also satisfies the conditions in Proposition 3. The condition is crucial to obtaining the rate. It has been observed that underestimating can make convergence extremely slow (Nemirovski et al., 2009). When the model is -regularized, can be estimated by the corresponding regularization parameter.

###### Corollary 1.

To ensure that , SSAG has a time complexity of , where is the condition number.

The term is due to initialization of and amortized over multiple data passes. In contrast, the time complexity for SGD is , where is defined in (6) (Bottou et al., 2016). To compare with , we assume that the perturbed samples have finite variance :

 E[∥^x−E[^x]∥2]=σ2x.

The variance of the SGD estimator can be bounded as

 E[∥gt−∇F(θt−1)∥2 = E[∥ϕ′it(^xTitθt−1)^xit−E[ϕ′it(^xTitθt−1)^xit]∥2] ≤ 3E[∥ϕ′it(^xTitθt−1)^xit−at^xit∥2] +3E[∥at^xit−atE[^xit]∥2 +3E[∥atE[^xit]−E[ϕ′it(^xTitθt−1)^xit]∥2] ⪅ 3σ2a+3a2tσ2x.

Thus, the gradient variance of SGD has two terms, one involving and the other involving . In particular, if the derivative is constant, then , and only the perturbed sample variance contributes to the gradient variance of SGD. For a large class of functions including the logistic loss and smoothed hinge loss, is nearly constant in some regions. In this case, we have .

#### 3.2.2 S-Saga

Besides Assumption 1, we assume the following:

###### Assumption 4.

Each is -strongly convex, i.e., .

###### Assumption 5.

For all , , where , , and is another randomly sampled noise for .

###### Theorem 2.

Assume that for some and such that . For the sequence generated from S-SAGA, we have

 E[∥θt−θ∗∥2]≤ν2γ+t+1, (15)

where , and .

Thus, S-SAGA has a convergence rate of . In comparison, the convergence rate of SGD is . Note that in (6) includes variance due to data sampling, while above only considers that due to noise. Since data sampling induces a much larger variation than that from perturbing the same sample, typically we have , and thus S-SAGA has faster convergence.

S-MISO considers the variance of the difference in gradients due to noise:

 1nn∑i=1Eξ,ξ′[∥ϕ′i(^x′Tiθt−1)^x′i−ϕ′i(^xTiθt−1)^xi∥2]≤σ2m,

and its convergence rate is . The bounds for and are similar in form. However, can be small when the difference is small, while it is not the case for . In particular, when is a constant regardless of random perturbations, .

The following Corollary considers the time complexity of S-SAGA.

###### Corollary 2.

To ensure that , S-SAGA has a time complexity of .

###### Remark 1.

In (Bietti & Mairal, 2017)

, additional speedup can be achieved by first running the algorithm with a constant stepsize for a few epochs, and then applying the decreasing stepsize. This trick is not used here. If incorporated, it can be shown that the

term will be improved to .

A summary of the convergence results is shown in Table 1. As can be seen, by exploiting the linear model structure, SSAG has a smaller variance constant than SGD ( vs ) while having comparable space requirement. S-SAGA improves over S-MISO and achieves gains both in terms of iteration complexity and storage.

### 3.3 Acceleration By Iterate Averaging

The complexity bounds in Corollaries 1 and 2 depend quadratically on the condition number . This may be problematic on ill-conditioned problems. To alleviate this problem, one can use iterate averaging (Bietti & Mairal, 2017), which outputs

 ¯θT≡2T(2γ+T−1)T−1∑t=0(γ+t)θt, (16)

where is the total number of iterations. It can be easily seen that (16) can be efficiently implemented in an online fashion without the need for storing ’s:

 ¯θt=(1−ρt)¯θt−1+ρtθt−1,

where and . As in (Bietti & Mairal, 2017), the following shows that the dependence in both SSAG and S-SAGA (Corollaries 1 and 2) can be reduced to .

###### Theorem 3.

Assume that and such that . For the sequence generated from SSAG, we have

 E[F(¯θT)]−F(θ∗) ≤ μγ(γ−1)T(2γ+T−1)∥θ0−θ∗∥2+4σ2aμ(2γ+T−1).

The stepsize and condition together implies that . Thus, when , the first term, which depends on , decays as , which is no better than (14). On the other hand, if , the first term decays at a faster rate.

###### Corollary 3.

When , to ensure that , SSAG with iterate averaging has a time complexity of , where .

Similarly, we have the following for S-SAGA.

###### Theorem 4.

Assume that for some and such that . For the sequence generated from S-SAGA, we have

 E[F(¯θT)−F(θ∗)] (17) ≤ μγ(γ−1)2T(2γ+T−1)C4+32σ2cμ(2γ+T−1),

where .

The condition is satisfied when . Thus, the second term in is scaled by . These implies that the first term in (17) decays as when . On the other hand, when , the first term decays as . Thus, iterate averaging does not provide S-SAGA with much acceleration as compared to SSAG.

The following Corollary considers the case where (Johnson & Zhang, 2013).

###### Corollary 4.

Assume that . When , to ensure that , S-SAGA with iterate averaging has a time complexity of .

## 4 Experiments

In this section, we perform experiments on logistic regression (Section 4.1) and AUC maximization (Section 4.2).

### 4.1 Logistic Regression with Dropout

Consider the -regularized logistic regression model with dropout noise, with dropout probability . This can be formulated as the following optimization problem:

 minθ1nn∑i=1E^ξi[log(1+exp(−yi^zTiθ))]+λ2∥θ∥2, (18)

where , is the feature vector of sample , and the corresponding class label. We vary . The smaller the

, the higher the condition number. Experiments are performed on two high-dimensional data sets from the LIBSVM archive (Table

2).

#### 4.1.1 Comparison with SGD and S-MISO

The proposed SSAG and S-SAGA are compared with SGD and S-MISO. From Proposition 4, we use a slightly larger