Machine learning tasks are often cast as optimization problems with some data distributions. In regularized risk minimization with training samples, one minimizes:
where is the model parameter, is the loss due to sample , and is a regularizer. In this paper, we assume that and
are smooth and convex. Stochastic gradient descent (SGD)(Robbins & Monro, 1951) and its variants (Nemirovski et al., 2009; Xiao, 2010; Duchi et al., 2011; Bottou et al., 2016) are flexible, scalable, and widely used for this problem. However, SGD suffers from large variance due to sampling noise. To alleviate this problem, the stepsize has to be decreasing, which slows convergence.
By exploiting the finite-sum structure in (1), a class of variance-reduced stochastic optimization methods have been proposed recently (Le Roux et al., 2012; Johnson & Zhang, 2013; Shalev-Shwartz & Zhang, 2013; Mairal, 2013; Defazio et al., 2014a, b). Based on the use of control variates (Fishman, 1996), they construct different approximations to the true gradient so that its variance decreases as the optimal solution is approached.
In order to capture more variations in the data distribution, it is effective to obtain more training data by injecting random noise to the data samples (Decoste & Schölkopf, 2002; van der Maaten et al., 2013; Paulin et al., 2014). Theoretically, it has been shown that random noise improves generalization (Wager et al., 2014). In addition, artificially corrupting the training data has a wide range of applications in machine learning. For example, additive Gaussian noise can be used in image denoising (Vincent et al., 2010) and provides a form of -type regularization (Bishop, 1995); dropout noise serves as adaptive regularization that is useful in stabilizing predictions (van der Maaten et al., 2013) and selecting discriminative but rare features (Wager et al., 2013); and Poisson noise is of interest to count features as in document classification (van der Maaten et al., 2013).
With the addition of noise perturbations, (1) becomes the following expected risk minimization problem:
where is the random noise injected to function , and denotes expectation w.r.t. . Because of the expectation, the perturbed data can be considered as infinite, and the finite data set assumption in variance reduction methods is violated. In this case, each function in problem (2) can only be accessed via a stochastic first-order oracle, and the main optimization tool is SGD.
Despite its importance, expected risk minimization has received very little attention. One very recent work for this is the stochastic MISO (S-MISO) (Bietti & Mairal, 2017). While it converges faster than SGD, S-MISO requires space, where is the feature dimensionality. This significantly limits its applicability to big data problems. The N-SAGA algorithm (Hofmann et al., 2015) can also be used on problems with infinite data. However, its asymptotic error is nonzero.
In this paper, we focus on the linear model. By exploiting the linear structure, we propose two SGD-like variants with low memory costs: stochastic sample average gradient (SSAG) and stochastic SAGA (S-SAGA). In particular, the memory cost of SSAG does not depend on the sample size , while S-SAGA has a memory requirement of , which matches the stochastic variance reduction methods on unperturbed data (Le Roux et al., 2012; Shalev-Shwartz & Zhang, 2013; Defazio et al., 2014a, b). Similar to S-MISO, the proposed algorithms have faster convergence than SGD. Moreover, the convergence rate of S-SAGA depends on a constant that is typically smaller than that of S-MISO. Experimental results on logistic regression and AUC maximization with dropout noise demonstrate the efficiency of the proposed algorithms.
. For a vector, is its -norm. For two vectors and , denotes its dot product.
2 Related Work
In this paper, we consider the linear model. Given samples , with each , the regularized risk minimization problem in (1) can be written as:
where is the prediction on sample , and is a loss. For example, logistic regression corresponds to , where
are the training labels; and linear regression corresponds to.
2.1 Learning with Injected Noise
To make the predictor robust, one can inject i.i.d. random noise to each sample (van der Maaten et al., 2013). Let the perturbed sample be . The following types of noise have been popularly used: (i) additive noise (Bishop, 1995; Wager et al., 2013): , where
comes from a zero-mean distribution such as the normal or Poisson distribution; and (ii) dropout noise(Srivastava et al., 2014): , where denotes the element-wise product, ,
is the dropout probability, and each component ofis an independent draw from a scaled Bernoullirandom variable. With random perturbations, (3) becomes the following expected risk minimization problem:
As the objective contains an expectation, computing the gradient is infeasible as infinite samples are needed. As an approximation, SGD uses the gradient from a single sample. However, this has large variance.
In this paper, we make the following assumption on in (4). Note that this implies and are also -smooth.
Each is -smooth w.r.t. , i.e., there exists constant such that .
2.2 Variance Reduction
In stochastic optimization, control variates have been commonly used to reduce the variance of stochastic gradients (Fishman, 1996). In general, given a random variable and another highly correlated random variable
, a variance-reduced estimate ofcan be obtained as
In stochastic optimization on problem (3), the gradient of the loss evaluated on sample is taken as . When the training set is finite, various algorithms have been recently proposed so that is strongly correlated with and can be easily evaluated. Examples include stochastic average gradient (SAG) (Le Roux et al., 2012), MISO (Mairal, 2013), stochastic variance reduced gradient (SVRG) (Johnson & Zhang, 2013), Finito (Defazio et al., 2014b), SAGA (Defazio et al., 2014a), and stochastic dual coordinate ascent (SDCA) (Shalev-Shwartz & Zhang, 2013).
However, with the expectation in (4), the full gradient (i.e., in (5)) cannot be evaluated, and variance reduction can no longer be used. Very recently, the stochastic MISO (S-MISO) algorithm (Bietti & Mairal, 2017) is proposed for solving (4). Its convergence rate outperforms that of SGD by having a smaller multiplicative constant. However, S-MISO requires an additional space, which prevents its use on large data sets.
3 Sample Average Gradient
Let the iterate at iteration be . To approximate the gradient in (4), SGD uses the gradient evaluated on a single sample , where is sampled uniformly from . The variance of is usually assumed to be bounded by a constant, as
where the expectation is taken w.r.t. both the random index and perturbation at iteration . Note that the gradient of regularizer does not contribute to the variance.
3.1 Exploiting the Model Structure
3.1.1 Stochastic Sample-Average Gradient (SSAG)
At iteration , the stochastic gradient of the loss for sample is . Thus, the gradient direction is determined by , while parameter only affects its scale. With this observation, we consider using as a control variate for , where may depend on past information but not on . Note that the gradient component is deterministic, and does not contribute to the construction of control variate. Using (5), the resultant gradient estimator is:
where is an estimate of . For example, can be defined as
so that can be incrementally updated as ’s are sampled. As
’s are i.i.d., by the law of large number, the sample averageconverges to the expected value .
The following shows that in (7) is a biased estimator of the gradient . As converges to , is still asymptotically unbiased.
Note that , where denotes the expectation w.r.t. . We assume that each can be easily computed. This is the case, for example, when the noise is dropout noise or additive zero-mean noise, and (van der Maaten et al., 2013). This suggests replacing in (7) by (which is equal to ), leading to the estimator:
The following shows that is unbiased, and also provides an upper bound of its variance.
, and . The bound is minimized when
For dropout noise and other additive noise with known variance, one can compute for each , and then average to obtain . However, evaluating the expectation in the numerator of (10) is infeasible.
Instead, we define as
for , and approximate the expectations in the numerator and denominator by moving averages:
We initialize , and set .
The resulting algorithm, called stochastic sample-average gradient (SSAG), is shown in Algorithm 1. Compared to S-MISO (Bietti & Mairal, 2017), SSAG is more computationally efficient. It does not require an extra memory, and only requires one single gradient evaluation (step 6) in each iteration.
The following Proposition shows that in (11) is asymptotically optimal for appropriate choices of and .
If (i) ; (ii) ; (iii) ; (iv) ; and (v) , then
A simple choice is: , where . The following Proposition quantifies the convergence of to . In particular, when , the asymptotic bound in (12) is minimized when .
With assumptions (i)-(v) in Proposition 3, , and , we have
3.1.2 Stochastic SAGA (S-SAGA)
Recall that in (9), plays the role of in (5), and plays the role of . However, the corresponding and in (5) can be negatively correlated in some iterations. This is partly because in (9) does not depend on , though serves as a control variate for . Thus, it is better for each sample to have its own scaling factor, leading to the estimator:
where . Note that can be updated sequentially as:
S-SAGA needs an additional space for and . However, as discussed in Section 3.1.1, for many types of noise. Hence, ’s do not need to be explicitly stored, and the additional space is reduced to . This is significantly smaller than that of S-MISO, which always requires additional space.
3.2 Convergence Analysis
In this section, we provide convergence results for SSAG and S-SAGA on problem (4).
For SSAG, we make the following additional assumptions.
is -strongly convex, i.e., .
for all .
Assume that for some and such that . For the sequence generated from SSAG, we have
where , and .
Note that this also satisfies the conditions in Proposition 3. The condition is crucial to obtaining the rate. It has been observed that underestimating can make convergence extremely slow (Nemirovski et al., 2009). When the model is -regularized, can be estimated by the corresponding regularization parameter.
To ensure that , SSAG has a time complexity of , where is the condition number.
The term is due to initialization of and amortized over multiple data passes. In contrast, the time complexity for SGD is , where is defined in (6) (Bottou et al., 2016). To compare with , we assume that the perturbed samples have finite variance :
The variance of the SGD estimator can be bounded as
Thus, the gradient variance of SGD has two terms, one involving and the other involving . In particular, if the derivative is constant, then , and only the perturbed sample variance contributes to the gradient variance of SGD. For a large class of functions including the logistic loss and smoothed hinge loss, is nearly constant in some regions. In this case, we have .
Besides Assumption 1, we assume the following:
Each is -strongly convex, i.e., .
For all , , where , , and is another randomly sampled noise for .
Assume that for some and such that . For the sequence generated from S-SAGA, we have
where , and .
Thus, S-SAGA has a convergence rate of . In comparison, the convergence rate of SGD is . Note that in (6) includes variance due to data sampling, while above only considers that due to noise. Since data sampling induces a much larger variation than that from perturbing the same sample, typically we have , and thus S-SAGA has faster convergence.
S-MISO considers the variance of the difference in gradients due to noise:
and its convergence rate is . The bounds for and are similar in form. However, can be small when the difference is small, while it is not the case for . In particular, when is a constant regardless of random perturbations, .
The following Corollary considers the time complexity of S-SAGA.
To ensure that , S-SAGA has a time complexity of .
A summary of the convergence results is shown in Table 1. As can be seen, by exploiting the linear model structure, SSAG has a smaller variance constant than SGD ( vs ) while having comparable space requirement. S-SAGA improves over S-MISO and achieves gains both in terms of iteration complexity and storage.
3.3 Acceleration By Iterate Averaging
The complexity bounds in Corollaries 1 and 2 depend quadratically on the condition number . This may be problematic on ill-conditioned problems. To alleviate this problem, one can use iterate averaging (Bietti & Mairal, 2017), which outputs
where is the total number of iterations. It can be easily seen that (16) can be efficiently implemented in an online fashion without the need for storing ’s:
Assume that and such that . For the sequence generated from SSAG, we have
The stepsize and condition together implies that . Thus, when , the first term, which depends on , decays as , which is no better than (14). On the other hand, if , the first term decays at a faster rate.
When , to ensure that , SSAG with iterate averaging has a time complexity of , where .
Similarly, we have the following for S-SAGA.
Assume that for some and such that . For the sequence generated from S-SAGA, we have
The condition is satisfied when . Thus, the second term in is scaled by . These implies that the first term in (17) decays as when . On the other hand, when , the first term decays as . Thus, iterate averaging does not provide S-SAGA with much acceleration as compared to SSAG.
The following Corollary considers the case where (Johnson & Zhang, 2013).
Assume that . When , to ensure that , S-SAGA with iterate averaging has a time complexity of .
4.1 Logistic Regression with Dropout
Consider the -regularized logistic regression model with dropout noise, with dropout probability . This can be formulated as the following optimization problem:
where , is the feature vector of sample , and the corresponding class label. We vary . The smaller the
, the higher the condition number. Experiments are performed on two high-dimensional data sets from the LIBSVM archive (Table2).
4.1.1 Comparison with SGD and S-MISO
The proposed SSAG and S-SAGA are compared with SGD and S-MISO. From Proposition 4, we use a slightly larger