Experimental code of stochastic optimization methods (Averaged SGD, SVRG, SAGA, and AMSVRG ...) for linear models.
We propose an optimization method for minimizing the finite sums of smooth convex functions. Our method incorporates an accelerated gradient descent (AGD) and a stochastic variance reduction gradient (SVRG) in a mini-batch setting. Unlike SVRG, our method can be directly applied to non-strongly and strongly convex problems. We show that our method achieves a lower overall complexity than the recently proposed methods that supports non-strongly convex problems. Moreover, this method has a fast rate of convergence for strongly convex problems. Our experiments show the effectiveness of our method.READ FULL TEXT VIEW PDF
Experimental code of stochastic optimization methods (Averaged SGD, SVRG, SAGA, and AMSVRG ...) for linear models.
We consider the minimization problem:
where are smooth convex functions from to
. In machine learning, we often encounter optimization problems of this type, i.e., empirical risk minimization. For example, given a sequence of training examples, where and . If we set
, then we obtain linear regression. If we set
), then we obtain logistic regression. Eachmay include smooth regularization terms. In this paper we make the following assumption.
Each convex function is -smooth, i.e., there exists such that for all ,
In part of this paper (the latter half of section 4), we also assume that is -strongly convex.
f(x) is -strongly convex, i.e., there exists such that for all ,
Note that it is obvious that .
Several papers recently proposed effective methods (SAG[1, 2], SDCA[3, 4], SVRG, S2GD, Acc-Prox-SDCA, Prox-SVRG, MISO, SAGA, Acc-Prox-SVRG, mS2GD) for solving problem (1). These methods attempt to reduce the variance of the stochastic gradient and achieve the linear convergence rates like a deterministic gradient descent when is strongly convex. Moreover, because of the computational efficiency of each iteration, the overall complexities (total number of component gradient evaluations to find an -accurate solution in expectation) of these methods are less than those of the deterministic and stochastic gradient descent methods.
An advantage of the SAG and SAGA is that they support non-strongly convex problems. Although we can apply any of these methods to non-strongly convex functions by adding a slight -regularization, this modification increases the difficulty of model selection. In the non-strongly convex case, the overall complexities of SAG and SAGA are . This complexity is less than that of the deterministic gradient descent, which have a complexity of , and is a trade-off with , which is the complexity of the AGD.
In this paper we propose a new method that incorporates the AGD and SVRG in a mini-batch setting like Acc-Prox-SVRG . The difference between our method and Acc-Prox-SVRG is that our method incorporates , which is similar to Nesterov’s acceleration , whereas Acc-Prox-SVRG incorporates . Unlike SVRG and Acc-Prox-SVRG, our method is directly applicable to non-strongly convex problems and achieves an overall complexity of
where the notation hides constant and logarithmic terms. This complexity is less than that of SAG, SAGA, and AGD. Moreover, in the strongly convex case, our method achieves a complexity
where is the condition number . This complexity is the same as that of Acc-Prox-SVRG. Thus, our method converges quickly for non-strongly and strongly convex problems.
. In Section 4, we describe the general scheme of our method and prove an important lemma that gives us a novel insight for constructing specific algorithms. Moreover, we derive an algorithm that is applicable to non-strongly and strongly convex problems and show its quickly converging complexity. Our method is a multi-stage scheme like SVRG, but it can be difficult to decide when we should restart a stage. Thus, in Section 5, we introduce some heuristics for determining the restarting time. In Section 6, we present experiments that show the effectiveness of our method.
We first introduce some notations. In this section, denotes the general norm on . Let be a distance generating function (i.e., 1-strongly convex smooth function with respect to ). Accordingly, we define the Bregman divergence by
where is the Euclidean inner product. The accelerated method proposed in  uses a gradient step and mirror descent steps and takes a linear combination of these points. That is,
Then, with appropriate parameters, converge to the optimal value as fast as the Nesterov’s accelerated methods [14, 15] for non-strongly convex problems. Moreover, in the strongly convex case, we obtain the same fast convergence as Nesterov’s methods by restarting this entire procedure.
In the rest of the paper, we only consider the Euclidean norm, i.e., .
To ensure the convergence of stochastic gradient descent (SGD), the learning rate must decay to zero so that we can reduce the variance effect of the stochastic gradient. This slows down the convergence. Variance reduction techniques [5, 8, 6, 12] such as SVRG have been proposed to solve this problem. We review SVRG in a mini-batch setting [11, 12]. SVRG is a multi-stage scheme. During each stage, this method performs SGD iterations using the following direction,
where is a starting point at stage, is an iteration index, is a uniformly randomly chosen size subset of , and . Note that
is an unbiased estimator of gradient: , where denote the expectation with respect to . A bound on the variance of is given in the following lemma, which is proved in the Supplementary Material.
Suppose Assumption 1 holds, and let . Conditioned on , we have
Due to this lemma, SVRG with achieves a complexity of .
We now introduce our Accelerated efficient Mini-batch SVRG (AMSVRG) which incorporates AGD and SVRG in a mini-batch setting. Our method is a multi-stage scheme similar to SVRG. During each stage, this method performs several APG-like  iterations and uses SVRG direction in a mini-batch setting. Each stage of AMSVRG is described in Figure 1.
Before we introduce the multi-stage scheme, we show the convergence of Algorithm 1. The following lemma is the key to the analysis of our method and gives us an insight on how to construct algorithms.
To prove Lemma 2, additional lemmas are required, which are proved in the Supplementary Material.
(Stochastic Gradient Descent). Suppose Assumption 1 holds, and let . Conditioned on , it follows that for ,
(Stochastic Mirror Descent). Conditioned on , we have that for arbitrary ,
From now on we consider Algorithm 1 with option 1 and set
Consider Algorithm 1 with option 1 under Assumption 1. For , we choose such that . Then, we have
Moreover, if for , then it follows
Using Lemma 2 and
This proves the theorem because . ∎
Let be the minimum values satisfying the assumption of Theorem 1 for , i.e., and . Then, from Theorem 1, we have an upper bound on the overall complexity (total number of component gradient evaluations to obtain -accurate solution in expectation):
where we used the monotonicity of with respect to for the first inequality. Note that the notation also hides and .
In this subsection, we introduce AMSVRG, as described in Figure 2.
(Boundedness) There is a compact subset such that the sequence generated by AMSVRG is contained in .
Note that, if we change the initialization of to , the above method with this modification will achieve the same convergence for general convex problems without the boundedness assumption (c.f. supplementary materials). However, for the strongly convex case, this modified version is slower than the above scheme. Therefore, we consider the version described in Figure 2.
From Theorem 1, we can see that for small and (e.g. ), the expected value of the objective function is halved at every stage under the assumptions of Theorem 1. Hence, running AMSVRG for outer iterations achieves an -accurate solution in expectation. Here, we consider the complexity at stage to halve the expected objective value. Let be the minimum values satisfying the assumption of Theorem 1, i.e., and . If the initial objective gap in stage is larger than , then the complexity at stage is
where we used the monotonicity of with respect to for the first inequality. Note that by Assumption 3, are uniformly bounded and notation also hides . The above analysis implies the following theorem.
Next, we consider the strongly convex case. We assume that is a -strongly convex function. In this case, we choose the distance generating function , so that the Bregman divergence becomes . Let the parameters be the same as in Theorem 2. Then, the expected value of the objective function is halved at every stage. Because , where is the condition number , the complexity at each stage is
Thus, we have the following theorem.
This complexity is the same as that of Acc-Prox-SVRG. Note that for the strongly convex case, we do not need the boundedness assumption.
Table 1 lists the overall complexities of the AGD, SAG, SVRG, SAGA, Acc-Prox-SVRG, and AMSVRG. The notation hides constant and logarithmic terms. By simple calculations, we see that
is the harmonic mean whose order is the same as. Thus, as shown in Table 1, the complexity of AMSVRG is less than or equal to that of other methods in any situation. In particular, for non-strongly convex problems, our method potentially outperform the others.
The parameters of AMSVRG are essentially and because the appropriate values of both and can be expressed by as in (7). It may be difficult to choose an appropriate which is the restart time for Algorithm 1. So, we propose heuristics for determining the restart time.
First, we suppose that the number of components is sufficiently large such that the complexity of our method becomes . That is, for appropriate , is an upper bound on (which is the complexity term). Therefore, we estimate the restart time as the minimum index that satisfies . This estimated value is upper bound on (in terms of the order). In this paper, we call this restart method R1.
Second, we propose an adaptive restart method using SVRG. In a strongly convex case, we can easily see that if we restart the AGD for general convex problems every , then the method achieves a linear convergence similar to that for strongly convex problems. The drawback of this restart method is that the restarting time depends on an unknown parameter , so several papers [18, 19, 20] have proposed effective adaptive restart methods. Moreover,  showed that this technique also performs well for general convex problems. Inspired by their study, we propose an SVRG-based adaptive restart method called R2. That is, if
then we return and start the next stage.
Third, we propose the restart method R3, which is a combination of the above two ideas. When exceeds , we restart Algorithm 1, and when
we return and restart Algorithm 1.
In this section, we compare AMSVRG with SVRG and SAGA. We ran an -regularized multi-class logistic regularization on mnist and covtype and ran an -regularized binary-class logistic regularization on rcv1. The datasets and their descriptions can be found at the LIBSVM website111http://www.csie.ntu.edu.tw/ cjlin/libsvmtools/datasets/. In these experiments, we vary regularization parameter in . We ran AMSVRG using some values of from and from , and then we chose the best and .
The results are shown in Figure 3. The horizontal axis is the number of single-component gradient evaluations. Our methods performed well and outperformed the other methods in some cases. For mnist and covtype, AMSVRG R1 and R3 converged quickly, and for rcv1, AMSVRG R2 worked very well. This tendency was more remarkable when the regularization parameter was small.
We propose method that incorporates acceleration gradient method and the SVRG in the increasing mini-batch setting. We showed that our method achieves a fast convergence complexity for non-strongly and strongly convex problems.
Let be a set of vectors in
be a set of vectors inand denote an average of . Let denote a uniform random variable representing a size subset of . Then, it follows that,
We denote a size subset of by and denote by . Then,
where is a combination. By symmetry, an each appears times and an each pair for appears times in . Therefore, we have
Since, , we have
This finishes the proof of Lemma. ∎
We now prove the Lemma 1.
We set . Using Lemma A and
conditional variance of is as follows