Stochastic optimization (SO) is frequently encountered in a vast number of areas, including telecommunication, medicine, and finance, to name but a few (Shapiro et al., 2014). SO aims to minimize an objective function which is given in a form of the expectation. Formally, the problem can be formulated as
where is a random function sampled from a distribution
. A well-known special case is the risk minimization in machine learning, whose objective function is
where denotes a random instance-label pair sampled from certain distribution , is the model for prediction, and is a loss that measures the prediction error (Vapnik, 1998).
In this paper, we focus on stochastic convex optimization (SCO), in which both the domain and the expected function are convex. A basic difficulty of solving stochastic optimization problem is that the distribution is generally unknown, or even if known, it is hard to evaluate the expectation exactly (Nemirovski et al., 2009). To address this challenge, two different ways have been proposed: sample average approximation (SAA) (Kim et al., 2015) and stochastic approximation (SA) (Kushner and Yin, 2003). SAA collects a set of random functions from , and constructs the empirical average to approximate the expected function . In contrast, SA tackles the stochastic optimization problem directly, at each iteration using a noisy observation of to improve the current iterate.
Compared with SAA, SA is more efficient due to the low computational cost per iteration, and has received significant research interests from optimization and machine learning communities (Zhang, 2004; Duchi et al., 2011; Ge et al., 2015; Wang et al., 2017). The performance of SA algorithms is typically measured by the excess risk:
where is the solution returned after
iterations. For Lipschitz continuous convex functions, stochastic gradient descent (SGD) achieves the unimprovablerate of convergence. Alternatively, if the optimization problem has certain curvature properties, then faster rates are sometimes possible. Specifically, for smooth functions, SGD is equipped with an risk bound, where is the minimal risk (Srebro et al., 2010). Thus, the convergence rate for smooth functions could be faster than when the minimal risk is small. For strongly convex functions, the convergence rate can also be improved to , where is the modulus of strong convexity (Hazan and Kale, 2011).
From the above discussions, we observe that either smoothness or strong convexity could be exploited to improve the convergence rate of SA. This observation motivates subsequent studies that boost the convergence rate by considering smoothness and strong convexity simultaneously. However, existing results are unsatisfactory because they either rely on strong assumptions (Mahdavi and Jin, 2013; Schmidt and Roux, 2013), are only applicable to unconstrained domains (Moulines and Bach, 2011; Needell et al., 2014), or limited to the problem of finite sum (Roux et al., 2012; Shalev-Shwartz and Zhang, 2013; Johnson and Zhang, 2013). This paper demonstrates that for the general SO problem, the convergence rate of SA could be faster than when both smoothness and strong convexity are present and the minimal risk is small. Our work is similar in spirit to a recent study of SAA (Zhang et al., 2017a), which also establishes faster rates under similar conditions. The main contributions of our paper are summarized below.
First, we propose a fast algorithm for stochastic approximation (FASA), which applies epoch gradient descent (Epoch-GD)(Hazan and Kale, 2011) with carefully designed initial solution and step size. Let be the condition number and be some small constant. Our theoretical analysis shows that, in expectation, FASA achieves an risk bound when the number of iterations . As a result, the convergence rate could be faster than when is small, and approaches when .
Second, to further benefit from small risk, we propose to use a fixed step size in Epoch-GD, and establish an risk bound which holds in expectation. Thus, the excess risk reduces exponentially until reaching , and if , we obtain a global linear convergence.
2 Related Work
In this section, we review related work on SA and SAA.
2.1 Stochastic Approximation (SA)
For Lipschitz continuous convex functions, stochastic gradient descent (SGD) exhibits the optimal risk bound (Nemirovski and Yudin, 1983; Zinkevich, 2003). When the random function is nonnegative and smooth, SGD (with a suitable step size) has a risk bound of , becoming if the minimal risk (Srebro et al., 2010, Corollary 4). If the expected function is -strongly convex, some variants of SGD (Hazan and Kale, 2011, 2014; Rakhlin et al., 2012; Shamir and Zhang, 2013) achieve an rate which is known to be minimax optimal (Agarwal et al., 2012). For the square loss and the logistic loss, an rate is attainable without strong convexity (Bach and Moulines, 2013). When the random function is -exponentially concave, the online Newton step (ONS) is equipped with an risk bound, where is the dimensionality (Hazan et al., 2007; Mahdavi et al., 2015). When the expected function is both smooth and strongly convex, we still have the convergence rate but with a smaller constant (Ghadimi and Lan, 2012). Specifically, the constant in the big O notation depends on the variance of the stochastic gradient instead of the maximum norm.
There are some studies that have established convergence rates that are faster than when both smoothness and strong convexity are present. Moulines and Bach (2011) and Needell et al. (2014) demonstrate that the distance between the SGD iterate and the optimal solution decreases at a linear rate in the beginning, but their results are limited to unconstrained problems. When an upper bound of is available, Mahdavi and Jin (2013) show that it is possible to reduce the excess risk at a linear rate until certain level. Under a strong growth condition, Schmidt and Roux (2013) prove that SGD could achieve a global linear rate. Recently, a variety of variance reduction techniques have been proposed and yield faster rates for SA (Roux et al., 2012; Shalev-Shwartz and Zhang, 2013; Johnson and Zhang, 2013). However, these methods are restricted to the special case that the expected function is a finite sum, and thus cannot be applied if the distribution is unknown. As can be seen, existing fast rates of SA are restricted to special problems or rely on strong assumptions. We will provide detailed comparisons in Section 3 to illustrate the advantage of this study—our setting is more general and our convergence rates are faster.
2.2 Sample Average Approximation (SAA)
SAA is also referred to as empirical risk minimization (ERM) in machine learning. In the literature, there are plenty of theories for SAA (Kim et al., 2015) or ERM (Vapnik, 1998). In the following, we only discuss related work on SAA in the past decade.
To present the results in SAA, we use to denote the total number of training samples. When the random function is Lipschitz continuous, Shalev-Shwartz et al. (2009) establish an risk bound. When is -strongly convex and Lipschitz continuous, Shalev-Shwartz et al. (2009) further prove an risk bound which holds in expectation. When is -exponentially concave, an risk bound is attainable (Koren and Levy, 2015; Mehta, 2016). Lower bounds of ERM for stochastic optimization are investigated by Feldman (2016). In a recent work, Zhang et al. (2017a) establish an risk bound when is smooth and is Lipschitz continuous. The most surprising result is that when is smooth and is Lipschitz continuous and -strongly convex, Zhang et al. (2017a) prove an risk bound, when . Thus, the convergence rate of ERM could be faster than when both smoothness and strong convexity are present and the number of training samples is large enough.
3 Our Results
We first introduce assumptions used in our analysis, then present our algorithms and theoretical guarantees.
The random function is nonnegative.
The random function is (almost surely) -smooth over , that is,
The expected function is -strongly convex over , that is,
The gradient of the random function is (almost surely) upper bounded by , that is,
We have the following comments regarding our assumptions.
Input: parameters , , , and
3.2 A General Algorithm
We first introduce a general algorithm for SA, which always achieves an rate, and becomes faster when is small.
3.2.1 Fast Algorithm for Stochastic Approximation (FASA)
Our fast algorithm for stochastic approximation (FASA) takes epoch gradient descent (Epoch-GD) as a subroutine. Although Hazan and Kale (2011) have established the convergence rate of Epoch-GD under the strong convexity condition, they did not utilize smoothness in their analysis. The procedures of Epoch-GD and FASA are described in Algorithm 1 and Algorithm 2, respectively.
Epoch-GD is an extension of stochastic gradient descent (SGD). It divides the optimization process into a sequence of epochs. In each epoch, Epoch-GD applies SGD multiple times, and the averaged iterate is passed to the next epoch. In the algorithm, we use to denote the projection onto the nearest point in . There are input parameters of Epoch-GD: (1) , the step size used in the first epoch; (2) , the size of the first epoch; (3) , the total number of stochastic gradients that can be consumed; and (4) , the initial solution. In each consecutive epoch, the step size decreases exponentially and the size of epoch increases exponentially.
In FASA, we first invoke Epoch-GD with an arbitrary initial solution, and the number of stochastic gradients is set to be . The purpose of this step is to get a good solution at the expense of stochastic gradients.111In this step, Epoch-GD can be replaced with any algorithm that achieves the optimal rate for strongly convex stochastic optimization, e.g., SGD with -suffix averaging (Rakhlin et al., 2012). Then, Epoch-GD is invoked again with as its initial solution and a budget of stochastic gradients. This time, we set a large epoch size to utilize the fact that the initial solution is of high quality. The convergence rate of FASA is given below.
The above theorem implies that when is large enough, i.e., , FASA achieves an
rate of convergence, which is faster than when the minimal risk is small. In particular, when , the convergence rate is improved to . Note that the upper bound has an exponential dependence on , so it is meaningful only when is chosen as a small constant.
Note that our algorithm is translation-invariant, i.e., it does not change if we translate the function by a constant. Since the upper bound in Theorem 1 depends on the minimal risk , one may attempt to subtract a constant from the function to make the bound tighter. However, because of the nonnegative requirement in Assumption 1, the best we can do is to redefine
and replace in Theorem 1 with .
To simplify Theorem 1, we provide the following corollary by setting .
Suppose . Under the same conditions as Theorem 1, we have
3.2.2 Comparisons with Previous Results
For smooth and strongly convex functions, Ghadimi and Lan (2012, Proposition 9) have established an rate for the expected risk, where is the variance of the stochastic gradient. Note that this rate is worse than that in Corollary 2 because is a constant in general, even when is small. For example, consider the problem of linear regression
and assume where is the Gaussian random noise and . Then, , which approaches zero as . On the other hand, the variance of the stochastic gradient at solution can be decomposed as
Even there is no noise, i.e., , the variance is nonzero due to the randomness of .
For unconstrained problems, Moulines and Bach (2011) and Needell et al. (2014) have analyzed the distance between the SGD iterate and the optimal solution under the smoothness and strong convexity condition. In particular, Theorem 1 of Moulines and Bach (2011) (with and ) implies the following convergence rate for the expected risk
which is also worse than our Corollary 2 because becomes a constant when . We note that it is possible to extend the analysis of Needell et al. (2014) to constrained problems, but the convergence rate becomes slower, and thus is worse than our rate. Detailed discussions about how to simplify and extend the result of Needell et al. (2014) are provided in Appendix A.
The convergence rate in Corollary 2 matches the state-of-the-art convergence rate of SAA (Zhang et al., 2017a). Specifically, under similar conditions, Zhang et al. (2017a, Theorem 3) have proved an risk bound for SAA, when . Compared with the results of Zhang et al. (2017a), our theoretical guarantees have the following advantages:
The lower bound of in our results is independent from the dimensionality, and thus our results can be applied to infinite dimensional problems, e.g., learning with kernels. In contrast, the lower bound of given by Zhang et al. (2017a, Theorem 3) depends on the dimensionality.
3.3 A Special Algorithm for Small Risk
The convergence rate of FASA cannot go beyond , even when is . In the following, we develop a special algorithm for the case that is small. The new algorithm achieves a linear convergence when is small, although it may not perform well otherwise.
3.3.1 Epoch Gradient Descent with Fixed Step Size (Epoch-GD-F)
The new algorithm is a variant of Epoch-GD, in which the step size, as well as the size of each epoch, is fixed. We name the new algorithm as epoch gradient descent with fixed step size (Epoch-GD-F), and summarize it in Algorithm 3. Epoch-GD-F has parameters: (1) , the fixed step size; (2) , the size of each epoch; (3) , the total number of stochastic gradients that can be consumed; and (4) , the initial solution. We bound the excess risk of Epoch-GD-F in the following theorem.
Input: parameters , , , and
From the above theorem, we observe that the excess risk is upper bounded by two terms: the first one decreases exponentially w.r.t. the number of epoches and the second one depends on . When , the excess risk is on the order of
which means it reduces exponentially until reaching . Note that if , we obtain a global linear convergence.
To better illustrate the convergence rate in Theorem 3, we present the iteration complexity of Epoch-GD-F.
3.3.2 Comparisons with Previous Results
When a prior knowledge is given beforehand, Mahdavi and Jin (2013) show that when
their stochastic algorithm is able to find a solution
such that with high probability
Their algorithm needs a prior knowledge , while our algorithm does not.
The final risk of their solution is upper bounded in terms of , while in our case, the risk is upper bounded in terms of , which is smaller than .
Their sample complexity has a linear dependent on the dimensionality , in contrast ours is dimensionality-independent. Thus, our results can be applied to the non-parametric setting where hypotheses lie in a functional space of infinite dimension.
The dependence of their sample complexity on and is much higher than ours.
This strong growth condition requires that all stochastic gradients are at , which is itself a necessary condition for , because all the random functions are nonnegative. In this case, our Theorem 3 also achieves a linear rate at the same order. However, our results have the following advantages:
Our Theorem 3 is more general because it covers the cases that is nonzero.
Our results are applicable even when there is a domain constraint.
For unconstrained problems, Theorem 2.1 of Needell et al. (2014) with a suitable step size also implies the following rate
which is slower than our rate in Theorem 3, because of the additional dependence on in the second term. Besides, Needell et al. (2014, (2.4) and (2.2)) provided the iteration complexity of their algorithm, as well as that of Moulines and Bach (2011) when the minimal risk is known. Specifically, the iteration complexities of Moulines and Bach (2011) and Needell et al. (2014) for finding an -optimal solution are
respectively. In this case, our Theorem 3 with implies the following iteration complexity
Compared with the lower bounds in (10), our iteration complexity is better because (i) it has a smaller dependence on , and (ii) it holds for constrained problems.
Our analysis follows from well-known and standard techniques, including the analysis of stochastic gradient descent (Zinkevich, 2003), self-bounding property of smooth functions (Srebro et al., 2010), and the implication of strong convexity (Hazan and Kale, 2011).
4.1 Proof of Theorem 1
We first state the excess risk of , the solution returned by the first call of Epoch-GD. From Theorem 5 of Hazan and Kale (2014), we have
We proceed to analyze the solution returned by the second call of Epoch-GD. In each epoch, the standard stochastic gradient descent (SGD) (Zinkevich, 2003) is applied. The following lemma shows how the excess risk decreases in each epoch. Apply iterations of the update
Based on the above lemma, we establish the following result for bounding the excess risk of the intermediate iterate. Consider the second call of Epoch-GD with parameters (,,, ). For any , we have
The number of epochs made is given by the largest value of satisfying , i.e.,
This value is
and the final solution is . From Lemma 4.1, we have
where the last step is due to
4.2 Proof of Lemma 4.1
We first introduce the self-bounding property of smooth functions (Srebro et al., 2010, Lemma 4.1). For an -smooth and nonnegative function ,
Let . Following the analysis of online gradient descent (Zinkevich, 2003), for any , we have
where the second inequality is due to the nonexpanding property of the projection operator (Nemirovski et al., 2009, (1.5)).
Summing up over all , we get
Recall that and is independent from . Taking expectation over both sides, we have
Rearranging the above inequality, we obtain
Dividing both sides by , we have
where the last step is due to Jensen’s inequality.
4.3 Proof of Lemma 4.1
Recall that the following parameters are used in the second call of Epoch-GD
Then, we have
We prove this lemma by induction on . When , from Lemma 4.1, we have
4.4 Proof of Theorem 3
We first establish the following lemma for bounding the excess risk of the intermediate iterate. For any , we have
The number of epochs made is given by and the final solution is . From Lemma 4.4, we have
4.5 Proof of Lemma 4.4
From (8), we know that
We prove this lemma by induction on . When , from Lemma 4.1, we have