In this paper, we consider nonsmooth nonconvex finite-sum optimization problems of the form
where and each is possibly nonconvex with a Lipschitz continuous gradient, while is nonsmooth but convex (e.g., norm or indicator function for some convex set ). We assume that the proximal operator of can be computed efficiently.
This above optimization problem is fundamental to many machine learning problems, ranging from convex optimization such as Lasso, SVM to highly nonconvex problem such as optimizing deep neural networks. There has been extensive research whenis convex (see e.g., (Xiao and Zhang, 2014; Defazio et al., 2014; Lan and Zhou, 2015; Allen-Zhu, 2017a)). In particular, if s are strongly-convex (Xiao and Zhang, 2014) proposed the Prox-SVRG algorithm, which achieves a linear convergence rate, based on the well-known variance reduction technique developed in (Johnson and Zhang, 2013)
. In recent years, due to the increasing popularity of deep learning, the nonconvex case has attracted significant attention. See e.g.,(Ghadimi and Lan, 2013; Allen-Zhu and Hazan, 2016; Reddi et al., 2016a; Lei et al., 2017) for results on the smooth nonconvex case (i.e., ). For the more general nonsmooth nonconvex case, the research is still somewhat limited.
Recently, for the nonsmooth nonconvex case, (Reddi et al., 2016b) provided two algorithms called ProxSVRG (similar to (Xiao and Zhang, 2014)) and ProxSAGA, which are based on the well-known variance reduction techniques SVRG and SAGA (Johnson and Zhang, 2013; Defazio et al., 2014). Before that, (Ghadimi et al., 2016) analyzed the deterministic proximal gradient method (i.e., computing the full-gradient in every iteration) for nonconvex nonsmooth problems. Here we denote it as ProxGD. (Ghadimi et al., 2016) also considered the stochastic case (here we denote it as ProxSGD). However, ProxSGD requires the batch sizes being a large number (i.e., ) or increasing with the iteration number . Note that ProxSGD may reduce to deterministic ProxGD after some iterations due to the increasing batch sizes. Note that from the perspectives of both computational efficiency and statistical generalization, always computing full-gradient (GD or ProxGD) may not be desirable for large-scale machine learning problems. A reasonable minibatch size is also desirable in practice, since the computation of minibatch stochastic gradients can be implemented in parallel. In fact, practitioners typically use moderate minibatch sizes, often ranging from something like 16 or 32 to a few hundreds (sometimes to a few thousands, see e.g., (Priya et al., 2017)).111In fact, some studies argued that smaller minibatch sizes in SGD are very useful for generalization (e.g., (Keskar et al., 2016)). Although generalization is not the focus of the present paper, it provides further motivation for studying the moderate minibatch size regime. Hence, it is important to study the convergence in moderate and constant minibatch size regime.
(Reddi et al., 2016b) provided the first non-asymptotic convergence rates for ProxSVRG with minibatch size at most , for the nonsmooth nonconvex problems. However, their convergence bounds (using constant or moderate size minibatches) are worse than the deterministic ProxGD in terms of the number of proximal oracle calls. Note that their algorithms (i.e., ProxSVRG/SAGA) outperform the ProxGD only if they use quite large minibatch size . Note that in a typical application, the number of training data is , and . Hence, is a quite a large minibatch size. Finally, they presented an important open problem of developing stochastic methods with provably better performance than ProxGD with constant minibatch size.
Our Contribution: In this paper, we propose ProxSVRG+ to solve (1). Our algorithm is almost the same as ProxSVRG (Reddi et al., 2016b), except some detailed parameter settings (Under PL condition, (Reddi et al., 2016b) needed to restart ProxSVRG times, while ProxSVRG+ does not need any restart). Our main technical contribution lies in the new convergence analysis of ProxSVRG+, which has notable difference from that of ProxSVRG (Reddi et al., 2016b). Our convergence results are stated in terms of the number of stochastic first-order oracle (SFO) calls and proximal oracle (PO) calls (see Definition 2 for the formal definitions).
We are mainly interested in modern large-scale machine learning problems, in which the number of data points is typical very large (e.g., ) and the accuracy is moderate. Hence, we generally regard as a much larger number than , and a moderate minibatch size as something like (for some ), and small minibatch size as . We list our results in Table 1–3, and Figure 2–2. We would like to highlight the following results yielded by our new analysis.
ProxSVRG+ is (resp. ) times faster than ProxGD (full gradient) in terms of #SFO when (resp. ), and times faster than ProxGD when (resp. ). Note that #PO for both ProxSVRG+ and ProxGD. Hence, for any super constant (such as for some , a moderate minibatch size often used in practice), ProxSVRG+ is strictly better than ProxGD. Hence, we partially answer the open question proposed in (Reddi et al., 2016b). We also note that ProxSVRG+ matches the best result achieved by ProxSVRG at , and ProxSVRG+ is strictly better for smaller (using less PO calls). See Figure 2 for an overview.
Assuming that the variance of the stochastic gradient is bounded (, see Assumption 1), ProvSVRG+ matches the best result achieved by SCSG in terms of #SFO, proposed by Lei et al. (Lei et al., 2017) recently, for the smooth nonconvex case, i.e., in form (1) (see Table 1, the 5th row). Arguably, ProxSVRG+ is more straightforward than SCSG, and yields simpler proof. Our results also matches the results of Natasha1.5 proposed by Allen-Zhu (Allen-Zhu, 2017b) very recently, in terms of #SFO, if there is no additional assumption (see Footnote 2 for details). In terms of #PO, our algorithm outperforms Natasha1.5.
For the nonconvex functions satisfying Polyak-Łojasiewicz condition (Polyak, 1963), we show that ProxSVRG+ achieves a global linear convergence rate without restart, while Reddi et al. (Reddi et al., 2016b) restarted ProxSVRG/SAGA times to obtain the global linear convergence rate. In this case, ProxSVRG+ is always no worse than ProxGD and ProxSVRG/SAGA, and sometimes outperforms them (and also generalizes the results of SCSG). We list our results in Table 3.
|Algorithms||Stochastic first-order||Proximal orcale||Additional|
|ProxGD (Ghadimi et al., 2016)||–|
|ProxSGD (Ghadimi et al., 2016)|
|ProxSVRG/SAGA (Reddi et al., 2016b)|
|SCSG (Lei et al., 2017)||NA|
|i.e., in (1))|
|Natasha1.5 (Allen-Zhu, 2017b)||222Natasha 1.5 used an additional parameter, called strongly nonconvex parameter () and #SFO in (Allen-Zhu, 2017b) is . If is much smaller than , the bound is better. Without any additional assumption, the default value of is . The result listed in the table is the case. Besides, one can verify that #PO of Natasha1.5 is the same as its #SFO.|
|ProxSVRG+||–||Same as ProxGD|
|Same as ProxSGD|
|–||Better than ProxGD,|
|does not need|
|Better than ProxGD and|
|same as SCSG (in SFO)|
|–||Same as ProxGD|
We assume that in (1) has an -Lipschitz continuous gradient for all , i.e., there is a constant such that
where denotes the Eculidean norm . Note that does not need to be convex. We also assume that the nonsmooth convex function in (1) is well structured, i.e., the following proximal operator on can be computed efficiently:
For convex problems, one typically uses the optimality gap as the convergence criterion (see e.g., (Nesterov, 2004)). But for general nonconvex problems, one typically uses the gradient norm as the convergence criterion. E.g., for smooth nonconvex problems (i.e., ), (Ghadimi and Lan, 2013; Reddi et al., 2016a; Lei et al., 2017) used (i.e., ) to measure the convergence results. In order to analyze the convergence results for nonsmooth nonconvex problems, we need to define the gradient mapping as follows (as in (Ghadimi et al., 2016; Reddi et al., 2016b)):
We often use an equivalent but useful form of as follows:
Note that if is a constant function (in particular, zero), this gradient mapping reduces to the ordinary gradient: . In this paper, we use the gradient mapping as the convergence criterion (same as (Ghadimi et al., 2016; Reddi et al., 2016b)).
is called an -accurate solution for problem (1) if , where denotes the point returned by a stochastic algorithm.
To measure the efficiency of a stochastic algorithm, we use the following oracle complexity.
Stochastic first-order oracle (SFO): given a point , SFO outputs a stochastic gradient such that .
Proximal oracle (PO): given a point , PO outputs the result of the proximal projection (see (3)).
Sometimes, we need the following assumption on the variance of the stochastic gradients for some algorithms or some particular cases (see the last column “additional condition” in Table 1). Such an assumption is necessary if one wants the convergence result to be independent of .
For , , where is a constant and is a stochastic gradient.
3 Nonconvex ProxSVRG+ Algorithm
In this section, we propose a proximal stochastic gradient algorithm, called ProxSVRG+, which is a variant of the ProxSVRG algorithm (Reddi et al., 2016b) (also similar to (Xiao and Zhang, 2014)). ProxSVRG+ is very straightforward, and the details are described in Algorithm 1. We call the batch size and the minibatch size.
4 Convergence Results
Now, we present the main theorem for ProxSVRG+ which corresponds to the last two rows in Table 1.
We let batch size . The number of SFO calls is at most
Under Assumption 1, we let batch size . The number of SFO calls is at most
where denotes the minimum. In both cases, the number of PO calls equals to the total number of iterations , which is at most
Remark: Algorithm 1 for Case 1 (i.e., ) is almost the same as ProxSVRG (Reddi et al., 2016b). But the proof for Theorem 1 is notably different. Reddi et al. (Reddi et al., 2016b) used a Lyapunov function () and showed that decreases by the accumulated gradient mapping in epoch . In our proof, we directly show that decreases by using a different analysis. This is made possible by tightening the inequalities in several places, e.g., by using Young’s inequality on different terms and applying Lemma 2 in a nontrivial way. Besides, we also use similar idea in SCSG (Lei et al., 2017) to bound the variance term, and our convergence result holds for any minibatch size (i.e., ).
|Algorithms||Stochastic first-order||Proximal orcale||Additional|
|ProxGD (Karimi et al., 2016)||–|
|(Reddi et al., 2016b)|
|SCSG (Lei et al., 2017)||NA|
|i.e., in (1))|
Similar to Table 2, one can choose and .
Similar to Figure 2, SCSG achieves its best result with ;
ProxSVRG/SAGA achieves its best result with (ProxSVRG+ obtains the same result in
this case); ProxSVRG+ is
better than ProxGD and ProxSVRG/SAGA, and generalizes the SCSG if .
Natasha1.5 (Allen-Zhu, 2017b) did not consider PL condition.
5 Convergence Under PL Condition
In this section, we provide the global linear convergence rate for nonconvex functions under the Polyak-Łojasiewicz (PL) condition (Polyak, 1963). The original form of PL condition is
where denotes the (global) optimal function value. For example, satisfies PL condition if is -strongly convex. Note that PL condition implies that every stationary point is a global minimum, but it does not imply there is a unique minimum unlike the strongly convex condition. Particularly, (Karimi et al., 2016) showed that PL condition is weaker than many conditions (e.g., strong convexity (SC), restricted strong convexity (RSC) and weak strong convexity (WSC) (Necoara et al., 2015)). Also, if is convex, PL condition is equivalent to the error bounds (EB) and quadratic growth (QG) condition (Luo and Tseng, 1993; Anitescu, 2000).
Note that if is a constant function, this gradient mapping reduces to , and our new definition of PL condition reduces to the original one in (Polyak, 1963). Our PL condition is different from the one used in (Karimi et al., 2016; Reddi et al., 2016b). See the Remark (2) at the end of the section. We list the convergence results under PL condition in Table 3.
Similar to Theorem 1, we provide the convergence result of ProxSVRG+ (Algorithm 1) under PL-condition in the following Theorem 2. Note that under PL condition (i.e. (7) holds), ProxSVRG+ can directly use the final iteration as the output point instead of the randomly chosen one . Similar to (Reddi et al., 2016b), we assume the condition number for simplicity. Otherwise, one can choose different step size which is similar to the case where we deal with other choices of epoch length (see Appendix A.2).
Let step size and denote the minibatch size. Then the final iteration point in Algorithm 1 satisfies under PL condition. We distinguish the following two cases:
We let batch size . The number of SFO calls is bounded by
Under Assumption 1, we let batch size . The number of SFO calls is bounded by
where denotes the minimum. In both cases, the number of PO calls equals to the total number of iterations which is bounded by .
We show that ProxSVRG+ directly obtains the global linear convergence rate without restart by a nontrivial proof. Note that Reddi et al. (Reddi et al., 2016b) restarted ProxSVRG/SAGA times to obtain a global linear convergence rate under PL condition. Similarly, ProxSVRG+ recovers several existing convergence results and sometime outperforms them, and also generalizes the results of SCSG (Lei et al., 2017) in this case.
where . Our PL condition is arguably more natural. In fact, one can show that if , our new PL condition (7) implies (8). For comparison, we also provide the proof of the same result (ProxSVRG+ directly obtains the linear convergence rate (Theorem 2) without restart) using the previous PL condition (8) in the appendix.
In this section, we present the experimental results. We compare the nonconvex ProxSVRG+ with nonconvex ProxGD, ProxSGD (Ghadimi et al., 2016), ProxSVRG (Reddi et al., 2016b) and ProxSCSG. Note that here ProxSCSG is a straightforward extension of SCSG (Lei et al., 2017), by replacing the original gradient update by the proximal operator in SCSG. Note that there is no known theoretical convergence result for ProxSCSG.
We conduct the experiments using the non-negative principal component analysis (NN-PCA) problem (same as(Reddi et al., 2016b)). In general, NN-PCA is NP-hard. Specifically, the optimization problem for a given set of samples (i.e., ) is:
Note that (9) can be written in the form (1), where and where set . We conduct the experiment on the standard MNIST and ‘a9a’ datasets. 555The datasets can be downloaded from https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/ The experimental results on both datasets are almost the same (see Figure 6–10 and Figure 6–10).
The samples from each dataset are normalized, i.e., for all . The parameters of the algorithms are chosen as follows: can be precomputed from the data samples in the same way as in (Li et al., 2017). The step sizes for different algorithms are set to be the ones used in their convergence results: For ProxGD, it is (see Corollary 1 in (Ghadimi et al., 2016)); for ProxSGD, (see Corollary 3 in (Ghadimi et al., 2016)); for ProxSVRG, (see Theorem 6 in (Reddi et al., 2016b)); for ProxSCSG, (see Corollary 3.3 in (Lei et al., 2017)). The step size for our ProxSVRG+ is (see our Theorem 1). We did not further tune the step sizes.
Regarding the comparison among these algorithms, we use the number of SFO calls (see Definition 2) to evaluate them. For example, for ProxGD, each iteration uses SFO calls (full gradient). We need to point out that we amortize the batch size (i.e., or in Line 5 of Algorithm 1) into the inner loops, so that the curves in the figures are smoother.666Otherwise, the curves would look like step functions. For example, in ProxSVRG, ProxSCSG and ProxSVRG+, the number of SFO calls for each inner loop is , and , respectively.
We demonstrate the performance of these algorithms with respect to various minibatch size . The experimental results are quite consistent with the theoretical results (Figure 2 at the end of Section 1). ProxSCSG achieves its fastest convergence with and becomes worse for larger . ProxSVRG is generally better when gets larger. However, as increases, ProxSVRG+ gets better first (when in our experiments) and then gets worse (for ). In Figure 6 and 6, we can see that the proposed ProxSVRG+ (Algorithm 1) achieves the best performance when (the red curve with dots).
Similar to Figure 6 and 6, we also conduct the experiment for ProxSVRG (Reddi et al., 2016b) for various minibatch size in Figure 6 and 6. We note that ProxSVRG achieves its best performance with a much larger minibatch size in a9a (Figure 6) or in MNIST (Figure 6). Recall that in theory, ProxSVRG+ achieves the best result with and ProxSVRG achieves its best performance with . Note that a9a is smaller than the MNIST dataset. Hence, for ProxSVRG, the best performing minibatch size increases as becomes larger. However, for ProxSVRG+, it is less sensitive to in our experiment (in both datasets, achieves the best result. See Figure 6 and 6).
In Figure 8 and 8, we compare the performance of these five algorithms as we vary the minibatch size . We can see that when (left-top), ProxSCSG achieves the best result. ProxSVRG+ is almost the same as ProxGD in this case. In the following figures, we can see that ProxSCSG is always getting worse, and ProxSVRG+ is getting better in the first four figures (i.e. ) and then it is getting worse in the last two figures. The results are consistent with Figure 2.
Finally, we compare these algorithms with their corresponding best minibatch size , in Figure 10 and 10. Here, ProxSVRG, ProxSCSG and ProxSVRG+ achieve their best performance when , , and , respectively. Note that this result is quite consistent with the minimum points of the curve of ProxSVRG, SCSG and ProxSVRG+ in Figure 2. One can see that ProxSVRG, ProxSCSG and ProxSVRG+ are quite close and outperform ProxGD and ProxSGD. However, we argue that our algorithm might be more attractive in certain applications due to its moderate minibatch size (again, not too small for parallelism and not too large for better generalization, possibly).
In this paper, we propose a simple proximal stochastic method called ProxSVRG+, which is a variant of ProxSVRG (Reddi et al., 2016b) for nonsmooth nonconvex optimization. We show that ProxSVRG+ recovers several well-known convergence results and provides some improved results by choosing proper parameters. Furthermore, for the nonconvex functions satisfying Polyak-Łojasiewicz condition, we show that ProxSVRG+ achieves a global linear convergence rate without restart, while (Reddi et al., 2016b) restarted ProxSVRG times. Finally, we conducted several experiments and the experimental results are consistent with our theoretical results.
We would like to thank Rong Ge for helpful discussions.
Appendix A Proofs for Nonconvex ProxSVRG+ Algorithm
a.1 Proof of Theorem 1
Before proving Theorem 1, we need a useful lemma for the proximal operator.
Let , then the following inequality holds:
Proof: First, we recall the proximal operator (see (5)):
For the nonsmooth function , we have
where such that according to the optimality condition of , and (12) holds due to the convexity of .
For the nonconvex function , we have
Besides, by letting and in (10), we have
where (18) uses the following Young’s inequality (choose )
Let and . Then, the following inequality holds:
where (23) uses Cauchy-Schwarz inequality.
Now, this lemma is proved by using Cauchy-Schwarz inequality and (24), i.e.,
Then, we bound the variance term in (25) as follows: