A Simple Proximal Stochastic Gradient Method for Nonsmooth Nonconvex Optimization

02/13/2018 ∙ by Zhize Li, et al. ∙ Tsinghua University 0

We analyze stochastic gradient algorithms for optimizing nonconvex, nonsmooth finite-sum problems. In particular, the objective function is given by the summation of a differentiable (possibly nonconvex) component, together with a possibly non-differentiable but convex component. We propose a proximal stochastic gradient algorithm based on variance reduction, called ProxSVRG+. The algorithm is a slight variant of the ProxSVRG algorithm [Reddi et al., 2016b]. Our main contribution lies in the analysis of ProxSVRG+. It recovers several existing convergence results (in terms of the number of stochastic gradient oracle calls and proximal operations), and improves/generalizes some others. In particular, ProxSVRG+ generalizes the best results given by the SCSG algorithm, recently proposed by [Lei et al., 2017] for the smooth nonconvex case. ProxSVRG+ is more straightforward than SCSG and yields simpler analysis. Moreover, ProxSVRG+ outperforms the deterministic proximal gradient descent (ProxGD) for a wide range of minibatch sizes, which partially solves an open problem proposed in [Reddi et al., 2016b]. Finally, for nonconvex functions satisfied Polyak-Łojasiewicz condition, we show that ProxSVRG+ achieves global linear convergence rate without restart. ProxSVRG+ is always no worse than ProxGD and ProxSVRG/SAGA, and sometimes outperforms them (and generalizes the results of SCSG) in this case.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In this paper, we consider nonsmooth nonconvex finite-sum optimization problems of the form

(1)

where and each is possibly nonconvex with a Lipschitz continuous gradient, while is nonsmooth but convex (e.g., norm or indicator function for some convex set ). We assume that the proximal operator of can be computed efficiently.

This above optimization problem is fundamental to many machine learning problems, ranging from convex optimization such as Lasso, SVM to highly nonconvex problem such as optimizing deep neural networks. There has been extensive research when

is convex (see e.g., (Xiao and Zhang, 2014; Defazio et al., 2014; Lan and Zhou, 2015; Allen-Zhu, 2017a)). In particular, if s are strongly-convex (Xiao and Zhang, 2014) proposed the Prox-SVRG algorithm, which achieves a linear convergence rate, based on the well-known variance reduction technique developed in (Johnson and Zhang, 2013)

. In recent years, due to the increasing popularity of deep learning, the nonconvex case has attracted significant attention. See e.g., 

(Ghadimi and Lan, 2013; Allen-Zhu and Hazan, 2016; Reddi et al., 2016a; Lei et al., 2017) for results on the smooth nonconvex case (i.e., ). For the more general nonsmooth nonconvex case, the research is still somewhat limited.

Recently, for the nonsmooth nonconvex case, (Reddi et al., 2016b) provided two algorithms called ProxSVRG (similar to (Xiao and Zhang, 2014)) and ProxSAGA, which are based on the well-known variance reduction techniques SVRG and SAGA (Johnson and Zhang, 2013; Defazio et al., 2014). Before that, (Ghadimi et al., 2016) analyzed the deterministic proximal gradient method (i.e., computing the full-gradient in every iteration) for nonconvex nonsmooth problems. Here we denote it as ProxGD. (Ghadimi et al., 2016) also considered the stochastic case (here we denote it as ProxSGD). However, ProxSGD requires the batch sizes being a large number (i.e., ) or increasing with the iteration number . Note that ProxSGD may reduce to deterministic ProxGD after some iterations due to the increasing batch sizes. Note that from the perspectives of both computational efficiency and statistical generalization, always computing full-gradient (GD or ProxGD) may not be desirable for large-scale machine learning problems. A reasonable minibatch size is also desirable in practice, since the computation of minibatch stochastic gradients can be implemented in parallel. In fact, practitioners typically use moderate minibatch sizes, often ranging from something like 16 or 32 to a few hundreds (sometimes to a few thousands, see e.g., (Priya et al., 2017)).111In fact, some studies argued that smaller minibatch sizes in SGD are very useful for generalization (e.g., (Keskar et al., 2016)). Although generalization is not the focus of the present paper, it provides further motivation for studying the moderate minibatch size regime. Hence, it is important to study the convergence in moderate and constant minibatch size regime.

(Reddi et al., 2016b) provided the first non-asymptotic convergence rates for ProxSVRG with minibatch size at most , for the nonsmooth nonconvex problems. However, their convergence bounds (using constant or moderate size minibatches) are worse than the deterministic ProxGD in terms of the number of proximal oracle calls. Note that their algorithms (i.e., ProxSVRG/SAGA) outperform the ProxGD only if they use quite large minibatch size . Note that in a typical application, the number of training data is , and . Hence, is a quite a large minibatch size. Finally, they presented an important open problem of developing stochastic methods with provably better performance than ProxGD with constant minibatch size.

Our Contribution: In this paper, we propose ProxSVRG+ to solve (1). Our algorithm is almost the same as ProxSVRG (Reddi et al., 2016b), except some detailed parameter settings (Under PL condition, (Reddi et al., 2016b) needed to restart ProxSVRG times, while ProxSVRG+ does not need any restart). Our main technical contribution lies in the new convergence analysis of ProxSVRG+, which has notable difference from that of ProxSVRG (Reddi et al., 2016b). Our convergence results are stated in terms of the number of stochastic first-order oracle (SFO) calls and proximal oracle (PO) calls (see Definition 2 for the formal definitions).

We are mainly interested in modern large-scale machine learning problems, in which the number of data points is typical very large (e.g., ) and the accuracy is moderate. Hence, we generally regard as a much larger number than , and a moderate minibatch size as something like (for some ), and small minibatch size as . We list our results in Table 13, and Figure 22. We would like to highlight the following results yielded by our new analysis.

  1. ProxSVRG+ is (resp. ) times faster than ProxGD (full gradient) in terms of #SFO when (resp. ), and times faster than ProxGD when (resp. ). Note that #PO for both ProxSVRG+ and ProxGD. Hence, for any super constant (such as for some , a moderate minibatch size often used in practice), ProxSVRG+ is strictly better than ProxGD. Hence, we partially answer the open question proposed in (Reddi et al., 2016b). We also note that ProxSVRG+ matches the best result achieved by ProxSVRG at , and ProxSVRG+ is strictly better for smaller (using less PO calls). See Figure 2 for an overview.

  2. Assuming that the variance of the stochastic gradient is bounded (, see Assumption 1), ProvSVRG+ matches the best result achieved by SCSG in terms of #SFO, proposed by Lei et al. (Lei et al., 2017) recently, for the smooth nonconvex case, i.e., in form (1) (see Table 1, the 5th row). Arguably, ProxSVRG+ is more straightforward than SCSG, and yields simpler proof. Our results also matches the results of Natasha1.5 proposed by Allen-Zhu (Allen-Zhu, 2017b) very recently, in terms of #SFO, if there is no additional assumption (see Footnote 2 for details). In terms of #PO, our algorithm outperforms Natasha1.5.

    We also note that SCSG (Lei et al., 2017) achieved the best result with , while our best convergence result is achieved with the minibatch size being (see Table 2 the 5th row), which is a moderate minibatch size used in practice (hence can take advantage of parallelism).

  3. For the nonconvex functions satisfying Polyak-Łojasiewicz condition (Polyak, 1963), we show that ProxSVRG+ achieves a global linear convergence rate without restart, while Reddi et al. (Reddi et al., 2016b) restarted ProxSVRG/SAGA times to obtain the global linear convergence rate. In this case, ProxSVRG+ is always no worse than ProxGD and ProxSVRG/SAGA, and sometimes outperforms them (and also generalizes the results of SCSG). We list our results in Table 3.

Algorithms Stochastic first-order Proximal orcale Additional
orcale (SFO) (PO) condition
ProxGD (Ghadimi et al., 2016)
(full gradient)
ProxSGD (Ghadimi et al., 2016)
ProxSVRG/SAGA (Reddi et al., 2016b)
SCSG (Lei et al., 2017) NA
(smooth nonconvex,
i.e., in (1))
Natasha1.5 (Allen-Zhu, 2017b)   222Natasha 1.5 used an additional parameter, called strongly nonconvex parameter () and #SFO in (Allen-Zhu, 2017b) is . If is much smaller than , the bound is better. Without any additional assumption, the default value of is . The result listed in the table is the case. Besides, one can verify that #PO of Natasha1.5 is the same as its #SFO.
ProxSVRG+
(this paper)

The notation denotes the minimum and denotes the minibatch size. The definitions of SFO
and PO are defined in Definition 2, (in the last column) is defined in Assumption 1.

Table 1: Comparison of the SFO and PO complexity
Algorithm Minibatches SFO PO Addi. cond. Notes
ProxSVRG+ Same as ProxGD
Same as ProxSGD
Better than ProxGD,
does not need
Better than ProxGD and
ProxSVRG/SAGA,
same as SCSG (in SFO)
Same as
ProxSVRG/SAGA
Same as ProxGD
Table 2: Some recommended minibatch sizes
Figure 1: SFO complexity in terms of minibatch size
Figure 2: PO complexity in terms of minibatch size
33footnotetext: Note that the curve of ProxSGD overlaps with ProxSVRG+ for and the curve of ProxSVRG/SAGA overlaps with ProxSVRG+ for in Figure 2.

2 Preliminaries

We assume that in (1) has an -Lipschitz continuous gradient for all , i.e., there is a constant such that

(2)

where denotes the Eculidean norm . Note that does not need to be convex. We also assume that the nonsmooth convex function in (1) is well structured, i.e., the following proximal operator on can be computed efficiently:

(3)

For convex problems, one typically uses the optimality gap as the convergence criterion (see e.g., (Nesterov, 2004)). But for general nonconvex problems, one typically uses the gradient norm as the convergence criterion. E.g., for smooth nonconvex problems (i.e., ), (Ghadimi and Lan, 2013; Reddi et al., 2016a; Lei et al., 2017) used (i.e., ) to measure the convergence results. In order to analyze the convergence results for nonsmooth nonconvex problems, we need to define the gradient mapping as follows (as in (Ghadimi et al., 2016; Reddi et al., 2016b)):

(4)

We often use an equivalent but useful form of as follows:

(5)

Note that if is a constant function (in particular, zero), this gradient mapping reduces to the ordinary gradient: . In this paper, we use the gradient mapping as the convergence criterion (same as (Ghadimi et al., 2016; Reddi et al., 2016b)).

Definition 1

is called an -accurate solution for problem (1) if , where denotes the point returned by a stochastic algorithm.

To measure the efficiency of a stochastic algorithm, we use the following oracle complexity.

Definition 2
  1. Stochastic first-order oracle (SFO): given a point , SFO outputs a stochastic gradient such that .

  2. Proximal oracle (PO): given a point , PO outputs the result of the proximal projection (see (3)).

Sometimes, we need the following assumption on the variance of the stochastic gradients for some algorithms or some particular cases (see the last column “additional condition” in Table 1). Such an assumption is necessary if one wants the convergence result to be independent of .

Assumption 1

For , , where is a constant and is a stochastic gradient.

3 Nonconvex ProxSVRG+ Algorithm

In this section, we propose a proximal stochastic gradient algorithm, called ProxSVRG+, which is a variant of the ProxSVRG algorithm (Reddi et al., 2016b) (also similar to (Xiao and Zhang, 2014)). ProxSVRG+ is very straightforward, and the details are described in Algorithm 1. We call the batch size and the minibatch size.

4 Convergence Results

Now, we present the main theorem for ProxSVRG+ which corresponds to the last two rows in Table 1.

0:  initial point

, number of epoch

, batch size , minibatch size , step size
1:  
2:  
3:  for  do
4:     
5:     Call SFO (in point ) times to obtain   444If , ProxSVRG+ is almost the same as ProxSVRG, i.e., .
6:     for  do
7:        Call SFO (in point ) times to obtain
8:         (call PO once)
9:     end for
10:     
11:  end for
11:   chosen uniformly from
Algorithm 1 ProxSVRG+
Theorem 1

Let step size and denote the minibatch size. Then returned by Algorithm 1 is an -accurate solution for problem (1) (i.e., ). We distinguish the following two cases:

  1. We let batch size . The number of SFO calls is at most

  2. Under Assumption 1, we let batch size . The number of SFO calls is at most

where denotes the minimum. In both cases, the number of PO calls equals to the total number of iterations , which is at most

Remark: Algorithm 1 for Case 1 (i.e., ) is almost the same as ProxSVRG (Reddi et al., 2016b). But the proof for Theorem 1 is notably different. Reddi et al. (Reddi et al., 2016b) used a Lyapunov function () and showed that decreases by the accumulated gradient mapping in epoch . In our proof, we directly show that decreases by using a different analysis. This is made possible by tightening the inequalities in several places, e.g., by using Young’s inequality on different terms and applying Lemma 2 in a nontrivial way. Besides, we also use similar idea in SCSG (Lei et al., 2017) to bound the variance term, and our convergence result holds for any minibatch size (i.e., ).

We defer the proof of Theorem 1 to Appendix A.1. Also, similar convergence results for other choices of epoch length (Line 2 of Algorithm 1) are provided in Appendix A.2.

Algorithms Stochastic first-order Proximal orcale Additional
orcale (SFO) (PO) condition
ProxGD (Karimi et al., 2016)
(full gradient)
ProxSVRG/SAGA
(Reddi et al., 2016b)
SCSG (Lei et al., 2017) NA
(smooth nonconvex,
i.e., in (1))
ProxSVRG+
(this paper)

Similar to Table 2, one can choose and . Similar to Figure 2, SCSG achieves its best result with ; ProxSVRG/SAGA achieves its best result with (ProxSVRG+ obtains the same result in this case); ProxSVRG+ is
better than ProxGD and ProxSVRG/SAGA, and generalizes the SCSG if .
Natasha1.5 (Allen-Zhu, 2017b) did not consider PL condition.

Table 3: Under the PL condition with parameter

5 Convergence Under PL Condition

In this section, we provide the global linear convergence rate for nonconvex functions under the Polyak-Łojasiewicz (PL) condition (Polyak, 1963). The original form of PL condition is

(6)

where denotes the (global) optimal function value. For example, satisfies PL condition if is -strongly convex. Note that PL condition implies that every stationary point is a global minimum, but it does not imply there is a unique minimum unlike the strongly convex condition. Particularly, (Karimi et al., 2016) showed that PL condition is weaker than many conditions (e.g., strong convexity (SC), restricted strong convexity (RSC) and weak strong convexity (WSC) (Necoara et al., 2015)). Also, if is convex, PL condition is equivalent to the error bounds (EB) and quadratic growth (QG) condition (Luo and Tseng, 1993; Anitescu, 2000).

Due to the nonsmooth term in problem (1), we use the gradient mapping (see (4)) to define a more general form of PL condition as follows:

(7)

Note that if is a constant function, this gradient mapping reduces to , and our new definition of PL condition reduces to the original one in (Polyak, 1963). Our PL condition is different from the one used in (Karimi et al., 2016; Reddi et al., 2016b). See the Remark (2) at the end of the section. We list the convergence results under PL condition in Table 3.

Similar to Theorem 1, we provide the convergence result of ProxSVRG+ (Algorithm 1) under PL-condition in the following Theorem 2. Note that under PL condition (i.e. (7) holds), ProxSVRG+ can directly use the final iteration as the output point instead of the randomly chosen one . Similar to (Reddi et al., 2016b), we assume the condition number for simplicity. Otherwise, one can choose different step size which is similar to the case where we deal with other choices of epoch length (see Appendix A.2).

Theorem 2

Let step size and denote the minibatch size. Then the final iteration point in Algorithm 1 satisfies under PL condition. We distinguish the following two cases:

  1. We let batch size . The number of SFO calls is bounded by

  2. Under Assumption 1, we let batch size . The number of SFO calls is bounded by

where denotes the minimum. In both cases, the number of PO calls equals to the total number of iterations which is bounded by .

Remark:

  1. We show that ProxSVRG+ directly obtains the global linear convergence rate without restart by a nontrivial proof. Note that Reddi et al. (Reddi et al., 2016b) restarted ProxSVRG/SAGA times to obtain a global linear convergence rate under PL condition. Similarly, ProxSVRG+ recovers several existing convergence results and sometime outperforms them, and also generalizes the results of SCSG (Lei et al., 2017) in this case.

  2. We want to point out that (Karimi et al., 2016; Reddi et al., 2016b) used the following PL condition form:

    (8)

    where . Our PL condition is arguably more natural. In fact, one can show that if , our new PL condition (7) implies (8). For comparison, we also provide the proof of the same result (ProxSVRG+ directly obtains the linear convergence rate (Theorem 2) without restart) using the previous PL condition (8) in the appendix.

The proofs of Theorem 2 under PL form (7) and (8) are provided in Appendix B.1 and B.2, respectively.

6 Experiments

In this section, we present the experimental results. We compare the nonconvex ProxSVRG+ with nonconvex ProxGD, ProxSGD (Ghadimi et al., 2016), ProxSVRG (Reddi et al., 2016b) and ProxSCSG. Note that here ProxSCSG is a straightforward extension of SCSG (Lei et al., 2017), by replacing the original gradient update by the proximal operator in SCSG. Note that there is no known theoretical convergence result for ProxSCSG.

We conduct the experiments using the non-negative principal component analysis (NN-PCA) problem (same as

(Reddi et al., 2016b)). In general, NN-PCA is NP-hard. Specifically, the optimization problem for a given set of samples (i.e., ) is:

(9)

Note that (9) can be written in the form (1), where and where set . We conduct the experiment on the standard MNIST and ‘a9a’ datasets. 555The datasets can be downloaded from https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/ The experimental results on both datasets are almost the same (see Figure 610 and Figure 610).

The samples from each dataset are normalized, i.e., for all . The parameters of the algorithms are chosen as follows: can be precomputed from the data samples in the same way as in (Li et al., 2017). The step sizes for different algorithms are set to be the ones used in their convergence results: For ProxGD, it is (see Corollary 1 in (Ghadimi et al., 2016)); for ProxSGD, (see Corollary 3 in (Ghadimi et al., 2016)); for ProxSVRG, (see Theorem 6 in (Reddi et al., 2016b)); for ProxSCSG, (see Corollary 3.3 in (Lei et al., 2017)). The step size for our ProxSVRG+ is (see our Theorem 1). We did not further tune the step sizes.

Regarding the comparison among these algorithms, we use the number of SFO calls (see Definition 2) to evaluate them. For example, for ProxGD, each iteration uses SFO calls (full gradient). We need to point out that we amortize the batch size (i.e., or in Line 5 of Algorithm 1) into the inner loops, so that the curves in the figures are smoother.666Otherwise, the curves would look like step functions. For example, in ProxSVRG, ProxSCSG and ProxSVRG+, the number of SFO calls for each inner loop is , and , respectively.

We demonstrate the performance of these algorithms with respect to various minibatch size . The experimental results are quite consistent with the theoretical results (Figure 2 at the end of Section 1). ProxSCSG achieves its fastest convergence with and becomes worse for larger . ProxSVRG is generally better when gets larger. However, as increases, ProxSVRG+ gets better first (when in our experiments) and then gets worse (for ). In Figure 6 and 6, we can see that the proposed ProxSVRG+ (Algorithm 1) achieves the best performance when (the red curve with dots).

Similar to Figure 6 and 6, we also conduct the experiment for ProxSVRG (Reddi et al., 2016b) for various minibatch size in Figure 6 and 6. We note that ProxSVRG achieves its best performance with a much larger minibatch size in a9a (Figure 6) or in MNIST (Figure 6). Recall that in theory, ProxSVRG+ achieves the best result with and ProxSVRG achieves its best performance with . Note that a9a is smaller than the MNIST dataset. Hence, for ProxSVRG, the best performing minibatch size increases as becomes larger. However, for ProxSVRG+, it is less sensitive to in our experiment (in both datasets, achieves the best result. See Figure 6 and 6).

Figure 3: Different minibatch size for ProxSVRG+
Figure 4: Different minibatch size for ProxSVRG+
Figure 5: Different minibatch size for ProxSVRG
Figure 6: Different minibatch size for ProxSVRG
Figure 7: Comparison among algorithms with different
Figure 8: Comparison among algorithms with different

In Figure 8 and 8, we compare the performance of these five algorithms as we vary the minibatch size . We can see that when (left-top), ProxSCSG achieves the best result. ProxSVRG+ is almost the same as ProxGD in this case. In the following figures, we can see that ProxSCSG is always getting worse, and ProxSVRG+ is getting better in the first four figures (i.e. ) and then it is getting worse in the last two figures. The results are consistent with Figure 2.

Finally, we compare these algorithms with their corresponding best minibatch size , in Figure 10 and 10. Here, ProxSVRG, ProxSCSG and ProxSVRG+ achieve their best performance when , , and , respectively. Note that this result is quite consistent with the minimum points of the curve of ProxSVRG, SCSG and ProxSVRG+ in Figure 2. One can see that ProxSVRG, ProxSCSG and ProxSVRG+ are quite close and outperform ProxGD and ProxSGD. However, we argue that our algorithm might be more attractive in certain applications due to its moderate minibatch size (again, not too small for parallelism and not too large for better generalization, possibly).

Figure 9: Performance under best minibatch size
Figure 10: Performance under best minibatch size

7 Conclusion

In this paper, we propose a simple proximal stochastic method called ProxSVRG+, which is a variant of ProxSVRG (Reddi et al., 2016b) for nonsmooth nonconvex optimization. We show that ProxSVRG+ recovers several well-known convergence results and provides some improved results by choosing proper parameters. Furthermore, for the nonconvex functions satisfying Polyak-Łojasiewicz condition, we show that ProxSVRG+ achieves a global linear convergence rate without restart, while (Reddi et al., 2016b) restarted ProxSVRG times. Finally, we conducted several experiments and the experimental results are consistent with our theoretical results.

Acknowledgments

We would like to thank Rong Ge for helpful discussions.

Appendix A Proofs for Nonconvex ProxSVRG+ Algorithm

In this appendix, we first provide the proof of Theorem 1 (Section A.1). Then we provide the proof for other choices of epoch length (Section A.2).

a.1 Proof of Theorem 1

Before proving Theorem 1, we need a useful lemma for the proximal operator.

Lemma 1

Let , then the following inequality holds:

(10)

Proof: First, we recall the proximal operator (see (5)):

(11)

For the nonsmooth function , we have

(12)
(13)

where such that according to the optimality condition of , and (12) holds due to the convexity of .

For the nonconvex function , we have

(14)
(15)

where (14) holds since has -Lipschitz continuous gradient (see (2)), and (15) holds since has the same -Lipschitz continuous gradient as .

This lemma is proved by adding (13), (14), (15), and recalling .

Proof of Theorem 1. Now, we are ready to use Lemma 1 to prove Theorem 1. Let and . By letting and in (10), we have

(16)

Besides, by letting and in (10), we have

(17)

We add (16) and (17) to obtain the key inequality

(18)
(19)

where (18) uses the following Young’s inequality (choose )

(20)

and (19) holds due to the following Lemma 2.

Lemma 2

Let and . Then, the following inequality holds:

Proof of Lemma 2. First, we obtain the relation between and as follows (similar to (Ghadimi et al., 2016)):

(21)
(22)

where (21) and (22) hold due to (13). Adding (21) and (22), we have

(23)
(24)

where (23) uses Cauchy-Schwarz inequality.

Now, this lemma is proved by using Cauchy-Schwarz inequality and (24), i.e.,

Note that is the iterated form in our algorithm (see Line 8 in Algorithm 1). Now, we take expectations with all history for (19).

(25)

Then, we bound the variance term in (25) as follows:

(26)