# A Simple Proximal Stochastic Gradient Method for Nonsmooth Nonconvex Optimization

We analyze stochastic gradient algorithms for optimizing nonconvex, nonsmooth finite-sum problems. In particular, the objective function is given by the summation of a differentiable (possibly nonconvex) component, together with a possibly non-differentiable but convex component. We propose a proximal stochastic gradient algorithm based on variance reduction, called ProxSVRG+. The algorithm is a slight variant of the ProxSVRG algorithm [Reddi et al., 2016b]. Our main contribution lies in the analysis of ProxSVRG+. It recovers several existing convergence results (in terms of the number of stochastic gradient oracle calls and proximal operations), and improves/generalizes some others. In particular, ProxSVRG+ generalizes the best results given by the SCSG algorithm, recently proposed by [Lei et al., 2017] for the smooth nonconvex case. ProxSVRG+ is more straightforward than SCSG and yields simpler analysis. Moreover, ProxSVRG+ outperforms the deterministic proximal gradient descent (ProxGD) for a wide range of minibatch sizes, which partially solves an open problem proposed in [Reddi et al., 2016b]. Finally, for nonconvex functions satisfied Polyak-Łojasiewicz condition, we show that ProxSVRG+ achieves global linear convergence rate without restart. ProxSVRG+ is always no worse than ProxGD and ProxSVRG/SAGA, and sometimes outperforms them (and generalizes the results of SCSG) in this case.

## Authors

• 19 publications
• 100 publications
• ### Fast Stochastic Methods for Nonsmooth Nonconvex Optimization

We analyze stochastic algorithms for optimizing nonconvex, nonsmooth fin...
05/23/2016 ∙ by Sashank J Reddi, et al. ∙ 0

• ### Improved Zeroth-Order Variance Reduced Algorithms and Analysis for Nonconvex Optimization

Two types of zeroth-order stochastic algorithms have recently been desig...
10/27/2019 ∙ by Kaiyi Ji, et al. ∙ 36

• ### Convex Optimization with Nonconvex Oracles

In machine learning and optimization, one often wants to minimize a conv...
11/07/2017 ∙ by Oren Mangoubi, et al. ∙ 0

• ### Analysis of nonsmooth stochastic approximation: the differential inclusion approach

In this paper we address the convergence of stochastic approximation whe...
05/04/2018 ∙ by Szymon Majewski, et al. ∙ 0

• ### Generalized conditional gradient: analysis of convergence and applications

The objectives of this technical report is to provide additional results...
10/22/2015 ∙ by Alain Rakotomamonjy, et al. ∙ 0

• ### Inertial Stochastic PALM and its Application for Learning Student-t Mixture Models

Inertial algorithms for minimizing nonsmooth and nonconvex functions as ...
05/05/2020 ∙ by Johannes Hertrich, et al. ∙ 0

Adaptivity is an important yet under-studied property in modern optimiza...
02/13/2020 ∙ by Samuel Horvath, et al. ∙ 4

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

In this paper, we consider nonsmooth nonconvex finite-sum optimization problems of the form

 minxΦ(x):=f(x)+h(x), (1)

where and each is possibly nonconvex with a Lipschitz continuous gradient, while is nonsmooth but convex (e.g., norm or indicator function for some convex set ). We assume that the proximal operator of can be computed efficiently.

This above optimization problem is fundamental to many machine learning problems, ranging from convex optimization such as Lasso, SVM to highly nonconvex problem such as optimizing deep neural networks. There has been extensive research when

is convex (see e.g., (Xiao and Zhang, 2014; Defazio et al., 2014; Lan and Zhou, 2015; Allen-Zhu, 2017a)). In particular, if s are strongly-convex (Xiao and Zhang, 2014) proposed the Prox-SVRG algorithm, which achieves a linear convergence rate, based on the well-known variance reduction technique developed in (Johnson and Zhang, 2013)

. In recent years, due to the increasing popularity of deep learning, the nonconvex case has attracted significant attention. See e.g.,

(Ghadimi and Lan, 2013; Allen-Zhu and Hazan, 2016; Reddi et al., 2016a; Lei et al., 2017) for results on the smooth nonconvex case (i.e., ). For the more general nonsmooth nonconvex case, the research is still somewhat limited.

Recently, for the nonsmooth nonconvex case, (Reddi et al., 2016b) provided two algorithms called ProxSVRG (similar to (Xiao and Zhang, 2014)) and ProxSAGA, which are based on the well-known variance reduction techniques SVRG and SAGA (Johnson and Zhang, 2013; Defazio et al., 2014). Before that, (Ghadimi et al., 2016) analyzed the deterministic proximal gradient method (i.e., computing the full-gradient in every iteration) for nonconvex nonsmooth problems. Here we denote it as ProxGD. (Ghadimi et al., 2016) also considered the stochastic case (here we denote it as ProxSGD). However, ProxSGD requires the batch sizes being a large number (i.e., ) or increasing with the iteration number . Note that ProxSGD may reduce to deterministic ProxGD after some iterations due to the increasing batch sizes. Note that from the perspectives of both computational efficiency and statistical generalization, always computing full-gradient (GD or ProxGD) may not be desirable for large-scale machine learning problems. A reasonable minibatch size is also desirable in practice, since the computation of minibatch stochastic gradients can be implemented in parallel. In fact, practitioners typically use moderate minibatch sizes, often ranging from something like 16 or 32 to a few hundreds (sometimes to a few thousands, see e.g., (Priya et al., 2017)).111In fact, some studies argued that smaller minibatch sizes in SGD are very useful for generalization (e.g., (Keskar et al., 2016)). Although generalization is not the focus of the present paper, it provides further motivation for studying the moderate minibatch size regime. Hence, it is important to study the convergence in moderate and constant minibatch size regime.

(Reddi et al., 2016b) provided the first non-asymptotic convergence rates for ProxSVRG with minibatch size at most , for the nonsmooth nonconvex problems. However, their convergence bounds (using constant or moderate size minibatches) are worse than the deterministic ProxGD in terms of the number of proximal oracle calls. Note that their algorithms (i.e., ProxSVRG/SAGA) outperform the ProxGD only if they use quite large minibatch size . Note that in a typical application, the number of training data is , and . Hence, is a quite a large minibatch size. Finally, they presented an important open problem of developing stochastic methods with provably better performance than ProxGD with constant minibatch size.

Our Contribution: In this paper, we propose ProxSVRG+ to solve (1). Our algorithm is almost the same as ProxSVRG (Reddi et al., 2016b), except some detailed parameter settings (Under PL condition, (Reddi et al., 2016b) needed to restart ProxSVRG times, while ProxSVRG+ does not need any restart). Our main technical contribution lies in the new convergence analysis of ProxSVRG+, which has notable difference from that of ProxSVRG (Reddi et al., 2016b). Our convergence results are stated in terms of the number of stochastic first-order oracle (SFO) calls and proximal oracle (PO) calls (see Definition 2 for the formal definitions).

We are mainly interested in modern large-scale machine learning problems, in which the number of data points is typical very large (e.g., ) and the accuracy is moderate. Hence, we generally regard as a much larger number than , and a moderate minibatch size as something like (for some ), and small minibatch size as . We list our results in Table 13, and Figure 22. We would like to highlight the following results yielded by our new analysis.

1. ProxSVRG+ is (resp. ) times faster than ProxGD (full gradient) in terms of #SFO when (resp. ), and times faster than ProxGD when (resp. ). Note that #PO for both ProxSVRG+ and ProxGD. Hence, for any super constant (such as for some , a moderate minibatch size often used in practice), ProxSVRG+ is strictly better than ProxGD. Hence, we partially answer the open question proposed in (Reddi et al., 2016b). We also note that ProxSVRG+ matches the best result achieved by ProxSVRG at , and ProxSVRG+ is strictly better for smaller (using less PO calls). See Figure 2 for an overview.

2. Assuming that the variance of the stochastic gradient is bounded (, see Assumption 1), ProvSVRG+ matches the best result achieved by SCSG in terms of #SFO, proposed by Lei et al. (Lei et al., 2017) recently, for the smooth nonconvex case, i.e., in form (1) (see Table 1, the 5th row). Arguably, ProxSVRG+ is more straightforward than SCSG, and yields simpler proof. Our results also matches the results of Natasha1.5 proposed by Allen-Zhu (Allen-Zhu, 2017b) very recently, in terms of #SFO, if there is no additional assumption (see Footnote 2 for details). In terms of #PO, our algorithm outperforms Natasha1.5.

We also note that SCSG (Lei et al., 2017) achieved the best result with , while our best convergence result is achieved with the minibatch size being (see Table 2 the 5th row), which is a moderate minibatch size used in practice (hence can take advantage of parallelism).

3. For the nonconvex functions satisfying Polyak-Łojasiewicz condition (Polyak, 1963), we show that ProxSVRG+ achieves a global linear convergence rate without restart, while Reddi et al. (Reddi et al., 2016b) restarted ProxSVRG/SAGA times to obtain the global linear convergence rate. In this case, ProxSVRG+ is always no worse than ProxGD and ProxSVRG/SAGA, and sometimes outperforms them (and also generalizes the results of SCSG). We list our results in Table 3.

33footnotetext: Note that the curve of ProxSGD overlaps with ProxSVRG+ for and the curve of ProxSVRG/SAGA overlaps with ProxSVRG+ for in Figure 2.

## 2 Preliminaries

We assume that in (1) has an -Lipschitz continuous gradient for all , i.e., there is a constant such that

 ∥∇fi(x)−∇fi(y)∥≤L∥x−y∥, (2)

where denotes the Eculidean norm . Note that does not need to be convex. We also assume that the nonsmooth convex function in (1) is well structured, i.e., the following proximal operator on can be computed efficiently:

 proxηh(x):=argminy∈Rd(h(y)+12η∥y−x∥2) (3)

For convex problems, one typically uses the optimality gap as the convergence criterion (see e.g., (Nesterov, 2004)). But for general nonconvex problems, one typically uses the gradient norm as the convergence criterion. E.g., for smooth nonconvex problems (i.e., ), (Ghadimi and Lan, 2013; Reddi et al., 2016a; Lei et al., 2017) used (i.e., ) to measure the convergence results. In order to analyze the convergence results for nonsmooth nonconvex problems, we need to define the gradient mapping as follows (as in (Ghadimi et al., 2016; Reddi et al., 2016b)):

 Gη(x):=1η(x−proxηh(x−η∇f(x))). (4)

We often use an equivalent but useful form of as follows:

 (5)

Note that if is a constant function (in particular, zero), this gradient mapping reduces to the ordinary gradient: . In this paper, we use the gradient mapping as the convergence criterion (same as (Ghadimi et al., 2016; Reddi et al., 2016b)).

###### Definition 1

is called an -accurate solution for problem (1) if , where denotes the point returned by a stochastic algorithm.

To measure the efficiency of a stochastic algorithm, we use the following oracle complexity.

###### Definition 2
1. Stochastic first-order oracle (SFO): given a point , SFO outputs a stochastic gradient such that .

2. Proximal oracle (PO): given a point , PO outputs the result of the proximal projection (see (3)).

Sometimes, we need the following assumption on the variance of the stochastic gradients for some algorithms or some particular cases (see the last column “additional condition” in Table 1). Such an assumption is necessary if one wants the convergence result to be independent of .

###### Assumption 1

For , , where is a constant and is a stochastic gradient.

## 3 Nonconvex ProxSVRG+ Algorithm

In this section, we propose a proximal stochastic gradient algorithm, called ProxSVRG+, which is a variant of the ProxSVRG algorithm (Reddi et al., 2016b) (also similar to (Xiao and Zhang, 2014)). ProxSVRG+ is very straightforward, and the details are described in Algorithm 1. We call the batch size and the minibatch size.

## 4 Convergence Results

Now, we present the main theorem for ProxSVRG+ which corresponds to the last two rows in Table 1.

###### Theorem 1

Let step size and denote the minibatch size. Then returned by Algorithm 1 is an -accurate solution for problem (1) (i.e., ). We distinguish the following two cases:

1. We let batch size . The number of SFO calls is at most

 36L(Φ(x0)−Φ(x∗))(Bϵ√b+bϵ)=O(nϵ√b+bϵ).
2. Under Assumption 1, we let batch size . The number of SFO calls is at most

 36L(Φ(x0)−Φ(x∗))(Bϵ√b+bϵ)=O((n∧1ϵ)1ϵ√b+bϵ),

where denotes the minimum. In both cases, the number of PO calls equals to the total number of iterations , which is at most

 36Lϵ(Φ(x0)−Φ(x∗))=O(1ϵ).

Remark: Algorithm 1 for Case 1 (i.e., ) is almost the same as ProxSVRG (Reddi et al., 2016b). But the proof for Theorem 1 is notably different. Reddi et al. (Reddi et al., 2016b) used a Lyapunov function () and showed that decreases by the accumulated gradient mapping in epoch . In our proof, we directly show that decreases by using a different analysis. This is made possible by tightening the inequalities in several places, e.g., by using Young’s inequality on different terms and applying Lemma 2 in a nontrivial way. Besides, we also use similar idea in SCSG (Lei et al., 2017) to bound the variance term, and our convergence result holds for any minibatch size (i.e., ).

We defer the proof of Theorem 1 to Appendix A.1. Also, similar convergence results for other choices of epoch length (Line 2 of Algorithm 1) are provided in Appendix A.2.

## 5 Convergence Under PL Condition

In this section, we provide the global linear convergence rate for nonconvex functions under the Polyak-Łojasiewicz (PL) condition (Polyak, 1963). The original form of PL condition is

 ∃μ>0, such that ∥∇f(x)∥2≥2μ(f(x)−f∗), ∀x, (6)

where denotes the (global) optimal function value. For example, satisfies PL condition if is -strongly convex. Note that PL condition implies that every stationary point is a global minimum, but it does not imply there is a unique minimum unlike the strongly convex condition. Particularly, (Karimi et al., 2016) showed that PL condition is weaker than many conditions (e.g., strong convexity (SC), restricted strong convexity (RSC) and weak strong convexity (WSC) (Necoara et al., 2015)). Also, if is convex, PL condition is equivalent to the error bounds (EB) and quadratic growth (QG) condition (Luo and Tseng, 1993; Anitescu, 2000).

Due to the nonsmooth term in problem (1), we use the gradient mapping (see (4)) to define a more general form of PL condition as follows:

 ∃μ>0, such that ∥Gη(x)∥2≥2μ(Φ(x)−Φ∗), ∀x. (7)

Note that if is a constant function, this gradient mapping reduces to , and our new definition of PL condition reduces to the original one in (Polyak, 1963). Our PL condition is different from the one used in (Karimi et al., 2016; Reddi et al., 2016b). See the Remark (2) at the end of the section. We list the convergence results under PL condition in Table 3.

Similar to Theorem 1, we provide the convergence result of ProxSVRG+ (Algorithm 1) under PL-condition in the following Theorem 2. Note that under PL condition (i.e. (7) holds), ProxSVRG+ can directly use the final iteration as the output point instead of the randomly chosen one . Similar to (Reddi et al., 2016b), we assume the condition number for simplicity. Otherwise, one can choose different step size which is similar to the case where we deal with other choices of epoch length (see Appendix A.2).

###### Theorem 2

Let step size and denote the minibatch size. Then the final iteration point in Algorithm 1 satisfies under PL condition. We distinguish the following two cases:

1. We let batch size . The number of SFO calls is bounded by

 O(nμ√blog1ϵ+bμlog1ϵ).
2. Under Assumption 1, we let batch size . The number of SFO calls is bounded by

 O((n∧1μϵ)1μ√blog1ϵ+bμlog1ϵ),

where denotes the minimum. In both cases, the number of PO calls equals to the total number of iterations which is bounded by .

Remark:

1. We show that ProxSVRG+ directly obtains the global linear convergence rate without restart by a nontrivial proof. Note that Reddi et al. (Reddi et al., 2016b) restarted ProxSVRG/SAGA times to obtain a global linear convergence rate under PL condition. Similarly, ProxSVRG+ recovers several existing convergence results and sometime outperforms them, and also generalizes the results of SCSG (Lei et al., 2017) in this case.

2. We want to point out that (Karimi et al., 2016; Reddi et al., 2016b) used the following PL condition form:

 ∃μ>0, such that Dh(x,α)≥2μ(Φ(x)−Φ∗), ∀x, (8)

where . Our PL condition is arguably more natural. In fact, one can show that if , our new PL condition (7) implies (8). For comparison, we also provide the proof of the same result (ProxSVRG+ directly obtains the linear convergence rate (Theorem 2) without restart) using the previous PL condition (8) in the appendix.

The proofs of Theorem 2 under PL form (7) and (8) are provided in Appendix B.1 and B.2, respectively.

## 6 Experiments

In this section, we present the experimental results. We compare the nonconvex ProxSVRG+ with nonconvex ProxGD, ProxSGD (Ghadimi et al., 2016), ProxSVRG (Reddi et al., 2016b) and ProxSCSG. Note that here ProxSCSG is a straightforward extension of SCSG (Lei et al., 2017), by replacing the original gradient update by the proximal operator in SCSG. Note that there is no known theoretical convergence result for ProxSCSG.

We conduct the experiments using the non-negative principal component analysis (NN-PCA) problem (same as

(Reddi et al., 2016b)). In general, NN-PCA is NP-hard. Specifically, the optimization problem for a given set of samples (i.e., ) is:

 min∥x∥≤1,x≥0−12xT(n∑i=1zizTi)x. (9)

Note that (9) can be written in the form (1), where and where set . We conduct the experiment on the standard MNIST and ‘a9a’ datasets. 555The datasets can be downloaded from https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/ The experimental results on both datasets are almost the same (see Figure 610 and Figure 610).

The samples from each dataset are normalized, i.e., for all . The parameters of the algorithms are chosen as follows: can be precomputed from the data samples in the same way as in (Li et al., 2017). The step sizes for different algorithms are set to be the ones used in their convergence results: For ProxGD, it is (see Corollary 1 in (Ghadimi et al., 2016)); for ProxSGD, (see Corollary 3 in (Ghadimi et al., 2016)); for ProxSVRG, (see Theorem 6 in (Reddi et al., 2016b)); for ProxSCSG, (see Corollary 3.3 in (Lei et al., 2017)). The step size for our ProxSVRG+ is (see our Theorem 1). We did not further tune the step sizes.

Regarding the comparison among these algorithms, we use the number of SFO calls (see Definition 2) to evaluate them. For example, for ProxGD, each iteration uses SFO calls (full gradient). We need to point out that we amortize the batch size (i.e., or in Line 5 of Algorithm 1) into the inner loops, so that the curves in the figures are smoother.666Otherwise, the curves would look like step functions. For example, in ProxSVRG, ProxSCSG and ProxSVRG+, the number of SFO calls for each inner loop is , and , respectively.

We demonstrate the performance of these algorithms with respect to various minibatch size . The experimental results are quite consistent with the theoretical results (Figure 2 at the end of Section 1). ProxSCSG achieves its fastest convergence with and becomes worse for larger . ProxSVRG is generally better when gets larger. However, as increases, ProxSVRG+ gets better first (when in our experiments) and then gets worse (for ). In Figure 6 and 6, we can see that the proposed ProxSVRG+ (Algorithm 1) achieves the best performance when (the red curve with dots).

Similar to Figure 6 and 6, we also conduct the experiment for ProxSVRG (Reddi et al., 2016b) for various minibatch size in Figure 6 and 6. We note that ProxSVRG achieves its best performance with a much larger minibatch size in a9a (Figure 6) or in MNIST (Figure 6). Recall that in theory, ProxSVRG+ achieves the best result with and ProxSVRG achieves its best performance with . Note that a9a is smaller than the MNIST dataset. Hence, for ProxSVRG, the best performing minibatch size increases as becomes larger. However, for ProxSVRG+, it is less sensitive to in our experiment (in both datasets, achieves the best result. See Figure 6 and 6).

In Figure 8 and 8, we compare the performance of these five algorithms as we vary the minibatch size . We can see that when (left-top), ProxSCSG achieves the best result. ProxSVRG+ is almost the same as ProxGD in this case. In the following figures, we can see that ProxSCSG is always getting worse, and ProxSVRG+ is getting better in the first four figures (i.e. ) and then it is getting worse in the last two figures. The results are consistent with Figure 2.

Finally, we compare these algorithms with their corresponding best minibatch size , in Figure 10 and 10. Here, ProxSVRG, ProxSCSG and ProxSVRG+ achieve their best performance when , , and , respectively. Note that this result is quite consistent with the minimum points of the curve of ProxSVRG, SCSG and ProxSVRG+ in Figure 2. One can see that ProxSVRG, ProxSCSG and ProxSVRG+ are quite close and outperform ProxGD and ProxSGD. However, we argue that our algorithm might be more attractive in certain applications due to its moderate minibatch size (again, not too small for parallelism and not too large for better generalization, possibly).

## 7 Conclusion

In this paper, we propose a simple proximal stochastic method called ProxSVRG+, which is a variant of ProxSVRG (Reddi et al., 2016b) for nonsmooth nonconvex optimization. We show that ProxSVRG+ recovers several well-known convergence results and provides some improved results by choosing proper parameters. Furthermore, for the nonconvex functions satisfying Polyak-Łojasiewicz condition, we show that ProxSVRG+ achieves a global linear convergence rate without restart, while (Reddi et al., 2016b) restarted ProxSVRG times. Finally, we conducted several experiments and the experimental results are consistent with our theoretical results.

## Acknowledgments

We would like to thank Rong Ge for helpful discussions.

## Appendix A Proofs for Nonconvex ProxSVRG+ Algorithm

In this appendix, we first provide the proof of Theorem 1 (Section A.1). Then we provide the proof for other choices of epoch length (Section A.2).

### a.1 Proof of Theorem 1

Before proving Theorem 1, we need a useful lemma for the proximal operator.

###### Lemma 1

Let , then the following inequality holds:

 Φ(x+)≤Φ(z)+⟨∇f(x)−v,x+−z⟩−1η⟨x+−x,x+−z⟩+L2∥x+−x∥2+L2∥z−x∥2,  ∀z∈Rd. (10)

Proof: First, we recall the proximal operator (see (5)):

 proxηh(x−ηv):=argminy∈Rd(h(y)+12η∥y−x∥2+⟨v,y⟩). (11)

For the nonsmooth function , we have

 h(x+) ≤h(z)+⟨p,x+−z⟩ (12) =h(z)−⟨v+1η(x+−x),x+−z⟩, (13)

where such that according to the optimality condition of , and (12) holds due to the convexity of .

For the nonconvex function , we have

 f(x+) ≤f(x)+⟨∇f(x),x+−x⟩+L2∥x+−x∥2 (14) −f(z) ≤−f(x)+⟨−∇f(x),z−x⟩+L2∥z−x∥2, (15)

where (14) holds since has -Lipschitz continuous gradient (see (2)), and (15) holds since has the same -Lipschitz continuous gradient as .

This lemma is proved by adding (13), (14), (15), and recalling .

Proof of Theorem 1. Now, we are ready to use Lemma 1 to prove Theorem 1. Let and . By letting and in (10), we have

 Φ(xst)≤Φ(¯xst)+⟨∇f(xst−1)−vst−1,xst−¯xst⟩−1η⟨xst−xst−1,xst−¯xst⟩+L2∥xst−xst−1∥2+L2∥¯xst−xst−1∥2. (16)

Besides, by letting and in (10), we have

 Φ(¯xst)≤Φ(xst−1)−1η⟨¯xst−xst−1,¯xst−xst−1⟩+L2∥¯xst−xst−1∥2=Φ(xst−1)−(1η−L2)∥¯xst−xst−1∥2. (17)

We add (16) and (17) to obtain the key inequality

 Φ(xst) ≤Φ(xst−1)+L2∥xst−xst−1∥2−(1η−L)∥¯xst−xst−1∥2+⟨∇f(xst−1)−vst−1,xst−¯xst⟩ −1η⟨xst−xst−1,xst−¯xst⟩ −12η(∥xst−xst−1∥2+∥xst−¯xst∥2−∥¯xst−xst−1∥2) =Φ(xst−1)−(12η−L2)∥xst−xst−1∥2−(12η−L)∥¯xst−xst−1∥2+⟨∇f(xst−1)−vst−1,xst−¯xst⟩ −12η∥xst−¯xst∥2 ≤Φ(xst−1)−(12η−L2)∥xst−xst−1∥2−(12η−L)∥¯xst−xst−1∥2+⟨∇f(xst−1)−vst−1,xst−¯xst⟩ −18η∥xst−xst−1∥2+16η∥¯xst−xst−1∥2 (18) =Φ(xst−1)−(58η−L2)∥xst−xst−1∥2−(13η−L)∥¯xst−xst−1∥2+⟨∇f(xst−1)−vst−1,xst−¯xst⟩ ≤Φ(xst−1)−(58η−L2)∥xst−xst−1∥2−(13η−L)∥¯xst−xst−1∥2+η∥∇f(xst−1)−vst−1∥2, (19)

where (18) uses the following Young’s inequality (choose )

 ∥xst−xst−1∥2≤(1+1α)∥¯xst−xst−1∥2+(1+α)∥xst−¯xst∥2,  ∀α>0, (20)

and (19) holds due to the following Lemma 2.

###### Lemma 2

Let and . Then, the following inequality holds:

 ⟨∇f(xst−1)−vst−1,xst−¯xst⟩≤η∥∇f(xst−1)−vst−1∥2

Proof of Lemma 2. First, we obtain the relation between and as follows (similar to (Ghadimi et al., 2016)):

 h(xst) ≤h(¯xst)−⟨vst−1+1η(xst−xst−1),xst−¯xst⟩ (21) h(¯xst) ≤h(xst)−⟨∇f(xst−1)+1η(¯xst−xst−1),¯xst−xst⟩, (22)

where (21) and (22) hold due to (13). Adding (21) and (22), we have

 1η⟨xst−¯xst,xst−¯xst⟩ ≤⟨∇f(xst−1)−vst−1,xst−¯xst⟩ 1η∥xst−¯xst∥2 ≤∥∇f(xst−1)−vst−1∥∥xst−¯xst∥ (23) ∥xst−¯xst∥ ≤η∥∇f(xst−1)−vst−1∥, (24)

where (23) uses Cauchy-Schwarz inequality.

Now, this lemma is proved by using Cauchy-Schwarz inequality and (24), i.e.,

Note that is the iterated form in our algorithm (see Line 8 in Algorithm 1). Now, we take expectations with all history for (19).

 E[Φ(xst)]≤E[Φ(xst−1)−(58η−L2)∥xst−xst−1∥2−(13η−L)∥¯xst−xst−1∥2+η∥∇f(xst−1)−vst−1∥2] (25)

Then, we bound the variance term in (25) as follows:

 E[η∥∇f(xst−1)−vst−1∥2] +ηE[∥∥1B∑j∈IB(∇fj(˜xs−1)−∇f(˜xs−1))∥∥2] (26)