 # A Stochastic Composite Gradient Method with Incremental Variance Reduction

We consider the problem of minimizing the composition of a smooth (nonconvex) function and a smooth vector mapping, where the inner mapping is in the form of an expectation over some random variable or a finite sum. We propose a stochastic composite gradient method that employs an incremental variance-reduced estimator for both the inner vector mapping and its Jacobian. We show that this method achieves the same orders of complexity as the best known first-order methods for minimizing expected-value and finite-sum nonconvex functions, despite the additional outer composition which renders the composite gradient estimator biased. This finding enables a much broader range of applications in machine learning to benefit from the low complexity of incremental variance-reduction methods.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

We consider stochastic composite optimization problems of the form

 minimizex∈Rdf(Eξ[gξ(x)])+r(x), (1)

where is a smooth and possibly nonconvex function, is a random variable, each is a smooth vector mapping, and  is convex and lower-semicontinuous. A special case we will consider separately is when

is a discrete random variable with uniform distribution over

. In this case the problem is equivalent to a deterministic optimization problem

 minimizex∈Rdf(1nn∑i=1gi(x))+r(x). (2)

The formulations in (1) and (2) cover a broader range of applications than classical stochastic optimization and empirical risk minimization (ERM) problems where each is a scalar function () and

is the scalar identity map. A well-known example is policy evaluation in reinforcement learning (RL)

(e.g., Sutton and Barto, 1998). With linear value function approximation, it can be formulated as

 minimizex∈Rd∥∥E[A]x−E[b]∥∥2,

where and

are random matrix and vector generated by a Markov decision process (MDP)

(e.g., Dann et al., 2014). Here we have , and .

Another interesting application is risk-averse optimization (e.g., Rockafellar, 2007; Ruszczyński, 2013), which has many applications in RL and financial mathematics. We consider a general formulation of mean-variance trade-off:

 (3)

where each is a reward function (such as total portfolio return). The goal of problem (3) is to maximize the average reward with a penalty on the variance which captures the potential risk. It can be cast in the form of (1) by using the mappings

 gξ(x):Rd→R2=[hξ(x)  h2ξ(x)]T,f(y,z):R2→R=−y+λy2−λz. (4)

Here, the intermediate dimension is very low, i.e., . This leads to very little overhead in computation compared with stochastic optimization without composition.

Besides these applications, the composition structure in (1) and (2) are of independent interest for research on stochastic and randomized algorithms. For the ease of notation, we define

 g(x):=Eξ[gξ(x)],F(x):=f(g(x)),Φ(x):=F(x)+r(x). (5)

In addition, let and  denote the gradients of  and  respectively, and denote the Jacobian matrix of  at . Then we have

In practice, computing exactly can be very costly if not impossible. A common strategy is to use stochastic approximation: we randomly sample a subset of  from its distribution and let

 ~g(x)=1|S|∑ξ∈Sgξ(x),~g′(x)=1|S|∑ξ∈Sg′ξ(x). (6)

However, is always a biased estimate of unless one can replace with the full expectation . This is in great contrast to the classical stochastic optimization problem

 minimizex∈RdEξ[gξ(x)]+r(x), (7)

where in (6) is always an unbiased gradient estimator for the smooth part . Using biased gradient estimators can cause various difficulties for constructing and analyzing randomized algorithms, but is often inevitable in dealing with more complex objective functions other than the empirical risk (see, e.g., Chaudhari et al., 2016; Hazan et al., 2016; Gulcehre et al., 2016; Mobahi and Fisher, 2015). As a simplest model, the analysis of randomized algorithms for (1) may provide insights for solving more challenging problems.

In this paper, we develop an efficient stochastic composite gradient method called CIVR (Composite Incremental Variance Reduction), for solving problems of the forms (1) and (2). We measure efficiency by the sample complexity of the individual functions and their Jacobian , i.e., the total number of times they need to be evaluated at some point, in order to find an -approximate solution. For nonconvex functions, an -approximate solution is some random output of the algorithm that satisfies , where is the proximal gradient mapping of the objective function  at  (see details in Section 2). If , then and the criteria for -approximation becomes . If the objective  is convex, we require where . For smooth and convex functions, these two notions are compatible, meaning that the dependence of the sample complexity on  in terms of both notions are of the same order.

Table 1 summarizes the sample complexities of the CIVR method under different assumptions obtained in this paper. We can define a condition number for -gradient dominant functions and for -optimally strongly convex functions, then the complexities become and for (1) and (2) respectively. In order to better position our contributions, we next discuss related work and then putting these results into context.

### 1.1 Related Work

We first discuss the nonconvex stochastic optimization problem (7), which is a special cases of (1). When and is smooth, Ghadimi and Lan (2013) developed a randomized stochastic gradient method with iteration complexity . Allen-Zhu (2018) obtained with additional second-order guarantee. There are also many recent works on solving its finite-sum version

 minimizex∈Rd1nn∑i=1gi(x)+r(x), (8)

which is a special case of (2). By extending the variance reduction techniques SVRG (Johnson and Zhang, 2013; Xiao and Zhang, 2014) and SAGA (Defazio et al., 2014) to nonconvex optimization, Allen-Zhu and Hazan (2016) and Reddi et al. (2016a, b, c) developed randomized algorithms with sample complexity . Under additional assumptions of gradient dominance or strong convexity, they obtained sample complexity , where is a suitable condition number. Allen-Zhu (2017) and Lei et al. (2017) obtained .

Based on a new variance reduction technique called SARAH (Nguyen et al., 2017), Nguyen et al. (2019) and Pham et al. (2019) developed nonconvex extensions to obtain sample complexities and for solving the expectation and finite-sum cases respectively. Fang et al. (2018) introduced another variance reduction technique called Spider, which can be viewed as a more general variant of SARAH. They obtained sample complexities and for the two cases respectively, but require small step sizes that are proportional to . Wang et al. (2018) extended Spider to obtain the same complexities with constant step sizes and under the gradient-dominant condition. In addition, Zhou et al. (2018) obtained similar results using a nested SVRG approach.

In addition to the above works on solving special cases of (1) and (2), there are also considerable recent works on a more general, two-layer stochastic composite optimization problem

 minimizex∈RdEν[fν(Eξ[gξ(x)])]+r(x), (9)

where is parametrized by another random variables , which is independent of . When , Wang et al. (2017a) derived algorithms to find an -approximate solution with sample complexities , and for the smooth nonconvex case, smooth convex case and smooth strongly convex case respectively. For nontrivial convex , Wang et al. (2017b) obtained improved sample complexity of , and for the three cases mentioned above respectively.

As a special case of (9), the following finite-sum problem also received significant attention:

 minimizex∈Rd1mm∑j=1fj(1nn∑i=1gi(x))+r(x). (10)

When and the overall objective function is strongly convex, Lian et al. (2017) derived two algorithms based on the SVRG scheme to attain sample complexities and respectively, where  is some suitably defined condition number. Huo et al. (2018) also used the SVRG scheme to obtain an complexity for the smooth nonconvex case and for strongly convex problems with nonsmooth . More recently, Zhang and Xiao (2019) proposed a composite randomized incremental gradient method based on the SAGA estimator Defazio et al. (2014), which matches the best known complexity when  is smooth and nonconvex, and obtained an improved complexity under either gradient dominant or strongly convex assumptions. When applied to the special cases (1) and (2) we focus on in this paper (), these results are strictly worse than ours in Table 1.

### 1.2 Contributions and Outline

We develop the CIVR method by extending the variance reduction technique of SARAH (Nguyen et al., 2017, 2019; Pham et al., 2019) and Spider (Fang et al., 2018; Wang et al., 2018) to solve the composite optimization problems (1) and (2). The complexities of CIVR in Table 1 match the best results for solving the non-composite problems (7) and (8), despite the additional outer composition and the composite-gradient estimator always being biased. In addition:

• It is shown in (Fang et al., 2018) that the complexity is nearly optimal for the non-composite finite-sum optimization problem (8). Therefore, we do not expect algorithms with better complexity for solving the more general composite finite-sum problem (2).

• Under the assumptions of gradient dominance or strong convexity, the complexity only appeared for the special case (8) in the recent work (Li and Li, 2018).

Our results indicate that the additional smooth composition in (1) and (2) does not incur higher complexity compared with (7) and (8), despite the difficulty of dealing with biased estimators. We believe these results can also be extended to the two-layer problems (9) and (10), by replacing  with in Table 1. But the extensions require quite different techniques and we will address them in a separate paper.

The rest of this paper is organized as follows. In Section 2, we introduce the CIVR method. In Section 3, we present convergence results of CIVR for solving the composite optimization problems (1) and (2) and the required parameter settings. Better complexities of CIVR under the gradient-dominant and optimally strongly convex conditions are given in Section 4. In Section 5, we present numerical experiments for solving a risk-averse portfolio optimization problem (3) on real-world datasets.

## 2 The composite incremental variance reduction (CIVR) method

algocf[t]

With the notations in (5), we can write the composite stochastic optimization problem (1) as

 minimizex∈Rd{Φ(x)=F(x)+r(x)}, (14)

where is smooth and is convex. The proximal operator of  with parameter  is defined as

 proxηr(x):=argminy{r(y)+12η∥y−x∥2}. (15)

We assume that  is relatively simple, meaning that its proximal operator has a closed-form solution or can be computed efficiently. The proximal gradient method (e.g., Nesterov, 2013; Beck, 2017) for solving problem (14) is

 xt+1=proxηr(xt−ηF′(xt)), (16)

where is the step size. The proximal gradient mapping of is defined as

 Gη(x)≜1η(x−proxηr(x−ηF′(x))). (17)

As a result, the proximal gradient method (16) can be written as . Notice that when , becomes the identity mapping and we have for any .

Suppose is generated by a randomized algorithm. We call an -stationary point in expectation if

 E[∥Gη(¯x)∥2]≤ϵ. (18)

(We assume that is a constant that does not depend on .) As we mentioned in the introduction, we measure the efficiency of an algorithm by its sample complexity of and their Jacobian , i.e., the total number of times they need to be evaluated, in order to find a point  that satisfies (18). Our goal is to develop a randomized algorithm that has low sample complexity.

We present in Algorithm LABEL:alg:CIVR the Composite Incremental Variance Reduction (CIVR) method. This methods employs a two time-scale variance-reduced estimator for both the inner function value of and its Jacobian . At the beginning of each outer iteration

(each called an epoch), we construct a relatively accurate estimate

for and for respectively, using a relatively large sample size . During each inner iteration  of the th epoch, we construct an estimate for and for respectively, using a smaller sample size and incremental corrections from the previous iterations. Note that the epoch length and the sample sizes and are all adjustable for each epoch . Therefore, besides setting a constant set of parameters, we can also adjust them gradually in order to obtain better theoretical properties and practical performance.

This variance-reduction technique was first proposed as part of SARAH Nguyen et al. (2017) where it is called recursive variance reduction. It was also proposed in (Fang et al., 2018) in the form of a Stochastic Path-Integrated Differential EstimatoR (Spider). Here we simply call it incremental variance reduction. A distinct feature of this incremental estimator is that the inner-loop estimates and are biased, i.e.,

 { E[yti|xti]=g(xti)−g(xti−1)+yti−1≠g(xti), E[zti|xti]=g′(xti)−g′(xti−1)+zti−1≠g′(xti). (19)

This is in contrast to two other popular variance-reduction techniques, SVRG (Johnson and Zhang, 2013) and SAGA (Defazio et al., 2014)

, whose gradient estimators are always unbiased. Note that unbiased estimators for

and are not essential here, because the composite estimator is always biased.

## 3 Convergence Analysis

In this section, we present theoretical results on the convergence properties of CIVR (Algorithm LABEL:alg:CIVR) when the composite function  is smooth. More specifically, we make the following assumptions.

###### Assumption 1.

The following conditions hold concerning problems (1) and (2):

• is a smooth and -Lipschitz function and its gradient is -Lipschitz.

• Each is a smooth and -Lipschitz vector mapping and its Jacobian is -Lipschtiz. Consequently, in (5) is -Lipschitz and its Jacobian is -Lipschitz.

• is a convex and lower-semicontinuous function.

• The overall objective function is bounded below, i.e., .

###### Assumption 2.

For problem (1), we further assume that there exist constants and such that

 Eξ[∥gξ(x)−g(x)∥2]≤σ2g,Eξ[∥g′ξ(x)−g′(x)∥2]≤σ2g′. (20)

As a result of Assumption 1, is smooth and is -Lipschitz continuous with

 LF=ℓ2gLf+ℓfLg

(see proof in the supplementary materials). For convenience, we also define two constants

 G0:=2(ℓ4gL2f+ℓ2fL2g),andσ20:=2(ℓ2gL2fσ2g+ℓ2fσ2g′). (21)

It is important to notice that , since we will use step size .

In the next two subsections, we present complexity analysis of CIVR for solving problem (1) and (2) respectively. Due to the space limitation, all proofs are provided in the supplementary materials.

### 3.1 The composite expectation case

The following results for solving problem (1) are presented with notations defined in (5), (17) and (21).

###### Theorem 1.

Suppose Assumptions 1 and 2 hold. Given any , we set and

 τt=τ=⌈1/√ϵ⌉,Bt=B=⌈σ20/ϵ⌉,St=S=⌈1/√ϵ⌉,fort=1,…,T.

Then as long as , the output of Algorithm LABEL:alg:CIVR satisfies

 E[∥Gη(¯x)∥2]≤(8(Φ(x10)−Φ∗)η−1+6)⋅ϵ=O(ϵ). (22)

As a result, the sample complexity of obtaining an -approximate solution is .

Note that in the above scheme, the epoch lengths and all the batch sizes and are set to be constant (depending on a pre-fixed ) without regard of . Intuitively, we do not need as many samples in the early stage of the algorithm as in the later stage. In addition, it will be useful in practice to have a variant of the algorithm that can adaptively choose , and throughout the epochs without dependence on a pre-fixed precision. This is done in the following theorem.

###### Theorem 2.

Suppose Assumptions 1 and 2 hold. We set and where and . Then as long as , we have for any ,

 (23)

As a result, obtaining an -approximate solution requires epochs and a total sample complexity of , where the notation hides logarithmic factors.

### 3.2 The composite finite-sum case

In this section, we consider the composite finite-sum optimization problem (2). In this case, the random variable has a uniform distribution over the finite index set . At the beginning of each epoch in Algorithm LABEL:alg:CIVR, we use the full sample size to compute and . Therefore for all and Equation (LABEL:defn:sarah-1) in Algorithm LABEL:alg:CIVR becomes

 yt0=g(xt0)=1nn∑j=1gj(xt0),zt0=g′(xt0)=1nn∑j=1g′j(xt0). (24)

Also in this case, we no longer need Assumption 2.

###### Theorem 3.

Suppose Assumptions 1 holds. Let the parameters in Algorithm LABEL:alg:CIVR be set as and for all . Then as long as , we have for any ,

 (25)

As a result, obtaining an -approximate solution requires epochs and a total sample complexity of .

Similar to the previous section, we can also choose the epoch lengths and sample sizes adaptively to save the sampling cost in the early stage of the algorithm. However, due to the finite-sum structure of the problem, when the batch size reaches , we will start to take the full batch at the beginning of each epoch to get the exact and . This leads to the following theorem.

###### Theorem 4.

Suppose Assumptions 1 holds. For some positive constants and , denote . When we set the parameters to be ; when , we set and . Then as long as ,

 E[∥Gη(¯x)∥2]≤⎧⎨⎩O(lnTT2) if T≤T0,O(lnn√n(T−T0+1)) if T>T0. (26)

As a result, the total sample complexity of Algorithm LABEL:alg:CIVR for obtaining an -approximate solution is , where hides logarithmic factors.

## 4 Fast convergence rates under stronger conditions

In this section we consider two cases where fast linear convergence can be guaranteed for CIVR.

The first case is when and is -gradient dominant, i.e., there is some such that

 F(x)−infyF(y)≤ν2∥F′(x)∥2,∀x∈Rd. (27)

Note that a -strongly convex function is -gradient dominant by this definition. Hence strong convexity is a special case of the gradient dominant condition, which in turn is a special case of the Polyak-Łojasiewicz condition with the Łojasiewicz exponent equal to 2 (see, e.g., Karimi et al., 2016).

In order to solve (1) with a pre-fixed precision , we use a periodic restart strategy depicted below.

###### Theorem 5.

Consider (1) with . Suppose Assumptions 1 and 2 hold and  is -gradient dominant. Given any , let , and . Then as long as ,

 E[F(¯x)−F∗]≤12(F(x10)−F∗)+12ϵ. (28)

Therefore, we can periodically restart Algorithm LABEL:alg:CIVR after every  epochs (using the output of previous period as input to the new period), then converges linearly to with a factor of per period. As a result, the sample complexity for finding an -solution is .

The restart strategy also applies to the finite-sum case.

###### Theorem 6.

Consider problem (2) with . Suppose Assumption 1 hold and is -gradient dominant. If we set and , then as long as ,

 E[F(¯x)−F∗]≤12(F(x10)−F∗). (29)

By periodically restart Algorithm LABEL:alg:CIVR after every  epochs, converges linearly to . As a result, the sample complexity for finding an -solution is .

### 4.2 Optimally strongly convex function

In this part, we assume a -optimally strongly convex condition on the function , i.e., there exists a such that

 Φ(x)−Φ(x∗)≥μ2∥x−x∗∥2,∀x∈Rd. (30)

We have the following two results for solving problems (1) and (2) respectively.

###### Theorem 7.

Consider problem (1). Suppose Assumptions 1 and 2 hold and is -optimally strongly convex. We set , and . Then if we choose ,

 E[Φ(¯x)−Φ∗]≤12(Φ(x10)−Φ∗)+12ϵ. (31)

By periodically restart Algorithm LABEL:alg:CIVR after every  epochs, converges linearly to As a result, the sample complexity for finding an -solution is .

###### Theorem 8.

Consider the finite-sum problem (2). Suppose Assumption 1 hold and is -optimally strongly convex. We set and . If , then

 E[Φ(¯x)−Φ∗]≤12(Φ(x10)−Φ∗). (32)

By periodically restart Algorithm LABEL:alg:CIVR after every epochs, converges linearly to with rate . Therefore, the sample complexity of finding an -solution is .

If we define a condition number , then since , we have and the above complexities become and .

## 5 Numerical Experiments

In this section, we present numerical experiments for a risk-averse portfolio optimization problem. Suppose there are assets that one can invest during  time periods labeled as . Let be the return or payoff per unit of asset  at time , and be the vector consists of . Let be the decision variable, where each component represent the amount of investment or percentage of the total investment allocated to asset , for . The same allocations or percentages of allocations are repeated over the  time periods. We would like to maximize the average return over the  periods, but with a penalty on the variance of the returns across the  periods (in other words, we would like different periods to have similar returns).

This problem can be formulated as a finite-sum version of problem (3), with a discrete random variable and for . The function  can be chosen as the indicator function of an ball, or a soft regularization term. We choose the later one in our experiments to obtain a sparse asset allocation. Using the mappings defined in (4), it can be further transformed into the composite finite-sum problem (2), hence readily solved by the CIVR method. For comparison, we implement the C-SAGA algorithm Zhang and Xiao (2019) as a benchmark. As another benchmark, this problem can also be formulated as a two-layer composite finite-sum problem (10), which was done in (Huo et al., 2018) and (Lian et al., 2017). We solve the two-layer formulation by ASC-PG (Wang et al., 2017b) and VRSC-PG (Huo et al., 2018). Finally, we also implemented CIVR-adp, which is the adaptive sampling variant described in Theorem 4.

We test these algorithms on three real world portfolio datasets, which contain 30, 38 and 49 industrial portfolios respectively, from the Keneth R. French Data Library. For the three datasets, the daily data of the most recent 24452, 10000 and 24400 days are extracted respectively to conduct the experiments. We set the parameter in (3) and use an regularization . The experiment results are shown in Figure 1. The curves are averaged over 20 runs and are plotted against the number of samples of the component functions (the horizontal axis).

Throughout the experiments, VRSC-PG and C-SAGA algorithms use the batch size while CIVR uses the batch size , all dictated by their complexity theory. CIVR-adp employs the adaptive batch size for . For Industrial-30 dataset, all of VRSC-PG, C-SAGA, CIVR and CIVR-adp use the same step size . They are chosen from the set by experiments. And works best for all four tested methods simultaneously. Similarly, is chosen for the Industrial-38 dataset and is chosen for the Industrial-49 dataset. For ASC-PG, we set its step size parameters and (see details in Wang et al., 2017b). They are hand-tuned to ensure ASC-PG converges fast among a range of tested parameters. Overall, CIVR and CIVR-adp outperform other methods.

## References

• Allen-Zhu  Zeyuan Allen-Zhu. Natasha: Faster non-convex stochastic optimization via strongly non-convex parameter. In Proceedings of the 34th International Conference on Machine Learning (ICML), volume 70 of Proceedings of Machine Learning Research, pages 89–97, Sydney, Australia, 2017.
• Allen-Zhu  Zeyuan Allen-Zhu. Natasha 2: Faster non-convex optimization than SGD. In Advances in Neural Information Processing Systems 31, pages 2675–2686. Curran Associates, Inc., 2018.
• Allen-Zhu and Hazan  Zeyuan Allen-Zhu and Elad Hazan. Variance reduction for faster non-convex optimization. In Proceedings of the 33rd International Conference on Machine Learning (ICML), pages 699–707, 2016.
• Beck  Amir Beck. First-Order Methods in Optimization. MOS-SIAM Series on Optimization. SIAM, 2017.
• Chaudhari et al.  Pratik Chaudhari, Anna Choromanska, Stefano Soatto, Yann LeCun, Carlo Baldassi, Christian Borgs, Jennifer Chayes, Levent Sagun, and Riccardo Zecchina. Entropy-sgd: Biasing gradient descent into wide valleys. arXiv preprint, arXiv:1611.01838, 2016.
• Dann et al.  Christoph Dann, Gerhard Neumann, and Jan Peters. Policy evaluation with temporal differences: a survey and comparison. Journal of Machine Learning Research, 15(1):809–883, 2014.
• Defazio et al.  Aaron Defazio, Francis Bach, and Simon Lacoste-Julien. SAGA: A fast incremental gradient method with support for non-strongly convex composite objectives. In Advances in Neural Information Processing Systems 27, pages 1646–1654, 2014.
• Fang et al.  Cong Fang, Chris Junchi Li, Zhouchen Lin, and Tong Zhang. Spider: Near-optimal non-convex optimization via stochastic path-integrated differential estimator. In Advances in Neural Information Processing Systems 31, pages 689–699. Curran Associates, Inc., 2018.
• Ghadimi and Lan  Saeed Ghadimi and Guanghui Lan. Stochastic first- and zeroth-order methods for nonconvex stochastic programming. SIAM Journal on Optimization, 23(4):2341–2368, 2013.
• Gulcehre et al.  Caglar Gulcehre, Marcin Moczulski, Francesco Visin, and Yoshua Bengio. Mollifying networks. arXiv preprint, arXiv:1608.04980, 2016.
• Hazan et al.  Elad Hazan, Kfir Yehuda Levy, and Shai Shalev-Shwartz. On graduated optimization for stochastic non-convex problems. In International conference on machine learning, pages 1833–1841, 2016.
• Huo et al.  Zhouyuan Huo, Bin Gu, Ji Jiu, and Heng Huang. Accelerated method for stochastic composition optimization with nonsmooth regularization. In

Proceedings of the 32nd AAAI Conference on Artificial Intelligence

, pages 3287–3294, 2018.
• Johnson and Zhang  Rie Johnson and Tong Zhang.

Accelerating stochastic gradient descent using predictive variance reduction.

In Advances in Neural Information Processing Systems 26, pages 315–323, 2013.
• Karimi et al.  Hamed Karimi, Julie Nutini, and Mark Schmidt. Linear convergence of gradient method and proximal-gradient methods under the Polyak-Łojasiewicz condition. In Machine Learning and Knowledge Discovery in Database - European Conference, Proceedings, pages 795–811, 2016.
• Lei et al.  Lihua Lei, Cheng Ju, Jianbo Chen, and Michael I Jordan. Non-convex finite-sum optimization via SCSG methods. In Advances in Neural Information Processing Systems 30, pages 2348–2358. Curran Associates, Inc., 2017.
• Li and Li  Zhize Li and Jian Li. A simple proximal stochastic gradient method for nonsmooth nonconvex optimization. In Advances in Neural Information Processing Systems 31, pages 5564–5574. Curran Associates, Inc., 2018.
• Lian et al.  Xiangru Lian, Mengdi Wang, and Ji Liu. Finite-sum composition optimization via variance reduced gradient descent. In Proceedings of the 20th International Conference on Artificial Intelligence and Statistics (AISTATS), pages 1159–1167, 2017.
• Mobahi and Fisher  Hossein Mobahi and John W Fisher. On the link between gaussian homotopy continuation and convex envelopes. In

International Workshop on Energy Minimization Methods in Computer Vision and Pattern Recognition

, pages 43–56. Springer, 2015.
• Nesterov  Yurii Nesterov. Gradient methods for minimizing composite functions. Mathematical Programming, 140(1):125–161, 2013.
• Nguyen et al.  Lam M. Nguyen, Jie Liu, Katya Scheinberg, and Martin Takáč. SARAH: A novel method for machine learning problems using stochastic recursive gradient. In Doina Precup and Yee Whye Teh, editors, Proceedings of the 34th International Conference on Machine Learning (ICML), volume 70 of Proceedings of Machine Learning Research (PMLR), pages 2613–2621, Sydney, Australia, 2017.
• Nguyen et al.  Lam M. Nguyen, Marten van Dijk, Dzung T. Phan, Phuong Ha Nguyen, Tsui-Wei Weng, and Jayant R. Kalagnanam. Finite-sum smooth optimization with sarah. arXiv preprint, arXiv:1901.07648, 2019.
• Pham et al.  Nhan H. Pham, Lam M. Nguyen, Dzung T. Phan, and Quoc Tran-Dinh. ProxSARAH: An efficient algorithmic framework for stochastic composite nonconvex optimization. arXiv preprint, arXiv:1902.05679, 2019.
• Reddi et al. [2016a] Sashank J. Reddi, Ahmed Hefny, Suvrit Sra, Barnabas Poczos, and Alex Smola. Stochastic variance reduction for nonconvex optimization. In Proceedings of The 33rd International Conference on Machine Learning, volume 48 of Proceedings of Machine Learning Research, pages 314–323, New York, New York, USA, 2016a.
• Reddi et al. [2016b] Sashank J Reddi, Suvrit Sra, Barnabás Póczos, and Alex Smola. Fast incremental method for smooth nonconvex optimization. In 2016 IEEE 55th Conference on Decision and Control (CDC), pages 1971–1977. IEEE, 2016b.
• Reddi et al. [2016c] Sashank J Reddi, Suvrit Sra, Barnabás Póczos, and Alexander J Smola. Proximal stochastic methods for nonsmooth nonconvex finite-sum optimization. In Advances in Neural Information Processing Systems 29, pages 1145–1153, 2016c.
• Rockafellar  R. Tyrrell Rockafellar. Convex Analysis. Princeton University Press, 1970.
• Rockafellar  R. Tyrrell Rockafellar. Coherent approaches to risk in optimization under uncertainty. INFORMS TutORials in Operations Research, 2007.
• Ruszczyński  Andrzej Ruszczyński. Advances in risk-averse optimization. INFORMS TutORials in Operation Research, 2013.
• Sutton and Barto  Richard S Sutton and Andrew G Barto. Reinforcement Learning: An Introduction. MIT Press, Cambridge, MA, 1998.
• Wang et al. [2017a] Mengdi Wang, Ethan X Fang, and Han Liu. Stochastic compositional gradient descent: algorithms for minimizing compositions of expected-value functions. Mathematical Programming, 161(1-2):419–449, 2017a.
• Wang et al. [2017b] Mengdi Wang, Ji Liu, and Ethan Fang. Accelerating stochastic composition optimization. Journal of Machine Learning Research, 18(105):1–23, 2017b.
• Wang et al.  Zhe Wang, Kaiyi Ji, Yi Zhou, Yingbin Liang, and Vahid Tarokh. SpiderBoost: A class of faster variance-reduced algorithms for nonconvex optimization. arXiv preprint, arXiv:1810.10690, 2018.
• Xiao and Zhang  Lin Xiao and Tong Zhang. A proximal stochastic gradient method with progressive variance reduction. SIAM Journal on Optimization, 24(4):2057–2075, 2014.
• Zhang and Xiao  Junyu Zhang and Lin Xiao. A composite randomized incremental gradient method. In Proceedings of the 36th International Conference on Machine Learning (ICML), number 97 in Proceedings of Machine Learning Research (PMLR), Long Beach, California, 2019.
• Zhou et al.  Dongruo Zhou, Pan Xu, and Quanquan Gu. Stochastic nested variance reduced gradient descent for nonconvex optimization. In Advances in Neural Information Processing Systems 31, pages 3921–3932. Curran Associates, Inc., 2018.

## Appendix A Convergence analysis for composite expectation case

In this section, we focus on convergence analysis of CIVR for solving the stochastic composite optimization problem (1), and prove Theorems 1 and 2.

First, we show that under Assumption 1, the composite function is smooth and has Lipschitz constant .

 ∥∥F′(x)−F′(y)∥∥ = ∥∥g′(x)Tf′(g(x))−g′(y)Tf′(g(y))∥∥ = ∥∥g′(x)Tf′(g(x))−g′(x)Tf′(g(y))+g′(x)Tf′(g(y))−g′(y)Tf′(g(y))∥∥ ≤ ∥∥g′(x)Tf′(g(x))−g′(x)Tf′(g(y))∥∥+∥∥g′(x)Tf′(g(y))−g′(y)Tf′(g(y))∥∥ ≤ ∥∥g′(x)∥∥∥∥f′(g(x))−f′(g(y))∥∥+∥∥f′(g(y))∥∥∥∥g′(x)−g′(y)∥∥ ≤ ∥∥g′(x)∥∥⋅Lf∥g(x)−g(y)∥+∥∥f′(g(y))∥∥⋅Lg∥x−y∥ ≤ ℓgLfℓg∥x−y∥+ℓfLg∥x−y∥ = (ℓ2gLf+ℓfLg)∥x−y∥,

where we used and , which are implied by the Lipschitz conditions on  and respectively.

Although the incremental estimators used in CIVR are biased, as shown in (19), we can still bound their squared distances from the targets. This is given in the following lemma.

###### Lemma 1.

Suppose Assumption 1 holds. Let and be constructed according to (LABEL:defn:sarah-1) and (LABEL:defn:sarah-2) in Algorithm LABEL:alg:CIVR. For any and , we have the following mean squared error (MSE) bounds

 (33)
###### Proof.

We first state a fact that allows us to decompose the MSE into a squared bias term and a variance term, that is, for an arbitrary random vector and a constant vector , we have

 E[∥ζ−u∥2]=∥E[ζ]−u∥2+Var(ζ), (34)

where . As a result,

 E[∥yti−g(xti)∥2∣∣xti]=∥∥E[yti|xti]−g(xti)∥∥2+Var(yti|xti).

For the bias term, we have . For the variance term, we have

 Var(yti|xti) = Var(yti−1+1St∑ξ∈Sti(gξ(xti)−gξ(xti−1))∣∣xti) = 1StVar(gξ(xti)−gξ(xti−1)|xti) ≤ 1StE[∥gξ(xti)−gξ(xti−1)∥2|xti] ≤ ℓ2gSt∥xti−xti−1∥2,

where the second equality is due to the fact that is a constant conditioning on and in the last inequality we used the -Lipschitz continuity of . Consequently,

Recursively applying the above procedure yields

 E[∥yti−g(xti)∥2]≤E[∥yt0−g(xt0)∥2]+i∑r=1ℓ2gStE[∥xtr−xtr−1∥2]. (35)

Similarly, the bound on can be shown by using the -Lipschitz continuity of