We consider stochastic composite optimization problems of the form
where is a smooth and possibly nonconvex function, is a random variable, each is a smooth vector mapping, and is convex and lower-semicontinuous. A special case we will consider separately is when. In this case the problem is equivalent to a deterministic optimization problem
is the scalar identity map. A well-known example is policy evaluation in reinforcement learning (RL)(e.g., Sutton and Barto, 1998). With linear value function approximation, it can be formulated as
where and2014). Here we have , and .
Another interesting application is risk-averse optimization (e.g., Rockafellar, 2007; Ruszczyński, 2013), which has many applications in RL and financial mathematics. We consider a general formulation of mean-variance trade-off:
where each is a reward function (such as total portfolio return). The goal of problem (3) is to maximize the average reward with a penalty on the variance which captures the potential risk. It can be cast in the form of (1) by using the mappings
Here, the intermediate dimension is very low, i.e., . This leads to very little overhead in computation compared with stochastic optimization without composition.
In addition, let and denote the gradients of and respectively, and denote the Jacobian matrix of at . Then we have
In practice, computing exactly can be very costly if not impossible. A common strategy is to use stochastic approximation: we randomly sample a subset of from its distribution and let
However, is always a biased estimate of unless one can replace with the full expectation . This is in great contrast to the classical stochastic optimization problem
where in (6) is always an unbiased gradient estimator for the smooth part . Using biased gradient estimators can cause various difficulties for constructing and analyzing randomized algorithms, but is often inevitable in dealing with more complex objective functions other than the empirical risk (see, e.g., Chaudhari et al., 2016; Hazan et al., 2016; Gulcehre et al., 2016; Mobahi and Fisher, 2015). As a simplest model, the analysis of randomized algorithms for (1) may provide insights for solving more challenging problems.
In this paper, we develop an efficient stochastic composite gradient method called CIVR (Composite Incremental Variance Reduction), for solving problems of the forms (1) and (2). We measure efficiency by the sample complexity of the individual functions and their Jacobian , i.e., the total number of times they need to be evaluated at some point, in order to find an -approximate solution. For nonconvex functions, an -approximate solution is some random output of the algorithm that satisfies , where is the proximal gradient mapping of the objective function at (see details in Section 2). If , then and the criteria for -approximation becomes . If the objective is convex, we require where . For smooth and convex functions, these two notions are compatible, meaning that the dependence of the sample complexity on in terms of both notions are of the same order.
|Assumptions (common: and Lipschitz and smooth, thus smooth)|
|Problem||nonconvex||-gradient dominant||convex, convex|
|convex||-optimally strongly convex|
Table 1 summarizes the sample complexities of the CIVR method under different assumptions obtained in this paper. We can define a condition number for -gradient dominant functions and for -optimally strongly convex functions, then the complexities become and for (1) and (2) respectively. In order to better position our contributions, we next discuss related work and then putting these results into context.
1.1 Related Work
We first discuss the nonconvex stochastic optimization problem (7), which is a special cases of (1). When and is smooth, Ghadimi and Lan (2013) developed a randomized stochastic gradient method with iteration complexity . Allen-Zhu (2018) obtained with additional second-order guarantee. There are also many recent works on solving its finite-sum version
which is a special case of (2). By extending the variance reduction techniques SVRG (Johnson and Zhang, 2013; Xiao and Zhang, 2014) and SAGA (Defazio et al., 2014) to nonconvex optimization, Allen-Zhu and Hazan (2016) and Reddi et al. (2016a, b, c) developed randomized algorithms with sample complexity . Under additional assumptions of gradient dominance or strong convexity, they obtained sample complexity , where is a suitable condition number. Allen-Zhu (2017) and Lei et al. (2017) obtained .
Based on a new variance reduction technique called SARAH (Nguyen et al., 2017), Nguyen et al. (2019) and Pham et al. (2019) developed nonconvex extensions to obtain sample complexities and for solving the expectation and finite-sum cases respectively. Fang et al. (2018) introduced another variance reduction technique called Spider, which can be viewed as a more general variant of SARAH. They obtained sample complexities and for the two cases respectively, but require small step sizes that are proportional to . Wang et al. (2018) extended Spider to obtain the same complexities with constant step sizes and under the gradient-dominant condition. In addition, Zhou et al. (2018) obtained similar results using a nested SVRG approach.
where is parametrized by another random variables , which is independent of . When , Wang et al. (2017a) derived algorithms to find an -approximate solution with sample complexities , and for the smooth nonconvex case, smooth convex case and smooth strongly convex case respectively. For nontrivial convex , Wang et al. (2017b) obtained improved sample complexity of , and for the three cases mentioned above respectively.
As a special case of (9), the following finite-sum problem also received significant attention:
When and the overall objective function is strongly convex, Lian et al. (2017) derived two algorithms based on the SVRG scheme to attain sample complexities and respectively, where is some suitably defined condition number. Huo et al. (2018) also used the SVRG scheme to obtain an complexity for the smooth nonconvex case and for strongly convex problems with nonsmooth . More recently, Zhang and Xiao (2019) proposed a composite randomized incremental gradient method based on the SAGA estimator Defazio et al. (2014), which matches the best known complexity when is smooth and nonconvex, and obtained an improved complexity under either gradient dominant or strongly convex assumptions. When applied to the special cases (1) and (2) we focus on in this paper (), these results are strictly worse than ours in Table 1.
1.2 Contributions and Outline
We develop the CIVR method by extending the variance reduction technique of SARAH (Nguyen et al., 2017, 2019; Pham et al., 2019) and Spider (Fang et al., 2018; Wang et al., 2018) to solve the composite optimization problems (1) and (2). The complexities of CIVR in Table 1 match the best results for solving the non-composite problems (7) and (8), despite the additional outer composition and the composite-gradient estimator always being biased. In addition:
Our results indicate that the additional smooth composition in (1) and (2) does not incur higher complexity compared with (7) and (8), despite the difficulty of dealing with biased estimators. We believe these results can also be extended to the two-layer problems (9) and (10), by replacing with in Table 1. But the extensions require quite different techniques and we will address them in a separate paper.
The rest of this paper is organized as follows. In Section 2, we introduce the CIVR method. In Section 3, we present convergence results of CIVR for solving the composite optimization problems (1) and (2) and the required parameter settings. Better complexities of CIVR under the gradient-dominant and optimally strongly convex conditions are given in Section 4. In Section 5, we present numerical experiments for solving a risk-averse portfolio optimization problem (3) on real-world datasets.
2 The composite incremental variance reduction (CIVR) method
where is smooth and is convex. The proximal operator of with parameter is defined as
We assume that is relatively simple, meaning that its proximal operator has a closed-form solution or can be computed efficiently. The proximal gradient method (e.g., Nesterov, 2013; Beck, 2017) for solving problem (14) is
where is the step size. The proximal gradient mapping of is defined as
As a result, the proximal gradient method (16) can be written as . Notice that when , becomes the identity mapping and we have for any .
Suppose is generated by a randomized algorithm. We call an -stationary point in expectation if
(We assume that is a constant that does not depend on .) As we mentioned in the introduction, we measure the efficiency of an algorithm by its sample complexity of and their Jacobian , i.e., the total number of times they need to be evaluated, in order to find a point that satisfies (18). Our goal is to develop a randomized algorithm that has low sample complexity.
We present in Algorithm LABEL:alg:CIVR the Composite Incremental Variance Reduction (CIVR) method. This methods employs a two time-scale variance-reduced estimator for both the inner function value of and its Jacobian . At the beginning of each outer iteration
(each called an epoch), we construct a relatively accurate estimatefor and for respectively, using a relatively large sample size . During each inner iteration of the th epoch, we construct an estimate for and for respectively, using a smaller sample size and incremental corrections from the previous iterations. Note that the epoch length and the sample sizes and are all adjustable for each epoch . Therefore, besides setting a constant set of parameters, we can also adjust them gradually in order to obtain better theoretical properties and practical performance.
This variance-reduction technique was first proposed as part of SARAH Nguyen et al. (2017) where it is called recursive variance reduction. It was also proposed in (Fang et al., 2018) in the form of a Stochastic Path-Integrated Differential EstimatoR (Spider). Here we simply call it incremental variance reduction. A distinct feature of this incremental estimator is that the inner-loop estimates and are biased, i.e.,
, whose gradient estimators are always unbiased. Note that unbiased estimators forand are not essential here, because the composite estimator is always biased.
3 Convergence Analysis
In this section, we present theoretical results on the convergence properties of CIVR (Algorithm LABEL:alg:CIVR) when the composite function is smooth. More specifically, we make the following assumptions.
is a smooth and -Lipschitz function and its gradient is -Lipschitz.
Each is a smooth and -Lipschitz vector mapping and its Jacobian is -Lipschtiz. Consequently, in (5) is -Lipschitz and its Jacobian is -Lipschitz.
is a convex and lower-semicontinuous function.
The overall objective function is bounded below, i.e., .
For problem (1), we further assume that there exist constants and such that
As a result of Assumption 1, is smooth and is -Lipschitz continuous with
(see proof in the supplementary materials). For convenience, we also define two constants
It is important to notice that , since we will use step size .
3.1 The composite expectation case
Note that in the above scheme, the epoch lengths and all the batch sizes and are set to be constant (depending on a pre-fixed ) without regard of . Intuitively, we do not need as many samples in the early stage of the algorithm as in the later stage. In addition, it will be useful in practice to have a variant of the algorithm that can adaptively choose , and throughout the epochs without dependence on a pre-fixed precision. This is done in the following theorem.
3.2 The composite finite-sum case
In this section, we consider the composite finite-sum optimization problem (2). In this case, the random variable has a uniform distribution over the finite index set . At the beginning of each epoch in Algorithm LABEL:alg:CIVR, we use the full sample size to compute and . Therefore for all and Equation (LABEL:defn:sarah-1) in Algorithm LABEL:alg:CIVR becomes
Also in this case, we no longer need Assumption 2.
Suppose Assumptions 1 holds. Let the parameters in Algorithm LABEL:alg:CIVR be set as and for all . Then as long as , we have for any ,
As a result, obtaining an -approximate solution requires epochs and a total sample complexity of .
Similar to the previous section, we can also choose the epoch lengths and sample sizes adaptively to save the sampling cost in the early stage of the algorithm. However, due to the finite-sum structure of the problem, when the batch size reaches , we will start to take the full batch at the beginning of each epoch to get the exact and . This leads to the following theorem.
Suppose Assumptions 1 holds. For some positive constants and , denote . When we set the parameters to be ; when , we set and . Then as long as ,
As a result, the total sample complexity of Algorithm LABEL:alg:CIVR for obtaining an -approximate solution is , where hides logarithmic factors.
4 Fast convergence rates under stronger conditions
In this section we consider two cases where fast linear convergence can be guaranteed for CIVR.
4.1 Gradient-dominant function
The first case is when and is -gradient dominant, i.e., there is some such that
Note that a -strongly convex function is -gradient dominant by this definition. Hence strong convexity is a special case of the gradient dominant condition, which in turn is a special case of the Polyak-Łojasiewicz condition with the Łojasiewicz exponent equal to 2 (see, e.g., Karimi et al., 2016).
In order to solve (1) with a pre-fixed precision , we use a periodic restart strategy depicted below.
Therefore, we can periodically restart Algorithm LABEL:alg:CIVR after every epochs (using the output of previous period as input to the new period), then converges linearly to with a factor of per period. As a result, the sample complexity for finding an -solution is .
The restart strategy also applies to the finite-sum case.
4.2 Optimally strongly convex function
In this part, we assume a -optimally strongly convex condition on the function , i.e., there exists a such that
If we define a condition number , then since , we have and the above complexities become and .
5 Numerical Experiments
In this section, we present numerical experiments for a risk-averse portfolio optimization problem. Suppose there are assets that one can invest during time periods labeled as . Let be the return or payoff per unit of asset at time , and be the vector consists of . Let be the decision variable, where each component represent the amount of investment or percentage of the total investment allocated to asset , for . The same allocations or percentages of allocations are repeated over the time periods. We would like to maximize the average return over the periods, but with a penalty on the variance of the returns across the periods (in other words, we would like different periods to have similar returns).
This problem can be formulated as a finite-sum version of problem (3), with a discrete random variable and for . The function can be chosen as the indicator function of an ball, or a soft regularization term. We choose the later one in our experiments to obtain a sparse asset allocation. Using the mappings defined in (4), it can be further transformed into the composite finite-sum problem (2), hence readily solved by the CIVR method. For comparison, we implement the C-SAGA algorithm Zhang and Xiao (2019) as a benchmark. As another benchmark, this problem can also be formulated as a two-layer composite finite-sum problem (10), which was done in (Huo et al., 2018) and (Lian et al., 2017). We solve the two-layer formulation by ASC-PG (Wang et al., 2017b) and VRSC-PG (Huo et al., 2018). Finally, we also implemented CIVR-adp, which is the adaptive sampling variant described in Theorem 4.
We test these algorithms on three real world portfolio datasets, which contain 30, 38 and 49 industrial portfolios respectively, from the Keneth R. French Data Library111http://mba.tuck.dartmouth.edu/pages/faculty/ken.french/data_library.html. For the three datasets, the daily data of the most recent 24452, 10000 and 24400 days are extracted respectively to conduct the experiments. We set the parameter in (3) and use an regularization . The experiment results are shown in Figure 1. The curves are averaged over 20 runs and are plotted against the number of samples of the component functions (the horizontal axis).
Throughout the experiments, VRSC-PG and C-SAGA algorithms use the batch size while CIVR uses the batch size , all dictated by their complexity theory. CIVR-adp employs the adaptive batch size for . For Industrial-30 dataset, all of VRSC-PG, C-SAGA, CIVR and CIVR-adp use the same step size . They are chosen from the set by experiments. And works best for all four tested methods simultaneously. Similarly, is chosen for the Industrial-38 dataset and is chosen for the Industrial-49 dataset. For ASC-PG, we set its step size parameters and (see details in Wang et al., 2017b). They are hand-tuned to ensure ASC-PG converges fast among a range of tested parameters. Overall, CIVR and CIVR-adp outperform other methods.
- Allen-Zhu  Zeyuan Allen-Zhu. Natasha: Faster non-convex stochastic optimization via strongly non-convex parameter. In Proceedings of the 34th International Conference on Machine Learning (ICML), volume 70 of Proceedings of Machine Learning Research, pages 89–97, Sydney, Australia, 2017.
- Allen-Zhu  Zeyuan Allen-Zhu. Natasha 2: Faster non-convex optimization than SGD. In Advances in Neural Information Processing Systems 31, pages 2675–2686. Curran Associates, Inc., 2018.
- Allen-Zhu and Hazan  Zeyuan Allen-Zhu and Elad Hazan. Variance reduction for faster non-convex optimization. In Proceedings of the 33rd International Conference on Machine Learning (ICML), pages 699–707, 2016.
- Beck  Amir Beck. First-Order Methods in Optimization. MOS-SIAM Series on Optimization. SIAM, 2017.
- Chaudhari et al.  Pratik Chaudhari, Anna Choromanska, Stefano Soatto, Yann LeCun, Carlo Baldassi, Christian Borgs, Jennifer Chayes, Levent Sagun, and Riccardo Zecchina. Entropy-sgd: Biasing gradient descent into wide valleys. arXiv preprint, arXiv:1611.01838, 2016.
- Dann et al.  Christoph Dann, Gerhard Neumann, and Jan Peters. Policy evaluation with temporal differences: a survey and comparison. Journal of Machine Learning Research, 15(1):809–883, 2014.
- Defazio et al.  Aaron Defazio, Francis Bach, and Simon Lacoste-Julien. SAGA: A fast incremental gradient method with support for non-strongly convex composite objectives. In Advances in Neural Information Processing Systems 27, pages 1646–1654, 2014.
- Fang et al.  Cong Fang, Chris Junchi Li, Zhouchen Lin, and Tong Zhang. Spider: Near-optimal non-convex optimization via stochastic path-integrated differential estimator. In Advances in Neural Information Processing Systems 31, pages 689–699. Curran Associates, Inc., 2018.
- Ghadimi and Lan  Saeed Ghadimi and Guanghui Lan. Stochastic first- and zeroth-order methods for nonconvex stochastic programming. SIAM Journal on Optimization, 23(4):2341–2368, 2013.
- Gulcehre et al.  Caglar Gulcehre, Marcin Moczulski, Francesco Visin, and Yoshua Bengio. Mollifying networks. arXiv preprint, arXiv:1608.04980, 2016.
- Hazan et al.  Elad Hazan, Kfir Yehuda Levy, and Shai Shalev-Shwartz. On graduated optimization for stochastic non-convex problems. In International conference on machine learning, pages 1833–1841, 2016.
Huo et al. 
Zhouyuan Huo, Bin Gu, Ji Jiu, and Heng Huang.
Accelerated method for stochastic composition optimization with
Proceedings of the 32nd AAAI Conference on Artificial Intelligence, pages 3287–3294, 2018.
Johnson and Zhang 
Rie Johnson and Tong Zhang.
Accelerating stochastic gradient descent using predictive variance reduction.In Advances in Neural Information Processing Systems 26, pages 315–323, 2013.
- Karimi et al.  Hamed Karimi, Julie Nutini, and Mark Schmidt. Linear convergence of gradient method and proximal-gradient methods under the Polyak-Łojasiewicz condition. In Machine Learning and Knowledge Discovery in Database - European Conference, Proceedings, pages 795–811, 2016.
- Lei et al.  Lihua Lei, Cheng Ju, Jianbo Chen, and Michael I Jordan. Non-convex finite-sum optimization via SCSG methods. In Advances in Neural Information Processing Systems 30, pages 2348–2358. Curran Associates, Inc., 2017.
- Li and Li  Zhize Li and Jian Li. A simple proximal stochastic gradient method for nonsmooth nonconvex optimization. In Advances in Neural Information Processing Systems 31, pages 5564–5574. Curran Associates, Inc., 2018.
- Lian et al.  Xiangru Lian, Mengdi Wang, and Ji Liu. Finite-sum composition optimization via variance reduced gradient descent. In Proceedings of the 20th International Conference on Artificial Intelligence and Statistics (AISTATS), pages 1159–1167, 2017.
- Mobahi and Fisher  Hossein Mobahi and John W Fisher. On the link between gaussian homotopy continuation and convex envelopes. In
- Nesterov  Yurii Nesterov. Gradient methods for minimizing composite functions. Mathematical Programming, 140(1):125–161, 2013.
- Nguyen et al.  Lam M. Nguyen, Jie Liu, Katya Scheinberg, and Martin Takáč. SARAH: A novel method for machine learning problems using stochastic recursive gradient. In Doina Precup and Yee Whye Teh, editors, Proceedings of the 34th International Conference on Machine Learning (ICML), volume 70 of Proceedings of Machine Learning Research (PMLR), pages 2613–2621, Sydney, Australia, 2017.
- Nguyen et al.  Lam M. Nguyen, Marten van Dijk, Dzung T. Phan, Phuong Ha Nguyen, Tsui-Wei Weng, and Jayant R. Kalagnanam. Finite-sum smooth optimization with sarah. arXiv preprint, arXiv:1901.07648, 2019.
- Pham et al.  Nhan H. Pham, Lam M. Nguyen, Dzung T. Phan, and Quoc Tran-Dinh. ProxSARAH: An efficient algorithmic framework for stochastic composite nonconvex optimization. arXiv preprint, arXiv:1902.05679, 2019.
- Reddi et al. [2016a] Sashank J. Reddi, Ahmed Hefny, Suvrit Sra, Barnabas Poczos, and Alex Smola. Stochastic variance reduction for nonconvex optimization. In Proceedings of The 33rd International Conference on Machine Learning, volume 48 of Proceedings of Machine Learning Research, pages 314–323, New York, New York, USA, 2016a.
- Reddi et al. [2016b] Sashank J Reddi, Suvrit Sra, Barnabás Póczos, and Alex Smola. Fast incremental method for smooth nonconvex optimization. In 2016 IEEE 55th Conference on Decision and Control (CDC), pages 1971–1977. IEEE, 2016b.
- Reddi et al. [2016c] Sashank J Reddi, Suvrit Sra, Barnabás Póczos, and Alexander J Smola. Proximal stochastic methods for nonsmooth nonconvex finite-sum optimization. In Advances in Neural Information Processing Systems 29, pages 1145–1153, 2016c.
- Rockafellar  R. Tyrrell Rockafellar. Convex Analysis. Princeton University Press, 1970.
- Rockafellar  R. Tyrrell Rockafellar. Coherent approaches to risk in optimization under uncertainty. INFORMS TutORials in Operations Research, 2007.
- Ruszczyński  Andrzej Ruszczyński. Advances in risk-averse optimization. INFORMS TutORials in Operation Research, 2013.
- Sutton and Barto  Richard S Sutton and Andrew G Barto. Reinforcement Learning: An Introduction. MIT Press, Cambridge, MA, 1998.
- Wang et al. [2017a] Mengdi Wang, Ethan X Fang, and Han Liu. Stochastic compositional gradient descent: algorithms for minimizing compositions of expected-value functions. Mathematical Programming, 161(1-2):419–449, 2017a.
- Wang et al. [2017b] Mengdi Wang, Ji Liu, and Ethan Fang. Accelerating stochastic composition optimization. Journal of Machine Learning Research, 18(105):1–23, 2017b.
- Wang et al.  Zhe Wang, Kaiyi Ji, Yi Zhou, Yingbin Liang, and Vahid Tarokh. SpiderBoost: A class of faster variance-reduced algorithms for nonconvex optimization. arXiv preprint, arXiv:1810.10690, 2018.
- Xiao and Zhang  Lin Xiao and Tong Zhang. A proximal stochastic gradient method with progressive variance reduction. SIAM Journal on Optimization, 24(4):2057–2075, 2014.
- Zhang and Xiao  Junyu Zhang and Lin Xiao. A composite randomized incremental gradient method. In Proceedings of the 36th International Conference on Machine Learning (ICML), number 97 in Proceedings of Machine Learning Research (PMLR), Long Beach, California, 2019.
- Zhou et al.  Dongruo Zhou, Pan Xu, and Quanquan Gu. Stochastic nested variance reduced gradient descent for nonconvex optimization. In Advances in Neural Information Processing Systems 31, pages 3921–3932. Curran Associates, Inc., 2018.
Appendix A Convergence analysis for composite expectation case
First, we show that under Assumption 1, the composite function is smooth and has Lipschitz constant .
where we used and , which are implied by the Lipschitz conditions on and respectively.
Although the incremental estimators used in CIVR are biased, as shown in (19), we can still bound their squared distances from the targets. This is given in the following lemma.
Suppose Assumption 1 holds. Let and be constructed according to (LABEL:defn:sarah-1) and (LABEL:defn:sarah-2) in Algorithm LABEL:alg:CIVR. For any and , we have the following mean squared error (MSE) bounds
We first state a fact that allows us to decompose the MSE into a squared bias term and a variance term, that is, for an arbitrary random vector and a constant vector , we have
where . As a result,
For the bias term, we have . For the variance term, we have
where the second equality is due to the fact that is a constant conditioning on and in the last inequality we used the -Lipschitz continuity of . Consequently,
Recursively applying the above procedure yields
Similarly, the bound on can be shown by using the -Lipschitz continuity of