Recently, Wang et al. (2016a) proposed the stochastic composition optimization of the following form:
Here , , denotes the composite function, and
are random variables. Problem (1) has been shown in Wang et al. (2016a) to include several important applications in estimation and machine learning.
In this paper, we focus on extending the formulation to include linear constraints, and consider the following variant of Problem (1):
Here , , , , , and are continuous functions, and is a closed convex function. The reason to consider the specific form of Problem (2) is as follows. (i) In practice, random variables such as and are obtained from problem-dependent data sets. Thus, they often only take values in a finite set with certain frequencies (captured by the first term in the objective (2)). (ii) Such problems often require the solutions to satisfy certain regularizing conditions (imposed by the term and constraint (3
)). Note here that the uniform distribution ofand (the and ) in (2) is not critical. In Section 4, we show that our algorithm is also applicable under other distributions.
1.1 Motivating Examples
We first present a few motivating examples of formulation (2). The first example is a risk-averse learning problem discussed in Wang et al. (2016b), which can be formulated into the following mean-variance minimization problems, i.e.,
is the loss function w.r.t variableand is parameterized by random variable , and denotes its variance. We see that this example is of the form (2), where plays the role of the regularizer and the variance term is the composition functions. There are many other problems that can be formulated into the mean-variance optimization as in (4), e.g., portfolio management Alexander and Baptista (2004).
The second motivating example is dynamic programming Sutton and Barto (1998); Dai et al. (2016). In this case, one can often approximate the value of each state by an inner product of a state feature and a target variable . Then, the policy learning problem can be formulated into minimizing the Bellman residual as follows:
denotes the transition probabilities under a policy, and denotes the discounting factor. Note that this problem also has the form of Problem (2).
The third example is multi-stage stochastic programming Shapiro et al. (2014). For example, a two-stage optimization scenario often requires solving the following problem:
Here are decision variables, are the corresponding random variables, and the function is the utility function. In this case, the expectation is the inner function and is the outer function in Problem (2).
1.2 Related Works
The stochastic composition optimization problem was first proposed in Wang et al. (2016a), where two solution algorithms, Basic SCGD and accelerated SCGD, were proposed. The algorithms were shown to achieve a sublinear convergence rate for convex and strongly convex cases, and were also shown to converge to a stationary point in the nonconvex case. Later, Wang et al. (2016b) proposed a proximal gradient algorithm called ASC-PG to improve the convergence rate when both inner and outer functions are smooth. However, the convergence rate is sublinear and their results do not include the regularizer when the objective functions are not strongly convex. In Lian et al. (2016), the authors solved the finite sample case of stochastic composition optimization and obtained two linear-convergent algorithms based on the stochastic variance reduction gradient technique (SVRG) proposed in Johnson and Zhang (2013). However, the algorithms do not handle the regularizer either.
The ADMM algorithm, on the other hand, was first proposed in Glowinski and Marroco (1975); Gabay and Mercier (1976) and later reviewed in Boyd et al. (2011). Since then, several ADMM-based stochastic algorithms have been proposed, e.g., Ouyang et al. (2013); Suzuki and others (2013); Wang and Banerjee (2013). However, these algorithms all possess sublinear convergence rates. Therefore, several recent works tried to combine the variance reduction scheme and ADMM to accelerate convergence. For instance, SVRG-ADMM was proposed in Zheng and Kwok (2016a). It was shown that SVRG-ADMM converges linearly when the objective is strongly convex, and has a sublinear convergence rate in the general convex case. Another recent work Zheng and Kwok (2016b) further proved that SVRG-ADMM converges to a stationary point with a rate when the objective is nonconvex. In Qian and Zhou (2016), the authors used acceleration technique in Allen-Zhu (2016); Hien et al. (2016) to further improve the convergence rate of SVRG-ADMM. However, all aforementioned variance-reduced ADMM algorithms cannot be directly applied to solving the stochastic composition optimization problem.
In this paper, we propose an efficient algorithm called com-SVR-ADMM, which combines ideas of SVRG and ADMM, to solve stochastic composition optimization. Our algorithm is based on the SVRG-ADMM algorithm proposed in Zheng and Kwok (2016a), which does not apply to composite optimization problems. We consider three different objective functions in Problem (2), and show that our algorithm achieves the following performance.
When is strongly convex and Lipschitz smooth, and is convex, our algorithm converges linearly. This convergence rate is comparable with those of com-SVRG-1 and com-SVRG-2 in Lian et al. (2016). However, com-SVRG-1 and com-SVRG-2 do not take the commonly used regularization penalty into consideration. Experimental results also show that com-SVR-ADMM converges faster than com-SVRG-1 and com-SVRG-2.
When is convex and Lipschitz smooth, and is convex, com-SVR-ADMM has a sublinear rate of , where is the number of outer iterations.111The number of inner iterations is constant. This result outperforms the convergence rate of ASC-PG in Wang et al. (2016b).222Note that ASC-PG is not based on SVRG and does not have inner loops.
When and are general convex functions (not necessarily Lipschitz smooth), com-SVR-ADMM achieves a rate of . To the best of our knowledge, this is the first convergence result for stochastic composite optimization with general convex problems without Lipschitz smoothness.
For vectorand a positive definite matrix , the -norm of vector is defined as . For a matrix , denotes its spectral norm,
denote its largest and smallest eigenvalue, respectively.denotes the pseudoinverse of . denotes a noisy subgradient of nonsmooth . For a function , denotes its Jacobian matrix. denotes the gradient of at point .
Recall that the stochastic composition problem we want to solve has the following form:
where . For clarity, we denote , , . Therefore, .
Our proposed procedure adopts the ADMM scheme. At every iteration the primal variables are obtained by minimizing the following augmented Lagrangian equation parameterized with , i.e.,
The update of dual variable is similar to that under gradient descent with the stepsize equaling . We also based our algorithm on a sampling oracle as in Wang et al. (2016b). Specifically, we assume a stochastic first-order oracle, which, if queried, returns a noisy gradient/subgradient or function value of and at a given point.
In the following sections, we introduce the stochastic variance reduced ADMM algorithm for solving stochastic compositional optimization (com-SVR-ADMM). We present com-SVR-ADMM in three different scenarios, i.e., strongly convex and Lipschitz smooth, general convex and Lipschitz smooth, and general convex. Algorithm 1 shows how com-SVR-ADMM is used in the strongly convex case, while Algorithm 2 is for the second and third cases.
2.1 Compositional Stochastic Variance Reduced ADMM for Strongly Convex Functions
As in SVRG, com-SVR-ADMM has inner loops inside each outer iteration. At every outer iteration, we need to keep track of a reference point (Step in Algorithm 1) for computing defined as
and evaluate , which is the Jacobian matrix of at point . The updates of and are the same as those in batch ADMM Boyd et al. (2011). The main difference lies in the update for , in that we use a stochastic sample and replace with its first-order approximation, and then approximate by
Here , are uniformly sampled from and , respectively. is an estimation of defined as follows:
where is a mini-batch and is obtained by uniformly and randomly sampling from for times (with replacement) and is the th element of . In step of Algorithm 1, we add a proximal term to control the distance between the next point and the current point . The parameter is a constant and plays the role of stepsize as in Ouyang et al. (2013).
Note that our estimated gradient is biased due to the composition objective function and its form is the same as com-SVRG-1 in Lian et al. (2016). However, we remark that our algorithm is not a trivial extension of com-SVRG-1 due to the existence of linear constraint and Lagrangian dual variable. Moreover, com-SVR-ADMM can handle regularization penalty while com-SVRG-1 cannot. Also, the update of uses the pseudoinverse of . In the common case when is sparse, one can use the efficient Lanczos algorithm Golub and Van Loan (2012) to compute . Note that step in Algorithm 1 often involves computing . The memory complexity for this step can be alleviated by algorithms proposed in recent works, e.g., Zheng and Kwok (2016a); Zhang et al. (2011).
2.2 Compositional Stochastic Variance Reduced ADMM for General Convex Functions
In this section, we describe how com-SVR-ADMM handles general convex composition problems with Lipschitz smoothness. Without strong convexity, we need to make subtle changes. As shown in Algorithm 2, besides changes in variable initialization and output, another key difference is the approximation of , where we use instead of , i.e.,
Note that in the cases of interest (see Assumption 1 below), the approximated gradient is unbiased. The next change is the stepsize for updating . In step of Algorithm 2, we use a positive definite matrix in the proximal term.333The corresponding proximal term of Algorithm 1 can be viewed to have . Therefore, the stepsize depends on two parameters: and , as shown in (11), where and are the iteration counters for outer and inner iteration, respectively. Here is a parameter of Lipschitz smoothness and will be specified in our assumptions in next section.
That is, is nonincreasing for . Then, according to the definition of -norm and (11), we have:
where and if , and is a scalar. Therefore, serves as the stepsize Ouyang et al. (2013), and it can be verified that satisfies the following properties:
That is, changes from to in stage . Note that even though is not a constant, it still has a reasonable value and does not need to vanish. This feature is helpful for convergence acceleration.
2.3 General Convex Functions without Lipschitz Smoothness
In the previous two sections, we present the procedures of com-SVR-ADMM for the strongly convex and general convex settings, both under the Lipschitz smooth assumption of . In this section, we further investigate the case when the smooth condition is relaxed. We still use Algorithm 2, except that the values and are changed to
Therefore, using the same technique in (12), it can be verified that in this setting changes from to in stage and decreases to zero. Note that in this case, the number of oracle calls at each step is the same as that in section 2.2.
Although the algorithm we proposed appears similar to the SVRG-ADMM algorithm in Zheng and Kwok (2016a), it is very different due to the composition nature of the objective function (which is not considered in SVRG-ADMM) and the stochastic variance reduced gradients in (8) and (10). These differences make it impossible to directly apply SVRG-ADMM and require a very different analysis for the new algorithm. Readers interested in the full proofs can refer to the appendix.
3 Theoretical Results
In this section, we analyze the convergence performance of com-SVR-ADMM under the three cases described in section 2. Below, we first state our assumptions. Note that the assumptions are not restrictive and are commonly made in the literature, e.g., Wang et al. (2016b); Ouyang et al. (2013); Wang et al. (2016a); Zheng and Kwok (2016b).
(i) For each , is convex and continuously differentiable, is convex (can be nonsmooth). Moreover, there exists an optimal primal-dual solution for Problem (2).
(ii) The feasible set for is bounded and denote .
(iii) For randomly sampled , and , we assume the following unbiased properties:
is strongly convex with parameter , i.e., ,
Matrix has full row rank.
There exists a positive constant , such that , and , , we have
For each , is Lipschitz smooth with positive parameter , that is, , we have
For every , is bounded, and for all , that satisfy
For clarity, we also use the following notations used in the theorems:
It can be verified that is always non-negative due to the convexity of and . The following theorem and corollary show that Algorithm 1 has a linear convergence rate.
Under Assumption 4, we have and :
i.e., each is Lipschitz smooth. Moreover, it implies .
From Corollary 1, if we want to achieve , , the number of steps we need to take is roughly . In each iteration, we need oracle calls. Therefore, the overall query complexity is . For comparison, the query complexity is for com-SVRG-1 and for com-SVRG-2 Lian et al. (2016), where is a parameter related to condition number. We will see in simulations in section 4 that the overall query complexity of com-SVR-ADMM is lower than com-SVRG-1 and com-SVRG-2.
From Theorem 2, we see that com-SVR-ADMM has an convergence rate under the general convex and Lipschitz smooth condition. It improves upon the convergence rate in the recent work Wang et al. (2016b). In Theorem 2, we consider both the convergence property of function value and feasibility violation. Since and are both non-negative, each term has an convergence rate.
In the following theorem, we show that our algorithm exhibits convergence rate for both the objective value and feasibility violation, when the objective is a general convex function.
The gradients/subgradients of all , , and are bounded and , . Moreover, is invertible and are bounded.
The reason for the introduction of is similar to the step taken in Ouyang et al. (2013), and is due to the lack of Lipschitz smooth property. This result implies an convergence rate for both objective value and feasibility violation.
In this section, we conduct experiments and compare com-SVR-ADMM to existing algorithms. We consider two experiment scenarios, i.e., the portfolio management scenario from Lian et al. (2016)
and the reinforcement learning scenario fromWang et al. (2016b). Since the objective functions in both scenarios are strongly convex and Lipshitz smooth, we only provide results for Algorithm 1.
4.1 Portfolio Management
Portfolio management is usually formulated as mean-variance minimization of the following form:
where for , is the number of assets, and is the number of observed time slots. Thus, is the observed reward in time slot . We compare our proposed com-SVR-ADMM with three benchmarks: com-SVRG-1, com-SVRG-2 from Lian et al. (2016), and SGD. In order to compute the unbiased stochastic gradient of SGD, we first enumerate all samples in the data set of to calculate and , then evaluate for a random sample . Using the same definition of and and the same parameters generation method as Lian et al. (2016), we set the regularization to , where .
The experimental results are shown in Figure 1 and Figure 2. Here the -axis represents the objective value minus optimal value and the -axis is the number of oracle calls or CPU time. We set , . cov is the parameter used for reward covariance matrix generation Lian et al. (2016). In Figure 1, , and in Figure 2. All shared parameters in the four algorithms, e.g., stepsize, have the same values. We can see that all SVRG based algorithms perform much better than SGD, and com-SVR-ADMM outperforms two other linear convergent algorithms.
4.2 Reinforcement Learning
Here we consider the problem (6), which can be used for on-policy learning Wang et al. (2016b). In our experiment, we assume there are finite states and the number of states is . is the policy in consideration. is the transition probability from state to given policy , is a discount factor, is the feature of state . Here we use a linear product to approximate the value of state . Our goal is to find the optimal .
We use the following specifications for oracles and :
Note here , and denote the -th element of vector . All shared parameters in four algorithms have the same values. Note here that the calculation of is no longer under uniform distribution. We use the given transition probability. In this experiment, the transition probability is randomly generated and then regularized. The reward is also randomly generated. In addition, we include a regularization term with . The results are shown in Figure 3 and Figure 4. It can be seen that our proposed com-SVR-ADMM achieves faster convergence compared to the benchmark algorithms.
In this paper, we propose an ADMM-based algorithm, called com-SVR-ADMM, for stochastic composition optimization. We show that when the objective function is strongly convex and Lipschitz smooth, com-SVR-ADMM converges linearly. In the case when the objective function is convex (not necessarily strongly convex) and Lipschitz smooth, com-SVR-ADMM improves the theoretical convergence rate from in Wang et al. (2016b) to . When the objective is only assumed to be convex, com-SVR-ADMM has a convergence rate of . Experimental results show that com-SVR-ADMM outperforms existing algorithms.
This work was supported in part by the National Natural Science Foundation of China Grants 61672316, 61303195, the Tsinghua Initiative Research Grant, and the China youth 1000-talent grant.
- Alexander and Baptista  Gordon J Alexander and Alexandre M Baptista. A comparison of var and cvar constraints on portfolio selection with the mean-variance model. Management science, 50(9):1261–1273, 2004.
- Allen-Zhu  Zeyuan Allen-Zhu. Katyusha: Accelerated variance reduction for faster sgd. ArXiv e-prints, abs/1603.05953, 2016.
- Boyd et al.  Stephen Boyd, Neal Parikh, Eric Chu, Borja Peleato, and Jonathan Eckstein. Distributed optimization and statistical learning via the alternating direction method of multipliers. Foundations and Trends® in Machine Learning, 3(1):1–122, 2011.
- Dai et al.  Bo Dai, Niao He, Yunpeng Pan, Byron Boots, and Le Song. Learning from conditional distributions via dual kernel embeddings. arXiv preprint arXiv:1607.04579, 2016.
- Gabay and Mercier  Daniel Gabay and Bertrand Mercier. A dual algorithm for the solution of nonlinear variational problems via finite element approximation. Computers & Mathematics with Applications, 2(1):17–40, 1976.
- Glowinski and Marroco  Roland Glowinski and A Marroco. Sur l’approximation, par éléments finis d’ordre un, et la résolution, par pénalisation-dualité d’une classe de problèmes de dirichlet non linéaires. Revue française d’automatique, informatique, recherche opérationnelle. Analyse numérique, 9(2):41–76, 1975.
- Golub and Van Loan  Gene H Golub and Charles F Van Loan. Matrix computations, volume 3. JHU Press, 2012.
- Hien et al.  Le Thi Khanh Hien, Canyi Lu, Huan Xu, and Jiashi Feng. Accelerated stochastic mirror descent algorithms for composite non-strongly convex optimization. arXiv preprint arXiv:1605.06892, 2016.
Rie Johnson and Tong Zhang.
Accelerating stochastic gradient descent using predictive variance reduction.In Advances in Neural Information Processing Systems, pages 315–323, 2013.
- Lian et al.  Xiangru Lian, Mengdi Wang, and Ji Liu. Finite-sum composition optimization via variance reduced gradient descent. arXiv preprint arXiv:1610.04674, 2016.
- Ouyang et al.  Hua Ouyang, Niao He, Long Tran, and Alexander G Gray. Stochastic alternating direction method of multipliers. ICML (1), 28:80–88, 2013.
- Qian and Zhou  Chao Zhang Hui Qian and Zebang Shen Tengfei Zhou. Accelerated stochastic admm for empirical risk minimization. arXiv preprint arXiv:1611.04074, 2016.
- Shapiro et al.  Alexander Shapiro, Darinka Dentcheva, et al. Lectures on stochastic programming: modeling and theory, volume 16. Siam, 2014.
- Sutton and Barto  Richard S Sutton and Andrew G Barto. Reinforcement learning: An introduction, volume 1. MIT press Cambridge, 1998.
- Suzuki and others  Taiji Suzuki et al. Dual averaging and proximal gradient descent for online alternating direction multiplier method. In ICML (1), pages 392–400, 2013.
- Wang and Banerjee  Huahua Wang and Arindam Banerjee. Online alternating direction method (longer version). arXiv preprint arXiv:1306.3721, 2013.
- Wang et al. [2016a] Mengdi Wang, Ethan X Fang, and Han Liu. Stochastic compositional gradient descent: Algorithms for minimizing compositions of expected-value functions. Mathematical Programming, 161(1-2):419–449, 2016.
- Wang et al. [2016b] Mengdi Wang, Ji Liu, and Ethan X. Fang. Accelerating stochastic composition optimization. In Advances In Neural Information Processing Systems, pages 1714–1722, 2016.
- Zhang et al.  Xiaoqun Zhang, Martin Burger, and Stanley Osher. A unified primal-dual algorithm framework based on bregman iteration. Journal of Scientific Computing, 46(1):20–46, 2011.
- Zheng and Kwok [2016a] Zheng and Kwok. Fast-and-light stochastic admm. arXiv preprint arXiv:1604.07070, 2016.
- Zheng and Kwok [2016b] Shuai Zheng and James T Kwok. Stochastic variance-reduced admm. arXiv preprint arXiv:1604.07070, 2016.
Recall that the stochastic composition problem we want to solve has the following form:
Moreover, if matrix has full row rank,