1 Introduction
Recently, Wang et al. (2016a) proposed the stochastic composition optimization of the following form:
(1) 
Here , , denotes the composite function, and
are random variables. Problem (
1) has been shown in Wang et al. (2016a) to include several important applications in estimation and machine learning.In this paper, we focus on extending the formulation to include linear constraints, and consider the following variant of Problem (1):
(2)  
s.t.  (3) 
Here , , , , , and are continuous functions, and is a closed convex function. The reason to consider the specific form of Problem (2) is as follows. (i) In practice, random variables such as and are obtained from problemdependent data sets. Thus, they often only take values in a finite set with certain frequencies (captured by the first term in the objective (2)). (ii) Such problems often require the solutions to satisfy certain regularizing conditions (imposed by the term and constraint (3
)). Note here that the uniform distribution of
and (the and ) in (2) is not critical. In Section 4, we show that our algorithm is also applicable under other distributions.1.1 Motivating Examples
We first present a few motivating examples of formulation (2). The first example is a riskaverse learning problem discussed in Wang et al. (2016b), which can be formulated into the following meanvariance minimization problems, i.e.,
(4)  
s.t.  (5) 
Here
is the loss function w.r.t variable
and is parameterized by random variable , and denotes its variance. We see that this example is of the form (2), where plays the role of the regularizer and the variance term is the composition functions. There are many other problems that can be formulated into the meanvariance optimization as in (4), e.g., portfolio management Alexander and Baptista (2004).The second motivating example is dynamic programming Sutton and Barto (1998); Dai et al. (2016). In this case, one can often approximate the value of each state by an inner product of a state feature and a target variable . Then, the policy learning problem can be formulated into minimizing the Bellman residual as follows:
(6) 
where
denotes the transition probabilities under a policy
, and denotes the discounting factor. Note that this problem also has the form of Problem (2).The third example is multistage stochastic programming Shapiro et al. (2014). For example, a twostage optimization scenario often requires solving the following problem:
Here are decision variables, are the corresponding random variables, and the function is the utility function. In this case, the expectation is the inner function and is the outer function in Problem (2).
1.2 Related Works
The stochastic composition optimization problem was first proposed in Wang et al. (2016a), where two solution algorithms, Basic SCGD and accelerated SCGD, were proposed. The algorithms were shown to achieve a sublinear convergence rate for convex and strongly convex cases, and were also shown to converge to a stationary point in the nonconvex case. Later, Wang et al. (2016b) proposed a proximal gradient algorithm called ASCPG to improve the convergence rate when both inner and outer functions are smooth. However, the convergence rate is sublinear and their results do not include the regularizer when the objective functions are not strongly convex. In Lian et al. (2016), the authors solved the finite sample case of stochastic composition optimization and obtained two linearconvergent algorithms based on the stochastic variance reduction gradient technique (SVRG) proposed in Johnson and Zhang (2013). However, the algorithms do not handle the regularizer either.
The ADMM algorithm, on the other hand, was first proposed in Glowinski and Marroco (1975); Gabay and Mercier (1976) and later reviewed in Boyd et al. (2011). Since then, several ADMMbased stochastic algorithms have been proposed, e.g., Ouyang et al. (2013); Suzuki and others (2013); Wang and Banerjee (2013). However, these algorithms all possess sublinear convergence rates. Therefore, several recent works tried to combine the variance reduction scheme and ADMM to accelerate convergence. For instance, SVRGADMM was proposed in Zheng and Kwok (2016a). It was shown that SVRGADMM converges linearly when the objective is strongly convex, and has a sublinear convergence rate in the general convex case. Another recent work Zheng and Kwok (2016b) further proved that SVRGADMM converges to a stationary point with a rate when the objective is nonconvex. In Qian and Zhou (2016), the authors used acceleration technique in AllenZhu (2016); Hien et al. (2016) to further improve the convergence rate of SVRGADMM. However, all aforementioned variancereduced ADMM algorithms cannot be directly applied to solving the stochastic composition optimization problem.
1.3 Contribution
In this paper, we propose an efficient algorithm called comSVRADMM, which combines ideas of SVRG and ADMM, to solve stochastic composition optimization. Our algorithm is based on the SVRGADMM algorithm proposed in Zheng and Kwok (2016a), which does not apply to composite optimization problems. We consider three different objective functions in Problem (2), and show that our algorithm achieves the following performance.

When is strongly convex and Lipschitz smooth, and is convex, our algorithm converges linearly. This convergence rate is comparable with those of comSVRG1 and comSVRG2 in Lian et al. (2016). However, comSVRG1 and comSVRG2 do not take the commonly used regularization penalty into consideration. Experimental results also show that comSVRADMM converges faster than comSVRG1 and comSVRG2.

When is convex and Lipschitz smooth, and is convex, comSVRADMM has a sublinear rate of , where is the number of outer iterations.^{1}^{1}1The number of inner iterations is constant. This result outperforms the convergence rate of ASCPG in Wang et al. (2016b).^{2}^{2}2Note that ASCPG is not based on SVRG and does not have inner loops.

When and are general convex functions (not necessarily Lipschitz smooth), comSVRADMM achieves a rate of . To the best of our knowledge, this is the first convergence result for stochastic composite optimization with general convex problems without Lipschitz smoothness.
1.4 Notation
For vector
and a positive definite matrix , the norm of vector is defined as . For a matrix , denotes its spectral norm,denote its largest and smallest eigenvalue, respectively.
denotes the pseudoinverse of . denotes a noisy subgradient of nonsmooth . For a function , denotes its Jacobian matrix. denotes the gradient of at point .2 Algorithm
Recall that the stochastic composition problem we want to solve has the following form:
s.t. 
where . For clarity, we denote , , . Therefore, .
Our proposed procedure adopts the ADMM scheme. At every iteration the primal variables are obtained by minimizing the following augmented Lagrangian equation parameterized with , i.e.,
The update of dual variable is similar to that under gradient descent with the stepsize equaling . We also based our algorithm on a sampling oracle as in Wang et al. (2016b). Specifically, we assume a stochastic firstorder oracle, which, if queried, returns a noisy gradient/subgradient or function value of and at a given point.
In the following sections, we introduce the stochastic variance reduced ADMM algorithm for solving stochastic compositional optimization (comSVRADMM). We present comSVRADMM in three different scenarios, i.e., strongly convex and Lipschitz smooth, general convex and Lipschitz smooth, and general convex. Algorithm 1 shows how comSVRADMM is used in the strongly convex case, while Algorithm 2 is for the second and third cases.
2.1 Compositional Stochastic Variance Reduced ADMM for Strongly Convex Functions
As in SVRG, comSVRADMM has inner loops inside each outer iteration. At every outer iteration, we need to keep track of a reference point (Step in Algorithm 1) for computing defined as
(7) 
and evaluate , which is the Jacobian matrix of at point . The updates of and are the same as those in batch ADMM Boyd et al. (2011). The main difference lies in the update for , in that we use a stochastic sample and replace with its firstorder approximation, and then approximate by
(8) 
Here , are uniformly sampled from and , respectively. is an estimation of defined as follows:
(9) 
where is a minibatch and is obtained by uniformly and randomly sampling from for times (with replacement) and is the th element of . In step of Algorithm 1, we add a proximal term to control the distance between the next point and the current point . The parameter is a constant and plays the role of stepsize as in Ouyang et al. (2013).
Note that our estimated gradient is biased due to the composition objective function and its form is the same as comSVRG1 in Lian et al. (2016). However, we remark that our algorithm is not a trivial extension of comSVRG1 due to the existence of linear constraint and Lagrangian dual variable. Moreover, comSVRADMM can handle regularization penalty while comSVRG1 cannot. Also, the update of uses the pseudoinverse of . In the common case when is sparse, one can use the efficient Lanczos algorithm Golub and Van Loan (2012) to compute . Note that step in Algorithm 1 often involves computing . The memory complexity for this step can be alleviated by algorithms proposed in recent works, e.g., Zheng and Kwok (2016a); Zhang et al. (2011).
2.2 Compositional Stochastic Variance Reduced ADMM for General Convex Functions
In this section, we describe how comSVRADMM handles general convex composition problems with Lipschitz smoothness. Without strong convexity, we need to make subtle changes. As shown in Algorithm 2, besides changes in variable initialization and output, another key difference is the approximation of , where we use instead of , i.e.,
(10) 
Note that in the cases of interest (see Assumption 1 below), the approximated gradient is unbiased. The next change is the stepsize for updating . In step of Algorithm 2, we use a positive definite matrix in the proximal term.^{3}^{3}3The corresponding proximal term of Algorithm 1 can be viewed to have . Therefore, the stepsize depends on two parameters: and , as shown in (11), where and are the iteration counters for outer and inner iteration, respectively. Here is a parameter of Lipschitz smoothness and will be specified in our assumptions in next section.
(11)  
That is, is nonincreasing for . Then, according to the definition of norm and (11), we have:
(12) 
where and if , and is a scalar. Therefore, serves as the stepsize Ouyang et al. (2013), and it can be verified that satisfies the following properties:
(13)  
That is, changes from to in stage . Note that even though is not a constant, it still has a reasonable value and does not need to vanish. This feature is helpful for convergence acceleration.
2.3 General Convex Functions without Lipschitz Smoothness
In the previous two sections, we present the procedures of comSVRADMM for the strongly convex and general convex settings, both under the Lipschitz smooth assumption of . In this section, we further investigate the case when the smooth condition is relaxed. We still use Algorithm 2, except that the values and are changed to
(14)  
Therefore, using the same technique in (12), it can be verified that in this setting changes from to in stage and decreases to zero. Note that in this case, the number of oracle calls at each step is the same as that in section 2.2.
Although the algorithm we proposed appears similar to the SVRGADMM algorithm in Zheng and Kwok (2016a), it is very different due to the composition nature of the objective function (which is not considered in SVRGADMM) and the stochastic variance reduced gradients in (8) and (10). These differences make it impossible to directly apply SVRGADMM and require a very different analysis for the new algorithm. Readers interested in the full proofs can refer to the appendix.
3 Theoretical Results
In this section, we analyze the convergence performance of comSVRADMM under the three cases described in section 2. Below, we first state our assumptions. Note that the assumptions are not restrictive and are commonly made in the literature, e.g., Wang et al. (2016b); Ouyang et al. (2013); Wang et al. (2016a); Zheng and Kwok (2016b).
Assumption 1.
(i) For each , is convex and continuously differentiable, is convex (can be nonsmooth). Moreover, there exists an optimal primaldual solution for Problem (2).
(ii) The feasible set for is bounded and denote .
(iii) For randomly sampled , and , we assume the following unbiased properties:
(15)  
Assumption 2.
is strongly convex with parameter , i.e., ,
(16) 
Assumption 3.
Matrix has full row rank.
Assumption 4.
There exists a positive constant , such that , and , , we have
Assumption 5.
For each , is Lipschitz smooth with positive parameter , that is, , we have
(17) 
Assumption 6.
For every , is bounded, and for all , that satisfy
(18)  
For clarity, we also use the following notations used in the theorems:
(19)  
It can be verified that is always nonnegative due to the convexity of and . The following theorem and corollary show that Algorithm 1 has a linear convergence rate.
Proposition.
Corollary 1.
From Corollary 1, if we want to achieve , , the number of steps we need to take is roughly . In each iteration, we need oracle calls. Therefore, the overall query complexity is . For comparison, the query complexity is for comSVRG1 and for comSVRG2 Lian et al. (2016), where is a parameter related to condition number. We will see in simulations in section 4 that the overall query complexity of comSVRADMM is lower than comSVRG1 and comSVRG2.
Theorem 2.
From Theorem 2, we see that comSVRADMM has an convergence rate under the general convex and Lipschitz smooth condition. It improves upon the convergence rate in the recent work Wang et al. (2016b). In Theorem 2, we consider both the convergence property of function value and feasibility violation. Since and are both nonnegative, each term has an convergence rate.
In the following theorem, we show that our algorithm exhibits convergence rate for both the objective value and feasibility violation, when the objective is a general convex function.
Assumption 7.
The gradients/subgradients of all , , and are bounded and , . Moreover, is invertible and are bounded.
Theorem 3.
The reason for the introduction of is similar to the step taken in Ouyang et al. (2013), and is due to the lack of Lipschitz smooth property. This result implies an convergence rate for both objective value and feasibility violation.
4 Experiments
In this section, we conduct experiments and compare comSVRADMM to existing algorithms. We consider two experiment scenarios, i.e., the portfolio management scenario from Lian et al. (2016)
and the reinforcement learning scenario from
Wang et al. (2016b). Since the objective functions in both scenarios are strongly convex and Lipshitz smooth, we only provide results for Algorithm 1.4.1 Portfolio Management
Portfolio management is usually formulated as meanvariance minimization of the following form:
(24) 
where for , is the number of assets, and is the number of observed time slots. Thus, is the observed reward in time slot . We compare our proposed comSVRADMM with three benchmarks: comSVRG1, comSVRG2 from Lian et al. (2016), and SGD. In order to compute the unbiased stochastic gradient of SGD, we first enumerate all samples in the data set of to calculate and , then evaluate for a random sample . Using the same definition of and and the same parameters generation method as Lian et al. (2016), we set the regularization to , where .
The experimental results are shown in Figure 1 and Figure 2. Here the axis represents the objective value minus optimal value and the axis is the number of oracle calls or CPU time. We set , . cov is the parameter used for reward covariance matrix generation Lian et al. (2016). In Figure 1, , and in Figure 2. All shared parameters in the four algorithms, e.g., stepsize, have the same values. We can see that all SVRG based algorithms perform much better than SGD, and comSVRADMM outperforms two other linear convergent algorithms.
4.2 Reinforcement Learning
Here we consider the problem (6), which can be used for onpolicy learning Wang et al. (2016b). In our experiment, we assume there are finite states and the number of states is . is the policy in consideration. is the transition probability from state to given policy , is a discount factor, is the feature of state . Here we use a linear product to approximate the value of state . Our goal is to find the optimal .
We use the following specifications for oracles and :
Note here , and denote the th element of vector . All shared parameters in four algorithms have the same values. Note here that the calculation of is no longer under uniform distribution. We use the given transition probability. In this experiment, the transition probability is randomly generated and then regularized. The reward is also randomly generated. In addition, we include a regularization term with . The results are shown in Figure 3 and Figure 4. It can be seen that our proposed comSVRADMM achieves faster convergence compared to the benchmark algorithms.
5 Conclusion
In this paper, we propose an ADMMbased algorithm, called comSVRADMM, for stochastic composition optimization. We show that when the objective function is strongly convex and Lipschitz smooth, comSVRADMM converges linearly. In the case when the objective function is convex (not necessarily strongly convex) and Lipschitz smooth, comSVRADMM improves the theoretical convergence rate from in Wang et al. (2016b) to . When the objective is only assumed to be convex, comSVRADMM has a convergence rate of . Experimental results show that comSVRADMM outperforms existing algorithms.
Acknowledgements
This work was supported in part by the National Natural Science Foundation of China Grants 61672316, 61303195, the Tsinghua Initiative Research Grant, and the China youth 1000talent grant.
References
 Alexander and Baptista [2004] Gordon J Alexander and Alexandre M Baptista. A comparison of var and cvar constraints on portfolio selection with the meanvariance model. Management science, 50(9):1261–1273, 2004.
 AllenZhu [2016] Zeyuan AllenZhu. Katyusha: Accelerated variance reduction for faster sgd. ArXiv eprints, abs/1603.05953, 2016.
 Boyd et al. [2011] Stephen Boyd, Neal Parikh, Eric Chu, Borja Peleato, and Jonathan Eckstein. Distributed optimization and statistical learning via the alternating direction method of multipliers. Foundations and Trends® in Machine Learning, 3(1):1–122, 2011.
 Dai et al. [2016] Bo Dai, Niao He, Yunpeng Pan, Byron Boots, and Le Song. Learning from conditional distributions via dual kernel embeddings. arXiv preprint arXiv:1607.04579, 2016.
 Gabay and Mercier [1976] Daniel Gabay and Bertrand Mercier. A dual algorithm for the solution of nonlinear variational problems via finite element approximation. Computers & Mathematics with Applications, 2(1):17–40, 1976.
 Glowinski and Marroco [1975] Roland Glowinski and A Marroco. Sur l’approximation, par éléments finis d’ordre un, et la résolution, par pénalisationdualité d’une classe de problèmes de dirichlet non linéaires. Revue française d’automatique, informatique, recherche opérationnelle. Analyse numérique, 9(2):41–76, 1975.
 Golub and Van Loan [2012] Gene H Golub and Charles F Van Loan. Matrix computations, volume 3. JHU Press, 2012.
 Hien et al. [2016] Le Thi Khanh Hien, Canyi Lu, Huan Xu, and Jiashi Feng. Accelerated stochastic mirror descent algorithms for composite nonstrongly convex optimization. arXiv preprint arXiv:1605.06892, 2016.

Johnson and
Zhang [2013]
Rie Johnson and Tong Zhang.
Accelerating stochastic gradient descent using predictive variance reduction.
In Advances in Neural Information Processing Systems, pages 315–323, 2013.  Lian et al. [2016] Xiangru Lian, Mengdi Wang, and Ji Liu. Finitesum composition optimization via variance reduced gradient descent. arXiv preprint arXiv:1610.04674, 2016.
 Ouyang et al. [2013] Hua Ouyang, Niao He, Long Tran, and Alexander G Gray. Stochastic alternating direction method of multipliers. ICML (1), 28:80–88, 2013.
 Qian and Zhou [2016] Chao Zhang Hui Qian and Zebang Shen Tengfei Zhou. Accelerated stochastic admm for empirical risk minimization. arXiv preprint arXiv:1611.04074, 2016.
 Shapiro et al. [2014] Alexander Shapiro, Darinka Dentcheva, et al. Lectures on stochastic programming: modeling and theory, volume 16. Siam, 2014.
 Sutton and Barto [1998] Richard S Sutton and Andrew G Barto. Reinforcement learning: An introduction, volume 1. MIT press Cambridge, 1998.
 Suzuki and others [2013] Taiji Suzuki et al. Dual averaging and proximal gradient descent for online alternating direction multiplier method. In ICML (1), pages 392–400, 2013.
 Wang and Banerjee [2013] Huahua Wang and Arindam Banerjee. Online alternating direction method (longer version). arXiv preprint arXiv:1306.3721, 2013.
 Wang et al. [2016a] Mengdi Wang, Ethan X Fang, and Han Liu. Stochastic compositional gradient descent: Algorithms for minimizing compositions of expectedvalue functions. Mathematical Programming, 161(12):419–449, 2016.
 Wang et al. [2016b] Mengdi Wang, Ji Liu, and Ethan X. Fang. Accelerating stochastic composition optimization. In Advances In Neural Information Processing Systems, pages 1714–1722, 2016.
 Zhang et al. [2011] Xiaoqun Zhang, Martin Burger, and Stanley Osher. A unified primaldual algorithm framework based on bregman iteration. Journal of Scientific Computing, 46(1):20–46, 2011.
 Zheng and Kwok [2016a] Zheng and Kwok. Fastandlight stochastic admm. arXiv preprint arXiv:1604.07070, 2016.
 Zheng and Kwok [2016b] Shuai Zheng and James T Kwok. Stochastic variancereduced admm. arXiv preprint arXiv:1604.07070, 2016.
6 Appendix
Recall that the stochastic composition problem we want to solve has the following form:
(25)  
s.t.  (26) 
where . For clarity, we denote , , . Therefore, and the augmented Lagrangian equation for (25, 26) is
(27) 
Denote as the optimal solution of (25, 26), then it can be verified that the KKT conditions are
(28) 
Moreover, if matrix has full row rank,
Comments
There are no comments yet.