# Fast Stochastic Variance Reduced ADMM for Stochastic Composition Optimization

We consider the stochastic composition optimization problem proposed in wang2017stochastic, which has applications ranging from estimation to statistical and machine learning. We propose the first ADMM-based algorithm named com-SVR-ADMM, and show that com-SVR-ADMM converges linearly for strongly convex and Lipschitz smooth objectives, and has a convergence rate of O( S/S), which improves upon the O(S^-4/9) rate in wang2016accelerating when the objective is convex and Lipschitz smooth. Moreover, com-SVR-ADMM possesses a rate of O(1/√(S)) when the objective is convex but without Lipschitz smoothness. We also conduct experiments and show that it outperforms existing algorithms.

## Authors

• 47 publications
• 26 publications
07/11/2017

### Accelerated Variance Reduced Stochastic ADMM

Recently, many variance reduced stochastic alternating direction method ...
04/24/2016

The alternating direction method of multipliers (ADMM) is a powerful opt...
11/02/2021

### Faster Convex Lipschitz Regression via 2-block ADMM

The task of approximating an arbitrary convex function arises in several...
05/14/2019

### Plug-and-Play Methods Provably Converge with Properly Trained Denoisers

Plug-and-play (PnP) is a non-convex framework that integrates modern den...
11/12/2019

### Nonconvex Stochastic Nested Optimization via Stochastic ADMM

We consider the stochastic nested composition optimization problem where...
07/10/2018

### Dual optimization for convex constrained objectives without the gradient-Lipschitz assumption

The minimization of convex objectives coming from linear supervised lear...
11/24/2020

### A New Algorithm for Convex Biclustering and Its Extension to the Compositional Data

Biclustering is a powerful data mining technique that allows simultaneou...
##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Recently, Wang et al. (2016a) proposed the stochastic composition optimization of the following form:

 minx(Eifi∘Ejgj)(x). (1)

Here , , denotes the composite function, and

are random variables. Problem (

1) has been shown in Wang et al. (2016a) to include several important applications in estimation and machine learning.

In this paper, we focus on extending the formulation to include linear constraints, and consider the following variant of Problem (1):

 minx,ω F(x)+R(ω) (2) s.t. Ax+Bω=0. (3)

Here , , , , , and are continuous functions, and is a closed convex function. The reason to consider the specific form of Problem (2) is as follows. (i) In practice, random variables such as and are obtained from problem-dependent data sets. Thus, they often only take values in a finite set with certain frequencies (captured by the first term in the objective (2)). (ii) Such problems often require the solutions to satisfy certain regularizing conditions (imposed by the term and constraint (3

)). Note here that the uniform distribution of

and (the and ) in (2) is not critical. In Section 4, we show that our algorithm is also applicable under other distributions.

### 1.1 Motivating Examples

We first present a few motivating examples of formulation (2). The first example is a risk-averse learning problem discussed in Wang et al. (2016b), which can be formulated into the following mean-variance minimization problems, i.e.,

 minx Eϵhϵ(x)+λVarϵhϵ(x) (4) s.t. Ax=0. (5)

Here

is the loss function w.r.t variable

and is parameterized by random variable , and denotes its variance. We see that this example is of the form (2), where plays the role of the regularizer and the variance term is the composition functions. There are many other problems that can be formulated into the mean-variance optimization as in (4), e.g., portfolio management Alexander and Baptista (2004).

The second motivating example is dynamic programming Sutton and Barto (1998); Dai et al. (2016). In this case, one can often approximate the value of each state by an inner product of a state feature and a target variable . Then, the policy learning problem can be formulated into minimizing the Bellman residual as follows:

 minwS∑s=1(⟨ϕs,w⟩−∑s′Pπs,s′(rs,s′+γ⟨ϕs′,w⟩))2+R(w), (6)

where

denotes the transition probabilities under a policy

, and denotes the discounting factor. Note that this problem also has the form of Problem (2).

The third example is multi-stage stochastic programming Shapiro et al. (2014). For example, a two-stage optimization scenario often requires solving the following problem:

 minxEv(minyEu|v(U(x,v,y,u))).

Here are decision variables, are the corresponding random variables, and the function is the utility function. In this case, the expectation is the inner function and is the outer function in Problem (2).

From these examples, we see that formulation (2) is general and includes important applications. Thus, it is important to develop fast and robust algorithms for solving (2).

### 1.2 Related Works

The stochastic composition optimization problem was first proposed in Wang et al. (2016a), where two solution algorithms, Basic SCGD and accelerated SCGD, were proposed. The algorithms were shown to achieve a sublinear convergence rate for convex and strongly convex cases, and were also shown to converge to a stationary point in the nonconvex case. Later, Wang et al. (2016b) proposed a proximal gradient algorithm called ASC-PG to improve the convergence rate when both inner and outer functions are smooth. However, the convergence rate is sublinear and their results do not include the regularizer when the objective functions are not strongly convex. In Lian et al. (2016), the authors solved the finite sample case of stochastic composition optimization and obtained two linear-convergent algorithms based on the stochastic variance reduction gradient technique (SVRG) proposed in Johnson and Zhang (2013). However, the algorithms do not handle the regularizer either.

The ADMM algorithm, on the other hand, was first proposed in Glowinski and Marroco (1975); Gabay and Mercier (1976) and later reviewed in Boyd et al. (2011). Since then, several ADMM-based stochastic algorithms have been proposed, e.g., Ouyang et al. (2013); Suzuki and others (2013); Wang and Banerjee (2013). However, these algorithms all possess sublinear convergence rates. Therefore, several recent works tried to combine the variance reduction scheme and ADMM to accelerate convergence. For instance, SVRG-ADMM was proposed in Zheng and Kwok (2016a). It was shown that SVRG-ADMM converges linearly when the objective is strongly convex, and has a sublinear convergence rate in the general convex case. Another recent work Zheng and Kwok (2016b) further proved that SVRG-ADMM converges to a stationary point with a rate when the objective is nonconvex. In Qian and Zhou (2016), the authors used acceleration technique in Allen-Zhu (2016); Hien et al. (2016) to further improve the convergence rate of SVRG-ADMM. However, all aforementioned variance-reduced ADMM algorithms cannot be directly applied to solving the stochastic composition optimization problem.

### 1.3 Contribution

In this paper, we propose an efficient algorithm called com-SVR-ADMM, which combines ideas of SVRG and ADMM, to solve stochastic composition optimization. Our algorithm is based on the SVRG-ADMM algorithm proposed in Zheng and Kwok (2016a), which does not apply to composite optimization problems. We consider three different objective functions in Problem (2), and show that our algorithm achieves the following performance.

• When is strongly convex and Lipschitz smooth, and is convex, our algorithm converges linearly. This convergence rate is comparable with those of com-SVRG-1 and com-SVRG-2 in Lian et al. (2016). However, com-SVRG-1 and com-SVRG-2 do not take the commonly used regularization penalty into consideration. Experimental results also show that com-SVR-ADMM converges faster than com-SVRG-1 and com-SVRG-2.

• When is convex and Lipschitz smooth, and is convex, com-SVR-ADMM has a sublinear rate of , where is the number of outer iterations.111The number of inner iterations is constant. This result outperforms the convergence rate of ASC-PG in Wang et al. (2016b).222Note that ASC-PG is not based on SVRG and does not have inner loops.

• When and are general convex functions (not necessarily Lipschitz smooth), com-SVR-ADMM achieves a rate of . To the best of our knowledge, this is the first convergence result for stochastic composite optimization with general convex problems without Lipschitz smoothness.

### 1.4 Notation

For vector

and a positive definite matrix , the -norm of vector is defined as . For a matrix , denotes its spectral norm,

denote its largest and smallest eigenvalue, respectively.

denotes the pseudoinverse of . denotes a noisy subgradient of nonsmooth . For a function , denotes its Jacobian matrix. denotes the gradient of at point .

## 2 Algorithm

Recall that the stochastic composition problem we want to solve has the following form:

 minx,ω F(x)+R(ω) s.t. Ax+Bω=0.

where . For clarity, we denote , , . Therefore, .

Our proposed procedure adopts the ADMM scheme. At every iteration the primal variables are obtained by minimizing the following augmented Lagrangian equation parameterized with , i.e.,

 Lρ(x,ω,λ)=F(x)+R(ω)+⟨λ,Ax+Bω⟩+ρ2||Ax+Bω||22.

The update of dual variable is similar to that under gradient descent with the stepsize equaling . We also based our algorithm on a sampling oracle as in Wang et al. (2016b). Specifically, we assume a stochastic first-order oracle, which, if queried, returns a noisy gradient/subgradient or function value of and at a given point.

In the following sections, we introduce the stochastic variance reduced ADMM algorithm for solving stochastic compositional optimization (com-SVR-ADMM). We present com-SVR-ADMM in three different scenarios, i.e., strongly convex and Lipschitz smooth, general convex and Lipschitz smooth, and general convex. Algorithm 1 shows how com-SVR-ADMM is used in the strongly convex case, while Algorithm 2 is for the second and third cases.

### 2.1 Compositional Stochastic Variance Reduced ADMM for Strongly Convex Functions

As in SVRG, com-SVR-ADMM has inner loops inside each outer iteration. At every outer iteration, we need to keep track of a reference point (Step in Algorithm 1) for computing defined as

 g(~x)=1mm∑j=1gj(~x), (7)

and evaluate , which is the Jacobian matrix of at point . The updates of and are the same as those in batch ADMM Boyd et al. (2011). The main difference lies in the update for , in that we use a stochastic sample and replace with its first-order approximation, and then approximate by

 ∇^Fik(xk)=(∂gjk(xk))T∇fik(^g(xk))−(∂gjk(~x))T∇fik(g(~x))+∇F(~x). (8)

Here , are uniformly sampled from and , respectively. is an estimation of defined as follows:

 ^g(xk)=g(~x)−1N∑1≤j≤N(gNk[j](~x)−gNk[j](xk)), (9)

where is a mini-batch and is obtained by uniformly and randomly sampling from for times (with replacement) and is the th element of . In step of Algorithm 1, we add a proximal term to control the distance between the next point and the current point . The parameter is a constant and plays the role of stepsize as in Ouyang et al. (2013).

Note that our estimated gradient is biased due to the composition objective function and its form is the same as com-SVRG-1 in Lian et al. (2016). However, we remark that our algorithm is not a trivial extension of com-SVRG-1 due to the existence of linear constraint and Lagrangian dual variable. Moreover, com-SVR-ADMM can handle regularization penalty while com-SVRG-1 cannot. Also, the update of uses the pseudoinverse of . In the common case when is sparse, one can use the efficient Lanczos algorithm Golub and Van Loan (2012) to compute . Note that step in Algorithm 1 often involves computing . The memory complexity for this step can be alleviated by algorithms proposed in recent works, e.g., Zheng and Kwok (2016a); Zhang et al. (2011).

### 2.2 Compositional Stochastic Variance Reduced ADMM for General Convex Functions

In this section, we describe how com-SVR-ADMM handles general convex composition problems with Lipschitz smoothness. Without strong convexity, we need to make subtle changes. As shown in Algorithm 2, besides changes in variable initialization and output, another key difference is the approximation of , where we use instead of , i.e.,

 ∇^Fik(xk)=(∂gjk(xk))T∇fik(g(xk))−(∂gjk(~x))T∇fik(g(~x))+∇F(~x). (10)

Note that in the cases of interest (see Assumption 1 below), the approximated gradient is unbiased. The next change is the stepsize for updating . In step of Algorithm 2, we use a positive definite matrix in the proximal term.333The corresponding proximal term of Algorithm 1 can be viewed to have . Therefore, the stepsize depends on two parameters: and , as shown in (11), where and are the iteration counters for outer and inner iteration, respectively. Here is a parameter of Lipschitz smoothness and will be specified in our assumptions in next section.

 ηs=1(s+1)LF, G0⪰G1⪰G2⪰...⪰GK−1, (11) G0=1sI,GK−1 =1s+1I,GK=1s+1I.

That is, is nonincreasing for . Then, according to the definition of -norm and (11), we have:

 12ηs||x−xk||2Gk=12ηs,k||x−xk||22, (12)

where and if , and is a scalar. Therefore, serves as the stepsize Ouyang et al. (2013), and it can be verified that satisfies the following properties:

 ηs,0=s(s+1)LF, ηs,K−1 =1LF, ηs,K=1LF, (13) ηs,0≤ηs,1≤...≤ ηs,K−1.

That is, changes from to in stage . Note that even though is not a constant, it still has a reasonable value and does not need to vanish. This feature is helpful for convergence acceleration.

### 2.3 General Convex Functions without Lipschitz Smoothness

In the previous two sections, we present the procedures of com-SVR-ADMM for the strongly convex and general convex settings, both under the Lipschitz smooth assumption of . In this section, we further investigate the case when the smooth condition is relaxed. We still use Algorithm 2, except that the values and are changed to

 ηs=1s+1, G0⪰G1⪰G2⪰...⪰GK−1, (14) G0=1√sI,GK−1 =1√s+1I,GK=1√s+1I.

Therefore, using the same technique in (12), it can be verified that in this setting changes from to in stage and decreases to zero. Note that in this case, the number of oracle calls at each step is the same as that in section 2.2.

Although the algorithm we proposed appears similar to the SVRG-ADMM algorithm in Zheng and Kwok (2016a), it is very different due to the composition nature of the objective function (which is not considered in SVRG-ADMM) and the stochastic variance reduced gradients in (8) and (10). These differences make it impossible to directly apply SVRG-ADMM and require a very different analysis for the new algorithm. Readers interested in the full proofs can refer to the appendix.

## 3 Theoretical Results

In this section, we analyze the convergence performance of com-SVR-ADMM under the three cases described in section 2. Below, we first state our assumptions. Note that the assumptions are not restrictive and are commonly made in the literature, e.g., Wang et al. (2016b); Ouyang et al. (2013); Wang et al. (2016a); Zheng and Kwok (2016b).

###### Assumption 1.

(i) For each , is convex and continuously differentiable, is convex (can be nonsmooth). Moreover, there exists an optimal primal-dual solution for Problem (2).

(ii) The feasible set for is bounded and denote .

(iii) For randomly sampled , and , we assume the following unbiased properties:

 E((∂gjk(x))T∇fik(g(x))) =∇F(x), (15) E(∂gjk(x))=∂g(x),E(∇ Fik(x))=∇F(x).
###### Assumption 2.

is strongly convex with parameter , i.e., ,

 F(x)−F(x∗)≥⟨∇F(x∗),x−x∗⟩+μF2||x−x∗||22. (16)
###### Assumption 3.

Matrix has full row rank.

###### Assumption 4.

There exists a positive constant , such that , and , , we have

 ||(∂gj(x))T∇fi(g(x))−(∂gj(y))T∇fi(g(y))||≤LF||x−y||.
###### Assumption 5.

For each , is Lipschitz smooth with positive parameter , that is, , we have

 ||∇fi(y)−∇fi(x)||≤Lf||y−x||. (17)
###### Assumption 6.

For every , is bounded, and for all , that satisfy

 ||gj(x)−gj(y)||≤CG||x−y||,||∂gj(x)||≤CG, (18) ||gj(x)−gj(y)||≤LG||x−y||2.

For clarity, we also use the following notations used in the theorems:

 u=[xω], uk=[xkωk], ~us=[~xs~ωs], ¯u=[¯x¯ω], (19) G(u)=F(x)−F(x∗)−⟨∇F(x∗),x−x∗⟩ + R(ω)−R(ω∗)−⟨~∇R(ω∗),ω−ω∗⟩.

It can be verified that is always non-negative due to the convexity of and . The following theorem and corollary show that Algorithm 1 has a linear convergence rate.

###### Proposition.

Under Assumption 4, we have and :

 ||∇Fi(x)−∇Fi(y)||≤LF||x−y||, (20)

i.e., each is Lipschitz smooth. Moreover, it implies .

###### Theorem 1.

Under Assumptions 1, 2, 3, 4, 5 and 6, if , then under Algorithm 1,

 γ1E[G(~us)]≤γ2G(~us−1), (21)

where (denote )

 γ1= (2η−32η2C4GL2fμFN−48η2L2F+8ηDCGLfLGσ(N)μF)K, γ2= (K+1)(32η2C4GL2fμFN+48η2L2F+8ηDCGLfLGσ(N)μF) +2μF+2ηρ||ATA||μF+2LFηρσmin(AAT).
###### Corollary 1.

Suppose the conditions in Theorem 1 hold. Then, there exist positive constants (number of inner iterations) and (mini-batch size) such that . Thus, Algorithm 1 converges linearly.

From Corollary 1, if we want to achieve , , the number of steps we need to take is roughly . In each iteration, we need oracle calls. Therefore, the overall query complexity is . For comparison, the query complexity is for com-SVRG-1 and for com-SVRG-2 Lian et al. (2016), where is a parameter related to condition number. We will see in simulations in section 4 that the overall query complexity of com-SVR-ADMM is lower than com-SVRG-1 and com-SVRG-2.

Now we prove the convergence property of com-SVR-ADMM under Assumptions 1 and 4.

###### Theorem 2.

Under Assumptions 1 and 4 , if and are chosen as in (11), under Algorithm 2,

 E (G(¯u)+Λ||A¯x+B¯ω||) (22) ≤ 4LFD2log(S+1)S+LFD2logS2KS+LFD2+ρD2||ATA||+2ρ||^λ0−λ∗||22+2ρΛ22KS,

where .

From Theorem 2, we see that com-SVR-ADMM has an convergence rate under the general convex and Lipschitz smooth condition. It improves upon the convergence rate in the recent work Wang et al. (2016b). In Theorem 2, we consider both the convergence property of function value and feasibility violation. Since and are both non-negative, each term has an convergence rate.

In the following theorem, we show that our algorithm exhibits convergence rate for both the objective value and feasibility violation, when the objective is a general convex function.

###### Assumption 7.

The gradients/subgradients of all , , and are bounded and ,  . Moreover, is invertible and are bounded.

###### Theorem 3.

Under Assumptions 1 and 7, denote
, , . If and are chosen as in (14), there exists a positive constant such that, under Algorithm 2,

 E (G(¯z)+Λ||A¯x+B¯ω||) (23) ≤ C1(C4+CF)√S+D2K√S+C3log(S+1)S+D2+ρ||ATA||D2+2ρ||^λ0−λ∗||22+2ρΛ22KS,

where , , , are positive constants.

The reason for the introduction of is similar to the step taken in Ouyang et al. (2013), and is due to the lack of Lipschitz smooth property. This result implies an convergence rate for both objective value and feasibility violation.

## 4 Experiments

In this section, we conduct experiments and compare com-SVR-ADMM to existing algorithms. We consider two experiment scenarios, i.e., the portfolio management scenario from Lian et al. (2016)

and the reinforcement learning scenario from

Wang et al. (2016b). Since the objective functions in both scenarios are strongly convex and Lipshitz smooth, we only provide results for Algorithm 1.

### 4.1 Portfolio Management

Portfolio management is usually formulated as mean-variance minimization of the following form:

 minx−1nn∑i=1⟨ri,x⟩+1nn∑i=1(⟨ri,x⟩−1nn∑j=1⟨rj,x⟩)2+R(x), (24)

where for , is the number of assets, and is the number of observed time slots. Thus, is the observed reward in time slot . We compare our proposed com-SVR-ADMM with three benchmarks: com-SVRG-1, com-SVRG-2 from Lian et al. (2016), and SGD. In order to compute the unbiased stochastic gradient of SGD, we first enumerate all samples in the data set of to calculate and , then evaluate for a random sample . Using the same definition of and and the same parameters generation method as Lian et al. (2016), we set the regularization to , where .

The experimental results are shown in Figure 1 and Figure 2. Here the -axis represents the objective value minus optimal value and the -axis is the number of oracle calls or CPU time. We set , . cov is the parameter used for reward covariance matrix generation Lian et al. (2016). In Figure 1, , and in Figure 2. All shared parameters in the four algorithms, e.g., stepsize, have the same values. We can see that all SVRG based algorithms perform much better than SGD, and com-SVR-ADMM outperforms two other linear convergent algorithms.

### 4.2 Reinforcement Learning

Here we consider the problem (6), which can be used for on-policy learning Wang et al. (2016b). In our experiment, we assume there are finite states and the number of states is . is the policy in consideration. is the transition probability from state to given policy , is a discount factor, is the feature of state . Here we use a linear product to approximate the value of state . Our goal is to find the optimal .

We use the following specifications for oracles and :

 gs′(w)=(ϕT1w,r1,s′+γϕTs′w,...,ϕTSw,rS,s′+γϕTs′w)T, fs(y)=(y[2s−1]−y[2s])2.

Note here , and denote the -th element of vector . All shared parameters in four algorithms have the same values. Note here that the calculation of is no longer under uniform distribution. We use the given transition probability. In this experiment, the transition probability is randomly generated and then regularized. The reward is also randomly generated. In addition, we include a regularization term with . The results are shown in Figure 3 and Figure 4. It can be seen that our proposed com-SVR-ADMM achieves faster convergence compared to the benchmark algorithms.

## 5 Conclusion

In this paper, we propose an ADMM-based algorithm, called com-SVR-ADMM, for stochastic composition optimization. We show that when the objective function is strongly convex and Lipschitz smooth, com-SVR-ADMM converges linearly. In the case when the objective function is convex (not necessarily strongly convex) and Lipschitz smooth, com-SVR-ADMM improves the theoretical convergence rate from in Wang et al. (2016b) to . When the objective is only assumed to be convex, com-SVR-ADMM has a convergence rate of . Experimental results show that com-SVR-ADMM outperforms existing algorithms.

## Acknowledgements

This work was supported in part by the National Natural Science Foundation of China Grants 61672316, 61303195, the Tsinghua Initiative Research Grant, and the China youth 1000-talent grant.

## 6 Appendix

Recall that the stochastic composition problem we want to solve has the following form:

 minx,ω F(x)+R(ω) (25) s.t. Ax+Bω=0. (26)

where . For clarity, we denote , , . Therefore, and the augmented Lagrangian equation for (25, 26) is

 Lρ(x,ω,λ)=F(x)+R(ω)+⟨λ,Ax+Bω⟩ +ρ2||Ax+Bω||22,ρ>0. (27)

Denote as the optimal solution of (25, 26), then it can be verified that the KKT conditions are

 ∇F(x∗)=−ATλ∗,~∇R(ω∗)=−BTλ∗,Ax∗+Bω∗=0. (28)

Moreover, if matrix has full row rank,