# Adaptive Stochastic Alternating Direction Method of Multipliers

The Alternating Direction Method of Multipliers (ADMM) has been studied for years. The traditional ADMM algorithm needs to compute, at each iteration, an (empirical) expected loss function on all training examples, resulting in a computational complexity proportional to the number of training examples. To reduce the time complexity, stochastic ADMM algorithms were proposed to replace the expected function with a random loss function associated with one uniformly drawn example plus a Bregman divergence. The Bregman divergence, however, is derived from a simple second order proximal function, the half squared norm, which could be a suboptimal choice. In this paper, we present a new family of stochastic ADMM algorithms with optimal second order proximal functions, which produce a new family of adaptive subgradient methods. We theoretically prove that their regret bounds are as good as the bounds which could be achieved by the best proximal function that can be chosen in hindsight. Encouraging empirical results on a variety of real-world datasets confirm the effectiveness and efficiency of the proposed algorithms.

## Authors

• 29 publications
• 1 publication
• 99 publications
• 67 publications
• ### Bregman Alternating Direction Method of Multipliers

The mirror descent algorithm (MDA) generalizes gradient descent by using...
06/13/2013 ∙ by Huahua Wang, et al. ∙ 0

• ### Faster Stochastic Alternating Direction Method of Multipliers for Nonconvex Optimization

In this paper, we propose a faster stochastic alternating direction meth...
08/04/2020 ∙ by Feihu Huang, et al. ∙ 0

• ### Sparse Support Vector Infinite Push

In this paper, we address the problem of embedded feature selection for ...
06/27/2012 ∙ by Alain Rakotomamonjy, et al. ∙ 0

• ### Optimized Signal Distortion for PAPR Reduction of OFDM Signals with IFFT/FFT Complexity via ADMM Approaches

In this paper, we propose two low-complexity optimization methods to red...
10/29/2018 ∙ by Yongchao Wang, et al. ∙ 0

• ### Learning Convex Regularizers for Optimal Bayesian Denoising

We propose a data-driven algorithm for the maximum a posteriori (MAP) es...
05/16/2017 ∙ by Ha Q. Nguyen, et al. ∙ 0

• ### Nonconvex Stochastic Nested Optimization via Stochastic ADMM

We consider the stochastic nested composition optimization problem where...
11/12/2019 ∙ by Zhongruo Wang, et al. ∙ 0

• ### Simultaneous Clustering and Optimization for Evolving Datasets

Simultaneous clustering and optimization (SCO) has recently drawn much a...
08/04/2019 ∙ by Yawei Zhao, et al. ∙ 7

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Originally introduced in [8, 7], the offline/batch Alternating Direction Method of Multipliers (ADMM) stemmed from the augmented Lagrangian method, with its global convergence property established in [6, 9, 4]. Recent studies have shown that ADMM achieves a convergence rate of  [14, 12] (where is number of iterations of ADMM), when the objective function is generally convex. Furthermore, ADMM enjoys a convergence rate of , for some , when the objective function is strongly convex and smooth [13, 2]. ADMM has shown attractive performance in a wide range of real-world problems such as compressed sensing [18], image restoration [11], video processing, and matrix completion [10], etc.

From the computational perspective, one drawback of ADMM is that, at every iteration, the method needs to compute an (empirical) expected loss function on all the training examples. The computational complexity is propositional to the number of training examples, which makes the original ADMM unsuitable for solving large-scale learning and big data mining problems. The online ADMM (OADMM) algorithm [17] was proposed to tackle the computational challenge. For OADMM, the objective function is replaced with an online function at every step, which only depends on a single training example. OADMM can achieve an average regret bound of for convex objective functions and for strongly convex objective functions. Interestingly, although the optimization of the loss function is assumed to be easy in the analysis of [17], it is actually not necessarily easy in practice. To address this issue, the stochastic ADMM algorithm was proposed, by linearizing the the online loss function [15, 16]. In stochastic ADMM algorithms, the online loss function is firstly uniformly drawn from all the loss functions associated with all the training examples. Then the loss function is replaced with its first order expansion at the current solution plus Bregman divergence from the current solution. The Bregman divergence is based on a simple proximal function, the half squared norm, so that the Bregman divergence is the half squared distance. In this way, the optimization of the loss function enjoys a closed-form solution. The stochastic ADMM achieves similar convergence rates as OADMM. Using half square norm as proximal function, however, may be a suboptimal choice. Our paper will address this issue.

Our contribution.  In the previous work [15, 16] the Bregman divergence is derived from a simple second order function, i.e., the half squared norm, which could be a suboptimal choice [3]. In this paper, we present a new family of stochastic ADMM algorithms with adaptive proximal functions, which can accelerate stochastic ADMM by using adaptive subgradient. We theoretically prove that the regret bounds of our methods are as good as those achieved by stochastic ADMM with the best proximal function that can be chosen in hindsight. The effectiveness and efficiency of the proposed algorithms are confirmed by encouraging empirical evaluations on several real-world datasets.

Organization. Section 2 presents the proposed algorithms. Section 3 gives our experimental results. Section 4 concludes our paper. Additional proofs can be found in the supplementary material.

## 2 Adaptive Stochastic Alternating Direction Method of Multipliers

### 2.1 Problem Formulation

In this paper, we will study a family of convex optimization problems, where our objective functions are composite. Specially, we are interested in the following equality-constrained optimization task:

 minw∈W,v∈Vf((w⊤,v⊤)⊤):=Eξℓ(w,ξ)+φ(v),s.t.Aw+Bv=b, (1)

where , , , , , and are convex sets. For simplicity, the notation is used for both the instance function value and its expectation

. It is assumed that a sequence of identical and independent (i.i.d.) observations can be drawn from the random vector

, which satisfies a fixed but unknown distribution. When is deterministic, the above optimization becomes the traditional problem formulation of ADMM [1]. In this paper, we will assume the functions and are convex but not necessarily continuously differentiable. In addition, we denote the optimal solution of (1) as .

Before presenting the proposed algorithm, we first introduce some notations. For a positive definite matrix , we define the -norm of a vector as . When there is no ambiguity, we often use to denote the Euclidean norm . We use to denote the inner product in a finite dimensional Euclidean space. Let be a positive definite matrix for . Set the proximal function , as . Then the corresponding Bregman divergence for is defined as

 Bϕt(w,u)=ϕt(w)−ϕt(u)−⟨∇ϕt(u),w−u⟩=12∥w−u∥2Ht.

### 2.2 Algorithm

To solve the problem (1), a popular method is Alternating Direction Multipliers Method (ADMM). ADMM splits the optimizations with respect to and by minimizing the augmented Lagrangian:

where is a pre-defined penalty. Specifically, the ADMM algorithm minimizes as follows

At each step, however, ADMM requires calculation of the expectation

, which may be unrealistic or computationally too expensive, since we may only have an unbiased estimate of

or the expectation is an empirical one for big data problem. To solve this issue, we propose to minimize the its following stochastic approximation:

 Lβ,t(w,v,θ)=⟨gt,w⟩+φ(v)−⟨θ,Aw+Bv−b⟩+β2∥Aw+Bv−b∥2+1ηBϕt(w,wt),

where and for will be specified later. This objective linearizes the and adopts a dynamic Bregman divergence function to keep the new model near to the previous one. It is easy to see that this proposed approximation includes the one proposed by [15] as a special case when . To minimize the above function, we followed the ADMM algorithm to optimize over , , sequentially, by fixing the others. In addition, we also need to update the for at every step, which will be specified later. Finally the proposed Adaptive Stochastic Alternating Direction Multipliers Method (Ada-SADMM) is summarized in Algorithm 1.

### 2.3 Analysis

In this subsection we will analyze the performance of the proposed algorithm for general , . Specifically, we will provide an expected convergence rate of the iterative solutions. To achieve this goal, we firstly begin with a technical lemma, which will facilitate the later analysis.

###### Lemma 1.

Let and be convex functions, and be positive definite, for . Then for Algorithm 1, we have the following inequality

 ℓ(wt)+φ(vt+1)−ℓ(w)−φ(v)+(zt+1−z)⊤F(zt+1) ≤η∥gt∥2H∗t2+1η[Bϕt(wt,w)−Bϕt(wt+1,w)]+β2(∥Aw+Bvt−b∥2−∥Aw+Bvt+1−b∥2)+⟨δt,w−wt⟩ +12β(∥θ−θt∥2−∥θ−θt+1∥2),

where , , , and .

###### Proof.

Firstly, using the convexity of and the definition of , we can obtain

Combining the above inequality with the relation between and will derive

 ℓ(wt)−ℓ(w)+⟨wt+1−w,−A⊤θt+1⟩ ≤⟨gt,wt+1−w⟩+⟨δt,w−wt⟩+⟨gt,wt−wt+1⟩+⟨wt+1−w,A⊤[β(Awt+1+Bvt+1−b)−θt]⟩ =⟨gt+A⊤[β(Awt+1+Bvt−b)−θt],wt+1−w⟩Lt+⟨w−wt+1,βA⊤B(vt−vt+1)⟩Mt+⟨δt,w−wt⟩ +⟨gt,wt−wt+1⟩Nt.

To provide an upper bound for the first term , taking and applying Lemma 1 in [15] to the step of getting in the Algorithm 1, we will have

 ⟨ℓ(wt,ξt)+A⊤[β(Awt+1+Bvt−b)−θt],wt+1−w⟩≤1η[Bϕt(wt,w)−Bϕt(wt+1,w)−Bϕt(wt+1,wt)].

To provide an upper bound for the second term , we can derive as follows

 ⟨w−wt+1,βA⊤B(vt−vt+1)⟩=β⟨Aw−Awt+1,Bvt−Bvt+1⟩ =β2[(∥Aw+Bvt−b∥2−∥Aw+Bvt+1−b∥2)+(∥Awt+1+Bvt+1−b∥2−∥Awt+1+Bvt−b∥2)] ≤β2(∥Aw+Bvt−b∥2−∥Aw+Bvt+1−b∥2)+12β∥θt+1−θt∥2.

To drive an upper bound for the final term , we can use Young’s inequality to get

 ⟨gt,wt−wt+1⟩≤η∥gt∥2H∗t2+∥wt−wt+1∥2Ht2η=η∥gt∥2H∗t2+Bϕt(wt,wt+1)η.

Replacing the terms , and with their upper bounds, we will get

 +β2(∥Aw+Bvt−b∥2−∥Aw+Bvt+1−b∥2)+12β∥θt+1−θt∥2.

Due to the optimality condition of the step of updating in Algorithm 1, i.e., and the convexity of , we have

 φ(vt+1)−φ(v)+⟨vt+1−v,−B⊤θt+1⟩≤0.

Using the fact , we have

Combining the above three inequalities and re-arranging the terms will conclude the proof. ∎

Given the above lemma, now we can analyze the convergence behavior of Algorithm 1. Specifically, we provide an upper bound on the the objective value and the feasibility violation.

###### Theorem 1.

Let and be convex functions, and be positive definite, for . Then for Algorithm 1, we have the following inequality for any and :

 E[f(¯uT)−f(u∗)+ρ∥A¯wT+B¯vT−b∥] ≤12T(ET∑t=1[2η(Bϕt(wt,w∗)−Bϕt(wt+1,w∗))+η∥gt∥2H∗t]+βD2v∗,B+ρ2β). (2)

where , , and , and .

###### Proof.

For convenience, we denote , , and . With these notations, using convexity of and and the monotonicity of operator , we have for any :

 f(¯uT)−f(u)+(¯zT−z)⊤F(¯zT) ≤ = 1TT∑t=1[ℓ(wt)+φ(vt+1)−ℓ(w)−φ(v)+(zt+1−z)⊤F(zt+1)].

Combining this inequality with Lemma 1 at the optimal solution , we can derive

 f(¯uT)−f(u∗)+(¯zT−z∗)⊤F(¯zT) ≤ 1TT∑t=1{1η[Bϕt(wt,w∗)−Bϕt(wt+1,w∗)]+η∥gt∥2H∗t2+⟨δt,w∗−wt⟩+β2(∥Aw∗+Bvt−b∥2 −∥Aw∗+Bvt+1−b∥2)+12β(∥θ−θt∥2−∥θ−θt+1∥2)} ≤ 1T{T∑t=1[1η[Bϕt(wt,w∗)−Bϕt(wt+1,w∗)]+η∥gt∥2H∗t2+⟨δt,w∗−wt⟩]+β2∥Aw∗+Bv1−b∥2+12β∥θ−θ1∥2} ≤

Because the above inequality is valid for any , it also holds in the ball . Combining with the fact that the optimal solution must also be feasible, it follows that

 maxθ∈Bρ{f(¯uT)−f(u∗)+(¯zT−z∗)⊤F(¯zT)} = maxθ∈Bρ{f(¯uT)−f(u∗)+¯θ⊤T(Aw∗+Bv∗−b)−θ⊤(A¯wT+B¯vT−b)} = maxθ∈Bρ{f(¯uT)−f(u∗)−θ⊤(A¯wT+B¯vT−b)}=f(¯uT)−f(u∗)+ρ∥A¯wT+B¯vT−b∥.

Combining the above two inequalities and taking expectation, we have

 E[f(¯uT)−f(u∗)+ρ∥A¯wT+B¯vT−b∥] ≤ 1TE{T∑t=1(1η[Bϕt(wt,w∗)−Bϕt(wt+1,w∗)]+η∥gt∥2H∗t2)+⟨δt,w∗−wt⟩)+β2D2v∗,B+12β∥θ−θ1∥2} ≤ 12T{ET∑t=1[2η[Bϕt(wt,w∗)−Bϕt(wt+1,w∗)]+η∥gt∥2H∗t]+βD2v∗,B+ρ2β},

where we used the fact in the last step. This completes the proof. ∎

The above theorem allows us to derive regret bounds for a family of algorithms that iteratively modify the proximal functions in attempt to lower the regret bounds. Since the rate of convergence is still dependent on and , next we are going to choose appropriate positive definite matrix and the constant to optimize the rate of convergence.

### 2.4 Diagonal Matrix Proximal Functions

In this subsection, we restrict as a diagonal matrix, for two reasons: (i) the diagonal matrix will provide results easier to understand than that for the general matrix; (ii) for high dimension problem the general matrix may result in prohibitively expensive computational cost, which is not desirable.

Firstly, we notice that the upper bound in the Theorem 1 relies on . If we assume all the ’s are known in advance, we could minimize this term by setting , . We shall use the following proposition.

###### Proposition 1.

For any , we have

 mindiag(s)⪰0, 1⊤s≤cT∑t=1∥gt∥2diag(s)=1c(d1∑i=1∥g1:T,i∥)2,

where and the minimum is attained at .

We omit proof of this proposition, since it is easy to derive. Since we do not have all the ’s in advance, we receives the stochastic (sub)gradients sequentially instead. As a result, we propose to update the incrementally as:

 Ht=aI+diag(st),

where and . For these s, we have the following inequality

 T∑t=1∥gt∥2H∗t=T∑t=1⟨gt,(aI+diag(st))−1gt⟩≤T∑t=1⟨gt,diag(st)−1gt⟩≤2d1∑i=1∥g1:T,i∥, (3)

where the last inequality used the Lemma 4 in [3], which implies this update is a nearly optimal update method for the diagonal matrix case. Finally, the adaptive stochastic ADMM with diagonal matrix update (Ada-SADMM) is summarized into the Algorithm 2.

For the convergence rate of the proposed Algorithm 2, we have the following specific theorem.

###### Theorem 2.

Let and be convex functions for any . Then for Algorithm 2, we have the following inequality for any and

 E[f(¯uT)−f(u∗)+ρ∥A¯wT+B¯vT−b∥] ≤ 12T(E[2ηd1∑i=1∥g1:T,i∥+2ηmaxt≤T∥wt−w∗∥2∞d1∑i=1∥g1:T,i∥]+βD2v∗,B+ρ2β).

If we further set where , then we have

 E[f(¯uT)−f(u∗)+ρ∥A¯wT+B¯vT−b∥]≤1T(√2E[Dw,∞d1∑i=1∥g1:T,i∥]+β2D2v∗,B+ρ22β).
###### Proof.

We have the following inequality

 2T∑t=1[Bϕt(wt,w∗)−Bϕt(wt+1,w∗)]=T∑t=1(∥wt−w∗∥2Ht−∥wt+1−w∗∥2Ht) =∥w1−w∗∥2H1+T−1∑t=1⟨wt+1−w∗,diag(st+1−st)wt+1−w∗⟩ ≤∥w1−w∗∥2H1+T−1∑t=1maxi(wt+1,i−w∗,i)2∥st+1−st∥1 =∥w1−w∗∥2H1+T−1∑t=1∥wt+1−w∗∥2∞(st+1−st)⊤1 ≤∥w1−w∗∥2H1+maxt≤T∥wt−w∗∥2∞s⊤T1−∥w1−w∗∥2∞s⊤11≤maxt≤T∥wt−w∗∥2∞d1∑i=1∥g1:T,i∥,

where the last inequality used and .

Plugging the above inequality and the inequality (4) into the inequality (2), will conclude the first part of the theorem. Then the second part is trivial to be derived. ∎

###### Remark 3.

For the example of sparse random data, assume that at each round , feature

appears with probability

for some and a constant . Then

 E[d∑i=1∥g1:T,i∥]=d∑i=1E[√|{t:|gt,i|=1}|]≤d∑i=1√E|{t:|gt,i|=1}|=d∑i=1√Tpi.

In this case, the convergence rate equals .

### 2.5 Full Matrix Proximal Functions

In this subsection, we derive and analyze new updates when we estimate a full matrix for the proximal function instead of a diagonal one. Although full matrix computation may not be attractive for high dimension problems, it may be helpful for tasks with low dimension. Furthermore, it will provide us with a more complete insight. Similar with the analysis for the diagonal case, we first introduce the following proposition (Lemma 15 in [3]).

###### Proposition 2.

For any , we have the following inequality

 minS⪰0,tr(S)≤cT∑t=1∥gt∥2S−1=1ctr(GT)

where, . and the minimizer is attained at . If is not of full rank, then we use its pseudo-inverse to replace its inverse in the minimization problem.

Because the (sub)gradients are received sequentially, we propose to update the incrementally as

 Ht=aI+G12t,

where , . For these s, we have the following inequalities

 T∑t=1∥gt∥2H∗t≤T∑t=1∥gt∥2S−1t≤2T∑t=1∥gt∥2S−1T=2tr(G1/2T), (4)

where the last inequality used the Lemma 10 in [3], which implies this update is a nearly optimal update method for the full matrix case. Finally, the adaptive stochastic ADMM with full matrix update can be summarized into the Algorithm 3.

For the convergence rate of the above proposed Algorithm 3, we have the following specific theorem.

###### Theorem 4.

Let and are convex functions for any . Then for Algorithm 3, we have the following inequality for any , ,

 E[f(¯uT)−f(u∗)+ρ∥A¯wT+B¯vT−b∥] ≤

Furthermore, if we set , where , then we have

 E[f(¯uT)−f(u∗)+ρ∥A¯wT+B¯yT−b∥]≤1T(√2E[Dw,2tr(G1/2T)]+β2D2v∗,B+ρ22β).
###### Proof.

We consider the sum of the difference

 2T∑t=1[Bϕt(wt,w∗)−Bϕt(wt+1,w∗)]=T∑t=1(∥wt−w∗∥2Ht−∥w