# Accelerated Primal-Dual Proximal Block Coordinate Updating Methods for Constrained Convex Optimization

Block Coordinate Update (BCU) methods enjoy low per-update computational complexity because every time only one or a few block variables would need to be updated among possibly a large number of blocks. They are also easily parallelized and thus have been particularly popular for solving problems involving large-scale dataset and/or variables. In this paper, we propose a primal-dual BCU method for solving linearly constrained convex program in multi-block variables. The method is an accelerated version of a primal-dual algorithm proposed by the authors, which applies randomization in selecting block variables to update and establishes an O(1/t) convergence rate under weak convexity assumption. We show that the rate can be accelerated to O(1/t^2) if the objective is strongly convex. In addition, if one block variable is independent of the others in the objective, we then show that the algorithm can be modified to achieve a linear rate of convergence. The numerical experiments show that the accelerated method performs stably with a single set of parameters while the original method needs to tune the parameters for different datasets in order to achieve a comparable level of performance.

## Authors

• 20 publications
• 8 publications
• ### Randomized Primal-Dual Proximal Block Coordinate Updates

In this paper we propose a randomized primal-dual proximal block coordin...
05/19/2016 ∙ by Xiang Gao, et al. ∙ 0

• ### Accelerated first-order primal-dual proximal methods for linearly constrained composite convex programming

Motivated by big data applications, first-order methods have been extrem...
06/29/2016 ∙ by Yangyang Xu, et al. ∙ 0

• ### Hybrid Jacobian and Gauss-Seidel proximal block coordinate update methods for linearly constrained convex programming

Recent years have witnessed the rapid development of block coordinate up...
08/13/2016 ∙ by Yangyang Xu, et al. ∙ 0

• ### Asynchronous parallel primal-dual block update methods

Recent several years have witnessed the surge of asynchronous (async-) p...
05/18/2017 ∙ by Yangyang Xu, et al. ∙ 0

• ### Fast Nonoverlapping Block Jacobi Method for the Dual Rudin--Osher--Fatemi Model

We consider nonoverlapping domain decomposition methods for the Rudin--O...
08/04/2019 ∙ by Chang-Ock Lee, et al. ∙ 0

• ### Accelerated Variance Reduced Block Coordinate Descent

Algorithms with fast convergence, small number of data access, and low p...
11/13/2016 ∙ by Zebang Shen, et al. ∙ 0

• ### Beating level-set methods for 3D seismic data interpolation: a primal-dual alternating approach

Acquisition cost is a crucial bottleneck for seismic workflows, and low-...
07/09/2016 ∙ by Rajiv Kumar, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Motivated by the need to solve large-scale optimization problems and increasing capabilities in parallel computing, block coordinate update (BCU) methods have become particularly popular in recent years due to their low per-update computational complexity, low memory requirements, and their potentials in a distributive computing environment. In the context of optimization, BCU first appeared in the form of block coordinate descent (BCD) type of algorithms which can be applied to solve unconstrained smooth problems or those with separable nonsmooth terms in the objective (possibly with separable constraints). More recently, it has been developed for solving problems with nonseparable nonsmooth terms and/or constraint in a primal-dual framework.

In this paper, we consider the following linearly constrained multi-block structured optimization model:

 minxf(x)+M∑i=1gi(xi), s.t. M∑i=1Aixi=b, (1)

where is partitioned into disjoint blocks , is a smooth convex function with Lipschitz continuous gradient, and each is proper closed convex and possibly non-differentiable. Note that can include an indicator function of a convex set , and thus (1) can implicitly include certain separable block constraints in addition to the nonseparable linear constraint.

Many applications arising in statistical and machine learning, image processing, and finance can be formulated in the form of (

1) including the basis pursuit [7], constrained regression [23]

, support vector machine in its dual form

[10], portfolio optimization [28], just to name a few.

Towards finding a solution for (1), we will first present an accelerated proximal Jacobian alternating direction method of multipliers (Algorithm 1), and then we generalize it to an accelerated randomized primal-dual block coordinate update method (Algorithm 2). Assuming strong convexity on the objective function, we will establish convergence rate results of the proposed algorithms by adaptively setting the parameters, where is the total number of iterations. In addition, if further assuming smoothness and the full-rankness we then obtain linear convergence of a modified method (Algorithm 3).

### 1.1 Related methods

Our algorithms are closely related to randomized coordinate descent methods, primal-dual coordinate update methods, and accelerated primal-dual methods. In this subsection, let us briefly review the three classes of methods and discuss their relations to our algorithms.

#### Randomized coordinate descent methods

In the absence of linear constraint, Algorithm 2 specializes to randomized coordinate descent (RCD), which was first proposed in [31] for smooth problems and later generalized in [38, 27] to nonsmooth problems. It was shown that RCD converges sublinearly with rate , which can be accelerated to for convex problems and achieves a linear rate for strongly convex problems. By choosing multiple block variables at each iteration, [37] proposed to parallelize the RCD method and showed the same convergence results for parallelized RCD. This is similar to setting in Algorithm 2, allowing parallel updates on the selected -blocks.

#### Primal-dual coordinate update methods

In the presence of linear constraints, coordinate descent methods may fail to converge to a solution of the problem because fixing all but one block, the selected block variable may be uniquely determined by the linear constraint. To perform coordinate update to the linearly constrained problem (1), one effective approach is to update both primal and dual variables. Under this framework, the alternating direction method of multipliers (ADMM) is one popular choice. Originally, ADMM [17, 14] was proposed for solving two-block structured problems with separable objective (by setting and in (1)), for which its convergence and also convergence rate have been well-established (see e.g. [2, 13, 22, 29]). However, directly extending ADMM to the multi-block setting such as (1) may fail to converge; see [6] for a divergence example of the ADMM even for solving a linear system of equations. Lots of efforts have been spent on establishing the convergence of multi-block ADMM under stronger assumptions (see e.g. [4, 6, 25, 26, 16]) such as strong convexity or orthogonality conditions on the linear constraint. Without additional assumptions, modification is necessary for the ADMM applied to multi-block problems to be convergent; see [12, 19, 20, 39] for example. Very recently, [15] proposed a randomized primal-dual coordinate (RPDC) update method, whose asynchronous parallel version was then studied in [41]. Applied to (1), RPDC is a special case of Algorithm 2 with fixed parameters. It was shown that RPDC converges with rate under convexity assumption. More general than solving an optimization problem, primal-dual coordinate (PDC) update methods have also appeared in solving fixed-point or monotone inclusion problems [36, 35, 9, 34]

. However, for these problems, the PDC methods are only shown to converge but no convergence rate estimates are known unless additional assumptions are made such as the strong monotonicity condition.

#### Accelerated primal-dual methods

It is possible to accelerate the rate of convergence from to for gradient type methods. The first acceleration result was shown by Nesterov [30] for solving smooth unconstrained problems. The technique has been generalized to accelerate gradient-type methods on possibly nonsmooth convex programs [1, 32]. Primal-dual methods on solving linearly constrained problems can also be accelerated by similar techniques. Under convexity assumption, the augmented Lagrangian method (ALM) is accelerated in [21] from convergence rate to by using a similar technique as that in [1] to the multiplier update, and [40] accelerates the linearized ALM using a technique similar to that in [32]. Assuming strong convexity on the objective, [18] accelerates the ADMM method, and the assumption is weakened in [40] to assuming the strong convexity for one component of the objective function. On solving bilinear saddle-point problems, various primal-dual methods can be accelerated if either primal or dual problem is strongly convex [5, 11, 3]. Without strong convexity, partial acceleration is still possible in terms of the rate depending on some other quantities; see e.g. [33, 8].

### 1.2 Contributions of this paper

We accelerate the proximal Jacobian ADMM [12] and also generalize it to an accelerated primal-dual coordinate updating method for linearly constrained multi-block structured convex program, where in the objective there is a nonseparable smooth function. With parameters fixed during all iterations, the generalized method reduces to that in [15] and enjoys convergence rate under mere convexity assumption. By adaptively setting the parameters at different iterations, we show that the accelerated method has convergence rate if the objective is strongly convex. In addition, if there is one block variable that is independent of all others in the objective (but coupled in the linear constraint) and also the corresponding component function is smooth, we modify the algorithm by treating that independent variable in a different way and establish a linear convergence result. Numerically, we test the accelerated method on quadratic programming and compare it to the (nonaccelerated) RPDC method in [15]. The results demonstrate that the accelerated method performs efficiently and stably with the parameters automatically set in accordance of the analysis, while the RPDC method needs to tune its parameters for different data in order to have a comparable performance.

### 1.3 Nomenclature and basic facts

Notations. For a positive integer , we denote as . We let denote the subvector of with blocks indexed by . Namely, if , then . Similarly, denotes the submatrix of with columns indexed by , and denotes the sum of component functions indicated by . We use for the partial gradient of with respect to at and with respect to . For a nondifferentiable function , denotes a subgradient of at . We reserve

for the identity matrix and use

for Euclidean norm. Given a symmetric positive semidefinite (PSD) matrix , for any vector of appropriate size, we define , and

 ΔW(v+,vo,v)=12[∥v+−v∥2W−∥vo−v∥2W+∥v+−vo∥2W]. (2)

If , we simply use . Also, we denote

 g(x)=m∑i=1gi(xi),F(x)=f(x)+g(x),Φ(^x,x,λ)=F(^x)−F(x)−⟨λ,A^x−b⟩. (3)

Preparations. A point is called a Karush-Kuhn-Tucker (KKT) point of (1) if

 0∈∂F(x∗)−A⊤λ∗,Ax∗−b=0. (4)

For convex programs, the conditions in (4) are sufficient for to be an optimal solution of (1), and they are also necessary if a certain qualification condition holds (e.g., the Slater condition: there is in the interior of the domain of such that ). Together with the convexity of , (4) implies

 Φ(x,x∗,λ∗)≥0,∀x. (5)

We will use the following lemmas as basic facts. The first lemma is straightforward to verify from the definition of ; the second one is similar to Lemma 3.3 in [15]; the third one is from Lemma 3.5 in [15].

###### Lemma 1.1

For any vectors and symmetric PSD matrix of appropriate sizes, it holds that

 u⊤Wv=12[∥u∥2W−∥u−v∥2W+∥v∥2W]. (6)
###### Lemma 1.2

Given a function , for a given and a random vector , if for any (that may depend on ) it holds then for any , we have

 E[F(^x)−F(x)+γ∥A^x−b∥]≤sup∥λ∥≤γϕ(λ).

Proof. Let if , and otherwise. Then

 Φ(^x,x,^λ)=F(^x)−F(x)+γ∥A^x−b∥.

In addition, since , we have and thus . Hence, we have the desired result from .

###### Lemma 1.3

Suppose Then,

 E∥A^x−b∥≤ϵγ−∥λ∗∥, and% −ϵ∥λ∗∥γ−∥λ∗∥≤E[F(^x)−F(x∗)]≤ϵ,

where satisfies the optimality conditions in (4), and we assume .

Outline. The rest of the paper is organized as follows. Section 2 presents the accelerated proximal Jacobian ADMM and its convergence results. In section 3, we propose an accelerated primal-dual block coordinate update method with convergence analysis. Section 4 assumes more structure on the problem (1) and modifies the algorithm in section 3 to have linear convergence. Numerical results are provided in section 5. Finally, section 6 concludes the paper.

## 2 Accelerated proximal Jacobian ADMM

In this section, we propose an accelerated proximal Jacobian ADMM for solving (1). At each iteration, the algorithm updates all block variables in parallel by minimizing a linearized proximal approximation of the augmented Lagrangian function, and then it renews the multiplier. Specifically, it iteratively performs the following updates:

 xk+1i=argminxi⟨∇if(xk)−A⊤i(λk−βkrk),xi⟩+gi(xi)+12∥xi−xki∥Pki,i=1,…,M, (7a) λk+1=λk−ρkrk+1, (7b)

where and are scalar parameters, is an block diagonal matrix with as its -th diagonal block for , and denotes the residual. Note that (7a) consists of independent subproblems, and they can be solved in parallel.

Algorithm 1 summarizes the proposed method. It reduces to the proximal Jacobian ADMM in [12] if and are fixed for all and there is no nonseparable function . We will show that adapting the parameters as the iteration progresses can accelerate the convergence of the algorithm.

### 2.1 Technical assumptions

Throughout the analysis in this section, we make the following assumptions.

###### Assumption 1

There exists satisfying the KKT conditions in (4).

###### Assumption 2

is Lipschitz continuous with modulus .

###### Assumption 3

The function is strongly convex with modulus .

The first two assumptions are standard, and the third one is for showing convergence rate of , where is the number of iterations. Note that if is strongly convex with modulus , we can let and . This way, we have a convex function and a strongly convex function . Hence, Assumption 3 is without loss of generality. With only convexity, Algorithm 1 can be shown to converge at the rate with parameters fixed for all iterations, and the order is optimal as shown in the very recent work [24].

### 2.2 Convergence results

In this subsection, we show the convergence rate result of Algorithm 1. First, we establish a result of running one iteration of Algorithm 1.

###### Lemma 2.1 (One-iteration analysis)

Under Assumptions 2 and 3, let be the sequence generated from Algorithm 1. Then for any and such that , it holds that

 Φ(xk+1,x,λ) (8) ≤ 12ρk[∥λ−λk∥2−∥λ−λk+1∥2+∥λk−λk+1∥2]−βk∥rk+1∥2 −12[∥xk+1−x∥2Pk−βkA⊤A+μI−∥xk−x∥2Pk−βkA⊤A+∥xk+1−xk∥2Pk−βkA⊤A−LfI].

Using the above lemma, we are able to prove the following theorem.

###### Theorem 2.2

Under Assumptions 2 and 3, let be the sequence generated by Algorithm 1. Suppose that the parameters are set to satisfy

 0<ρk≤2βk,Pk⪰βkA⊤A+LfI,∀k≥1, (9)

and there exists a number such that for all ,

 k+k0+1ρk ≤ k+k0ρk−1, (10) (k+k0+1)(Pk−βkA⊤A) ⪯ (k+k0)(Pk−1−βk−1A⊤A+μI). (11)

Then, for any satisfying , we have

 t∑k=1(k+k0+1)Φ(xk+1,x,λ)+t∑k=1k+k0+12(2βk−ρk)∥rk+1∥2 +t+k0+12∥xt+1−x∥2Pt−βtA⊤A+μI ≤ ϕ1(x,λ), (12)

where

 ϕ1(x,λ)=k0+22ρ1∥λ−λ1∥2+k0+22∥x1−x∥2P1−β1A⊤A. (13)

In the next theorem, we provide a set of parameters that satisfy the conditions in Theorem 2.2 and establish the convergence rate result.

###### Theorem 2.3 (Convergence rate of order 1/t2)

Under Assumptions 1 through 3, let be the sequence generated by Algorithm 1 with parameters set to:

 βk=ρk=kβ,Pk=kP+LfI,∀k≥1, (14)

where is a block diagonal matrix satisfying . Then,

 max{β∥rt+1∥2, ∥xt+1−x∗∥2P−βA⊤A}≤2t(t+k0+1)ϕ1(x∗,λ∗), (15)

where , and is defined in (13). In addition, letting and

 T=t(t+2k0+3)2,¯xt+1=∑tk=1(k+k0+1)xkT,

we have

 |F(¯xt+1)−F(x∗)|≤1Tmax|∥λ∥≤γϕ1(x∗,λ), (16a) ∥A¯xt+1−b∥≤1Tmax{1,∥λ∗∥}max∥λ∥≤γϕ1(x∗,λ). (16b)

## 3 Accelerating randomized primal-dual block coordinate updates

In this section, we generalize Algorithm 1 to a randomized setting where the user may choose to update a subset of blocks at each iteration. Instead of updating all block variables, we randomly choose a subset of them to renew at each iteration. Depending on the number of processors (nodes, or cores), we can choose a single or multiple block variables for each update.

### 3.1 The algorithm

Our algorithm is an accelerated version of the randomized primal-dual coordinate update method recently proposed in [15], for which we shall use RPDC as its acronym.111In fact, [15] presents a more general algorithmic framework. It assumes two groups of variables, and each has multi-block structure. Our method in Algorithm 2 is an accelerated version of one special case of Algorithm 1 in [15]. At each iteration, it performs a block proximal gradient update to a subset of randomly selected primal variables while keeping the remaining ones fixed, followed by an update to the multipliers. Specifically, at iteration , it selects an index set with cardinality and performs the following updates:

 xk+1i=⎧⎨⎩argminxi⟨∇if(xk)−A⊤i(λk−βkrk),xi⟩+gi(xi)+ηk2∥xi−xki∥2,%ifi∈Sk,xki, if i∉Sk (17a) rk+1=rk+∑i∈SkAi(xk+1i−xki), (17b) λk+1=λk−ρkrk+1, (17c)

where and are algorithm parameters, and their values will be determined later. Note that we use in (17a) for simplicity. It can be replaced by a PSD matrix weighted norm square term as in (7a), and our convergence results still hold.

Algorithm 2 summarizes the above method. If the parameters and are fixed during all the iterations, i.e., constant parameters, the algorithm reduces to a special case of the RPDC method in [15]. Adapting these parameters to the iterations, we will show that Algorithm 2 enjoys faster convergence rate than RPDC if the problem is strongly convex.

### 3.2 Convergence results

In this subsection, we establish convergence results of Algorithm 2 under Assumptions 1 and 3, and also the following partial gradient Lipschitz continuity assumption.

###### Assumption 4

For any with , is Lipschitz continuous with a uniform constant .

Note that if is Lipschitz continuous with constant , then and . In addition, if and only differ on a set with cardinality , then

 f(x+)≤f(x)+⟨∇f(x),x+−x⟩+Lm2∥x+−x∥2. (18)

Similar to the analysis in section 2, we first establish a result of running one iteration of Algorithm 2. Throughout this section, we denote .

###### Lemma 3.1 (One iteration analysis)

Under Assumptions 3 and 4, let be the sequence generated from Algorithm 2. Then for any such that , it holds

 E[Φ(xk+1,x,λk+1)+(βk−ρk)∥rk+1∥2+μ2∥xk+1−x∥2] (19) ≤ (1−θ)E[Φ(xk,x,λk)+βk∥rk∥2+μ2∥xk−x∥2]−E[ΔηkI−βkA⊤A(xk+1,xk,x)−Lm2∥xk+1−xk∥2].

When (i.e., (1) is convex), Algorithm 2 has convergence rate with fixed . This can be shown from (19), and a similar result in slightly different form has been established in [15, Theorem 3.6]. For completeness, we provide its proof in the appendix.

###### Theorem 3.2 (Un-accelerated convergence)

Under Assumptions 1 and 4, let be the sequence generated from Algorithm 2 with for all , satisfying

 0<ρ≤θβ,η≥Lm+β∥A∥22,

where denotes the spectral norm of . Then

 ∣∣E[F(¯xt)−F(x∗)]∣∣≤11+θ(t−1)max∥λ∥≤γϕ2(x∗,λ), (20a) E∥A¯xt−b∥≤1(1+θ(t−1))max{1,∥λ∗∥}max∥λ∥≤γϕ2(x∗,λ), (20b)

where satisfies the KKT conditions in (4), , and

 ¯xt=xt+1+θ∑tk=2xk1+θ(t−1),ϕ2(x,λ)=(1−θ)(F(x1)−F(x))+η2∥x1−x∥2+θ∥λ∥22ρ.

When is strongly convex, the above convergence rate can be accelerated to by adaptively changing the parameters at each iteration. The following theorem is our main result. It shows an convergence result under certain conditions on the parameters. Based on this theorem, we will give a set of parameters that satisfy these conditions, thus providing a specific scheme to choose the paramenters.

###### Theorem 3.3

Under Assumptions 3 and 4, let be the sequence generated from Algorithm 2 with parameters satisfying the following conditions for a certain number :

 θ(k+k0+1) ≥ 1,∀k≥2, (21a) (βk−1−ρk−1)(k+k0) ≥ (1−θ)(k+k0+1)βk,∀2≤k≤t, (21b) θ(k+k0+1)−1ρk−1 ≥ θ(k+k0+2)−1ρk,∀2≤k≤t−1, (21c) θ(t+k0+1)−1ρt−1 ≥ t+k0+1ρt, (21d) βk(k+k0+1) ≥ βk−1(k+k0),∀k≥2, (21e) (k+k0+1)(ηk−Lm)I ⪰ βk(k+k0+1)A⊤A,∀k≥1, (21f) (k+k0)ηk−1+μ(θ(k+k0+1)−1) ≥ (k+k0+1)ηk,∀k≥2. (21g)

Then for any such that , we have

 (22) ≤ (1−θ)(k0+2)E[Φ(x1,x,λ1)+β1∥r1∥2+μ2∥x1−x∥2]+η1(k0+2)2E∥x1−x∥2 +θ(k0+3)−12ρ1E∥λ1−λ∥2−t+k0+12E∥xt+1−x∥2(μ+ηt)I−βtA⊤A.

Specifying the parameters that satisfy (21), we show convergence rate of Algorithm 2.

###### Proposition 3.4

The following parameters satisfy all conditions in (21):

 βk=μ(θk+2+θ)2ρ∥A∥22,∀k≥1, (23a) ρk=⎧⎪ ⎪⎨⎪ ⎪⎩θβk(6−5θ), for 1≤k≤t−1,(t+k0+1)ρt−1θ(t+k0+1)−1, for k=t (23b) ηk=ρβk∥A∥22+Lm,∀k≥1, (23c)

where and

 k0=4θ+2Lmθμ. (24)
###### Theorem 3.5 (Accelerated convergence)

Under Assumptions 1, 3 and 4, let be the sequence generated from Algorithm 2 with parameters taken as in (23). Then

 ∣∣E[F(¯xt+1)−F(x∗)]∣∣≤1Tmax∥λ∥≤γϕ3(x∗,λ),E∥A¯xt+1−b∥≤1Tmax{1,∥λ∗∥}max∥λ∥≤γϕ3(x∗,λ), (25)

where ,

 ¯xt+1 = (t+k0+1)xt+1+∑tk=2(θ(k+k0+1)−1)xkT, ϕ3(x,λ) = (1−θ)(k0+2)[F(x1)−F(x)+β1∥r1∥2+μ2∥x1−x∥2] +η1(k0+2)2∥x1−x∥2+θ(k0+3)−12ρ1∥λ∥2

and

 T=(t+k0+1)+t∑k=2(θ(k+k0+1)−1).

 E∥xt+1−x∗∥2≤2ϕ3(x∗,λ∗)(t+k0+1)((ρ−1)μ2ρ(θt+θ+2)+2μ+Lm).

## 4 Linearly convergent primal-dual method

In this section, we assume some more structure on (1) and show that a linear rate of convergence is possible. If there is no linear constraint, Algorithm 2 reduces to the RCD method proposed in [31]. It is well-known that RCD converges linearly if the objective is strongly convex. However, with the presence of linear constraints, mere strong convexity of the objective of the primal problem only ensures the smoothness of its Lagrangian dual function, but not its strong concavity. Hence, in general, we do not expect linear convergence by only assuming strong convexity on the primal objective function. To ensure linear convergence on both the primal and dual variables, we need additional assumptions.

Throughout this section, we suppose that there is at least one block variable being absent in the nonseparable part of the objective, namely . For convenience, we rename this block variable to be , and the corresponding component function and constraint coefficient matrix as and . Specifically, we consider the following problem

 minx,yf(x1,…,xM)+M∑i=1gi(xi)+h(y), s.t. M∑i=1Aixi+By=b. (26)

One example of (26) is the problem that appears while computing a point on the central path of a convex program. Suppose we are interested in solving

 minxf(x1,…,xM), s.t. M∑i=1Aixi≤b,xi≥0,i=1,…,M. (27)

Let and use the log-barrier function. We have the log-barrier approximation of (27) as follows:

 minx,yf(x1,…,xM)−μM∑i=1e⊤logxi−μe⊤logy, s.t. M∑i=1Aixi+y=b, (28)

where is the all-one vector. As decreases, the approximation becomes more accurate.

Towards a solution to (26), we modify Algorithm 2 by updating -variable after the -update. Since there is only a single -block, to balance and updates, we do not renew

in every iteration but instead update it in probability

. Hence, roughly speaking, and variables are updated in the same frequency. The method is summarized in Algorithm 3.

### 4.1 Technical assumptions

In this section, we denote . Assume is differentiable. Similar to (4), a point is called a KKT point of (26) if

 0∈∂F(x∗)−A⊤λ∗, (32a) ∇h(y∗)−B⊤λ∗=0, (32b) Ax∗+By∗−b=0. (32c)

Besides Assumptions 3 and 4, we make two additional assumptions as follows.

###### Assumption 5

There exists satisfying the KKT conditions in (32).

###### Assumption 6

The function is strongly convex with modulus , and its gradient is Lipschitz continuous with constant .

The strong convexity of and implies

 F(xk+1)−F(x∗)−⟨~∇F(x∗),xk+1−x∗⟩ ≥ μ2∥xk+1−x∗∥2, (33a) ⟨yk+1−y∗,∇h(yk+1)−∇h(y∗)⟩ ≥ ν∥yk+1−y∗∥2. (33b)

### 4.2 Convergence analysis

Similar to Lemma 3.1, we first establish a result of running one iteration of Algorithm 3. It can be proven by similar arguments to those showing Lemma 3.1.

###### Lemma 4.1 (One iteration analysis)

Under Assumptions 3, 4, and 6, let be the sequence generated from Algorithm 3. Then for any and such that , it holds

 Eφ(zk+1,z)+(β−ρ)E∥rk+1∥2+1ρEΔ(λk+1,λk,λ) (34) +E[ΔP(xk+1,xk,x)−Lm2∥xk+1−xk∥2]+EΔQ(yk+1,yk,y) ≤ (1−θ)Eφ(zk,z)+β(1−θ)E∥rk∥2+1−θρEΔ(λk,λk−1,λ) +βE⟨A(xk+1−x),B(yk+1−yk)⟩+β(1−θ)E⟨B(yk−y),A(x