1 Introduction
Modern applications in various data sciences and engineering can involve huge amount of data and/or variables
[42]. Driven by these very largescale problems and also the advancement of multicore computers, parallel computing has gained tremendous attention in recent years. In this paper, we consider the affinely constrained multiblock structured problem:(1) 
where the variable is partitioned into multiple disjoint blocks , is a continuously differentiable and convex function, and each is a lower semicontinuous extendedvalued convex but possibly nondifferentiable function. Besides the nonseparable affine constraint, (1) can also include certain block separable constraint by letting part of be an indicator function of a convex set, e.g., nonnegativity constraint.
We will present a novel asynchronous (async) parallel primaldual method (see Algorithm 2) towards finding a solution to (1). Suppose there are multiple nodes (or cores, CPUs). We let one node (called master node) update both primal and dual variables and all the remaining ones (called worker nodes) compute and provide block gradients of to the master node. We assume each is proximable (see the definition in (5) below). In addition, we make the following assumption:
Assumption 0
The computation of is roughly at least times more expensive than for all , where is the number of nodes.
This assumption is only for the purpose to achieve nice practical speedup performance of the asyncparallel method. When it holds, the master node can quickly digest the block gradient information fed by all worker nodes, and thus the latters will keep working. Note that our theoretical analysis do not require this assumption. If there is only one node (i.e., ), the assumption always holds, and our method provides a novel serial primaldual BCU with adaptive stepsize for solving (1).
1.1 Motivating examples
Problems in the form of (1
) arise in many areas including signal processing, machine learning, finance, and statistics. For example, the basis pursuit problem
[7] seeks a sparse solution on an affine subspace through solving the linearly constrained program:(2) 
Partitioning into multiple disjoint blocks in an arbitrary way, one can formulate (2) into the form of (1) with and each .
Another example is the portfolio optimization [28]. Suppose we have a unit of capital to invest on assets. Let be the fraction of capital invested on the th asset and be the expected return rate of the th asset. The goal is to minimize the risk measured by subject to total unit capital and minimum expected return , where and is the covariance matrix. To find the optimal , one can solve the problem:
(3) 
Introducing slack variables to the first two inequalities, one can easily write (3) into the form of (1) with a quadratic and each being an indicator function of the nonnegativity constraint set.
In addition, (1
) includes as a special case the dual support vector machine (SVM)
[9]. Given training data set with , let and . The dual form of the linear SVM can be written as(4) 
where , and is a given number relating to the soft margin size. It is easy to formulate (4) into the form of (1) with being the quadratic objective function and each the indicator function of the set .
Finally, the penalized and constrained (PAC) regression problem [21] is also one example of (1) with and linear constraint of equations. As (that often holds for problems with massive training data), the PAC regression satisfies Assumption ‣ 1. In addition, if and , both (3) and (4) satisfy the assumption, and thus the proposed asyncparallel method will be efficient when applied to these problems. Although Assumption ‣ 1 does not hold for (2) as , our method running on a single node can still outperform stateoftheart nonparallel solvers; see the numerical results in section 4.1.
1.2 Block coordinate update
The block coordinate update (BCU) method breaks possibly very highdimensional variable into small pieces and renews one at a time while all the remaining blocks are fixed. Although the problem (1) can be extremely largescale and complicated, BCU solves a sequence of smallsized and easier subproblems. As (1) owns nice structures, e.g., coordinate friendly [30], BCU can not only have low perupdate complexity but also enjoy faster overall convergence than the method that updates the whole variable every time. BCU has been applied to many unconstrained or blockseparably constrained optimization problems (e.g., [39, 40, 29, 33, 44, 35, 45, 20]), and it has also been used to solve affinely constrained separable problems, i.e., in the form of (1) without term (e.g., [12, 11, 16, 17, 18]). However, only a few existing works (e.g., [19, 14, 13]) have studied BCU on solving affinely constrained problems with a nonseparable objective function.
1.3 Asynchronization
Parallel computing methods distribute computation over and collect results from multiple nodes. Synchronous (sync) parallel methods require all nodes to keep in the same pace. Upon all nodes finish their own computation, they altogether proceed to the next step. This way, the faster node has to wait for the slowest one, and that wastes a lot of waiting time. On the contrary, asyncparallel methods keep all nodes continuously working and eliminate the idle waiting time. Numerous works (e.g., [34, 26, 27, 31]) have demonstrated that asyncparallel methods can achieve significantly better speedup than their syncparallel counterparts.
Due to lack of synchronization, the information used by a certain node may be outdated. Hence the convergence of an asyncparallel method cannot be easily inherited from its nonparallel counterpart but often requires a new tool of analysis. Most existing works only analyze such methods for unconstrained or blockseparably constrained problems. Exceptions include [41, 46, 3, 4] that consider separable problems with special affine constraint.
1.4 Related works
Recent several years have witnessed the surge of asyncparallel methods partly due to the increasingly large scale of data/variable involved in modern applications. However, only a few existing works discuss such methods for affinely constrained problems. Below we review the literature of asyncparallel BCU methods in optimization and also primaldual BCU methods for affinely constrained problems.
It appears that the first asyncparallel method was proposed by Chazan and Miranker [5] for solving linear systems. Later, such methods have been applied in many others fields. In optimization, the first asyncparallel BCU method was due to Bertsekas and Tsitsiklis [1] for problems with a smooth objective. It was shown that the objective gradient sequence converges to zero. Tseng [38] further analyzed its convergence rate and established local linear convergence by assuming isocost surface separation and a local Lipschitz error bound on the objective. Recently, [27, 26] developed asyncparallel methods based on randomized BCU for convex problems with possibly block separable constraints. They established convergence and also rate results by assuming a bounded delay on the outdated block gradient information. The results have been extended to the case with unbounded probabilistic delay in [32], which also shows convergence of the asyncparallel BCU methods for nonconvex problems. On solving problems with convex separable objective and linear constraints, [41] proposed to apply the alternating direction method of multipliers (ADMM) in an asynchronous and distributive way. Assuming a special structure on the linear constraint, it established ergodic convergence result, where is the total number of iterations. In [46, 3, 4], the asyncADMM is applied to distributed multiagent optimization, which can be equivalently formulated into (1) with and consensus constraint. Among them, [46] showed sublinear convergence of the asyncADMM for convex problems, and [4] established its linear convergence for strongly convex problems while [3] also considered nonconvex cases. The works [31, 8] developed asyncparallel BCU methods for fixedpoint or monotone inclusion problems. Although these settings are more general (including convex optimization as a special case), strong monotonicity (similar to strong convexity in optimization) is needed to establish convergence rate results.
Running on a single node, the proposed asyncparallel method reduces to a serial randomized primaldual BCU. In the literature, various GaussSeidel (GS) cyclic BCU methods have been developed for solving separable convex programs with linear constraints. Although a cyclic primaldual BCU can empirically work well, in general it may diverge [12, 6]. To guarantee convergence, additional assumptions besides convexity must be made, such as strong convexity on part of the objective [15, 2, 22, 36, 25, 24, 10] and orthogonality properties of block matrices in the linear constraint [6]. Without these assumptions, modifications to the algorithm are necessary for convergence, such as further correction step after each cycle of updates [17, 18], random permutation of all blocks before each cycle of updates [37], Jacobitype update [11, 16] that is essentially linearized augmented Lagrange method (ALM), and hybrid JacobiGS update [36, 23, 43]. Different from these modifications, our algorithm simply employs randomization in selecting block variable and can perform significantly better than Jacobitype methods. In addition, convergence is guaranteed with mere convexity assumption and thus better than those results for GStype methods.
1.5 Contributions
The contributions are summarized as follows.

We propose an asyncparallel BCU method for solving multiblock structured convex programs with linear constraint. The algorithm is the first asyncparallel primaldual method for affinely constrained problems with nonseparable objective. When there is only one node, it reduces to a novel serial primaldual BCU method with stepsize adaptive to blocks.

Merely with convexity, convergence of the proposed method is guaranteed. We first establish convergence of the serial BCU method. We show that the objective value converges in probability to the optimal value and also the constraint residual to zero. In addition, we establish an ergodic convergence rate result. Then through bounding a cross term involving delayed block gradient, we prove that similar convergence results hold for the asyncparallel BCU method if a delaydependent stepsize is chosen.

We implement the proposed algorithm and apply it to the basis pursuit, quadratic programming, and also the support vector machine problems. Numerical results demonstrate that the serial BCU is comparable to or better than stateoftheart methods. In addition, the asyncparallel BCU method can achieve significantly better speedup performance than its syncparallel counterpart.
1.6 Notation and Outline
We use bold small letters for vectors and bold capital letters for matrices. denotes the integer set . represents a vector with for its th block and zero for all other blocks. We denote as the Euclidean norm of and for a symmetric positive semidefinite matrix . We reserve
for the identity matrix, and its size is clear from the context.
stands for the expectation about conditional on previous history . We use for convergence in probability of a random vector sequence to .The proximal operator of a function is defined as
(5) 
If has a closedform solution or is easy to compute, we call proximable.
2 Algorithm
In this section, we propose an asyncparallel primaldual method for solving (1). Our algorithm is a BCUtype method based the augmented Lagrangian function of (1):
where is the multiplier (or augmented Lagrangian dual variable), and is a penalty parameter.
2.1 Nonparallel method
For ease of understanding, we first present a nonparallel method in Algorithm 1. At every iteration, the algorithm chooses one out of block uniformly at random and renews it by (6) while fixing all the remaining blocks. Upon finishing the update to , it immediately changes the multiplier . The linearization to possibly complicated smooth term greatly eases the subproblem. Depending on the form of , we can choose appropriate to make (6) simple to solve. Since each is proximable, one can always easily find a solution to (6) if . For even simpler such as norm and indicator function of a box constraint set, we can set to a diagonal matrix and have a closedform solution to (6).
Randomly choosing a block to update has advantages over the cyclic way in both theoretical and empirical perspectives. We will show that this randomized BCU has guaranteed convergence with mere convexity other than strong convexity assumed by the cyclic primaldual BCU. In addition, randomization enables us to parallelize the algorithm in an efficient way as shown in Algorithm 2.
(6) 
(7)  
(8) 
2.2 Asyncparallel method
Assume there are nodes. Let the data and variables be stored in a global memory accessible to every node. We let one node (called master node) update both primal variable and dual variable and the remaining ones (called worker nodes) compute block gradients of and provide them to the master node. The method is summarized in Algorithm 2. We make a few remarks on the algorithm as follows.

Special case: If there is only one node (i.e., ), the algorithm simply reduces to the nonparallel Algorithm 1.

Iteration number: Only the master node increases the iteration number , which counts the times is updated and also the number of used block gradients. Hence, even if , Algorithm 2 does not reduce to its syncparallel counterpart.

Delayed information: Since all worker nodes provide block gradients to the master node, we cannot guarantee every computed block gradient will be immediately used to update . Hence, in (9), may be not equal but a delayed (i.e., outdated) block gradient. The delay is usually in the same order of and can affect the stepsize, but the affect is negligible as the block number is greater than the delay in an order (see Theorem 3.8).
Because blocks are computed in the master node, the values of and used in the update are always uptodate. One can let worker nodes compute new ’s and then feed them (or also the changes in ) to the master node. That way, and will also be outdated when computing blocks.

Load balance: Under Assumption ‣ 1, if (9) is easy to solve (e.g., ) and all nodes have similar computing power, the master node will have used all received block gradients before a new one comes. We let the master node itself also compute block gradient if there is no new one sent from any worker node. This way, all nodes work continuously without idle wait. Compared to its syncparallel counterpart that typically suffers serious load imbalance, the asyncparallel can achieve better speedup; see the numerical results in section 4.3.
3 Convergence analysis
In this section, we present convergence results of the proposed algorithm. First, we analyze the nonparallel Algorithm 1. We show that the objective value and the residual converge to the optimal value and zero respectively in probability. In addition, we establish a sublinear convergence rate result based on an averaged point. Then, through bounding a cross term involving the delayed block gradient, we establish similar results for the asyncparallel Algorithm 2.
Throughout our analysis, we make the following assumptions.
Assumption 1 (Existence of a solution)
There exists one pair of primaldual solution such that and .
Assumption 2 (Gradient Lipschitz continuity)
There exist constants ’s and such that for any and ,
and
Denote . Then under the above assumption, it holds that
(10) 
3.1 Convergence results of Algorithm 1
We first establish several lemmas, which will be used to show our main convergence results.
Lemma 3.1
Let be the sequence generated from Algorithm 1. Then for any independent of , it holds that
Proof. We write
. For the first term, we use the uniform distribution of
on and the convexity of to haveand for the second term, we use (10) to have
(11)  
(12) 
Combining the above two inequalities gives the desired result.
Lemma 3.2
For any independent of such that , it holds
Proof. Let . Then
(13)  
(14)  
(15) 
Note and . In addition, from , we have . Hence,
(16) 
Noting
Lemma 3.3
For any independent of , it holds
where denotes the subgradient of at .
Proof. From the convexity of , it follows that
(17) 
Writing and taking the conditional expectation give
We obtain the desired result by plugging the above equation into (17).
Using the above three lemmas, we show an inequality after each iteration of the algorithm.
Theorem 3.4 (Fundamental result)
Let be the sequence generated from Algorithm 1. Then for any such that , it holds
(18)  
(19)  
(20) 
where .
Proof. Since is one solution to (6), there is a subgradient of at such that
Hence,
(21) 
In the above equation, using Lemmas 3.1 through 3.3 and noting
(22) 
we have the desired result.
Now we are ready to show the convergence results of Algorithm 1.
Theorem 3.5 (Global convergence in probability)
Let be the sequence generated from Algorithm 1. If and , then
Proof. Note that
Hence, taking expectation over both sides of (18) and summing up from through yield
(23) 
Since , it follows from Young’s inequality that
In addition,
Plugging the above two equations into (3.1) and using , we have
(24) 
Letting in the above equality, we have from and that
which together with implies that
(25a)  
(25b) 
For any , it follows from the Markov’s inequality that
and
(26)  
(27)  
(28)  
(29)  
(30) 
where in the first inequality, we have used the fact , and the last equation follows from (25) and the Markov’s inequality. This completes the proof.
Given any and
, we can also estimate the number of iterations for the algorithm to produce a solution satisfying an error bound
with probability no less than .Definition 3.1 (solution)
Given and , a random vector is called an solution to (1) if and
Theorem 3.6 (Ergodic convergence rate)
Proof. Since is convex, it follows from (3.1) that
Comments
There are no comments yet.