 # Markov Chain Block Coordinate Descent

The method of block coordinate gradient descent (BCD) has been a powerful method for large-scale optimization. This paper considers the BCD method that successively updates a series of blocks selected according to a Markov chain. This kind of block selection is neither i.i.d. random nor cyclic. On the other hand, it is a natural choice for some applications in distributed optimization and Markov decision process, where i.i.d. random and cyclic selections are either infeasible or very expensive. By applying mixing-time properties of a Markov chain, we prove convergence of Markov chain BCD for minimizing Lipschitz differentiable functions, which can be nonconvex. When the functions are convex and strongly convex, we establish both sublinear and linear convergence rates, respectively. We also present a method of Markov chain inertial BCD. Finally, we discuss potential applications.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

We consider the following minimization problem

 minimize  f(x)≡f(x1,x2,…,xN) (1.1)

where is a differentiable function (possibly nonconvex) and every () is Lipschitz with constant .

The block coordinate gradient descent (BCD) method is a popular approach that can take the advantage of the coordinate structure in (1.1). The method updates one coordinate, or a block of coordinates, at each iteration, as follows. For , choose and compute

 xk+1ik=xkik−γ∇ikf(xk), (1.2)

where is a step size; for remaining , we keep .

The coordinate gradient descent method was introduced in . The random selection rule (i.i.d. over the iterations) appeared in [24, 15]. In the same paper , the method of accelerated coordinate gradient descent was proposed, and it was later analyzed in  for both convex and strongly convex functions. Both [15, 10] select a coordinate

with probability proportional to the Lipschitz constant

of over free ; the rate is optimal when ’s are equal. An improved random sampling method with acceleration was introduced in , which further decreases the complexity when some ’s are significantly smaller than the rest. This method was further generalized in  to an asynchronous parallel method, which obtains parallel speedup to the accelerated rate. In another line of work,  combines stochastic coordinate gradient descent with mirror descent stochastic approximation, where a random data mini-batch is taken to update a randomly chosen coordinate. This is improved in , where the presented method uses each random mini-batch to update all the coordinates in a sequential fashion. Besides stochastic selection rules, there has been work of the cyclic sampling rule. The work  studies its convergence under the convex and nonconvex settings, and  proves sublinear and linear rates in the convex setting. The constants in these rates are worse than standard gradient descent though. For a family of problems,  obtains improved rates to match standard gradient descent (and their results also apply to the random shuffling rule). The greedy sampling rule has also been studied in the literature but unrelated to this paper. Let us just mention some references [11, 20, 12, 16]. Finally,  explores the family of problems with the structure that enables us to update a block coordinate at a much lower cost than updating all blocks in batch.

This paper introduces the Markov-chain select rule. We call our method Markov-chain block gradient coordinate descent (MC-BCD). In this method, is selected according to a Markov chain; hence, unlike the above methods, our choice is neither stochastic i.i.d. (with respect to ) nor deterministic. Specifically, there is an underlying strongly-connected graph with the set of vertices and set of edges . Each node can compute and update . We call a walk of if every . If the walk is deterministic and visits every node at least once in every iterations, then is essentially cyclic; if every is chosen randomly from , then we obtain MC-BCD, which is the focus of this paper. To the best of our knowledge, MC-BCD is new.

### 1.1 Motivations

Generally speaking, one does not use MC-BCD to accelerate i.i.d. random or cyclic BCD but for other reasons: When we are forced to take Markov chain samples because cyclic and stochastic samples are not available; Or, although cyclic and stochastic samples are available, it is easier or cheaper to take Markov chain samples. We briefly present some examples below to illustrate those motivations. Some examples are tested numerically in Section 6 below.

Markov Chain Dual Coordinate Ascent (MC-DCA). The paper  proposes the Stochastic Dual Coordinate Gradient Ascent (SDCA) to solve

 minimizew∈Rd{λ2∥w∥2+1NN∑i=1ℓi(w⊤ai)}, (1.3)

where is the regularization parameter,

is the data vector associated with

th sample, and

is a convex loss function. Its dual problem can be formulated as

 minimizeα∈RN{D(α):=λ2∥Aα∥2+1NN∑i=1ℓ∗i(−αi)}, (1.4)

where with column , and is the conjugate of . By applying stochastic BCD to (1.4

), SDCA can reach comparable or better convergence rate than stochastic gradient descent. We employ this idea and propose MC-DCA : in the

th iteration, while if ,

 αk+1ik=αkik−γ(λA⊤ik(Aαk)−∇ℓ∗ik(−αkik)N) (1.5)

where is a Markov chain.

The Markov chain must come from somewhere. Consider that the data are stored in a distributed fashion over a graph. Only when the graph is complete can we efficiently sample i.i.d. randomly and access ; only when the graph has a Hamiltonian cycle can we visit the data in a cyclic fashion without visiting any node twice in each cycle. MC-DCA works under a much weaker assumption: as long as the graph is connected. Specifically, let a token hold and vector , and let the token randomly walk through the nodes in the network; each node holds data and can compute ; as the token arrives at node , the node accesses and and computes and , which are used to update and update .

Future rewards in a Markov decision process. This example is a finite-state ( states) Discounted Markov Decision Process (DMDP) for which we can compute the transition probability from any current state to the next state, or quickly approximate it. We can use MC-BCD to compute the expected future reward vector.

Let us describe the DMDP. Entering any state , we receive an award and then take an action according to a given policy (a state-to-action mapping). After the action is taken, the system enters a state , , with probability . The transition matrix depends on the action taken and thus depends on . The reward discount factor is . Our goal is to evaluate the expected future rewards of all states for fixed . This step dominates the per-step computation of the policy-update iteration , which iteratively updates .

For each state , the expected future reward is given as

 vi:=E{it}[+∞∑t=1γtrit∣i0=i],

where the state sequence is a Markov chain induced by the transition matrix and is the reward received at time . The corresponding Bellman equation is , the matrix form of which is

 v=r+γPv, (1.6)

where and .

When is huge, solving (1.6) is difficult. Often we have memory to store a few -vectors (also, can be reduced by dimension reduction) but not an -matrix. Therefore, we can store the vector only temporarily in each iteration. In the case where the physical principles or the rule of game are given, such as in the Tetris game, we can compute the transition probabilities explicitly. Consider another scenario where can not be computed explicitly but can be approximated by Monte-Carlo simulations. The simulation of transition at just one state is much cheaper than that of all states. In both scenarios, we have access to . This allows us to apply MC-BCD to solve a dual optimization problem below to compute the future reward vector ,

 minimizev{12N∥(IN−γP)v−r∥2+λ2∥v∥2}, (1.7)

where is a fixed regularization parameter. This corresponds to setting in (1.4). Note that in DMDP, one cannot transit from the current state to an arbitrary . Therefore, standard cyclic and stochastic BCD is not applicable.

Running the MC-DCA iteration (1.5) requires the vectors and . We update by maintaining a sequence as follows: initialize (vector zero) and thus ; in th iteration, we compute , where the equality follows since and only differ over their th component. This update is done without accessing the full matrix .

As we showed above, running our algorithm to compute the expected future award only requires memory. Also the algorithm iterates simultaneously while the system samples its state trajectory. Suppose each policy can be stored in memory (e.g., deterministic policy) and updating using a computed also needs memory; then, we can a policy-update iteration with memory.

Risk minimization by dual coordinate ascent over a tricky distribution. Let be a statistical sample space with distribution , and is a proper, closed, strongly convex function. Consider the following regularized expectation minimization problem

 (1.8)

Since the objective is strongly convex, its dual problem is smooth. If it is easy to sample data from , (1.8) can be solved by SDCA, which uses i.i.d. samples. When the distribution is difficult to sample directly but has a faster Markov Chain Monte Carlo (MCMC) sampler, we can apply MC-DCA to this problem.

Multi-agent resource-constrained optimization. Consider the multi-agent optimization problem of agents :

 minimizef(x1,x2,…,xN)+β2∥max{Ax−b,0}∥2, (1.9)

where is the cost function, is the resource vector, and penalizes any over usage of resources. Define a graph, in which every node is an agent and every edge connects a pair of agents that either depend on one another in or share at least one resource. In other words, the objective function (1.9) has a graph structure in that computing the gradient of requires only the information of the adjacent agents of .

MC-BCD becomes a decentralized algorithm: after an agent updates its decision variable , it broadcasts to one of its neighbors, and activates it to run next step. In this process, form a random walk over the graph and, therefore, is a Markov chain. As long as the network is connected, a central coordinator is no more necessary. However, sampling i.i.d. randomly requires a central coordinator and will consume more communication since it may communicate beyond neibors. Also selecting essentially cyclicly requires a tour of the graph, which relys on the knowledge of the graph topology.

When is differentiable with Lipschitz continuous gradient, so is the objective function. We apply MC-BCD to (1.9) to obtain

 xk+1ik=xkik−γ∇fik(xk)−γβA⊤ikmax{Axk−b,0}, (1.10)

where is a Markov chain. We assume that agent can access and and compute . Similar to the example for computing expected future reward above, can be updated along with the iterations so no node needs the access to the full matrix . Alternatively, we can use a central governor which receives updated and from agent and sends the data to for the next iteration.

Decentralized optimization. This example is taken from . Again consider the empirical risk minimization problem (1.3). We consider solving its dual problem (1.4) in a network by assigning each sample to a node. A parallel distributed algorithm will update for all the components, , concurrently.

If the network has a central server, then each node sends its to the central server, which forms and then broadcasts it back to the nodes.

If the network does not have a central server, then we can form either running a decentralized gossip algorithm or calling an all-reduce communication. The former does not require the knowledge of the network topology and is an iterative method. The latter requires the topology and takes at least rounds and at least total communication, even slower when the network is sparse. An alternative approach is to create a token that holds and follows a random walk in the network. The token acts like a traveling center. When the token arrives at a node , the node updates its using the token’s , and this local update leads to a sparse change to ; updating requires no access to for . The method in  applies this idea to an ADMM formulation of the decentralized consensus problem (rather than BCD in this paper) and shows that total communication is significantly reduced.

### 1.2 Difficulty of the convergence proofs: biased expectation

Sampling according to a Markov chain is neither (essential) cyclic nor i.i.d. stochastic. No matter how large is, it is still possible that a node is never visited during some iterations. Unless the graph is a complete graph (every node is directly connected with every other node), there are nodes without an edge connecting them, i.e., . Hence, given , it is impossible to have . So, no matter how one selects the sampling probability and step size , we generally do not have for any constant , where . This, unfortunately, breaks down all the existing analyses of stochastic BCD since they all need a non-vanishing probability for every block to be selected.

### 1.3 Proposed method and contributions

Given a graph , MC-BCD is written mathematically as

 sample ik∈{j:(ik−1,j)∈E}∼Pik−1,j(k), (1.11a) compute xk+1ik=xkik−γ∇ikf(xk), (1.11b)

where is a constant stepsize, and is the transition matrix in the th step (details given in Sec. 2), and we maintain for all . The initial point can be chosen arbitrarily. The block can be chosen either deterministically or randomly. The following diagram illustrates the influential relations of

and the random variable sequences

and :

 i0−−−−→i1−−−−→i2−−−−→i3−−−−→…⏐⏐↓⏐⏐↓⏐⏐↓⏐⏐↓x0−−−−→x1−−−−→x2−−−−→x3−−−−→x4−−−−→… (1.12)

To our best knowledge, (1.11) did not appear before and, as explained above, is not a special case of existing BCD analyses. When the Markov chain has a finite mixing time and problem (1.1) has a lower bounded objective, we show that using ensures . The concept of mixing time is reviewed in the next section. In addition, when is convex and coercive, we show that at the rate of with a hidden constant related to the mixing time. Note that running the algorithm itself requires no knowledge about the mixing time of the chain. Furthermore, when is (restricted) strongly convex, then the rate is improved to be linear, unsurprisingly. Although we do not develop any Nesterov-kind acceleration in this paper, a heavy-ball-kind inertial MC-BCD is presented and analyzed because the additional work is quite small. When the computation is noisy, as long as the noise is square summable (which is weaker than being summable), MC-BCD still converges.

### 1.4 Possible future work

We mention some future improvements of MC-BCD, which will require significantly more work to achieve. First, it is possible to accelerate MC-BCD using both Nesterov-kind momentum and optimizing the transition probability. Second, it is important to parallelize MC-BCD, for example, to allow multiple random walks to simultaneously update different blocks [22, 7], even in an asynchronous fashion like [13, 19, 27]. Third, it is interesting to develop a primal-dual type MC-BCD, which would apply to a model-free DMDP along a single trajectory. Yet another line of work applies block coordinate update to linear and nonlinear fixed-point problems [18, 17, 5] because it can solve optimization problems in imaging and conic programming, which are equipped with nonsmooth, nonseparable objectives, and constraints.

## 2 Preliminaries

### 2.1 Markov chain

We recall some definitions and propertiesof the Markov chain that we use in this paper.

###### Definition 1 (finite-state (time-homogeneous) Markov chain).

A stochastic process in a finite state space is called Markov chain with transition matrices if, for , , and , we have

 P(Xk+1=j∣X0=i0,X1=i1,…,Xk=i)=P(Xk+1=j∣Xk=i)=Pi,j(k). (2.1)

The chain is time-homogeneous if for some constant matrix .

Let the probability distribution of

be denoted as the row vector , that is, . Each satisfies Obviously, it holds . When the Markov chain is time-homogeneous, we have and , for , where is the th power of .

###### Definition 2.

A time-homogeneous Markov chain is irreducible if, for any , there exists such that . State is said to have a period if whenever is not a multiple of and is the greatest such integer. If , then we say state is aperiodic. If every state is aperiodic, the Markov chain is said to be aperiodic.

Any time-homogeneous, irreducible, and aperiodic Markov chain has a stationary distribution with and , and . This is a sufficient but not necessary condition to have such . If the Markov fails to be time-homogeneous111The time-homogeneous, irreducible, and aperiodic Markov chain is widely used; however, in practical problems, the Markov chain may not satisfy the time-homogeneous assumption. For example, in a mobile, if the network connectivity structure is changing all the time, then the set of the neighbors of an agent is time-varying ., it may still have a stationary distribution under additional assumptions.

In this paper, we make the following assumption, which always holds for time-homogeneous, irreducible, and aperiodic Markov chain and may hold for more general Markov chains.

###### Assumption 1.

The Markov chain has the transition matrices and the stationary distribution . Define

 Φ(m,n):=P(m)P(m+1)⋯P(m+n),m,n≥0,Π∗:=⎡⎢ ⎢ ⎢ ⎢⎣π∗π∗⋮π∗⎤⎥ ⎥ ⎥ ⎥⎦∈RN×N,

that is, every row of is . For each , there exists such that

 ∥Φ(m,n)−Π∗∥2<ϵ,for all m≥0,n≥\uptauϵ−1. (2.2)

Here, is called a mixing time, which specifies how long a Markov chain evolves close to its stationary distribution. The literature has a thorough investigation of various kinds of mixing times . Previous mixing time focuses on bounding the difference between and the stationary distribution . Our version is just easier to use in the analysis.

For a time-homogeneous, irreducible, and aperiodic Markov chain with the transition matrix , . It is easy to have as , where

denotes the second largest eigenvalue of

(positive and smaller than 1) . Besides the time-homogeneous, irreducible, and aperiodic Markov chain, some other non-time-homogeneous chains can also have a geometrically-convergent . An example is presented in .

### 2.2 Notation and constants

The following notation is used throughout this paper:

 Δk:=xk+1−xk (2.3)

In MC-BCD iteration, only the block of is nonzero; other blocks are zero. Let be the minimal stationary distribution, i.e.,

 π∗min:=min1≤i≤N{π∗i}. (2.4)

For any closed proper function , denotes the set , and denotes the norm. Through the proofs, we use the following sigma algebra

 χk:=σ(x1,x2,…,xk,i0,i1,…,ik−1).

Let Assumption 1 hold. In our proofs, we let be the -mixing time, i.e.,

 ∥Φ(m,n)−Π∗∥2≤π∗min2,  %whenever  n≥\uptau−1. (2.5)

With direct calculations,

 π∗min2≤[Φ(m,n)]i,j,~{}for~{}any~{}i,j∈{1,2,…,N},n≥\uptau−1. (2.6)

If the Markov chain promises a geometric rate, then we have

 \uptau=O(ln2π∗min). (2.7)

It is worth mentioning that, for a complete graph where all nodes are connected to each other, we have a Markov chain with , and our MC-BCD will reduce to random BCD .

## 3 Markov chain block coordinate gradient descent

In this section, we study the convergence properties of the MC-BCD for problem (1.1). The discussion covers both convex and nonconvex cases. We show that the MC-BCD can converge if the stepsize is taken as the same as that in traditional BCD. For convex problems, sublinear convergence rate is established, and for strongly convex cases, linear convergence is shown.

Our analysis is conducted to an inexact version of the MC-BCD, which allows error in computing partial gradients:

 xk+1j={xkj−γ(∇jf(xk)+ϵk), if j=ikxkj, if j≠ik, (3.1)

where is sampled in the same way as in (1.11a), and denotes the error in the th iteration. If vanishes, the above updates reduce to the MC-BCD in (1.11).

### 3.1 Convergence analysis

The results in this section applies to both convex and nonconvex cases, and they rely on the following assumption.

###### Assumption 2.

The set of minimizers of function is nonempty, and is Lipschitz continuous about with constant for each , namely,

 ∥∇if(x)−∇if(x+αei)∥≤L∥α∥,∀x∈RN,∀α∈R, (3.2)

where denotes the th standard basis vector in . In addition, is also Lipschitz continuous about with constant , namely,

 ∥∇f(x)−∇f(x+s)∥≤Lr∥s∥,∀x∈RN,∀s∈RN. (3.3)

We call the condition number.

When (3.2) holds for each , we have

 f(x+dei)≤f(x)+⟨∇if(x),dei⟩+L2∥d∥2. (3.4)

Lemma 1 below is very standard. It bounds the square summation of by initial objective error and iteration errors. Lemmas 2 and 3 are new; they study the bounds on because the sampling bias prevents us from directly bounding . The bounds in these three lemmas are combined in Theorem 1 to get the convergence rates of .

###### Lemma 1.

Under Assumption 2, let be generated by the inexact MC-BCD (3.1) with any constant stepsize . Then for any ,

 k∑t=0∥Δt∥2≤4γ2−Lγ⋅(f(x0)−minf)+4γ2(2−Lγ)2k∑t=0∥ϵt∥2. (3.5)
###### Proof.

Recalling the definition of in (2.3) and noting for all , we have:

 ⟨Δk,∇f(xk)⟩=⟨xk+1ik−xkik,∇ikf(xk)⟩=−1γ∥Δk∥2+⟨ϵk,xkik−xk+1ik⟩, (3.6)

where we have used the update rule in (3.1) to obtain the second equality. By (3.4) and (3.6), it holds that

 f(xk+1) ≤f(xk)+⟨Δk,∇f(xk)⟩+L2∥Δk∥2 =f(xk)+(L2−1γ)∥Δk∥2+⟨ϵk,xkik−xk+1ik⟩ (3.7) a)≤f(xk)+(L4−12γ)∥Δk∥2+γ∥ϵk∥22−Lγ, (3.8)

where is from the Young’s inequality . Summing (3.8), rearranging terms, and noting , we obtain the desired result and complete the proof. ∎

Also, we can bound partial gradient by the iterate change and error term as follows.

###### Lemma 2.

Assume (3.3). Let be generated by the inexact MC-BCD (3.1). Then for , it holds

 ∥∇ikf(xk−\uptau+1)∥2≤2L2r⋅(\uptau−1)⋅k−1∑d=k−\uptau+1∥Δd∥2+4γ2∥Δk∥2+4∥ϵk∥2. (3.9)
###### Proof.

By the update rule in (3.1) and the definition of , we have . Applying the triangle inequality to the above inequality yields

 ∥∇ikf(xk−\uptau+1)∥2≤ 2∥∇ikf(xk−\uptau+1)−∇ikf(xk)∥2+2∥∥ ∥∥Δkikγ+ϵk∥∥ ∥∥2 (3.10) ≤ 2∥∇ikf(xk−\uptau+1)−∇ikf(xk)∥2+4γ2∥Δk∥2+4∥ϵk∥2.

Note . Hence, it follows from the triangle inequality and the Lipschitz continuity of in (3.3) that

 ∥∇ikf(xk−\uptau+1)∥2 ≤2∥∇f(xk−\uptau+1)−∇f(xk)∥2+4γ2∥Δk∥2+4∥ϵk∥2 ≤2⋅(\uptau−1)⋅k−1∑d=k−\uptau+1∥∇f(xd+1)−∇f(xd)∥2+4γ2∥Δk∥2+4∥ϵk∥2 ≤2L2r⋅(\uptau−1)⋅k−1∑d=k−\uptau+1∥Δd∥2+4γ2∥Δk∥2+4∥ϵk∥2,

which gives the desired result. ∎

###### Remark 1.

If , then starting from (3.10) and by the same arguments, we can have

 ∥∇ikf(xk−\uptau+1)∥2≤2L2r⋅(\uptau−1)⋅k−1∑d=k−\uptau+1∥Δd∥2+2γ2∥Δk∥2. (3.11)

###### Lemma 3.

Let (2.5) hold. Then it holds

 E(∥∇ikf(xk−\uptau+1)∥2∣χk−\uptau+1)≥π∗min2∥∇f(xk−\uptau+1)∥2. (3.12)
###### Proof.

Taking conditional expectation, we have

 E(∥∇ikf(xk−\uptau+1)∥2∣χk−\uptau+1)=N∑j=1∥∇jf(xk−\uptau+1)∥2⋅P(ik=j∣χk−\uptau+1).

By the Markov property, it holds . Then the desired result is obtained from (2.6) and the fact . ∎

###### Theorem 1.

Let Assumptions 1 and 2 hold and be generated by the inexact MC-BCD (3.1) with any constant stepsize . We have the following results:

1. Square summable noise: If the noise sequence satisfy . Then,

 limk→∞E∥∇f(xk)∥=0, (3.13)

and

 E[min1≤t≤k∥∇f(xt)∥2]≤2(k+1)π∗min[C1(\uptau)⋅(f(x0)−minf)+(C2(\uptau)+4)E]. (3.14)
2. Non-square-summable noise: If for some positive number , then

 (3.15)

The constants used above are

 C1(\uptau):=4γ2−Lγ(2L2r(\uptau−1)2+4γ2),C2(\uptau):=4γ2(2−Lγ)2(2L2r(\uptau−1)2+4γ2). (3.16)
###### Proof.

In the case of square summable noise, we have as . In addition, it follows from (3.5) that and thus as . Hence, (3.9) implies

 limk→∞∥∇ikf(xk−\uptau+1)∥2=0. (3.17)

Taking expectation on (3.17) and using the Lebesgue dominated convergence theorem, we have

 limk→∞E∥∇ikf(xk−\uptau+1)∥2=0.

Hence from (3.12), it follows that

 limk→∞E∥∇f(xk)∥2=limk→∞E∥∇f(xk−\uptau+1)∥2≤2π∗minlimk→∞E∥∇ikf(xk−\uptau+1)∥2=0,

and thus (3.13) holds by the Jensen’s inequality .

Note for any . Therefore, summing both sides of (3.9) yields

 k∑t=\uptau−1∥∇itf(xt−\uptau+1)∥2≤ 2L2r(\uptau−1)2k−1∑d=0∥Δd∥2+4γ2k∑t=\uptau−1∥Δt∥2+4k∑t=\uptau−1∥ϵt∥2 ≤ (2L2r(\uptau−1)2+4γ2)k∑t=0∥Δt∥2+4k∑t=\uptau−1∥ϵt∥2. (3.18)

The inequality in (3.18) together with (3.5) and the assumption on gives

 ∞∑t=\uptau−1∥∇itf(xt−\uptau+1)∥2≤C1(\uptau)⋅(f(x0)−minf)+(C2(\uptau)+4)E, (3.19)

where and are defined in (3.16). In addition, we have

 (k+1)⋅E[min0≤t≤k∥∇f(xt)∥2]≤k∑t=0E∥∇f(xt)∥2=k+\uptau−1∑t=\uptau−1E∥∇f(xt−\uptau+1)∥2≤2π∗mink+\uptau−1∑t=\uptau−1E∥∇itf(xt−\uptau+1)∥2, (3.20)

where the last inequality follows from (3.12). Now the result in (3.14) is obtained from the above inequality together with that in (3.19).

In the case of , we have from (3.5) and (3.18) that

 k∑t=\uptau−1∥∇itf(xt−\uptau+1)∥2≤(2L2r(\uptau−1)2+4γ2)(4γ2−Lγ⋅(f(x0)−minf)+4γ2(k+1)S(2−Lγ)2)+4(k−