1 Introduction
We consider the following minimization problem
(1.1) 
where is a differentiable function (possibly nonconvex) and every () is Lipschitz with constant .
The block coordinate gradient descent (BCD) method is a popular approach that can take the advantage of the coordinate structure in (1.1). The method updates one coordinate, or a block of coordinates, at each iteration, as follows. For , choose and compute
(1.2) 
where is a step size; for remaining , we keep .
The coordinate gradient descent method was introduced in [29]. The random selection rule (i.i.d. over the iterations) appeared in [24, 15]. In the same paper [15], the method of accelerated coordinate gradient descent was proposed, and it was later analyzed in [10] for both convex and strongly convex functions. Both [15, 10] select a coordinate
with probability proportional to the Lipschitz constant
of over free ; the rate is optimal when ’s are equal. An improved random sampling method with acceleration was introduced in [1], which further decreases the complexity when some ’s are significantly smaller than the rest. This method was further generalized in [8] to an asynchronous parallel method, which obtains parallel speedup to the accelerated rate. In another line of work, [6] combines stochastic coordinate gradient descent with mirror descent stochastic approximation, where a random data minibatch is taken to update a randomly chosen coordinate. This is improved in [31], where the presented method uses each random minibatch to update all the coordinates in a sequential fashion. Besides stochastic selection rules, there has been work of the cyclic sampling rule. The work [30] studies its convergence under the convex and nonconvex settings, and [2] proves sublinear and linear rates in the convex setting. The constants in these rates are worse than standard gradient descent though. For a family of problems, [26] obtains improved rates to match standard gradient descent (and their results also apply to the random shuffling rule). The greedy sampling rule has also been studied in the literature but unrelated to this paper. Let us just mention some references [11, 20, 12, 16]. Finally, [17] explores the family of problems with the structure that enables us to update a block coordinate at a much lower cost than updating all blocks in batch.This paper introduces the Markovchain select rule. We call our method Markovchain block gradient coordinate descent (MCBCD). In this method, is selected according to a Markov chain; hence, unlike the above methods, our choice is neither stochastic i.i.d. (with respect to ) nor deterministic. Specifically, there is an underlying stronglyconnected graph with the set of vertices and set of edges . Each node can compute and update . We call a walk of if every . If the walk is deterministic and visits every node at least once in every iterations, then is essentially cyclic; if every is chosen randomly from , then we obtain MCBCD, which is the focus of this paper. To the best of our knowledge, MCBCD is new.
1.1 Motivations
Generally speaking, one does not use MCBCD to accelerate i.i.d. random or cyclic BCD but for other reasons: When we are forced to take Markov chain samples because cyclic and stochastic samples are not available; Or, although cyclic and stochastic samples are available, it is easier or cheaper to take Markov chain samples. We briefly present some examples below to illustrate those motivations. Some examples are tested numerically in Section 6 below.
Markov Chain Dual Coordinate Ascent (MCDCA). The paper [25] proposes the Stochastic Dual Coordinate Gradient Ascent (SDCA) to solve
(1.3) 
where is the regularization parameter,
is the data vector associated with
th sample, andis a convex loss function. Its dual problem can be formulated as
(1.4) 
where with column , and is the conjugate of . By applying stochastic BCD to (1.4
), SDCA can reach comparable or better convergence rate than stochastic gradient descent. We employ this idea and propose MCDCA : in the
th iteration, while if ,(1.5) 
where is a Markov chain.
The Markov chain must come from somewhere. Consider that the data are stored in a distributed fashion over a graph. Only when the graph is complete can we efficiently sample i.i.d. randomly and access ; only when the graph has a Hamiltonian cycle can we visit the data in a cyclic fashion without visiting any node twice in each cycle. MCDCA works under a much weaker assumption: as long as the graph is connected. Specifically, let a token hold and vector , and let the token randomly walk through the nodes in the network; each node holds data and can compute ; as the token arrives at node , the node accesses and and computes and , which are used to update and update .
Future rewards in a Markov decision process. This example is a finitestate ( states) Discounted Markov Decision Process (DMDP) for which we can compute the transition probability from any current state to the next state, or quickly approximate it. We can use MCBCD to compute the expected future reward vector.
Let us describe the DMDP. Entering any state , we receive an award and then take an action according to a given policy (a statetoaction mapping). After the action is taken, the system enters a state , , with probability . The transition matrix depends on the action taken and thus depends on . The reward discount factor is . Our goal is to evaluate the expected future rewards of all states for fixed . This step dominates the perstep computation of the policyupdate iteration [28], which iteratively updates .
For each state , the expected future reward is given as
where the state sequence is a Markov chain induced by the transition matrix and is the reward received at time . The corresponding Bellman equation is , the matrix form of which is
(1.6) 
where and .
When is huge, solving (1.6) is difficult. Often we have memory to store a few vectors (also, can be reduced by dimension reduction) but not an matrix. Therefore, we can store the vector only temporarily in each iteration. In the case where the physical principles or the rule of game are given, such as in the Tetris game, we can compute the transition probabilities explicitly. Consider another scenario where can not be computed explicitly but can be approximated by MonteCarlo simulations. The simulation of transition at just one state is much cheaper than that of all states. In both scenarios, we have access to . This allows us to apply MCBCD to solve a dual optimization problem below to compute the future reward vector ,
(1.7) 
where is a fixed regularization parameter. This corresponds to setting in (1.4). Note that in DMDP, one cannot transit from the current state to an arbitrary . Therefore, standard cyclic and stochastic BCD is not applicable.
Running the MCDCA iteration (1.5) requires the vectors and . We update by maintaining a sequence as follows: initialize (vector zero) and thus ; in th iteration, we compute , where the equality follows since and only differ over their th component. This update is done without accessing the full matrix .
As we showed above, running our algorithm to compute the expected future award only requires memory. Also the algorithm iterates simultaneously while the system samples its state trajectory. Suppose each policy can be stored in memory (e.g., deterministic policy) and updating using a computed also needs memory; then, we can a policyupdate iteration with memory.
Risk minimization by dual coordinate ascent over a tricky distribution. Let be a statistical sample space with distribution , and is a proper, closed, strongly convex function. Consider the following regularized expectation minimization problem
(1.8) 
Since the objective is strongly convex, its dual problem is smooth. If it is easy to sample data from , (1.8) can be solved by SDCA, which uses i.i.d. samples. When the distribution is difficult to sample directly but has a faster Markov Chain Monte Carlo (MCMC) sampler, we can apply MCDCA to this problem.
Multiagent resourceconstrained optimization. Consider the multiagent optimization problem of agents [4]:
(1.9) 
where is the cost function, is the resource vector, and penalizes any over usage of resources. Define a graph, in which every node is an agent and every edge connects a pair of agents that either depend on one another in or share at least one resource. In other words, the objective function (1.9) has a graph structure in that computing the gradient of requires only the information of the adjacent agents of .
MCBCD becomes a decentralized algorithm: after an agent updates its decision variable , it broadcasts to one of its neighbors, and activates it to run next step. In this process, form a random walk over the graph and, therefore, is a Markov chain. As long as the network is connected, a central coordinator is no more necessary. However, sampling i.i.d. randomly requires a central coordinator and will consume more communication since it may communicate beyond neibors. Also selecting essentially cyclicly requires a tour of the graph, which relys on the knowledge of the graph topology.
When is differentiable with Lipschitz continuous gradient, so is the objective function. We apply MCBCD to (1.9) to obtain
(1.10) 
where is a Markov chain. We assume that agent can access and and compute . Similar to the example for computing expected future reward above, can be updated along with the iterations so no node needs the access to the full matrix . Alternatively, we can use a central governor which receives updated and from agent and sends the data to for the next iteration.
Decentralized optimization. This example is taken from [32]. Again consider the empirical risk minimization problem (1.3). We consider solving its dual problem (1.4) in a network by assigning each sample to a node. A parallel distributed algorithm will update for all the components, , concurrently.
If the network has a central server, then each node sends its to the central server, which forms and then broadcasts it back to the nodes.
If the network does not have a central server, then we can form either running a decentralized gossip algorithm or calling an allreduce communication. The former does not require the knowledge of the network topology and is an iterative method. The latter requires the topology and takes at least rounds and at least total communication, even slower when the network is sparse. An alternative approach is to create a token that holds and follows a random walk in the network. The token acts like a traveling center. When the token arrives at a node , the node updates its using the token’s , and this local update leads to a sparse change to ; updating requires no access to for . The method in [32] applies this idea to an ADMM formulation of the decentralized consensus problem (rather than BCD in this paper) and shows that total communication is significantly reduced.
1.2 Difficulty of the convergence proofs: biased expectation
Sampling according to a Markov chain is neither (essential) cyclic nor i.i.d. stochastic. No matter how large is, it is still possible that a node is never visited during some iterations. Unless the graph is a complete graph (every node is directly connected with every other node), there are nodes without an edge connecting them, i.e., . Hence, given , it is impossible to have . So, no matter how one selects the sampling probability and step size , we generally do not have for any constant , where . This, unfortunately, breaks down all the existing analyses of stochastic BCD since they all need a nonvanishing probability for every block to be selected.
1.3 Proposed method and contributions
Given a graph , MCBCD is written mathematically as
sample  (1.11a)  
compute  (1.11b) 
where is a constant stepsize, and is the transition matrix in the th step (details given in Sec. 2), and we maintain for all . The initial point can be chosen arbitrarily. The block can be chosen either deterministically or randomly. The following diagram illustrates the influential relations of
and the random variable sequences
and :(1.12) 
To our best knowledge, (1.11) did not appear before and, as explained above, is not a special case of existing BCD analyses. When the Markov chain has a finite mixing time and problem (1.1) has a lower bounded objective, we show that using ensures . The concept of mixing time is reviewed in the next section. In addition, when is convex and coercive, we show that at the rate of with a hidden constant related to the mixing time. Note that running the algorithm itself requires no knowledge about the mixing time of the chain. Furthermore, when is (restricted) strongly convex, then the rate is improved to be linear, unsurprisingly. Although we do not develop any Nesterovkind acceleration in this paper, a heavyballkind inertial MCBCD is presented and analyzed because the additional work is quite small. When the computation is noisy, as long as the noise is square summable (which is weaker than being summable), MCBCD still converges.
1.4 Possible future work
We mention some future improvements of MCBCD, which will require significantly more work to achieve. First, it is possible to accelerate MCBCD using both Nesterovkind momentum and optimizing the transition probability. Second, it is important to parallelize MCBCD, for example, to allow multiple random walks to simultaneously update different blocks [22, 7], even in an asynchronous fashion like [13, 19, 27]. Third, it is interesting to develop a primaldual type MCBCD, which would apply to a modelfree DMDP along a single trajectory. Yet another line of work applies block coordinate update to linear and nonlinear fixedpoint problems [18, 17, 5] because it can solve optimization problems in imaging and conic programming, which are equipped with nonsmooth, nonseparable objectives, and constraints.
2 Preliminaries
2.1 Markov chain
We recall some definitions and propertiesof the Markov chain that we use in this paper.
Definition 1 (finitestate (timehomogeneous) Markov chain).
A stochastic process in a finite state space is called Markov chain with transition matrices if, for , , and , we have
(2.1) 
The chain is timehomogeneous if for some constant matrix .
Let the probability distribution of
be denoted as the row vector , that is, . Each satisfies Obviously, it holds . When the Markov chain is timehomogeneous, we have and , for , where is the th power of .Definition 2.
A timehomogeneous Markov chain is irreducible if, for any , there exists such that . State is said to have a period if whenever is not a multiple of and is the greatest such integer. If , then we say state is aperiodic. If every state is aperiodic, the Markov chain is said to be aperiodic.
Any timehomogeneous, irreducible, and aperiodic Markov chain has a stationary distribution with and , and . This is a sufficient but not necessary condition to have such . If the Markov fails to be timehomogeneous^{1}^{1}1The timehomogeneous, irreducible, and aperiodic Markov chain is widely used; however, in practical problems, the Markov chain may not satisfy the timehomogeneous assumption. For example, in a mobile, if the network connectivity structure is changing all the time, then the set of the neighbors of an agent is timevarying [9]., it may still have a stationary distribution under additional assumptions.
In this paper, we make the following assumption, which always holds for timehomogeneous, irreducible, and aperiodic Markov chain and may hold for more general Markov chains.
Assumption 1.
The Markov chain has the transition matrices and the stationary distribution . Define
that is, every row of is . For each , there exists such that
(2.2) 
Here, is called a mixing time, which specifies how long a Markov chain evolves close to its stationary distribution. The literature has a thorough investigation of various kinds of mixing times [3]. Previous mixing time focuses on bounding the difference between and the stationary distribution . Our version is just easier to use in the analysis.
For a timehomogeneous, irreducible, and aperiodic Markov chain with the transition matrix , . It is easy to have as , where
denotes the second largest eigenvalue of
(positive and smaller than 1) [14]. Besides the timehomogeneous, irreducible, and aperiodic Markov chain, some other nontimehomogeneous chains can also have a geometricallyconvergent . An example is presented in [21].2.2 Notation and constants
The following notation is used throughout this paper:
(2.3) 
In MCBCD iteration, only the block of is nonzero; other blocks are zero. Let be the minimal stationary distribution, i.e.,
(2.4) 
For any closed proper function , denotes the set , and denotes the norm. Through the proofs, we use the following sigma algebra
Let Assumption 1 hold. In our proofs, we let be the mixing time, i.e.,
(2.5) 
With direct calculations,
(2.6) 
If the Markov chain promises a geometric rate, then we have
(2.7) 
It is worth mentioning that, for a complete graph where all nodes are connected to each other, we have a Markov chain with , and our MCBCD will reduce to random BCD [15].
3 Markov chain block coordinate gradient descent
In this section, we study the convergence properties of the MCBCD for problem (1.1). The discussion covers both convex and nonconvex cases. We show that the MCBCD can converge if the stepsize is taken as the same as that in traditional BCD. For convex problems, sublinear convergence rate is established, and for strongly convex cases, linear convergence is shown.
Our analysis is conducted to an inexact version of the MCBCD, which allows error in computing partial gradients:
(3.1) 
where is sampled in the same way as in (1.11a), and denotes the error in the th iteration. If vanishes, the above updates reduce to the MCBCD in (1.11).
3.1 Convergence analysis
The results in this section applies to both convex and nonconvex cases, and they rely on the following assumption.
Assumption 2.
The set of minimizers of function is nonempty, and is Lipschitz continuous about with constant for each , namely,
(3.2) 
where denotes the th standard basis vector in . In addition, is also Lipschitz continuous about with constant , namely,
(3.3) 
We call the condition number.
When (3.2) holds for each , we have
(3.4) 
Lemma 1 below is very standard. It bounds the square summation of by initial objective error and iteration errors. Lemmas 2 and 3 are new; they study the bounds on because the sampling bias prevents us from directly bounding . The bounds in these three lemmas are combined in Theorem 1 to get the convergence rates of .
Lemma 1.
Proof.
Recalling the definition of in (2.3) and noting for all , we have:
(3.6) 
where we have used the update rule in (3.1) to obtain the second equality. By (3.4) and (3.6), it holds that
(3.7)  
(3.8) 
where is from the Young’s inequality . Summing (3.8), rearranging terms, and noting , we obtain the desired result and complete the proof. ∎
Also, we can bound partial gradient by the iterate change and error term as follows.
Proof.
Remark 1.
If , then starting from (3.10) and by the same arguments, we can have
(3.11) 
Furthermore, we can lower bound full gradient by conditional partial gradient.
Lemma 3.
Let (2.5) hold. Then it holds
(3.12) 
Proof.
Taking conditional expectation, we have
By the Markov property, it holds . Then the desired result is obtained from (2.6) and the fact . ∎
Theorem 1.
Let Assumptions 1 and 2 hold and be generated by the inexact MCBCD (3.1) with any constant stepsize . We have the following results:

Square summable noise: If the noise sequence satisfy . Then,
(3.13) and
(3.14) 
Nonsquaresummable noise: If for some positive number , then
(3.15)
The constants used above are
(3.16) 
Proof.
In the case of square summable noise, we have as . In addition, it follows from (3.5) that and thus as . Hence, (3.9) implies
(3.17) 
Taking expectation on (3.17) and using the Lebesgue dominated convergence theorem, we have
Hence from (3.12), it follows that
and thus (3.13) holds by the Jensen’s inequality .
Note for any . Therefore, summing both sides of (3.9) yields
(3.18) 
The inequality in (3.18) together with (3.5) and the assumption on gives
(3.19) 
where and are defined in (3.16). In addition, we have
(3.20) 
where the last inequality follows from (3.12). Now the result in (3.14) is obtained from the above inequality together with that in (3.19).
Comments
There are no comments yet.