## 1 Introduction

Distributed machine learning is an attractive solution to tackle large-scale learning tasks [6, 15]. In this paper, we consider that agents represent the nodes of a connected network and collaboratively solve the following stochastic optimization problem

(1.1) |

where and

is a statistical sample space with probability distribution

at node (we omit the underlying -algebra), and is a closed (possibly nonconvex) function associated with. This formulation contains various multi-agent machine learning, reinforcement learning and statistical problems. We are particularly interested in cases where obtaining an independent and identically distributed (i.i.d.) sample

from is very hard or even impossible at every node ; see an example in [34] where the cost of i.i.d. sampling can be very expensive. In statistics, a common method to overcome the issue of difficult sampling is employing a Markov chain whose stationary distribution is . Therefore, to solve (1.1), one can still use the parallel implementation of the widely-used method Stochastic Gradient Descent (SGD)

[26]:(1.2) |

where and are the stepsize and parameter at iteration , and is a *nearly unbiased* stochastic gradient of obtained at node .
By using the Markov chain, in order to perform one iteration of (1.2), each node has to generate a sequence of samples and only uses the last one . According to [20, Theorem 4.9], to get a sample that is nearly i.i.d., one needs to simulate the Markov chain for a sufficiently long time; e.g., a large . For this reason, we call the iteration (1.2) as SGD-.
In addition, to update the next parameter via (1.2), all the local gradients need to be collected.
Therefore, applying iteration (1.2) for
problem (1.1) over the distributed nodes has following two limitations.

Sample inefficiency: Different from standard stochastic optimization settings, when it is difficult to obtain i.i.d. samples from , implementing SGD- for (1.1) requires regenerating Markov chains at each node and at every iteration. Nevertheless, this wastes a sizeable amount of variable samples, especially when the Markov chain has a large mixing time.

Communication inefficiency: The presumption of implementing (1.2) is that there is a fusion center (which can be a designated agent) aggregating the local gradients and carrying out the parameter update. However, this incurs a significant amount of synchronization and communication overhead, especially when the network is large and sparse.

In this context, our goal is to find the near-optimal solution of (1.1) in a sample- and communication-efficient manner.

### 1.1 Prior Art

In this part, we briefly review three lines of related works: decentralized optimization, decentralized stochastic optimization, and Markov chain gradient descent.

Decentralized optimization. Decentralized algorithms have been originally studied in control and signal processing communities, e.g., calculating the mean of data distributed over multiple sensors [3, 22, 27, 2]. The decentralized (sub)gradient descent (DGD) algorithms for the finite sum optimization have been studied in [21, 5, 11, 18, 35]. However, DGD converges to an inexact solution due to that it actually minimizes an unconstrained penalty function rather than the original one. To fix this, the dual information is leveraged in recent works such as decentralized ADMMs and primal-dual algorithms [4, 28, 31, 30]. Although DGD is slower than decentralized ADMMs and primal-dual algorithms in the convex settings, it is much simpler and therefore easier to extend to the nonconvex, online and delay-tolerant settings [36, 10, 19].

Decentralized stochastic optimization. Generalizing methods for the decentralized deterministic optimization, decentralized stochastic optimization has been studied recently. By assuming a local Poisson clock for each agent, asynchronous gossip algorithms is proposed by [24], in which each worker randomly selects part of its neighbors to communicate with. In fact, these algorithms used random communication graphs. The decentralized algorithm with random communication graph for the constrained problem is introduced by [33]; the subgradient version is given by[25]. In recent works [32, 14, 16], the decentralized SGD (DSGD) is proposed and theoretical analysis is provided. In [32], the authors presented the complexity analysis for a stochastic decentralized algorithm. In [14], the authors design a kind of stochastic decentralized algorithm by recruiting the dual information, and provide the related computational complexity. In latter paper [16], the authors show the speedup when the number of nodes is increased. And in paper [17], the authors proposed the asynchronous decentralized stochastic gradient descent and presented the theoretical and numerical results.

Markov chain gradient descent. While i.i.d. samples are not always available in stochastic optimization, recent focus has been on the analysis of stochastic algorithms following a single trajectory of the Markov chain or other general ergodic processes. The key challenge of analyzing MGD is to deal with the biased expectation of gradients. The ergodic convergence results have been reported in [13, 12]. Specifically, [13, 12] study the conditional expectation with a sufficiently large delay which is sufficiently close to the gradient (but still different). The authors of [23] proved the almost sure convergence under the diminishing stepsizes , . In [7], the authors improved convergence results with larger stepsizes in the sense of ergodic convergence. In all these works, the Markov chain is required to be reversible, and the functions have to be convex. In a very recent paper [34], the non-ergodic convergence of MGD has been shown in the nonconvex case with non-reversible Markov chain, but the algorithm needs to be implemented in a centralized fashion.

### 1.2 Our contributions

The main theme of this paper is the development of the first- and zeroth-order version of decentralized Markov chain gradient descent (DMGD) algorithms, and their performance analysis. In contrast to the well-known DSGD, DMGD leverages Markov chain sampling rather than uniform random sampling, which gains sample efficiency. For first-order DMGD, each node uses a Markov chain trajectory to sample a gradient and then communicates with its neighbors to update the local variables. To further account for the case where stochastic gradient is not readily available, we develop the zeroth-order variant of DMGD where only point-wise function values are needed.

We establish the non-ergodic convergence of first- and zeroth-order DMGD and their ergodic convergence rates. The results show that DMGD — the first-order DMGD converges at the same order as the centralized MGD, and the zeroth-order DMGD converges at the same order as DMGD. Some novel results are are developed based on new techniques and approaches developed in this paper. To get the stronger results in general cases, we used the varying mixing time rather than fixed ones. It is worth mentioning that our theoretical results can directly derive novel results for DSGD if the Markov chain trajectory sampling reduces to the uniform random sampling case. The numerical results are presented to demonstrate that DMGD performs better than DSGD in terms of sample efficiency.

Notation: Let denote the

-th eigenvalue of a matrix. Let

denote the local copy of at node . For a matrix , and . For a positive semidefinite matrix , .## 2 Decentralized Markov Chain Gradient Descent

### 2.1 Preliminaries

We first consider the discrete case of our problem (1.1), that is, all the distributions () are supported on a set of points^{1}^{1}1For notational brevity, we assume the same cardinal number for different distribution support sets., (for ). We define the functions as
, and thus problem (1.1) becomes the following finite-sum formulation

(2.1) |

where .

Denote as the trajectory of the Markov chain in the th node. We use a connected graph with vertex set and edge set . Any edge represents a communication link between nodes and . And more, let

With the notation above, we can reformulate problem (2.1) into an equivalent problem over the network, which can be described as

(2.2) |

Mixing matrix: The mixing matrix is frequently used in decentralized optimization. In many cases, it can be designed by the users according to the given graph. Formally, it is defined as follows.

###### Definition 1.

The mixing matrix is assumed to have the following properties:
(1) (Graph) If and , then , otherwise, ;

(2) (Symmetry) ;

(3) (Null space property) ;

(4)
(Spectral property)

With the symmetricity of , its eigenvalues are real and can be sorted in the nonincreasing order. Thus, let denote the th largest eigenvalue of ; then, it holds that With Definition 1, we can easily see for any being the solution to problem (2.2). Therefore, (2.2) can be further described as

(2.3) |

By introducing penalty parameter , the decentralized gradient descent is actually the gradient descent with learning rate with to solve the following unconstrained problem

(2.4) |

The DGD algorithm is actually the gradient desent with stepsize applied to problem (2.4). With Assumption 3, function is differentiable and is Lipschitz with constant , where .

Markov chain: We recall several definitions, properties, and existing results of the finite-state time-homogeneous Markov chain, which will be used in the proposed algorithms.

###### Definition 2.

Let be an -matrix with real-valued elements. A stochastic process in a finite state space is called a time-homogeneous Markov chain with transition matrix if, for , , and , we have

Denote the probability distribution of

as the non-negative row vector

, i.e., and satisfies For the time-homogeneous Markov chain, it holds and for , where denotes the th power of .A Markov chain is irreducible if, for any , there exists such that .
State is said to have a period if whenever is *not* a
multiple of and is the greatest integer with
this property. If , then we say state is aperiodic. If every state is aperiodic, the Markov chain is said to be aperiodic.
Any time-homogeneous, irreducible, and aperiodic
Markov chain has a stationary distribution
with and , and . It also holds that

(2.5) |

The largest eigenvalue of

is 1, and the corresponding eigenvector is

.Mixing time is an important concept which describes how long a Markov chain evolves until its current state has a distribution very close to its stationary distribution. The literature studies about various kinds of mixing times, whose majority, however, is about reversible Markov chains (i.e., ). With basic matrix analysis, the mixing time introduced in [34] provides a direct relationship between and the deviation of the distribution of the current state from the stationary distribution (Lemma 1 in the Appendix).

### 2.2 Algorithmic Development of DMGD

The local scheme of DMGD in the th node can be presented as

(2.6) |

In each iteration, each node calculates the local gradient on the Markov chain trajectory , and then communicates with its neighbors with a weighted average to update the iteration. Here, is the -element of the mixing matrix. It is easy to see that if and the Markov chain trajectory is the uniform sampling, (2.2) then reduces to the DSGD. We can see a major difference from the classical decentralized gradient descent algorithm: in each iteration, each node will keep part of its previous data. The parameter

will go to zero. The settings are to guarantee the convergence of the algorithm; the diminishing stepsizes are used to reduce the variances cost by the gradients samplings. However, as

, will dominate in the iteration. That means the algorithm will be inefficient when is large if we choose diminishing stepsizes strategy. In Table 1, we present the comparison of DSGD- and DMGD.The global scheme can be described as the following iteration

(2.7) |

where has been given before. If we introduce a new variable as , the global scheme then can be further presented as This iterative formulation can help us to understand the convergence of the algorithm. Suppose that the Markov chains all reduce to the uniform sampling. By defining the -algebra as we can see that in this condition; that means DMGD actually admits the SGD applied to minimize function and will converge to the critical point of in the perspective of expectation. For general Markov chains, the analysis is much more complicated for the biased conditional expectation. Although DMGD fails to be identical to MGD for function , it is not hard to believe that the convergence of DMGD is also related to function . In fact, our results show that for DMGD under mild assumptions.

### 2.3 Key Challenge of Analyzing DMGD

Markov chain sampling is neither cyclic nor i.i.d. stochastic. For any large , it is still possible that a sample is never visited during some iterations. For a fixed node , unless the graph is a complete graph (i.e., all elements are directly connected), there are elements *without* an edge connecting them, i.e., . Hence, given , it is *impossible* to have . So, no matter how one selects the sampling probability and stepsize , we generally *cannot* have
for any constant , where . This fact, unfortunately, breaks down all the existing analyses of stochastic decentralized optimization since they all need a non-vanishing probability for every sample in each node can be selected.

## 3 Convergence analysis of DMGD

In this part, we present the theoretical results of DMGD with finite-state Markov chains. Our analysis builds on the following assumptions.

###### Assumption 1.

Function is lower bounded, that is .

###### Assumption 2.

The gradient of is uniformly bounded, that is, there exists a constant such that .

###### Assumption 3.

The gradient of is Lipschitz continuous with , i.e., .

###### Assumption 4.

The Markov chains in all nodes are time-homogeneous, irreducible, and aperiodic. They have a same transition matrix and have uniform same stationary distribution.^{2}^{2}2We require all nodes to employ the Markov chain with same transition matrix . This setting is for the convenience of presentations in the proofs and can be modified as different Markov chains.

###### Theorem 1.

In Theorem 1, the functions are not necessary to be convex. In fact, it is more difficult to prove (3.2) than to prove (1). The descent on a Lyapunov function and the Schwarz inequality can directly derive (3.2), while (1) requires a technical lemma, which first given in [36] and generalized by [34]. An extreme case is that and ; DMGD will reduce to the classical MGD. But Theorem 1 cannot cover the existing convergence results of MGD. In [34]

, the authors estimated the convergence of MGD with the stepsizes requirements

(3.4) |

Although the convergence results in Theorem 1 require Markov chains to be time-homogeneous, it can be extended to general Markov chains with extra assumptions given in existing works. In paper [23], the time non-homogeneous Markov chain but with extra assumptions (Assumptions 4 and 5, in Section 4 of [23]) is proposed. These two assumptions involve with many details; several majors are doubly stochastic, uniformly bounded away from zero, diagonals of the transition matrices are positive, and strong connections of some edges. In paper [7], the authors use more general Markov chain but also with an assumption (Assumption C, in Section 2 of [7]), which can be satisfied by finite-state time-homogeneous Markov chain.

The results in Theorem 1 present the convergence about function rather than the original finite sum optimization. In fact, the minimizer of (2.1) cannot be achieved for the existence of the parameter . However, the differences between the rows of is uniformly bounded.

###### Proposition 1.

Under the conditions of Theorem 1, for any and , we have

(3.5a) | |||

(3.5b) |

where the average iterate is defined as .

## 4 Zeroth-order DMGD

This section presents the zero-order version of DMGD with two-points feedback strategy [8, 9, 1, 29]. This paper employs the method given in [8, 9]. Specifically, it uses the estimator of the gradient of by querying at and with returning where is a random unit vector and is a small parameter. In the zeroth-DMGD, we use the two-points feedback to replace the local gradients and obtain following iteration at the th node

(4.1) |

where still denotes the Markov chain trajectory in the th node, and is uniformly sampled for the unite sphere in , and is the parameter in the th iteration. In this algorithm, we just use the function values information rather than the gradients. Thus it is called zeroth-order scheme. We can present the following convergence result of the zeroth-order DMGD.

###### Theorem 2.

Compared with the first-order DMGD in Theorem 1, a constant factor degrades the convergence rates. Such difference comes from the two-points estimation errors of gradient which is dimension-dependent. This result indicates that in low-dimension case, the zeroth-order version can work well as DMGD, as could be expected; but for high-dimension case, the speed might be slowed down.

## 5 Numerical tests

In this section, we compare the performance of our algorithm with the decentralized parallel SGD (DSGD) on an autoregressive model, which closely resembles the first experiment in

[7]. Assume that there are autoregressive processes distributed on a graph of nodes. We attempt to recover a consensus vector from the multiple processes. On each node , set matrix as a subdiagonal matrix with random entries . Randomly sample a vector , with the unit 2-norm. In our experiments, we tested with and . The data are generated by the following auto regressive process:Clearly, for any , forms a Markov chain. Let denote the stationary distribution of the Markov chain on the

-th node. By defining the loss function as

with , we reconstruct as the solution to the following problem:(5.2) |

We choose as our stepsize, where . This choice is consistently with our theory below. Specifically we compare:

DMGD, where is from one trajectory of the Markov chain on the -th node;

DSGD-, for , where each is the -th sample of an independent trajectory on the -th node. All trajectories are generated by starting from the same initial state.

To compute gradients, DSGD- uses times as many samples as DMGD. We did not try to adapt as increases because there lacks a theoretical guidance. The numerical comparison results are reported in Figure 1, which show that DMGD outperforms the DSGD- with . The numerical results in Figure 1 are quite positive on DMGD. As expected, DMGD used significantly fewer total samples than DSGD on each T. Surprisingly, MCGD did not cost even more gradient computations. It is important to find that DSGD1 and DSGD2, as well as SGD4, stagnate at noticeably lower accuracies due to that their T values are too small.

## 6 Analysis on continuous state space

In this part, we consider the case that are continuums, and turn back to the problem (1.1). Time-homogeneous and reversible infinite-state Markov chains are considered in this case. With the results in [Theorem 4.9, [20]], the mixing time in this case still enjoys geometric decrease like (A.1). Mathematically, this fact can be presented as

(6.1) |

where still denotes the deviation matrix . Here and are constants determined by the Markov chain. Here, we use notation and to give the difference to and in Lemma 1.

Let be the Markov chain trajectory in th node. We first present the local scheme:

(6.2) |

By defining

and , the global scheme is then of the following form

(6.3) |

The convergence is proved for a possibly nonconvex objective function and time-homogeneous and reversible chains, which obey the following assumption.

###### Assumption 5.

For any , it holds that (1) , ; (2) ; (3) , ; (4) , ; (5) The stationary distribution of the Markov chain in the th node is right .

Denote a function as

(6.4) |

The convergence results are described by this function.

###### Proposition 2.

Unlike Theorem 1, the Markov chain assumptions cannot be weakened, i.e., the Markov chains must be time-homogeneous and reversible in Proposition 2. Another difference is that the stationary distributions are not necessary to be uniform. In fact, (6) can be extended to the case where node stores different functions ; and all satisfy Assumption 5.

## 7 Conclusions

In this paper, we proposed the decentralized Markov chain gradient descent (DMGD) algorithm, where the samples are taken along a trajectory of Markov chain over the network. Our algorithms can be used when it is impossible or very expensive to sample directly from a distribution, or the distribution is even unknown, but sampling via a Markov chain is possible and relatively cheap. The convergence analysis is proved in possibly nonconvex cases.

Building upon the current work, several promising future directions can be pursued. The first one is to extend DMGD to the asynchronous setting, which can further reduce the synchronization overhead. The second one is to reduce the communications cost in DMGD by using quantization or sparsification techniques. Designing Markov chain primal-dual algorithms is also worth investigating.

## References

- [1] Alekh Agarwal, Ofer Dekel, and Lin Xiao. Optimal algorithms for online convex optimization with multi-point bandit feedback. In COLT, pages 28–40. Citeseer, 2010.
- [2] Tuncer Can Aysal, Mehmet Ercan Yildiz, Anand D Sarwate, and Anna Scaglione. Broadcast gossip algorithms for consensus. IEEE Transactions on Signal processing, 57(7):2748–2761, 2009.
- [3] Stephen Boyd, Arpita Ghosh, Balaji Prabhakar, and Devavrat Shah. Gossip algorithms: Design, analysis and applications. In INFOCOM 2005. 24th Annual Joint Conference of the IEEE Computer and Communications Societies. Proceedings IEEE, volume 3, pages 1653–1664. IEEE, 2005.
- [4] Tsung-Hui Chang, Mingyi Hong, and Xiangfeng Wang. Multi-agent distributed optimization via inexact consensus admm. IEEE Trans. Signal Processing, 63(2):482–497, 2015.
- [5] Annie I Chen and Asuman Ozdaglar. A fast distributed proximal-gradient method. In Communication, Control, and Computing (Allerton), 2012 50th Annual Allerton Conference on, pages 601–608. IEEE, 2012.
- [6] Jeffrey Dean, Greg Corrado, Rajat Monga, Kai Chen, Matthieu Devin, Mark Mao, Andrew Senior, Paul Tucker, Ke Yang, Quoc V Le, et al. Large scale distributed deep networks. In Advances in Neural Information Processing Systems, pages 1223–1231, 2012.
- [7] John C Duchi, Alekh Agarwal, Mikael Johansson, and Michael I Jordan. Ergodic mirror descent. SIAM Journal on Optimization, 22(4):1549–1578, 2012.
- [8] John C Duchi, Michael I Jordan, Martin J Wainwright, and Andre Wibisono. Optimal rates for zero-order convex optimization: The power of two function evaluations. IEEE Transactions on Information Theory, 61(5):2788–2806, 2015.
- [9] Saeed Ghadimi and Guanghui Lan. Stochastic first-and zeroth-order methods for nonconvex stochastic programming. SIAM Journal on Optimization, 23(4):2341–2368, 2013.
- [10] Saghar Hosseini, Airlie Chapman, and Mehran Mesbahi. Online distributed convex optimization on dynamic networks. IEEE Trans. Automat. Contr., 61(11):3545–3550, 2016.
- [11] Duvsan Jakovetić, Joao Xavier, and José MF Moura. Fast distributed gradient methods. IEEE Transactions on Automatic Control, 59(5):1131–1146, 2014.
- [12] Bjorn Johansson, Maben Rabi, and Mikael Johansson. A simple peer-to-peer algorithm for distributed optimization in sensor networks. In Decision and Control, 2007 46th IEEE Conference on, pages 4705–4710. IEEE, 2007.
- [13] Björn Johansson, Maben Rabi, and Mikael Johansson. A randomized incremental subgradient method for distributed optimization in networked systems. SIAM Journal on Optimization, 20(3):1157–1170, 2009.
- [14] Guanghui Lan, Soomin Lee, and Yi Zhou. Communication-efficient algorithms for decentralized and stochastic optimization. arXiv preprint arXiv:1701.03961, 2017.
- [15] Mu Li, David G Andersen, Jun Woo Park, Alexander J Smola, Amr Ahmed, Vanja Josifovski, James Long, Eugene J Shekita, and Bor-Yiing Su. Scaling distributed machine learning with the parameter server. In OSDI, volume 14, pages 583–598, 2014.
- [16] Xiangru Lian, Ce Zhang, Huan Zhang, Cho-Jui Hsieh, Wei Zhang, and Ji Liu. Can decentralized algorithms outperform centralized algorithms? a case study for decentralized parallel stochastic gradient descent. In Advances in Neural Information Processing Systems, pages 5330–5340, 2017.
- [17] Xiangru Lian, Wei Zhang, Ce Zhang, and Ji Liu. Asynchronous decentralized parallel stochastic gradient descent. In Proceedings of the 35th International Conference on Machine Learning, pages 3043–3052, 2018.
- [18] Ion Matei and John S Baras. Performance evaluation of the consensus-based distributed subgradient method under random communication topologies. IEEE Journal of Selected Topics in Signal Processing, 5(4):754–771, 2011.
- [19] Brendan McMahan and Matthew Streeter. Delay-tolerant algorithms for asynchronous distributed online learning. In Advances in Neural Information Processing Systems, pages 2915–2923, 2014.
- [20] Ravi Montenegro, Prasad Tetali, et al. Mathematical aspects of mixing times in markov chains. Foundations and Trends® in Theoretical Computer Science, 1(3):237–354, 2006.
- [21] Angelia Nedic and Asuman Ozdaglar. Distributed subgradient methods for multi-agent optimization. IEEE Transactions on Automatic Control, 54(1):48–61, 2009.
- [22] Reza Olfati-Saber, J Alex Fax, and Richard M Murray. Consensus and cooperation in networked multi-agent systems. Proceedings of the IEEE, 95(1):215–233, 2007.
- [23] S Sundhar Ram, A Nedić, and Venugopal V Veeravalli. Incremental stochastic subgradient algorithms for convex optimization. SIAM Journal on Optimization, 20(2):691–717, 2009.
- [24] S Sundhar Ram, Angelia Nedić, and Venu V Veeravalli. Asynchronous gossip algorithm for stochastic optimization: Constant stepsize analysis. In Recent Advances in Optimization and its Applications in Engineering, pages 51–60. Springer, 2010.
- [25] S Sundhar Ram, Angelia Nedić, and Venugopal V Veeravalli. Distributed stochastic subgradient projection algorithms for convex optimization. Journal of optimization theory and applications, 147(3):516–545, 2010.
- [26] Herbert Robbins and S Monro. A stochastic approximation method. Annals Math Statistics, 22:400–407, 1951.
- [27] Luca Schenato and Giovanni Gamba. A distributed consensus protocol for clock synchronization in wireless sensor network. 2007.
- [28] Ioannis D Schizas, Alejandro Ribeiro, and Georgios B Giannakis. Consensus in ad hoc wsns with noisy links part i: Distributed estimation of deterministic signals. IEEE Transactions on Signal Processing, 56(1):350–364, 2008.
- [29] Ohad Shamir. An optimal algorithm for bandit and zero-order convex optimization with two-point feedback. Journal of Machine Learning Research, 18(52):1–11, 2017.
- [30] Wei Shi, Qing Ling, Gang Wu, and Wotao Yin. Extra: An exact first-order algorithm for decentralized consensus optimization. SIAM Journal on Optimization, 25(2):944–966, 2015.
- [31] Wei Shi, Qing Ling, Kun Yuan, Gang Wu, and Wotao Yin. On the linear convergence of the admm in decentralized consensus optimization. IEEE Trans. Signal Processing, 62(7):1750–1761, 2014.
- [32] Benjamin Sirb and Xiaojing Ye. Consensus optimization with delayed and stochastic gradients on decentralized networks. In Big Data (Big Data), 2016 IEEE International Conference on, pages 76–85. IEEE, 2016.
- [33] Kunal Srivastava and Angelia Nedic. Distributed asynchronous constrained stochastic optimization. IEEE Journal of Selected Topics in Signal Processing, 5(4):772–790, 2011.
- [34] Tao Sun, Yuejiao Sun, and Wotao Yin. On markov chain gradient descent. In Advances in Neural Information Processing Systems, pages 9917–9926, 2018.
- [35] Kun Yuan, Qing Ling, and Wotao Yin. On the convergence of decentralized gradient descent. SIAM Journal on Optimization, 26(3):1835–1854, 2016.
- [36] Jinshan Zeng and Wotao Yin. On nonconvex decentralized gradient descent. IEEE Transactions on Signal Processing, 66(11):2834–2848, 2018.

## Appendix A Technical lemmas

###### Lemma 1.

Let Assumption 4 hold and let be the th largest eigenvalue of , and

Then, we can bound the largest entry-wise absolute value of the deviation matrix as

(A.1) |

for , where is a constant that also depends on the Jordan canonical form of and is a constant that depends on and .

First, we define a family of matrices as

(A.2) |

and . And the projection matrix is given as

(A.3) |

It is easy to see

(A.4) |

###### Lemma 2.

Let and with , then, we have

(A.5) |

where is the second eigenvalue of .

### Proof of Lemma 2

Assume the eigenvalue decomposition of is

with and is a diagonal matrix. It is worth mentioning , where denotes the first row of . Similarly, we use the notation

With directly computations,

Therefore, we can derive

(A.6) |

Our problem then turns to bound , which can be estimated as

###### Lemma 3.

Let be generated by Algorithm 1 and Assumption 2 hold, then we have

(A.7) |

for any and some . And the gradients of at the sequence is bounded, i.e.,

(A.8) |

and

(A.9) |

for any .