# Characterizing the Exact Behaviors of Temporal Difference Learning Algorithms Using Markov Jump Linear System Theory

In this paper, we provide a unified analysis of temporal difference learning algorithms with linear function approximators by exploiting their connections to Markov jump linear systems (MJLS). We tailor the MJLS theory developed in the control community to characterize the exact behaviors of the first and second order moments of a large family of temporal difference learning algorithms. For both the IID and Markov noise cases, we show that the evolution of some augmented versions of the mean and covariance matrix of TD learning exactly follows the trajectory of a deterministic linear time-invariant (LTI) dynamical system. Applying the well-known LTI system theory, we obtain closed-form expressions for the mean and covariance matrix of TD learning at any time step. We provide a tight matrix spectral radius condition to guarantee the convergence of the covariance matrix of TD learning, and perform a perturbation analysis to characterize the dependence of the TD behaviors on learning rate. For the IID case, we provide an exact formula characterizing how the mean and covariance matrix of TD learning converge to the steady state values at a linear rate. For the Markov case, we use our formulas to explain how the behaviors of TD learning algorithms are affected by learning rate and various properties of the underlying Markov chain.

## Authors

• 38 publications
• 1 publication
• ### Adaptive Learning Rate via Covariance Matrix Based Preconditioning for Deep Neural Networks

Adaptive learning rate algorithms such as RMSProp are widely used for tr...
05/31/2016 ∙ by Yasutoshi Ida, et al. ∙ 0

• ### Towards a unified theory for testing statistical hypothesis: Multinormal mean with nuisance covariance matrix

Under a multinormal distribution with arbitrary unknown covariance matri...
10/18/2017 ∙ by Ming-Tien Tsai, et al. ∙ 0

• ### A Simple Yet Efficient Rank One Update for Covariance Matrix Adaptation

In this paper, we propose an efficient approximated rank one update for ...
10/11/2017 ∙ by Zhenhua Li, et al. ∙ 0

• ### Graphical continuous Lyapunov models

The linear Lyapunov equation of a covariance matrix parametrizes the equ...
05/21/2020 ∙ by Gherardo Varando, et al. ∙ 0

• ### The covariance matrix of Green's functions and its application to machine learning

In this paper, a regression algorithm based on Green's function theory i...
04/14/2020 ∙ by Tomoko Nagai, et al. ∙ 0

• ### Mean and dispersion of harmonic measure

In this note, we provide and prove exact formulas for the mean and the t...
09/26/2018 ∙ by Sirio Legramanti, et al. ∙ 0

• ### Analysis of Bayesian Inference Algorithms by the Dynamical Functional Approach

We analyze the dynamics of an algorithm for approximate inference with l...
01/14/2020 ∙ by Burak Çakmak, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Reinforcement learning (RL) has shown great promise in solving sequential decision making tasks bertsekas1996neuro ; sutton2018reinforcement . One important topic for RL is policy evaluation whose objective is to evaluate the value function of a given policy. A large family of temporal difference (TD) learning methods including standard TD, GTD, TDC, GTD2, DTD, and ATD sutton1988learning ; sutton2008convergent ; sutton2009fast ; Niao2019 have been developed to solve the policy evaluation problem. These TD learning algorithms have become important building blocks for RL algorithms. See dann2014policy for a comprehensive survey. Despite the popularity of TD learning, the behaviors of these algorithms have not been fully understood from a theoretical viewpoint. The standard ODE technique TsiRoy1997 ; borkar2000ode ; bhatnagar2012stochastic ; kushner2003stochastic ; borkar2009stochastic can only be used to prove asymptotic convergence. Finite sample bounds are challenging to obtain and typically developed in a case-by-case manner. Recently, there have been intensive research activities focusing on establishing finite sample bounds for TD learning methods with linear function approximations under various assumptions. The IID noise case is covered in dalal2018finite ; lakshminarayanan2018linear ; liu2015finite . In bhandari2018finite , the analysis is extended for a Markov noise model but an extra projection step in the algorithm is required. Very recently, finite sample bounds for the TD method (without the projection step) under the Markov assumption have been obtained in srikant2019finite . The bounds in srikant2019finite actually work for any TD learning algorithm that can be modeled by a linear stochastic approximation scheme. It remains unclear how tight these bounds are (especially for the large learning rate region). To complement the existing analysis results and techniques, we propose a general unified analysis framework for TD learning algorithms by borrowing the Markov jump linear system (MJLS) theory costa2006 from the controls literature. Our approach is inspired by a recent research trend in applying control theory for analysis of optimization algorithms Lessard2014 ; hu2017SGD ; Bin2017COLT ; BinHu2017 ; fazlyab2017analysis ; van2017fastest ; cyrus2018robust ; sundararajan2017robust ; hu2017control ; hu2018dissipativity ; fazlyab2017dynamical ; aybat2018robust ; lessard2019direct ; han2019systematic ; aybat2019universally ; dhingra2018proximal ; nelson2018integral , and extends the jump system perspective for finite sum optimization methods in Bin2017COLT to TD learning.

Our key insight is that TD learning algorithms with linear function approximations are essentially just Markov jump linear systems. Notice that a MJLS is described by a linear state space model whose state/input matrices are functions of a jump parameter sampled from a finite state Markov chain. Since the behaviors of MJLS have been well established in the controls field costa2006 ; feng1992stochastic ; abou1995solution ; chizeck1986discrete ; costa1993stability ; ji1988controllability ; ji1990jump ; el1996robust ; Fang2002 ; seiler2003bounded , we can borrow the analysis tools there to analyze TD learning algorithms in a more unified manner. Our main contributions are summarized as follows.

1. We present a unified Markov jump linear system perspective on a large family of TD learning algorithms including TD, TDC, GTD, GTD2, ATD, and DTD. Specifically, we make the key observation that these methods are just MJLS subject to some prescribed input.

2. By tailoring the existing MJLS theory, we show that the evolution of some augmented versions of the mean and covariance matrix of all above TD learning methods exactly follows the trajectory of a deterministic linear time-invariant (LTI) dynamical system for both the IID and Markov noise cases. As a result, we obtain unified closed-form formulas for the mean and covariance matrix of TD learning at any time step.

3. We provide a tight matrix spectral radius condition to guarantee the convergence of the covariance matrix of TD learning under the general Markov assumption. By using the matrix perturbation theory moro1997lidskii ; kato2013perturbation ; avrachenkov2013analytic ; gonzalez2015laurent , we perform a perturbation analysis to show the dependence of the behaviors of TD learning on learning rate in a more explicit manner. For the IID case, we provide an exact formula characterizing how the mean and covariance matrix of TD learning converge to the steady state values at a linear rate. For the Markov case, we use our formulas to explain how the behaviors of TD learning algorithms are affected by learning rate and various properties of the underlying Markov chain.

We view our proposed analysis as a complement rather than a replacement for existing analysis techniques. Our exact formulas provide new insights especially for large learning rate region.

## 2 Background

### 2.1 Notation

The set of

-dimensional real vectors is denoted as

. The Kronecker product of two matrices and is denoted by . Notice and when the matrices have compatible dimensions. Let denote the standard vectorization operation that stacks the columns of a matrix into a vector. We have . Let denote the symmetrization operation, i.e. . Let denote a matrix whose -th block is and all other blocks are zero. Specifically, given for , we have

 diag(Hi)=⎡⎢ ⎢⎣H1…0⋮⋱⋮0…Hn⎤⎥ ⎥⎦.

A square matrix is Schur stable if all its eigenvalues have magnitude strictly less than

. A square matrix is Hurwitz if all its eigenvalues have strictly negative real parts. The spectral radius of a matrix is denoted as . The eigenvalue with the largest magnitude of is denoted as and the eigenvalue with the largest real part of is denoted as .

### 2.2 Useful facts for linear time-invariant systems

The behaviors of linear time-invariant (LTI) systems have been well understood and documented in standard control textbooks hespanha2009 ; chen1998linear . We review a few useful facts here. Consider an LTI system governed by the following state-space model.

 xk+1=Hxk+Guk, (1)

where , , , and . Given an initial condition and an input sequence , the sequence is uniquely determined as

 xk=(H)kx0+k−1∑t=0(H)k−1−tGut, (2)

where stands for the -th power of the matrix . The above formula gives a complete characterization of the behaviors of the LTI model (1). The first term is the so-called homogeneous state response of (1). When is Schur stable,

converges to a zero matrix and

for any arbitrary . When , there always exists such that does not converge to . When , there even exists such that . See Section 7.2 in hespanha2009 for a detailed discussion. When , we know that converges to at a linear rate specified by . See Section 2.2 in Lessard2014 for a detailed discussion. The second term is called the forced response of (1). When is Schur stable, this term has a bounded norm for any given the fact that the norm of the input is uniformly bounded above by a constant hespanha2009 . Now we summarize a few useful facts in the following proposition.

###### Proposition 1.

Suppose , and is determined by (2). The following statements are true:

1. If , then exists. We have . In addition, can be expressed as

 xk=x∞+(H)k(x0−x∞). (3)

In addition, for some and any arbitrarily small .

2. Suppose exists and is equal to . Then exists and we have .

3. Suppose converges to as . Choose as where is arbitrary small. Then we still have . In addition, we have where is some constant.

###### Proof.

The above facts are well known in the control community. For completeness, we will include a proof in the supplementary material. ∎

When is a constant, (3) gives a precise characterization of the behaviors of . Specifically, is a sum of a constant steady state term and a matrix power term that decays at a linear rate specified by . In general, the convergence rate of depends on the convergence rate of . When is a constant, the convergence rate of is completely specified by . When itself converges at a linear rate , the convergence rate of will be dominated by .

The stability condition is quite tight. See more discussions in the supplementary material. We will show that the first and second order moments of TD learning algorithms are exactly governed by the formula (2) and can be analyzed using Proposition 1 if we choose and properly.

### 2.3 Useful facts for Markov jump linear systems

Next we briefly review the MJLS theory. We follow the treatment in the standard textbook costa2006 . Let be a finite state Markov chain. A MJLS is governed by the following state-space model:

 ξk+1=H(zk)ξk+G(zk)yk, (4)

where and are matrix functions of . Clearly, is the state, and is the input. Let be sampled from a finite state space . Then there is a one-to-one mapping from to the finite set where . Hence we can assume is sampled from a finite set of matrices and is sampled from . Without loss of generality, we can assume is sampled from and then align our notation as for . The setup is general enough to cover any finite state space case due to the one-to-one correspondence between and . Next, we assume that the Markov chain

has transition probabilities

where and for all . We specify the transition matrix by setting its -th entry to be .

An amazing fact is that some augmented versions of the mean value and the covariance matrix of for the MJLS model (4) actually follow the dynamics of a deterministic LTI model in the form of (1). This fact is well documented in the MJLS literature (Chapter 3 in costa2006 ). We briefly review these results here and will apply them to analyze TD learning. Let us define and . The indicator function is defined as if and otherwise. We further set and . Obviously, we have and . We also augment and as

 qk=⎡⎢ ⎢⎣qk1⋮qkn⎤⎥ ⎥⎦,Qk=[Qk1Qk2…Qkn].

For simplicity, first consider the case where . Proposition 3.1 in costa2006 states that given , and can be calculated iteratively as and . The update rule for is equivalent to . We can also obtain , which is a compact form for

 vec(Qk+1)=⎡⎢ ⎢⎣p11H1⊗H1…pn1Hn⊗Hn⋮⋱⋮p1nH1⊗H1…pnnHn⊗Hn⎤⎥ ⎥⎦vec(Qk). (5)

Therefore, if , we can compute as and . See Chapter 3.2 in costa2006 for detailed proofs.

For the purpose of analyzing TD learning, we need to look at the case where . In this case, and just track the trajectories of (1) with non-zero . Denote . A direct consequence of Proposition 3.35 in costa2006 is that given , and can be calculated as

 qk+1j =n∑i=1pij(Hiqki+Gipki), (6) Qk+1j =n∑i=1pij(HiQkiHTi+2sym(HiqkiGTi)+pkiGiGTi), (7)

which can be rewritten as an LTI model (1) subject to non-zero input which involves . To save some space, we will present the explicit formula for this LTI model in the supplementary material. The key message is that the behaviors of and can be fully understood via the LTI theory. More discussions on this point are presented in the supplementary material.

In general, the covariance matrix and the mean value do not directly follow an LTI system. However, when working with the augmented covariance matrix and the augmented mean value vector , we do obtain an LTI model in the form of (1). Moreover, any rate bound on also directly works for the mean square error since one has .

#### IID case.

Suppose is sampled in an IID manner, i.e. . Then both and directly form LTI systems with much smaller dimensions. Specifically, we have

 μk+1 =n∑i=1pi(Hiqk+Gi)=¯Hμk+¯G, vec(Qk+1) =(n∑i=1piHi⊗Hi)vec(Qk)+(n∑i=1pi(Hi⊗Gi+Gi⊗Hi))μk+n∑i=1piGi⊗Gi.

There are many ways to derive the above formulas. One way is to first show and in this case and then apply (6) and (7). Under the IID assumption, one just checks the spectral radius of and to guarantee the linear convergence in the form of (3).

## 3 A general Markov jump system perspective for TD learning

In this section, we propose a general jump system perspective for TD learning algorithms. We will apply the proposed framework to obtain more detailed analysis results for TD learning under the IID and Markov assumptions in the next two sections.

First, notice that many TD learning algorithms including TD, TDC, GTD, GTD2, A-TD, and D-TD are just special cases of the following linear stochastic recursion:

 ξk+1=ξk+α(A(zk)ξk+b(zk)), (8)

which can be immediately rewritten as the following MJLS

 ξk+1=(I+αA(zk))ξk+αb(zk). (9)

The above model is a special case of (4) if we set , , and . Consequently, many TD learning algorithms can be analyzed using the MJLS theory reviewed in Section 2.3. For illustrative purposes, we explain the jump system formulation for the standard TD method.

#### Example 1: TD method.

The standard TD method (or TD(0)) uses the following update rule:

 θk+1 =θk−αϕ(sk)((ϕ(sk)−γϕ(sk+1))Tθk−r(sk)),

where is the underlying Markov chain, is the feature vector, is the reward, is the discounting factor, and

is the weight vector to be estimated. Suppose

is the vector that solves the projected Bellman equation. We can set and then rewrite the TD update as

 θk+1−θ∗=(I+αA(zk))(θk−θ∗)+αb(zk), (10)

where and . See Section 3.1 in srikant2019finite for more explanations. Now we can extend the MJLS theory reviewed in Section 2.3 to analyze .

Here we omit the detailed formulations for other TD learning methods since it is a well-known fact that all these methods can be rewritten in the form of (8) if and are properly chosen. The key message is that can be viewed as a jump parameter and TD learning methods are essentially just MJLS. We want to emphasize that all the TD learning algorithms that can be analyzed using the ODE method are in the form of (9). More discussions on detailed jump system formulations of other TD learning algorithms are presented in the supplementary material. Now we extend the MJLS theory reviewed in Section 2.3 to analyze (9) under both the IID and Markov assumptions.

## 4 Analysis under the IID assumption

For illustrative purposes, we first present the analysis for (9) under the IID assumption. In this case, the analysis is significantly simpler. Consider the jump system model (9). Now we can set , , and . Denote . It is also natural to assume . We can directly obtain the following result.

###### Theorem 1.

Consider the jump system model (9) with , , and . Suppose is sampled from using an IID distribution . In addition, assume . Then and are governed by the following LTI system:

 [μk+1vec(Qk+1)]=[H110H21H22][μkvec(Qk)]+[0α2∑ni=1pi(bi⊗bi)], (11)

where , and are determined as

 H11=I+α¯A,H21=α2n∑i=1pi(Ai⊗bi+bi⊗Ai),H22=In2ξ+α(I⊗¯A+¯A⊗I)+α2n∑i=1pi(Ai⊗Ai). (12)

In addition, the following closed-form solution holds for any ,

 [qkvec(Qk)] =([H110H21H22])k[q0vec(Q0)]+α2k−1∑t=0[0(H22)k−1−t∑ni=1pi(bi⊗bi)]. (13)

Finally, if , we have

 (14)

where , and is given as

 vec(Q∞)=limk→0vec(Qk)=−α(I⊗¯A+¯A⊗I+αn∑i=1pi(Ai⊗Ai))−1(n∑i=1pi(bi⊗bi)). (15)
###### Proof.

This theorem follows from the remark at the end of Section 2.3. Notice . Hence taking the full expectation leads to . Similarly, one can show

Then we can perform the vectorization operation to obtain (11). Next, we can apply (2) to show (13). Finally, (14) is a direct consequence of Proposition 3.6 in costa2006 and Fact 1 in Proposition 1. For completeness, a detailed proof is presented in the supplementary material. ∎

Now we discuss various implications of Theorem 1. For simplicity we denote .

#### Stability condition and eigenvalue perturbation analysis.

As discussed in Section 2.2, if , one cannot even guarantee the boundedness of for all . Actually the matrix sum blows up in this case. On the other hand, if we have , then the first term on the right side of (14) converges to at a linear rate specified by , and the second term on the right side of (14) is a constant matrix quantifying the steady state covariance. We can apply Proposition 3.6 in costa2006 to show that is Schur stable if and only if is Schur stable. Then the needed stability condition becomes . An important question is how to choose such that for some given , , and . We provide some clue to this question by applying an eigenvalue perturbation analysis to the matrix . We assume is small. Then under mild technical condition111One such condition is that is a semisimple eigenvalue., we can ignore the quadratic term in the expression of and use to estimate . Hence we have

 λmax(H22)≈1+2λmaxreal(¯A)α+O(α2). (16)

Then we immediately obtain . Therefore, as long as is Hurwitz, there exists sufficiently small such that . This is consistent with the discussion in srikant2019finite where a similar assumption on is made. More details of the perturbation analysis are provided in the supplementary material.

#### Limiting behavior.

Obviously, converges to at the rate specified by due to the relation . Applying Proposition 1 and making use of the block structure in , one can show , which leads to the result in (14). We can clearly see and can be controlled by decreasing . Notice the convergence rate of to its limit is specified by . Hence one can increase the convergence rate at the price of increasing the steady state error. This is consistent with the finite sample bound in the literature bhandari2018finite ; srikant2019finite . When is large, we need to keep the quadratic term . Therefore, our theory does capture the behaviors of TD learning for both small and large learning rates, and complement the existing finite sample bounds. Specifically, (14) gives an exact formula describing the convergence behavior of TD learning even for large .

## 5 Analysis under the Markov assumption

Now we can analyze the behaviors of TD learning under the general assumption that is a Markov chain. Recall that the augmented mean vector and the augmented covariance matrix have been defined in Section 2.3. We can directly obtain the following result.

###### Theorem 2.

Consider the jump system model (9) with , , and . Suppose is a Markov chain sampled from using the transition matrix . In addition, define and set the augmented vector . Clearly . Further denote the augmented vectors as , , and set .

1. Then and are governed by the following state-space model:

 [qk+1vec(Qk+1)]=[H110H21H22][qkvec(Qk)]+⎡⎣α((PTdiag(pki))⊗Inξ)bα2((PTdiag(pki))⊗In2ξ)^B⎤⎦, (17)

where , and are given by

 H11=(PT⊗Inξ)diag(Inξ+αAi),H21=α⎡⎢ ⎢⎣p11S(b1,A1)…pn1S(bn,An)⋮⋱⋮p1nS(b1,A1)…pnnS(bn,An)⎤⎥ ⎥⎦,H22=(PT⊗In2ξ)diag((Inξ+αAi)⊗(Inξ+αAi)). (18)

In addition, the following closed-form solution holds for any

 qk=(H11)kq0+αk−1∑t=0(H11)k−1−t((PTdiag(pti))⊗Inξ)b,vec(Qk)=(H22)kvec(Q0)+k−1∑t=0(H22)k−1−t(H21qt+α2((PTdiag(pti))⊗In2ξ)^B), (19)

where , and are determined by (18).

2. Suppose . We set . If we also assume where is a stationary distribution for , then we have

 q∞=limk→∞qk=α(I−H11)−1((PTdiag(p∞i))⊗Inξ)b,vec(Q∞)=limk→0vec(Qk)=α2(IN−H22)−1(α−2H21q∞+((PTdiag(p∞i))⊗In2ξ)^B). (20)
3. If we further assume the geometric ergodicity, i.e. , then we have

 ∥[qkvec(Qk)]−[q∞vec(Q∞)]∥≤C0max{σ(H11)+ε,σ(H22)+ε,~ρ}k. (21)

where is some constant and is an arbitrary small positive number.

###### Proof.

A detailed proof is presented in the supplementary material. We present a proof sketch here. Notice (17) is a direct consequence of (6) and (7) (which are special cases of Proposition 3.35 in costa2006 ). Specifically, it is straightforward to verify the following equations using the Markov assumption

 qk+1j =n∑i=1pij((I+αAi)qki+αpkibi), (22) Qk+1j =n∑i=1pij((I+αAi)Qki(I+αAi)T+2αsym((I+αAi)qkibTi)+α2pkibibTi). (23)

Then we can apply the basic property of the vectorization operation to obtain (17). Applying (2) to iterate (17) directly leads to (19). Or we can also use the block structure in to rewrite the update rule for as

 vec(Qk+1)=H22vec(Qk)+H21qk+α2((PTdiag(pki))⊗In2ξ)^B.

Treating as the input to the system, we will also be able to prove (19). Finally, we can apply Facts 2 and 3 in Proposition 1 to prove Statements 2 and 3 in this theorem. ∎

Now we discuss various implications of Theorem 2.

#### Stability condition and eigenvalue perturbation analysis.

Similar to the IID case, the needed stability condition is . Now becomes a much larger matrix depending on the transition matrix . An important question is how to choose such that for some given , , , and . Again, we perform an eigenvalue perturbation analysis for the matrix

. This case is quite subtle due to the fact that we are no longer perturbing an identity matrix. We are perturbing the matrix

and the eigenvalues here are not simple. Under the ergodicity assumption, the largest eigenvalue for (which is ) is semisimple. Hence we can directly apply the results in Section II of kato2013perturbation or Theorem 2.1 in moro1997lidskii to show

 λmax(H22)=1+2λmaxreal(¯A)α+O(α2). (24)

where and is the unique stationary distribution of the Markov chain under the ergodicity assumption. Then we still have . Therefore, as long as is Hurwitz, there exists sufficiently small such that . This is consistent with Assumption 3 in srikant2019finite . To understand the details of our perturbation argument, we refer the readers to the remark placed after Theorem 2.1 in moro1997lidskii . Notice we have

 H22=PT⊗In2ξ+α(PT⊗In2ξ)(Ai⊗I+I⊗Ai)+O(α2).

The largest eigenvalue of is semisimple due to the ergodicity assumption. Then the perturbation result directly follows as a consequence of Theorem 2.1 in moro1997lidskii . More explanations are also provided in the supplementary material.

#### Limiting behavior.

Assume the Markov chain is ergodic, and then . Notice it is natural to have the assumption and hence we also have the assumption . It is interesting to notice that in general but . When is small, we can apply the Laurent series trick in avrachenkov2013analytic ; gonzalez2015laurent to show that under the ergodicity assumption. The difficulty here is that is a singular matrix and hence does not have a Taylor series around . Therefore, we need to apply some recent matrix inverse perturbation result to perform a Laurent expansion of . From the ergodicity assumption, we know the singularity order of is just . Applying Theorem 1 in avrachenkov2013analytic , we can obtain the Laurent expansion of and show . Consequently, we have and can be controlled by decreasing . This is consistent with the finite sample bound in srikant2019finite .

#### Effects of mixing rate of zk on the overall convergence rate.

Clearly, the convergence rates of and also depend on the initial distribution and the mixing rate of the underlying Markov jump parameter (which is denoted as ). Statement 3 in Theorem 2 just states that the overall convergence rate now depends on the slower one between the mixing rate and the spectral radius of . If the initial distribution is the stationary distribution, i.e. , the input to the LTI dynamical system (17) is just a constant for all and then we will be able to obtain an exact formula similar to (14). However, for a general initial distribution , the mixing rate matters more and may affect the overall convergence rate. It is also worth mentioning that is a property of the Markov chain while can be controlled by the learning rate . When becomes smaller and smaller, eventually is going to become the dominating term and the mixing rate does not affect the system dynamics any more. Overall, our results are consistent with the finite sample bounds in srikant2019finite for small , and provide some complimentary perspectives for large via the exact formulations. More discussions will be presented in the supplementary materials.

## References

• [1] H. Abou-Kandil, G. Freiling, and G. Jank. On the solution of discrete-time Markovian jump linear quadratic control problems. Automatica, 31(5):765–768, 1995.
• [2] K. Avrachenkov and J. Lasserre. Analytic perturbation of generalized inverses. Linear Algebra and its Applications, 438(4):1793–1813, 2013.
• [3] N. Aybat, A. Fallah, M. Gurbuzbalaban, and A. Ozdaglar. Robust accelerated gradient method. arXiv preprint arXiv:1805.10579, 2018.
• [4] N. Aybat, A. Fallah, M. Gurbuzbalaban, and A. Ozdaglar. A universally optimal multistage accelerated stochastic gradient method. arXiv preprint arXiv:1901.08022, 2019.
• [5] D. Bertsekas and J. Tsitsiklis. Neuro-dynamic programming, volume 5. Athena Scientific Belmont, 1996.
• [6] J. Bhandari, D. Russo, and R. Singal. A finite time analysis of temporal difference learning with linear function approximation. arXiv preprint arXiv:1806.02450, 2018.
• [7] S. Bhatnagar, H. Prasad, and L. Prashanth. Stochastic recursive algorithms for optimization: simultaneous perturbation methods, volume 434. Springer, 2012.
• [8] V. Borkar. Stochastic approximation: a dynamical systems viewpoint, volume 48. Springer, 2009.
• [9] V. Borkar and S. Meyn. The ODE method for convergence of stochastic approximation and reinforcement learning. SIAM Journal on Control and Optimization, 38(2):447–469, 2000.
• [10] C. Chen. Linear system theory and design. Oxford University Press, Inc., 1998.
• [11] H. Chizeck, A. Willsky, and D. Castanon. Discrete-time Markovian-jump linear quadratic optimal control. International Journal of Control, 43(1):213–231, 1986.
• [12] O. Costa and M. Fragoso. Stability results for discrete-time linear systems with Markovian jumping parameters. Journal of Mathematical Analysis and Applications, 179(1):154–178, 1993.
• [13] O. Costa, M. Fragoso, and R. Marques. Discrete-time Markov jump linear systems. Springer Science & Business Media, 2006.
• [14] S. Cyrus, B. Hu, B. Van Scoy, and L. Lessard. A robust accelerated optimization algorithm for strongly convex functions. In 2018 Annual American Control Conference (ACC), pages 1376–1381, 2018.
• [15] G. Dalal, B. Szörényi, G. Thoppe, and S. Mannor. Finite sample analyses for td (0) with function approximation. In

Thirty-Second AAAI Conference on Artificial Intelligence

, 2018.
• [16] C. Dann, G. Neumann, and J. Peters. Policy evaluation with temporal differences: A survey and comparison.

The Journal of Machine Learning Research

, 15(1):809–883, 2014.
• [17] N. Dhingra, S. Khong, and M. Jovanovic. The proximal augmented lagrangian method for nonsmooth composite optimization. IEEE Transactions on Automatic Control, 2018.
• [18] L. El Ghaoui and M. Rami. Robust state-feedback stabilization of jump linear systems via LMIs. International Journal of Robust and Nonlinear Control, 6(9-10):1015–1022, 1996.
• [19] Y. Fang and K. Loparo. Stochastic stability of jump linear systems. IEEE Transactions on Automatic Control, 47(7):1204–1208, 2002.