# Fast rates for online learning in Linearly Solvable Markov Decision Processes

We study the problem of online learning in a class of Markov decision processes known as linearly solvable MDPs. In the stationary version of this problem, a learner interacts with its environment by directly controlling the state transitions, attempting to balance a fixed state-dependent cost and a certain smooth cost penalizing extreme control inputs. In the current paper, we consider an online setting where the state costs may change arbitrarily between consecutive rounds, and the learner only observes the costs at the end of each respective round. We are interested in constructing algorithms for the learner that guarantee small regret against the best stationary control policy chosen in full knowledge of the cost sequence. Our main result is showing that the smoothness of the control cost enables the simple algorithm of following the leader to achieve a regret of order ^2 T after T rounds, vastly improving on the best known regret bound of order T^3/4 for this setting.

## Authors

• 21 publications
• 13 publications
• ### Online Convex Optimization in Adversarial Markov Decision Processes

We consider online learning in episodic loop-free Markov decision proces...
05/19/2019 ∙ by Aviv Rosenberg, et al. ∙ 0

• ### Online learning in MDPs with side information

We study online learning of finite Markov decision process (MDP) problem...
06/26/2014 ∙ by Yasin Abbasi-Yadkori, et al. ∙ 0

• ### Learning to Collaborate in Markov Decision Processes

We consider a two-agent MDP framework where agents repeatedly solve a ta...
01/23/2019 ∙ by Goran Radanovic, et al. ∙ 0

• ### Online Learning in Markov Decision Processes with Adversarially Chosen Transition Probability Distributions

We study the problem of learning Markov decision processes with finite s...
03/12/2013 ∙ by Yasin Abbasi-Yadkori, et al. ∙ 0

• ### No-Regret Stateful Posted Pricing

In this paper, a rather general online problem called dynamic resource a...
05/04/2020 ∙ by Yuval Emek, et al. ∙ 0

• ### Online Learning in Kernelized Markov Decision Processes

We consider online learning for minimizing regret in unknown, episodic M...
05/21/2018 ∙ by Sayak Ray Chowdhury, et al. ∙ 0

• ### Smoothed Online Optimization for Regression and Control

We consider Online Convex Optimization (OCO) in the setting where the co...
10/23/2018 ∙ by Gautam Goel, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

We consider the problem of online learning in Markov decision processes (MDPs) where a learner sequentially interacts with an environment by repeatedly taking actions that influence the future states of the environment while incurring some immediate costs. The goal of the learner is to choose its actions in a way that the accumulated costs are as small as possible. Several variants of this problem have been well-studied in the literature, primarily in the case where the costs are assumed to be independent and identically distributed (Sutton and Barto, 1998; Puterman, 1994; Bertsekas and Tsitsiklis, 1996; Szepesvári, 2010). In the current paper, we consider the case where the costs are generated by an arbitrary external process and the learner aims to minimize its total loss during the learning procedure—conforming to the learning paradigm known as online learning (Cesa-Bianchi and Lugosi, 2006; Shalev-Shwartz, 2012). In the online-learning framework, the performance of the learner is measured in terms of the regret defined as the gap between the total costs incurred by the learner and the total costs of the best comparator chosen from a pre-specified class of strategies. In the case of online learning in MDPs, a natural class of strategies is the set of all state-feedback policies: several works studied minimizing regret against this class both in the stationary-cost (Bartlett and Tewari, 2009; Jaksch et al., 2010; Abbasi-Yadkori and Szepesvári, 2011) and the non-stochastic setting (Even-Dar et al., 2009; Yu et al., 2009; Neu et al., 2010, 2012; Zimin and Neu, 2013; Dick et al., 2014; Neu et al., 2014; Abbasi-Yadkori et al., 2014). In the non-stochastic setting, most works consider MDPs with unstructured, finite state spaces and guarantee that the regret increases no faster than  as the number of interaction rounds grows large. A notable exception is the work of Abbasi-Yadkori et al. (2014), who consider the special case of (continuous-state) linear-quadratic control with arbitrarily changing target states, and propose an algorithm that guarantees a regret bound of .

In the present paper, we study another special class of MDPs that turns out to allow fast rates. Specifically, we consider the class of so-called linearly solvable MDPs (in short, LMDPs), first proposed and named so by Todorov (2006). This class takes its name after the special property that the Bellman optimality equations characterizing the optimal behavior policy take the form of a system of linear equations, which makes optimization remarkably straightforward in such problems. The continuous formulation (in both space and time) was discovered independently by Kappen (2005) and is known as path integral control. LMDPs have many interesting properties. For example, optimal control laws for LMDPs can be linearly combined to derive composite optimal control laws efficiently (Todorov, 2009). Also, the inverse optimal control problem in LMDPs can be expressed as a convex optimization problem (Dvijotham and Todorov, 2010)

. LMDPs generalize an existing duality between optimal control computation and Bayesian inference

(Todorov, 2008). Indeed, the popular belief propagation algorithm used in dynamic probabilistic graphical models is equivalent the the power iteration method used to solve LMDPs (Kappen et al., 2012).

The LMDP framework has found applications in robotics (Matsubara et al., 2014; Ariki et al., 2016), crowdsourcing (Abbasi-Yadkori et al., 2015), and controlling the growth dynamics of complex networks (Thalmeier et al., 2017). The related path integral control framework of Kappen (2005) has been applied in several real-world tasks, including robot navigation (Kinjo et al., 2013)

, motor skill reinforcement learning

(Theodorou et al., 2010; Rombokas et al., 2013; Gómez et al., 2014), aggressive car maneuvering (Williams et al., 2016) or autonomous flight of teams of quadrotors (Gómez et al., 2016).

In the present paper, we show that besides the aforementioned properties, the structure of LMDPs also enables constructing efficient online learning procedures with very low regret. In particular, we show that, under some mild assumptions on the structure of the LMDP, the (conceptually) simplest online learning strategy of following the leader guarantees a regret of order , vastly improving over the best known previous result by Guan, Raginsky, and Willett (2014), who prove a regret bound of order for arbitrarily small under the same assumptions. Our approach is based on the observation that the optimal control law arising from the LMDP structure is a smooth function of the underlying cost function, enabling rapid learning without any regularization whatsoever.

The rest of the paper is organized as follows. Section 2 introduces the formalism of LMDPs and summarizes some basic facts that our technical content is going to rely on. Section 3 describes our online learning model. Our learning algorithm is described in Section 4 and analyzed in Section 5. Finally, we draw conclusions in Section 6.

#### Notation.

We will consider several real-valued functions over a finite state-space

, and we will often treat these functions as finite-dimensional (column) vectors endowed with the usual definitions of the

norms. The set of probability distributions over

will be denoted as . Indefinite sums with running variables or are understood to run through all .

## 2 Background on linearly solvable MDPs

This section serves as a quick introduction into the formalism of linearly solvable MDPs (LMDPs, Todorov (2006)). These decision processes are defined by the tuple , where is a finite set of states, is a transition kernel called the passive dynamics (with being the probability of the process moving to state given the previous state ) and is the state-cost function. Our Markov decision process is a sequential decision-making problem where the initial state is drawn from some distribution , and the following steps are repeated for an indefinite number of rounds :

1. [leftmargin=.7cm]

2. The learner chooses a transition kernel satisfying for all .

3. The learner observes and draws the next state .

4. The learner incurs the cost

 ℓ(Xt,Qt)=c(Xt)+D(Qt(⋅|Xt)∥P(⋅|Xt)),

where

is the relative entropy (or Kullback-Leibler divergence) between the probability distributions

and defined as .

The state-cost function should be thought of as specifying the objective for the learner in the MDP, while the relative-entropy term governs the costs associated with significant deviations from the passive dynamics. Accordingly, we refer to this component as the control cost. A central question in the theory of Markov decision problems is finding a behavior policy that minimizes (some notion of) the long-term total costs. In this paper, we consider the problem of minimizing the long-term average cost-per-stage . Assuming that the passive dynamics is aperiodic and irreducible, this limit is minimized by a stationary policy (see, e.g., Puterman (1994, Sec. 8.4.4)). Below, we provide two distinct derivations for the optimal stationary policy that minimizes the average costs under this assumption.

### 2.1 The Bellman equations

We first take an approach rooted in dynamic programming (Bertsekas, 2007), following Todorov (2006). Under our assumptions, the optimal stationary policy minimizing the average cost is given by finding the solution to the Bellman optimality equation

 v(x)=c(x)−λ+minq∈Δ(X){D(q∥P(⋅|x))+∑x′q(x′)v(x′)} (1)

for all , where is called the optimal value function and is the average cost associated with the optimal policy111This solution is guaranteed to be unique up to a constant shift of the values: if is a solution, then so is for any . Unless stated otherwise, we will assume that is such that holds for a fixed state .. Linearly solvable MDPs get their name from the fact that the Bellman optimality equation can be rewritten in a simple linear form. To see this, observe that by elementary calculations involving Lagrange multipliers, we have

 minq∈Δ(X){D(q∥P(⋅|x))+∑x′q(x′)v(x′)}= −log∑x′P(x′|x)e−v(x′),

so, after defining the exponentiated value function for all , plugging into Equation (1) and exponentiating both sides gives

 z(x)=eλ−c(x)∑x′P(x′|x)z(x′). (2)

Rewriting the above set of equations in matrix form, we obtain the linear equations

 e−λz=GPz,

where is a diagonal matrix with . By the Perron-Frobenius theorem (see, e.g., Chapter 8 of Meyer (2000)) concerning positive matrices, the above system of linear equations has a unique222As in the case of the Bellman equations, this solution is unique up to a scaling of . solution satisfying for all

, and this eigenvector corresponds to the largest eigenvalue

of . Since the solution of the Bellman optimality equation (1) is unique (up to a constant shift corresponding to a constant scaling of ), we obtain that is the average cost of the optimal policy. In summary, the Bellman optimality equation takes the form of a Perron–Frobenius eigenvalue problem, which can be efficiently solved by iterative methods such as the well-known power method for finding top eigenvectors. Finally, getting back to the basic form (1) of the Bellman equations, we can conclude after simple calculations that the optimal policy can be computed for all as

 Q(x′|x)=P(x′|x)z(x′)∑yP(y|x)z(y).

### 2.2 The convex optimization view

We also provide an alternative (and, to our knowledge, yet unpublished) view of the optimal control problem in LMDPs, based on convex optimization. For the purposes of this paper, we find this form to be more insightful, as it enables us to study our learning problem in the framework of online convex optimization (Hazan, 2011, 2016; Shalev-Shwartz, 2012). To derive this form, observe that under our assumptions, every feasible policy induces a stationary distribution over the state space satisfying . This stationary distribution and the policy together induce a distribution over defined for all as . We will call as the stationary transition measure induced by , which is motivated by the observation that corresponds to the probability of observing the transition in the equilibrium state: . Notice that, with this notation, the average cost-per-stage of policy can be rewritten in the form

 limT→∞1TT∑t=1E[ℓ(Xt,Q)]=∑xμQ(x)(c(x)+D(Q(⋅|x)∥P(⋅|x)))=∑x,x′πQ(x,x′)(c(x)+logπQ(x,x′)P(x′|x)∑yπQ(x,y))=∑x,x′πQ(x,x′)logπQ(x,x′)∑yπQ(x,y)+∑x,x′πQ(x,x′)(c(x)−log(P(x′|x))).

The first term in the final expression above is the negative conditional entropy of relative to , where is a pair of random states drawn from . Since the negative conditional entropy is convex in (for a proof, see Appendix A.1) and the second term in the expression is linear in , we can see that is a convex function of . This suggests that we can view the optimal control problem as having to find a feasible stationary transition measure that minimizes the expected costs. In short, defining

 f(π;c)=∑x,x′π(x,x′)(c(x)+logπ(x,x′)P(x′|x)∑yπ(x,y)) (3)

and as the (convex) set of feasible stationary transition measures satisfying

 ∑x′π(x,x′)=∑x′′π(x′′,x)(∀x),∑x,x′π(x,x′)=1,π(x,x′)≥0(∀x,x′),π(x,x′)=0(∀x,x′:P(x′|x)=0), (4)

the optimization problem can be succinctly written as . In Appendix A.2, we provide a derivation of the optimal control given by Equation (2) starting from the formulation given above. We also remark that our analysis will heavily rely on the fact that is affine in .

## 3 Online learning in linearly solvable MDPs

We now present the precise learning setting that we consider in the present paper. We will study an online learning scheme where for each round , the following steps are repeated:

1. [leftmargin=.7cm]

2. The learner chooses a transition kernel satisfying for all .

3. The learner observes and draws the next state .

4. Obliviously to the learner’s choice, the environment chooses state-cost function .

5. The learner incurs the cost

 ℓt(Xt,Qt)=ct(Xt)+D(Qt(⋅|Xt)∥P(⋅|Xt)).
6. The environment reveals the state-cost function .

The key change from the stationary setting described in the previous section is that the state-cost function now may change arbitrarily between each round, and the learner is only allowed to observe the costs after it has made its decision. We stress that we assume that the learner fully knows the passive dynamics, so the only difficulty comes from having to deal with the changing costs. As usual in the online-learning literature, our goal is to do nearly as well as the best stationary policy chosen in hindsight after observing the entire sequence of cost functions. To define our precise performance measure, we first define the average reward of a policy as

 LT(Q)=E[T∑t=1ℓt(X′t,Q)],

where the state trajectory is generated sequentially as and the expectation integrates over the randomness of the transitions. Having this definition in place, we can specify the best stationary policy333The existence of the minimum is warranted by the fact that is a continuous function bounded from below on its compact domain. and define our performance measure as the (total expected) regret against :

 RT=E[T∑t=1ℓt(Xt,Qt)]−LT(Q∗T),

where the expectation integrates over both the randomness of the state transitions and the potential randomization used by the learning algorithm. Having access to this definition, we can now formally define the goal of the learner as having to come up with a sequence of policies that guarantee that the total regret grows sublinearly, that is, that the average per-round regret asymptotically converges to zero.

For our analysis, it will be useful to define an idealized version of the above online optimization problem, where the learner is allowed to immediately switch between the stationary distributions of the chosen policies. By making use of the convex-optimization view given in Section 2.2, we define an auxiliary online convex optimization (or, in short, OCO, see, e.g., Hazan, 2011; Shalev-Shwartz, 2012) problem called the idealized OCO problem where in each round , the following steps are repeated:

1. The learner chooses the stationary transition measure .

2. Obliviously to the learner’s choice, the environment chooses the loss function

.

3. The learner incurs a loss of .

4. The environment reveals the loss function .

The performance of the learner in this setting is measured by the idealized regret

 ¯¯¯¯RT=T∑t=1˜ℓt(πt)−minπ∈Δ(M)T∑t=1˜ℓt(π).

Throughout the paper, we will consider oblivious environments that choose the sequence of state-cost functions without taking into account the states visited by the learner. This assumption will enable us to simultaneously reason about the expected costs under any sequence of state distributions, and thus to make a connection between the idealized regret and the true regret . This technique was first used by Even-Dar et al. (2009) and was shown to be essentially inevitable by Yu et al. (2009): As discussed in their Section 3.1, no learning algorithm can avoid linear regret if the environment is not oblivious.

## 4 Algorithm and main result

In this section, we propose a simple algorithm for online learning in LMDPs based on the “follow-the-leader” (FTL) strategy. On a high level, the idea of this algorithm is greedily betting on the policy that seems to have been optimal for the total costs observed so far. While this strategy is known to fail catastrophically in several simple learning problems (see, e.g., Cesa-Bianchi and Lugosi 2006), it is known to perform well in several important scenarios such as sequential prediction under the logarithmic loss (Merhav and Feder, 1992) or prediction with expert advice under bounded losses, given that losses are stationary (Kotłowski, 2016) and often serves as a strong benchmark strategy (de Rooij et al., 2014; Sani et al., 2014). In our learning problem, following the leader is a very natural choice of algorithm, as the convex formulation of Section 2.2 suggests that we can effectively build on the analysis of Follow-the-Regularized-Leader-type algorithms without having to explicitly regularize the objective.

In precise terms, our algorithm computes the sequence of policies by running FTL in the idealized setting: in round , the algorithm chooses the stationary transition measure

 πt= \rm arg\,minπ∈Δ(M)t−1∑s=1˜ℓs(π)= \rm arg% \,minπ∈Δ(M)t−1∑s=1f(π;cs)= \rm arg\,minπ∈Δ(M)(t−1)⋅f(π;1t−1t−1∑s=1cs)= \rm arg\,% minπ∈Δ(M)f(π;¯¯ct),

where the third equality uses the fact that is affine in its second argument and the last step introduces the average state-cost function . This form implies that can be computed as the optimal control for the state-cost function , which can be done by following the procedure described in Section 2.1. Precisely, we define the diagonal matrix with its  diagonal element , let be the largest eigenvalue of and be the corresponding (unit-norm) right eigenvector. Also, let and , and note that is the optimal average-cost-per-stage of given the cost function . Finally, we define the policy used in round as

 Qt(x′|x)=P(x′|x)zt(x′)∑yP(y|x)zt(y) (5)

for all and . We denote the induced stationary distribution by . The algorithm is presented as Algorithm 1.

Now we present our main result. First, we state two key assumptions about the underlying passive dynamics; both of these assumptions are also made by Guan et al. (2014).

###### Assumption 1

The passive dynamics is irreducible and aperiodic. In particular, there exists a natural number such that for all and all . We will refer to as the (worst-case) hitting time.

###### Assumption 2

The passive dynamics is ergodic in the sense that its Markov–Dobrushin ergodicity coefficient is strictly less than :

 α(P)=maxx,y∈X∥P(⋅|x)−P(⋅|y)∥1<1.

A standard consequence (see, e.g., Seneta 2006) of Assumption 2 is that the passive dynamics mixes quickly: for any distributions , we have

 ∥∥(μ−μ′)TP∥∥1≤α(P)∥∥μ−μ′∥∥1.

We will sometimes refer to as the mixing time associated with . Now we are ready to state our main result: Suppose that the passive dynamics satisfies Assumptions 1 and 2. Then, the regret of Algorithm 1 satisfies . The asymptotic notation used in the theorem hides a number of factors that depend only on the passive dynamics . In particular, the bound scales polynomially with the worst-case mixing time of any optimal policy, and shows no explicit dependence on the number of states.444Of course, the mixing time time does depend on the size of the state space in general. We explicitly state the bound at the end of the proof presented in the next section as Equation (8), when all terms are formally defined.

## 5 Analysis

In this section, we provide a series of lemmas paving the way towards proving Theorem 4. The attentive reader may find some of these lemmas familiar from related work: indeed, we build on several technical results from Even-Dar et al. (2009); Neu et al. (2014) and Guan et al. (2014). Our main technical contribution is an efficient combination of these tools that enables us to go way beyond the best known bounds for our problem, proved by Guan et al. (2014). Throughout the section, we will assume that the conditions of Theorem 4 are satisfied.

Before diving into the analysis, we state some technical results that we will use several times. We defer all proofs to Appendix B. First, we present some important facts regarding LMDPs with bounded state-costs. In particular, we define as the optimal policy with respect to an arbitrary state-cost function and let be the set of all state-costs bounded in . We define as the set of optimal policies induced by state-cost functions in : . Observe that for all , as and for all . Below, we give several useful results concerning policies in . For stating these results, let and . We first note that the average cost of is bounded in : By the Perron-Frobenius theorem (see, e.g., Meyer, 2000, Chapter 8), we have that the largest eigenvalue of is bounded by the maximal and minimal row sums of : , which translates to having under our assumptions. The next key result bounds the value functions and the control costs in terms of the hitting time: For all and , the value functions satisfy . Furthermore, all policies satisfy

 maxxD(Q(⋅|x)∥P(⋅|x))≤H+1.

The proof is loosely based on ideas from Bartlett and Tewari (2009). The second statement guarantees that the mixing time is finite for all policies in : The Markov–Dobrushin coefficient of any policy is bounded as

 α(Q)≤α(P)+(1−α(P))(1−e−H−2)<1.

The proof builds on the previous lemma and uses standard ideas from Markov-chain theory. In what follows, we will use

and to denote the worst-case mixing time and ergodicity coefficient, respectively. With this notation, we can state the following lemma that establishes that the value functions are -Lipschitz with respect to the state-cost function. For pronouncing and proving the statement, it is useful to define the span seminorm . Note that it is easy to show that is indeed a seminorm as it satisfies all the requirements to be a norm except that it maps all constant vectors (and not just zero) to zero. Let and be two state-cost functions taking values in the interval and let and be the corresponding optimal value functions. Then,

 ∥∥vf−vg∥∥s≤2τ∥f−g∥∞.

The proof roughly follows the proof of Proposition 3 of Guan et al. (2014), with the slight difference that we make the constant factor in the bound explicit. A consequence of this result is our final key lemma in this section that actually makes our fast rates possible: a bound on the change-rate of the policies chosen by the algorithm. . The proof is based on ideas by Guan et al. (2014). As for the proof of Theorem 4, we follow the path of Even-Dar et al. (2009); Neu et al. (2014); Guan et al. (2014), and first analyze the idealized setting where the learner is allowed to directly pick stationary distributions instead of policies. Then, we show how to relate the idealized regret of FTL to its true regret in the original problem.

### 5.1 Regret in the idealized OCO problem

Let us now consider the idealized online convex optimization problem described at the end of Section 3. In this setting, our algorithm can be formally stated as choosing the stationary transition measure . This view enables us to follow a standard proof technique for analyzing online convex optimization algorithms, going back to at least Merhav and Feder (1992). The first ingredient of our proof is the so-called “follow-the-leader/be-the-leader” lemma Cesa-Bianchi and Lugosi (2006, Lemma 3.1): . The second step exploits the bound on the change rate of the policies to show that looking one step into the future does not buy much advantage. Note however that controlling the change rate is not sufficient by itself, as our loss functions are effectively unbounded. . In the interest of space, we only provide a proof sketch here and defer the full proof to Appendix B.5.

Proof sketch  Let us define . By exploiting the affinity of in its second argument, we can start by proving . Furthermore, by using the form of the optimal policy given in Eq. (5) and the form of given in Eq. (3), we can obtain

 ˜ℓt(πt)−˜ℓt(πt+1) =(μt−μt+1)T(ct+¯¯ct)+μTt+1(¯¯ct−¯¯ct+1)+λt−λt+1 ≤2∥μt+1−μt∥1+2∥Δt∥∞.

The first term can be bounded by a simple argument (see, e.g., Lemma 4 of Neu et al. 2014) that leads to

 ∥μt+1−μt∥1≤max{τ(Qt),τ(Qt+1)}maxx∥Qt+1(⋅|x)−Qt(⋅|x)∥1.

Now, the first factor can be bounded by and the second by appealing to Lemma 5. The proof is concluded by plugging the above bounds into Equation (12), using , summing up both sides, and noting that . Putting Lemmas 5.1 and 5.1 together, we obtain the following bound on the idealized regret of FTL: .

### 5.2 Regret in the reactive setting

We first show that the advantage of the true best policy over our final policy is bounded. Let be the smallest non-zero transition probability under the passive dynamics and . Then, . The proof follows from applying Lemma 1 from Neu et al. (2014) and observing that holds for all . It remains to relate the total cost of FTL to the total idealized cost of the algorithm. This is done in the following lemma: