1 Introduction
We consider the problem of online learning in Markov decision processes (MDPs) where a learner sequentially interacts with an environment by repeatedly taking actions that influence the future states of the environment while incurring some immediate costs. The goal of the learner is to choose its actions in a way that the accumulated costs are as small as possible. Several variants of this problem have been wellstudied in the literature, primarily in the case where the costs are assumed to be independent and identically distributed (Sutton and Barto, 1998; Puterman, 1994; Bertsekas and Tsitsiklis, 1996; Szepesvári, 2010). In the current paper, we consider the case where the costs are generated by an arbitrary external process and the learner aims to minimize its total loss during the learning procedure—conforming to the learning paradigm known as online learning (CesaBianchi and Lugosi, 2006; ShalevShwartz, 2012). In the onlinelearning framework, the performance of the learner is measured in terms of the regret defined as the gap between the total costs incurred by the learner and the total costs of the best comparator chosen from a prespecified class of strategies. In the case of online learning in MDPs, a natural class of strategies is the set of all statefeedback policies: several works studied minimizing regret against this class both in the stationarycost (Bartlett and Tewari, 2009; Jaksch et al., 2010; AbbasiYadkori and Szepesvári, 2011) and the nonstochastic setting (EvenDar et al., 2009; Yu et al., 2009; Neu et al., 2010, 2012; Zimin and Neu, 2013; Dick et al., 2014; Neu et al., 2014; AbbasiYadkori et al., 2014). In the nonstochastic setting, most works consider MDPs with unstructured, finite state spaces and guarantee that the regret increases no faster than as the number of interaction rounds grows large. A notable exception is the work of AbbasiYadkori et al. (2014), who consider the special case of (continuousstate) linearquadratic control with arbitrarily changing target states, and propose an algorithm that guarantees a regret bound of .
In the present paper, we study another special class of MDPs that turns out to allow fast rates. Specifically, we consider the class of socalled linearly solvable MDPs (in short, LMDPs), first proposed and named so by Todorov (2006). This class takes its name after the special property that the Bellman optimality equations characterizing the optimal behavior policy take the form of a system of linear equations, which makes optimization remarkably straightforward in such problems. The continuous formulation (in both space and time) was discovered independently by Kappen (2005) and is known as path integral control. LMDPs have many interesting properties. For example, optimal control laws for LMDPs can be linearly combined to derive composite optimal control laws efficiently (Todorov, 2009). Also, the inverse optimal control problem in LMDPs can be expressed as a convex optimization problem (Dvijotham and Todorov, 2010)
. LMDPs generalize an existing duality between optimal control computation and Bayesian inference
(Todorov, 2008). Indeed, the popular belief propagation algorithm used in dynamic probabilistic graphical models is equivalent the the power iteration method used to solve LMDPs (Kappen et al., 2012).The LMDP framework has found applications in robotics (Matsubara et al., 2014; Ariki et al., 2016), crowdsourcing (AbbasiYadkori et al., 2015), and controlling the growth dynamics of complex networks (Thalmeier et al., 2017). The related path integral control framework of Kappen (2005) has been applied in several realworld tasks, including robot navigation (Kinjo et al., 2013)
, motor skill reinforcement learning
(Theodorou et al., 2010; Rombokas et al., 2013; Gómez et al., 2014), aggressive car maneuvering (Williams et al., 2016) or autonomous flight of teams of quadrotors (Gómez et al., 2016).In the present paper, we show that besides the aforementioned properties, the structure of LMDPs also enables constructing efficient online learning procedures with very low regret. In particular, we show that, under some mild assumptions on the structure of the LMDP, the (conceptually) simplest online learning strategy of following the leader guarantees a regret of order , vastly improving over the best known previous result by Guan, Raginsky, and Willett (2014), who prove a regret bound of order for arbitrarily small under the same assumptions. Our approach is based on the observation that the optimal control law arising from the LMDP structure is a smooth function of the underlying cost function, enabling rapid learning without any regularization whatsoever.
The rest of the paper is organized as follows. Section 2 introduces the formalism of LMDPs and summarizes some basic facts that our technical content is going to rely on. Section 3 describes our online learning model. Our learning algorithm is described in Section 4 and analyzed in Section 5. Finally, we draw conclusions in Section 6.
Notation.
We will consider several realvalued functions over a finite statespace
, and we will often treat these functions as finitedimensional (column) vectors endowed with the usual definitions of the
norms. The set of probability distributions over
will be denoted as . Indefinite sums with running variables or are understood to run through all .2 Background on linearly solvable MDPs
This section serves as a quick introduction into the formalism of linearly solvable MDPs (LMDPs, Todorov (2006)). These decision processes are defined by the tuple , where is a finite set of states, is a transition kernel called the passive dynamics (with being the probability of the process moving to state given the previous state ) and is the statecost function. Our Markov decision process is a sequential decisionmaking problem where the initial state is drawn from some distribution , and the following steps are repeated for an indefinite number of rounds :

[leftmargin=.7cm]

The learner chooses a transition kernel satisfying for all .

The learner observes and draws the next state .

The learner incurs the cost
where
is the relative entropy (or KullbackLeibler divergence) between the probability distributions
and defined as .
The statecost function should be thought of as specifying the objective for the learner in the MDP, while the relativeentropy term governs the costs associated with significant deviations from the passive dynamics. Accordingly, we refer to this component as the control cost. A central question in the theory of Markov decision problems is finding a behavior policy that minimizes (some notion of) the longterm total costs. In this paper, we consider the problem of minimizing the longterm average costperstage . Assuming that the passive dynamics is aperiodic and irreducible, this limit is minimized by a stationary policy (see, e.g., Puterman (1994, Sec. 8.4.4)). Below, we provide two distinct derivations for the optimal stationary policy that minimizes the average costs under this assumption.
2.1 The Bellman equations
We first take an approach rooted in dynamic programming (Bertsekas, 2007), following Todorov (2006). Under our assumptions, the optimal stationary policy minimizing the average cost is given by finding the solution to the Bellman optimality equation
(1) 
for all , where is called the optimal value function and is the average cost associated with the optimal policy^{1}^{1}1This solution is guaranteed to be unique up to a constant shift of the values: if is a solution, then so is for any . Unless stated otherwise, we will assume that is such that holds for a fixed state .. Linearly solvable MDPs get their name from the fact that the Bellman optimality equation can be rewritten in a simple linear form. To see this, observe that by elementary calculations involving Lagrange multipliers, we have
so, after defining the exponentiated value function for all , plugging into Equation (1) and exponentiating both sides gives
(2) 
Rewriting the above set of equations in matrix form, we obtain the linear equations
where is a diagonal matrix with . By the PerronFrobenius theorem (see, e.g., Chapter 8 of Meyer (2000)) concerning positive matrices, the above system of linear equations has a unique^{2}^{2}2As in the case of the Bellman equations, this solution is unique up to a scaling of . solution satisfying for all
, and this eigenvector corresponds to the largest eigenvalue
of . Since the solution of the Bellman optimality equation (1) is unique (up to a constant shift corresponding to a constant scaling of ), we obtain that is the average cost of the optimal policy. In summary, the Bellman optimality equation takes the form of a Perron–Frobenius eigenvalue problem, which can be efficiently solved by iterative methods such as the wellknown power method for finding top eigenvectors. Finally, getting back to the basic form (1) of the Bellman equations, we can conclude after simple calculations that the optimal policy can be computed for all as2.2 The convex optimization view
We also provide an alternative (and, to our knowledge, yet unpublished) view of the optimal control problem in LMDPs, based on convex optimization. For the purposes of this paper, we find this form to be more insightful, as it enables us to study our learning problem in the framework of online convex optimization (Hazan, 2011, 2016; ShalevShwartz, 2012). To derive this form, observe that under our assumptions, every feasible policy induces a stationary distribution over the state space satisfying . This stationary distribution and the policy together induce a distribution over defined for all as . We will call as the stationary transition measure induced by , which is motivated by the observation that corresponds to the probability of observing the transition in the equilibrium state: . Notice that, with this notation, the average costperstage of policy can be rewritten in the form
The first term in the final expression above is the negative conditional entropy of relative to , where is a pair of random states drawn from . Since the negative conditional entropy is convex in (for a proof, see Appendix A.1) and the second term in the expression is linear in , we can see that is a convex function of . This suggests that we can view the optimal control problem as having to find a feasible stationary transition measure that minimizes the expected costs. In short, defining
(3) 
and as the (convex) set of feasible stationary transition measures satisfying
(4) 
the optimization problem can be succinctly written as . In Appendix A.2, we provide a derivation of the optimal control given by Equation (2) starting from the formulation given above. We also remark that our analysis will heavily rely on the fact that is affine in .
3 Online learning in linearly solvable MDPs
We now present the precise learning setting that we consider in the present paper. We will study an online learning scheme where for each round , the following steps are repeated:

[leftmargin=.7cm]

The learner chooses a transition kernel satisfying for all .

The learner observes and draws the next state .

Obliviously to the learner’s choice, the environment chooses statecost function .

The learner incurs the cost

The environment reveals the statecost function .
The key change from the stationary setting described in the previous section is that the statecost function now may change arbitrarily between each round, and the learner is only allowed to observe the costs after it has made its decision. We stress that we assume that the learner fully knows the passive dynamics, so the only difficulty comes from having to deal with the changing costs. As usual in the onlinelearning literature, our goal is to do nearly as well as the best stationary policy chosen in hindsight after observing the entire sequence of cost functions. To define our precise performance measure, we first define the average reward of a policy as
where the state trajectory is generated sequentially as and the expectation integrates over the randomness of the transitions. Having this definition in place, we can specify the best stationary policy^{3}^{3}3The existence of the minimum is warranted by the fact that is a continuous function bounded from below on its compact domain. and define our performance measure as the (total expected) regret against :
where the expectation integrates over both the randomness of the state transitions and the potential randomization used by the learning algorithm. Having access to this definition, we can now formally define the goal of the learner as having to come up with a sequence of policies that guarantee that the total regret grows sublinearly, that is, that the average perround regret asymptotically converges to zero.
For our analysis, it will be useful to define an idealized version of the above online optimization problem, where the learner is allowed to immediately switch between the stationary distributions of the chosen policies. By making use of the convexoptimization view given in Section 2.2, we define an auxiliary online convex optimization (or, in short, OCO, see, e.g., Hazan, 2011; ShalevShwartz, 2012) problem called the idealized OCO problem where in each round , the following steps are repeated:

The learner chooses the stationary transition measure .

Obliviously to the learner’s choice, the environment chooses the loss function
. 
The learner incurs a loss of .

The environment reveals the loss function .
The performance of the learner in this setting is measured by the idealized regret
Throughout the paper, we will consider oblivious environments that choose the sequence of statecost functions without taking into account the states visited by the learner. This assumption will enable us to simultaneously reason about the expected costs under any sequence of state distributions, and thus to make a connection between the idealized regret and the true regret . This technique was first used by EvenDar et al. (2009) and was shown to be essentially inevitable by Yu et al. (2009): As discussed in their Section 3.1, no learning algorithm can avoid linear regret if the environment is not oblivious.
4 Algorithm and main result
In this section, we propose a simple algorithm for online learning in LMDPs based on the “followtheleader” (FTL) strategy. On a high level, the idea of this algorithm is greedily betting on the policy that seems to have been optimal for the total costs observed so far. While this strategy is known to fail catastrophically in several simple learning problems (see, e.g., CesaBianchi and Lugosi 2006), it is known to perform well in several important scenarios such as sequential prediction under the logarithmic loss (Merhav and Feder, 1992) or prediction with expert advice under bounded losses, given that losses are stationary (Kotłowski, 2016) and often serves as a strong benchmark strategy (de Rooij et al., 2014; Sani et al., 2014). In our learning problem, following the leader is a very natural choice of algorithm, as the convex formulation of Section 2.2 suggests that we can effectively build on the analysis of FollowtheRegularizedLeadertype algorithms without having to explicitly regularize the objective.
In precise terms, our algorithm computes the sequence of policies by running FTL in the idealized setting: in round , the algorithm chooses the stationary transition measure
where the third equality uses the fact that is affine in its second argument and the last step introduces the average statecost function . This form implies that can be computed as the optimal control for the statecost function , which can be done by following the procedure described in Section 2.1. Precisely, we define the diagonal matrix with its diagonal element , let be the largest eigenvalue of and be the corresponding (unitnorm) right eigenvector. Also, let and , and note that is the optimal averagecostperstage of given the cost function . Finally, we define the policy used in round as
(5) 
for all and . We denote the induced stationary distribution by . The algorithm is presented as Algorithm 1.
Input: Passive dynamics .
Initialization: for all .
For , repeat

[leftmargin=.7cm]

Construct .

Find the right eigenvector of corresponding to the largest eigenvalue.

Compute the policy

Observe state and draw .

Observe statecost function and update .
Now we present our main result. First, we state two key assumptions about the underlying passive dynamics; both of these assumptions are also made by Guan et al. (2014).
Assumption 1
The passive dynamics is irreducible and aperiodic. In particular, there exists a natural number such that for all and all . We will refer to as the (worstcase) hitting time.
Assumption 2
The passive dynamics is ergodic in the sense that its Markov–Dobrushin ergodicity coefficient is strictly less than :
A standard consequence (see, e.g., Seneta 2006) of Assumption 2 is that the passive dynamics mixes quickly: for any distributions , we have
We will sometimes refer to as the mixing time associated with . Now we are ready to state our main result: Suppose that the passive dynamics satisfies Assumptions 1 and 2. Then, the regret of Algorithm 1 satisfies . The asymptotic notation used in the theorem hides a number of factors that depend only on the passive dynamics . In particular, the bound scales polynomially with the worstcase mixing time of any optimal policy, and shows no explicit dependence on the number of states.^{4}^{4}4Of course, the mixing time time does depend on the size of the state space in general. We explicitly state the bound at the end of the proof presented in the next section as Equation (8), when all terms are formally defined.
5 Analysis
In this section, we provide a series of lemmas paving the way towards proving Theorem 4. The attentive reader may find some of these lemmas familiar from related work: indeed, we build on several technical results from EvenDar et al. (2009); Neu et al. (2014) and Guan et al. (2014). Our main technical contribution is an efficient combination of these tools that enables us to go way beyond the best known bounds for our problem, proved by Guan et al. (2014). Throughout the section, we will assume that the conditions of Theorem 4 are satisfied.
Before diving into the analysis, we state some technical results that we will use several times. We defer all proofs to Appendix B. First, we present some important facts regarding LMDPs with bounded statecosts. In particular, we define as the optimal policy with respect to an arbitrary statecost function and let be the set of all statecosts bounded in . We define as the set of optimal policies induced by statecost functions in : . Observe that for all , as and for all . Below, we give several useful results concerning policies in . For stating these results, let and . We first note that the average cost of is bounded in : By the PerronFrobenius theorem (see, e.g., Meyer, 2000, Chapter 8), we have that the largest eigenvalue of is bounded by the maximal and minimal row sums of : , which translates to having under our assumptions. The next key result bounds the value functions and the control costs in terms of the hitting time: For all and , the value functions satisfy . Furthermore, all policies satisfy
The proof is loosely based on ideas from Bartlett and Tewari (2009). The second statement guarantees that the mixing time is finite for all policies in : The Markov–Dobrushin coefficient of any policy is bounded as
The proof builds on the previous lemma and uses standard ideas from Markovchain theory. In what follows, we will use
and to denote the worstcase mixing time and ergodicity coefficient, respectively. With this notation, we can state the following lemma that establishes that the value functions are Lipschitz with respect to the statecost function. For pronouncing and proving the statement, it is useful to define the span seminorm . Note that it is easy to show that is indeed a seminorm as it satisfies all the requirements to be a norm except that it maps all constant vectors (and not just zero) to zero. Let and be two statecost functions taking values in the interval and let and be the corresponding optimal value functions. Then,The proof roughly follows the proof of Proposition 3 of Guan et al. (2014), with the slight difference that we make the constant factor in the bound explicit. A consequence of this result is our final key lemma in this section that actually makes our fast rates possible: a bound on the changerate of the policies chosen by the algorithm. . The proof is based on ideas by Guan et al. (2014). As for the proof of Theorem 4, we follow the path of EvenDar et al. (2009); Neu et al. (2014); Guan et al. (2014), and first analyze the idealized setting where the learner is allowed to directly pick stationary distributions instead of policies. Then, we show how to relate the idealized regret of FTL to its true regret in the original problem.
5.1 Regret in the idealized OCO problem
Let us now consider the idealized online convex optimization problem described at the end of Section 3. In this setting, our algorithm can be formally stated as choosing the stationary transition measure . This view enables us to follow a standard proof technique for analyzing online convex optimization algorithms, going back to at least Merhav and Feder (1992). The first ingredient of our proof is the socalled “followtheleader/betheleader” lemma CesaBianchi and Lugosi (2006, Lemma 3.1): . The second step exploits the bound on the change rate of the policies to show that looking one step into the future does not buy much advantage. Note however that controlling the change rate is not sufficient by itself, as our loss functions are effectively unbounded. . In the interest of space, we only provide a proof sketch here and defer the full proof to Appendix B.5.
Proof sketch Let us define . By exploiting the affinity of in its second argument, we can start by proving . Furthermore, by using the form of the optimal policy given in Eq. (5) and the form of given in Eq. (3), we can obtain
The first term can be bounded by a simple argument (see, e.g., Lemma 4 of Neu et al. 2014) that leads to
Now, the first factor can be bounded by and the second by appealing to Lemma 5. The proof is concluded by plugging the above bounds into Equation (12), using , summing up both sides, and noting that . Putting Lemmas 5.1 and 5.1 together, we obtain the following bound on the idealized regret of FTL: .
5.2 Regret in the reactive setting
We first show that the advantage of the true best policy over our final policy is bounded. Let be the smallest nonzero transition probability under the passive dynamics and . Then, . The proof follows from applying Lemma 1 from Neu et al. (2014) and observing that holds for all . It remains to relate the total cost of FTL to the total idealized cost of the algorithm. This is done in the following lemma:
Comments
There are no comments yet.