Online Learning in Markov Decision Processes with Adversarially Chosen Transition Probability Distributions

We study the problem of learning Markov decision processes with finite state and action spaces when the transition probability distributions and loss functions are chosen adversarially and are allowed to change with time. We introduce an algorithm whose regret with respect to any policy in a comparison class grows as the square root of the number of rounds of the game, provided the transition probabilities satisfy a uniform mixing condition. Our approach is efficient as long as the comparison class is polynomial and we can compute expectations over sample paths for each policy. Designing an efficient algorithm with small regret for the general case remains an open problem.



There are no comments yet.


page 1

page 2

page 3

page 4


On Online Learning in Kernelized Markov Decision Processes

We develop algorithms with low regret for learning episodic Markov decis...

Fast rates for online learning in Linearly Solvable Markov Decision Processes

We study the problem of online learning in a class of Markov decision pr...

Learning Adversarial MDPs with Bandit Feedback and Unknown Transition

We consider the problem of learning in episodic finite-horizon Markov de...

Stateful Posted Pricing with Vanishing Regret via Dynamic Deterministic Markov Decision Processes

In this paper, a rather general online problem called dynamic resource a...

An Efficient Algorithm for Multiple-Pursuer-Multiple-Evader Pursuit/Evasion Game

We present a method for pursuit/evasion that is highly efficient and and...

Thompson Sampling for Learning Parameterized Markov Decision Processes

We consider reinforcement learning in parameterized Markov Decision Proc...

Robust Finite-State Controllers for Uncertain POMDPs

Uncertain partially observable Markov decision processes (uPOMDPs) allow...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Notation

Let be a finite state space and be a finite action space. Let be the space of probability distributions over set . Define a policy as a mapping from the state space to , . We use to denote the probability of choosing action in state under policy . A random action under policy is denoted by . A transition probability kernel (or transition model) is a mapping from the direct product of the state and action spaces to : . Let be the transition probability matrix of policy under transition model . A loss function is a bounded real-valued function over state and action spaces,

. For a vector

, define . For a real-valued function defined over , define . The inner product between two vectors and is denoted by .

2 Introduction

Consider the following game between a learner and an adversary: at round , the learner chooses a policy from a policy class . In response, the adversary chooses a transition model from a set of models and a loss function . The learner takes action , moves to state and suffers loss . To simplify the discussion, we assume that the adversary is oblivious, i.e. its choices do not depend on the previous choices of the learner. We assume that . In this paper, we study the full-information version of the game, where the learner observes the transition model and the loss function at the end of round . The game is shown in Figure 1. The objective of the learner is to suffer low loss over a period of rounds, while the performance of the learner is measured using its regret with respect to the total loss he would have achieved had he followed the stationary policy in the comparison class minimizing the total loss.

Even-Dar et al. (2004) prove a hardness result for MDP problems with adversarially chosen transition models. Their proof, however, seems to have gaps as it assumes that the learner chooses a deterministic policy before observing the state at each round. Note that an online learning algorithm only needs to choose an action at the current state and does not need to construct a complete deterministic policy at each round. Their hardness result applies to deterministic transition models, while we make a mixing assumption in our analysis. Thus, it is still an open problem whether it is possible to obtain a computationally efficient algorithm with a sublinear regret.

Yu and Mannor (2009a, b) study the same setting, but obtain only a regret bound that scales with the amount of variation in the transition models. This regret bound can grow linearly with time.

Even-Dar et al. (2009) prove regret bounds for MDP problems with a fixed and known transition model and adversarially chosen loss functions. In this paper, we prove regret bounds for MDP problems with adversarially chosen transition models and loss functions. We are not aware of any earlier regret bound for this setting. Our approach is efficient as long as the comparison class is polynomial and we can compute expectations over sample paths for each policy.

MDPs with changing transition kernels are good models for a wide range of problems, including dialogue systems, clinical trials, portfolio optimization, two player games such as poker, etc.

3 Online MDP Problems

  Initial state:   for  do      Learner chooses policy      Adversary chooses model and loss function      Learner takes action      Learner suffers loss      Update state      Learner observes and   end for

Figure 1: Online Markov Decision Processes

Let be an online learning algorithm that generates a policy at round . Let be the state at round if we have followed the policies generated by algorithm . Similarly, denotes the state if we have chosen the same policy up to time . Let . The regret of algorithm up to round with respect to any policy is defined by

where . Note that the regret with respect to is defined in terms of the sequence of states that would have been visited under policy . Our objective is to design an algorithm that achieves low regret with respect to any policy .

In the absence of state variables, the problem reduces to a full information online learning problem (Cesa-Bianchi and Lugosi, 2006). The difficulty with MDP problems is that, unlike the full information online learning problems, the choice of policy at each round changes the future states and losses. The main idea behind the design and the analysis of our algorithm is the following regret decomposition:



Notice that the choice of policies has no influence over future losses in . Thus, can be bounded by a specific reduction to full information online learning algorithms (to be specified later). Also, notice that the competitor policy does not appear in . In fact, depends only on the algorithm . We will show that if algorithm and the class of models satisfy the following two “smoothness” assumptions, then can be bounded by a sublinear term.

  • Assumption A1 Rarely Changing Policies Let be the probability that algorithm changes its policy at round . There exists a constant such that for any , any sequence of models and loss functions , .

  • Assumption A2 Uniform Mixing There exists a constant such that for all distributions and over the state space, any deterministic policy , and any model ,

As discussed by Neu et al. (2010), if Assumption 3 holds for deterministic policies, then it holds for all policies.

3.1 Full Information Algorithms

  : number of experts, : number of rounds.   Initialize for each expert .   .   for  do      For any , .      Draw such that for any , .      Choose the action suggested by expert .      The adversary chooses loss function .      The learner suffers loss .      For expert , .      .   end for

Figure 2: The EWA Algorithm

We would like to have a full information online learning algorithm that rarely changes its policy. The first candidate that we consider is the well-known Exponentially Weighted Average (EWA) algorithm (Vovk, 1990, Littlestone and Warmuth, 1994) shown in Figure 2. In our MDP problem, the EWA algorithm chooses a policy according to distribution


The policies that this EWA algorithm generates most likely are different in consecutive rounds and thus, the EWA algorithm might change its policy frequently. However, a variant of EWA, called Shrinking Dartboard (SD) (Geulen et al., 2010) and shown in Figure 3, satisfies Assumption 3. Our algorithm, called SD-MDP, is based on the SD algorithm and is shown in Figure 4. Notice that the algorithm needs to know the number of rounds, , in advance.

  : number of experts, : number of rounds.   .   Initialize for each expert .   .   for  do      For any , .      With probability choose the previously selected expert, and with probability , choose based on the distribution .      Learner takes the action suggested by expert .      The adversary chooses loss function .      The learner suffers loss .      For all experts , .      .   end for

Figure 3: The Shrinking Dartboard Algorithm

Consider a basic full information problem with experts. Let be the regret of the SD algorithm with respect to expert up to time . We have the following results for the SD algorithm.

Theorem 1.

For any expert ,

and also for any ,


The proof of the regret bound can be found in (Geulen et al., 2010, Theorem 3). The proof of the bound on the probability of switch is similar to the proof of Lemma 2 in (Geulen et al., 2010) and is as follows: As shown in (Geulen et al., 2010, Lemma 2), the probability of switch at time is

Thus, . Because the loss function is bounded in , we have that

Thus, , and thus,

  : number of rounds.   .   For all policies , .   for  do      For any , .      With probability choose the previous policy, , while with probability , choose based on the distribution .      Learner takes the action      Adversary chooses transition model and loss function .      Learner suffers loss .      Learner observes and .      Update state: .      For all policies , .      .   end for

Figure 4: SD-MDP: The Shrinking Dartboard Algorithm for Markov Decision Processes

3.2 Analysis of the Sd-Mdp Algorithm

The main result of this section is the following regret bound for the SD-MDP algorithm.

Theorem 2.

Let the loss functions selected by the adversary be bounded in , and the transition models selected by the adversary satisfy Assumption 3. Then, for any policy ,

In the rest of this section, we write to denote the SD-MDP algorithm. For the proof we use the regret decomposition (1):

3.2.1 Bounding

Lemma 3.

For any policy ,


Consider the following imaginary game between a learner and an adversary: we have a set of experts (policies) . At round , the adversary chooses a loss vector , whose th element determines the loss of expert at this round. The learner chooses a distribution over experts (defined by the SD algorithm), from which it draws an expert . Next, the learner observes the loss function . From the regret bound for the SD algorithm (Theorem 1), it is guaranteed that for any expert ,

Next, we determine how the adversary chooses the loss vector. At time , the adversary chooses a loss function and sets . Noting that and finishes the proof. ∎

3.2.2 Bounding

First, we prove the following two lemmas.

Lemma 4.

For any state distribution , any transition model , and any policies and ,


Proof is easy and can be found in (Even-Dar et al., 2009), Lemma 5.1. ∎

Lemma 5.

Let be the probability of a policy switch at time . Then, .


Proof is identical to the proof of Theorem 1. ∎

Lemma 6.

We have that


Let . Notice that the choice of policies are independent of the state variables. We can write


where is the distribution of for and is the distribution of for .111Notice that contains only policies, which are independent of the state variables. Let be the event of a policy switch at time . From inequality

and Lemma 5, we get that


Let . We have that


where we have used the fact that , because the initial distributions are identical. By (3.2.2) and (3.2.2), we get that

What makes the analysis possible is the fact that all policies mix no matter what transition model is played by the adversary.

Proof of Theorem 2.

The result is obvious by Lemmas 3 and 6. ∎

The next corollary extends the result of Theorem 2 to continuous policy spaces.

Corollary 7.

Let be an arbitrary policy space, be the -covering number of space , and be an -cover. Assume that we run the SD-MDP algorithm on . Then, under the same assumptions as in Theorem 2, for any policy ,


Let be the value of policy . Let . First, we prove that the value function is Lipschitz with Lipschitz constant . The argument is similar to the argument in the proof of Lemma 6. For any and ,

With an argument similar to the one in the proof of Lemma 6, we can show that


Given this and the fact that for any policy , there is a policy such that , we get that

In particular if is the space of all policies, , so regret is no more than

By the choice of , we get that .


  • Cesa-Bianchi and Lugosi (2006) Nicolò Cesa-Bianchi and Gábor Lugosi. Prediction, Learning, and Games. Cambridge University Press, New York, NY, USA, 2006.
  • Even-Dar et al. (2004) Eyal Even-Dar, Sham M. Kakade, and Yishay Mansour. Experts in a Markov decision process. In NIPS, 2004.
  • Even-Dar et al. (2009) Eyal Even-Dar, Sham M. Kakade, and Yishay Mansour. Online Markov decision processes. Mathematics of Operations Research, 34(3):726–736, 2009.
  • Geulen et al. (2010) Sascha Geulen, Berthold Vöcking, and Melanie Winkler. Regret minimization for online buffering problems using the weighted majority algorithm. In COLT, 2010.
  • Littlestone and Warmuth (1994) Nick Littlestone and Manfred K. Warmuth. The weighted majority algorithm. Inf. Comput., 108(2):212–261, 1994.
  • Neu et al. (2010) Gergely Neu, András György, and András Antos Csaba Szepesvári. Online Markov decision processes under bandit feedback. In NIPS, 2010.
  • Vovk (1990) Vladimir Vovk. Aggregating strategies. In COLT, pages 372–383, 1990.
  • Yu and Mannor (2009a) Jia Yuan Yu and Shie Mannor. Arbitrarily modulated Markov decision processes. In IEEE Conference on Decision and Control, 2009a.
  • Yu and Mannor (2009b) Jia Yuan Yu and Shie Mannor. Online learning in Markov decision processes with arbitrarily changing rewards and transitions. In GameNets, 2009b.