1 Notation
Let be a finite state space and be a finite action space. Let be the space of probability distributions over set . Define a policy as a mapping from the state space to , . We use to denote the probability of choosing action in state under policy . A random action under policy is denoted by . A transition probability kernel (or transition model) is a mapping from the direct product of the state and action spaces to : . Let be the transition probability matrix of policy under transition model . A loss function is a bounded realvalued function over state and action spaces,
. For a vector
, define . For a realvalued function defined over , define . The inner product between two vectors and is denoted by .2 Introduction
Consider the following game between a learner and an adversary: at round , the learner chooses a policy from a policy class . In response, the adversary chooses a transition model from a set of models and a loss function . The learner takes action , moves to state and suffers loss . To simplify the discussion, we assume that the adversary is oblivious, i.e. its choices do not depend on the previous choices of the learner. We assume that . In this paper, we study the fullinformation version of the game, where the learner observes the transition model and the loss function at the end of round . The game is shown in Figure 1. The objective of the learner is to suffer low loss over a period of rounds, while the performance of the learner is measured using its regret with respect to the total loss he would have achieved had he followed the stationary policy in the comparison class minimizing the total loss.
EvenDar et al. (2004) prove a hardness result for MDP problems with adversarially chosen transition models. Their proof, however, seems to have gaps as it assumes that the learner chooses a deterministic policy before observing the state at each round. Note that an online learning algorithm only needs to choose an action at the current state and does not need to construct a complete deterministic policy at each round. Their hardness result applies to deterministic transition models, while we make a mixing assumption in our analysis. Thus, it is still an open problem whether it is possible to obtain a computationally efficient algorithm with a sublinear regret.
Yu and Mannor (2009a, b) study the same setting, but obtain only a regret bound that scales with the amount of variation in the transition models. This regret bound can grow linearly with time.
EvenDar et al. (2009) prove regret bounds for MDP problems with a fixed and known transition model and adversarially chosen loss functions. In this paper, we prove regret bounds for MDP problems with adversarially chosen transition models and loss functions. We are not aware of any earlier regret bound for this setting. Our approach is efficient as long as the comparison class is polynomial and we can compute expectations over sample paths for each policy.
MDPs with changing transition kernels are good models for a wide range of problems, including dialogue systems, clinical trials, portfolio optimization, two player games such as poker, etc.
3 Online MDP Problems
Let be an online learning algorithm that generates a policy at round . Let be the state at round if we have followed the policies generated by algorithm . Similarly, denotes the state if we have chosen the same policy up to time . Let . The regret of algorithm up to round with respect to any policy is defined by
where . Note that the regret with respect to is defined in terms of the sequence of states that would have been visited under policy . Our objective is to design an algorithm that achieves low regret with respect to any policy .
In the absence of state variables, the problem reduces to a full information online learning problem (CesaBianchi and Lugosi, 2006). The difficulty with MDP problems is that, unlike the full information online learning problems, the choice of policy at each round changes the future states and losses. The main idea behind the design and the analysis of our algorithm is the following regret decomposition:
(1) 
Let
Notice that the choice of policies has no influence over future losses in . Thus, can be bounded by a specific reduction to full information online learning algorithms (to be specified later). Also, notice that the competitor policy does not appear in . In fact, depends only on the algorithm . We will show that if algorithm and the class of models satisfy the following two “smoothness” assumptions, then can be bounded by a sublinear term.

Assumption A1 Rarely Changing Policies Let be the probability that algorithm changes its policy at round . There exists a constant such that for any , any sequence of models and loss functions , .

Assumption A2 Uniform Mixing There exists a constant such that for all distributions and over the state space, any deterministic policy , and any model ,
As discussed by Neu et al. (2010), if Assumption 3 holds for deterministic policies, then it holds for all policies.
3.1 Full Information Algorithms
We would like to have a full information online learning algorithm that rarely changes its policy. The first candidate that we consider is the wellknown Exponentially Weighted Average (EWA) algorithm (Vovk, 1990, Littlestone and Warmuth, 1994) shown in Figure 2. In our MDP problem, the EWA algorithm chooses a policy according to distribution
(2) 
The policies that this EWA algorithm generates most likely are different in consecutive rounds and thus, the EWA algorithm might change its policy frequently. However, a variant of EWA, called Shrinking Dartboard (SD) (Geulen et al., 2010) and shown in Figure 3, satisfies Assumption 3. Our algorithm, called SDMDP, is based on the SD algorithm and is shown in Figure 4. Notice that the algorithm needs to know the number of rounds, , in advance.
Consider a basic full information problem with experts. Let be the regret of the SD algorithm with respect to expert up to time . We have the following results for the SD algorithm.
Theorem 1.
For any expert ,
and also for any ,
Proof.
The proof of the regret bound can be found in (Geulen et al., 2010, Theorem 3). The proof of the bound on the probability of switch is similar to the proof of Lemma 2 in (Geulen et al., 2010) and is as follows: As shown in (Geulen et al., 2010, Lemma 2), the probability of switch at time is
Thus, . Because the loss function is bounded in , we have that
Thus, , and thus,
∎
3.2 Analysis of the SdMdp Algorithm
The main result of this section is the following regret bound for the SDMDP algorithm.
Theorem 2.
Let the loss functions selected by the adversary be bounded in , and the transition models selected by the adversary satisfy Assumption 3. Then, for any policy ,
In the rest of this section, we write to denote the SDMDP algorithm. For the proof we use the regret decomposition (1):
3.2.1 Bounding
Lemma 3.
For any policy ,
Proof.
Consider the following imaginary game between a learner and an adversary: we have a set of experts (policies) . At round , the adversary chooses a loss vector , whose th element determines the loss of expert at this round. The learner chooses a distribution over experts (defined by the SD algorithm), from which it draws an expert . Next, the learner observes the loss function . From the regret bound for the SD algorithm (Theorem 1), it is guaranteed that for any expert ,
Next, we determine how the adversary chooses the loss vector. At time , the adversary chooses a loss function and sets . Noting that and finishes the proof. ∎
3.2.2 Bounding
First, we prove the following two lemmas.
Lemma 4.
For any state distribution , any transition model , and any policies and ,
Proof.
Proof is easy and can be found in (EvenDar et al., 2009), Lemma 5.1. ∎
Lemma 5.
Let be the probability of a policy switch at time . Then, .
Proof.
Proof is identical to the proof of Theorem 1. ∎
Lemma 6.
We have that
Proof.
Let . Notice that the choice of policies are independent of the state variables. We can write
(3) 
where is the distribution of for and is the distribution of for .^{1}^{1}1Notice that contains only policies, which are independent of the state variables. Let be the event of a policy switch at time . From inequality
and Lemma 5, we get that
(4) 
Let . We have that
(5) 
where we have used the fact that , because the initial distributions are identical. By (3.2.2) and (3.2.2), we get that
∎
What makes the analysis possible is the fact that all policies mix no matter what transition model is played by the adversary.
The next corollary extends the result of Theorem 2 to continuous policy spaces.
Corollary 7.
Let be an arbitrary policy space, be the covering number of space , and be an cover. Assume that we run the SDMDP algorithm on . Then, under the same assumptions as in Theorem 2, for any policy ,
Proof.
Let be the value of policy . Let . First, we prove that the value function is Lipschitz with Lipschitz constant . The argument is similar to the argument in the proof of Lemma 6. For any and ,
With an argument similar to the one in the proof of Lemma 6, we can show that
Thus,
Given this and the fact that for any policy , there is a policy such that , we get that
∎
In particular if is the space of all policies, , so regret is no more than
By the choice of , we get that .
References
 CesaBianchi and Lugosi (2006) Nicolò CesaBianchi and Gábor Lugosi. Prediction, Learning, and Games. Cambridge University Press, New York, NY, USA, 2006.
 EvenDar et al. (2004) Eyal EvenDar, Sham M. Kakade, and Yishay Mansour. Experts in a Markov decision process. In NIPS, 2004.
 EvenDar et al. (2009) Eyal EvenDar, Sham M. Kakade, and Yishay Mansour. Online Markov decision processes. Mathematics of Operations Research, 34(3):726–736, 2009.
 Geulen et al. (2010) Sascha Geulen, Berthold Vöcking, and Melanie Winkler. Regret minimization for online buffering problems using the weighted majority algorithm. In COLT, 2010.
 Littlestone and Warmuth (1994) Nick Littlestone and Manfred K. Warmuth. The weighted majority algorithm. Inf. Comput., 108(2):212–261, 1994.
 Neu et al. (2010) Gergely Neu, András György, and András Antos Csaba Szepesvári. Online Markov decision processes under bandit feedback. In NIPS, 2010.
 Vovk (1990) Vladimir Vovk. Aggregating strategies. In COLT, pages 372–383, 1990.
 Yu and Mannor (2009a) Jia Yuan Yu and Shie Mannor. Arbitrarily modulated Markov decision processes. In IEEE Conference on Decision and Control, 2009a.
 Yu and Mannor (2009b) Jia Yuan Yu and Shie Mannor. Online learning in Markov decision processes with arbitrarily changing rewards and transitions. In GameNets, 2009b.
Comments
There are no comments yet.