Log In Sign Up

Online learning in MDPs with side information

by   Yasin Abbasi-Yadkori, et al.

We study online learning of finite Markov decision process (MDP) problems when a side information vector is available. The problem is motivated by applications such as clinical trials, recommendation systems, etc. Such applications have an episodic structure, where each episode corresponds to a patient/customer. Our objective is to compete with the optimal dynamic policy that can take side information into account. We propose a computationally efficient algorithm and show that its regret is at most O(√(T)), where T is the number of rounds. To best of our knowledge, this is the first regret bound for this setting.


page 1

page 2

page 3

page 4


Online Learning in Adversarial MDPs: Is the Communicating Case Harder than Ergodic?

We study online learning in adversarial communicating Markov Decision Pr...

Online learning in MDPs with linear function approximation and bandit feedback

We consider an online learning problem where the learner interacts with ...

Online Bayesian Recommendation with No Regret

We introduce and study the online Bayesian recommendation problem for a ...

Incrementality Bidding via Reinforcement Learning under Mixed and Delayed Rewards

Incrementality, which is used to measure the causal effect of showing an...

Online Hyperparameter Optimization for Class-Incremental Learning

Class-incremental learning (CIL) aims to train a classification model wh...

Cooperative Online Learning in Stochastic and Adversarial MDPs

We study cooperative online learning in stochastic and adversarial Marko...

Accumulating Knowledge for Lifelong Online Learning

Lifelong learning can be viewed as a continuous transfer learning proced...

1 Introduction

We study online learning of finite Markov decision process (MDP) problems when a side information vector is available. The problem is motivated by applications such as clinical trials (Lavori and Dawson, 2000, Murphy et al., 2001), recommendation systems (Li et al., 2010), etc.

For example, consider a multi-stage treatment strategy that specifies which treatments should be applied to a patient, given his responses to the past treatments. Each patient is specified by the outcome of several tests that are performed on the patient before the treatment begins. We collect these test results in a side information vector. A simple universal strategy uses the same policy to make recommendations for all patients, independent of the side information vector. Ideally, we would like to have treatment strategies that are adapted to each patient’s characteristics.

The problem can be modeled as a MDP problem with an infinite state space. The state variable contains the side information and the patient responses up to the current stage in the treatment. Although there are regret bounds for MDP problems with infinite state spaces (Abbasi-Yadkori and Szepesvári, 2011, Abbasi-Yadkori, 2012, Ortner and Ryabko, 2012), the proposed algorithms can be computationally expensive.

Alternatively, we can model the problem as a MDP problem with changing rewards and transition probability kernels. There is however no computationally efficient algorithm with a performance guarantee for this setting.

In this paper, we model such decision problems with Markov decision processes where the transition and reward functions are allowed to depend on the side information given for each new problem instance. Using our previous example, the side information corresponds to the results of the tests preceding the treatment, actions correspond to different treatment options, and the states are given by the outcome of the applied treatments. Every new patient corresponds to a new episode in the decision problem, where the transitions and rewards characterizing the treatment procedure are influenced by the history of the patient in question. In what follows, we precisely formulate the outlined decision problem and provide a principled way of utilizing side information to maximize rewards.

2 Background

To set up our goals, we need to fix some notations. Let denote the norm of vector . A finite episodic Markov decision process (MDP) is characterized by its finite state space , finite action space , transition function and reward function . An episodic MDP also has a few special states, the starting state and some terminal states: Each episode starts from the designated starting state and ends when it reaches a terminal state. When, in addition, the state space has a layered structure with respect to the transitions, we get the so-called loop-free variant of episodic MDPs. The layered structure of the state space means that , where is called the th layer of the state space, for all , and the agent can only move between consecutive layers. That is, for any and , if . In particular, each episode starts at layer , from state . In every state , the learner chooses an action , earns some reward , and is eventually transferred to state . The episode ends when the learner reaches any state belonging to the last layer . This assumption is equivalent to assuming that each trajectory consists of exactly transitions.111Note that all loop-free state spaces can be transformed to one that satisfies our assumptions with no significant increase in the size of the problem. A simple transformation algorithm is given in Appendix A of György et al. (2007). This framework is a natural fit for episodic problems where time is part of the state variable. Figure 1 shows an example of a loop-free episodic MDP. For any state we will use to denote the index of the layer belongs to, that is, if .

Figure 1: An example of a loop-free episodic Markov decision process when two actions (“up”) and (“down”) are available in all states. Nonzero transition probabilities under each action are indicated with arrows between circles representing states. In the case of two successor states, the successor states with the intended direction with larger probabilities are connected with solid arrows, dashed arrows indicate less probable transitions.

A deterministic policy (or, in short: a policy) is a mapping . We say that a policy is followed in an episodic MDP problem if the action in state is set to be , independently of previous states and actions. The set of all deterministic policies will be denoted by . A random path is said to be generated by policy under the transition model if the initial state is and is drawn from for all . We denote this relation by . Define the value of a policy , given a fixed reward function and a transition model as

that is, the expected sum of rewards gained when following in the MDP defined by and .

3 The learning problem

We consider episodic loop-free environments where the transitions and rewards are influenced by some vector of side information. In particular, the probability of a transition to state given that action was chosen in state is given by the generalized linear model

where is a link function

(such as sigmoid function),

is a feature mapping, and is some unknown parameter vector for each individual . Furthermore, the rewards are parametrized as

where is another feature mapping and .

In every episode of our learning problem, we are given a side information vector , which gives rise to the reward function and transition functions . A reasonable goal in this setting is to accumulate nearly as much reward as the best dynamic policy that can take side information into account. Defining such a dynamic policy as a mapping , we denote the best achievable performance by


The expected value of the learner’s policy in episode will be denoted by . We are interested in online algorithms that have no information about the parameter vectors and at the beginning of the learning process, but minimize the following notion of regret:

4 Algorithm

Input: State space , action space , confidence parameter .

For each episode :

  1. Observe side information .

  2. Construct confidence sets according to Equations (5) and (6)

  3. Compute , and according to Equation (2).

  4. Traverse path .

  5. Receive rewards .

Algorithm 1 Algorithm for online learning in episodic MDPs with side information.

Our algorithm combines ideas from the UCRL2 algorithm of Jaksch et al. (2010) and the results of Filippi et al. (2010). The algorithm that we propose is based on the Optimism in the Face of Uncertainty (OFU) principle. First proposed by Lai and Robbins (1985), OFU is a general principle that can be employed to design efficient algorithms in many stochastic online learning problems (Auer et al., 2002, Auer, 2002, Dani et al., 2008, Abbasi-Yadkori and Szepesvári, 2011)

. The basic idea is to maintain a confidence set for the unknown parameter vector and then in every round choose an estimate from the confidence set together with a policy so that the predicted expected reward is maximized, i.e., the estimate-policy pair is chosen optimistically.

To implement the OFU principle, we construct confidence sets and that contain the true models and with probability at least each. The confidence parameter is specified by the user. Our parametric estimates take the form


for each and . Using these notations, the confidence sets and translate to confidence sets for the transition and reward functions as

Using these confidence sets, we select our model and policy simultaneously as


The above optimization task can be efficiently performed by the extended dynamic programming algorithm presented in Section 6 (see also Neu et al., 2012).

We employ techniques similar to Filippi et al. (2010) to construct our confidence sets. Let


Let be the set of time steps up to time that is observed. At time , we solve the equations


to obtain and . Let

be an increasing function (to be specified later). Then, the confidence interval corresponding with

at time is , where


Similarly, the confidence interval corresponding with at time is given by , where


Summarizing, our confidence sets for the reward and transition functions are respectively defined as




5 Analysis

First, we make a number of assumptions.

Assumption 1.

Function is continuously differentiable, Lipschitz with constant . Further, we have that , and , where denotes the derivative of .

Assumption 2.

There exists such that for all , .

Assumption 3.

Function is bounded in .

The main result of this section is the following theorem.

Theorem 1.

Let Assumptions 1,2,3 hold. Then, with probability at least , for any sequence of side information vectors,222We use to hide logarithmic factors in the big-O notation.

We will need a number of lemmas to prove the theorem.

Lemma 1 (Filippi et al. (2010), Proposition 1).

Take any such that and . Let . Let

Let and be the solutions of (3) and (4), respectively. Then, for any , any , and any , with probability at least , it holds that

Also, with probability at least ,

Lemma 2 (Abbasi-Yadkori (2012), Lemma E.3).

Let be a sequence in . Define . If for all , then

Let be the dynamic policy achieving the maximum in (1). Also, let be the estimate of the learner’s value . Since we select our model and policy optimistically, we have . It follows that the regret can be bounded as


To treat this term, we use some results by Neu et al. (2012). Consider , that is, the probability that a trajectory generated by and includes . Note that given a layer , the restriction is a distribution. Define an estimate of as . First, we repeat Lemma 4 of Neu et al. (2012).

Lemma 3.

Assume that there exists some function such that holds for all . Then

for all .

Using this result, we can prove the following statement.

Lemma 4.

Assume that there exist some functions and such that and hold for all with probability at least each. Then with probability at least ,


Fix an arbitrary . We have and , thus

Using our upper bound on for all along with Lemma 3, we get


with probability at least . Similarly, using our upper bound on for all , we get


with probability at least . For the second term on the right hand side of (8), notice that form a martingale difference sequence with respect to and thus by the Hoeffding–Azuma inequality and almost surely, we have

with probability at least . The union bound implies that we have, with probability at least simultaneously for all ,


By a similar argument,


also holds with probability at least . We obtain the statement of the lemma by using the union bound. ∎

Now we are ready to prove our main result.

Proof of Theorem 1.

Fix some . Let and . Fix . Let be the number of time steps that we have observed up to time . Let

By Lemma 1, is an upper bound on the error of our transition estimates, thus satisfying the condition of Lemmas 3 and 4.

Let be the timestep that we observe for the th time. Notice that . Let . As , we can write


where the last inequality follows from Lemma 2. Similarly, we can prove that


Summing up these bounds for all , using Inequality (7) and Lemma 4 gives the upper bound on the regret as

6 Extended dynamic programming

The extended dynamic programming algorithm is given by Algorithm 2.

Input: confidence sets of the form (5) and (6) .

Initialization: Set .


  1. Let and be a sorting of the states in such that .

  2. For all

    1. .

    2. .

    3. .

    4. for all .

    5. Set .

    6. While do

      1. Set

      2. Set .

  3. For all

    1. Let .

    2. Let .

Return: optimistic transition function , optimistic reward function , optimistic policy .

Algorithm 2 Extended dynamic programming for finding an optimistic policy and transition model for a given confidence set of transition functions and given rewards.

The next lemma, which can be obtained by a straightforward modification of the proof of Theorem 7 of Jaksch et al. (2010), shows that Algorithm 2 efficiently solves the desired minimization problem.

Lemma 5.

Algorithm 2 solves the maximization problem (2) for the confidence sets and . Let denote the maximum number of possible transitions in the given model. The time and space complexity of Algorithm 2 is the number of possible non-zero elements of allowed by the given structure, and so it is , which, in turn, is .

7 Conclusions

In this paper, we introduced a model for online learning in episodic MDPs where the transition and reward functions can depend on some side information provided to the learner. We proposed and analyzed a novel algorithm for minimizing regret in this setting and have shown that the regret of this algorithm is . While we are not aware of any theoretical results for this precise setting, it is beneficial to compare our results to previously known guarantees for other settings.

First, the UCRL2 algorithm of Jaksch et al. (2010) enjoys a regret bound of for -step episodic problems with fixed transition and reward functions. This setting can be regarded as a special case of ours when and constant (or i.i.d.) side information, thus our bounds for this case are worse by a multiplicative factor of . 333This difference stems from the fact that we have to directly bound the error of instead of the norm . While such a bound is readily available for the single-parameter linear setting, it is highly non-trivial whether a similar result is provable for generalized linear models. However, our algorithm achieves low regret against a much richer class of policies, so our guarantees are far superior when side information has a large impact on the decision process.

There is also a large literature on temporal-difference methods that are model-free and learn a value function. Asymptotic behavior of temporal-difference methods (Sutton, 1988) in large state and action spaces is studied both in on-policy (Tsitsiklis and Van Roy, 1997) and off-policy (Sutton et al., 2009b, a, Maei et al., 2009) settings. All these results concern the policy estimation problem, i.e., estimating the value of a fixed policy. The available results for the control problem, i.e., estimating the value of the optimal policy, are more limited (Maei et al., 2010) and prove only convergence to local optimum of some objective function. It is not clear if and under what conditions these TD control methods converge to the optimal policy.

Yu and Mannor (2009a, b) consider the problem of online learning in MDPs where the reward and transition functions may change arbitrarily after each time step. Their setting can be seen as a significantly more difficult version of ours, when the side information is only revealed after the learner selects its policy . One cannot expect to be able to compete with the set of dynamic policies using side information, so they consider regret minimization against the pool of stationary state-feedback policies. Still, the algorithms proposed in these papers fail to achieve sublinear regret. The mere existence of consistent learning algorithms for this problem is a very interesting open problem.

An interesting direction of future work is to consider learning in unichain Markov decision processes where a new side information vector is provided after each transition made in the MDP. The main challenge in this setting is that long-term planning in such a quickly changing environment is very difficult without making strong assumptions on the generation of the sequence of side information vectors. Learning in the situation when the sequence is generated by an oblivious adversary is not much simpler than in the setting of Yu and Mannor (2009a, b): seeing one step into the future does not help much when having to plan multiple steps ahead in a Markovian environment.

We expect that a non-trivial combination of the ideas presented in the current paper with principles of online prediction of arbitrary sequences can help constructing algorithms that achieve consistency in the above settings.


  • Abbasi-Yadkori (2012) Y. Abbasi-Yadkori. Online Learning for Linearly Parametrized Control Problems. PhD thesis, University of Alberta, 2012.
  • Abbasi-Yadkori and Szepesvári (2011) Y. Abbasi-Yadkori and Cs. Szepesvári. Regret bounds for the adaptive control of linear quadratic systems. In COLT, 2011.
  • Auer (2002) Peter Auer. Using confidence bounds for exploitation-exploration trade-offs.

    Journal of Machine Learning Research

    , 3:397–422, 2002.
  • Auer et al. (2002) Peter Auer, Nicolò Cesa-Bianchi, and Paul Fischer. Finite time analysis of the multiarmed bandit problem. Machine Learning, 47(2-3):235–256, 2002.
  • Dani et al. (2008) Varsha Dani, Thomas P. Hayes, and Sham M. Kakade. Stochastic linear optimization under bandit feedback. In Conference on Learning Theory, pages 355–366, 2008.
  • Filippi et al. (2010) Sarah Filippi, Olivier Cappé, Aurélien Garivier, and Csaba Szepesvári. Parametric bandits: The generalized linear case. In NIPS, pages 586–594, 2010.
  • György et al. (2007) András György, Tamás Linder, Gábor Lugosi, and György Ottucsák. The on-line shortest path problem under partial monitoring. Journal of Machine Learning Research, 8:2369–2403, 2007. ISSN 1532-4435.
  • Jaksch et al. (2010) Thomas Jaksch, Ronald Ortner, and Peter Auer.

    Near-optimal regret bounds for reinforcement learning.

    J. Mach. Learn. Res., 99:1563–1600, August 2010. ISSN 1532-4435. URL
  • Lai and Robbins (1985) Tze Leung Lai and Herbert Robbins. Asymptotically efficient adaptive allocation rules. Advances in Applied Mathematics, 6:4–22, 1985.
  • Lavori and Dawson (2000) P. W. Lavori and R. Dawson. A design for testing clinical strategies: biased individually tailored within-subject randomization. Journal of the Royal Statistical Society A, 163:29–38, 2000.
  • Li et al. (2010) Lihong Li, Wei Chu, John Langford, and Robert E. Schapire. A contextual-bandit approach to personalized news article recommendation. In WWW, 2010.
  • Maei et al. (2009) H. R. Maei, Cs. Szepesvári, S. Bhatnagar, D. Precup, D. Silver, and R. S. Sutton. Convergent temporal-difference learning with arbitrary smooth function approximation. In Advances in Neural Information Processing Systems, 2009.
  • Maei et al. (2010) H. R. Maei, Cs. Szepesvári, S. Bhatnagar, and R. S. Sutton. Toward off-policy learning control with function approximation. In Proceedings of the 27th International Conference on Machine Learning, 2010.
  • Murphy et al. (2001) S. A. Murphy, M. J. van der Laan, and J. M. Robins. Marginal mean models for dynamic regimes. Journal of American Statistical Association, 96:1410–1423, 2001.
  • Neu et al. (2012) Gergely Neu, András György, and Csaba Szepesvári. The adversarial stochastic shortest path problem with unknown transition probabilities. In

    Proceedings of the Fifteenth International Conference on Artificial Intelligence and Statistics

    , volume 22 of JMLR Workshop and Conference Proceedings, pages 805–813, La Palma, Canary Islands, April 21-23 2012.
  • Ortner and Ryabko (2012) R. Ortner and D. Ryabko. Online regret bounds for undiscounted continuous reinforcement learning. In NIPS, 2012.
  • Sutton et al. (2009a) R. S. Sutton, H. R. Maei, D. Precup, S. Bhatnagar, D. Silver, Cs. Szepesvári, and E. Wiewiora. Fast gradient-descent methods for temporal-difference learning with linear function approximation. In Proceedings of the 26th International Conference on Machine Learning, 2009a.
  • Sutton et al. (2009b) R. S. Sutton, Cs. Szepesvári, and H. R. Maei. A convergent O(n) algorithm for off-policy temporal-difference learning with linear function approximation. In Advances in Neural Information Processing Systems, 2009b.
  • Sutton (1988) Richard S. Sutton. Learning to predict by the methods of temporal differences. Machine Learning, 3:9–44, 1988.
  • Tsitsiklis and Van Roy (1997) John N. Tsitsiklis and Benjamin Van Roy. An analysis of temporal-difference learning with function approximation. IEEE TRANSACTIONS ON AUTOMATIC CONTROL, 42(5):674–690, 1997.
  • Yu and Mannor (2009a) Jia Yuan Yu and Shie Mannor. Online learning in Markov decision processes with arbitrarily changing rewards and transitions. In

    GameNets’09: Proceedings of the First ICST international conference on Game Theory for Networks

    , pages 314–322, Piscataway, NJ, USA, 2009a. IEEE Press.
    ISBN 978-1-4244-4176-1.
  • Yu and Mannor (2009b) Jia Yuan Yu and Shie Mannor. Arbitrarily modulated Markov decision processes. In Joint 48th IEEE Conference on Decision and Control and 28th Chinese Control Conference, pages 2946–2953. IEEE Press, 2009b.