1 Introduction
We study online learning of finite Markov decision process (MDP) problems when a side information vector is available. The problem is motivated by applications such as clinical trials (Lavori and Dawson, 2000, Murphy et al., 2001), recommendation systems (Li et al., 2010), etc.
For example, consider a multistage treatment strategy that specifies which treatments should be applied to a patient, given his responses to the past treatments. Each patient is specified by the outcome of several tests that are performed on the patient before the treatment begins. We collect these test results in a side information vector. A simple universal strategy uses the same policy to make recommendations for all patients, independent of the side information vector. Ideally, we would like to have treatment strategies that are adapted to each patient’s characteristics.
The problem can be modeled as a MDP problem with an infinite state space. The state variable contains the side information and the patient responses up to the current stage in the treatment. Although there are regret bounds for MDP problems with infinite state spaces (AbbasiYadkori and Szepesvári, 2011, AbbasiYadkori, 2012, Ortner and Ryabko, 2012), the proposed algorithms can be computationally expensive.
Alternatively, we can model the problem as a MDP problem with changing rewards and transition probability kernels. There is however no computationally efficient algorithm with a performance guarantee for this setting.
In this paper, we model such decision problems with Markov decision processes where the transition and reward functions are allowed to depend on the side information given for each new problem instance. Using our previous example, the side information corresponds to the results of the tests preceding the treatment, actions correspond to different treatment options, and the states are given by the outcome of the applied treatments. Every new patient corresponds to a new episode in the decision problem, where the transitions and rewards characterizing the treatment procedure are influenced by the history of the patient in question. In what follows, we precisely formulate the outlined decision problem and provide a principled way of utilizing side information to maximize rewards.
2 Background
To set up our goals, we need to fix some notations. Let denote the norm of vector . A finite episodic Markov decision process (MDP) is characterized by its finite state space , finite action space , transition function and reward function . An episodic MDP also has a few special states, the starting state and some terminal states: Each episode starts from the designated starting state and ends when it reaches a terminal state. When, in addition, the state space has a layered structure with respect to the transitions, we get the socalled loopfree variant of episodic MDPs. The layered structure of the state space means that , where is called the th layer of the state space, for all , and the agent can only move between consecutive layers. That is, for any and , if . In particular, each episode starts at layer , from state . In every state , the learner chooses an action , earns some reward , and is eventually transferred to state . The episode ends when the learner reaches any state belonging to the last layer . This assumption is equivalent to assuming that each trajectory consists of exactly transitions.^{1}^{1}1Note that all loopfree state spaces can be transformed to one that satisfies our assumptions with no significant increase in the size of the problem. A simple transformation algorithm is given in Appendix A of György et al. (2007). This framework is a natural fit for episodic problems where time is part of the state variable. Figure 1 shows an example of a loopfree episodic MDP. For any state we will use to denote the index of the layer belongs to, that is, if .
A deterministic policy (or, in short: a policy) is a mapping . We say that a policy is followed in an episodic MDP problem if the action in state is set to be , independently of previous states and actions. The set of all deterministic policies will be denoted by . A random path is said to be generated by policy under the transition model if the initial state is and is drawn from for all . We denote this relation by . Define the value of a policy , given a fixed reward function and a transition model as
that is, the expected sum of rewards gained when following in the MDP defined by and .
3 The learning problem
We consider episodic loopfree environments where the transitions and rewards are influenced by some vector of side information. In particular, the probability of a transition to state given that action was chosen in state is given by the generalized linear model
where is a link function
(such as sigmoid function),
is a feature mapping, and is some unknown parameter vector for each individual . Furthermore, the rewards are parametrized aswhere is another feature mapping and .
In every episode of our learning problem, we are given a side information vector , which gives rise to the reward function and transition functions . A reasonable goal in this setting is to accumulate nearly as much reward as the best dynamic policy that can take side information into account. Defining such a dynamic policy as a mapping , we denote the best achievable performance by
(1) 
The expected value of the learner’s policy in episode will be denoted by . We are interested in online algorithms that have no information about the parameter vectors and at the beginning of the learning process, but minimize the following notion of regret:
4 Algorithm
Our algorithm combines ideas from the UCRL2 algorithm of Jaksch et al. (2010) and the results of Filippi et al. (2010). The algorithm that we propose is based on the Optimism in the Face of Uncertainty (OFU) principle. First proposed by Lai and Robbins (1985), OFU is a general principle that can be employed to design efficient algorithms in many stochastic online learning problems (Auer et al., 2002, Auer, 2002, Dani et al., 2008, AbbasiYadkori and Szepesvári, 2011)
. The basic idea is to maintain a confidence set for the unknown parameter vector and then in every round choose an estimate from the confidence set together with a policy so that the predicted expected reward is maximized, i.e., the estimatepolicy pair is chosen optimistically.
To implement the OFU principle, we construct confidence sets and that contain the true models and with probability at least each. The confidence parameter is specified by the user. Our parametric estimates take the form
and
for each and . Using these notations, the confidence sets and translate to confidence sets for the transition and reward functions as
Using these confidence sets, we select our model and policy simultaneously as
(2) 
The above optimization task can be efficiently performed by the extended dynamic programming algorithm presented in Section 6 (see also Neu et al., 2012).
We employ techniques similar to Filippi et al. (2010) to construct our confidence sets. Let
and
Let be the set of time steps up to time that is observed. At time , we solve the equations
(3)  
(4) 
to obtain and . Let
be an increasing function (to be specified later). Then, the confidence interval corresponding with
at time is , whereand
Similarly, the confidence interval corresponding with at time is given by , where
and
Summarizing, our confidence sets for the reward and transition functions are respectively defined as
(5) 
and
(6) 
5 Analysis
First, we make a number of assumptions.
Assumption 1.
Function is continuously differentiable, Lipschitz with constant . Further, we have that , and , where denotes the derivative of .
Assumption 2.
There exists such that for all , .
Assumption 3.
Function is bounded in .
The main result of this section is the following theorem.
Theorem 1.
We will need a number of lemmas to prove the theorem.
Lemma 1 (Filippi et al. (2010), Proposition 1).
Lemma 2 (AbbasiYadkori (2012), Lemma E.3).
Let be a sequence in . Define . If for all , then
Let be the dynamic policy achieving the maximum in (1). Also, let be the estimate of the learner’s value . Since we select our model and policy optimistically, we have . It follows that the regret can be bounded as
(7) 
To treat this term, we use some results by Neu et al. (2012). Consider , that is, the probability that a trajectory generated by and includes . Note that given a layer , the restriction is a distribution. Define an estimate of as . First, we repeat Lemma 4 of Neu et al. (2012).
Lemma 3.
Assume that there exists some function such that holds for all . Then
for all .
Using this result, we can prove the following statement.
Lemma 4.
Assume that there exist some functions and such that and hold for all with probability at least each. Then with probability at least ,
Proof.
Fix an arbitrary . We have and , thus
Using our upper bound on for all along with Lemma 3, we get
(8) 
with probability at least . Similarly, using our upper bound on for all , we get
(9) 
with probability at least . For the second term on the right hand side of (8), notice that form a martingale difference sequence with respect to and thus by the Hoeffding–Azuma inequality and almost surely, we have
with probability at least . The union bound implies that we have, with probability at least simultaneously for all ,
(10) 
By a similar argument,
(11) 
also holds with probability at least . We obtain the statement of the lemma by using the union bound. ∎
Now we are ready to prove our main result.
6 Extended dynamic programming
The extended dynamic programming algorithm is given by Algorithm 2.
The next lemma, which can be obtained by a straightforward modification of the proof of Theorem 7 of Jaksch et al. (2010), shows that Algorithm 2 efficiently solves the desired minimization problem.
Lemma 5.
Algorithm 2 solves the maximization problem (2) for the confidence sets and . Let denote the maximum number of possible transitions in the given model. The time and space complexity of Algorithm 2 is the number of possible nonzero elements of allowed by the given structure, and so it is , which, in turn, is .
7 Conclusions
In this paper, we introduced a model for online learning in episodic MDPs where the transition and reward functions can depend on some side information provided to the learner. We proposed and analyzed a novel algorithm for minimizing regret in this setting and have shown that the regret of this algorithm is . While we are not aware of any theoretical results for this precise setting, it is beneficial to compare our results to previously known guarantees for other settings.
First, the UCRL2 algorithm of Jaksch et al. (2010) enjoys a regret bound of for step episodic problems with fixed transition and reward functions. This setting can be regarded as a special case of ours when and constant (or i.i.d.) side information, thus our bounds for this case are worse by a multiplicative factor of . ^{3}^{3}3This difference stems from the fact that we have to directly bound the error of instead of the norm . While such a bound is readily available for the singleparameter linear setting, it is highly nontrivial whether a similar result is provable for generalized linear models. However, our algorithm achieves low regret against a much richer class of policies, so our guarantees are far superior when side information has a large impact on the decision process.
There is also a large literature on temporaldifference methods that are modelfree and learn a value function. Asymptotic behavior of temporaldifference methods (Sutton, 1988) in large state and action spaces is studied both in onpolicy (Tsitsiklis and Van Roy, 1997) and offpolicy (Sutton et al., 2009b, a, Maei et al., 2009) settings. All these results concern the policy estimation problem, i.e., estimating the value of a fixed policy. The available results for the control problem, i.e., estimating the value of the optimal policy, are more limited (Maei et al., 2010) and prove only convergence to local optimum of some objective function. It is not clear if and under what conditions these TD control methods converge to the optimal policy.
Yu and Mannor (2009a, b) consider the problem of online learning in MDPs where the reward and transition functions may change arbitrarily after each time step. Their setting can be seen as a significantly more difficult version of ours, when the side information is only revealed after the learner selects its policy . One cannot expect to be able to compete with the set of dynamic policies using side information, so they consider regret minimization against the pool of stationary statefeedback policies. Still, the algorithms proposed in these papers fail to achieve sublinear regret. The mere existence of consistent learning algorithms for this problem is a very interesting open problem.
An interesting direction of future work is to consider learning in unichain Markov decision processes where a new side information vector is provided after each transition made in the MDP. The main challenge in this setting is that longterm planning in such a quickly changing environment is very difficult without making strong assumptions on the generation of the sequence of side information vectors. Learning in the situation when the sequence is generated by an oblivious adversary is not much simpler than in the setting of Yu and Mannor (2009a, b): seeing one step into the future does not help much when having to plan multiple steps ahead in a Markovian environment.
We expect that a nontrivial combination of the ideas presented in the current paper with principles of online prediction of arbitrary sequences can help constructing algorithms that achieve consistency in the above settings.
References
 AbbasiYadkori (2012) Y. AbbasiYadkori. Online Learning for Linearly Parametrized Control Problems. PhD thesis, University of Alberta, 2012.
 AbbasiYadkori and Szepesvári (2011) Y. AbbasiYadkori and Cs. Szepesvári. Regret bounds for the adaptive control of linear quadratic systems. In COLT, 2011.

Auer (2002)
Peter Auer.
Using confidence bounds for exploitationexploration tradeoffs.
Journal of Machine Learning Research
, 3:397–422, 2002.  Auer et al. (2002) Peter Auer, Nicolò CesaBianchi, and Paul Fischer. Finite time analysis of the multiarmed bandit problem. Machine Learning, 47(23):235–256, 2002.
 Dani et al. (2008) Varsha Dani, Thomas P. Hayes, and Sham M. Kakade. Stochastic linear optimization under bandit feedback. In Conference on Learning Theory, pages 355–366, 2008.
 Filippi et al. (2010) Sarah Filippi, Olivier Cappé, Aurélien Garivier, and Csaba Szepesvári. Parametric bandits: The generalized linear case. In NIPS, pages 586–594, 2010.
 György et al. (2007) András György, Tamás Linder, Gábor Lugosi, and György Ottucsák. The online shortest path problem under partial monitoring. Journal of Machine Learning Research, 8:2369–2403, 2007. ISSN 15324435.

Jaksch et al. (2010)
Thomas Jaksch, Ronald Ortner, and Peter Auer.
Nearoptimal regret bounds for reinforcement learning.
J. Mach. Learn. Res., 99:1563–1600, August 2010. ISSN 15324435. URL http://portal.acm.org/citation.cfm?id=1859890.1859902.  Lai and Robbins (1985) Tze Leung Lai and Herbert Robbins. Asymptotically efficient adaptive allocation rules. Advances in Applied Mathematics, 6:4–22, 1985.
 Lavori and Dawson (2000) P. W. Lavori and R. Dawson. A design for testing clinical strategies: biased individually tailored withinsubject randomization. Journal of the Royal Statistical Society A, 163:29–38, 2000.
 Li et al. (2010) Lihong Li, Wei Chu, John Langford, and Robert E. Schapire. A contextualbandit approach to personalized news article recommendation. In WWW, 2010.
 Maei et al. (2009) H. R. Maei, Cs. Szepesvári, S. Bhatnagar, D. Precup, D. Silver, and R. S. Sutton. Convergent temporaldifference learning with arbitrary smooth function approximation. In Advances in Neural Information Processing Systems, 2009.
 Maei et al. (2010) H. R. Maei, Cs. Szepesvári, S. Bhatnagar, and R. S. Sutton. Toward offpolicy learning control with function approximation. In Proceedings of the 27th International Conference on Machine Learning, 2010.
 Murphy et al. (2001) S. A. Murphy, M. J. van der Laan, and J. M. Robins. Marginal mean models for dynamic regimes. Journal of American Statistical Association, 96:1410–1423, 2001.

Neu et al. (2012)
Gergely Neu, András György, and Csaba Szepesvári.
The adversarial stochastic shortest path problem with unknown
transition probabilities.
In
Proceedings of the Fifteenth International Conference on Artificial Intelligence and Statistics
, volume 22 of JMLR Workshop and Conference Proceedings, pages 805–813, La Palma, Canary Islands, April 2123 2012.  Ortner and Ryabko (2012) R. Ortner and D. Ryabko. Online regret bounds for undiscounted continuous reinforcement learning. In NIPS, 2012.
 Sutton et al. (2009a) R. S. Sutton, H. R. Maei, D. Precup, S. Bhatnagar, D. Silver, Cs. Szepesvári, and E. Wiewiora. Fast gradientdescent methods for temporaldifference learning with linear function approximation. In Proceedings of the 26th International Conference on Machine Learning, 2009a.
 Sutton et al. (2009b) R. S. Sutton, Cs. Szepesvári, and H. R. Maei. A convergent O(n) algorithm for offpolicy temporaldifference learning with linear function approximation. In Advances in Neural Information Processing Systems, 2009b.
 Sutton (1988) Richard S. Sutton. Learning to predict by the methods of temporal differences. Machine Learning, 3:9–44, 1988.
 Tsitsiklis and Van Roy (1997) John N. Tsitsiklis and Benjamin Van Roy. An analysis of temporaldifference learning with function approximation. IEEE TRANSACTIONS ON AUTOMATIC CONTROL, 42(5):674–690, 1997.

Yu and Mannor (2009a)
Jia Yuan Yu and Shie Mannor.
Online learning in Markov decision processes with arbitrarily
changing rewards and transitions.
In
GameNets’09: Proceedings of the First ICST international conference on Game Theory for Networks
, pages 314–322, Piscataway, NJ, USA, 2009a. IEEE Press. ISBN 9781424441761.  Yu and Mannor (2009b) Jia Yuan Yu and Shie Mannor. Arbitrarily modulated Markov decision processes. In Joint 48th IEEE Conference on Decision and Control and 28th Chinese Control Conference, pages 2946–2953. IEEE Press, 2009b.