1 Introduction
Consider a basic sequential decision making problem in healthcare, namely that of learning a treatment policy for patients to optimize some health outcome of interest. One could model the interaction with every patient as a Markov Decision Process (MDP). In precision or personalized medicine, we want the treatment to be personalized to every patient. At the same time, the amount of data available on any given patient may not be enough to personalize well. This means that modeling each patient via a different MDP will result in severely suboptimal treatment policies. The other extreme of pooling all patients’ data results in more data but most of it will perhaps not be relevant to the patient we currently want to treat. We therefore face a tradeoff between having a large amount of shared data to learn a single policy, and, finding the most relevant policy for each patient. A similar tradeoff occurs in other applications involving humans as the agent’s environment such as online tutoring and web advertising.
A key observation is that in many personalized decision making scenarios, we have some side information available about individuals which might help us in designing personalized policies and also help us pool the interaction data across the right subsets of individuals. Examples of such data include laboratory data or medical history of patients in healthcare, user profiles or history logs in web advertising, and student profiles or historical scores in online tutoring. Access to such side information should let us learn better policies even with a limited amount of interaction with individual users. We refer to this sideinformation as contexts and adopt an augmented model called Contextual Markov Decision Process (CMDP) proposed by Hallak et al. (2015). We assume that contexts are fully observed and available before the interaction starts for each new MDP.^{1}^{1}1Hallak et al. (2015) assumes latent contexts, which results in significant differences from our work in application scenarios, required assumptions, and results. See detailed discussion in Section 5.
In this paper we study the sample complexity of learning in CMDPs in the worst case. We consider two concrete settings of learning in a CMDP with continuous contexts. In the first setting, the individual MDPs vary in an arbitrary but smooth manner with the contexts, and we propose our CoverRmax algorithm in Section 3 with PAC bounds. The innate hardness of learning in this general case is captured by our lower bound construction in Section 3.1. To show that it is possible to achieve significantly better sample complexity in more structured CMDPs, we consider another setting where contexts are used to create linear combinations of a finite set of fixed but unknown MDPs. We use the KWIK framework to devise the KWIK_LRRmax algorithm in Section 4.1 and also provide a PAC upper bound for the algorithm.
2 Contextual Markov Decision Process
2.1 Problem setup and notation
We start with basic definitions and notations for MDPs, and then introduce the contextual case.
Definition 2.1 (Markov Decision Processes).
A Markov Decision Process (MDP) is defined as a tuple () where is the state space and is the action space;
defines the transition probability function for a tuple
where ; and defines the initial state distribution for the MDP.We consider the case of fixed horizon (denoted ) episodic MDPs. We use to denote a policy’s action for state at timestep . For each episode, an initial state is observed according to the distribution and afterwards, for , the agent chooses an action according to a (nonstationary) policy. There is a reward and then a next state according to the reward and the transition functions. For policy we define its value as follows:
(1) 
An optimal policy is one that achieves the largest possible value (called optimal value and denoted ). Next we define the contextual model which is similar to the definition given by Hallak et al. (2015):
Definition 2.2 (Contextual MDP).
A contextual Markov Decision Process (CMDP) is defined as a tuple () where is the context space, is the state space, and is the action space. is a function which maps a context to MDP parameters .
We denote the MDP for a context by . We make the simplifying assumption that the initial state distribution is independent of the context and is same for all. We assume throughout the paper that the rewards are bounded between 0 and 1. We denote by respectively. We also assume that the context space is bounded, and for any the norm of is upper bounded by some constant. We consider the online learning scenario with the following protocol: For :

Observe context .

Choose a policy (based on and previous episodes).

Experience an episode in using .
We do not make any distributional assumptions over the context sequence. Instead, we allow the sequence to be chosen in an arbitrary and potentially adversarial manner. A natural criteria for judging the efficiency of the algorithm is to look at the number of episodes where it performs suboptimally. The main aim of the PAC analysis is to bound the number of episodes where we have , i.e., the value of the algorithm’s policy is not optimal (Dann and Brunskill, 2015). Although, we do give PAC bounds for the CoverRmax algorithm given below, the reader should make note that, we have not made explicit attempts to achieve the tightest possible result. We use the Rmax (Brafman and Tennenholtz, 2002) algorithm as the base of our construction to handle explorationexploitation because of its simplicity. Our approach can also be combined with the other PAC algorithms (Strehl and Littman, 2008; Dann and Brunskill, 2015) for improved dependence on , and .
3 CoverRmax
In this section, we present the CoverRmax algorithm and provide a PAC bound for it under a smoothness assumption. The key motivation for our contextual setting is that sharing information among different contexts is helpful. Therefore, it is natural to assume that the MDPs corresponding to similar contexts will themselves be similar. This can be formalized by the following smoothness assumption:
Definition 3.1 (Smoothness).
Given a CMDP (), and a distance metric over the context space , if for any two contexts , , we have the following constraints:
then, we call this a smooth CMDP with smoothness parameters and .
We assume that the distance metric and the constants and are known. This smoothness assumption allows us to use a minimally tweaked version of Rmax (Brafman and Tennenholtz, 2002) and provide an analysis for smooth CMDPs similar to existing literature on MDPs(Kearns and Singh, 2002; Strehl et al., 2009; Strehl and Littman, 2008). If we know the transition dynamics and the expected reward functions for each stateaction pair in a finite MDP, we can easily compute the optimal policy. The idea of Rmax is to distinguish the stateaction pairs as known or unknown
: a stateaction pair is known if we have visited it for enough number of times, so that the empirical estimates of reward and transition probabilities are nearaccurate due to sufficient data. A state
becomes known when all for all actions the pairs become known. Rmax then constructs an auxiliary MDP which encourages optimistic behaviour by assigning maximum reward (hence the name Rmax) to the remaining unknown states. When we act according to the optimal policy in the auxiliary MDP, one of the following must happen: 1) we exploit the information available and achieve nearoptimal value, or, 2) we visit unknown states and accumulate more information efficiently.Formally, for a set of known states , we define an (approximate) induced MDP in the following manner. Let and denote the number of observations of stateaction pair and transitions respectively. Also, let denote the total reward obtained from stateaction pair . For each , define the values
(2) 
For each , define the values as and .
We use the certainty equivalent policy computed for this induced MDP and perform balanced wandering (Kearns and Singh, 2002) for unknown states. Balanced wandering ensures that all actions are tried equally and fairly for unknown states. Assigning maximum reward to the unknown states pushes the agent to visit these states and provides the necessary exploration impetus. The generic template of Rmax is given in Algorithm 1.
For the contextual case, we have an infinite number of such MDPs. The idea behind our algorithm is that, one can group close enough contexts and treat them as a single MDP. Utilizing the boundedness of the context space , we create a cover of with finitely many balls of radius centered at . By tuning the radius , we can control the bias introduced by ignoring the differences among the MDPs in the same ball. Doing so allows us to pool together the data from all MDPs in a ball, so that we avoid the difficulty of infinite MDPs and instead only deal with finitely many of them. The size of the cover, i.e., the number of balls can be measured by the notion of covering numbers (see e.g., Wu (2016)), defined as
The resulting algorithm, CoverRmax, is obtained by using the subroutines in Algorithm 2, and we state its sample complexity guarantee in Theorem 3.2.
Theorem 3.2 (PAC bound for CoverRmax).
For any input values and a CMDP with smoothness parameters and , with probability at least , the CoverRmax algorithm produces a sequence of policies which yield at most
nonoptimal episodes, where and .
Proof sketch.
We first of all carefully adapt the analysis of Rmax by Kakade (2003) to get the PAC bound for an episodic MDP. Let be the number of visits to a stateaction pair after which the model’s estimate for has an error of at most and reward estimate has an absolute error of at most . We can show that:
Lemma 3.3.
Let be an MDP with the fixed horizon . If is the optimal policy for as computed by Rmax, then for any starting state , with probability at least , we have for all but episodes.
Now instead of learning the model for each contextual MDP separately, we combine the data within each ball. Therefore, we have to take care of two things: choose the radius for a fine enough cover and a value of which is the number of visits after which a state becomes known for a ball. For satisfying the conditions of Lemma 3.3 for all MDPs within a ball, we need the radius of the cover to be and the value of . Using Lemma 3.3, we obtain an upper bound on number of non episodes in a single ball as with probability at least .
Setting the individual failure probability to be and using the union bound, we get the stated PAC bound.^{2}^{2}2For detailed proofs, we refer the reader to the appendix. ∎
We observe that the PAC bound has linear dependence on the covering number of the context space. In case of a dimensional Euclidean metric space, the covering number would be of the order . However, we show in Section 3.1, that, the dependence would be at least linear, and hence, indicate the difficulty of optimally learning in such cases.
3.1 Lower Bound
We prove a lower bound on the number of suboptimal episodes for any learning algorithm in a smooth CMDP which shows that a linear dependence on the covering number of the context space is unavoidable. As far as we know, there is no existing way of constructing PAC lower bounds for continuous state spaces with smoothness, so we cannot simply augment the state representation to include context information. Instead, we prove our own lower bound in Theorem 3.4 which builds upon the work of Dann and Brunskill (2015) on lower bounds for episodic finite MDPs and of Slivkins (2014) on lower bounds for contextual bandits.
Theorem 3.4 (Lower bound for smooth CMDP).
There exists constants , such that for every and , any algorithm that satisfies a PAC guarantee for and computes a sequence of deterministic policies for each context, there is a hard CMDP with smoothness constant , such that
(3) 
where is the number of suboptimal episodes and .
Proof.
The overall idea is to embed multiple MDP learning problems in a CMDP, such that the agent has to learn the optimal policy in each MDP separately and cannot generalize across them. We show that the maximum number of problems that can be embedded scales with the covering number, and the result follows by incorporating known PAC lower bound for episodic MDPs.
We start with the lower bound for learning in episodic MDPs. See Figure 1 and its caption for details. The construction is due to Dann and Brunskill (2015) and we adapt their lower bound statement to our setting in Theorem 3.5.
Theorem 3.5 ( Lower bound for episodic MDPs (Dann and Brunskill, 2015) ).
There exists constants , such that for every and , any algorithm that satisfies a PAC guarantee for and computes a sequence of deterministic policies, there is a hard instance so that , where is the number of suboptimal episodes. The constants can be chosen as , .^{3}^{3}3The lower bound here differs from that in the original paper by , because our value is normalized (see Eq.(1)), whereas they allow the magnitude of value to grow with .
Now we discuss how to populate the context space with these hard MDPs. Note in Figure 1 that, the agent does not know which action is the most rewarding (), and the adversary can choose to be any element of (which is essentially choosing an instance from a family of MDPs). In our scenario, we would like to allow the adversary to choose the MDP independently for each individual packing point to yield a lower bound linear in the packing number. However, this is not always possible due to the smoothness assumption, as committing to an MDP at one point may restrict the adversary’s choices in another point.
To deal with this difficulty, we note that any pair of hard MDPs differ from each other by in transition distributions. Therefore, we construct a packing of with radius , defined as a set of points such that any two points in are at least away from each other. The maximum size of such is known as the packing number:
which is related to the covering number as . The radius is chosen to be so that arbitrary choices of hard MDP instances at different packing points always satisfy the smoothness assumption (recall that ). Once we fix the MDPs for all , we specify the MDP for as follows: for state and action ,
Essentially, as we move away from a packing point, the transition to / become more uniform. We can show that:
Claim 3.6.
The CMDP defined above is satisfies Definition 3.1 with constant .^{4}^{4}4The reward function does not vary with context hence reward smoothness is satisfied for all . The proof of the claim is deferred to the appendix.
We choose the context sequence given as input to be repetitions of an arbitrary permutation of . By construction, the learning at different points in are independent, so the lower bound is simply the lower bound for learning a single MDP (Theorem 3.5) multiplied by the cardinality of (the packing number). Using the well known relation that , we have the desired lower bound. We refer the reader to the appendix for proof of Claim 3.6 and a more detailed analysis. ∎
4 Contextual Linear Combination of MDPs
From the previous section, it is clear that for a contextual MDP with just smoothness assumptions, exponential dependence on context dimension is unavoidable. Further, the computational requirements of our CoverRmax algorithm scales with the covering number of the context space. As such, in this section, we focus on a more structured assumption about the mapping from context space to MDPs and show that we can achieve substantially improved sample efficiency.
The specific assumption we make in this section is that the model parameters of an individual MDP is the linear combination of the parameters of base MDPs, i.e.,
(4) 
We use and as shorthand for the vectors that concatenate the parameters from different base MDPs for the same (and ). The parameters of the base MDPs ( and ) are unknown and need to be recovered from data by the learning agent, and the combination coefficients are directly available which is the context vector itself. This assumption can be motivated in an application scenario where the user/patient responds according to her characteristic distribution over possible behavioural patterns.
A mathematical difficulty here is that for an arbitrary context vector , is not always a valid transition function and may violate nonnegativity and normalization constraints. Therefore, we require that , that is, stays in the probability simplex so that is always valid.^{5}^{5}5 is the simplex }.
4.1 KWIK_LRRmax
We first explain how to estimate the model parameters in this linear setting, and then discuss how to perform exploration properly.
Model estimation
Recall that in Section 3, we treat the MDPs whose contexts fall in a small ball as a single MDP, and estimate its parameters using data from the local context ball. In this section, however, we have a global structure due to our parametric assumption ( base MDPs that are shared across all contexts). This implies that data obtained at a context may be useful for learning the MDP parameters at another context that is far away, and to avoid the exponential dependence on we need to leverage this structure and generalize globally across the entire context space.
Due to the linear combination setup, we use linear regression to replace the estimation procedure in Equation
2: in an episode with context , when we observe the stateaction pair , a nextstate will be drawn from .^{6}^{6}6Here we useto denote the random variable, and
to denote a possible realization. Therefore, the indicator of whether is equal toforms an unbiased estimate of
, i.e., Based on this observation, we can construct a featurelabel pair(5) 
whenever we observe a transition tuple under context , and their relationship is governed by a linear prediction rule with being the coefficients. Hence, to estimate from data, we can simply collect the featurelabel pairs that correspond to this particular tuple, and run linear regression to recover the coefficients. The case for reward function is similar, hence, not discussed.
If the data is abundant (i.e., is observed many times) and exploratory (i.e., the design matrix that consists of the vectors for is wellconditioned), we can expect to recover accurately. But how to guarantee these conditions? Since the context is chosen adversarially, the design matrix can indeed be illconditioned.
Observe, however, when the matrix is illconditioned and new contexts lie in the subspace spanned by previously observed contexts, we can make accurate predictions despite the inability to recover the model parameters. An online linear regression (LR) procedure will take care of this issue, and we choose KWIK_LR (Walsh et al., 2009) as such a procedure.
The original KWIK_LR deals with scalar labels, which can be used to decide whether the estimate of is sufficiently accurate (known). A pair then becomes known if is known for all . This approach, however, generally leads to a loose analysis, because there is no need to predict for each individual accurately: if the estimate of is close to the true distribution under error, the pair can already be considered as known. We extend the KWIK_LR analysis to handle vectorvalued outputs, and provide tighter error bounds by treating as a whole. Below we introduce our extended version of KWIK_LR, and explain how to incorporate the knownness information in Rmax skeleton to perform efficient exploration.
Identifying known with KWIK_LR
The KWIK_LRRmax algorithm we propose for the linear setting still uses Rmax template (Algorithm 1) for exploration: in every episode we build the induced MDP , and act greedily according to its optimal policy with balanced wandering. The major difference from CoverRmax lies in how the set of known states are identified and how is constructed, which we explain below (see pseudocode in Algorithm 3).
At a high level, the algorithm works in the following way: when constructing , we query the KWIK procedure for estimates and for every pair using . The KWIK procedure either returns (don’t know), or returns estimates that are guaranteed to be accurate. If is returned, then we consider as unknown and associate with reward for exploration. Such optimistic exploration ensures significant probability of observing pairs on which we have predicted . If we do observe such pairs in an episode, we call with featurelabel pairs formed via Equation 5 to make progress on estimating parameters for unknown stateaction pairs.
Next we walk through the pseudocode and explain how and work in detail. Then we prove an upper bound on the number of updates that can happen (i.e., the if condition holds on Line 3), which forms the basis of our analysis of KWIK_LRRmax.
In Algorithm 3, we initialize matrices and for each using and update them over time. Let be the design matrix at episode , where each row is a context such that was observed in episode . By matrix inverse rules, we can verify that the update rule on Line 3 essentially yields , where is the value of in episode
. This is the inverse of the (unnormalized and regularized) empirical covariance matrix, which plays a central role in linear regression analysis. The matrix
accumulates the outer product between the feature vector (context) and the onehot vector label . It is then obvious that is the linear regression estimate of using the data up to episode .When a new input vector comes, we check whether is below a predetermined threshold (Line 3). Recall that is the inverse covariance matrix, so a small implies that the estimate is close to along the direction of , so we predict ; otherwise we return . The KWIK subroutine for rewards is similar hence omitted. To ensure that the estimated transition probability is valid, we project the estimated vector onto , which can be done efficiently using existing techniques (Duchi et al., 2008).
Below we state the KWIK bound for learning the transition function; the KWIK bound for learning rewards is much smaller hence omitted here. We use the KWIK bound for scalar linear regression from Walsh et al. (2009) and the property of multinomial samples to get our KWIK bound.
Theorem 4.1 (KWIK_LR bound for learning multinomial vectors).
For any and , if the KWIK_LR algorithm is executed for probability vectors , with with suitable constants and , then the number of ’s where updates take place (see Line 3) will be bounded by , and, with probability at least , where a non“” prediction is returned, .
Proof sketch.
(See full proof in the appendix.) We provide a direct reduction to KWIK bound for learning scalar values. The key idea is to notice that for any vector :
So conceptually we can view Algorithm 3 as running
scalar linear regression simultaneously, each of which projects the vector label to a scalar by a fixed linear transformation
. We require every scalar regressor to have KWIK guarantee, and the error guarantee for the vector label follows from union bound. ∎With this result, we are ready to prove the formal PAC guarantee for KWIK_LRRmax.
Theorem 4.2 (PAC bound for KWIK_LRRmax).
For any input values and a linear CMDP model with number of base MDPs, with probability , the KWIK_LRRmax algorithm, produces a sequence of policies which yield at most
nonoptimal episodes.
Proof.
When the KWIK subroutine (Algorithm 3) makes non“” predictions , we require that
After projection onto , we have:
Further, the update to the matrices and happen only when an unknown state action pair is visited and the KWIK subroutine still predicts (Line 3). The KWIK bound states that after a fixed number of updates to an unknown pair, the parameters will always be known with desired accuracy. The number of updates can be obtained by setting the desired accuracy in transitions to and failure probability as in Theorem 4.1:
We now use Lemma 3.3 where instead of updating counts for number of visits, we look at the number of updates for unknown pairs. On applying a union bound over all state action pairs and using Lemma 3.3, it is easy to see that the suboptimal episodes are bounded by with probability at least . The bound in Theorem 4.2 is obtained by substituting the value of . ∎
We see that for this contextual MDP, the linear structure helps us in avoiding the exponential dependence in context dimension . The combined dependence on and is now .
5 Related work
Transfer in RL with latent contexts
The general definition of CMDPs captures the problem of transfer in RL and multitask RL. See Taylor and Stone (2009) and Lazaric (2011) for surveys of empirical results. Recent papers have also advanced the theoretical understanding of transfer in RL. For instance, Brunskill and Li (2013) and Hallak et al. (2015) analyzed the sample complexity of CMDPs where each MDP is an element of a finite and small set of MDPs, and the MDP label is treated as the latent (i.e., unseen) context. Mahmud et al. (2013) consider the problem of transferring the optimal policies of a large set of known MDPs to a new MDP. The commonality of the above papers is that the MDP label (i.e., the context) is not observed. Hence, their methods have to initially explore in every new MDP to identify its label, which requires the episode length to be substantially longer than the planning horizon. This can be a problematic assumption in our motivating scenarios, where we interact with a patient / user / student for a limited period of time and the data in a single episode (whose length is the planning horizon) is not enough for identifying the underlying MDP. In contrast to prior work, we propose to leverage observable context information to perform more direct transfer from previous MDPs, and our algorithm works with arbitrary episode length .
RL with side information
Our work leverages the available sideinformation for each MDP, which is inspired by the use of contexts in contextual bandits (Langford and Zhang, 2008; Li et al., 2010). The use of such side information can also be found in RL literature: Ammar et al. (2014) developed a multitask policy gradient method where the context is used for transferring knowledge between tasks; Killian et al. (2016) used parametric forms of MDPs to develop models for personalized medicine policies for HIV treatment.
RL in metric space
For smooth CMDPs (Section 3), we pool observations across similar contexts and reduce the problem to learning policies for finitely many MDPs. An alternative approach is to consider an infinite MDP whose state representation is augmented by the context, and apply PACMDP methods for metric state spaces (e.g., CPACE proposed by Pazis and Parr (2013)). However, doing so might increase the sample and computational complexity unnecessarily, because we no longer leverage the structure that a particular component of the (augmented) state, namely the context, remains the same in an episode. Concretely, the augmenting approach needs to perform planning in the augmented MDP over states and contexts, which makes its computational/storage requirement worse than our solution: we only perform planning in MDPs defined on , whose computational characteristics have no dependence on the context space. In addition, we allow the context sequence to be chosen in an adversarial manner. This corresponds to adversarially chosen initial states in MDPs, which is usually not handled by PACMDP methods.
KWIK learning of linear hypothesis classes
Our linear combination setting (Section 4) provides an instance where parametric assumptions can lead to substantially improved PAC bounds. We build upon the KWIKRmax learning framework developed in previous work (Li et al., 2008; Szita and Szepesvári, 2011) and use KWIK linear regression as a subroutine. For the resulting KWIK_LRRmax algorithm, its sample complexity bound inherently depends on the KWIK bound for linear regression. It is well known that even for linear hypothesis classes, the KWIK bound is exponential in input dimension in the agnostic case (Szita and Szepesvári, 2011). Therefore, the success of the algorithm relies on the validity of the modelling assumption.
AbbasiYadkori and Neu (2014) studied a problem similar to our linear combination setting, and proposed a noregret algorithm by combining UCRL2 (Jaksch et al., 2010) with confidence set techniques from stochastic linear optimization literature (Dani et al., 2008; Filippi et al., 2010). Our work takes an independent and very different approach, and we provide a PAC guarantee which is not directly comparable to regret bound. Still, we observe that our dependence on is optimal for PAC whereas theirs is not ( is optimal for bandit regret analysis and they have ); on the other hand, their dependence on (the number of rounds) is optimal, and our dependence on , its counterpart in PAC analysis, is suboptimal. It is an interesting future direction to combine the algorithmic ideas from both papers to improve the guarantees.
6 Conclusion
In this paper, we present a general setting of using side information for learning nearoptimal policies in a large and potentially infinite number of MDPs. The proposed CoverRmax algorithm is a modelbased PACexploration algorithm for the case where MDPs vary smoothly with respect to the observed side information. Our lower bound construction indicates the necessary exponential dependence of any PAC algorithm on the context dimension in a smooth CMDP. We also consider another instance with a parametric assumption, and using a KWIK linear regression procedure, present the KWIK_LRRmax algorithm for efficient exploration in linear combination of MDPs. Our PAC analysis shows a significant improvement with this structural assumption.
The use of context based modelling of multiple tasks has rich application possibilities in personalized recommendations, healthcare treatment policies, and tutoring systems. We believe that our setting can possibly be extended to cover the large space of multitask RL quite well with finite/infinite number of MDPs, observed/latent contexts, and deterministic/noisy mapping between context and environment. We hope our work spurs further research along these directions.
This work was supported in part by a grant from the Open Philanthropy Project to the Center for HumanCompatible AI, and in part by NSF Grant IIS 1319365. Ambuj Tewari acknowledges the support from NSF grant CAREER IIS1452099 and Sloan Research Fellowship. Any opinions, findings, conclusions, or recommendations expressed here are those of the authors and do not necessarily reflect the views of the sponsors.
References
 AbbasiYadkori and Neu [2014] Yasin AbbasiYadkori and Gergely Neu. Online learning in mdps with side information. arXiv preprint arXiv:1406.6812, 2014.

Ammar et al. [2014]
Haitham B Ammar, Eric Eaton, Paul Ruvolo, and Matthew Taylor.
Online multitask learning for policy gradient methods.
In
Proceedings of the 31st International Conference on Machine Learning (ICML14)
, pages 1206–1214, 2014.  Brafman and Tennenholtz [2002] Ronen I Brafman and Moshe Tennenholtz. Rmaxa general polynomial time algorithm for nearoptimal reinforcement learning. Journal of Machine Learning Research, 3(Oct):213–231, 2002.
 Brunskill and Li [2013] Emma Brunskill and Lihong Li. Sample complexity of multitask reinforcement learning. arXiv preprint arXiv:1309.6821, 2013.
 Dani et al. [2008] Varsha Dani, Thomas P Hayes, and Sham M Kakade. Stochastic linear optimization under bandit feedback. In COLT, pages 355–366, 2008.
 Dann and Brunskill [2015] Christoph Dann and Emma Brunskill. Sample complexity of episodic fixedhorizon reinforcement learning. In Advances in Neural Information Processing Systems, pages 2818–2826, 2015.
 Duchi et al. [2008] John Duchi, Shai ShalevShwartz, Yoram Singer, and Tushar Chandra. Efficient projections onto the l 1ball for learning in high dimensions. In Proceedings of the 25th international conference on Machine learning, pages 272–279. ACM, 2008.
 Filippi et al. [2010] Sarah Filippi, Olivier Cappe, Aurélien Garivier, and Csaba Szepesvári. Parametric bandits: The generalized linear case. In Advances in Neural Information Processing Systems, pages 586–594, 2010.
 Hallak et al. [2015] Assaf Hallak, Dotan Di Castro, and Shie Mannor. Contextual markov decision processes. arXiv preprint arXiv:1502.02259, 2015.
 Jaksch et al. [2010] Thomas Jaksch, Ronald Ortner, and Peter Auer. Nearoptimal regret bounds for reinforcement learning. Journal of Machine Learning Research, 11(Apr):1563–1600, 2010.
 Kakade [2003] Sham Machandranath Kakade. On the sample complexity of reinforcement learning. PhD thesis, 2003.
 Kearns and Singh [2002] Michael Kearns and Satinder Singh. Nearoptimal reinforcement learning in polynomial time. Machine Learning, 49(23):209–232, 2002.
 Killian et al. [2016] Taylor Killian, George Konidaris, and Finale DoshiVelez. Transfer learning across patient variations with hidden parameter markov decision processes. arXiv preprint arXiv:1612.00475, 2016.

Langford and Zhang [2008]
John Langford and Tong Zhang.
The epochgreedy algorithm for multiarmed bandits with side information.
In Advances in neural information processing systems, pages 817–824, 2008.  Lazaric [2011] A. Lazaric. Transfer in reinforcement learning: a framework and a survey. In M. Wiering and M. van Otterlo, editors, Reinforcement Learning: State of the Art. Springer, 2011.
 Li et al. [2008] Lihong Li, Michael L Littman, and Thomas J Walsh. Knows what it knows: a framework for selfaware learning. In Proceedings of the 25th international conference on Machine learning, pages 568–575. ACM, 2008.
 Li et al. [2010] Lihong Li, Wei Chu, John Langford, and Robert E Schapire. A contextualbandit approach to personalized news article recommendation. In Proceedings of the 19th international conference on World wide web, pages 661–670. ACM, 2010.
 Mahmud et al. [2013] MM Mahmud, Majd Hawasly, Benjamin Rosman, and Subramanian Ramamoorthy. Clustering markov decision processes for continual transfer. arXiv preprint arXiv:1311.3959, 2013.
 Pazis and Parr [2013] Jason Pazis and Ronald Parr. Pac optimal exploration in continuous space markov decision processes. In AAAI, 2013.
 Slivkins [2014] Aleksandrs Slivkins. Contextual bandits with similarity information. Journal of Machine Learning Research, 15(1):2533–2568, 2014.
 Strehl and Littman [2008] Alexander L Strehl and Michael L Littman. An analysis of modelbased interval estimation for markov decision processes. Journal of Computer and System Sciences, 74(8):1309–1331, 2008.
 Strehl et al. [2009] Alexander L Strehl, Lihong Li, and Michael L Littman. Reinforcement learning in finite mdps: Pac analysis. Journal of Machine Learning Research, 10(Nov):2413–2444, 2009.
 Szita and Szepesvári [2011] István Szita and Csaba Szepesvári. Agnostic kwik learning and efficient approximate reinforcement learning. In Proceedings of the 24th Annual Conference on Learning Theory, pages 739–772, 2011.
 Taylor and Stone [2009] Matthew E Taylor and Peter Stone. Transfer learning for reinforcement learning domains: A survey. Journal of Machine Learning Research, 10(Jul):1633–1685, 2009.

Walsh et al. [2009]
Thomas J Walsh, István Szita, Carlos Diuk, and Michael L Littman.
Exploring compact reinforcementlearning representations with linear
regression.
In
Proceedings of the TwentyFifth Conference on Uncertainty in Artificial Intelligence
, pages 591–598. AUAI Press, 2009. A corrected version is available as Technical Report DCStr660, Department of Computer Science, Rutgers University, December, 2009.  Wu [2016] Yihong Wu. Packing, covering, and consequences on minimax risk. EECS598: Informationtheoretic methods in highdimensional statistics, Statistics, Yale University, 2016.
Appendix A Proofs from Section 3
a.1 Proof of Lemma 3.3
We adapt the analysis in [11] for the episodic case which results in the removal of a factor of , since complete episodes are counted as mistakes and we do not count every suboptimal action in an episode. We reproduce the detailed analysis here for completion. For completing the proof of Lemma 3.3, firstly, we will look at a version of simulation lemma from [12]. Also, for the complete analysis we will assume that the rewards lie between 0 and 1.
Definition A.1 (Induced MDP).
Let be an MDP with being a subset of states. Given, such a set , we define an induced MDP in the following manner. For each , define the values
For all , define and .
Lemma A.2 (Simulation lemma for episodic MDPs).
Let and be two MDPs with the same stateaction space. If the transition dynamics and the reward functions of the two MDPs are such that
then, for every (nonstationary) policy the two MDPs satisfy this property:
Proof.
Consider to be the set of all trajectories of length and let denote the probability of observing trajectory in with the behaviour policy . Further, let the expected average reward obtained for trajectory in MDP .
The bound for the second term follows from the proof of lemma 8.5.4 in [11]. Combining the two expressions, we get the desired result. ∎
Lemma A.3 (Induced inequalities).
Let be an MDP with being the set of known states. Let be the induced MDP as defined in A.1 with respect to and . We will show that for any (nonstationary) policy , all states ,
and
where denotes the value of policy in MDP when starting from state .
Proof.
See Lemma 8.4.4 from [11]. ∎
Corollary A.4 (Implicit Explore and Exploit).
Let be an MDP with as the set of known states and be the induced MDP. If and be the optimal policies for and respectively, we have for all states :
Proof.
Follows from Lemma 8.4.5 from [11]. ∎
Proof of Lemma 3.3.
Let be the optimal policy for . Also, using the assumption about , we have an approximation of as the MDP . Rmax computes the optimal policy for which is denoted by . Then, by Lemma A.2,
Combining this with Lemma A.3, we get
If this escape probability is less than , then the desired relation is true. Therefore, we need to bound the number of episodes where this expected number is greater than . Note that, due to balanced wandering, we can have at most visits to unknown states for the Rmax algorithm. In the execution, we may encounter an extra visits as the estimates are updated only after the termination of an episode.
Whenever this quantity is more than , the expected number of exploration steps in such episodes is at least . By the Hoeffding’s inequality, for episodes, with probability, at least , the number of successful exploration steps is greater than
Therefore, if , with probability at least , the total number of visits to an unknown state is more than . Using the upper bound on such visits, we conclude that these many episodes suffice. ∎
a.2 Proof of Theorem 3.2
We now need to compute the required resolution of the cover and the number of transitions which will guarantee the approximation for the value functions as required in the previous lemma. The following result is the key result:
Lemma A.5 (Cover approximation).
For a given CMDP and a finite cover, i.e., such that :
and
if we visit every stateaction pair times in a ball summing observations over all , then, for any policy and with probability at least , the approximate MDP corresponding to computed using empirical averages will satisfy
for all .
Proof.
For each visit to a state action pair , we observe a transition to some for context in visit with probability . Let us encode this by an dimensional vector with at all indices except . After observing such transitions, we create the estimate for any as . Now for all ,
For bounding the first term, we use the Hoeffding’s bound:
Therefore, with probability at least , for all , we have:
Comments
There are no comments yet.