Contextual Markov Decision Processes

02/08/2015 ∙ by Assaf Hallak, et al. ∙ 0

We consider a planning problem where the dynamics and rewards of the environment depend on a hidden static parameter referred to as the context. The objective is to learn a strategy that maximizes the accumulated reward across all contexts. The new model, called Contextual Markov Decision Process (CMDP), can model a customer's behavior when interacting with a website (the learner). The customer's behavior depends on gender, age, location, device, etc. Based on that behavior, the website objective is to determine customer characteristics, and to optimize the interaction between them. Our work focuses on one basic scenario--finite horizon with a small known number of possible contexts. We suggest a family of algorithms with provable guarantees that learn the underlying models and the latent contexts, and optimize the CMDPs. Bounds are obtained for specific naive implementations, and extensions of the framework are discussed, laying the ground for future research.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Markov Decision Processes (MDPs) are commonly used to describe dynamic behavior in multiple fields such as signal processing, robotics, games, advertising, health and queues management (Puterman, 2005; White, 1993)

. When multiple trajectories are observed from a single source, a question in this context is the following: “Does each observed trajectory follow the same transition probabilities”? When the answer is affirmative, these transitions can be evaluated through standard maximum likelihood estimation

(Boas, 2006), and many techniques exist for different setups, most notably are the Hidden Markov Models (HMMs) method (Elliott et al., 1995) in modeling and the Partially Observed Markov Decision Processes (POMDPs) (Aberdeen, 2003) in control.

However, in many applications there are additional exogenous variables that affect the model. We refer to these variables collectively as the context. For example, the temporal behavior of sugar levels for diabetes patients is largely influenced by their age and gender. Similarly, humidity measurements are greatly affected by the geographical location of the measurement device. Since these context variables do not change within each measurement, the standard solution of incorporating them into the state creating a much larger MDP or POMDP seems faulty as it reduces the generalizing power of the model. Specifically, incorporating static features into the state forms distinct unconnected dynamic chains. As transition probability between states with different contexts is always zero, a more compact model would be separate transition matrices for each context instead of one double sized matrix.

1.1 Motivation for Contextual Dynamics

A real world example for latent context learning is the problem of identifying the user. Consider a large content website. Such a website has two main activities: (a) suggesting relevant content to its users and (b) presenting alluring ads for profit. Current methodologies that determine the relevance of the content and the ads require the user profile: age, gender, income level, device, location, etc. Usually, in order to determine whether a certain user is revisiting the website, mechanisms such as (HTTP) cookies are used. But in many cases these mechanisms are insufficient. What if the website does not have any prior information about the user (also known as the cold start problem; Kohrs & Merialdo 2001)? Can we learn the user’s age or gender by observing his interaction with the website? In other words, given a trajectory of the pages visited by the user can we predict (cluster or classify) the user’s profile? And more importantly, can we take advantage of such clustering and tailor the policy to the user?

This type of problem exists also in scenarios where we have information about the owner of a device, but several users use it and we want to identify them (such as children using their parents tablets). In this work we suggest to model the user interaction in a Markovian fashion in order to identify the user (Meyn & Tweedie, 2009).

A more elaborate scenario is when the user has been identified, and we want to optimize the content and ads presented to him where the optimization criterion, for instance, is maximizing the user’s time spent in the website. In such cases, we model the interaction of a user as a Markov Decision Process where different user’s groups may be modeled and optimized according to their context. In on-line advertising, solutions to such optimization problem are highly valuable, where the correct identification of users leads to higher click through rates (CTRs; Richardson et al. 2007). Hence, the ultimate goal is on-line learning an optimal control when both the context, and the model’s parameters are unknown. Notice that a sub-goal in this case is the one described above: learning the underlying Markov dynamics.

Our work’s main contribution is presenting a general algorithm with provable guarantees for the finite horizon episodic contextual MDP setup. Considering a specific implementation, we provide regret analysis and empirical parametric sensitivity analysis. Additionally, we discuss two applicative extensions of the model: the case of infinitely many contexts, and the concurrent Reinforcement Learning (RL)

(Silver et al., 2013) setting. The reader should bear in mind the solutions suggested are preliminary and our focus is on presenting the problems along with their derived trade-offs, as well as setting the bar for future research.

1.2 Related Literature

Many previous works are related to the setup presented in this paper. In Hidden Markov Models (HMMs; e.g., Elliott et al. 1995) the Markovian state dynamics are latent, and the observed samples are transformation of the sources output. Works by Wilson & Bobick 1999 and Radenen & Artières 2014 had considered adding context to HMMs, however the context in their model only affects the observations distribution and not the state dynamics.

A natural extension of HMMs to a control setting is POMDPs (Aberdeen, 2003). CMDPs can be modeled using POMDPs by setting the context variable to be the origin and each possible MDP as a distinct outgoing chain. However, POMDPs are too general and complex to capture the essence of the CMDP setup. In addition, POMDPs usually assume an underlying distribution on the contexts which we refrain from doing.

The notion of context was borrowed from closely related works in the Multi-Armed Bandits (MAB) literature (Sutton & Barto, 1998; Bubeck, 2012), called Contextual-MAB (Langford & Zhang, 2007; Lai & Robbins, 1985). The extension to the regular setting of MAB is that before the learner plays his turn, a context is presented to the user. Another similar paper is by Maillard & Mannor (2014), describing a setup in which only the rewards depend on an unobserved latent variable. They consider three cases: the reward function and context are known, the reward function is known but the context is not, and where both are unknown.

Other related literature considers model selection in MDPs. Doya et al. (2002) propose an architecture for multiple model-based reinforcement learning (MMRL). Their approach decomposes a complex task into multiple domains in time and space and use a responsibility signal to weigh the outputs of multiple models and to gate the learning of the prediction models and controllers. Hence, responsibility signal measure how to mix different models such that various areas in the state space could be more easily modeled. A similar approach was used in learning meta-parameters of motor skills (Kober et al., 2012). In our work, we try to identify a single source that fits all the space.

Another relevant problem is that of on-line representation learning (Nguyen et al., 2013; Maillard et al., 2011), dealing with finding the best state space while interacting with the environment. Differently, in the CMDP setup all models share the same state space. Finally, Kiseleva et al. (2013) consider contextual MDPs, but from a purely applicative perspective–they relate directly to web advertising by modeling user types.

We conclude with a short comparison of CMDPs with other known extensions of the MDP model:

  1. In Contextual HMMs the context affects only the observation distribution, and there is no control.

  2. POMDPs are a more complex structure generalizing CMDPs. Since in our case the hidden parameters are constant over time a simpler solution might exist. In addition, some distribution over contexts is assumed.

  3. In Multi-model RL the dynamics and rewards are composed of a convex combination of several models, meaning that in each trajectory there can be more than one valid model.

  4. The problem of state representation is that of finding a suitable state space for given observations. In our case the state space is the same for all models, allowing more efficient solutions.

  5. Models described by robust MDPs (Nilim & El Ghaoui, 2005; Wiesemann et al., 2013) consider uncertainty in the transitions and rewards. CMDPs can be viewed as such, where the uncertainty is not rectangular (around each state-action pair) but singular–determining one transition sets all of them.

1.3 Paper Structure

The paper is organized as follows: In Section 2 we formally define the CMDP setting, compare it to other models from the literature and introduce the general setup. Section 3 describes the problem in more details and presents a general form algorithm to solve it. One specific instance of the algorithm is analyzed, and eventually some possible extensions are presented. In Section 4, we provide experiments and discuss trade-offs in our setup. Finally, in Section 5 we conclude and lay down future directions.

2 Contextual Markov Decision Processes

We begin with defining a standard Markov Decision Process (MDP; Puterman 2005).

Definition 1.

(MDP Setup) A Markov Decision Process is a tuple where is the state space, is the action space, is the transition probability (), is a reward function, and is the initial state distribution.

Given a deterministic horizon , the learner-interaction is as follows. At the beginning of each episode, an initial state is chosen according to the state distribution . Afterwards, for each , the learner chooses an action according to a policy where . We note that the policy may be a random function. The environment provides a reward and the next state is (randomly) chosen according to . In general, the learner’s goal is to maximize the following value function:

where the expectation is taken over trajectories with respect to the policy and the initial distribution .

When the MDP parameters are given, the problem of finding the policy which maximizes cumulative reward is known in the literature as planning (Puterman, 2005; Bertsekas & Tsitsiklis, 1995). When the MDP parameters are unknown in advance, finding the best policy is known as Adaptive Control or Reinforcement Learning (RL; Puterman 2005; Bertsekas & Tsitsiklis 1995).

The following definition establishes the extended model considered in the paper, denoted by Contextual MDPs.

Definition 2.

Contextual Markov Decision Process (CMDP) is a tuple where is called the context space, and are the state and action space correspondingly, and is function mapping any context to an MDP .

So essentially, CMDP is simply a set of models sharing the same state and action space.

The simplest scenario in a CMDP setting is when the context is simply observable. In this setting, the problem reduces to correctly generalizing the model from the context. If the observable context is finite where , then with no further assumption, one can simply learn different models.

An interesting problem arises when scales with the number of sampled trajectories. For instance, consider the problem of targeted advertising: Given behavioral patterns and side information of many customers, companies usually seek to group the consumers so they can target their needs and habits. Since side information usually resides in a very large set (for example, the cross-product of gender, age, etc.), in practice it is aggregated when the number of clusters depends on the amount of available data.

The model aggregation problem is not considered in this work, and instead we focus on latent contexts for the rest of the paper. Additionally, we assume the initial state distribution and rewards are context independent, maintaining the hardness of the problem while greatly simplifying the writing. Finally, we adopt the common -bounded reward assumption.

2.1 General Setup

We define the general setup as follows: The context space consists of possible contexts. The time axis is divided into episodes, denoted by . In the beginning of each episode, the environment chooses a context (in a random, adversarial or any other fashion). Afterwards, an initial state is randomly chosen according to an initial state distribution . A trajectory of length is generated where is a stopping time (Meyn & Tweedie, 2009). Then, for the chosen MDP an interaction as described in Definition 1 is applied until the end of the trajectory.

3 Problem Definition and Solution

We assume a small finite , and that is bounded almost surely, denoting this setup as finite sources episodic CMDP. The goal is maximizing over the cumulative rewards from all trajectories by the ’th trajectory, for increasing . Therefore, we measured performance with respect to . Ideally, a good policy should optimize the trade-off between exploration and exploitation of the current chain. However, unlike the standard RL setup, the exploration in this case should consider not only the model’s parameters, but also the hidden context.

We measure our performance with the notion of regret: the difference between the cumulative reward and the cumulative reward obtained by an agent satisfying some optimality property. For example, in infinite horizon RL the cumulative discounted reward is compared against an agent with knowledge of the true model who can therefore start from the optimal policy (Auer et al., 2009); the faster the regret bound converges to with the better.

Similarly, we compare ourselves to the all knowing agent applying the optimal policy for the correct context at each trajectory. In our setup, since is bounded the regret is evaluated mainly with respect to the number of trajectories . Notice though, that in each new trajectory some loss is guaranteed until the correct context is identified. Therefore, the regret will always be linear in . A different optimal agent, when there is some prior distribution over contexts, can be chosen to perform the solution of the resulting POMDP, but there may be other appropriate choices. The problem of redefining the regret to obtain more meaningful bounds was left for future research.

Definition 3.

For the problem of finite sources episodic CMDP we define the regret over trajectories to be:


where is the optimal value function in steps for the context chosen in the ’th trajectory, and is the reward obtained by the agent in the ’th trajectory at the ’th step.

In order to solve the problem of regret minimization we introduce the CECE general framework (Cluster-Explore-Classify-Exploit) that partitions the trajectories to mini-batches. In the beginning of each mini-batch, all previously seen trajectories are used to form distinct models through Algorithm 1 (Cluster). Then, for each new trajectory in the current mini-batch the agent generates a partial trajectory using Algorithm 2 (Explore). The partial trajectory is then classified to a context by Algorithm 3 (Classify). Finally, Algorithm 4 sets the policy for the remainder of the trajectory (Exploit). In summary:

  1. [leftmargin=36pt, labelwidth=!, label=Alg. 0:]

  2. Cluster observed trajectories to models.

  3. Explore the context.

  4. Classify partial trajectory to model.

  5. Exploit the identified model.

The following assumptions and theorem guarantee CECE’s performance:

Definition 4.

1. Let:


be two MDPs with the same state space, action space, rewards and initial state distribution. We define to be an -approximated model of if for every state-action pair :


2. Let:


be two CMDPs with the same state and action space satisfying . We define to be an -approximated CMDP of if there exists a matching between the contexts such that for every we have that is an -approximated model of .

Assumption 1.

Let be some constant number of trajectories. For every there exists , such that after applying Algorithm 1 on trajectories, with probability at least the estimated -models form an -approximated CMDP of the true CMDP.

Assumption 1 guarantees that having enough trajectories will drive Algorithm 1 to output an approximated model for each context. It envelopes a hidden assumption that all contexts were observed enough times. Since there is some probability of error , the clustering procedure must be repeated when more trajectories are presented to ensure diminishing regret; that is the reason a mini-batch scheme is applied.

Assumption 2.

For every , there exists such that given an -approximated CMDP, after applying Algorithms 2 and 3 the correct context is identified with probability at least . In addition, the number of steps taken is a stopping time denoted by .

This assumption assures us each trajectory will be classified correctly with high probability, which will guarantee good performance for exploitation in the next step. Moreover, represents the number of samples needed to differentiate between the models.

Assumption 3.

Given an -approximated model, Algorithms 4 obtains .

Assumption 3 establishes the regret provided by Algorithm 4 when the models are well-approximated.

Theorem 1.

Let be the number of trajectories in the ’th mini-batch. Then if Assumptions 1, 2, 3 hold, CECE achieves in the ’th mini-batch:


where and .

The proof is a straightforward combination of the given assumptions.

3.1 Discussion

Notice that in order for Assumption to hold with a meaningful , when is set each model must be observed sufficiently. This fact should be added as an additional assumption depending on the specific realization of Algorithm 1. Supposedly the subsequent ’s can be chosen arbitrarily small, utilizing information from new trajectories as soon as it is available. Yet, Algorithm 1 may be computationally expensive, making larger ’s preferable in practice. Another possible approach to this trade-off is to apply on-line clustering (Ailon et al., 2009).

In essence, Algorithm 1 is a form of Multiple Model Learning (MML) algorithm (Vainsencher et al., 2013) – each trajectory is a sample from an unknown model (context) and the goal is learning all models simultaneously. It could also be reduced to the clustering problem, where each trajectory is represented as an vector of its empirical transition matrix. Indeed, some information is lost in this process: the number of samples from each

pair in the trajectory is ignored despite its effect on the variance around the sampled distribution. So, ideally each trajectory should be reduced to a point with varying variance across dimensions, which gets smaller for longer trajectories.

Subsequently, one may question whether can converge to for infinitely many trajectories. In our setup, as grows the trajectories are more distinct, but is bounded almost surely. So even for large

’s, there would be at least some constant portion of the trajectories acting as outliers of the model they originated from, possibly tainting the clusters. One way to solve this issue is through an outlier robust clustering (for example K-median ;

Har-Peled & Mazumdar 2004).

Next, consider the effect of the trajectories length on the hardness of the problem. When is very large, it is much more important to recognize the correct model. Since Algorithm (exploitation) is applied for a longer duration, it could include an exploratory part to obtain a better model while running the trajectory, in addition to shielding against wrongful classification.

The other extreme case is when is too small to determine the correct model with high probability. Assuming the models can still be approximated, one reasonable solution would be to try and optimize the worst case performance over all models. This approach is closely related to the problem of Robust MDPs (Nilim & El Ghaoui, 2005) - a formulation of MDPs with uncertainty in the transitions and rewards. When the uncertainty set is rectangular an efficient solution exists. However, in our case it is singular - setting one transiton probability is the same as setting the context along with its related transition matrix; thus the problem is intractable (Wiesemann et al., 2013).

When all trajectories are short, it might be impossible to provide an approximation of the true models. Consider for example the extreme case where only one transition is given - unless there is a stationary distribution over contexts the models cannot be learned nor optimized. Subsequently, varied lengths pose another question: how confident are we in the clustering of each trajectory? Embedding short trajectories might inject more noise to the clustering process than improve it, so some selection is needed to insure proper modeling. This question may relate to the notion of clusters separability (Ostrovsky et al., 2006) - short trajectories can lead to non-separable models that cannot be learned through clustering.

A rather simple realization of Algorithm 2 (exploration) is to apply a fixed policy until some condition is fulfilled. One may consider what is the policy which will achieve this condition with as few steps as possible (since the regret is linear in the number of exploration steps).

For instance, if there are only two models a logical approach would be to choose actions maximizing the distinction between the models. However, this is non-optimal as actions have future consequences - a distinctive action for one state could lead the agent to an area of the state space which is very similar between the models.

A follow-up idea is using the original state and action space, and reshaping the rewards to award actions for distinguishing between the models. However, this solution is still problematic since the underlying transition probabilities are unknown and could be these of either of the possible models. Hence, finding a good exploration policy is an open question we hypothesize to be as difficult as solving a singular Robust MDP.

Finally, consider the effect of increasingly more possible contexts. These increase both the size of the initial required for clustering, and the number of samples needed for model identification . The case of infinitely many models requires some changes in the algorithm, as discussed in the end of this section.

3.2 A Specific Instance

We an example for an instance of CECE and substitute in Assumptions 1, 2, 3. For simplicity, we assume the trajectory length is a constant for the remainder of the analysis. The proposed realization was chosen to be trivial to allow simple analysis; It is only a demonstration of the trade-offs in CMDPs and CECE’s modularity.

Algorithm 1 is the following scheme:

  1. For each trajectory , and state action pair , estimate the transition probability by its empirical distribution.

  2. Go over all possible partitions of trajectories to sets , and minimize over the following score:


    where is the estimated transition probability for all trajectories in the cluster.

This scheme is highly inefficient as it performs an exhaustive search for the best partition. However, as a preliminary result all we require is for it to accommodate Assumption . There are other polynomial time clustering algorithms with guarantees (Ostrovsky et al. 2006; Arthur & Vassilvitskii 2007 for instance), but their bounds and assumptions would have to be adjusted to our case.

In Algorithm 2, the uniform policy over actions is applied for a constant number of steps . As mentioned above, this procedure could be improved. For once, the total number of steps could be decided on-line according to the confidence. Moreover, there might be other exploration policies that could produce faster identification of the true model, or even combine exploitation in the strategy to generate overall smaller regret.

The proposed Algorithm 3 chooses the model obtaining the smallest distance between the set of models and the empirical transition matrix from the partial trajectory. Other possible methods include maximum likelihood, weighted or distance, and methods taking into account the cost of choosing a wrong model.

Lastly, Algorithm 4 was chosen naively to apply the exploitation policy with regards to the estimated model. A more sophisticated approach would be to consider an RL algorithm whose regret with respect to goes to . Since in our scenario is constant, the suggested solution is satisfactory.

We can now quote the necessary assumptions and resulting Corollary:

Assumption 4.

Let .

  1. By the ’th trajectory, each model was sampled at least times.

  2. For some , for every two contexts and : .

  3. In every trajectory, each state-action pair is visited at least times, and is large enough: .

The first part guarantees each model is sampled enough times for the classification to converge. The second part provides a constant difference between the models, such that with enough data the estimated models will be separable. The last part of the assumption is needed to make sure there are enough samples in each trajectory to learn the model. It can be guaranteed by requiring to be long enough, assuming that the induced MDP is ergodic under the uniform policy.

Lemma 1.

If Assumption 4 holds, the described realization of Algorithms 1-4 satisfy Assumptions 1-3 with:


The full proof is available in Section C of the supplementary material.

Corollary 1.

If Assumption 4 holds, the described realization of Algorithms 1-4 achieves in the ’th mini-batch:


where .

Notice that each summand relates to a different error:

  1. The first summand corresponds to trajectory misclassification. It can point us to proper choice of : scaled with and the distance between models.

  2. The second summand corresponds to the context and model uncertainty. Large and are required to estimate each model well enough.

  3. The third summand corresponds to trajectories misclustering. It is the only error which diminishes with , as the exponential multiplicative converges to .

3.3 Extensions

There are other interesting extensions to the previous setup exhibiting different trade-offs. For once, consider the more complicated scenario when there is an infinite or unknown number of models. CECE’s mini-batch solution can be adjusted to this case by adding a probability to reject all models in Algorithm 3, but the clustering step will be much harder to evaluate in this case. Consequently, regret analysis requires a more precise setup, for example bounded ratio between the number of contexts and trajectories, or some distribution over contexts.

A more natural setup in web advertising applications is the concurrent RL setup (Silver et al., 2013). Assume the agent interacts with multiple infinite horizon trajectories, where each time step one trajectory (which may be new) requires an action. In the CMDP setup, each trajectory originates from a different latent context. The performance in this case should take into account both the length and number of trajectories.

A rather naive solution would be to employ some RL algorithm (for example, Q-learning; Watkins & Dayan 1992) in every trajectory, regardless of the other trajectories. This approach ignores information on the model obtained from other trajectories sharing the same context. Thus, if there are many short trajectories it could produce high regret.

A different solution is applying some variation of CECE’s scheme - in each time step in a trajectory: first cluster (Algorithm 1), and then either (a) choose an option which explores the context (Algorithm 2), or (b) classify the partial trajectory (Algorithm 3) and choose an action exploiting the context (Algorithm 4). Even though the trajectory length is unbounded, as long as more model samples are obtained from other trajectories the error in the exploitation phase decreases. Actual regret bounds for both approaches depend on the parameters and assumptions of the specified problem. When there are few long trajectories the first independent RL approach would prevail (with regret of ; Auer et al. 2009) , while many shorter trajectories are better dealt with a CECE variant (with regret of for equal probability contexts).

4 Experiments

In this section we discuss the trade-offs that exist in the CMDPs settings. In the first experiment we test only the clustering part in CECE. We consider a CMDP with equal probability contexts, actions and

states where the transition matrix for each context was drawn from a uniform distribution. We generate

trajectories of a constant length sampling actions uniformly. For the purpose of scoring the clusters we calculate the entropy of each distribution over clusters for each correct context, and average the results according to the number of samples from that context. Thus, when the trajectories are perfectly clustered, for each context the entropy will be and so will be the average. The worst possible score results from independent clusters and contexts. The clustering algorithm we used in this case was -means (Duda et al., 2012) on the vectorized empirical transition matrices, the results were averaged over

trials and were added error bars of one standard deviation.

We examined the following: (1) How long should trajectories be to obtain favorable clustering? (2) How the quality of the clustering depends on the number of episodes, for various trajectories lengths? In the first part of the experiment (top plot in Figure 1) we generate trajectories and present the score as a function of the trajectories length . In the second part of the experiment (bottom plot in Figure 1), we generate trajectories of varying lengths and measure the score as a function of the number of episodes .

Figure 1: Experiment 1

We draw the following conclusions: (1) There is a phase transition in the clustering performance with respect to

: below a certain threshold (here ) the clustering utterly fails, followed by a short adjustment period, where finally (here at ) the clustering succeeds almost certainly. (2) If the trajectories are too short, the clustering will fail even when increasing the number of episodes. (3) If the trajectories are sufficiently long, additional episodes improve the clustering quality (as implied by Lemma 1).

Next, we experimented with the full CECE algorithm. We simulated a CMDP with states, actions and contexts of equal probability. Each trial consists of episodes of length . The results were averaged over experiments. The parameter sets the portion of the trajectory time steps dedicated to identify the model, and was taken to be , . The learning policy employed by Algorithm 2 was taken to be uniform over all actions. The exploitation algorithm used is Q-learning (Bertsekas & Tsitsiklis, 1995).

We performed four experiments where in each of the experiments all the parameters excluding one were fixed. The average reward throughout the experiment is measured. The results are presented in Figure 2. On the top-left and bottom-right plots we can see how CECE behaves as the number of episodes and trajectory length increase. As more data are available, the average reward increases since the clustering phase performs better and the models are better learned. Similarly, the average reward decreases as more models are introduced (top-right plot) since it is harder to cluster and learn each model. Notice that for constant proportion there will always be a difference between the optimal and the achieved value due to the identification phase.

Figure 2: Experiment 2

An interesting result is presented in the bottom-left plot. The parameter describing the portion of samples taken to identify the correct model. The resulting plot represents the exploration-exploitation trade-off for our suggested model: How many samples are used to identify the correct model against how many of them are used to optimize the C-MDP.

5 Conclusions and Future Work

In this work we presented a new framework for modeling multiple Markovian sources with sequential decision making. While our models can be encompassed in existing models (e.g., POMDPs; Aberdeen 2003) the proposed setup offers much flexibility in modeling both observable and latent static context while maintaining computational tractability. We demonstrated that under certain conditions one can overcome two fundamental problems: (1) learning the model parameters, and (2) optimizing on-line the action within an RL framework. We suggested and analyzed basic algorithms when the number of contexts is finite.

This paper is but a first step in developing the contextual MDPs framework. Since CECE is a modular solution its performance can be improved by independent upgrades to its building blocks, such as:

  1. The clustering techniques we used are somewhat inefficient and does not consider the confidence of each trajectory.

  2. Data and models dependent learning policies could possibly classify the trajectory in less steps.

  3. Reward oriented context classification can lead to improved overall regret.

  4. Incorporating context exploration in the exploitation phase hedges against miss-classification.

There are other schemes to solve CMDPs. A rather similar approach is combining the Exploration-Classification-Exploitation steps to form a belief over models and solve accordingly (like MMRL; Doya et al. 2002). Another reasonable approach when there is some distribution over contexts is to model the problem as a POMDP (Aberdeen, 2003), and then learning and optimizing it. Finally, it is possible to view the optimization problem as a robust MDP (Nilim & El Ghaoui 2005; where uncertainty is on which model the data come from). While solving the resulting Robust MDPs directly is hard computationally, a rectangular relaxation can be possibly used to provide an approximated result; one future direction is to investigate this approximation.

The concurrent RL setup (Silver et al., 2013), as well as the case of many or even infinitely many contexts are of practical importance. We have presented rough ideas on how to pursue these, but the exact theoretical setup requires a more precise definition (what guarantees could be made, what assumptions must hold and so on).

The issues of computational efficiency and sample complexity are important and were not tackled in this paper. Despite the availability of big data in many appealing venues, the state, action and context spaces may scale accordingly. Hence, an interesting theoretical and practical concern is the error and regret rates for finite sample size; finding these requires a more subtle analysis and is left for future work.

Subsequently, for very large state or action spaces, straightforward implementation of the model-based approach will fail as the number of samples required to learn the model grows accordingly. Solving this problem within the CMDP framework may introduce some intriguing connections. For example, if the linear function approximation technique is used (Sutton & Barto, 1998), the problem of clustering same-policy trajectories corresponds to the subspace clustering problem (Vidal, 2010).

In conclusion, from an algorithmic and analytic points of view the theoretical trade-off between learning, exploration, optimization, and control of CMDPs is still very much an open question.


  • Aberdeen (2003) Aberdeen, Douglas. A (revised) survey of approximate methods for solving partially observable markov decision processes. National ICT Australia, Canberra, Australia, Tech. Rep, 2003.
  • Ailon et al. (2009) Ailon, Nir, Jaiswal, Ragesh, and Monteleoni, Claire.

    Streaming k-means approximation.

    In Advances in Neural Information Processing Systems, pp. 10–18, 2009.
  • Arthur & Vassilvitskii (2007) Arthur, David and Vassilvitskii, Sergei. k-means++: The advantages of careful seeding. In Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms, pp. 1027–1035. Society for Industrial and Applied Mathematics, 2007.
  • Auer et al. (2009) Auer, Peter, Jaksch, Thomas, and Ortner, Ronald. Near-optimal regret bounds for reinforcement learning. In Advances in neural information processing systems, pp. 89–96, 2009.
  • Bertsekas & Tsitsiklis (1995) Bertsekas, Dimitri P and Tsitsiklis, John N. Neuro-dynamic programming. Athena Scientific, 1995.
  • Boas (2006) Boas, Mary L. Mathematical Methods in the Physical. John Wiley & Sons., Inc, 2006.
  • Bubeck (2012) Bubeck, Sébastien. Regret analysis of stochastic and nonstochastic multi-armed bandit problems.

    Foundations and Trends® in Machine Learning

    , 5(1):1–122, 2012.
  • Doya et al. (2002) Doya, K., Samejima, K., Katagiri, K., and Kawato, M. Multiple model-based reinforcement learning. Neural computation, 14(6):1347–1369, 2002.
  • Duda et al. (2012) Duda, Richard O, Hart, Peter E, and Stork, David G. Pattern classification. John Wiley & Sons, 2012.
  • Elliott et al. (1995) Elliott, Robert J, Aggoun, Lakhdar, and Moore, John B. Hidden Markov Models. Springer, 1995.
  • Har-Peled & Mazumdar (2004) Har-Peled, Sariel and Mazumdar, Soham. On coresets for k-means and k-median clustering. In

    Proceedings of the thirty-sixth annual ACM symposium on Theory of computing

    , pp. 291–300. ACM, 2004.
  • Kearns & Singh (2002) Kearns, Michael and Singh, Satinder. Near-optimal reinforcement learning in polynomial time. Machine Learning, 49(2-3):209–232, 2002.
  • Kiseleva et al. (2013) Kiseleva, Julia, Lam, Hoang Thanh, Pechenizkiy, Mykola, and Calders, Toon. Predicting current user intent with contextual markov models. In Data Mining Workshops (ICDMW), 2013 IEEE 13th International Conference on, pp. 391–398. IEEE, 2013.
  • Kober et al. (2012) Kober, Jens, Wilhelm, Andreas, Oztop, Erhan, and Peters, Jan. Reinforcement learning to adjust parametrized motor primitives to new situations. Autonomous Robots, 33(4):361–379, 2012.
  • Kohrs & Merialdo (2001) Kohrs, Arnd and Merialdo, Bernard. Improving collaborative filtering for new users by smart object selection. In Proceedings of International Conference on Media Features (ICMF), 2001.
  • Lai & Robbins (1985) Lai, Tze Leung and Robbins, Herbert. Asymptotically efficient adaptive allocation rules. Advances in applied mathematics, 6(1):4–22, 1985.
  • Langford & Zhang (2007) Langford, John and Zhang, Tong.

    The epoch-greedy algorithm for contextual multi-armed bandits.

    Advances in neural information processing systems, 20:1096–1103, 2007.
  • Maillard & Mannor (2014) Maillard, Odalric-Ambrym and Mannor, Shie. Latent bandits. In Proceedings of The 31st International Conference on Machine Learning, pp. 136–144, 2014.
  • Maillard et al. (2011) Maillard, Odalric-Ambrym, Ryabko, Daniil, and Munos, Rémi. Selecting the state-representation in reinforcement learning. In Advances in Neural Information Processing Systems, pp. 2627–2635, 2011.
  • Meyn & Tweedie (2009) Meyn, Sean P and Tweedie, Richard L. Markov chains and stochastic stability. Cambridge University Press, 2009.
  • Nguyen et al. (2013) Nguyen, Phuong, Maillard, Odalric-Ambrym, Ryabko, Daniil, and Ortner, Ronald. Competing with an infinite set of models in reinforcement learning. In

    Proceedings of the Sixteenth International Conference on Artificial Intelligence and Statistics

    , pp. 463–471, 2013.
  • Nilim & El Ghaoui (2005) Nilim, Arnab and El Ghaoui, Laurent. Robust control of markov decision processes with uncertain transition matrices. Operations Research, 53(5):780–798, 2005.
  • Ostrovsky et al. (2006) Ostrovsky, Rafail, Rabani, Yuval, Schulman, Leonard J, and Swamy, Chaitanya. The effectiveness of lloyd-type methods for the k-means problem. In Foundations of Computer Science, 2006. FOCS’06. 47th Annual IEEE Symposium on, pp. 165–176. IEEE, 2006.
  • Puterman (2005) Puterman, Martin L. Markov decision processes: discrete stochastic dynamic programming. John Wiley & Sons, 2005.
  • Radenen & Artières (2014) Radenen, Mathieu and Artières, Thierry. Handling signal variability with contextual markovian models. Pattern Recognition Letters, 35:236–245, 2014.
  • Richardson et al. (2007) Richardson, Matthew, Dominowska, Ewa, and Ragno, Robert. Predicting clicks: estimating the click-through rate for new ads. In Proceedings of the 16th international conference on World Wide Web, pp. 521–530. ACM, 2007.
  • Silver et al. (2013) Silver, David, Newnham, Leonard, Barker, David, Weller, Suzanne, and McFall, Jason. Concurrent reinforcement learning from customer interactions. In Proceedings of The 30th International Conference on Machine Learning, pp. 924–932, 2013.
  • Sutton & Barto (1998) Sutton, Richard S and Barto, Andrew G. Introduction to reinforcement learning. MIT Press, 1998.
  • Vainsencher et al. (2013) Vainsencher, Daniel, Mannor, Shie, and Xu, Huan. Learning multiple models via regularized weighting. In Advances in Neural Information Processing Systems, pp. 1977–1985, 2013.
  • Vidal (2010) Vidal, René. A tutorial on subspace clustering. IEEE Signal Processing Magazine, 28(2):52–68, 2010.
  • Watkins & Dayan (1992) Watkins, Christopher JCH and Dayan, Peter. Q-learning. Machine learning, 8(3-4):279–292, 1992.
  • Weissman et al. (2003) Weissman, Tsachy, Ordentlich, Erik, Seroussi, Gadiel, Verdu, Sergio, and Weinberger, Marcelo J. Inequalities for the l1 deviation of the empirical distribution. Hewlett-Packard Labs, Tech. Rep, 2003.
  • White (1993) White, Douglas J. A survey of applications of markov decision processes. Journal of the Operational Research Society, pp. 1073–1096, 1993.
  • Wiesemann et al. (2013) Wiesemann, Wolfram, Kuhn, Daniel, and Rustem, Berç. Robust markov decision processes. Mathematics of Operations Research, 38(1):153–183, 2013.
  • Wilson & Bobick (1999) Wilson, Andrew D and Bobick, Aaron F. Parametric hidden markov models for gesture recognition. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 21(9):884–900, 1999.

Appendix A List of Notations

Notation Meaning
State space or number of states
Action space or number of actions
Time horizon
Time index
Number of trajectories in batch data
Number of trajectories in the ’th mini-batch
Number of possible contexts
Value of policy in model
Minimal inf-distance between two distinct models.

Appendix B Useful Lemmas

The following Lemmas are used in the proofs:

Lemma 2.

(Weissman et al., 2003) Let

be a probability distribution on the set

. Let

be independent identically distributed random variables distributed according to

. Then for all ,

Lemma 3.

(Kearns & Singh, 2002) Let be an MDP over states, and be an -approximation of . Then for any policy :


and consequently for the optimal policy in each MDP correspondingly:


Appendix C Proof of Lemma 1

Lemma 1.

If Assumption 4 holds, the described realization of Algorithms 1-4 satisfy Assumptions 1-3 with:


We show each Assumption holds, starting with Assumption 1.

For two transition functions of size denote:


We denote by the estimated transition matrices from trajectory and cluster correspondingly. In addition, is the true clustering of each trajectory, and is the clustering found by the algorithm.

Since there are at least samples from each state-action pair, according to Lemma 2 and the union bound, we obtain that:


Since there are at least trajectories from each model, we also obtain that:


and therefore:


Now we obtain the following:


When is large, we can approximate for since each summand is bounded by with that probability, and when it is unbounded the maximal value of distance between two distributions is a constant . Therefore:


with probability at least , for .

Since the average is of that order, there must exist a matching between the true clusters and optimal clusters satisfying:


If the distance between every two true clusters is , the agreement between matching clusters are on all trajectories in a reasonable radius, i.e. of the trajectories. so the error in each model is of the order :


Now in order for to hold, we can choose to be of order , and then:


To summarize, for we obtain that with probability at least , , where:


Next, we show Assumption 2 holds. We bound the probability of misclassification by the following probability:


as if this event occurs then the true model will be chosen. To bound this quantity, we use the union bound over the complement event, so we need to bound:


For the left term:


For the right term:


Now, using the union bound we obtain that the classification is correct with probability at least , where