1 Introduction
We study a sequential decisionmaking problem where the environment is a Markov Decision Process (MDP), but the dynamics and the reward depend on a static parameter referred to as the context. For example, consider a lifelong learning task, in which an autonomous driving car must navigate the road while avoiding other vehicles. The car is likely to incur similar instances of this problem with different conditions, such as visibility and weather. While we expect some tasks to be similar, the agent is also required to adapt. For instance, when encountering fog, the agent is expected to drive more safely and slowly than during optimal weather conditions.
Another example is the dynamic treatment regime (Chakraborty and Murphy, 2014). Here, there is a sick patient and a clinician who acts to improve the patient’s health. The state space of the MDP is composed of the patient’s clinical measurements, and the actions are the clinician’s decisions. Traditionally, treatments are targeted to the “average" patient. Instead, in personalized medicine, people are separated into different groups and the medical decisions are tailored to the individual patient based on the predicted response or risk of disease (Fig. 0(b)). Recently, with the cost of genetic sequencing dropping dramatically, and with the growth of patients that are willing to track and share their healthcare records, personalized medicine is being developed to match the specific needs of patients. One success story of personalized medicine was the development of a drug called Herceptin for a group of cancers termed HER2+ that are highly aggressive and often have a poor prognosis. Herceptin was diagnosed to treat women (the context) with HER2+ breast cancer and it improved their survival time from 20 months to 5 years^{1}^{1}1Source: bit.ly/2P6iCRr. For acute respiratory distress syndrome (ARDS), clinicians argue that treatment goals should rely on individual patients’ physiology (Berngard et al., 2016). In (Wesselink et al., 2018), the authors study organ injury that might occur when mean arterial pressure decreases below a certain threshold, and report that this threshold varies across different patient groups.
The COIRL framework (left): a context vector parametrizes the environment. For each context, the expert uses the true mapping from contexts to rewards
and provides demonstrations. The agent learns an estimation of this mapping and acts optimally with respect to it.These examples highlight the importance of patient information in the online treatment regime and motivate us to consider contextual information within the RL framework. One possibility is to expand the state such that it will include the patient information (the context). However, this approach can increase the complexity of the problem significantly, as the set of possible MDPs grows exponentially with the dimension of the context. Therefore, in the Contextual MDP framework (Hallak et al., 2015), the goal is to learn the mapping from contexts to the environment (dynamics and rewards). A learning algorithm for this problem should learn a mapping that generalizes to unseen contexts, improves at that task as it observes contexts, and achieves desired sample complexity (Modi et al., 2018) or regret (Modi and Tewari, 2019).
Another issue that is prevalent in realworld problems is that the reward function may be sparse and misspecified. For example, in online treatment problems like sepsis (Komorowski et al., 2018), the only available signal is the mortality of the patient at the end of the treatment. Manually designing a reward function for this problem is complicated and could lead to poor performance (Raghu et al., 2017; Lee et al., 2019). In many cases, it is easier for humans to define the reward implicitly by providing demonstrations of what constitutes a proper treatment. Inverse Reinforcement Learning (Ng et al., 2000, IRL) is concerned with inferring the reward function by observing an expert in order to find a policy that guarantees a value that is close to that of the expert.
Finally, deploying RL algorithms to treat patients or to drive cars cannot be regarded in the same way as solving a video game due to safety considerations and lack of a simulator. To address these issues, we propose the Contextual Inverse Reinforcement Learning (COIRL) model. We study a safe, online learning framework where an expert supervises the RL algorithm as follows. The agent observes a context, estimates the reward and proposes a policy. The expert evaluates the agent’s actions and decides if they are optimal. If not, the expert provides a demonstration to the agent. The goal of the agent is to learn a mapping from contexts to rewards by observing expert demonstrations.
We design and analyze two algorithms for COIRL: (1) For linear reward mappings, we study the ellipsoid method, for which we provide theoretical guarantees and analyze the sample complexity for finding an optimal solution. (2) For nonlinear mappings, we study a blackbox optimization solution that minimizes a surrogate loss using evolutionary strategies (Salimans et al., 2017, ES)
. We consider two loss functions that enable feature expectation matching; an intuitive but nondifferentiable loss that minimizes the distance of the value of the agent from the value of the expert, and a differentiable loss that is based on a minmax objective. We evaluate our algorithms in autonomous driving and online treatment simulators and demonstrate their ability to generalize to unseen contexts.
2 Problem formulation and notation
Contextual Markov Decision Processes (CMDPs): A Markov Decision Process (MDP) is defined by the tuple where is the state space, the action space, the transition kernel, the initial state distribution and the reward function. A Contextual MDP (CMDP) is an extension of an MDP, and is defined by where is the context space, and is a mapping on such that is an MDP with space and action spaces for each . We consider a CMDP with finite state and action spaces, and associate each state with a feature vector . Additionally, we assume that the transitions and the initial state distribution are not context dependent.
In our work we will further assume a linear setting, in which the reward function for context is linear in the state features: , where is the rewards coefficients vector which is given by the linear mapping , where . We assume , and , i.e., the standard dimensional simplex. This allows a straightforward expansion to a model in which the transitions are also given by a linear mapping of the context, as seen in (Modi et al., 2018). One way of viewing this model is that each row in the mapping is a base rewards coefficient vector, and the reward for a specific context is a convex combination of these base rewards.
We consider deterministic policies which dictate the agent’s behaviour at each state. The (normalized) value of a policy for reward coefficients vector is: where is called the feature expectations of , and is the discount factor. For the optimal policy with respect to (w.r.t.) the reward coefficients vector, we denote the value by . The normalization of the value function by the constant is for convenience, i.e. to claim , and does not affect the resulting policies. Given , for each context we may calculate the reward coefficients vector and find the optimal policy, i.e. the policy with the highest value, using standard methods such as policy/value iteration.
Inverse Reinforcement Learning in CMDPs:
In standard IRL the goal is to learn a reward which best explains the behavior of an observed expert. The model describing this scenario is the MDP\R which is an MDP without a reward function (also commonly called a controlled Markov chain). Similarly, we denote a CMDP without a mapping of context to reward by
CMDP\M. The goal in Contextual IRL (COIRL) is to infer a mapping , from observations of an expert, which will induce nearoptimal policies for all contexts. As shown in (Ng et al., 2000), the IRL problem is illdefined and we aren’t ensured to learn the real reward or, in this case, the mapping ; however, it is still possible to learn a mapping which induces optimal policies and enables generalization to new contexts.While learning a transition kernel and an initial distribution which are parametrized by the context is related to the IRL problem, it can be seen as a separate, precursory problem, allowing us to make the simplifying assumptions presented previously. By using existing methods to learn the mappings for the transition kernel and initial distribution in a contextual model, such as in (Modi et al., 2018), and by using the simulation lemma (Kearns and Singh, 2002), our results can be extended to a more general CMDP setting.
3 Methods
In this section, we study learning algorithms for COIRL that are motivated by the online treatment regime. We begin with an online learning framework, where we design algorithms that do not have access to a simulator of the environment, and the agent is only allowed to explore nearoptimal actions. We then consider an offline learning framework, where observational (offpolicy) data of expert demonstrations was collected a priori. For example, in the medical domain, these demonstrations may represent collected data of clinicians treating patients. Such data is publicly available, for example, in the MIMICIII data set (Johnson et al., 2016). Finally, we consider a warm start framework where the agent policy is initialized in the offline framework and continues to learn in the online framework.
More explicitly, in the online framework, the agent learns under the supervision of an expert. We propose a setting, in which at each timestep a new context is revealed, possibly adversarially, and the agent acts based on the optimal policy w.r.t. its estimated mapping , denoted by . The expert provides two forms of supervision for the agent. First, the expert evaluates the agent’s behavior and produces a binary signal which determines if the agent’s policy is optimal, i.e., Second, when the agent is suboptimal, the expert provides a demonstration in the form of its policy (or feature expectations) for . The goal is to learn a mapping which induces optimal policies for all contexts based on a minimal number of examples from the expert.
Next, we present two approaches to solving the COIRL problem. We begin with the linear model, for which we propose an ellipsoidbased approach with proven polynomial upper bounds. We then consider nonlinear models and propose descentbased algorithms.
3.1 Ellipsoid algorithms for COIRL
The goal of the algorithms in this section is to find a linear mapping from contexts to rewards by observing expert demonstrations. We study ellipsoid based algorithms that maintain an ellipsoidshaped feasibility set that contains . At any step, the current estimation of is defined as the center of the ellipsoid, and the agent acts optimally w.r.t. this estimation. If the agent performs suboptimally, the expert provides a demonstration in the form of the optimal feature expectations for
The feature expectations are used to generate a linear constraint (hyperplane) on the ellipsoid that is crossing its center. Under this constraint, we construct a new feasibility set that is half of the previous ellipsoid, and still contains
. For the algorithm to proceed, we compute a new ellipsoid that is the minimum volume enclosing ellipsoid (MVEE) around this "halfellipsoid" ^{2}^{2}2This procedure follows a sequence of linear algebra operations which we explain in the appendix. These updates are guaranteed to gradually reduce the volume of the ellipsoid (a wellknown result (Boyd and Barratt, 1991)) until its center is a mapping which induces optimal policies. Theorem 1 shows that this algorithm achieves a polynomial upper bound on the number of suboptimal timesteps. Finally, notice that in Algorithm 1 we use an underline notation to denote a "flattening" operator for matrices, and to denote a composition of an outer product and the flattening operator. The proof for Theorem 1 is provided in the supplementary material, and is adapted from (Amin et al., 2017) to the COIRL problem.Theorem 1.
In the linear setting where , for an agent acting according to Algorithm 1, the number of rounds in which the agent is not optimal is .
Practical ellipsoid algorithm: In many realworld scenarios, the expert cannot evaluate the value of the agent’s policy and cannot provide its policy or feature expectations. To address these issues, we consider a relaxed approach, in which the expert evaluates a single trajectory of the agent and, if it is suboptimal, the expert demonstrates a single step trajectory. Due to the stochasticity of the underlying MDP, evaluating the value of the agent based on a single trajectory is impractical. Hence we consider an alternative approach, in which the expert evaluates each of the individual actions performed by the agent. We define the expert criterion for providing a demonstration to be for each stateaction pair
in the agent’s trajectory. This implies that for the initial distribution which assigns probability
to a state in which the agent is suboptimal, the value of the agent is not optimal which enables us to make similar arguments as before.Suboptimal experts: In addition, we relax the requirement that the expert must be optimal and instead assume that, for each context , the expert acts optimally w.r.t. which is close to ; the expert also evaluates the agent w.r.t. this mapping. This allows the agent to learn from different experts, and from nonstationary experts whose judgment and performance vary over time. If a suboptimal action w.r.t. is played at state , the expert provides a rollout of steps from to the agent. As this rollout is a sample of the optimal policy w.r.t. , we aggregate examples to assure that with high probability, the linear constraint that we use in the ellipsoid algorithm does not exclude from the feasibility set. Note that these batches may be constructed across different contexts, different experts, and different states from which the demonstrations start. In the supplementary material, we provide pseudo code for this process (Algorithm 3). Theorem 2 below upper bounds the number of suboptimal actions that Algorithm 3 chooses.
Theorem 2.
For an agent acting according to Algorithm 3 , with probability of at least , for and , if the number of rounds in which a suboptimal action is played is .
The proof for Theorem 2 is provided in the supplementary material, and is adapted from (Amin et al., 2017) to the setup of COIRL with near optimal experts.
Warmstart for the ellipsoid algorithm: In this setup, the goal is to use offline data to initiate the ellipsoid algorithm with a smaller feasibility set. Although this approach leads to lesser regret, similarly to the online setting and in order to ensure the optimal solution remains within the feasibility set, an expert’s supervision is required for training. We simulate the online setting by iterating over the trajectories in the data. The expert evaluates the agent’s suggested action for each state and provides the binary optimality signal. As each trajectory is an expert demonstration, we use it as an alternative to the online expert demonstration. By adhering to the conditions of Theorem 2, its theoretical guarantees remain.
3.2 Optimization methods for COIRL with nonlinear mappings
In the previous section, we analyzed a scenario in which the mapping from contexts to rewards was linear, i.e. . This reward structure enabled the analysis of the sample complexity of the ellipsoid algorithm and guaranteed its convergence. In this section, we extend the COIRL framework to nonlinear mappings, i.e. , where is a nonlinear function. We formulate COIRL as an optimization problem and provide descent algorithms to solve it.
The goal is to find a mapping which induces policies that have feature expectations that match the expert’s feature expectations for any context, i.e., minimize , an approach known as feature expectation matching (Abbeel and Ng, 2004; Ziebart et al., 2008). However, minimizing such a loss is difficult, as it is piecewise constant in (or in the linear case). For this reason, we explore two surrogate loss functions (alternative loss functions whose minimization leads to feature expectation matching).
The first surrogate loss function is the MSE between the estimated value of the expert and the agent:
(1) 
where denotes the optimal policy w.r.t. and denotes the optimal policy w.r.t. . Note that in order to evaluate the loss, we have to compute the optimal policies w.r.t. , which involves solving tabular MDPs (e.g. with policy iteration). This fact makes Eq. 1 nondifferentiable w.r.t. to as solving an MDP is nondifferentiable. On the the other hand, the loss function is Lipschitz continuous w.r.t. , as the following lemma states. The proof can be found in the supplementary material and is based on the simulation lemma (Kearns and Singh, 2002).
Lemma 1.
The objective function (1) is Lipschitz continuous in .
To minimize Eq. 1, we design a black box algorithm (Algorithm 2) that is based on Evolution Strategies (Salimans et al., 2017, ES); a gradientfree descent method for solving blackbox optimization problems based on computing finite differences (Nesterov and Spokoiny, 2017). The algorithm receives a set of contextdemonstration tuples and returns parameters . At each step, a contextdemonstration tuple is sampled at random. Next, a set of random Gaussian noise vectors () is sampled at random, and used to perturb to yield a set of reward functions . Each reward is used to evaluate the losses (Eq. 1) . Finally, the descent direction is computed as a the sum of perturbed vectors, weighted by the losses .
While the loss in Eq. 1 is intuitive, it has a few drawbacks. First, its evaluation requires solving MDPs (which is computationally prohibitive), and second, we found it hard to minimize in some settings (see the experiments section for more details). For these reasons, we consider a second surrogate optimization problem, defined by:
(2) 
a similar problem to the IRL formulation in (Syed and Schapire, 2008; Ho and Ermon, 2016). This approach requires a twostep optimization process. At each iteration: (1) given the current estimation , we compute the optimal policies for and their corresponding feature expectations. Then (2) given the feature expectations, perform ES on the loss and take a single step to update
. On the positive side, this loss is differentiable, and can be optimized with standard backpropagation.
4 Experiments
This section is organized as follows. We begin with analyzing our approach on a common IRL task, an autonomous driving simulation (Abbeel and Ng, 2004; Syed and Schapire, 2008), adapted to the contextual setup. We then test our method in a medical domain, using a data set of expert (clinicians) trajectories for treating patients with sepsis^{3}^{3}3The data and code that we used to construct these simulators, as well as the implementation of our algorithms can be found in github.com/CIRLMDP/CIRL.. More details will follow in the relevant subsections.
We experimented with the methods that we presented in the previous section, namely, the ellipsoid algorithm, and the ES method with losses Eq. 1 and Eq. 2. We evaluate and compare their cumulative regret, the number of demonstrations they require, and their ability to generalize to a holdout test set. In each experiment, we create a random sequence of contexts
, average the results across several seeds and report the mean and the standard deviation. Note that once an algorithm achieves an
optimal value in the online framework, it will stop requesting demonstrations from the expert. For that reason, algorithms that perform better request fewer contexts from the expert and their generalization graph appears truncated. We emphasize here that if a plot ends abruptly in these experiments, the reason is that at that point, the algorithm achieves an optimal value and stops requesting for demonstrations. For nonlinear reward model, we taketo be a multilayer perceptron; for the ES methods, we use value iteration to compute optimal policies; all details and hyperparameters can be found in the supplementary material.
4.1 Driving simulation
The driving task simulates a threelane highway, in which there are two visible cars  car A and car B. The agent, controlling car A can drive both on the highway and offroad. Car B drives on a fixed lane, at a slower speed than car A. Upon leaving the frame, car B is replaced by a new car, appearing in a random lane at the top of the screen. The reward is defined to be linear in the feature expectations , where is composed of 3 features: (1) a speed feature, (2) a collision feature, which is valued in case of a collision and otherwise, and (3) an offroad feature, which is 0.5 if the car is on the road and 0 otherwise. The environment is modeled as a tabular MDP that consists of states. The speed is selected once, at the initial state, and is kept constant afterward. The other states are generated by 17 Xaxis positions for the agent’s car, 3 available speed values, 3 lanes and 10 Yaxis positions in which car B may reside. During the simulation, the agent controls the steering direction of the car, moving left or right, i.e., two actions.
In this task, the context vector implies different priorities for the agent; should it prefer speed or safety? Is going offroad to avoid collisions a valid option? For example, an ambulance will prioritize speed and may allow going offroad as long as it goes fast and avoids collisions, while a bus will prioritize avoiding both collisions and offroad driving as structural integrity is its main concern. The optimal behavior is defined using a linear mapping or a nonlinear mapping . To demonstrate the effectiveness of our solutions, our mappings are constructed in a way that induces different behaviors for different contexts, making generalization a challenging task. For the nonlinear task, we consider two reward coefficient vectors and , and define the mapping by .
Results: For the online linear setting (Fig. 3), we define the optimality threshold to be for all algorithms. We report the cumulative regret (Fig. 2(a)), the number of demonstrations that each algorithm requested (Fig. 2(b)), and their ability to generalize to a holdout test set (Fig. 2(c)), which were calculated using 20 seeds. Examining the results, we can see that despite the theoretical guarantees, the descent methods achieve better sample efficiency and regret than the ellipsoid. Also, Eq. 2 leads to better regret overall and requires significantly fewer demonstrations to reach optimal performance.



For the nonlinear online setting (Fig. 4) we compare the ES method for minimizing loss (2) with the ellipsoid algorithm, with across 5 seeds. These results demonstrate that the ellipsoid does not perform well in nonlinear settings, highlighted in the inability to generalize and the linear regret growth, while the ES method with a nonlinear model is able to converge to a nearoptimal solution.



Notably, loss (1) is excluded from the nonlinear results as it was unable to generalize and thus required a demonstration at nearly every timestep, making it about on par with the ellipsoid. A possible explanation for this is that this loss discourages advancing in the correct direction under certain circumstances. For example, consider the case where the agent’s coefficient for the speed feature is 0.1, and the agent’s and expert’s feature expectations are 0.5, 1 respectively. A speculative step increasing the coefficient to 0.2 may not be sufficient to change the agent’s feature expectations, and thus will increase the loss. On the other hand, an increase in the coefficient is necessary to match the feature expectations, therefore the step the ES algorithm takes would go in the opposite direction. Loss (2) avoids such issues, which may explain its superior results.
4.2 Dynamic treatment regime
In this setup, there is a sick patient and a clinician who acts to improve the patient’s medical condition. The context (static information) represents patient features which do not change during treatment, such as age and gender. The state of the agent summarizes the dynamic measurements of the patient throughout the treatment, such as blood pressure and EEG readouts. The action space, i.e., the clinician’s actions, consists of a sequence of decision rules, one per stage of intervention, and represent a combination of intervention categories. Dynamic treatment regimes are particularly useful for managing chronic disorders and fit well into the larger paradigm of personalized medicine (Komorowski et al., 2018; Prasad et al., 2017).
We focus on an intensive care task, where the agent needs to choose the right treatment for a patient that is diagnosed with sepsis. We use the MIMICIII data set (Johnson et al., 2016) and follow the data processing steps that were taken in Jeter et al. (2019). However, performing offpolicy evaluation is not possible using this dataset, as it does not satisfy basic requirements (Gottesman et al., 2018, 2019). Therefore, we designed a simulator of a CMDP, based on this data.
The dataset consists of trajectories. Each trajectory represents a sequential treatment that was provided by a clinician to a patient. The available information for each patient consists static features (the context, e.g. gender, age),
dynamic measurements of the patient at each time step (e.g. heart rate, body temperature). In addition, each trajectory contains the reported clinician actions (the amount of fluids and vasopressors given to a patient at each timestep and binned to 25 different values), and a mortality signal which indicates whether the patient was alive 90 days after his hospital admission. In order to create a tabular MDP, we cluster the dynamic features using Kmeans
(MacQueen et al., 1967). Each cluster is considered a state and the coordinates of the cluster centroids are taken as its features . We construct the transition kernel between the clusters using the empirical transitions in the data. As in the previous sections, we consider a reward which is linear in , i.e., , where is a matrix we construct from the data for the simulator. In the simulator, the expert acts optimally w.r.t. this .When treating a sepsis patient, the clinician has several decisions to make. One such decision is whether or not to provide a patient with vasopressors, drugs which are commonly applied to restore and maintain blood pressure in patients with sepsis. However, what is regarded as normal blood pressure differs based on the age and weight of the patient (Wesselink et al., 2018). In our setting, captures this information, as it maps from contextual information (age) and dynamic information (blood pressure) to reward.
Results: Here we compare our algorithms within the online framework, over 1000 timesteps, where , across 5 seeds for all algorithms. Similarly to the autonomous vehicle experiments, we measure the regret (Figure 4(a)) and the number of demonstrations that each algorithm requested (Figure 4(b)). In addition to generalization to a holdout set in Figure 4(c), we provide results for the inaccuracy (miss rate) of the agents, i.e., in how many states the policy of the agent differs from that of the expert. These results suggest that in this more complicated environment, the ES approaches perform even better compared to the ellipsoid method. While all algorithms are able to learn and generalize, both ES approaches require significantly fewer demonstrations and accumulate less regret. We also note that although the miss rate decreases over time, it does not go below for any of the methods. This shows that while the accuracy metric is indicative of good performance, it may not be a good metric when evaluating policies learned through IRL, as it only measures the ability to imitate the expert rather than the ability to learn the latent contextual reward structure.



5 Discussion
We studied the COIRL problem with linear and nonlinear reward mappings. While nonlinear mappings are more appropriate to model realworld problems, for a linear mapping, we were able to provide theoretical guarantees and sample complexity analysis. Moreover, when applying AI agents to realworld problems, interpretability of the learned model is of major importance, in particular when considering deployment in medical domains (Komorowski et al., 2018). Interpretability of linear models can be achieved by analyzing the mapping and providing insights on the importance of specific features for specific contexts.
We experimented with two approaches for COIRL in the linear setup  the Ellipsoid and the ES methods. While the Ellipsoid has theoretical guarantees, we observed that ES performed better in all of our experiments. This raises an important question  what is the lower bound on the number of samples required? In (Amin et al., 2017), the ellipsoid method was proposed in a noncontextual IRL setup, and was shown to achieve a sample complexity of while the lower bound is This may explain the fact that ES achieved better performance than the ellipsoid, even in the linear setup, although we cannot analyze its performance.
Finally, the literature on contextual MDPs is concerned with providing theoretical guarantees and sample complexity analysis for the scenario in which we can model each patient as a tabular MDP. However, when the measurements of the patient are continuous, deep learning methods are likely to perform better than state aggregation. In the deep setup, the critical question is how to design an architecture that will leverage the structure of the static and dynamic information. While there has been some preliminary work in robotics domains
(Xu et al., 2018), these works often focus on metalearning, i.e., fewshot adaptation, whereas COIRL considers the zeroshot scenario.References

Abbeel and Ng [2004]
Pieter Abbeel and Andrew Y Ng.
Apprenticeship learning via inverse reinforcement learning.
In
Proceedings of the twentyfirst international conference on Machine learning
, page 1. ACM, 2004.  Amin et al. [2017] Kareem Amin, Nan Jiang, and Satinder Singh. Repeated inverse reinforcement learning. In Advances in Neural Information Processing Systems, pages 1815–1824, 2017.
 Arjovsky et al. [2017] Martin Arjovsky, Soumith Chintala, and Léon Bottou. Wasserstein gan. arXiv preprint arXiv:1701.07875, 2017.
 Berngard et al. [2016] S Clark Berngard, Jeremy R Beitler, and Atul Malhotra. Personalizing mechanical ventilation for acute respiratory distress syndrome. Journal of thoracic disease, 8(3):E172, 2016.
 Boyd and Barratt [1991] Stephen P Boyd and Craig H Barratt. Linear controller design: limits of performance. Prentice Hall Englewood Cliffs, NJ, 1991.
 Chakraborty and Murphy [2014] Bibhas Chakraborty and Susan A Murphy. Dynamic treatment regimes. Annual review of statistics and its application, 1:447–464, 2014.
 Cisse et al. [2017] Moustapha Cisse, Piotr Bojanowski, Edouard Grave, Yann Dauphin, and Nicolas Usunier. Parseval networks: Improving robustness to adversarial examples. In Proceedings of the 34th International Conference on Machine LearningVolume 70, pages 854–863. JMLR. org, 2017.
 Gottesman et al. [2018] Omer Gottesman, Fredrik Johansson, Joshua Meier, Jack Dent, Donghun Lee, Srivatsan Srinivasan, Linying Zhang, Yi Ding, David Wihl, Xuefeng Peng, et al. Evaluating reinforcement learning algorithms in observational health settings. arXiv preprint arXiv:1805.12298, 2018.
 Gottesman et al. [2019] Omer Gottesman, Fredrik Johansson, Matthieu Komorowski, Aldo Faisal, David Sontag, Finale DoshiVelez, and Leo Anthony Celi. Guidelines for reinforcement learning in healthcare. Nature medicine, 25(1):16–18, 2019.
 Hallak et al. [2015] Assaf Hallak, Dotan Di Castro, and Shie Mannor. Contextual markov decision processes. arXiv preprint arXiv:1502.02259, 2015.

Ho and Ermon [2016]
Jonathan Ho and Stefano Ermon.
Generative adversarial imitation learning.
In Advances in Neural Information Processing Systems, pages 4565–4573, 2016.  Itenov et al. [2018] Theis Itenov, Daniel Murray, and Jens Jensen. Sepsis: Personalized medicine utilizing ‘omic’technologies—a paradigm shift? In Healthcare, page 111. Multidisciplinary Digital Publishing Institute, 2018.

Jeter et al. [2019]
Russell Jeter, Christopher Josef, Supreeth Shashikumar, and Shamim Nemati.
Does the "artificial intelligence clinician" learn optimal treatment strategies for sepsis in intensive care?, 2019.
URL https://github.com/point85AI/PolicyIterationAIClinician.git.  Johnson et al. [2016] Alistair E.W. Johnson, Tom J. Pollard, Lu Shen, Liwei H. Lehman, Mengling Feng, Mohammad Ghassemi, Benjamin Moody, Peter Szolovits, Leo Anthony Celi, and Roger G. Mark. Mimiciii, a freely accessible critical care database. Scientific Data, 3:160035, May 2016. ISSN 20524463. doi: 10.1038/sdata.2016.35. URL http://dx.doi.org/10.1038/sdata.2016.35.
 Kearns and Singh [2002] Michael Kearns and Satinder Singh. Nearoptimal reinforcement learning in polynomial time. Machine learning, 49(23):209–232, 2002.
 Komorowski et al. [2018] Matthieu Komorowski, Leo A Celi, Omar Badawi, Anthony C Gordon, and A Aldo Faisal. The artificial intelligence clinician learns optimal treatment strategies for sepsis in intensive care. Nature Medicine, 24(11):1716, 2018.
 Lee et al. [2019] Donghun Lee, Srivatsan Srinivasan, and Finale DoshiVelez. Truly batch apprenticeship learning with deep successor features. arXiv preprint arXiv:1903.10077, 2019.
 MacQueen et al. [1967] James MacQueen et al. Some methods for classification and analysis of multivariate observations. In Proceedings of the fifth Berkeley symposium on mathematical statistics and probability, pages 281–297. Oakland, CA, USA, 1967.
 Modi and Tewari [2019] Aditya Modi and Ambuj Tewari. Contextual markov decision processes using generalized linear models. arXiv preprint arXiv:1903.06187, 2019.
 Modi et al. [2018] Aditya Modi, Nan Jiang, Satinder Singh, and Ambuj Tewari. Markov decision processes with continuous side information. In Algorithmic Learning Theory, pages 597–618, 2018.
 Nesterov and Spokoiny [2017] Yurii Nesterov and Vladimir Spokoiny. Random gradientfree minimization of convex functions. Foundations of Computational Mathematics, 17(2):527–566, 2017.
 Ng et al. [2000] Andrew Y Ng, Stuart J Russell, et al. Algorithms for inverse reinforcement learning. In Icml, volume 1, page 2, 2000.
 Prasad et al. [2017] Niranjani Prasad, LiFang Cheng, Corey Chivers, Michael Draugelis, and Barbara E Engelhardt. A reinforcement learning approach to weaning of mechanical ventilation in intensive care units. UAI, 2017.
 Raghu et al. [2017] Aniruddh Raghu, Matthieu Komorowski, Imran Ahmed, Leo Celi, Peter Szolovits, and Marzyeh Ghassemi. Deep reinforcement learning for sepsis treatment. arXiv preprint arXiv:1711.09602, 2017.
 Salimans et al. [2017] Tim Salimans, Jonathan Ho, Xi Chen, Szymon Sidor, and Ilya Sutskever. Evolution strategies as a scalable alternative to reinforcement learning. arXiv preprint arXiv:1703.03864, 2017.
 Syed and Schapire [2008] Umar Syed and Robert E Schapire. A gametheoretic approach to apprenticeship learning. In Advances in neural information processing systems, pages 1449–1456, 2008.
 Wesselink et al. [2018] EM Wesselink, TH Kappen, HM Torn, AJC Slooter, and WA van Klei. Intraoperative hypotension and the risk of postoperative adverse outcomes: a systematic review. British journal of anaesthesia, 2018.
 Xu et al. [2018] Kelvin Xu, Ellis Ratner, Anca Dragan, Sergey Levine, and Chelsea Finn. Learning a prior over intent via metainverse reinforcement learning. arXiv preprint arXiv:1805.12573, 2018.
 Ziebart et al. [2008] Brian D Ziebart, Andrew L Maas, J Andrew Bagnell, and Anind K Dey. Maximum entropy inverse reinforcement learning. In Aaai, volume 8, pages 1433–1438. Chicago, IL, USA, 2008.
Appendix A Ellipsoid Algorithm for trajectories
Appendix B MVEE computation
This computation is commonly found in optimization lecture notes and textbooks. First, we define an ellipsoid by for a vector
, the center of the ellipsoid, and an invertible matrix
. Our first task is computing  the MVEE for the initial feasibility set . The result is of course a sphere around 0: . For the update , we define and calculate the new ellipsoid by , .Appendix C Proof of Theorem 1
For simpler analysis, we define a "flattening" operator, converting a matrix to a vector: by . We also define the operator to be the composition of the flattening operator and the outer product: . Therefore, the value of policy for context is given by where , .
Lemma 2 (Boyd and Barratt [1991]).
If is an ellipsoid with center , and , we define , then:
Proof of Theorem 1.
We prove the theorem by showing that the volume of the ellipsoids for is bounded from below.
In conjunction with Lemma 2, which claims there is a minimal rate of decay in the ellipsoid volume, this shows that the number of times the ellipsoid is updated is polynomially bounded.
We begin by showing that always remains in the ellipsoid. We note that in rounds where , we have
. In addition, as the agent acts optimally w.r.t. the reward we have that
. Combining these observations yield:
(3) 
This shows that is never disqualified when updating . Since this implies that . Now we show that not only remains in the ellipsoid, but also a small ball surrounding it. If is disqualified by the algorithm: . Multiplying this inequality by 1 and adding it to (3) yields: . We apply Hölder inequality to LHS: Therefore for any disqualified : , thus is never disqualified. This implies that . Finally, let be the number of rounds by in which . Using Lemma 2 we get that: Therefore . ∎
Appendix D Proof of Theorem 2
Lemma 3 (Azuma’s inequality).
For a martingale , if a.s. for :
Proof of Theorem 2.
We first note that we may assume that for any : . If , we update the ellipsoid by where is the indicator vector of coordinate in which exceeds 1, and the inequality direction depends on the sign of . If still, this process can be repeated for a finite number of steps until , as the volume of the ellipsoid is bounded from below and each update reduces the volume (Lemma 2). Now we have , implying . As no points of are removed this way, this does not affect the correctness of the proof. Similarly, we may assume as .
We denote which remains constant for each update in the batch by . We define the timesteps corresponding to the demonstrations in the batch for . We define to be the expected value of , and to be the outer product of and the feature expectations of the expert policy for . We also denote by . We bound the following term from below, as in Theorem 1:
(1): is bounded from below by , identically to the previous proof.
(2): by assumption thus since by Hölder’s inequality the term is bounded by .
(3): we have from definitions, thus since . As mentioned previously we may assume , therefore by Hölder’s inequality the term is bounded by due to our choice of :
.
(4): we note that the partial sums for are a martingale. As we can apply Azuma’s inequality (Lemma 3) with and with our chosen n this yields: with probability of at least .
Thus and as in Theorem 1 this shows is never disqualified, and the number of updates is bounded by , and multiplied by n this yields the upper bound on the number of rounds in which a suboptimal action is chosen. By unionbound, the required bound for term (4) holds in all updates with probability of at least .
∎
Appendix E Proof of Lemma 1
Proof of Lemma 1.
Our proof leverages the simulation lemma, showing that a small change in correlates to a small change in . In turn, the resulting policies are ‘close’ in value. We recall the results from [Kearns and Singh, 2002], both the definition of an approximate MDP and the similarity result over the resulting value functions (Lemma 4).
Definition 1.
Let and be Markov decision processes over the same state space. Then we say that is an approximation of if:
Lemma 4 (Simulation Lemma [Kearns and Singh, 2002]).
Let be any Markov decision process over states. Let , and let be an approximation of . Then for any policy ,
Note that the reward is linear in and the features , hence , for some constant and for all contexts , implies that . Plugging this result into the simulation lemma we conclude that and , where and are some constants. This implies that the MSE is also Lipschitz, which concludes our proof. ∎
Remark 1.
Appendix F Experimental Details
In this section, we describe the technical details of our experiments, including the hyperparameters used. To solve MDPs, we use value iteration. Our implementation is based on a stopping condition with a tolerance threshold, , such that the algorithm stops if In the driving simulation we used and in the sepsis treatment we use .
f.1 Autonomous driving simulation
In these experiments, we define our mappings in a way that induces different behaviours for different contexts, making generalization a more challenging task. Specifically, for the linear setting we use , before normalization. For our nonlinear mapping, contexts with are mapped to reward coefficients vector , otherwise they are mapped to , which induce the feature expectations respectively. The decision regions for the nonlinear mapping are visualized in Fig. 6. The contexts are sampled uniformly in the 2dimensional simplex. We evaluate all algorithms on the same sequences of contexts, and average the results over such sequences.
Hyperparameter selection: By definition, the ellipsoid algorithm is hyperparameter free and does not require tuning.
ES algorithms: for the linear model the algorithms maintained a matrix to estimate . For loss ( 1), the algorithm was executed with the parameters: with decay rate of , for epochs, where the algorithm takes one step to minimize the loss for each context and the order of the contexts randomized when a new context added but not during the algorithm run. For loss ( 2), the algorithm was executed with the parameters: with decay rate of , for iterations which didn’t iterate randomly over the contexts, but rather used the entire training set for each step. Note, for this loss additional points sampled for the descent direction estimation do not require solving MDPs and thus more can be used for a more accurate calculation. For both losses, the matrix was normalized according to , and so was the step calculated by the ES algorithm, before it was multiplied by and applied.
For the nonlinear setting, the model used for the nonlinear mapping was a fully connected neural net, with layers of sizes
. The activation function used was the leaky ReLU function, with a parameter of
. Note that we can’t normalize the parameters here as in the linear case; therefore an L2normalization layer is added to the output. The same parameters were used as in the linear case, except with 120 iterations over the entire training set. They were originally optimized for this model and setting and worked asis for the linear environment. As we aim to estimate the gradient, a small was used and performed best. The number of points, , was selected as fewer points produced noisy results. The step size, decay rate and the number of iterations were selected in a way that produced fast yet accurate convergence of the loss. Note that here the steps were also normalized before application, and the normalization was applied per layer.We also provide results for the offline framework, demonstrating the ellipsoid method isn’t suited for this framework and must be initiated in the manner we describe in the warm start section. Here, we used a training and test set of contexts to evaluate the algorithms. The ellipsoid method uses all contexts in the training set to update its estimation of
Comments
There are no comments yet.