Imitation Learning is an inter-discipline territory [Attia and Dayan2018]
. What originally stemmed as a supervised learning problem[Pomerleau1991] has been ever since promoted by members of the reinforcement learning community [Daumé, Langford, and Marcu2009], [Ross, Gordon, and Bagnell2011b]. Imitation learning has been successfully applied as an end-to-end solution [Ho and Ermon2016], or as a building block in more evolved engineering architectures [Silver et al.2016]. Accommodating imitation learning concepts when training artificial agents is beneficial for several reasons. Techniques of such nature usually converge faster [Ross and Bagnell2010] and with fewer unanticipated artifacts in the converged policy (a.k.a Reward hacking) [Schaal1999]. The merits of Imitation learning, together with current limitations of alternative reinforcement learning approaches [Irpan2018] makes it a key component in the design of intelligent artificial agents.
Current Imitation Techniques are too Restrictive
Most commonly, imitation learning refers to the imitation of humans by robots or other artificial agents [Schaal1999]. Existing methods of imitation learning attempt to follow the expert’s policy directly. In other words, agents are trained to recover the state-action mapping induced by the expert. A fundamental underlying assumption of this approach is that the agent can at all act like the expert. Namely, that the expert and agent share the same action space. In the general case where the expert’s and agent’s action spaces are different, i.e. , most, if not all existing approaches will fail because pure imitation is no longer feasible. Furthermore, the restriction that has far-reaching consequences when imitation is applied to real-world applications. In the context of robotics, it requires robots to have a humanoid structure, and in the context of self-driving cars, it favors the use of continuous-action agents since humans operate in this domain. As a result, the adoption of imitation learning in real-world applications has been severely hindered [Schaal and Atkeson2010].
Scalable Imitating Should be Action-Space Agnostic
We focus on imitation problems where the state of the agent/expert is irrelevant for success. We refer to this type of problems as agent-agnostic tasks111Not all imitation tasks can be formulated in an agent-agnostic manner. There also exist tasks with an inherent dependence on the internal state of the agent. For example, teaching a robot to dance. We refer to this type of tasks as agent-dependent tasks.. This setup covers most of the imitation tasks that come to mind222It is worth noting that oftentimes agent-dependent tasks can be formulated as agent-agnostic tasks. For instance, a walking task can be rephrased in an agent-agnostic manner by including the expert’s location (center of mass) in the state.. Included here are object manipulation [Asfour et al.2008] and maneuvering tasks [Abbeel and Ng2004]. By definition, proper imitation in agent-agnostic tasks would aim to imitate the influence experts have on the environment rather than their explicit sequence of actions. In other words, scalable imitation should try to imitate the transition of environmental states induces by the expert, rather than its policy . We denote such a loose form of imitation as Inspiration Learning since the agent is free to craft new strategies, as long as their effect on the environment remains the same. Figuratively speaking, if a camera would be used to record expert demonstrations, then in the standard imitation approach it would be set to record the expert, while in the inspiration approach it would be set to record what the robot sees, i.e., the environment.
Knowledge Transfer via State Transitions
Transferring knowledge between an expert and an agent that do not share the same action space requires creative ways to evaluate whether the agent has learned how to carry out the task at hand or not. In this work, we try to address this challenge. We argue that in the context of sequential decision-making problems, attending this question is possible by monitoring state transitions. We turn to the celebrated actor-critic architecture and design a dyadic critic specifically for this task. The critic we use consists of two parts: 1) a state-value estimation part, as in common actor-critic algorithms and 2) a single-step reward function derived from an expert/agent classifier. This critic, which is oblivious to the action space of both players, is able to guide any agent toward behaviors that generate similar effects on the environment, analogous to that of the expert, even if the eventual policy is completely different from the one demonstrated by the expert.
In this section, we revisit key milestones in the field of imitation learning, starting from basic supervised approaches, and up to generative adversarial based imitation. Lastly, we briefly review the field of Preferential based Reinforcement Learning (PbRL), a concept that we believe can improve current imitation learning approaches.
The Early Days of Imitation Learning
Not surprisingly, the first attempts to solve imitation tasks were based on ideas of supervised learning [Pomerleau1991]. Not long after, problems such as data scarcity [Natarajan et al.2013] and covariate shifts [Sugiyama and Kawanabe2012] forced the adoption of fresh ideas from the field of Reinforcement Learning (RL).
Generally speaking, RL based methods held the promise of addressing the fundamental limitation of supervised approaches. That is, accounting for potentially disastrous approximation errors: , by observing their effect throughout complete trajectories.
For example, work such as Forward Training [Ross and Bagnell2010], and SMILe [Ross and Bagnell2010] offered gradual schemes that learn different policies for each time-step. At time , Forward Training will try to compensate for drifting errors by training a policy on the actual state-distribution induced by policies to . SMILe, on the other hand, will repeatedly query the expert on the actual trajectories visited by the agent until convergence. However, both approaches, and counterparts of their kind were not suitable for challenging imitation setups. Tasks with a long time horizon were ruled out because of the requirement to train a policy for each time-step. Real world problems were excluded because they couldn’t provide an expert-on-demand as the methods require.
Imitation Learning and No-Regret Algorithms
Soon after, ross2011reduction˜ross2011reduction introduced the DAgger algorithm. While similar in spirit to SMILe, DAgger operates in a slightly different manner. It proceeds by gathering experience using an agent-expert mixed behavioral policy:
where and are the expert, behavior and agent policies respectively. The data that was gathered using is aggregated together with all datasets collected up to time : . Eventually, a policy is trained on the cumulative set that is labeled with the help of the expert:
Explained differently, at each iteration, DAgger is training a policy to succeed on all trajectories seen so far. This is in contrary to most RL approaches that wish to fit fresh online data only. With this trait in mind, DAgger can be thought of as an online no-regret imitation algorithm. Moreover, it is shown by the authors that no other imitation algorithm can achieve better regret.
Adversarial Networks and Imitation Learning
A significant breakthrough in the field occurred at 2016 with the introduction of Generative Adversarial Imitation Learning (GAIL) [Ho and Ermon2016], an imitation algorithm closely related to the celebrated GAN architecture. GAN was originally presented as a method to learn generative models by defining a two-player zero-sum game:
where is some noise distribution, player represents the generative model and is the judge. GAIL showed that GAN fits imitation problems like a glove. By modifying to represent a policy , GAIL showed how to harness GAN for imitation purposes:
The motivation behind GAIL, i.e., to use GAN for imitation, is to rely on a neural network to build a dynamic decision rule to classify between the expert’s and the agent’s state-action pairs. GAIL uses the continuous classification score as a proxy reward signal to train the player. Other than that, GAIL also proved to be efficient with respect to the number of expert examples it requires. This is partially explained by the fact that even though expert examples are scarce, the algorithm enjoys an unlimited access to agent examples through simulation. Loosely speaking, having infinitely many agent examples allow the discriminative net to gain a precise understanding of its behavior, and as a result to also understand it’s differences from the expert.
However, while GAIL is efficient in terms of expert examples, this is clearly not the case regarding the required number of environment interactions. The high sample complexity is explained by the fact that GAIL’s update rule is based on the famous REINFORCE algorithm [Williams1992]
. REINFORCE offers an approximation to the true gradient, and is primarily used in situations where it is hard, or even impossible to calculate the original gradient. However, REINFORCE suffers from a high variance and is known to require numerous iterations before converging.
Model-Based Adversarial Imitation Learning
To compensate for this inefficiency, model-based Imitation Learning (mGAIL) [Baram, Anschel, and
Mannor2016] offered to take an opposite approach. While GAIL applies gradient approximations on samples obtained from the original environment, mGAIL applies the original gradient on samples obtained from an approximated environment.
Put differently, mGAIL offers a model-based alternative that attempts to learn a differentiable parametrization of the forward model (transition function). Using this approach, a multi-step interaction with the environment creates an end-to-end differentiable graph that allows to backpropagate gradients through time, thus, enabling to calculate the original gradient of the objective function. mGAIL’s advantage of using the original gradient comes at the cost of learning an accurate forward model, a task that often proves to be extremely challenging. Errors in the forward model can bias the gradient up to a level where convergence is again at risk.
The limitation of Current Approaches
While GAIL and mGAIL complement each other, both methods amount to a standard imitation setup that requires a shared action space between the expert and the agent. To understand why, it is enough to revisit GAIL’s decision rule (Eq 2
). In both methods, the rule is based on the joint distribution of states and actions. Not only that such a decision rule is limiting the agent to operate in the expert’s action domain, but it adds further optimization complications. State and actions are completely different quantities and embedding them in a joint space is not trivial.
In this work, we argue that the right decision rule should not be based on the joint state-action distribution , but rather on the state-transition distribution . Using the state-transition distribution we can neutralize the presence of the agent performing the task and instead focus on imitating the effects it induces on the environment.
PbRL and Adversarial Imitation Learning
Even though adversarial-based methods have proved successful for imitation, algorithms of this type suffer from an acute problem: they induce a non-stationary MDP. The reward, which is derived from a continually-adapting classification rule, is constantly changing. As a result, estimation of long-term returns, an underlying ingredient in most RL algorithms, becomes almost infeasible [Nareyek2003], [Koulouriotis and
Xanthopoulos2008]. We believe that alleviating this problem is possible using the concept of PbRL.
The motivation behind PbRL is to alleviate the difficulty of designing reward functions. Originally, PbRL was designed as a paradigm for learning from non-numerical feedback [Fürnkranz and Hüllermeier2011]. Instead, PbRL tries to train agents using preferences between states, actions or trajectories. The goal of the agent in PbRL is to find a policy that maximally complies with a set of preferences . Assume two trajectories . A preference is satisfied if:
Generally speaking, PbRL methods are divided into three categories. The first includes algorithms that directly search for policies that comply with expert preferences [Wilson, Fern, and Tadepalli2012]. The second consists of ”model-based” approaches that rely on a preference model [Fürnkranz et al.2012]. And the third encompasses methods that try to estimate a surrogate utility for a given trajectory [Akrour, Schoenauer, and Sebag2011]. We refer the reader to wirth2017survey˜wirth2017survey for a more comprehensive survey on PbRL.
As of today, the prevalent approach to imitation is to take a GAN-like approach. To understand this connection better, we recall that GANs in the context of imitation, or RL in general, can be best understood as a form of an actor-critic architecture [Pfau and Vinyals2016]. Therefore, in the following section, we present a family of advantage actor-critic algorithms for Inspiration learning tasks.
Actor Critic Methods
In an actor-critic algorithm, one of the prevailing approaches in reinforcement learning [Konda and Tsitsiklis2000], one maintains a separate parameterization for the policy (actor) and the state-value function (critic). The role of the actor is straightforward- to represent the policy. The role of the critic, on the other hand, is to assess the expected performance of the policy based on experience. The critic’s estimation is used as a baseline to determine whether the current behavior should be strengthened (if better than the baseline), or weakened (if worse). Numerous actor-critic variations exist [Vamvoudakis and Lewis2010], [Bhasin et al.2013]. Among the common ones is the advantage actor-critic Architecture [Peters and Schaal2008].
Advantage Actor-Critic for Inspiration Learning
The advantage function of a state-action pair is defined as . The function comprises of two parts: an action-dependent term , and an action-independent term . Because of their structure, advantage functions are commonly used to score gradients in policy gradient algorithms. is often approximated by the return from online rollouts: , while is trained to predict the expected discounted return of states through regression. The approximated advantage function is given by:
However, as in any imitation challenge, the reward signal is absent. In the following, we describe how using PbRL we are able to synthesis robust classification-based rewards that can be integrated into any advantage actor-critic algorithm. We start by describing the proposed rewards.
Basic scoring: The first variant we consider score actions by considering their raw classification score. It is simply given by:
We note that is the classification score for the transition, where is the state following when choosing action .
Figure 1: Shared Action Imitation: results for using our proposed method in the standard imitation framework where the expert and the agent share the same action space. We run experiments on three Atari2600 games: Breakout (a), Enduro (b) and Seaquest (c)
Preferential scoring: Working with a discrimination based reward pose challenges that arise from its non-stationary nature. To facilitate this problem, we suggest applying a ranking transformation on the raw classification scores.
Put in words, calculates action-based discrimination preferences. Doing so, we are able to discard nuance in temporary classification scores, and encourage stationarity through the course of training. The preferential reward is given by:
soft-Preferential scoring: In cases where classification scores are approximately the same: , a hard ranking transformation is likely to deform the decision rule to a high degree. Relaxing the deformation can be carried out in several ways. In this paper, we choose to apply a softmax transformation to the raw classification scores:
The soft preferential reward is therefore given by:
The algorithm we propose requires a set of expert trajectories that include states only . At each iteration, the agent gathers experience according to its current policy. At the same time, classification scores and state-value estimations are recorded at each time step . We then derive a reward function as explained above. For efficiency, the policy improvement step proceeds from end to start: an advantage function is approximated using the returns and the value estimations . Finally, a policy gradient step can take place:
The full algorithm is presented in Algorithm 1.
In this section, we assess the performance of our actor-critic algorithms in action. We used a standard shared parameterization for the three components of our algorithm: the actor, the state-value function and the expert-agent classifier. Our implementation uses a convolutional neural network with two layers of 64 convolution channels each (with rectified linear units in between), followed by a 512-wide fully connected layer (with rectified linear unit activation), and three output layers for, and . Unless otherwise mentioned, states include a stack of last frames, and experts were trained using vanilla advantage-actor-critic algorithm [Brockman et al.2016]. We use expert trajectories across all experiments333All experiments were conducted on a single GPU machine with a single actor..
Shared Actions Imitation:
The purpose of the first set of experiments is to test our algorithm in the familiar imitation learning setup: an expert and an agent that share the same action space. We recall that only expert states are recorded, and the agent has no access to any ground truth actions whatsoever. We tested three Atari2600 games: Breakout, Enduro and Seaquest. Results for this section are shown in Figure 1.
Continuous to Discrete Imitation:
In the second set of experiments, we wish to test our method in a setup where the expert is using a continuous action space and the agent uses a discrete one. To test this setup we use the following two environments:
Roundabout Merging Problem:
A car is merging into a roundabout that contains two types of drivers: aggressive ones that do not give right of way when a new car wishes to enter the roundabout, and courteous drivers that will let the new car merge in. The car is positively rewarded after a successful merge and negatively rewarded otherwise (we note that success is defined over a set of considerations including efficiency, safety, comfort etc.). We use a rule-based logic for the expert in this case. The expert is responsible to output two continuous signals: 1) a one-dimensional steering command and 2) a one-dimensional acceleration command . The agent, on the other hand, is allowed to choose from a discrete set of six commands indicating longitudinal and transverse shifts from its current position . Results are shown in Figure 2.
A continuous control task modeled by the MuJoCo physics simulator [Todorov, Erez, and Tassa2012]
. The expert uses a continuous action space to balance a multiple degrees of freedom monoped to keep it from falling. We use the Trust Region Policy Optimization[Schulman et al.2015]
algorithm to train the expert policy. We present the agent with the expert demonstrations and restrict him to use a set of 7 actions only (obtained by applying K-means clustering[Agarwal and Mustafa2004] on the expert actions). Results for this experiment are presented in Figure 2.
Skills to Primitives Imitation:
In the third set of experiments, an expert is first trained to solve a task using a set of predefined skills [Sutton, Precup, and Singh1999]. We then trained the agent, equipped with primitive actions only, to imitate the expert. We note that in this experiment, expert skills did not include primitive actions that are not available to the agent. We tested the same Atari games (Breakout, Enduro and Seaquest). The agent was able to achieve expert performance in all games (see Figure 3).
Primitives to Skills Imitation:
In the last set of experiments, an expert is trained using primitive actions only, while the agent is equipped with skills. As before, the skills did not include primitive actions not available to the expert. As before, we used predefined skills. We note that using our algorithm to learn the options themselves can prove to be a useful method to extract expert skills from demonstrations. However, this is out of the scope of this work and is a subject of further research. Results for this section are provided in Figure 4).
In this work, we show how imitation between differently acting agents is possible.
Our novelty lies in the observation that imitation is attained when two agents induce the same effect on the environment and not necessarily when they share the same policy.
We accompany our observation with a family of actor-critic algorithms. An actor – to represent the training agent. A critic – to a) classify state transitions into two classes (agent/expert), from which a reward signal is derived to guide the actor, and b) learn the state-value function for bias reduction purposes.
We provide results for various types of imitation setups including shared action space imitation, continuous to discrete imitation and primitive to macro imitation. Some of the results are surprising. For example the ability to distill a continuous action space policy using discrete sets (we show examples where ). However, some of the results are less intriguing. For instance, the ability to decompose a macro level policy into a primitive-level one is almost trivial in our case. The critic is oblivious to the performing agent and is only concerned with the induced effect on the environment. Thus, knowledge transfer between agents that operate in different action domains is possible.
The Importance of a Shared Parametrization
We have also experimented with separate parametrization architectures where the agent-expert classifier is modeled by a completely separate neural network (in oppose to a shared parametrization architecture, where a single neural network is used to output all three signals). We found that shared parametrization produces significantly better results. We hypothesize that a shared parametrization is important because the same features that are used to embed the expert (for the expert-agent classification task), are also used to sample actions from.
Where Do We Go From Here?
Our method allows agents to find new strategies (besides the expert’s one) to solve a task. If this is the case, then it’s fair to consider agents (or strategies) that do not entirely cover the expert’s state distribution, but perhaps just a vital subset of states that are needed to reach the end goal. We believe that this is an interesting direction for further research.
The flexible nature of our framework that allows imitating an expert in various strategies, can be used to obtain super-expert performance. By super we mean, policies that are safer than the expert, more efficient, robust and so on. Our method can be integrated as a building block in an evolutionary algorithm that can help evolve robots that are optimal for specific imitation tasks.
A typical setup of our method requires two ingredients: a) expert examples (e.g. video recordings) and b) a simulator to train the agent. Although not tested, we require the expert states and the simulator states to be the same (i.e., to be generated by the same source). We speculate that this is crucial in order to prevent the critic from performing vain classification that is based on state appearance and not on the state dynamic, as we would like it to be. We hold that this limitation can be removed to allow imitation between different state distributions. Such an improvement can be carried out for example by redesigning the critic to solve two separate tasks: 1) appearance classification and 2) dynamic classification, and deriving the reward from the latter. This improvement is out of the scope of this paper and will be explored in a subsequent work.
Abbeel, P., and Ng, A. Y.
Apprenticeship learning via inverse reinforcement learning.
Proceedings of the twenty-first international conference on Machine learning, 1. ACM.
- [Agarwal and Mustafa2004] Agarwal, P. K., and Mustafa, N. H. 2004. K-means projective clustering. In Proceedings of the Twenty-third ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, PODS ’04, 155–165. New York, NY, USA: ACM.
- [Akrour, Schoenauer, and Sebag2011] Akrour, R.; Schoenauer, M.; and Sebag, M. 2011. Preference-based policy learning. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, 12–27. Springer.
- [Asfour et al.2008] Asfour, T.; Azad, P.; Gyarfas, F.; and Dillmann, R. 2008. Imitation learning of dual-arm manipulation tasks in humanoid robots. International Journal of Humanoid Robotics 5(02):183–202.
- [Attia and Dayan2018] Attia, A., and Dayan, S. 2018. Global overview of imitation learning. arXiv preprint arXiv:1801.06503.
- [Baram, Anschel, and Mannor2016] Baram, N.; Anschel, O.; and Mannor, S. 2016. Model-based adversarial imitation learning. arXiv preprint arXiv:1612.02179.
- [Bhasin et al.2013] Bhasin, S.; Kamalapurkar, R.; Johnson, M.; Vamvoudakis, K. G.; Lewis, F. L.; and Dixon, W. E. 2013. A novel actor–critic–identifier architecture for approximate optimal control of uncertain nonlinear systems. Automatica 49(1):82–92.
- [Brockman et al.2016] Brockman, G.; Cheung, V.; Pettersson, L.; Schneider, J.; Schulman, J.; Tang, J.; and Zaremba, W. 2016. Openai gym.
- [Daumé, Langford, and Marcu2009] Daumé, H.; Langford, J.; and Marcu, D. 2009. Search-based structured prediction. Machine learning 75(3):297–325.
- [Fürnkranz and Hüllermeier2011] Fürnkranz, J., and Hüllermeier, E. 2011. Preference learning. In Encyclopedia of Machine Learning. Springer. 789–795.
- [Fürnkranz et al.2012] Fürnkranz, J.; Hüllermeier, E.; Cheng, W.; and Park, S.-H. 2012. Preference-based reinforcement learning: a formal framework and a policy iteration algorithm. Machine learning 89(1-2):123–156.
- [Ho and Ermon2016] Ho, J., and Ermon, S. 2016. Generative adversarial imitation learning. arXiv preprint arXiv:1606.03476.
- [Irpan2018] Irpan, A. 2018. Deep reinforcement learning doesn’t work yet. https://www.alexirpan.com/2018/02/14/rl-hard.html.
- [Konda and Tsitsiklis2000] Konda, V. R., and Tsitsiklis, J. N. 2000. Actor-critic algorithms. In Advances in neural information processing systems, 1008–1014.
- [Koulouriotis and Xanthopoulos2008] Koulouriotis, D. E., and Xanthopoulos, A. 2008. Reinforcement learning and evolutionary algorithms for non-stationary multi-armed bandit problems. Applied Mathematics and Computation 196(2):913–922.
Choosing search heuristics by non-stationary reinforcement learning.In Metaheuristics: Computer decision-making. Springer. 523–544.
[Natarajan et al.2013]
Natarajan, S.; Odom, P.; Joshi, S.; Khot, T.; Kersting, K.; and Tadepalli, P.
Accelerating imitation learning in relational domains via transfer by
International Conference on Inductive Logic Programming, 64–75. Springer.
- [Peters and Schaal2008] Peters, J., and Schaal, S. 2008. Natural actor-critic. Neurocomputing 71(7-9):1180–1190.
- [Pfau and Vinyals2016] Pfau, D., and Vinyals, O. 2016. Connecting generative adversarial networks and actor-critic methods. arXiv preprint arXiv:1610.01945.
- [Pomerleau1991] Pomerleau, D. A. 1991. Efficient training of artificial neural networks for autonomous navigation. Neural Computation 3(1):88–97.
- [Ross and Bagnell2010] Ross, S., and Bagnell, D. 2010. Efficient reductions for imitation learning. In AISTATS, 661–668.
- [Ross, Gordon, and Bagnell2011a] Ross, S.; Gordon, G. J.; and Bagnell, D. 2011a. A reduction of imitation learning and structured prediction to no-regret online learning. In AISTATS, volume 1, 6.
- [Ross, Gordon, and Bagnell2011b] Ross, S.; Gordon, G. J.; and Bagnell, J. A. 2011b. No-regret reductions for imitation learning and structured prediction. In In AISTATS. Citeseer.
- [Schaal and Atkeson2010] Schaal, S., and Atkeson, C. G. 2010. Learning control in robotics. IEEE Robotics & Automation Magazine 17(2):20–29.
- [Schaal1999] Schaal, S. 1999. Is imitation learning the route to humanoid robots? Trends in cognitive sciences 3(6):233–242.
- [Schulman et al.2015] Schulman, J.; Levine, S.; Moritz, P.; Jordan, M. I.; and Abbeel, P. 2015. Trust region policy optimization. CoRR, abs/1502.05477.
- [Silver et al.2016] Silver, D.; Huang, A.; Maddison, C. J.; Guez, A.; Sifre, L.; Van Den Driessche, G.; Schrittwieser, J.; Antonoglou, I.; Panneershelvam, V.; Lanctot, M.; et al. 2016. Mastering the game of go with deep neural networks and tree search. nature 529(7587):484.
- [Sugiyama and Kawanabe2012] Sugiyama, M., and Kawanabe, M. 2012. Machine learning in non-stationary environments: Introduction to covariate shift adaptation. MIT press.
- [Sutton, Precup, and Singh1999] Sutton, R. S.; Precup, D.; and Singh, S. 1999. Between mdps and semi-mdps: A framework for temporal abstraction in reinforcement learning. Artificial intelligence 112(1-2):181–211.
- [Todorov, Erez, and Tassa2012] Todorov, E.; Erez, T.; and Tassa, Y. 2012. Mujoco: A physics engine for model-based control. In Intelligent Robots and Systems (IROS), 2012 IEEE/RSJ International Conference on, 5026–5033. IEEE.
- [Vamvoudakis and Lewis2010] Vamvoudakis, K. G., and Lewis, F. L. 2010. Online actor–critic algorithm to solve the continuous-time infinite horizon optimal control problem. Automatica 46(5):878–888.
- [Williams1992] Williams, R. J. 1992. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning 8(3-4):229–256.
- [Wilson, Fern, and Tadepalli2012] Wilson, A.; Fern, A.; and Tadepalli, P. 2012. A bayesian approach for policy learning from trajectory preference queries. In Advances in neural information processing systems, 1133–1141.
- [Wirth et al.2017] Wirth, C.; Akrour, R.; Neumann, G.; and Fürnkranz, J. 2017. A survey of preference-based reinforcement learning methods. The Journal of Machine Learning Research 18(1):4945–4990.