1 Introduction
Preference elicitation is a wellknown problem in statistical decision theory [5]. The goal is to determine, whether a given decision maker prefers some events to other events, and if so, by how much. The first main assumption is that there exists a partial ordering among events, indicating relative preferences. Then the corresponding problem is to determine which events are preferred to which others. The second main assumption is the expected utility hypothesis. This posits that if we can assign a numerical utility to each event, such that events with larger utilities are preferred, then the decision maker’s preferred choice from a set of possible gambles will be the gamble with the highest expected utility. The corresponding problem is to determine the numerical utilities for a given decision maker.
Preference elicitation is also of relevance to cognitive science and behavioural psychology, where a proper elicitation procedure may allow one to reach more robust experimental conclusions. There are also direct practical applications, such as determining customer preferences. Finally, by analysing the apparent preferences of an expert while performing a particular task, we may be able to discover behaviours that match or even surpass the performance of the expert in the very same task.
This paper uses the formal setting of preference elicitation to determine the preferences of an agent acting within a discretetime stochastic environment. We assume that the agent obtains a sequence of (hidden to us) rewards from the environment and that its preferences have a functional form related to the rewards. We also suppose that the agent is acting nearly optimally (in a manner to be made more rigorous later) with respect to its preferences. Armed with this information, and observations from the agent’s interaction with the environment, we can determine the agent’s preferences and policy in a Bayesian framework. This allows us to generalise previous Bayesian approaches to inverse reinforcement learning.
In order to do so, we define a structured prior on reward functions and policies. We then derive two different Markov chain procedures for preference elicitation. The result of the inference is used to obtain policies that are significantly improved with respect to the
true preferences of the observed agent.Numerous other inverse reinforcement learning approaches exist [1, 12, 13]. Our main contribution we provide a clear Bayesian formulation of inverse reinforcement learning as preference elicitation, with a structured prior on the agent’s utilities and policies. This generalises the approache of Ramachandran and Amir [12] and paves the way to principled procedures for determining distributions on reward functions, policies and reward sequences. Performancewise, we show that the policies obtained through our methodology easily surpass the agent’s actual policy with respect to its own utility. Furthermore, we obtain policies that are significantly better than those obtained other inverse reinforcement learning methods that we compare against.
Finally, the relation to experimental design for preference elicitation (see [2] for example) must be pointed out. Although this is a very interesting planning problem, in this paper we do not deal with active
preference elicitation. We focus on the subproblem of estimating preferences given a particular observed behaviour in a given environment and use decision theoretic formalisms to derive efficient procedures for inverse reinforcement leraning.
This paper is organised as follows. The next section formalises the preference elicitation setting and relates it to inverse reinforcement learning. Section 3 presents the abstract statistical model used for estimating the agent’s preferences. Section 4 describes a model and inference procedure for joint estimation of the agent’s preferences and its policy. Section 5 discusses related work in more detail. Section 6 presents comparative experiments, which quantitatively examine the quality of the solutions in terms of both preference elicitation and the estimation of improved policies, concluding with a view to further extensions.
2 Formalisation of the problem
We separate the agent’s preferences (which are unknown to us) from the environment’s dynamics (which we consider known). More specifically, the environment is a controlled Markov process , with state space , action space , and transition kernel , indexed in such that
is a probability measure
^{1}^{1}1We assume the measurability of all sets with respect to some appropriate algebra. on . The dynamics of the environment are Markovian: If at time the environment is in state and the agent performs action , then the next state is drawn with a probability independent of previous states and actions:(2.1) 
where we use the convention and to represent sequences of variables.
In our setting, we have observed the agent acting in the environment and obtain a sequence of actions and a sequence of states:
The agent has an unknown utility function, , according to which it selects actions, which we wish to discover. Here, we assume that has a structure corresponding to that of reinforcement learning discounted reward infinitehorizon problems and that the agent tries to maximise the expected utility.
Assumption 1.
The agent’s utility at time is defined in terms of future rewards from time :
(2.2) 
where is a discount factor, and the reward is given by the reward function so that the .
In our framework, this is only one of the many possible assumptions regarding the form of the utility function. This choice establishes correspondence with the standard reinforcement learning setting. However, unlike other inverse reinforcement learning approaches, ours is applicable to arbitrary functional forms of the utility.
The controlled Markov process and the utility define a Markov decision process
[10] (MDP), denoted by . The agent uses some policy to select actions with distribution , which together with the Markov decision process defines a Markov chain on the sequence of states, such that:(2.3) 
where we use a subscript to denote that the probability is taken with respect to the process defined jointly by . We shall use this notational convention throughout this paper. Similarly, the expected utility of a policy is denoted by . We also introduce the family of value functions , where is a set of MDPs, with such that:
(2.4) 
Finally, we use to denote the optimal value function for an MDP , such that:
(2.5) 
With a slight abuse of notation, we shall use when we only need to distinguish between different reward functions , as long as the remaining components of remain fixed.
Loosely speaking, our problem is to estimate the reward function and discount factor that the agent uses, given the observations and some prior beliefs. As shall be seen in the sequel, this task is easier with additional assumptions on the structural form of the policy . We derive two sampling algorithms. The first estimates a joint posterior distribution on the policy and reward function, while the second also estimates a distribution on the sequence of rewards that the agent obtains. We then show how to use those estimates in order to obtain a policy that can perform significantly better than that of the agent’s original policy with respect to the agent’s true preferences.
3 The statistical model
In the simplest version of the problem, we assume that is known and we only estimate the reward function, given some prior over reward functions and policies. This assumption can be easily relaxed, however via an additional prior on the discount factor.
Let be a space of reward functions and to be a space of policies
. We define a (prior) probability measure
on such that for any , corresponds to our prior belief that the reward function is in . Finally, for any reward function , we define a conditional probability measure on the space of policies . Let denote the agent’s true reward function and policy respectively. Then our model is:(3.1) 
while the joint prior on reward functions and policies is denoted by:
(3.2) 
such that is a probability measure on .
For the moment we shall leave the exact functional form of the prior on the reward functions and the conditional prior on the policy unspecified. Nevertheless, the structure allows us to state the following:
Lemma 1.
Proof.
Conditioning on the observations
via Bayes’ theorem, we obtain the conditional measure:
(3.4) 
where is a marginal likelihood term:
It is easy to see via induction that:
(3.5) 
where is the initial state distribution. Thus, the reward function posterior is proportional to:
Note that the terms can be taken out of the integral. Since they also appear in the denominator, the state transition terms cancel out. ∎
4 Estimation
While it is entirely possible to assume that the agent’s policy is optimal with respect to its utility (as is done for example in [1]), our analysis can be made more interesting by assuming otherwise. One simple idea is to restrict the policy space to stationary softmax policies:
(4.1) 
where we assumed a finite action set for simplicity. Then we can define a prior on policies, given a reward function, by specifying a prior on the inverse temperature , such that given the reward function and , the policy is uniquely determined.
However, our framework’s generality allows any functional form relating the agent’s preferences and policies. As an example, we could define a prior distribution over the optimality of the chosen policy, without limiting ourselves to softmax forms. This would of course change the details of the estimation procedure.
For the chosen prior (4.1), inference can be performed using standard Markov chain Monte Carlo (MCMC) methods [3]. If we can estimate the reward function well enough, we may be able to obtain policies that surpass the performance of the original policy with respect to the agent’s reward function .
4.1 A MetropolisHastings procedure
Recall that a MetropolisHastings (MH) algorithm for sampling from some distribution with density using a proposal distribution with conditional density , has the form:
In our case, and ^{2}^{2}2Here we abuse notation, using to denote the density or probability function with respect to a Lebesgue or counting measure associated with the probability measure on subsets of . We use independent proposals . Since , it follows that:
This gives rise to the sampling procedure described in Alg. 1, which uses a gamma prior for the temperature.
4.2 A hybrid Gibbs procedure
The second alternative is a twostage hybrid Gibbs sampler, described in Alg. 2. The main interest of this procedure is that it conditions alternatively on a reward sequence sample and on a reward function sample at the th iteration of the chain. Thus, we also obtain a posterior distribution on reward sequences.
This sampler is of particular utility when the reward function prior is conjugate to the reward distribution, in which case: (i) The reward sequence sample can be easily obtained and (ii) the reward function prior can be conditioned on the reward sequence with a simple sufficient statistic. While, sampling from the reward function posterior continues to require MH, the resulting hybrid Gibbs sampler remains a valid procedure [3], which may give better results than specifying arbitrary proposals for pure MH sampling.
As previously mentioned, the Gibbs procedure also results in a distribution over the reward sequences observed by the agent. On the one hand, this could be valuable in applications where the reward sequence is the main quantity of interest. On the other hand, this has the disadvantage of making a strong assumption about the distribution from which rewards are drawn.
5 Related work
5.1 Linear programming
One interesting solution proposed by [8]
is to use a linear program in order to find a reward function that maximises the gap between the best and second best action. Although elegant, this approach suffers from some drawbacks.
(a) A good estimate of the optimal policy must be given. This may be hard in cases where the demonstrating agent does not visit all of the states frequently. (b) In some pathological MDPs, there is no such gap. For example it could be that for any action , there exists some other action with equal value in every state.5.2 Policy walk
Our framework can be seen as a generalisation of the Bayesian approach considered in [12], which does not employ a structured prior on the rewards and policies. In fact, they implicitly define the joint poisterior over rewards and policies as:
which implies that the exponential term corresponds to . This ad hoc choice is probably the weakest point in this approach. Although, as mentioned in [12], such a choice could be justifiable through a maximum entropy argument, we note that the maximumentropy based approach reported in [14] does not employ the value function in that way.
Rearranging, we write the denominator as:
(5.1) 
which is still not computable, but we can employ a MetropolisHastings step using as a proposal distribution, and an acceptance probability of:
We note that in [12], the authors employ a different sampling procedure than a straightforward MH, called a policy grid walk. In exploratory experiments, where we examined the performance of the authors’ original method [11], we have determined that MH is sufficient and that the most crucial factor for this particular method was its initialisation.
5.3 The maximum entropy approach.
A maximum entropy approach is reported in [14]. Given a feature function , and a set of trajectories , they obtain features . They show that given empirical constraints , where is the empirical feature expectation, one can obtain a maximum entropy distribution for actions of the form . If is the identity, then can be seen as a scaled stateaction value function.
In general, maximum entropy approaches have good minimax guarantees [7]. Consequently, the estimated policy is guaranteed to be close to the agent’s. However, at best, by bounding the error in the policy, one obtains a twosided high probability bound on the relative loss. Thus, one is almost certain to perform neither much better, nor much worse that the demonstrator.
5.4 Game theoretic approach
An interesting game theoretic approach was suggested by [13] for apprenticeship learning. This also only requires statistics of observed features, similarly to the maximum entropy approach. The main idea is to find the solution to a game matrix with a number of rows equal to the number of possible policies, which, although large, can be solved efficiently by an exponential weighting algorithm. The method is particularly notable for being (as far as we are aware of) the only one with a highprobability upper bound on the loss relative to the demonstrating agent and no corresponding lower bound. Thus, this method may in principle lead to a significant improvement over the demonstrator. Unfortunately, as far as we are aware of, sufficient conditions for this to occur are not known at the moment.
6 Experiments
6.1 Domains
We compare the proposed algorithms on two different domains, namely on random MDPs and random maze tasks. The Random MDP task is a discretestate MDP, with four actions, such that each leads to a different, but possibly overlapping, quarter of the state set.^{3}^{3}3 The transition matrix of the MDPs was chosen so that the MDP was communicating (c.f. [10]) and so that each individual action from any state results in a transition to approximately a quarter of all available states (with the destination states arrival probabilities being uniformly selected and the nondestination states arrival probabilities being set to zero).
The reward functions is drawn from a Betaproduct hyperprior with parameters
and , where the index is over all stateaction pairs. This defines a distribution over the parametersof the Bernoulli distribution determining the probability of the agent of obtaining a reward when carrying out an action
in a particular state .For the Random Maze tasks we constructed planar grid mazes of different sizes, with four actions at each state, in which the agent has a probability of to succeed with the current action and is otherwise moved to one of the adjacent states randomly. These mazes are also randomly generated, with the rewards function being drawn from the same prior. The maze structure is sampled by randomly filling a grid with walls through a productBernoulli distribution with parameter , and then rejecting any mazes with a number of obstacles higher than .
6.2 Algorithms
We compared our methodology, using the pure MH and the hybrid Gibbs sampler, to three previous approaches. The linear programming based approach [8], the gametheoretic approach [13] and finally, the Bayesian inverse reinforcement learning method suggested in [12]. In all cases, each demonstration was a long trajectory , provided by a demonstrator employing a softmax policy with respect to the optimal value function.
All algorithms have some parameters that must be selected. Since our methodology employs MCMC the sampling parameters must be chosen so that convergence is ensured. We found that samples from the chain were sufficient, for both the MH and hybrdig Gibbs sampler, with steps used as burnin, for both tasks.
We also compared our procedure with the Bayesian inverse reinforcement learning algorithm of Ramachandran and Amir [12]. For the latter, we used a MH sampler seeded with the solution found by [8], as suggested by [11] and by our own preliminary experiments. We also verified that the same number of samples used in our case was sufficient for this method.
The linearprogramming based inverse reinforcement learning algorithm by Ng and Russell [8] requires the actual agent policy as input. For the randomMDP domain, we used the maximum likelihood estimate from our observations. For the maze domain, we used a Laplacesmoothed estimate (i.e. a productDirichlet prior with parameters equal to 1) instead, since this was more stable.
Finally, we examined the MWAL algorithm of Syed and Schapire [13]. This requires the cumulative discounted feature expectation as input, for appropriately defined features. Since we dealt with discrete environments, we used the state occupancy as a feature. Although the feature expectations can be calculated empirically, we obtained better performance in practice with the following procedure: We first computed the transition probabilities of the Markov chain induced by the maximum likelihood (or Laplacesmoothed) policy given the observed stateaction sequences of the agent and the MDP transitions. Then we calculate the expectation of these features given this chain. We set all accuracy parameters of this algortihm to , which was sufficient for a robust behaviour.
6.3 Performance measure
In order to measure performance, we plot the loss of the value function of each policy relative to the optimal policy with respect to the agent’s utility:
(6.1) 
where and .
In all cases, we average over experiments on an equal number of randomly generated environments . For the th experiment, we generate a steplong demonstration
via an agent employing a softmax policy. The same demonstration is used across all methods to reduce variance.
6.4 Results
We consider the loss of five different policies, averaged over runs. The first, soft, is the policy of the demonstrating agent itself. The second, MH, is the MetropolisHastings procedure defined in Alg. 1, while GMH is the hybrid Gibbs procedure from Alg. 2. Finally, Ng & Russel, Ramachandran & Amir, and Syed & Schapire are our implementations of the methods described in the papers by the respective authors, summarised in Sec. 5.
We first examined the the loss of greedy policies,^{4}^{4}4Experiments with nongreedy policies (not shown) produced generally worse results. derived from the estimated reward function, as the demonstrating agent becomes greedier. Figure 1 shows results for the two different domains. It is easy to see that the MH sampler significantly outperforms the demonstrator, even when the latter is nearly greedy. While the hybrid Gibbs sampler’s performance lies between that of the demonstrator and the MH sampler, it also estimates a distribution over reward sequences as a sideeffect. Thus, it could be of further value where estimation of reward sequences is important. We observed that the performance of the baseline methods is generally inferior, though nevertheless the Syed & Schapire algorithm tracks the demonstrator’s performance closely.
This suboptimal performance of the baseline methods in the Random MDP setting cannot be attributed to poor estimation of the demonstrated policy, as can clearly be seen in Figure 2(a), which shows the loss of the greedy policy derived from each method as the amount of data increases. While the proposed samplers improve significantly as observations accumulate, this effect is smaller in the baseline methods we compared against. As a final test, we plot the relative loss in the Random MDP as the number of states increases in Figure 2(b). We can see that the relative performance of methods is invariant to the size of the state space for this problem.
Overall, we observed that our suggested model consistently outperforms the agent in all settings, when the MH sampler is used, while the Gibbs sampler only manages to match the behaviour approximately. Presumably, this is due to the joint estimation of the reward sequence. Finally, the other methods under consideration on average to do not improve upon the initial policy and can be, in a large number of cases, significantly worse. For the linear programming inverse RL method, perhaps this can be attributed to implicit assumptions about the MDP and the optimality of the given policy. For the policy walk inverse RL method, our belief is that its suboptimal performance is due to the very restrictive and somewhat ad hoc prior it uses. Finally, the performance of the game theoretic approach is slightly disappointing. Although it is is much more robust than the other two baseline approaches, it never outperforms the demonstrator, even thought technically this is possible. One possible explanation is that since this approach is worstcase by construction, it results in overly conservative policies.
7 Discussion
We introduced a unified framework of preference elicitation and inverse reinforcement learning, presented a statistical model for inference and derived two different sampling procedures for estimation. Our framework is flexible enough to allow plugging in alternative priors on the form of the policy and of the agent’s preferences, although that would require adjusting the sampling procedures. In experiments, we showed that for a particular choice of policy prior, closely corresponding to previous approaches, our samplers can outperform not only other wellknown inverse reinforcement learning algorithms, but the demonstrating agent as well.
The simplest extension, which we have already alluded to, is the estimation of the discount factor, for which we have obtained promising results in preliminary experiments. A slightly harder generalisation occurs when the environment is not known to us. This is not due to difficulties in inference, since in many cases a posterior distribution over is not hard to maintain (see for example [4, 9]). However, computing the optimal policy given a belief over MDPs is harder [4], even if we limit ourselves to stationary policies [6]. We would also like to consider more types of preference and policy priors. Firstly, the use of spatial priors for the reward function, which would be necessary for large or continuous environments. Secondly, the use of alternative priors on the demonstrator’s policy.
The generality of the framework allows us to formulate different preference elicitation problems than those directly tied to reinforcement learning. For example, it is possible to estimate utilities that are not additive functions of some latent rewards. This does not appear to be easily achievable through the extension of other inverse reinforcement learning algorithms. It would be interesting to investigate this in future work.
Finally, although in this paper we have not considered the problem of experimental design for preference elicitation (i.e. active preference elicitation), we believe is a very interesting direction. First of all, it has many applications, such as the automated optimal design of behavioural experiments to give but one example. Nevertheless, a more effective preference elicitation procedure such as the one presented in this paper is absolutely essential for the complex planning task that experimental design is. Consequently, we hope that researchers in that area will find our methods useful.
References

Abbeel and Ng [2004]
P. Abbeel and A.Y. Ng.
Apprenticeship learning via inverse reinforcement learning.
In
Proceedings of the 21st international conference on Machine learning (ICML 2004)
, 2004. 
Boutilier [2002]
C. Boutilier.
A POMDP formulation of preference elicitation problems.
In
Proceedings of the National Conference on Artificial Intelligence
, pages 239–246. Menlo Park, CA; Cambridge, MA; London; AAAI Press; MIT Press; 1999, 2002.  Casella et al. [1999] George Casella, Stephen Fienberg, and Ingram Olkin, editors. Monte Carlo Statistical Methods. Springer Texts in Statistics. Springer, 1999.
 Duff [2002] Michael O’Gordon Duff. Optimal Learning Computational Procedures for Bayesadaptive Markov Decision Processes. PhD thesis, University of Massachusetts at Amherst, 2002.
 Friedman and Savage [1952] Milton Friedman and Leonard J. Savage. The expectedutility hypothesis and the measurability of utility. The Journal of Political Economy, 60(6):463, 1952.
 Furmston and Barber [2010] Thomas Furmston and David Barber. Variational methods for reinforcement learning. In AISTATS, pages 241–248, 2010.
 Grünwald and Dawid [2004] Peter D. Grünwald and A. Philip Dawid. Game theory, maximum entropy, minimum discrepancy, and robust bayesian decision theory. Annals of Statistics, 32(4):1367–1433, 2004.
 Ng and Russell [2000] Andrew Y. Ng and Stuart Russell. Algorithms for inverse reinforcement learning. In in Proc. 17th International Conf. on Machine Learning, pages 663–670. Morgan Kaufmann, 2000.
 Poupart et al. [2006] P. Poupart, N. Vlassis, J. Hoey, and K. Regan. An analytic solution to discrete Bayesian reinforcement learning. In ICML 2006, pages 697–704. ACM Press New York, NY, USA, 2006.
 Puterman [2005] Marting L. Puterman. Markov Decision Processes : Discrete Stochastic Dynamic Programming. John Wiley & Sons, New Jersey, US, 2005.
 Ramachandran [2010] D Ramachandran, 2010. Personal communication.
 Ramachandran and Amir [2007] D. Ramachandran and E. Amir. Bayesian inverse reinforcement learning. In in 20th Int. Joint Conf. Artificial Intelligence, volume 51, page 61801, 2007.
 Syed and Schapire [2008] Umar Syed and Robert E. Schapire. A gametheoretic approach to apprenticeship learning. In Advances in Neural Information Processing Systems, volume 10, 2008.
 Ziebart et al. [2010] Brian D. Ziebart, J. Andrew Bagnell, and Anind K. Dey. Modelling interaction via the principle of maximum causal entropy. In Proceedings of the 27th International Conference on Machine Learning (ICML 2010), Haifa, Israel, 2010.