Active Preference Learning using Maximum Regret

05/08/2020 ∙ by Nils Wilde, et al. ∙ University of Waterloo Monash University 19

We study active preference learning as a framework for intuitively specifying the behaviour of autonomous robots. In active preference learning, a user chooses the preferred behaviour from a set of alternatives, from which the robot learns the user's preferences, modeled as a parameterized cost function. Previous approaches present users with alternatives that minimize the uncertainty over the parameters of the cost function. However, different parameters might lead to the same optimal behaviour; as a consequence the solution space is more structured than the parameter space. We exploit this by proposing a query selection that greedily reduces the maximum error ratio over the solution space. In simulations we demonstrate that the proposed approach outperforms other state of the art techniques in both learning efficiency and ease of queries for the user. Finally, we show that evaluating the learning based on the similarities of solutions instead of the similarities of weights allows for better predictions for different scenarios.



There are no comments yet.


page 1

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Recently, research in human robot interaction (HRI) has focused on the design of frameworks that enable inexperienced users to efficiently deploy robots [1, 2, 3, 4, 5, 6]. Autonomous mobile robots for instance are capable of navigating with little to no human guidance; however, user input is required to ensure their behaviour meets the user’s expectations. For example, in industrial facilities, a robot might need to be instructed about the context and established workflows or safety regulations [7]. An autonomous car should learn which driving style a passenger would find comfortable [8, 9]. Users who are not experts in robotics find it challenging to specify robot behaviour that meets their preferences [3].

Active preference learning offers a methodology for a robot to learn user preferences through interaction [1, 2, 10, 11, 3, 12]. Users are presented with a sequence of alternative behaviours to a specific robotic task and choose their preferred alternative. Figure 1 shows an example of learning user preferences for an autonomous vehicle where alternative behaviours are presented on an interface. Usually, the user is assumed to make their choice based on an internal, hidden cost function. The objective is to learn this cost function such that a robot can optimize its behaviour accordingly. Often, the user cost function is modelled as a weighted sum of predefined features [1]. Hence, learning the cost function is reduced to learning the weights. The key questions in this methodology are (1) how to select a set of possible solutions that are presented to the user such that the cost function can be learned from few queries to the user, and (2) can the user choose reliably between these solutions.

(a) Optimal behaviour.
(b) Learned behaviour.
Fig. 1: Behaviour of an autonomous car (red) in the presence of another vehicle (white). In (a) we show the optimal behaviour for some user. In (b) we show alternative paths presented during active preference learning. Darker shades of red indicate behaviour that was presented later.

In this work we propose a new approach for selecting solutions in active preference learning. In contrast to the work of [1, 13, 2] our approach does not focus on reducing the uncertainty of the belief over the weights; instead, we consider the set of all possible solutions to the task. Different weights in the user cost function might correspond to similar or even equal optimal solutions; in optimization problems this is known as sensitivity [14]

. Thus, even if the estimated weights do not equal the true user weights, the corresponding solution might be the same. Therefore, we propose an new measure for active preference learning: The regret of the learned path. The concept of regret is known in robust shortest path problems

[15, 16]. Consider two sets of weights for a user cost function, one that is optimal for a user and one that was estimated through active preference learning. The regret of the estimate captures the suboptimality of the solution found using the estimated weights, i.e., the ratio of the cost of the estimated solution evaluated by the optimal weights and the cost of the optimal solution, evaluated by the optimal weights.

We use the notion of regret to select alternatives to show to the user. From a set of solutions that are considered equally good for the user given the feedback obtained so far, we choose the pair of solutions that, if is the optimum, the ratio of costs is maximised. As the user either rejects or we remove the most sub-optimal alternative from our solution space. In each iteration, our proposed approach optimizes over the set of all solutions that are consistent with the user feedback obtained so far. It then presents the user with the pair of solutions where the regret is maximized.

Following the motivation for regret, we evaluate the results of active preference learning based on the learned solution, instead of the learned weights. Therefore, we use the relative error in the cost of paths as a metric. This mirrors how an actual user would evaluate a robot’s behaviour: Users are not interested in what weights are used by a robot’s motion planner, one of the main motivations for active preference learning is that users find it challenging to express weights for cost or reward functions. Instead users judge a robot’s behaviour by how similar it is to what they imagine as an optimal behaviour.

I-a Related Work

The concept of learning a hidden reward function from a user is widely used in various human-robot interaction frameworks, such as learning from demonstrations (LfD) [4, 17], learning from corrections [18, 19] and learning from preferences [1, 17, 13, 12, 3].

Learning from demonstrations or corrections is often based on inverse optimal control or reinforcement learning. The user is modelled to optimize an internal cost function when providing demonstrations or corrections. The objective is then to find a cost function for which the demonstrated behaviour is optimal.

Closely related to our work, the authors of [1, 13] and [2] investigate how active preference learning can be used to shape a robot’s behaviour. Thereby a general robot control problem with continuous states and actions is considered. The user cost function is modelled as weighted sum of features. They show that the robot is able to learn user preferred behaviours from few iterations using active preference learning. In [1] and [13], Dragan and colleagues investigate a measure for selecting a new pair of possible solutions to be shown to the user based on the posterior belief over the set of all weights. In detail, new solutions are selected such that the integral over the unnormalized posterior, called volume, is minimized in expectation. This approach is revised in [2], where a failure case for the volume removal is demonstrated. As an alternative measure, the authors propose the information entropy of the posterior belief over the weights.

In our work we show that both of the above approaches disregard the sensitivity of the underlying motion planning problem: Learning about the weights of the cost function can be inefficient, as different weights can lead to the same optimal behaviour. In our previous work [12] we discretized the weight space into equivalence regions, i.e., sets of weights where the optimal solution of the motion planning problem is the same.

Another concern during active preference learning is to present alternatives to the user that are easy for them to differentiate and lead to a lower error rate. The authors of [20]

investigate strategies for active learning that consider the flow of the queries to reduce the mental effort of the user and thus decrease the user’s error rate. Similarly,

[2] optimizes for queries that are easy to answer. In our work, we present an active query strategy that features these properties intrinsically: By maximizing the regret of the presented paths, we automatically choose paths that are different with respect to the user cost function and thus are expected to be easily distinguishable for the user.

I-B Contributions

We contribute to the ongoing research in active preference learning as a framework for specifying complex robot behaviours. We propose a measure for evaluating the solution found by preference learning based on the robot’s learned behaviour instead of the learned weights in the cost function. Further, we propose a new active query selection guided by the maximum error ratio between solutions. Thereby, users are presented with the pair of solutions that has the maximum error ratio among all paths in the feasible solution space. We demonstrate the performance of our approach by comparing it to a competing state of the art technique and show that our proposed method learns the desired behaviour more efficiently. Moreover, the queries the user is presented with are easier to answer and thus lead to more reliable user feedback. Finally, we demonstrate how our measure based on solutions gives better predictions about the behaviour of the robot in different scenarios that were not part of the learning.

Ii Problem Statement

Ii-a Preliminaries

Let be the state space of a robot and the environment it is acting in and some start state. Further, we have an action space where each potentially only affect parts of the state, i.e., there might be static or dynamic obstacles unaffected by the robot’s actions.

Further let be a path of finite length starting at

. A path is evaluated by a column vector of predefined features

. Together with a row vector of weights we define the cost of a path as


Given some let be the optimal path, i.e., . We denote this optimal cost for a weight as


For any other weight , we call the cost of evaluated by .

Ii-B Problem Formulation

We consider a robot’s state and action space and some start state . We consider a vector of weights , describing a user’s preference for the robot’s behaviour and the corresponding optimal path . Each element of the weight vector has a lower and upper bound and . However, itself is hidden. We can learn about by presenting the user with pairs of paths over iterations. The objective is to find an estimated path that reflects the user preferences , i.e., is as similar to as possible.

To evaluate the result of learning the authors of [1] propose the alignment metric, i.e., the cosine of the angle between the learned weight vector and . We adapt this metric and transform it to a normalized error between and , which we call the weight error:


The alignment metric was also used in [13, 2]. However, this metric has two potential shortcomings: 1) It does not consider the sensitivity of the optimization problem that finds an optimal path for a given weight vector. Thus, an error in might actually not result in a different optimal path. Moreover, even if the learned weight has a relatively small error, the corresponding path might be suboptimal to the user. 2) The weight error is not suitable as a test error (i.e., to test whether the learned user preferences generalize well to new task instances not encountered during learning) since it does not consider the robot’s resulting behaviour: is equal for all training and test instances. Hence, the weight error gives no insight into how well the estimated preferences translate into different scenarios, unless , i.e,. the optimal weights are found.

Therefore, we choose a different metric for evaluating the learned behaviour: Instead of the learned weight we consider the learned path . We compare the cost of , evaluated by the user’s true cost to the optimal cost path of :


We refer to as the path error. A similar relative error was already used in [21]. Based on this metric we can now formally pose the learning problem.

Problem 1.

Given and , and a user with hidden weights who can be queried over iterations about their preference between two paths and , find a weight with the corresponding optimal path starting at that minimizes .

Iii Active Preference Learning

We introduce the user model and learning framework of our active preference learning approach and then discuss several approaches for selecting new solutions in each iteration.

Iii-a User Model

To learn about and thus find , we can iteratively present the user with a pair of paths and they return the one they prefer:


However, a user might not always follow this model exactly. For instance, they might consider features that are not in the model or they are uncertain in their decision when and are relatively similar. Thus, we extend equation (5) to a probabilistic model, similar to our previous work in [12]. Let

be a binary random variable where

if the user prefers path over and otherwise. Then we have


where . If we recover the deterministic case from equation (5). In this very simple model the user’s choice does not depend on how similar and are. In the simulations we will simulate the user according to the more sophisticated model in [2], which models the user’s error rate as a function of the similarity between alternatives, and show that equation (6) nonetheless allows us to achieve strong performance.

Iii-B Learning Framework

Over multiple iterations, equation (5) yields a collection of inequalities of the form . We write the feedback obtained after iterations as a sequence . We then summarize the left-hand-sides for all iterations using a matrix . Based on the sequence we can compute an estimate of using a maximum likelihood estimator.

Deterministic case

In the deterministic case, i.e., , the estimate must satisfy to be consistent with the user feedback obtained thus far. The set of all such weights constitutes the feasible set .

Iii-C Active Query Selection

In active preference learning we can choose a pair to present to the user in each iteration . As is an optimal path for , we only consider paths that are optimal for weights . Given the user feedback obtained until iteration , a new pair is then found by maximizing some measure describing the expected learning effect from showing to the user.

In the literature, several approaches have been discussed: Removing the Volume, i.e., minimizing the integral of the unnormalized posterior over the weights [1, 13], maximizing the information entropy over the weights [2] and removing equivalence regions, i.e., sets of weights where for each weight has the same optimal path [12].

Parameter space and solution space

The first two approaches ([1, 13] and [2]) maximize information about the parameter space, i.e., the weights , instead of the solution space, i.e., the set of all possible paths . Despite its intuitive motivation based on inverse reinforcement learning, this has a major drawback: The difference in the parameters does not map linearly to the difference in the features of corresponding optimal solutions. Given some and , we can compute optimal paths and with features and , respectively. Then does not necessarily hold. This implies that learning efficiently about does not guarantee efficient or effective learning about the resulting paths. Moreover, learning about might allow for disregarding a large number of weights. However, the corresponding optimal paths can be very similar and thus the learning step is potentially less informative in the solution space.

Example 1.

We consider the autonomous driving example from [2] which is posed in a continuous state and action space, illustrated in Figure 1. The left plot of Figure 2 shows the weight error of random weights and one single random optimal weight . The samples and lie on the unit circle in . The distribution is not entirely uniform due to symmetries of the function. In the right plot, we show for the corresponding optimal paths of the sampled weights. We observe that the path error distribution takes nearly a discrete form, despite the continuous action space. This illustrates how different weights do not necessarily lead to different solutions. Consequently, the solution space is more structured than the parameter space.

Fig. 2: Example of the sensitivity of a continuous motion planning problem. The left plot shows the weight error for uniformly random weights. The right plot shows the path error of the corresponding optimal paths.

In our previous work we proposed a framework that updates probabilities over the solution space

[12] based on a discretization: Sets of weights that have the same optimal path are labeled as an equivalence region. The objective for active query selection is to maximally reduce the posterior belief over equivalence regions, i.e, to reject as many equivalence regions as possible. A drawback of this approach is that there exists cases where any query only allows for updating the belief of few equivalence regions, resulting in slow convergence.

Because of these limitations of the existing approaches we study a new measure based on the solution space.

Iv Min-Max Regret Learning

We now propose a new measure called the maximum regret, which we seek to minimize.

Definition 1 (Regret of a path relative to a weight).

Given a path that is optimal for and some weight , the regret of under is


Regret expresses how sub-optimal a path is when evaluated by some weights . In active learning, this can be interpreted as follows: If is the final estimate, but is the optimal solution, how large is the ratio between the cost of , evaluated by , and the optimal cost? We now formulate an approach for selecting which alternatives to show to the user by using regret.

Iv-a Deterministic Regret

When assuming a deterministic user, we need to assure that and hold, such that the presented paths reflect the user feedback obtained so far. Given we pose the The Maximum Regret under Constraints Problem (MRuC) as


The objective can be written in the form

. This is a bi-linear program, which are a generalization of quadratic programs. Unfortunately, in our case the objective function is non-convex; generally, such problems are hard to solve.

Symmetric Regret

In equation (8) we have defined the maximum regret problem when one path is given. While presenting users with a new pair of paths , we want to find paths where the regret of under is maximized and vice versa. Thus, we rewrite the objective in equation (8) to , which we call the symmetric regret. We denote the maximum symmetric regret of a feasible set as , which can be found with the following bi-linear program:


Similar to equation (8) this is a non-convex optimization problem. In the evaluation we solve this problem by sampling a set of weights and pre-computing the corresponding optimal paths, following the approach in [1].

Iv-B Probabilistic Regret

We now formulate regret with consideration of the user’s uncertainty when choosing among paths. Taking a Bayesian perspective we treat as a random vector. This allows us to express a posterior belief over given an observation . Let and , respectively. Further, we assume a uniform prior over . For any estimate where , i.e., that is consistent with the feedback , we have


Let denote . We calculate the posterior given a sequence of user feedback as


This allows us to formulate the symmetric regret in the probabilistic case, by weighting the regret by the posterior of and :


That is, we discount the symmetric regret such that we only consider pairs where both and are likely given the user feedback .

Finally, we adapt the problem of finding the maximum symmetric regret from equation (9) to the probabilistic case. As we cannot formulate a feasible set for a probabilistic user, we consider a finite set where each is uniformly randomly sampled from the set . We then take the maximum over all to compute the probabilistic maximum regret


In min-max regret learning, we choose the pair of paths that is the maximizer of equation (13).

Iv-C Preference Learning with Probabilistic Maximum Regret

1 Initialize
2 Sample a set of weights
3 for  to  do
8       if  then
10      else
Algorithm 1 Maximum Regret Learning

Our proposed solution for active preference learning using probabilistic maximum regret is summarized in Algorithm 1. In each iteration we find the pair that maximizes the probabilistic symmetric regret as in equation (13) over a set of samples (line 4). We then obtain user feedback where if the user prefers path and otherwise (line 7) and add the feedback to a sequence (line 8-11). When learning is completed, we choose as the sample with the highest probability and return the corresponding shortest path (line 11,12).

Using the maximum regret in the query selection is a greedy approach to minimize the maximum error. Given the current belief over the weights, we choose the pair with the maximum error ratio, discounted by the likelihoods of and .

V Evaluation

(a) Driver
(b) LDS
Fig. 3: Comparison of active preference learning with maximizing entropy and minimizing regret.

We evaluate the proposed approach using the simulation environment from [2], allowing us to compare our approach to theirs in the same experimental setup. To label the approaches let denote the maximum entropy learning from [2] and our maximum regret learning.

We consider two of the test cases in [2]: The autonomous driving scenario (Driver) from Figure 1 and an abstract linear dynamic system (LDS). In the driver scenario, an autonomous car moves on a three lane road in the presence of a human-driven vehicle as shown in Figure 1. Paths are described by four features: Heading relative to the road, staying in the lane, keeping the speed, and the distance to the other car. Every feature is averaged over the entire path. In LDS the problem has a six dimensional state and three dimensional action space and six features. However, the features do not have a practical interpretation. We choose these two examples because the entropy approach from [2] showed strong results on driver and this scenario was already previously investigated in [1]. For the LDS example on the other hand the entropy approach showed the weakest results.

We simulate user feedback using the probabilistic user model from [2]. Given two paths, the user’s uncertainty depends on how similar the paths are with respect to the cost function evaluated for :


Given a weight we find the corresponding optimal path with the generic non-linear optimizer L-BFGS [22]. The probabilistic regret is computed using pre-sampled weights as described in Algorithm 1, with an uncertainty of in equation (6).

Similar to [2] for each experiment we sample a user preference uniformly randomly from the unit circle, i.e., . We notice that this can include irrational user behaviour: A negative weight on heading for instance would encourage the autonomous car to not follow the road.

Finally, in these simulations the behaviour is actually captured by a reward and not a cost. Thus, an optimal path is found by minimizing the negative cost and we change the definition of regret to .

V-a Learning error

In Figure 3 we compare to on both metrics over iterations for the two experiments, each repeated times. In the boxplots the center line shows the median and the green triangle shows the mean.

At iteration we include the error over all sampled paths, i.e., the distribution from Figure 2, averaged over all trials. The weight errors of reproduce the results presented in [2], with the boxplots providing additional insight into the deviation of the error.

In the driver example, overall achieves a smaller weight error. In the path space we observe that actually performs better for the first iterations. However, from iteration onward, achieves a near optimal median value and shrinks the deviations nearly monotonically. At iteration over of the data lie below , i.e., have less than error. While achieves a median value less than already in iteration , we observe larger deviations, which actually can increase again as in iterations and . Thus, exhibits a strong performance after ten iterations in half the cases, but a quarter of the data still has error or more.

For the LDS example in Figure 2(b) both approaches do worse on the weight metric, hinting that it might be a harder problem to solve. For the weight alignment we see that outperforms ; nonetheless, both approaches exhibit larger deviations. In the path alignment on the other hand, we observe a drastic difference between and : When using the median passes the mark within just a single iteration and all data points pass the mark in iteration . The median values of are a bit higher than for the driver example. However, shows large deviations over all iterations nearly of the data above and over over .

In both experiments we observe the benefit of the entropy approach when using the weight error metric, while path error performance is much better for the regret approach. These observations fit well with the theoretical nature of both approaches: reduces the uncertainty of the weight space, but does not consider the corresponding paths, i.e., the behaviour of the robot. In contrast greedily minimizes the maximum expected error of all paths. Most importantly, achieves a lower path error despite having a larger weight error. That is, while the weights found by are more similar to based on the alignment metric, the resulting behaviour of is better. This strongly supports our claim that the solution space is more significant for judging the result of preference learning. In Section V-C we compare the two metrics in terms of robustness under different scenarios.

V-B Easiness of queries

A major contribution of [2] is the design of queries that are easy for the user to answer. In maximum regret learning we do not directly consider the user’s uncertainty when choosing a new pair of paths. However, as the paths maximizing the probabilistic symmetric regret have a large difference on cost our approach implicitly selects paths that are easy to answer for a user where or .

To compare the easiness of the queries presented to the user, we consider the probability that the user would choose the path with lower cost, evaluated by . This probability is actually used in the user model from [2], written out in equation (14). In Figure 4 we compare the probability of correct user answers for and for each iteration in the driver experiment.

Fig. 4: The likelihood that the simulated user gives the ’correct’ answer, i.e., the probability in equation (14).

We observe that both approaches achieve very high probabilities for correct user answers in the first iteration, i.e., ask an easy question. Afterwards, the probabilities get smaller: In the driver example the median of decreases to and the deviations increase significantly. The approach maintains median values close to until iteration . However, after iteration the probabilities increase again until iteration and clearly outperform . In the actual experiment, we recorded correct answers for which is slightly worse than reported for the strict queries in [2], where correct answers occurred in of cases ( wrong answers over iterations on average). Nonetheless, with the simulated answers were correct in of cases, outperforming . In the LDS the approaches perform very similar to one another. We recorded for and for . The decrease in the probabilities for could be explained by the higher dimensional weight space in the LDS experiment, making it generally harder to ask easy questions.

Overall, these results strongly support our claim that maximizing regret implicitly creates queries that are easy for the user to answer.

V-C Generalization of the error

Finally, we investigate how the two error metrics generalize to different scenarios. That is, we investigate whether each error metric is useful for predicting the learning performance when the robot needs to generalise to a new instance of the problem not encountered during active preference learning. Therefore, we consider the driver scenario from Figure 1 as a training case and construct test cases by changing the initial state of the human driven vehicle (white). The weight error is scenario independent, it directly describes how similar the estimated weight is to . Thus, the weight error is the same in training and test cases and cannot be used as test error, as this would contain no additional information about performance on the test case. Hence, we use the path error as the test error. Further, we notice that if the weight error is zero, i.e., the weights have been learned perfectly, then the path error is zero in all scenarios. However, as shown in Figure 3 and in [1, 2] the weights typically do not converge to the true user weight within a few iterations. Given some weight the path errors are fixed values in every test scenario. We are now interested in how well the weight error and the path error of the training scenario predict the path error of the test scenario.

We generate different random user weights and then generate estimates of each of these weights. For every estimate we find the optimal path and compute the path and weight error which are used as training errors for the estimate. In Figure 5 we show how these training errors relates to the test error. We compare the path and weight error as a measure of generalisation performance (i.e., how well the weight and path errors predict the test case performance).

Fig. 5: Relationship between training errors measured by the path and weight metric to test errors in the path metric.

We observe that the path error translates linearly between training and test scenarios: Given a weight with a certain path error in the training scenario, the weight yields paths in the test scenarios that have a similar path error, on average. The relationship between weight error and test error is more complex. For a weight error of during training we observe a test error of , i.e., if the weights are very close to the optimum, the optimal solution is found in every scenario. However, the test error shows large deviations, implying that a low weight error in training is not a robust measure of how good the resulting behaviour is in test cases. The observation is supported by a strong Pearson correlation of between training and test error for the path error, but a much weaker correlation of for the weight error. This lends support to the claim that the path error is better suited for making predictions of the performance in scenarios that were not part of the training.

Vi Discussion

In this paper we investigated a new technique for generating queries in active preference learning for robot tasks. We have shown that competing state of the art techniques have shortcomings as they focus on the weight space only. As an alternative, we introduced the regret of the cost of paths as a heuristic for the query selection, which allows to greedily minimize the maximum error. Further, we studied an error function that captures the similarity of the behaviour of estimated preferences and the optimal behaviour, instead of the similarity of weights. In simulations we demonstrated that using regret in the query selection leads to faster convergence than entropy while the queries are even easier for the user to answer. Moreover, we have shown that the path error allows for better predictions for other scenarios.

For future work special cases such as discrete action spaces in the form of lattice planners should be investigated. This would give further inside into the computational hardness of finding the maximum regret and potentially allow for solution strategies that do not require pre-sampling weights and paths. Richer user feedback such as an equal preference option could be of interest, promising results for this approach were presented in [13, 2]. Finally, regret based preference learning should be investigated in a user study to show the practicality of this approach.


  • [1] D. Sadigh, A. D. Dragan, S. Sastry, and S. A. Seshia, “Active preference-based learning of reward functions,” in Robotics: Science and Systems (RSS), 2017.
  • [2] E. Bıyık, M. Palan, N. C. Landolfi, D. P. Losey, and D. Sadigh, “Asking easy questions: A user-friendly approach to active reward learning,” 2019.
  • [3] N. Wilde, A. Blidaru, S. L. Smith, and D. Kulic, “Improving user specifications for robot behavior through active preference learning: Framework and evaluation,” IJRR, to appear. [Online]. Available:
  • [4] P. Abbeel and A. Y. Ng, “Apprenticeship learning via inverse reinforcement learning,” in

    Proceedings of the twenty-first international conference on Machine learning

    .   ACM, 2004, p. 1.
  • [5] A. Jain, S. Sharma, T. Joachims, and A. Saxena, “Learning preferences for manipulation tasks from online coactive feedback,” The International Journal of Robotics Research, vol. 34, no. 10, pp. 1296–1313, 2015.
  • [6] B. Akgun, M. Cakmak, J. W. Yoo, and A. L. Thomaz, “Trajectories and keyframes for kinesthetic teaching: A human-robot interaction perspective,” in Proceedings of the seventh annual ACM/IEEE international conference on Human-Robot Interaction.   ACM, 2012, pp. 391–398.
  • [7] M. C. Gombolay, R. J. Wilcox, and J. A. Shah, “Fast scheduling of robot teams performing tasks with temporospatial constraints,” IEEE Transactions on Robotics, vol. 34, no. 1, pp. 220–239, 2018.
  • [8] D. S. González, O. Erkent, V. Romero-Cano, J. Dibangoye, and C. Laugier, “Modeling driver behavior from demonstrations in dynamic environments using spatiotemporal lattices,” in 2018 IEEE ICRA.   IEEE, 2018, pp. 1–7.
  • [9] T. Gu, J. Atwood, C. Dong, J. M. Dolan, and J.-W. Lee, “Tunable and stable real-time trajectory planning for urban autonomous driving,” in 2015 IEEE/RSJ IROS.   IEEE, 2015, pp. 250–256.
  • [10] C. Daniel, M. Viering, J. Metz, O. Kroemer, and J. Peters, “Active Reward Learning,” Robotics: Science and Systems (RSS), vol. 10, no. July, 2014.
  • [11] P. F. Christiano, J. Leike, T. Brown, M. Martic, S. Legg, and D. Amodei, “Deep reinforcement learning from human preferences,” in Advances in Neural Information Processing Systems, 2017, pp. 4299–4307.
  • [12] N. Wilde, D. Kulić, and S. L. Smith, “Bayesian active learning for collaborative task specification using equivalence regions,” IEEE Robotics and Automation Letters, vol. 4, no. 2, pp. 1691–1698, April 2019.
  • [13] C. Basu, M. Singhal, and A. D. Dragan, “Learning from richer human guidance: Augmenting comparison-based learning with feature queries,” in Proceedings of the 2018 ACM/IEEE International Conference on Human-Robot Interaction, ser. HRI ’18.   New York, NY, USA: ACM, 2018, pp. 132–140.
  • [14] D. Bertsimas and J. N. Tsitsiklis, Introduction to linear optimization.   Athena Scientific Belmont, MA, 1997, vol. 6.
  • [15] R. Montemanni and L. M. Gambardella, “An exact algorithm for the robust shortest path problem with interval data,” Computers & Operations Research, vol. 31, no. 10, pp. 1667–1680, 2004.
  • [16]

    A. Kasperski and P. Zielinski, “An approximation algorithm for interval data minmax regret combinatorial optimization problems.”

    Inf. Process. Lett., vol. 97, no. 5, pp. 177–180, 2006.
  • [17] M. Palan, N. C. Landolfi, G. Shevchuk, and D. Sadigh, “Learning reward functions by integrating human demonstrations and preferences,” arXiv preprint arXiv:1906.08928, 2019.
  • [18] D. P. Losey and M. K. O’Malley, “Including uncertainty when learning from human corrections,” in Conference on Robot Learning, 2018, pp. 123–132.
  • [19] J. Y. Zhang and A. D. Dragan, “Learning from extrapolated corrections,” in 2019 International Conference on Robotics and Automation (ICRA).   IEEE, 2019, pp. 7034–7040.
  • [20] M. Racca, A. Oulasvirta, and V. Kyrki, “Teacher-aware active robot learning,” in 2019 14th ACM/IEEE International Conference on Human-Robot Interaction (HRI).   IEEE, 2019, pp. 335–343.
  • [21] N. Wilde, D. Kulić, and S. L. Smith, “Learning user preferences in robot motion planning through interaction,” in 2018 IEEE International Conference on Robotics and Automation (ICRA), May 2018, pp. 619–626.
  • [22] G. Andrew and J. Gao, “Scalable training of l 1-regularized log-linear models,” in Proceedings of the 24th international conference on Machine learning, 2007, pp. 33–40.