Bayesian Active Learning for Collaborative Task Specification Using Equivalence Regions

01/28/2019 ∙ by Nils Wilde, et al. ∙ University of Waterloo 0

Specifying complex task behaviours while ensuring good robot performance may be difficult for untrained users. We study a framework for users to specify rules for acceptable behaviour in a shared environment such as industrial facilities. As non-expert users might have little intuition about how their specification impacts the robot's performance, we design a learning system that interacts with the user to find an optimal solution. Using active preference learning, we iteratively show alternative paths that the robot could take on an interface. From the user feedback ranking the alternatives, we learn about the weights that users place on each part of their specification. We extend the user model from our previous work to a discrete Bayesian learning model and introduce a greedy algorithm for proposing alternative that operates on the notion of equivalence regions of user weights. We prove that with this algorithm the revision active learning process converges on the user-optimal path. In simulations on realistic industrial environments, we demonstrate the convergence and robustness of our approach.



There are no comments yet.


page 1

page 8

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

We address two active research topics in human-robot interaction (HRI): learning from non-expert users and human-robot collaboration. We develop a methodology for non-expert users to create specifications for complex robot tasks [1]. These specifications then enable humans and robots to operate in a shared workspace [2].

For instance, in an industrial environment shared between humans, autonomous and human-operated vehicles, a facility operator might define road rules to be followed by both robot and human-operated vehicles. Such road rules increase the predictability of robot behaviour for humans in the environment. These can include constraints such as areas of avoidance, one way roads or speed limits. The environment map, the operator specifications and a defined set of start and goal locations yield a complete specification of a robot task. In practice, designing such rules can be challenging, as operators might have little intuition about how their specification will affect the robot’s behaviour and therefore the performance. They might be willing to accept the violation of less important constraints if sufficiently beneficial for task performance. For instance, Figure 1 shows an industrial environment with several user defined constraints. When the robot uses the dark blue path, it respects all constraints. In the alternative solutions, the robot traverses (i.e., violates) constraints as this enables a significant reduction in the time to travel from start to goal.

Fig. 1: Example environment (white) with obstacles (black) and user defined constraints. Roads are drawn in green with an arrow indicating the direction. Speed limit zones are drawn in yellow, while areas of avoidance are illustrated in red. Further, four different paths between a given start and goal are shown. Dark blue indicates the initial path following the specification. Purple and orange paths are alternatives that the simulated user accepted during the interaction. In cyan we show the user optimal path , to which the learning eventually converged.

A user likely has a preference for which path is a better solution for the given task, based on the completion time and on the importance of the constraints. This can be captured via weights, describing the importance of constraints. However, asking the user to define these weights is unintuitive and possibly challenging. We propose a framework where a user provides only a spatial definition of constraints, while the importance of each constraint is latent.
To fill the gap between this incomplete user input and a complete robot task description we apply active preference learning. We present the user with alternative solutions, i.e., paths, and ask for a ranking. In our framework, the user is ”on-the-loop” and provides feedback at their convenience. The latest path preferred by the user becomes the current path and can already be executed. Consequently, each set of alternative paths contains the current path. Through this interaction with the user, the relative importance of each constraint can be learned. In our previous work [3], we developed an algorithm that iteratively builds a set of linear inequalities on the hidden weights of the user constraints. Our user model evaluated paths based on a cost function trading-off constraint violation and time. The learning system assumed that the user would always provide feedback consistent with that cost function and thus iteratively rejected paths that became inconsistent with the user feedback. We extend our previous work [3] in two ways: First, the assumptions on the user are relaxed in order to capture more realistic behaviour. By considering noisy user feedback and introducing a probabilistic learning approach, we allow users to not always behave consistently with our user model. Second, based on our formulation of the problem as a shortest path search on a graph, we define equivalence regions for possible solutions. We propose a new learning system that exploits the notion of equivalence regions, which are sets of constraint weights that are indistinguishable to the user. From this we obtain a greedy algorithm that allows highly efficient learning, outperforming other state-of-the-art techniques.

Related work

Recently, active preference learning has been extensively studied for robot task specification. Thereby, a user is assumed to have an internal, hidden cost function which is learned from the user feedback to a presented set of alternatives. For instance, in [4] experts rank the demonstrated task performance for grasping applications while [5, 6] focus on continuous trajectories of dynamical systems like autonomous mobile robots. We propose a framework relying on a deterministic black-box planner that outputs a path for a given set of weights for the user constraints [3]. Different weights can have the same optimal solution, allowing for a discretization of the weight space. Therefore, our problem can be cast as an entity identification problem [7]: Our hypotheses are the sets of constraint weights that have different optimal solutions, tests correspond to asking the user about their preference between paths and observations equal their feedback. Golovin et al. [8] introduce a strong algorithm for near-optimal Bayesian active learning with noisy observations. However, their approach focuses on running each test at most once while we allow for repeated queries. While [5]

greedily reduces the integral of the continuous probability density function over the weight space, our greedy algorithm is formulated over the discretized weight space corresponding to unique paths. A major drawback of the user model proposed in

[5] and [6] is that the user’s behaviour depends on the scale of the selected features as it considers an absolute instead of a relative error and does not provide a normalizing mechanism. We propose a more general, scale-invariant user model and show its robustness in simulations. Finally, our work differs from other applications of active preference learning for robotics in the way we choose the features for the cost function. Usually, features are picked manually and are therefore a design choice for the learning system [4, 9]. In our case, features are the violations of constraints that follow from the user specification and are user specific.

A common technique for learning from demonstration [10]

is inverse reinforcement learning (IRL). The optimal behaviour of a dynamical system is described by a hidden reward function. IRL then learns this reward function by observing optimal demonstrations. Similarly, we assume a hidden user cost/reward function for the quality of a path. The cost function is modelled as a weighted sum of predefined features and we are interested in learning the weights. However, providing demonstrations might be difficult

[4], the amount of necessary demonstrations may be prohibitively large [11] or demonstrations may require a high level of expertise [12]. Providing rich and precise specifications prior to a robot executing a task can be challenging and more prone to inaccuracies [13]. In contrast, active preference learning learns hidden reward functions by proposing alternative solutions and asking for the user’s preference. Closely related to this work, [14] presents a GUI for specifying the user constraints on a given environment, which is used as a front-end to the work presented in this paper.


In our previous work [3], we proposed a deterministic user model for learning about weights from a ranking feedback and proposed a complete algorithm. Using the same framework for combining path planning with user constraints, we extend the user model. To capture user feedback inconsistent with our assumed cost function, we propose a Bayesian learning approach (Section III-A). Thereby, we exploit the discrete properties of our problem, introducing a partitioning of the solution space based on equivalence regions. We prove almost sure convergence of the algorithm (Section III-B) and derive a greedy approach (Section III-C). Finally, we show the performance and robustness of our approach in comparison with another state-of-the-art technique in extensive simulations (Section IV).

Ii Problem Formulation

Ii-a Preliminaries

Using definitions from [15], a multi-graph is a triple , where the function

associates each edge with an ordered pair of vertices. Multiple edges are allowed to connect the same ordered pair of vertices and are then called parallel. In our problem we consider doubly weighed multi-graphs of the form

. Thereby, and are independent weight functions, each associating a real number to each edge of the graph: for .

A walk between two vertices and on a graph is a finite sequence of vertices and edges where are distinct. A path between two vertices and is defined as a graph where is a walk. On a weighted graph, the cost of a path is defined as . In doubly weighted graphs we define two costs and where for .

Ii-B Problem statement

Initial Specification

User interaction

Revised Specification



Fig. 2: Flowchart of the problem. An initial specification is revised using interactive learning to obtain a revision that better fits the user preferences.

We summarize the problem setup in Figure 2. From the environment and a set of user constraints we construct an initial specification for the robot. As users might allow the violation of some of their constraints for sufficient time benefit, we present them with alternative paths during the interaction and ask for feedback. From this feedback we learn the weights of the constraints and obtain a revised specification that corresponds to the user preferences. Figure 1 illustrates how the path that we believe to be optimal evolves after observing user feedback until it converges to the optimal solution.

As in our previous work [3], we consider a fully known, static environment, represented as a weighted strongly connected multigraph . The weight on the graph encodes the time a robot requires to traverse an edge. We use parallel edges with different times to model speed. A robot task consists of navigating from a start vertex to a goal vertex on . On the environment, a user specifies a set containing constraints. Each constraint is a pair , where is a subset of the edges of and is a hidden user cost for the constraint. To incorporate the user specification, we create a doubly weighted graph . For each edge in the second weight is defined as the sum of all that belong to a constraint containing . The problem is to find a path from to that minimizes the following objective:


The true user weights are latent. Moreover, they are defined in units of time, allowing us to pose the multi-objective optimization as an unweighted sum. To learn about the weights, we can query the user by presenting them with a set of paths

. The feedback is a vector

representing a ranking, i.e., a partial ordering, of the presented paths. Without loss of generality we focus only on pair-wise comparisons, as the ranking of additional elements can be expressed with a set of pair-wise relations. This is also well motivated with respect to the user; ranking more than two alternatives might be unnecessarily challenging [16].

We formally define the Learning of User Preferences (LUP) problem as follows:

Problem 1 (Lup).

Given a graph , a user specification , a start and goal vertex, a user evaluating a presented set of paths and a budget of iterations for querying the user, maximize the belief about the true user weights and its corresponding shortest path with respect to equation (1).

Iii Probabilistic Learning

In this section we propose a probabilistic model of user behaviour. In our previous work [3] we required the user to always provide feedback consistent with a linear user model. In contrast, we now consider that the user feedback may be noisy and thus the user feedback is not deterministic.

Iii-a Bayesian Learning

Using definitions for Bayesian inference from

[17] we set up a learning model for gaining information about the hidden (latent) parameter . We model the user weights to be positive and finite: . The cost of a path is , where the violation vector describes how many edges of each constraint are traversed by , is a column vector containing all latent user weights and is the time to traverse . From each user feedback for a pair

we can derive a hyperplane of the form

. This hyperplane defines two subsets of the weight space, and , where . Thus is the set of all weights for which has lower cost than .

Probabilities of halfpspaces

For any pair of paths , the parameter holds iff . Adopting a Bayesian perspective, we treat

as a random variable and assign an uninformed prior

. Notice that the volumes of and do not correspond to the probability of a path being preferred over another path. From the user feedback about two paths and we obtain binary observations. We denote observations with a random variable , indicating whether the user prefers path or . A deterministic user always provides feedback where , i.e., . A probabilistic user is consistent with this model with some probability . Hence, the probability of given is


We refer to as the accuracy of the user and assume , i.e., that our user model fits the user’s decision making better than a random guess. If the parameter is hidden, we can evaluate equation (2

) with an estimate

. To simplify notation we write as . In general, is a function of and . This allows us to model different levels of the user’s accuracy depending on how similar the paths are.

Probabilities of equivalence regions

Equation (2) describes an observation model for a pair of paths, assigning probabilities to halfspaces. We now assign probabilities to paths instead. We observe that not every value in the weight space leads to a unique shortest path, which leads to our definition of equivalence regions.

Definition 1 (Equivalence region).

If the same path is optimal for two weights and , we call and equivalent. An equivalence region of a weight is then the set of all weights that are equivalent to the weight: .

We can use equivalence regions to discretize the weight space . Given a comparison of two paths , we introduce a second observation model that describes the probability of user feedback given that the true user weight lies in the equivalence region of some path as


If an equivalence region lies in both halfspaces and , we obtain no information from the feedback , since not all weights in are either feasible or infeasible with the user feedback; expressed in the third case. Let be the set of all equivalence regions for a given problem instance. The observation model allows us to express a probability for given an observation as a Bayesian posterior


Following [17], we write the Bayesian posterior for a series of observations for arbitrary pairs of paths as


Notice that this general model does not depend on the exact form of the likelihoods , we only require . Therefore, our model could use the likelihood function from [6]. Alternatively, one could fix all to a constant. Then, in contrast to [5, 6], the accuracy of the user does not depend on the scaling of the features in the cost function. Moreover, our model increases the robustness towards user feedback that appears inaccurate because the user is considering context that is not described by our features. For instance, in a warehouse an operator might have different preferences for different weekdays or wants a robot to temporarily avoid certain regions. This can not be covered with the current cost function and thus this user would appear erratic to the learner. Finally, when the accuracy is set to one, the deterministic learning model [3] is recovered. The key advantage of using equivalence regions in equation (5

) is that it reduces the complexity of the probability distribution since we now have a discrete distribution over regions rather than a continuous one This allows for a significantly faster solving of the problem, as we will show in Section


Iii-B Probabilistic Algorithm

In Algorithm 1 we propose a general procedure to iteratively learn about user preferences from pairwise user feedback with inaccurate users. Initially, we compute the set of all equivalence regions (line 2). After updating our current belief about all equivalence regions (line 4), we iteratively generate new paths (line 5) similar to our deterministic algorithm from [3]. Then, we request user feedback for the pair and add the user feedback to a set (6-7). After adding the new observation to our set, we update the weight space and, if necessary, the current weight (8-9). The procedure is repeated until we reach the iteration budget in line 2, at the end we return the weight where the posterior belief is maximized. We discuss an implementation of the function in Section III-C.

Input: , ,
1 ,
2 Calculate
3 for  to  do
4       Update for all
6       Get user feedback for paths and
8       if  then
Algorithm 1 Learning user weights by sampling


We now establish almost surely convergence of Algorithm 1. Let be the true user weight and for all pairs of paths . Without loss of generality, we only consider pairs that are ordered such that ; hence always hold (but this is not known to the algorithm). Moreover, is the number of equivalence regions in and the number of all pair-wise comparisons. For the following definition we change our notation and denote the optimal path as .

Definition 2 (Asymptotically completely informative sequence).

Let be a sequence of pairs of paths presented to the user in iterations, and for each , let be the longest subsequence of for which . Then the sequence is asymptotically completely informative if as goes to , we have for all .

In other words, if a sequence of pairs of paths contains observations about every , and the number of observations for each pair goes to infinity as the length of the sequence goes to infinity, it is called asymptotically completely informative. Notice that such a sequence of paths is not required to contain subsequences for all pairs of paths ; it is sufficient if the feedback to the pairs contains information about all other equivalence regions according to (3). Finally, we call asymptotically completely informative if the corresponding sequence of paths is asymptotically completely informative and treat the probability that the true user weight or an estimate lie in an equivalence region as a random variable.

Proposition 1 (Convergence).

Let be the equivalence region containing . Given asymptotically completely informative user feedback , the probability that the best estimate of Algorithm 1 lies in converges almost surely to 1 as all go to infinity:


At first consider the comparison of an arbitrary pair and fix . Let be a sequence of user feedback of length and be the number of times the user chooses , i.e., chooses accurately. Moreover, let be our estimate of . For simplification we drop the superscript. From , we can conclude that as , using Hoeffding’s inequality [18]. We notice that the probability for a sequence of user feedback given depends on , while our belief about given some user feedback is based on . Given user feedback with known and

, the posterior probability is


We take the limit as goes to :


Using leads to . The term is strictly negative if . Hence, approaches zero as goes to infinity. We conclude that . As we only have two paths, we only have two equivalence regions. Following our ordering of and , . Hence, as for a single, fixed pair of paths and .
Finally, we extend the result to comparisons for multiple pairs. Equation (5) expresses the probability of lying in a given equivalence region. Notice that , as well as . As for all , we have . Hence, and the statement holds. ∎

From Proposition 1 we conclude that Algorithm 1 always elicits the true user weight if an asymptotically completely informative sequence of paths is presented to the user for feedback, and if the user’s accuracy with respect to our model is greater than . However this does not include any guarantees on the speed of convergence. In the next section we derive a greedy approach to maximize convergence speed.

Iii-C Greedy Policy

We now show how to find new paths in each iteration of Algorithm 1, i.e., the function . We notice that computing the set of all equivalence regions for a given problem instance is computationally intractable, a proof is provided in Appendix -A. Thus, the set can be of exponential size in relation to the number of constraints. Because of this, we propose a greedy algorithm for finding new paths. We define as the unnormalized posterior, i.e., the numerator of equation (5). As is not a probability we refer to it as the posterior measure. The decrease in the posterior measure is captured as


Our primary motivating application is one in which the user is ”on-the-loop”. We do not require them to constantly provide feedback and already execute the current solution [3]. Therefore, we keep the current best path and fix to be one of the two alternative paths comprising the next query (Algorithm 1). Thus, our greedy algorithm returns the path maximizing the posterior measure


In this optimization we only need to consider one for each equivalence region. In Appendix -B we discuss the case where two new paths are presented, i.e., we do not fix one path in each query to be . In this case it can be shown that equation (10) is an adaptive submodular function. Finally, we ensure convergence for our greedy approach.

Lemma 1 (Convergence of the greedy algorithm).

The greedy algorithm equation (10) returns an asymptotically completely informative sequence of paths if the number of iterations goes to infinity and thus the probability of the true user weight converges to one almost surely.


To prove the statement we show two properties: 1) The greedy algorithm eventually returns and 2) if it eventually returns all paths necessary to constitute an asymptotically completely informative sequence. To show the first statement let . If a path is returned we either do not learn about and the posterior of either or decreases relatively to the posterior of (see equation (3)). Otherwise, the comparison of contains information about and thus is expected to increase the posterior of . While , the expected marginal reward of presenting , i.e., , increases monotonically, relatively to the reward of any other path. Thus, will eventually be the maximizer of equation (10). Then, the greedy algorithm returns and the user will prefer over the current in expectation.

For the second statement, assume we already have . Due to inaccurate user feedback another path becomes . However, as shown above, the algorithm eventually presents again. Consider the path where is minimal among all paths ( is defined in Definition 2). Case 1: is the maximizer of equation (10), thus is presented and increments. Case 2: Some is the maximizer and is presented. If the corresponding feedback contains information about , increments as well. According to equation (3), if no information about is obtained, increases by a ratio of (where is the user accuracy) relative to all , where gets rejected by the feedback to . As this holds for all other paths and the number of paths is finite, increases relative to all other posterior measures until, after a finite number of iterations, either information about is obtained and increments, or case 1 applies. Hence, we are guaranteed to increment the minimal and thus all must go to infinity. ∎

Performing an exact greedy step is hard, as finding the set of all equivalence regions is intractable. In practice, a polynomial sized estimate can be found via sampling which allows for an approximate greedy step.

Iv Evaluation

To generate realistic simulations, we recruited users to create specifications. Given the layout of a real industrial facility, users defined constraints as described in [14]. To systematically evaluate our approach, we simulate user feedback in the active learning. This allows us to pick different ground truths and generate the user feedback with varying accuracy levels. Figure 1 illustrates an example specification with different possible solutions. Further, for an outdoor scenario we conducted experiments using graphs generated by a probabilistic method [19] and random specifications.

Our primary interest is how the posterior belief about evolves. In the evaluation we have two objectives: Showing the robustness of our user model and comparing our work with [5]. We refer to their approach as the Maximum Volume Removal (MRV) and name our approach the Maximum Equivalence Region Removal (MERR) 111Note: In both approaches neither volume nor equivalence regions are removed, we rather assign a lower posterior probability to the rejected items..

In our implementation of MRV, we modify the query selection: As we consider a discrete space, queries are found by iterating over rather then solving a continuous optimization problem. To ensure comparability with our framework, we fix one path in the pair that comprises a query as the current path. Moreover, MRV requires a scaling of the features. The user’s accuracy is modelled as an exponential function of the difference in the cost of two paths [5]. Thus the user’s accuracy depends on the scaling of the cost function. The model is extended by a linear parameter in [6] to describe different levels of accuracy. However, as no restriction on the scale of the features is made, these values do not yield similar results for different scenarios. In our experiments we manually determined for each scenario such that the user’s accuracy is approximately .

To investigate the performance we used three different specifications that vary in complexity. The first specification consists of 26 constraints, covering of the free space, the second (shown in Figure 1) has 41 constraints covering while the third consists of 52 constraints covering 222When a user specifies a road on the interface it counts as 2 constraints for the planner: A reward for following the road, i.e., a constraint with a negative weight, and a penalty for going against the direction of travel.. For each specification we varied between two start and goal pairs and three randomly selected true user weights . As finding the set of all equivalence regions is intractable, we generate estimates via sampling. The graph for these experiments is based on a grid layout and has of vertices. MERR and MVR differ in three components: The model for the simulated user feedback, the user model assumed by the learning system and the strategy for presenting new paths, i.e., the query selection.

Iv-a Performance of MERR and MVR query selections

Experiment 1

In the first experiment, we compare the performance for the different active query selections - either MVR or MERR - for a user following the MVR model. Figure 3 shows the result for a total of 180 trials (10 repetitions for each configuration of user, start-goal pair and true user weight) with a budget of 30 iterations.

Fig. 3: Results for experiment 1. For different values of the posterior of the optimal weight the left plot shows the percentage of trials that achieved that value within iterations. In the right plot we show median values of the posterior over all iterations together with violin plots of the distribution of the posterior of at iterations 10, 20 and 30.

The left plot of Figure 3 illustrates the percentage of trials that achieved a given final posterior value for the true user weight within the iteration budget. A critical threshold for the posterior is , as then and is the unique maximizer of the posterior distribution. The MERR query selection has a higher success rate: of the trials converge within iterations, while only do so for MVR queries. Interestingly, both approaches always converge once the posterior of surpasses . In the right sub plot we illustrate the evolution of the median posterior of over the iterations. Further, at iterations 10, 20 and 30 we show the distribution of the data. We observe that between iterations 10 and 20 the MERR posterior median starts to increase quickly, passing the threshold at iteration 17 and getting past after 21 iterations. MVR shows a slower increase and does not pass within the 30 iterations. The violin plots at three stages of the process illustrate a further detail: The distributions are nearly bimodal. Both approaches succeed for some instances very quickly while the posterior stays low for harder instances. However, we observe that the low end of the distribution shrinks more quickly for MERR and eventually becomes completely bimodal. In some hard instances, the algorithm takes longer to initially show to the user and does not learn about until then. However, once is shown, the belief about is maximized quickly, leading to the bimodal distribution.

To explain the better performance of MERR, we recall that equivalence regions vary drastically in volume. MVR often proposes queries that reduce the integral of the posterior but do not significantly change the posterior of equivalence regions and thus makes little progress.

Fig. 4: Results for experiment 2. The same analysis as in Figure 3 is shown, but for the MERR user two different values for accuracy, and , are depicted.

Experiment 2

The second experiment focuses on the performance of both query selection methods, assuming the user generates responses according to the MERR model, i.e., equation (3). In Experiment 1 we simulated the user according to [5, 6]. In that model, the accuracy depends on how different the presented paths are and therefore is influenced by the query selection. In order to ensure comparability with our user model, we fixed the accuracy to the average of Experiment 1, thus . Additionally, a second dataset shows the results for lower accuracies of . Notice that the range for meaningful values is ; users with act completely independently of our model. Further, corresponds to misleading feedback, doubles the error rate to of cases. In Figure 4 we summarize the data as done for Experiment 1.

In contrast to Experiment 1, both query selection methods achieve a lower convergence rate for . MERR reaches a posterior median of in of the trials, while with MVR less than of trials reach the threshold. For the lower accuracy of , both query selection methods perform worse: MERR converges only in and MVR of cases. The graphs illustrating the mean posteriors over the iterations show a similar result for compared to Experiment 1. For , the median of the posterior for MERR makes substantially better progress than MVR. The distributions confirm this observation: MERR shrinks the bottom lobe and gains on the upper end more quickly.

In summary, the first two experiments highlight the performance benefit when maximizing the decrease in the posterior summed over equivalence regions compared to the decrease in the integral of the posterior (i.e., the removed volume), irrespective of the user model.

Iv-B Robustness of MERR

We further investigate how sensitive the proposed approach is to knowledge of the user’s accuracy. We simulate the user according to the MERR model with a constant accuracy , but the learning system only has access to an estimate . We fixed and and picked either , or . The experiment is based on the smallest and largest specifications with 26 and 52 constraints, respectively. For each configuration we average over 20 trials.

Fig. 5: Experiment 3. Robustness of the MERR user model for two specification (26 and 52 constraints) and different accuracies in the simulated user and different estimates for the learning.

Figure 5 shows that an estimation error of for the user accuracy has very little influence on the performance. In all 4 plots the over- and underestimates behave similarly to when the learner knows the user’s accuracy exactly. As expected, the accuracy itself has an impact on the performance: For the learning system performs worse for both specifications. Especially for the large specification, there is only little progress over the 20 iterations. In a more complicated setting additional feedback is needed to elicit the true user weight, the higher amount of inaccurate feedback then has a larger impact. On the other hand, for the final result is relatively similar for both specifications. We conclude that a richer specification has a smaller impact on the performance when the user feedback is more accurate.

Iv-C Extension to other scenarios

Finally, we applied the approach to a different setting: An environment described by a graph based on a -nearest probabilistic roadmap (PRM) [19]. User specifications are generated randomly by sampling polygons in the environment; for each region a constraint is formulated over all edges incident with a vertex in the region. Using the layout of the campus of the University of Waterloo, we generated a PRM graph with vertices and , the number of sampled constraints is . In Figure 6 we show the map together with the generated graph. Further, we compare the initial path of the learning which does not violate any constraints, and the user optimal path that is learned through interaction. We conducted a similar analysis as in Experiments 1 and 2, averaged for 20 different specifications. Overall of the trials achieve a posterior of at least within 50 iterations. Moreover, after iterations the median passes the threshold, while is reached after interactions.

Fig. 6: Results for Experiment 4. (a) shows the outdoor environment (buildings in black and freespace in white) with the generated PRM-graph. Red indicates edges that belong to a randomly generated constraint, orange shows the optimal path between a start and goal location of a task when not violating any constraints. (b) shows the optimal path and the updated weights on the constraints – red indicates high and blue low weights.

V Discussion

User models

Generally, both models, MVR and MERR, assume that the user evaluates paths based on a weighted sum of features. In MVR the user’s accuracy depends on how similar the presented paths are. This approach is well motivated and promises good performance when the features are adequate. On the other hand, it has disadvantages with the scaling of features and lacks robustness when users do not follow the model. In contrast, our approach generally models inaccuracies as a random noise and is agnostic towards their exact form. In the simulations we fixed to a constant and demonstrated the robustness. Therefore, our learning system is less dependent on our user model exactly capturing the real user behaviour. A limitation of the MERR user model with constant accuracy values is that accurate user feedback is potentially not exploited efficiently as the noise then is query-independent. This approach is investigated in [9]. Moreover, both learning models depend on sampling to perform the greedy step. Even though the accuracy can be increased arbitrarily with more samples, finding the optimal solution is computationally intractable.

Performance in experiments

In the first two experiments we showed data that was collected for different user weights and different user specifications (and thus features). Both have a direct influence on the algorithm’s performance. Usually, more complex specifications have a larger set of equivalence regions, which affects the convergence. Maybe surprisingly, the true user preference can also influence the performance, especially in the MERR model: The learning system makes little progress if most of the hyperplanes learned from a sequence of user feedback intersect . Moreover, in single trials both models can perform relatively poorly by random chance. An inaccurate user feedback for a query containing leads to decrease in the posterior of . Then, the learning system might need multiple iterations to present again and thus elicit . This effect is enhanced when the accuracy of the user is low. Nonetheless, the MERR learning model is still guaranteed to converge.

Vi Conclusions and Future Work

We presented an interactive framework for robot task specification. Based on Bayesian active leaning we derived a greedy algorithm for generating queries. Our approach exploits the fact that different weights for constraints do not necessarily lead to different optimal paths. Using equivalence regions allows for a discrete Bayesian learning model that does not require the user to always provide feedback consistent with the assumed cost function. The probability of an inconsistent user feedback is of a general form and scale-invariant. In simulations, we demonstrated that our approach outperforms a related state-of-the-art technique [5] and showed robustness of our user model. One future work direction is to extend the concept of equivalence regions to continuous spaces by introducing a notion of path similarity. Further, the proposed framework can be applied to more complex and realistic scenarios including multiple start and goal locations and additional features, potentially including dynamic data such as traffic. Finally, user studies are required to show the practical performance of the framework.

-a Hardness of finding all non-equivalent paths

Proposition 2 (Hardness of finding all paths).

Finding the number of non-equivalent paths between and on the graph is P hard.


To prove the statement we reduce the S-T-paths problem to our problem. The S-T paths problem finds the number of all paths from a start to a goal vertex on an unweighted graph and is known to be #P complete [20]. Let be an instance of S-T-paths where is a non-weighted graph without parallel edges and maps a pair of vertices to each edge. We construct an instance of our problem where with , , and and . We choose for all such that is metric. We then pick a user specification consisting of user constraints, each defining a weight for exactly one edge such that each edge in is associated with a single hidden weight . We obtain a doubly weighted graph . Every path on then has a corresponding path on and vice versa. Moreover, any path between and on is a shortest path for some realization of all as we can choose if and for all and otherwise. Hence, the number of equivalence regions of our problem equals the number of all paths on , which corresponds to the number of S-T paths on . We conclude that a solution to finding all equivalent paths solves S-T paths. ∎

The complexity class of P includes problems such as counting the number of solutions of NP-hard problems; however, even for many problems solvable in polynomial time, counting the solutions is nonetheless P hard [20]. If a polynomial time algorithm for solving a P hard problem exists, it would imply P NP.

-B Discussion of adaptive submodularity

In Section III-C we proposed a greedy algorithm that maximizes the reduction of an unnormalized posterior. The objective of the algorithm is related to adaptive submodular functions, introduced in [21]. A similar approach is presented in the active learning framework of [5], where the objective is the reduction of the unnormalized integrated posterior of the weight space, referred to as the removed volume. The authors show that this volume removal function is adaptive submodular. In contrast, equation (9) sums over the posterior measure of all equivalence regions. This indicates how the belief over paths changes instead of over all weights. When is not fixed to be , our greedy objective function can also be shown to be normalized, adaptive monotone, and adaptive submodular. Adaptive monotonicity follows from being multiplied with , or when user feedback is observed. Further, adaptive submodularity follows since the marginal reward of an element , as defined in [21], is smaller for a set than for a set , where as the decrease in the posterior measure has an upper bound of . Adaptive submularity provides strong performance guarantees for a greedy approach: At any given iteration, the greedy solution achieves times the optimal solution and is the best polynomial time approximation [21].


  • [1] V. Villani, F. Pini, F. Leali, and C. Secchi, “Survey on human-robot collaboration in industrial settings: Safety, intuitive interfaces and applications,” Mechatronics, no. June 2017, pp. 1–19, 2018.
  • [2] M. C. Gombolay, R. J. Wilcox, and J. A. Shah, “Fast scheduling of robot teams performing tasks with temporospatial constraints,” IEEE Transactions on Robotics, vol. 34, no. 1, pp. 220–239, 2018.
  • [3] N. Wilde, D. Kulic, and S. L. Smith, “Learning user preferences in robot motion planning through interaction,” in ICRA, 2018.
  • [4] C. Daniel, M. Viering, J. Metz, O. Kroemer, and J. Peters, “Active Reward Learning,” RSS, vol. 10, no. July, 2014.
  • [5] D. Sadigh, A. D. Dragan, S. Sastry, and S. A. Seshia, “Active preference-based learning of reward functions,” in RSS, 2017.
  • [6] C. Basu, M. Singhal, and A. D. Dragan, “Learning from richer human guidance: Augmenting comparison-based learning with feature queries,” in HRI 2018.   ACM, 2018, pp. 132–140.
  • [7]

    V. T. Chakaravarthy, V. Pandit, S. Roy, P. Awasthi, and M. Mohania, “Decision trees for entity identification: Approximation algorithms and hardness results,” in

    Proceedings of the twenty-sixth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems.   ACM, 2007, pp. 53–62.
  • [8] D. Golovin, A. Krause, and D. Ray, “Near-optimal bayesian active learning with noisy observations,” in NIPS, 2010, pp. 766–774.
  • [9] R. Holladay, S. Javdani, A. Dragan, and S. Srinivasa, “Active comparison based learning incorporating user uncertainty and noise,” in RSS Workshop on Model Learning for Human-Robot Communication, 2016.
  • [10] B. Akgun, M. Cakmak, J. W. Yoo, and A. L. Thomaz, “Trajectories and keyframes for kinesthetic teaching: A human-robot interaction perspective,” in HRI 2012.   ACM, 2012, pp. 391–398.
  • [11] P. F. Christiano, J. Leike, T. Brown, M. Martic, S. Legg, and D. Amodei, “Deep reinforcement learning from human preferences,” in NIPS, 2017, pp. 4299–4307.
  • [12] A. Wilson, A. Fern, and P. Tadepalli, “A bayesian approach for policy learning from trajectory preference queries,” in NIPS.
  • [13] P. Abbeel and A. Y. Ng, “Apprenticeship learning via inverse reinforcement learning,” in

    Proceedings of the twenty-first international conference on Machine learning

    .   ACM, 2004, p. 1.
  • [14] A. Blidaru, S. L. Smith, and D. Kulic, “Assessing user specifications for robot task planning,” in RO-MAN, 2018.
  • [15] B. Korte and J. Vygen, Combinatorial Optimization: Theory and Algorithms, 4th ed.   Springer Publishing Company, Inc., 2007.
  • [16] K. G. Jamieson and R. Nowak, “Active ranking using pairwise comparisons,” in NIPS, 2011, pp. 2240–2248.
  • [17] L. Wasserman, All of Statistics: A Concise Course in Statistical Inference.   Springer Publishing Company, Incorporated, 2010.
  • [18] W. Hoeffding, “Probability inequalities for sums of bounded random variables,” Journal of the American statistical association, vol. 58, no. 301, pp. 13–30, 1963.
  • [19] S. M. LaValle, Planning Algorithms.   Cambridge, U.K.: Cambridge University Press, 2006.
  • [20] L. G. Valiant, “The complexity of enumeration and reliability problems,” SIAM Journal on Computing, vol. 8, no. 3, pp. 410–421, 1979.
  • [21] D. Golovin and A. Krause, “Adaptive submodularity: Theory and applications in active learning and stochastic optimization,”

    Journal of Artificial Intelligence Research

    , vol. 42, pp. 427–486, 2011.