Feature Expansive Reward Learning: Rethinking Human Input

06/23/2020 ∙ by Andreea Bobu, et al. ∙ berkeley college 0

In collaborative human-robot scenarios, when a person is not satisfied with how a robot performs a task, they can intervene to correct it. Reward learning methods enable the robot to adapt its reward function online based on such human input. However, this online adaptation requires low sample complexity algorithms which rely on simple functions of handcrafted features. In practice, pre-specifying an exhaustive set of features the person might care about is impossible; what should the robot do when the human correction cannot be explained by the features it already has access to? Recent progress in deep Inverse Reinforcement Learning (IRL) suggests that the robot could fall back on demonstrations: ask the human for demonstrations of the task, and recover a reward defined over not just the known features, but also the raw state space. Our insight is that rather than implicitly learning about the missing feature(s) from task demonstrations, the robot should instead ask for data that explicitly teaches it about what it is missing. We introduce a new type of human input, in which the person guides the robot from areas of the state space where the feature she is teaching is highly expressed to states where it is not. We propose an algorithm for learning the feature from the raw state space and integrating it into the reward function. By focusing the human input on the missing feature, our method decreases sample complexity and improves generalization of the learned reward over the above deep IRL baseline. We show this in experiments with a 7DOF robot manipulator. Finally, we discuss our method's potential implications for deep reward learning more broadly: taking a divide-and-conquer approach that focuses on important features separately before learning from demonstrations can improve generalization in tasks where such features are easy for the human to teach.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 2

page 6

page 12

page 15

page 16

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

When we deploy robots in human environments, they have to be able to adapt their reward functions to human preferences. For instance in the scenario in Fig. 1, the robot was carrying the cup over the laptop, risking a spill, and the person intervened to correct it. Recent methods interpret such corrections as evidence about the human’s desired preference for how to complete the task, enabling the robot to update its reward function online Bajcsy et al. (2017); Jain et al. (2015); Bajcsy et al. (2018).

Because they have to perform updates online from very little input, these methods resort to representing the reward as a linear function of a small set of hand-engineered features. Unfortunately, this puts too much burden on system designers: specifying a priori an exhaustive set of all the features that end-users might care about is impossible for real-world tasks. While prior work has enabled robots to at least detect that the features it has access to cannot explain the human’s input Bobu et al. (2020), it is still unclear how the robot might then construct a feature that can explain it.

A natural answer is in deep IRL methods Wulfmeier et al. (2016); Finn et al. (2016); Levine et al. (2011), which learn rewards defined directly on the high-dimensional raw state (or observation) space, thereby constructing features automatically. When the robot can’t make sense of the human input, it can ask for demonstrations of the task, and learn a reward over not just the known features, but also the raw state space. On the bright side, the learned reward would now be able to explain the demonstrations. On the downside, this may come at the cost of losing generalization when venturing sufficiently far from the demonstrations Fu et al. (2018); Reddy et al. (2020).

In this work, we propose an alternative to relying on demonstrations: we can co-design the learner together with the type of human feedback we ask for. Our insight is that instead of learning about the missing feature(s) implicitly through the optimal actions, the robot should ask for data that explicitly teaches it what is missing. We introduce a new type of human input, which we call feature traces – partial trajectories that describe the monotonic evolution of the value of the feature to be learned. To provide a feature trace, the person guides the robot from an area of the state space where the missing feature is highly expressed to states where it is not, in a monotonic fashion.

Looking back at Fig. 1, the person teaches the robot to avoid the laptop by giving a few feature traces: she starts with the arm above the laptop and moves it away until comfortable with the distance from the object. We introduce a reward learning algorithm that harvests the structure inherent to feature traces and uses it to efficiently train a generalizable aspect of the reward: in our example, the distance from the laptop. In experiments on a 7DoF robot arm, we find that by taking control of not only the algorithm but also the kind of data it can receive, our method is able to recover more generalizable rewards with much less human input compared to a learning from demonstrations baseline.

Finally, we discuss our method’s potential implications for the general deep reward learning community. Feature traces enable humans to teach robots about salient aspects of the reward in an intuitive way, making it easier to learn overall rewards. This suggests that taking a divide-and-conquer approach focusing on learning important features separately before learning the reward could benefit IRL generalization and sample complexity. Although more work is needed to teach difficult features with even less supervision, we are excited to have taken a step towards better disambiguating complex reward functions that explain human inputs such as demonstrations in the context of reward learning.

2 Method

We consider a robot operating in the presence of a human. The robot has access to a feature set defined over states

, and is optimizing its current estimate of the reward function,

. Here,

is a vector of parameters specifying how the features combine. If the robot is not executing the task according to the person’s preferences, the human can intervene with input

. For instance, might be an external torque that the person applies to change the robot’s current configuration. Or, they might stop the robot and kinesthetically demonstrate the task, resulting in a trajectory. Building on prior work, we assume the robot can evaluate whether its existing feature space can explain the human input (Sec. 2.4). If it can, the robot directly updates its reward function parameters , also in line with prior work Bajcsy et al. (2017); Ratliff et al. (2006) (Sec. 2.3). But what if it can not? Below, we introduce a method for augmenting the robot’s feature space by soliciting further feature-specific human input (Sec. 2.1) and using it to learn a new mapping directly from the raw state space (Sec. 2.2). 111Prior work proposed tackling this issue by iterative boosting Ratliff et al. (2007) or constructing binary features Levine et al. (2010); Choi and Kim (2013). We instead focus on features that can be non-binary, complex non-linear functions of the raw state space.

2.1 Collecting Feature Traces from Human Input

A state feature is an arbitrary complex mapping . To learn a non-linear representation of , we introduce feature traces , a novel type of human input defined as a sequence of states monotonically decreasing in feature value, i.e. . When learning a feature, the robot first queries the human for a set of traces. The person gives a trace by simply moving the system from a state of their choosing to an end state, nosily ensuring monotonicity.

The power of feature traces lies in their inherent structure. Our algorithm, thus, makes certain assumptions to harvest this structure during the learning procedure. First, we assume that, that the feature values of states along the collected traces are monotonically decreasing. Since humans are imperfect, we allow users to violate the monotonicity assumption by modeling them as noisily rational, following the classic Bradley Terry and Luce-Shepard models of preference Bradley and Terry (1952); Luce (1959):

(1)

Our method also assumes by default that and , meaning the human starts in a state where the missing feature is highly expressed, then leads the system to a state along decreasing feature values. Since in some situations providing a 1-0 trace is difficult, our algorithm optionally allows the human to give labels 222Since providing decimal labels is challenging, the person gives instead a rating between 0 and 10. for the respective feature values.

To illustrate how a human might offer feature traces, let’s turn again to Fig. 1. Here, the person is teaching the robot to keep the mug away from the laptop. The person starts a trace at by placing the end-effector close to the object center, then leads the robot away from the laptop to

. Our method works best when the person tries to be informative, i.e. covers diverse areas of the space: the traces illustrated move radially in all directions and start at different heights. While for some features, like distance from a known object, it is easy to be informative, for others, like slowing down near objects, it might be more difficult. This limitation can be potentially alleviated using active learning, thereby shifting the burden away from the human to select informative traces, onto the robot to make queries for traces by proposing informative starting states. For instance, the robot could fit an ensemble of functions from traces online, and query for new traces from states where the ensemble disagrees

Reddy et al. (2019).

2.2 Learning a Feature Function

We represent the missing feature by a neural network

. We use the feature traces , their inherent monotonicity, and the approximate knowledge about and to train .

Using Eq. 1, we frame feature learning as a classification problem with a Maximum Likelihood objective over a dataset of tuples , where is a label indicating which state has higher feature value. First, due to monotonicity along a feature trace , we have , so if and 0 otherwise. This results in tuples per trace. Second, we encode our knowledge of and for all by encouraging indistinguishable feature values at the starts and ends of traces as . This results in a total dataset of that is already significantly large for a small set of feature traces. The final cross-entropy loss is then:

(2)
(3)

The optionally provided labels are incorporated as bonus for as and to encourage the labeled feature values to approach the labels.

2.3 Online Reward Update

Once we have a new feature, the robot updates its feature vector . At this point, the robot goes back to the original human input that previously could not be explained by the old features and uses it to update its estimate of the reward parameters . Here, any prior work on online reward learning from user input is applicable, but we highlight one example to make the exposition complete.

Input: Features , weight , confidence threshold , robot trajectory , fixed while executing  do
       if  then
             estimate_confidence() as in Sec. 2.4. if  then
                   for  to with  do
                         query_human_feature_trace() as in Sec. 2.1.
                   learn_feature() as in Sec. 2.2.
             update_reward() as in Sec. 2.3. replan_trajectory()
Algorithm 1 FERL Algorithm Overview

For instance, take the setting where the human’s original input was an external torque, applied as the robot was tracking a trajectory that was optimal with respect to its current reward. Prior work Bajcsy et al. (2017) has modeled this as inducing a deformed trajectory , by propagating the change in configuration to the rest of the trajectory. Further, let define linear weights on the reward features. Then, the robot updates its estimate in the direction of the feature change from to

(4)

where is the cumulative feature sum along a trajectory. If instead, the human intervened with a full demonstration, work on online learning from demonstrations (Sec. 3.2 in Ratliff et al. (2006)) has derived the same update with now the human demonstration. In our implementation, we use corrections and follow Bajcsy et al. (2018), which shows that people more easily correct on feature at a time, and only update the index corresponding to the feature that changes the most (if our feature learning worked correctly, this is the new feature we just learned). After the update, the robot replans its trajectory using the new reward.

2.4 Confidence Estimation

We lastly have to detect that a feature is missing in the first place. Prior work does so by looking at how people’s choices are modeled via the Boltzmann noisily-rational decision model:

(5)

where the human chooses trajectories proportional to their exponentiated reward Baker et al. (2007); Von Neumann and Morgenstern (1945). The coefficient controls the degree to which the robot expects to observe human interventions consistent with its feature space. Typically, is fixed, recovering the Maximum Entropy IRL Ziebart et al. (2008) observation model. Inspired by work in Fridovich-Keil et al. (2019); Fisac et al. (2018); Bobu et al. (2020), we instead reinterpret it as a confidence in the robot’s features’ ability to explain human data. To detect missing features, we estimate via a Bayesian belief update . If is above a threshold , our method updates the reward as usual with its current features; if , the current features are insufficient and the robot stops and enters the feature learning mode. Algorithm 1 summarizes the full procedure.

3 FERL for Feature Learning

Before investigating the benefits of our method (FERL) for reward learning, we analyze the quality of the features it can learn. We present results for six different features of varying complexity.

Figure 2: Visualization of the experimental setup, learned feature values , and training feature traces for table (left) and laptop (right). We display the feature values for states sampled from the reachable set of the 7DoF arm, as well as their projections onto the and planes.

Experimental Design. We conduct our experiments on a 7-DoF JACO robotic arm. We investigate six features that arise in the context of personal robotics: a) table: distance of the End-Effector (EE) to the table; b) coffee: keeping the coffee cup upright; c) laptop: 0.3m -plane distance of the EE to a laptop position, to avoid passing over the laptop at any height; d) test laptop location: same as laptop but the test position differs from the training positions; e) proxemics: keeping the EE away from the human, more so when moving in front of them, and less so when moving on their side; f) between objects: 0.2m -plane distance of the EE to two objects – the feature penalizes collision with objects, and, to a lesser extent, passing in between the two objects. This feature requires some traces with explicit labels . We approximate all features by small neural networks (2 layers, 64 units each), and train them on a set of feature traces (see App. B.1 for details).

For each feature, we collected a set of 20 feature traces (40 for the complex test laptop location and between objects) from which we sample subsets for training. As described in Sec. 2.1, the human teaching the feature starts at a state where the feature is highly expressed e.g. for laptop that is the EE positioned vertically above the laptop. She then moves the EE away until the distance is equal to the desired bump radius. She does this for a few different directions and heights to provide a diverse dataset. For each feature, we determine what an informative and intuitive set of traces would be, i.e. how to choose the starting states to cover enough of the space (details in App. A.1).

Our raw state space consists of the 27D positions of all robot joints and objects in the scene, as well as the rotation matrix of the EE with respect to the robot base. We assume known object positions but they could be obtained from a vision system. It was surprisingly difficult to train on both positions and orientations due to spurious correlations in the raw state space, hence we show results for training only on positions or only on orientations. This speaks to the need for methods that can handle correlated input spaces, which we expand on in App. A.3.

We test feature learning by manipulating the number of traces the learner gets access to, and measuring error compared to the ground truth feature on a test set of states . To form , we uniformly sample 10,000 states from the robot’s reachable set. Importantly, many of these test points are far from the training traces, probing the generalization of the learned features . We measure error via the Mean-Squared-Error (MSE), . To ground the MSE values, we normalize them with the mean MSE of a randomly initialized untrained feature function, , hence a value of 1.0 equals random performance. For each number of feature traces , we run 10 experiments sampling different traces from , and calculate .

We test these hypotheses: H1) With enough data, FERL learns good features; H2) FERL learns increasingly better features with more data; H3) FERL becomes less input-sensitive with more data.

Qualitative results. We first inspect qualitative results with FERL for . In Fig. 2 we show the learned table and laptop features by visualizing the position of the EE for all 10,000 points in our test set. The color of the points encodes the learned feature values from low (blue) to high (yellow): table is highest when the EE is farthest, while laptop peaks when the EE is above the laptop. In Fig. 3, we visualize the remaining features by projecting the test points on 2D sub-spaces and plotting the average feature value per 2D grid point. For Euclidean features we used the EE’s xy-plane and for coffee we project the -axis basis vector of the EE after forward kinematic rotations onto the xz-plane (arrow up represents the cup upright). White pixels are an artifact of sampling.

The top row of Fig. 3 illustrates the Ground Truth (GT) feature values and the bottom row the trained feature . We observe that resembles very well for most features. Our most complex feature, between objects, does not recreate the GT as well, although it does learn the general shape. However, we note in App. C.1 that in smaller raw input space it is able to learn the fine-grained GT structure. This implies that spurious correlation in input space is a problem, hence for complex features more data or active learning methods to collect informative traces are required.

Figure 3: The plots display the ground truth (top) and learned feature values (bottom) over , averaged and projected onto a representative 2D subspace: the -plane for all Euclidean features, and the orientation plane for coffee (the arrow represents the cup upright).
Figure 4: For each feature, we show the

mean and standard error across 10 random seeds with an increasing number of traces (orange) compared to random performance (gray).

Quantitative analysis. We now discuss the quantitative results of our experiments for all 6 features. Fig. 4 displays the means and standard errors across 10 seeds for each feature with data increase. The figure drives four core observations: 1) perform significantly better than random, supporting H1; 2) the mean decreases with data, supporting H2 and demonstrating that FERL can learn high quality features with little human input; 3) the standard error of decreases with data, meaning with more traces FERL is insensitive to the exact input received, supporting H3; 4) the more complex a feature, the more traces are needed for good performance: while table and laptop perform well with just , some other features require more traces. In summary, the qualitative and quantitative results support our hypotheses and suggest that our training methodology requires few traces to reliably learn feature functions that generalize well to states not seen during training.

4 FERL for Reward Learning

Experimental Setup. We compare FERL reward learning to an adapted Maximum Entropy Inverse Reinforcement Learning (ME-IRL) baseline Ziebart et al. (2008); Finn et al. (2016); Wulfmeier et al. (2016) learning a deep reward function from demonstrations. We model the ME-IRL reward function as a neural network with 2 layers, 128 units each. For a fair comparison, we gave

access to the known features: once the 27D Euclidean input is mapped to a final neuron, a last layer combines it with the known feature vector.

Also for a fair comparison, we took great care to collect a set of demonstrations for ME-IRL designed to be as informative as possible (protocol detailed in App. A.2). Moreover, FERL and ME-IRL rely on different types of human input: FERL on feature traces and pushes and ME-IRL on a set of near-optimal demonstrations . To level the amount of data each method has access to, we collected the sets of traces and demonstrations such that ME-IRL has more data points: the average number of states per demonstration/trace in our experiments were 61 and 39, respectively.

We run experiments in three settings in which two features are known and one feature is unknown. In case 1, the laptop feature is unknown and the true reward is , in case 2, the table feature is unknown and , and in case 3 the proxemics feature is unknown and . We include with zero weight to evaluate if the methods can learn to ignore irrelevant features.

The gradient of the Maximum Entropy objective with respect to the reward parameters can be estimated by: Wulfmeier et al. (2016); Finn et al. (2016). is the parametrized reward, the set of expert demonstrations, and are trajectory samples from the induced near optimal policy. We use TrajOpt Schulman et al. (2013) to obtain the current set of samples (see App. B.2 for details). We validated our ME-IRL implementation in experiments and observe that it quickly learns a reward that induces a state expectation very close to the expert demonstrations .

Figure 5: Visual comparison of the ground truth, FERL, and ME-IRL rewards.

We compare the two reward learning methods across two metrics commonly used in the IRL literature Choi and Kim (2011): 1) Reward Accuracy: how close to GT the learned reward is by some distance metric, and 2) Behavior Accuracy: how well do the behaviors induced by the learned rewards compare to the GT optimal behavior, measured by evaluating the induced trajectories on GT reward.

For Reward Accuracy, we manipulate the number of traces/demonstrations each learner gets access to, and measure the MSE compared to the GT reward on , similar to Sec. 3. For Behavior Accuracy, we train FERL and ME-IRL with a set of 10 traces/demonstrations that we deemed to be most informative. We then use TrajOpt Schulman et al. (2013) to produce optimal trajectories for 100 randomly selected start-goal pairs under the learned rewards. We evaluate the trajectories with the GT reward and divide by the reward of the GT induced trajectory for easy relative comparison.

We use these metrics to test the hypotheses: H1) FERL learns rewards that better generalize to the state space than ME-IRL; H2) FERL performance is less sensitive to the specific input than ME-IRL.

Qualitative Comparison. In Fig. 5, we show the learned FERL and ME-IRL rewards as well as the GT for case 1 evaluated at the test points (see App. C.2 for cases 2 and 3). As we can see, by first learning the laptop feature and then the reward on the extended feature vector, FERL is able to learn a fine-grained reward structure closely resembling GT. Meanwhile, ME-IRL learns some structure capturing where the laptop is, but not enough to result in a good trade-off between the active features.

Quantitative Analysis. To compare Reward Accuracy, we show in Fig. 7 the MSE mean and standard error across 10 seeds, with increasing training data. We visualize results from all 3 cases, with FERL in orange and ME-IRL in gray. FERL is closer to GT than ME-IRL no matter the amount of data, supporting H1. Additionally, the consistently decreasing mean MSE for FERL suggests that our method gets better with more data; in contrast, the same trend is inexistent with ME-IRL. Supporting H2, the high standard error that ME-IRL displays implies that ME-IRL is highly sensitive to the demonstrations provided and the learned reward likely overfits to the expert demonstrations. With more data, this shortcoming might disappear; however, this would pose an additional burden on the human, which our method successfully alleviates.

Lastly, we looked at Behavior Accuracy for the two learned rewards. Fig. 7 illustrates the reward ratios to GT for all three cases. The GT ratio is 1 by default, and the closer to 1 the ratios are, the better the performance because all rewards are negative. The figure confirms H1, showing that FERL rewards produce trajectories that are preferred under the GT reward over ME-IRL reward trajectories.

Figure 6: MSE of FERL and ME-IRL to GT reward.
Figure 7: Induced trajectories’ reward ratio.

5 Discussion of Results, Implications, and Future Work

In this work, we proposed FERL, a framework for learning rewards from corrections when the initial feature set cannot capture human preferences. Based on our insight that the robot should ask for data that explicitly teaches it what is missing, we introduced feature traces as a novel type of human input that allows for intuitive teaching and learning of non-linear features from high-dimensional state spaces. In experiments, we analyzed the quality of the learned features and showed that FERL outperforms a deep reward learning from demonstrations baseline (ME-IRL) in terms of data-efficiency, generalization, and sensitivity to input data.

Potential Implications for Learning Complex Rewards from Demonstrations. Reward learning from raw state space with expressive function approximators is considered difficult because there exists a large set of functions that could explain the human input, e.g. for demonstrations many functions induce policies that match the state expectation of the demonstrations. The higher dimensional the state , the more information from human input is needed to disambiguate between those functions sufficiently to find a reward which accurately captures human preferences and thereby generalizes to states not seen during training and not just replicates the demonstrations’ state expectations as in IRL. We are hopeful that our method of collecting feature traces rather than just demonstrations has potential implications broadly for non-linear (deep) reward learning, as a way to better disambiguate the reward and improve generalization.

While in this paper we focused on adapting a reward online, we also envision our method used as part of a "divide-and-conquer" alternative to IRL: first, collect feature traces for the important non-linear criteria of the reward, and then use demonstrations to figure out how to (shallowly) combine them. The reason this might help relative to relying on demonstrations for everything is that demonstrations aggregate a lot of information. First, by learning features, we can isolate learning what matters, from learning how to trade off or combine what matters into a single value (the features vs. their combination) – in contrast, demonstrations have to teach the robot about both at once. Second, feature traces give information about states that are not going to be on optimal trajectories, be it states with high feature values that are undesirable, or states with low feature values where other, more important features have high values. Third, feature traces are also structured by the monotonicity assumption: they tell us relative feature values of the states along a trace, whereas demonstrations only tell us about the reward value in aggregate across a trajectory. These might be the reasons why we saw the result in Fig. 7, 7, where the FERL reward reliably generalized better to new states than the demonstration-only based IRL.

Limitations and Future Work. There are four main limitations of FERL which we seek to address in future work. First, due to the current pandemic, we could not run a user study, so we do not know how non-expert users provide feature traces. Second, with the current feature learning protocol, it is cumbersome to teach discontinuous features, so we would like to extend our approach by other feature learning protocols.Third, while we show that FERL works reliably in 27D, we want to extend it to higher dimensional state spaces. Our initial results in the appendix show that this is difficult if the raw states are highly correlated. We believe techniques from causal learning, such as Invariant Risk Minimization Arjovsky et al. (2019), can be helpful. Lastly, we want to further ease the human supervision burden by developing an active learning approach where the robot autonomously picks starting states most likely to result in informative feature traces. To prevent learning incorrect features, we want to enable the human to validate a learned feature, e.g. through visualization before it is added to the feature set.

Broader Impact

In this section, we consider our works’ broader ethical implications and potential future societal consequences.

On the positive side, we are excited about our work’s potential benefits for the field of human-robot interaction, and human-AI interaction in general. As we envision it, FERL allows people to more easily communicate to the agent what they want and care about. If used properly, our technology would benefit the end users, as autonomous agents will be able to produce the desired behaviors more easily and accurately. We hope that this work helps bring the field closer to solving the value alignment problem Hadfield-Menell et al. (2016), which would have long-term implications for the future of AI and its relationship to humanity Bostrom (2014). Lastly, as discussed in Sec. 5, we hope that our work will contribute to the IRL community, relieving the burden of communicating rewards for system designers and end-users alike.

Despite all these benefits, our contribution could have unintended consequences. Perhaps the biggest concern comes from what happens when the input to our method is imperfect and results in learning incorrect features that impair performance. This could happen for a variety of reasons: the person’s input violating our algorithm’s assumptions, the system failing for some unexpected reason, or even people acting in bad faith and teaching the robot the wrong aspect to optimize for on purpose. For all these cases, future work must investigate ways in which the robot can determine whether to accept or reject the newly learned feature. Although this is still being studied, we are hopeful that advances in AI safety like Hadfield-Menell et al. (2017) could help ensure that these adverse scenarios don’t happen.

This work was supported by the Air Force Office of Scientific Research (AFOSR), the DARPA Assured Autonomy Grant, and the German Academic Exchange Service (DAAD).

References

  • M. Arjovsky, L. Bottou, I. Gulrajani, and D. Lopez-Paz (2019) Invariant risk minimization. ArXiv abs/1907.02893. Cited by: §A.3, §5.
  • A. Bajcsy, D. P. Losey, M. K. O’Malley, and A. D. Dragan (2018) Learning from physical human corrections, one feature at a time. In Proceedings of the 2018 ACM/IEEE International Conference on Human-Robot Interaction, HRI ’18, New York, NY, USA, pp. 141–149. External Links: ISBN 978-1-4503-4953-6, Link, Document Cited by: §1, §2.3.
  • A. Bajcsy, D. P. Losey, M. K. O’Malley, and A. D. Dragan (2017) Learning robot objectives from physical human interaction. In CoRL, Cited by: §1, §2.3, §2.
  • C. Baker, J. B Tenenbaum, and R. R Saxe (2007) Goal inference as inverse planning. pp. . Cited by: §2.4.
  • A. Bobu, A. Bajcsy, J. F. Fisac, S. Deglurkar, and A. D. Dragan (2020) Quantifying hypothesis space misspecification in learning from human–robot demonstrations and physical corrections. IEEE Transactions on Robotics (), pp. 1–20. Cited by: §1, §2.4.
  • N. Bostrom (2014) Superintelligence: paths, dangers, strategies. 1st edition, Oxford University Press, Inc., USA. External Links: ISBN 0199678111 Cited by: Broader Impact.
  • R. A. Bradley and M. E. Terry (1952) Rank analysis of incomplete block designs: i. the method of paired comparisons. Biometrika 39 (3/4), pp. 324–345. Cited by: §2.1.
  • J. Choi and K. Kim (2011) Inverse reinforcement learning in partially observable environments.

    Journal of Machine Learning Research

    12 (Mar), pp. 691–730.
    Cited by: §4.
  • J. Choi and K. Kim (2013) Bayesian nonparametric feature construction for inverse reinforcement learning. In

    Twenty-Third International Joint Conference on Artificial Intelligence

    ,
    Cited by: footnote 1.
  • C. Finn, S. Levine, and P. Abbeel (2016) Guided cost learning: deep inverse optimal control via policy optimization. In Proceedings of the 33rd International Conference on International Conference on Machine Learning - Volume 48, ICML’16, pp. 49–58. Cited by: §1, §4, §4.
  • J. F. Fisac, A. Bajcsy, S. L. Herbert, D. Fridovich-Keil, S. Wang, C. J. Tomlin, and A. D. Dragan (2018) Probabilistically safe robot planning with confidence-based human predictions. Robotics: Science and Systems (RSS). Cited by: §2.4.
  • D. Fridovich-Keil, A. Bajcsy, J. F. Fisac, S. L. Herbert, S. Wang, A. D. Dragan, and C. J. Tomlin (2019) Confidence-aware motion prediction for real-time collision avoidance. International Journal of Robotics Research. Cited by: §2.4.
  • J. Fu, K. Luo, and S. Levine (2018) Learning robust rewards with adverserial inverse reinforcement learning. In International Conference on Learning Representations, External Links: Link Cited by: §1.
  • D. Hadfield-Menell, A. Dragan, P. Abbeel, and S. Russell (2016) Cooperative inverse reinforcement learning. In Proceedings of the 30th International Conference on Neural Information Processing Systems, NIPS’16, Red Hook, NY, USA, pp. 3916–3924. External Links: ISBN 9781510838819 Cited by: Broader Impact.
  • D. Hadfield-Menell, A. Dragan, P. Abbeel, and S. Russell (2017) The off-switch game. In Proceedings of the 26th International Joint Conference on Artificial Intelligence, IJCAI’17, pp. 220–227. External Links: ISBN 9780999241103 Cited by: Broader Impact.
  • A. Jain, S. Sharma, T. Joachims, and A. Saxena (2015) Learning preferences for manipulation tasks from online coactive feedback. The International Journal of Robotics Research 34 (10), pp. 1296–1313. Cited by: §1.
  • S. Levine, Z. Popovic, and V. Koltun (2010) Feature construction for inverse reinforcement learning. In Advances in Neural Information Processing Systems, pp. 1342–1350. Cited by: footnote 1.
  • S. Levine, Z. Popovic, and V. Koltun (2011) Nonlinear inverse reinforcement learning with gaussian processes. In Advances in Neural Information Processing Systems, pp. 19–27. Cited by: §1.
  • R. D. Luce (1959) Individual choice behavior.. John Wiley, Oxford, England. Cited by: §2.1.
  • N. Ratliff, D. M. Bradley, J. Chestnutt, and J. A. Bagnell (2007)

    Boosting structured prediction for imitation learning

    .
    In Advances in Neural Information Processing Systems, pp. 1153–1160. Cited by: footnote 1.
  • N. D. Ratliff, J. A. Bagnell, and M. A. Zinkevich (2006) Maximum margin planning. In Proceedings of the 23rd International Conference on Machine Learning, ICML ’06, New York, NY, USA, pp. 729–736. External Links: ISBN 1595933832, Link, Document Cited by: §2.3, §2.
  • S. Reddy, A. D. Dragan, S. Levine, S. Legg, and J. Leike (2019) Learning human objectives by evaluating hypothetical behavior. External Links: 1912.05652 Cited by: §2.1.
  • S. Reddy, A. D. Dragan, and S. Levine (2020) SQIL: imitation learning via reinforcement learning with sparse rewards. arXiv: Learning. Cited by: §1.
  • J. Schulman, J. Ho, A. X. Lee, I. Awwal, H. Bradlow, and P. Abbeel (2013) Finding locally optimal, collision-free trajectories with sequential convex optimization.. In Robotics: science and systems, Vol. 9, pp. 1–10. Cited by: §B.2, §4, §4.
  • J. Von Neumann and O. Morgenstern (1945) Theory of games and economic behavior. Princeton University Press Princeton, NJ. Cited by: §2.4.
  • M. Wulfmeier, D. Z. Wang, and I. Posner (2016) Watch this: scalable cost-function learning for path planning in urban environments. In 2016 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Vol. , pp. 2089–2095. Cited by: §1, §4, §4.
  • B. D. Ziebart, A. Maas, J. A. Bagnell, and A. K. Dey (2008) Maximum entropy inverse reinforcement learning. In Proceedings of the 23rd National Conference on Artificial Intelligence - Volume 3, AAAI’08, pp. 1433–1438. External Links: ISBN 978-1-57735-368-3, Link Cited by: §2.4, §4.

Appendix A Experimental Details

a.1 Extended FERL Protocols for Feature Trace Collection

In this section, we present our protocol for collecting feature traces for the six features discussed in Sec. 3. As we will see, the traces collected from the human only noisily satisfy the assumptions in Sec. 2.1. Nevertheless, as we showed in Sec. 3, FERL is able to learn high quality feature functions.

For table, the person’s goal is to teach that being close to the table, anywhere on the plane, is desirable, whereas being far away in height is undesirable. As such, we saw in Fig. 2 on the left that traces should traverse the space from up at a height, until reaching the table. A few different starting configurations are helpful, not necessarily to cover the whole state space, but rather to have signal in the data: having the same trace 10 times wouldn’t be different from having it once.

For laptop, as described in the text and shown in Fig. 2 on the right, the person starts in the middle of the laptop, and moves away a distance equal to the bump radius desired. Having traces from a few different directions and heights helps learn a more distinct feature. For test laptop location, the laptop’s location at test time is not seen during training. Thus, the training traces should happen with various laptop positions, also starting in the middle and moving away as much distance as desired.

Figure 8: Feature traces for coffee. We show the values of the -axis base vector of the End-Effector (EE) orientation. The traces start with the EE pointing downwards and move it upwards.
Figure 9: Feature traces for proxemics with the human at . The GT feature values are projected on the -plane.
Figure 10: Feature traces for between objects. The GT values are projected on the -plane.

When teaching the robot to keep the cup upright (coffee), the person starts their traces by placing the robot in a position where the cup is upside-down, then moving the arm or rotating the End-Effector (EE) such that it points upright. Doing this for a few different start configurations helps. Fig. 8 shows example traces colored with the true feature values.

When learning proxemics, the goal is to keep the EE away from the human, more so when moving in front of their face, and less so when moving on their side. As such, when teaching this feature, the person places the robot right in front of the human, then moves it away until hitting the contour of some desired imaginary ellipsis: moving further in front of the human, and not as far to the sides, in a few directions. Fig. 10 shows example traces colored with the Ground Truth (GT) feature values.

Lastly, for between objects there are a few types of traces, all illustrated in Fig. 10. First, to teach a high feature value on top of the objects, some traces need to start on top of them and move away radially. Next, the person has a few options: 1) record a few traces spanning the line between the objects, at different heights, and labeling the start and the end the same; 2) starting anywhere on the imaginary line between the objects and moving perpendicularly away the desired distance, and labeling the start; 3) starting on top of one of the objects, moving towards the other then turning away in the direction orthogonal to the line between the objects.

a.2 Extended ME-IRL Protocols for Demonstration Collection

In an effort to make the ME-IRL comparison fair, we paid careful attention to collecting informative demonstrations. As such, for each unknown feature, we recorded a good mix of 20 demonstrations about the unknown feature only (with a focus on learning about it), the known feature only (to learn a reward weight on it), and both of them (to learn a reward weight combination on them). For all features, we chose diverse start and goal configurations to trace the demonstrations.

For case 1, we had a mix of demonstrations that already start close to the table and focus on going around the laptop, ones that are far away enough from the laptop such that only staying close to the table matters, and ones where both features are considered at the same time. Fig. 11 (left) shows examples of such demonstrations: the two in the back start far away enough from the laptop but at a high height, and the two in the front start above the laptop at different heights.

For case 2, we collected a similar set of trajectories, although we had more demonstrations attempting to stay close to the table when the laptop was already far away. Fig. 11 (middle) shows a few examples: the two in the back start far away from the laptop and only focus on staying close to the table, a few more start at a high height but need to avoid the laptop to reach the goal, and another two start above the laptop and move away from it.

For case 3, the most difficult one, some demonstrations had to avoid the person slightly to their side, while others needed to avoid the person more aggressively in the front. We also varied the height and start-goal locations, to ensure that we learned about each feature separately, as well as together. Fig. 11 (right) shows a few of the collected demonstrations.

Figure 11: A few representative demonstrations collected for case 1(left), case 2 (middle), and case 3 (right). The colors signify the true reward values in each case, where yellow is low and blue is high.

a.3 Discussion of Raw State Space Dimensionality

Figure 12:

Quantitative feature learning results for 36D without (above) and with (below) the subspace selection heuristic. For each feature, we show the

mean and standard error across 10 random seeds with an increasing number of traces (orange) compared to random performance (gray).

In our experiments in Sec. 3 and 4, we chose a 36D input space made out of 27 Euclidean coordinates ( positions of all robot joints and environment objects) and 9 entries in the EE’s rotation matrix. We now explain how we chose this raw state space, how spurious correlations across different dimensions can reduce feature learning quality, and how this adverse effect can be alleviated.

First, note that the robot’s 7 joint angles and the positions of the objects are the most succinct representation of the state, because the positions and rotation matrices of the joints can be determined from the angles via forward kinematics. With enough data, the neural network should be able to implicitly learn forward kinematics and the feature function on top of it. However, we found that applying forward kinematics a-priori and giving the network access to the positions and rotation matrices for each joint improve both data efficiency and feature quality significantly. In its most comprehensive setting, thus, the raw state space can be 97D (7 angles, 21 coordinates of the joints, 6 coordinates of the objects, and 63 entries in rotation matrices of all joints).

Unfortunately, getting neural networks to generalize on such high dimensional input spaces, especially with the little data that we have access to, is very difficult. Due to the redundancy of the information in the 97D state space, the feature network frequently picks up on spurious correlations in the input space, which decreases the generalization performance of the learned feature. In principle, this issue could be resolved with more diverse and numerous data. However, since our goal was to make feature learning as little effort as possible for the human, we instead opted for the reduced 36D state space, focusing directly on the positions and the EE orientation.

Now, as noted in Sec. 3

, the spurious correlations in the 36D space still made it difficult to train on both the position and orientation subspaces. To better separate redundant information, we devised a heuristic to automatically select the appropriate subspace for a feature. For each subspace, the algorithm first trains a separate network for 10 epochs on half of the input traces and evaluates its generalization ability on the other half using the

FERL loss. The subspace model with the lower loss (better generalization) is then used for and trained on all traces. We found this heuristic to work fairly well, selecting the right subspace on average in about 85% of experiments.

To test how well it works in feature learning, we replicated the experiment in Fig. 4 on the 36D state space, both with and without the subspace selection heuristic. A first obvious observation from this experiment is that performing feature learning on separate subspaces (Fig. 4) results in lower MSEs for all features and number of traces than learning from all 36 raw states (Fig. 12). Without the heuristic (Fig. 12 above), we notice that, while spurious correlations in the raw state space are not problematic for some features (table, coffee, laptop, between objects), they can reduce the quality of the learned feature significantly for proxemics and test laptop location. Adding our imperfect heuristic (Fig. 12

below) solves this issue, but increases the variance on each error bar: while our heuristic can improve learning when it successfully chooses the correct raw state subspace, feature learning worsens when it chooses the wrong one.

In practice, when the subspace is not known, the robot could either use this heuristic or it could ask the human which subspace is relevant for teaching the desired feature. While this is a first step towards dealing with correlated input spaces, more work is needed to find more reliable solutions. For example, a better alternative to our heuristic could be found in methods for causal learning, such as Invariant Risk Minimization Arjovsky et al. [2019]. We defer such explorations to future work.

Appendix B Implementation Details

We report details of our training procedures, as well as any hyperparameters used. We tried a few different settings but no extensive hyperparameter tuning was performed. Here we present the settings that worked best for each method. The code can be found at

https://github.com/andreea7b/FERL.

Figure 13: The between objects feature. (Left) We first train our method using a 27D highly correlated raw state space ( positions of all robot joints and objects). The learned feature (Down) does not capture the fine-grained structure of the ground truth (Up). (Right) When we train the network using only 9D ( positions of the EE and objects), the quality of the learned feature improves.

b.1 FERL Training Details

The feature function

is approximated by a 2 layer, 64 hidden units neural network. We used a leaky ReLu non-linearity for all but the output layer, for which we used the softplus non-linearity. We normalized the output of

every epoch by keeping track of the maximum and minimum output logit over the entire training data. Following the description in Sec.

2.1, the full dataset consists of tuples, where the first part is all tuples encoding monotonicity and the second part is all tuples encouraging indistinguishable feature values at the starts and ends of traces. Note that , hence in the dataset there are significantly fewer tuples of the latter than the former type. This imbalance can lead to the training converging to local optima where the start and end values of traces are significantly different across traces. We addressed this by using data augmentation (adding each tuple 5 times to ) and weighing the loss from the

tuples by a factor of 10. We optimized our final loss function using Adam for 100 epochs with a learning rate and weight decay of 0.001, and a batch-size of 32 tuples over all tuples.

b.2 ME-IRL Training Details

The reward is approximated by a 2 layer, 128 hidden units neural network, with ReLu non-linearities. As described in Sec. 4, we add the known features to the output of this network before linearly mapping them to with a softplus non-linearity. While is given, at each iteration we have to generate a set of near optimal trajectories for the current reward . For that we take the start and goal pairs of the demonstrations and use TrajOpt Schulman et al. [2013] to generate an optimal trajectory for each start-goal pair, hence . At each of the 50 iterations we go through all start-goal pairs with one batch consisting of the and trajectories of one randomly selected start-goal pair from which we estimate the gradient as detailed in Sec. 4. We optimize the loss with Adam using a learning rate and weight decay of 0.001.

Appendix C Additional Results

c.1 Between Objects with 9D Raw State Space Input

In Fig. 3 we saw that for between features, while FERL learned the approximate location of the objects to be avoided, it did not manage to learn the more fine-grained structure of the ground truth feature. This could either be because our method is flawed or it could be an artefact of the spurious correlations in the high dimensional state space. To analyze this result further, we trained a network with only the dimensions necessary for learning this feature: the position of the EE and the positions of the two objects. The result in Fig. 13 illustrates that, in fact, our method is capable of capturing the fine-grained structure of the ground truth; however, more dimensions in the raw state space induce more spurious correlations that decrease the quality of the features learned.

Figure 14: Visual comparison of GT, FERL, and ME-IRL learned rewards for case 2.
Figure 15: Visual comparison of GT, FERL, and ME-IRL learned rewards for case 3.

c.2 Qualitative Reward Visualization for Cases 2 & 3

Following the procedure detailed in Sec. 4, we qualitatively compare the ground truth reward and the the learned rewards of FERL and ME-IRL. Figure 14 visualizes the rewards for case 2, where the table feature is unknown and , and Figure 15 for case 3 with the proxemics feature unknown and . Similar to case 1 (Fig. 5), we observe that in case 2 FERL is able to learn a fine-grained reward structure closely resembling GT. In case 3, for the more difficult proxemics, FERL with just 10 features traces is not perfect but still recovers most of the reward structure. In both cases, ME-IRL only learns a coarse structure with a broad region of low reward which does not capture the intricate trade-off of the true reward function.