To achieve a fluent human-robot teaming, there must be a precise balance between manual control by the human and autonomy for the robot. Both extremes – the need to manually task all robot activity versus fully autonomous robots – have been shown to degrade overall performance [Chen:2012, 844354]. Finding this exact balance is difficult as a robot working in a collaborative setting requires the ability to anticipate and adapt to its human partner. Because of the significant degree of difference (i.e., heterogeneity) between individual humans, learning to tailor the robot behavior to each human is intractable with conventional techniques.
The goal of this work is for a robot to contribute as a high performing teammate, learning to anticipate the needs/actions of its human partners. To accomplish this goal, the robot must develop and maintain a joint mental model of each demonstrator’s policy while taking into account that there may be significant differences amongst the robot’s various human teammates. For example, consider a robot acting as a scrub nurse that has the job of handing a surgeon a tool during a procedure; given a large dataset of surgeons’ preferences for a tool during a certain procedure, the robot can attempt to hand the surgeon the correct tool rather than burdening the surgeon with having to explicitly task the robot to fetch the desired instrument.
|Train||All data||On each
data cluster, created through k-means clustering[Nikolaidis:2015:EML:2696454.2696455]
|All data; Start of Training: Trains as ; End of Training: and a fully connected layer are appended. Only the weights of the F.C. layer and are tuned.||All data||All data|
|Test||Weights fixed||Weights fixed||Same test scheme used in||Weights fixed||Same test scheme used in|
While this task seems straightforward, the disparities between surgeons’ preferences makes it difficult to learn from the data. Early work in Learning from Demonstration (LfD) found that pilots executing the same flight plan created such variance in the data as to make it more practical to learn from a single pilot and disregard the remaining data[DBLP:conf/icml/MoralesS04]. Sammut et. al. [DBLP:conf/icml/SammutHKM92] showed that when attempting LfD from pilots’ demonstrations executing a single flight plan, averaging trajectories led to worse performance than using a single trajectory. Nikolaidis et al. [Nikolaidis:2015:EML:2696454.2696455]
approached this issue by categorizing demonstrators according to their task execution preference by clustering and learning a separate policy for each cluster. While this allows for utilization of the entire dataset, each policy learns off a fraction of the data. Returning to the robotic scrub nurse scenario, this clustering method would split the data into categories based on the type of surgeon, such as intern, resident, or attending surgeon, and learn a separate policy for each type based upon only the data representing the respective type. We hypothesize that a robot scrub nurse would have a better estimate of a surgeon’s preferred surgical workflow if the robot were to reason about the data within that surgeon’s cluster (e.g., all resident surgeons’ data given the robot’s partner is a resident) as opposed to treating all surgeons the same. While learning from clusters may provide preferences closer to optimality, it makes the learning problem harder by providing 1/of the data, where k is the number of clusters. Further, this cluster-based approach does not account for variability among surgeons within the same cluster.
Instead, we believe that accounting for the commonalities and differences amongst demonstrators within a cluster (or across all clusters) would allow for a more complete use of the data. Our method seeks to estimate the demonstrator’s “style” or “unique descriptor” (i.e., latent embedding) in real-time by employing a Bayesian Neural Network (BNN), which reasons about the discrepancy between the average demonstrator and the specific one currently being observed. In our case, this descriptor is represented by a low-dimensional latent encoding vector, where the length of the vector is a hyper-parameter that can be indicative of the complexity of the style. explicitly synthesizes features to explain the variance of a demonstrator that isn’t accounted for in the one-size-fits-all part of the network. In our network , can be learned in real-time via an auto-encoder [HinSal06]
or via backpropagation[DBLP:conf/aaai/KillianKD17], as we do in our proposed method.
We believe that exploration into heterogeneous learning from demonstration will allow for a large increase in human-robot task utility. In turn, this ability to learn a human teammate’s behavior can be leveraged to give specific benefits in surgery, manufacturing, and search & rescue.
Ii Research Approach
As a testbed, we utilize the StarCraft II API PySC2 [vinyals2017starcraft], which allows for analyzing real-player game replays. We use this testbed rather than proceed to human LfD since a large dataset of human demonstrations (i.e., 1-vs.-1 game replays) are readily available. StarCraft II poses an difficult challenge as it is a real-time, continuous-state-space, partially observable, strategy game.
Ii-a Effects of Utilizing Heterogeneity
We consider the problem of learning to mimic the decision-making of players (i.e., direct policy learning) within StarCraft II. We consider the following algorithmic formulations for comparison, as shown in Table 1. First, we consider a regular neural network , which is our baseline. Second, we include the clustering-based method, denoted , by Nikolaidis et al. [Nikolaidis:2015:EML:2696454.2696455] in which we cluster the gameplay data into three partitions using k-means and learn a separate policy network on each. Third, we consider a BNN, , which is able to holistically reason about the homo- and heterogeneity amongst the demonstrators.
Finally, we believe that there may be important, latent, time-varying information (e.g., phases of gameplay dynamics) that may need to be explicitly captured. The standard model for capturing these dynamics is a Long-/Short-Term Memory (LSTM) neural network. We augment these networks here to be able to reason about static heterogeneity amonsgt players by including a Bayesian encoding structure in a network we designate. We also include a generic LSTM () as a baseline.
Iii Results and Discussion
provides promising evidence that our Bayesian-LSTM formulation, which captures static heterogeneity and time-varying gameplay phenomena, improves the performance of LfD mechanisms. In future work, we plan to conduct a sensitivity analysis to isolate robust hyperparameters, find better loss functions, and discover regularizers to isolate player-dependent features.
|161.6 %||101.6 %||98.9 %||87.2 %|
Figure 1 depicts how the performance of the BNN changes with various encoding lengths and the effects of training on data that is clustered. It can be seen that as the game proceeds the ’s with encoding lengths of three and six clearly outperform the . This result supports the hypothesis that holistically reasoning about heterogeneity is helpful.
Iv Future Work
Successful human-robot teaming requires robot algorithms that are able to take into account the heterogeneity of humans and allow the robot to tailor to the needs of its unique team. The destination of this work is to achieve a feat similar to that of a robot scrub nurse. Given demonstrations of a certain task performed, the robot can infer the demonstrator’s style and assist him/her in the best way. The work presented in this paper provides insight into the development of a continuous mapping function between the encoding and the demonstrator policy. Moving forward, a complimentary robot policy must be identified that will result in a human-robot teaming setting of the highest performance.
Theoretical expansions upon this work include adding in an estimate of uncertainty of the current demonstrator style to inform active learning mechanisms and seeking alternate loss functions for tuningto maximize the information captured and resisting the tendency to overfit.
We also note that recent techniques in meta-learning, e.g. Finn et. al. [finn2017model], have sought to learn a network that can be quickly tailored to perform well for a single task drawn from a known distribution of tasks. Once tailored to a specific task, this policy loses its ability to then be tailored to different tasks. In contrast, our approach is able to switch between tasks (e.g., predicting humans’ actions) by adapting a relatively small vector encoding, , rather than tuning the entire network.
Further, human-subject experimentation can be performed to analyze the performance of heterogeneous LfD. A sample starting experiment would be to give a robot demonstrations of left-handed people and right-handed people performing some task, and ask it to identify the dominant hand of a current demonstrator. We can then move to test the performance of a robot learning a complimentary policy to assist those based on their distinctive style. Overall, these results would produce interesting inquiries into heterogeneous LfD and allow us to push the utility of human-robot teaming.