Resource optimization problems arise in critical environments, such as healthcare, air traffic control, and manufacturing operations. Many of these problems are characterized as NP-Hard Chapman (1987) and computationally intractable to solve due to the large number of jobs to be performed and the complex interplay of temporal and resource-based constraints. Human domain experts readily achieve high-quality solutions to these problems by drawing on years of experience working with similar problems, during which they develop assistive strategies, heuristics, and guidelines. However, manually soliciting and encoding this knowledge in a computational framework is prone to error, not scalable, and leaves much to be desired Cheng et al. (2006); Raghavan et al. (2006). In this paper, we aim to develop an apprenticeship learning framework that is able to infer these strategies implicitly from observation to scale the power of a standard homogeneous apprenticeship learning model. The challenge for apprenticeship learning in this context lies in learning from the different strategies domain experts develop from their unique experiences (i.e., learning from heterogeneous demonstration).
There has been significant progress in the ability to capture domain-expert knowledge from demonstration (Abbeel and Ng, 2004; Gombolay et al., 2016b; Konidaris et al., 2011; Zheng et al., 2014; Odom and Natarajan, 2016; Ziebart et al., 2008; Reddy et al., 2018)
, predominantly using inverse reinforcement learning (IRL), to learn the reward function followed by domain experts (demonstrators), as well as a policy to optimize that reward function. However, applying IRL to planning domains is difficult due to the need for a model and the computational tractability as the state space grows. Heterogeneous decision-makers would appear to have different reward functions due to their innate preferences, and the IRL algorithm would fail to learn the true representation of the reward function. Furthermore, IRL requires the form of a reward function, and hypothesis in the case of max-entropy IRL, beforehand; Markov decision processes (MDPs) are ill-suited to many scheduling problems with non-Markovian constraints, and state-space enumeration and exploration is intractable for large-scale planning domains.
Another complementary approach to capture domain-expert knowledge is to learn a function that directly maps states to actions (Chernova and Veloso, 2007; Terrell and Mutlu, 2012; Huang and Mutlu, 2014). While this method scales better, such policy learning approaches do not yet handle heterogeneity well. The typical approach is to assume homogeneity over demonstrators, reasoning about the average human, as shown in the left-most diagram in Figure 1. However, seminal work attempting to learn auto-pilots from commercial aviators found that pilots executing the same flight plan created such heterogeneous data as to make it more practical to learn from a single trajectory and disregard the remaining data Sammut et al. (2002) .
A more recent approach by Nikolaidis et al. (2015) sought to divide demonstrators into relatively-homogeneous clusters and learn a separate model of human decision-making from each cluster. As depicted in the center diagram in Figure 1, this approach means that each model only has
of the data to learn from, missing out on the possible homogeneity existing among the clusters. With high-dimensional data, expensive data collection, and residual, within-cluster heterogeneity, such an approach is ultimately unsatisfying.
In this paper, we seek to overcome these key gaps in prior work by providing an integrated learning framework that allows for learning planning policies from heterogeneous decision-makers. We propose using personalized embeddings, learned through backpropogation, which enable the apprenticeship learner to automatically adapt to a person’s unique characteristics while simultaneously leveraging any homogeneity that exists within the data (i.e., uniform adherence to hard constraints). We then present a human-interpretable version of our apprenticeship learning model that allows for direct analysis of a given demonstrator’s behavior.
We evaluate our approach on three problems: a synthetic low-dimensional environment, a synthetic job scheduling environment consisting of mock experts’ scheduling heuristics, and a real-world dataset with human gameplay in StarCraft II. To our knowledge, this is the first paper to apply a personalized apprenticeship learning framework to learn from heterogeneous decision-makers in a cohesive framework that learns one integrated model to capture the similarity and differences of demonstrators through personalized embeddings. We also utilize counterfactual reasoning by pairwise comparisons to improve the model’s performance and display inference of the required action-specific features in domains in which they are not readily available. Finally, we introduce a methodology to learn and transfer our apprenticeship learning framework into a human-interpretable model for analysis or examination by a given expert for education and training.
2 Learning from Heterogeneous Decision Makers
Personalized Neural Networks (PNNs) are an extension of a standard neural network, or any differentiable model, that allow for capturing the homo- and heterogeneity among human domain experts presenting varied trajectories. Here, we present the framework for automatically inferring personalized embeddings to learn from heterogeneous decision-makers.
Figure 2 depicts a PNN, which learns a model, , of the human demonstrator’s decision-making policy, where is the demonstrator-specific personalized embedding of length
, which is a tune-able hyperparameter. As this personalized embedding is a continuous parameter, the outcome of choosing a lengththat is too high or low in comparison to the optimum is not nearly as detrimental as choosing the non-optimal number of clusters in the approach of Nikolaidis et al. (2015). These latent features, , provide the pattern of the current decision-maker, which accounts for a component not represented within the state features and that is needed for accurate prediction. The training procedure of a PNN consists of taking as input an example of a state, at time , for person, , as well as the person’s embedding, at training iteration , and predict the person’s action in that state, . The loss is computed as the Rényi divergence Życzkowski (2003) between the predicted action, , and the true action,
. This loss is then backpropagated through the network to update model parametersand the personalized embedding
When applying the algorithm during runtime (i.e., testing) for a new human demonstrator, , one updates the embedding, ; however, the network’s parameters, , remain static. The personalized embedding,
, for a new human demonstrator is initialized to the mean personalized embedding. This means during runtime, we start by assuming a new expert is performing the planning task in the predicted manner; over time, we infer how she is acting differently and update our personalized embedding accordingly. This hybrid approach enables one to balance the bias-variance tradeoff, grounding the model in parameters common to all demonstrators viawhile tailoring a subset of the parameters, , to tune the model for an individual.
We note that researchers have explored the use of a latent embedding in other contexts. For example, Killian et al. (2017) applied a Bayesian Neural Network to model transition dynamics. Tani et al. (2004)
utilized a similar personalized embedding in a recurrent neural network but learns through imitative interaction in a domain with fewer degrees of freedom primarily concerned with mimicking motion rather than decision-making for planning and scheduling. Recent work byAngelov et al. (2019) uses causal analysis to learn specifications from task demonstrations, but assumptions are made regarding the number of demonstrator types and the work requires prior labeling of a demonstrator set. Our approach is novel in that it utilizes a latent embedding as a personalized embedding for apprenticeship learning in domains with a high degree of freedom while also automatically inferring behavior, eliminating the need for tedious and biased annotation of person types.
To increase the utility of our learning framework, we draw inspiration from the domain of web page ranking Page et al. (1998), where the goal is to predict the most relevant web page given a search query. Web page ranking must learn how pages relate to one another and capture these complex dependencies. These dependencies are also apparent in many complex planning problems, such as in the scheduling domain where there are tasks related to precedence, wait, and deadline constraints. The pairwise approach to web page ranking determines a ranking based on pairwise comparisons between individual pages Jin et al. (2008); Pahikkala et al. (2007)
. Utilizing this methodology, we can apply counterfactual reasoning between the factual (action taken) and the counterfactual (action not taken) to learn a ranking formulation to predict which action the expert would ultimately take at each moment.Gombolay et al. (2016a) presented evidence that learning a pairwise preference model by comparing pairs of actions can outperform a multi-class classification model. However, this paper is the first to our knowledge to apply counterfactual reasoning in personalized apprenticeship learning.
. From each observation, we then extract 1) the feature vector describing that action,, from state , 2) the corresponding feature, for an alternative action, , 3) a contextual feature vector capturing features common to all actions (e.g., how many workers are available to be assigned jobs), , and 4) the person’s embedding, . We note that each demonstrator has their own embedding which is updated through backpropagation during training. The comparison of the action the decision-maker took, , versus the action not taken,
, is considered a positive example for a classifier. Likewise, the reverse comparison is a negative example. This process generatesexamples for each time step and is repeated for all users and all trajectory demonstrations.
Given this data set, the apprentice is trained to output a psuedo-probability,of action being taken over action at time by the human decision-maker described by embedding , using features . To predict the probability of taking action at time t, we marginalize over all other actions, as shown in Equation 3. Finally, the action prediction is the argmax this probability.
Generating Action-Specific Features
In many cases, the action-specific features may not be readily available. In this case, we can choose to learn these features using a variation of this framework, where instead of an embedding personalized to the current demonstrator, the embedding is personalized to a particular action. In this way we can learn a mapping . This represents the transition model for the environment, and we can learn an action representation for each action. These action embeddings can then be used in the pairwise approach discussed above. We apply this method for the StarCraft II environment with great success.
2.1 Personalized Differentiable Decision Tree
While a PNN provides us with high-performance models of demonstrator preferences, standard deep network models lack straightforward interpretability. Interpretability is an important area of exploration in machine learning, and an interpretable model of resource allocation or planning tasks would be useful for a variety of reasons, from decision explanations to training purposes. Therefore, we present a personalized differentiable decision tree (PDDT) model that is able to approximate PNN performance, while also giving us interpretability. Differentiable decision trees (DDTs)Suárez and Lutsko (1999) have provided researchers in various fields with simple and powerful models for constructing fuzzy decision trees Yuan and Shaw (1995) through differentiation.
Our approach begins with a balanced DDT and a demonstrator embedding, as in the PNN. The demonstrator embedding is concatenated with input data and routed directly to each decision node, rather than being transformed between nodes as in a traditional deep network architecture. While this hinders the representation learning capacity of our model, it is important to learn a model directly over input features to preserve interpretability.
Each decision node in the PDDT is conditioned on three parameters: a vector of weights , a vector of comparison values , and a vector of selective importances . When input data is passed to a decision node , the data is weighted by and compared against . By using a vector for comparison values, we can perform element-wise comparison for each input feature, allowing the model to learn easily translatable rules for how to consider each individual element of . After comparison against , the model then uses its selective importance vector to decide which feature matters the most for . The maximum value in is set to 1, while all other elements of are set to 0, and the transformed input data is then element-wise multiplied by . This procedure ensures that each decision node only considers a single feature during each forward pass but still allows the model to learn over all features during backpropagation.
The single transformed feature is then weighted by a learned featureoutputs a value between 1 and 0, where 1 means the evaluates to "true", and 0 means that evaluates to "false." The learned feature enables the PDDT tree to control the steepness of the sigmoid function. Each decision node evaluates the the sigmoidal approximation of a decision boundary, . Leaf nodes in the PDDT maintain a set of weights over each output class and have exactly one path from root to leaf . Decision nodes along the path output probabilities, which are all multiplied to produce the final probability of reaching given input and the current demonstrator embedding. The set of weights in is then weighted by this probability, and the weighted output of all leaves is summed to produce the final network output.
To achieve interpretability, we translate a PDDT to a decision tree through using the selective importance parameters . Every decision node in the PDDT outputs a decision based on a single feature, chosen by ; therefore, we can use to choose which feature, weight, and comparison should be used to instantiate a new decision node for the interpretable decision tree. We can also remove the probabilistic weighting of leaves by pushing towards infinity, restricting every decision’s output to 1 or 0. Finally, the class output within each leaf is chosen as the maximum value of the leaf’s weights. The new model is then a decision tree where every node considers one feature and outputs 1 or 0, only one leaf is selected for each forward pass, and each leaf outputs a single class.
We use three environments to evaluate the utility of our personalized apprenticeship learning framework. In the synthetic low-dimensional environment, we perform tests to show the advantage of our framework and a case study displaying the need for PDDTs. In the synthetic scheduling environment, we test the capability of our apprenticeship learning framework to learn heterogeneous scheduling policies. Finally, in StarCraft II, we test the apprentice models on a real-world gameplay dataset and present a result showing the utility of interpretability. Our approaches and implementations are available at https://github.com/ghost12331/Personalized-Apprenticeship-Learning-from-Heterogeneous-Decision-Makers. Hyper-parameter settings are provided in the supplementary material.
3.1 Synthetic Low-Dimensional Environment
The synthetic low-dimensional environment represents a simple domain where an expert will choose an action based on the state and one of two hidden heuristics. This domain captures the idea that we have homogeneity in conforming to constraints (z) and strategies or preferences (heterogeneity) in the form of . The idea is that decision-making exists on a manifold for each "mode" or "strategy" an operator shows, and we need to infer the identity of these manifolds through the embedding.
Demonstration trajectories are given in sets of 20 (which we denote a schedule), where each observation consists of and and the output is . Exact specifications for the computation of the label given by the observation of , where is the indicator function. Assuming a near-even class distribution, randomly guessing and overfitting to one class results in about 50% accuracy. Only by inferring the type of demonstrator, given by , will the apprentice be able to achieve an accurate model of decision-making.
3.2 Synthetic Scheduling Environment
The next environment we use to explore our personalized learning framework is a synthetic environment that we can control, manipulate, and interpret to empirically validate the efficacy of our proposed method. For our investigation, we leverage a jobshop scheduling environment built upon the XD[ST-SR-TA] scheduling domain defined by Korsah (2011), representing one of the hardest scheduling problems. In this environment, two agents must complete a set of twenty tasks which have upper- and lower-bound temporal constraints (i.e., deadlines and wait constraints), proximity constraints (i.e., no two agents can be in the same place at the same time), and travel-time constraints. For the purposes of apprenticeship learning, an action is defined as the assignment of an agent to complete a task presently. The decision-maker must decide the optimal sequence of actions according to the decision-maker’s own criteria. For this environment, we construct a set of heterogeneous, mock decision-makers that select scheduling actions according to Equation 4.
In this equation, our decision-maker selects a task, from the set of tasks, . The task-prioritization scheme is based upon three criteria: prioritizes tasks according to their deadline (i.e., “earliest-deadline first"), prioritizes the closest task, and prioritizes tasks according to a user-specified highest/lowest index or value based upon (i.e., ). The heterogeneity in decision-making comes from the latent weighting vector . Specifically, and weight the importance of and , respectively. Furthermore, is a mode selector in which the highest/lowest task index is prioritized. By drawing from a multivariate random distribution, we can create an infinite number of unique demonstrator types.
3.3 StarCraft II
In our third environment, we leverage a real-world dataset with gameplay from StarCraft II. This data is provided alongside the StarCraft II API PySC2 (Vinyals et al., 2017). The dataset contains a large number of 1-vs.-1 replays that affords access to game state-action information at every frame, information regarding the outcome of the game, and the ranking of the players. The state of the game at any timestep within the gameplay trajectory (i.e., demonstration) consists of several images pertaining to where units and buildings are located alongside information about visibility regions and vectorized state information regarding the amount of resources, buildings, and units in the game. The action taken in every frame can be one of hundreds, and thus as a simplification, we produce 40 actions that are a representative super-class of refined actions.
4 Results and Discussion
We evaluate the performance of our apprenticeship learning framework against related approaches and assess the power of counterfactual reasoning against two baselines: 1) Pointwise – each action is considered independently with a positive example assigned to an action taken and a negative example assigned to an action not taken, which is another type of model from web-page ranking Page et al. (1999) and 2) Standard – the model must select from which action is taken (i.e., a multi-class classification problem), which is the ubiquitous approach in policy-learning Silver et al. (2016). Finally, we explore queries related to interpretability that support the significance of our approach.
4.1 Synthetic Low-Dimensional Environment
A set of 50 schedules are given as input to the apprenticeship learning framework specified above. Several approaches are used in a thorough comparison between our personalized approach and those previous. We test a standard decision tree (DT), a standard neural network (NN), a differentiable decision tree (DDT), a neural network trained on homogeneous clusters generated through k-means clusteringNikolaidis et al. (2015)
(k-means to NN), a neural network trained on soft clusters generated through a Gaussian Mixture Model (GMM to NN), a Personalized Differentiable Decision Tree (PDDT), and a Personalized Neural Network (PNN).
We also consider training a decision tree in which we infer embeddings using a Monte Carlo sampling so that a decision tree is given a univariate embedding sampled from a bimodal distribution, which is iteratively updated based on the tree’s performance as the distribution changes. This expectation maximization procedure allows us to compare the capability of another personalized embedding-based approach. However, this approach requires the number of modes (types) of domain experts within the domain to be known in advance (in this case, 2), which is a limiting condition to the utility of this approach. We first initialize the probability that a domain expert is of type 1 and type 2 to random. Then, for a training schedule, it appends the embedding (which is a one-hot encoding of the type i.e., [1,0] or [0,1]) to the input and trains a decision tree classifier upon this. Then, this classifier is used to predict the outcome on the rest of the samples in that schedule and is penalized by its incorrectness. Repeating this process allows for the embeddings to converge to their proper values.
Figure 3(a) shows the personalized apprenticeship learning frameworks–specifically, the PDDT and PNN–outperform conventional approaches to apprenticeship learning. It should be noted that we expected a DT with bimodal embedding to perform as well as the personalized apprenticeship framework, as the MCMC sampling-based update allows for the embedding of the expert to be inferred. We hypothesize that the reason that the performance is not comparable to that of the PDDT or PNN is due to the invariability in the outputs of a decision tree.
4.2 Synthetic Scheduling environment
A set of 150 schedules generated by heterogeneous domain experts are given as input to the apprenticeship learning framework specified above. This consists of 3000 individual timesteps that the framework must be able to learn from. While we have the capability to generate a larger number of schedules, it is more significant to show that our apprenticeship learning framework works well on smaller training datasets as in many cases these demonstrations can be expensive or sparse.
The efficacy of each approach is shown in Figure 3(b). It can be seen that the personalized apprenticeship learning frameworks that utilize counterfactual reasoning outperforms all other approaches, achieving near perfect accuracy in predicting what task a domain expert will schedule. These results show that the personalized apprenticeship learning framework can learn from observations in a high-dimensional and complex environment with heterogeneous demonstrators.
4.3 StarCraft II
A set of real-world gameplay data from StarCraft II is used to further verify our claim. This environment has higher dimensionality and is a multi-label classification task Tsoumakas and Katakis (2007): multiple actions can be taken in a single time step, increasing the challenge of inferring a high-quality policy. Based on the relative performance of the various methods tested on the synthetic domains – with PNN and PDDTs outperforming the baselines – we evaluate a neural network learning for multi-class classification of actions as well as neural network, a PNN, and PDDT using counterfactual reasoning via pairwise comparisons. We find that the PDDT and PNNs again outperform our baselines, as shown in Figure 5, on a data set of human decision-making on a complex planning problem.
The results of our empirical evaluations support the hypothesis that personalized embeddings allow for learning a powerful representation of heterogeneous domain experts. However, we want not just an accurate model, but also one that lends insight into the demonstrator’s decision-making process. In section 2.1, we proposed a differentiable decision tree architecture that would provide the novel ability for apprenticeship learning with personalized embeddings for heterogeneous demonstrators in a tree-like architecture. With the ability to readily translate a PDDT to a classical decision tree with interpretable decision boundaries, this model offers much promise so long as we do not empirically lose a substantial amount of accuracy through the conversion process.
To assess the efficacy of learning interpretable decision trees with personalized embeddings through the PDDT, we conduct the following experiment for the synthetic, low-dimensional environment. First, we train a PNN over a set of training data; next, we extract the learned embeddings and train a traditional DT with the training data and paired embeddings. Third, we run the PNN on the test set to infer the “correct” embeddings and provide these inferred embeddings to the DT along with the test demonstration examples for the sake of classification. In addition to this PNN-DT hybrid approach, we report the accuracy of the PNN provided with these PNN-extracted embeddings.
Table 1 shows the advantage of using a PDDT. While an uninterpretable PNN can achieve higher accuracy, the DT constructed from PNN embeddings performs much worse than any other approach. Positively, the conversion from continuous PDDT to discrete PDDT incurs a low penalty and the resulting model is much better than the DT given PNN embeddings. This result is intuitive in that jointly learning the hard constraints of the problem with the demonstrator embedding leads to performance gains versus simply learning the embeddings separately and then attempting to build an independent interpretable model. In a real-world dataset of StarCraft II, generating a discrete PDDT from a continuous PDDT incurs a relatively small penalty, confirming that we able to generate interpretable models in complex domains without sacrificing performance. We note that it is possible to achieve perfect accuracy for the synthetic environments as we do not introduce artificial noise.
Figure 6 shows an interpretable model that can be generated from the PDDT, for the low-dimensional environment. Given a set of observations from decision-makers, our personalized framework can generate an interpretable model that will display the behavior of each decision-maker, allowing us to inspect each individual demonstrator’s style. If a decision node is conditioned on the demonstrator embedding and leads to a set of actions of type A when the decision evaluates to true and of type B otherwise, we can say that demonstrators with embeddings that satisfy the decision node stylistically prefer actions of type A (causal reasoning).
|Model||Low-Dimensional||Scheduling||StarCraft II (Average Loss)|
|*Discrete PDDT||89.15%||100.00 %||0.0985|
|*DT (PNN Embeddings)||77.70%||17.65 %||——–|
|Continuous PDDT||94.79%||100.00 %||0.0960|
|PNN||95.65 %||99.80 %||0.0866|
We present a new apprenticeship learning framework for learning from heterogeneous demonstrators, leveraging PNNs and PDDTs to learn task models from large datasets, while still being able to predict individual demonstrator preferences. We demonstrate that our approach is superior to standard NN and DDT models, which fail to capture individual demonstrator styles, and that our counterfactual reasoning approach to actions is superior to a standard action prediction approach, allowing us to achieve near-perfect accuracy in predicting demonstration trajectories of domain experts in a scheduling problem. Finally, we introduce two methods to extract an interpretable model for demonstrator preferences and task constraints, show that conversion from a differentiable PDDT into a discrete, interpretable PDDT offers performance gains over attempting to construct an interpretable model with an independent set of demonstrator embeddings and task examples, and underline ways that this interpretability can highlight differences in demonstrator styles.
- Abbeel and Ng  Pieter Abbeel and Andrew Y. Ng. Apprenticeship learning via inverse reinforcement learning. In In Proceedings of the Twenty-first International Conference on Machine Learning. ACM Press, 2004.
- Angelov et al.  Daniel Angelov, Yordan Hristov, and Subramanian Ramamoorthy. Using causal analysis to learn specifications from task demonstrations. CoRR, abs/1903.01267, 2019.
- Chapman  David Chapman. Planning for conjunctive goals. Artificial Intelligence, 32(3):333 – 377, 1987.
- Cheng et al.  Tsang-Hsiang Cheng, Chih-Ping Wei, and Vincent S. Tseng. Feature selection for medical data mining: Comparisons of expert judgment and automatic approaches. In Proceedings of the 19th IEEE Symposium on Computer-Based Medical Systems, CBMS ’06, pages 165–170, Washington, DC, USA, 2006. IEEE Computer Society.
- Chernova and Veloso  Sonia Chernova and Manuela Veloso. Confidence-based policy learning from demonstration using gaussian mixture models. In Proceedings of the 6th international joint conference on Autonomous agents and multiagent systems, page 233. ACM, 2007.
- Gombolay et al. [2016a] Matthew Gombolay, Reed Jensen, Jessica Stigile, Sung-Hyun Son, and Julie Shah. Decision-making authority, team efficiency and human worker satisfaction in mixed human-robot teams. In Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI), New York City, NY, U.S.A., July 9-15 2016.
- Gombolay et al. [2016b] Matthew C. Gombolay, Reed Jensen, Jessica Stigile, Sung-Hyun Son, and Julie A. Shah. Apprenticeship scheduling: Learning to schedule from human experts. In Proceedings of the Twenty-Fifth International Joint Conference on Artificial Intelligence, IJCAI 2016, New York, NY, USA, 9-15 July 2016, pages 826–833, 2016.
- Huang and Mutlu  Chien-Ming Huang and Bilge Mutlu. Learning-based modeling of multimodal behaviors for humanlike robots. In Proceedings of the 2014 ACM/IEEE International Conference on Human-robot Interaction, HRI ’14, pages 57–64, New York, NY, USA, 2014. ACM.
- Jin et al.  Rong Jin, Hamed Valizadegan, and Hang Li. Ranking refinement and its application to information retrieval. In Proceedings of the 17th International Conference on World Wide Web, WWW ’08, pages 397–406, New York, NY, USA, 2008. ACM.
Killian et al. 
Taylor W Killian, Samuel Daulton, George Konidaris, and Finale Doshi-Velez.
Robust and efficient transfer learning with hidden parameter markov decision processes.In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems 30, pages 6250–6261. Curran Associates, Inc., 2017.
Konidaris et al. 
G.D. Konidaris, S.R. Kuindersma, R.A. Grupen, and A.G. Barto.
Cst: Constructing skill trees by demonstration.
Proceedings of the ICML Workshop on New Developments in Imitation Learning, July 2011.
- Korsah  G. Ayorkor Korsah. Exploring bounded optimal coordination for heterogeneous teams with cross-schedule dependencies. PhD thesis, Carnegie Mellon University, Pittsburgh, PA, January 2011.
- Nikolaidis et al.  Stefanos Nikolaidis, Ramya Ramakrishnan, Keren Gu, and Julie Shah. Efficient model learning from joint-action demonstrations for human-robot collaborative tasks. In Proceedings of the Tenth Annual ACM/IEEE International Conference on Human-Robot Interaction, HRI ’15, pages 189–196, New York, NY, USA, 2015. ACM.
- Odom and Natarajan  Phillip Odom and Sriraam Natarajan. Active advice seeking for inverse reinforcement learning. In Proceedings of the 2016 International Conference on Autonomous Agents & Multiagent Systems, AAMAS ’16, pages 512–520, Richland, SC, 2016. International Foundation for Autonomous Agents and Multiagent Systems.
- Page et al.  L. Page, S. Brin, R. Motwani, and T. Winograd. The pagerank citation ranking: Bringing order to the web. In Proceedings of the 7th International World Wide Web Conference, pages 161–172, Brisbane, Australia, 1998.
- Page et al.  Lawrence Page, Sergey Brin, Rajeev Motwani, and Terry Winograd. The pagerank citation ranking: Bringing order to the web. Technical report, Stanford InfoLab, 1999.
- Pahikkala et al.  Tapio Pahikkala, Evgeni Tsivtsivadze, Antti Airola, Jorma Boberg, and Tapio Salakoski. Learning to rank with pairwise regularized least-squares. SIGIR 2007 Workshop on Learning to Rank for Information Retrieval, 01 2007.
- Raghavan et al.  Hema Raghavan, Omid Madani, and Rosie Jones. Active learning with feedback on features and instances. J. Mach. Learn. Res., 7:1655–1686, December 2006.
- Reddy et al.  Siddharth Reddy, Anca D. Dragan, and Sergey Levine. Where do you think you’re going?: Inferring beliefs about dynamics from behavior. CoRR, abs/1805.08010, 2018.
- Sammut et al.  Claude Sammut, Scott Hurst, Dana Kedzier, and Donald Michie. Imitation in animals and artifacts. pages 171–189, 2002.
- Silver et al.  David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George Van Den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, et al. Mastering the game of go with deep neural networks and tree search. nature, 529(7587):484, 2016.
- Suárez and Lutsko  Alberto Suárez and James F Lutsko. Globally optimal fuzzy decision trees for classification and regression. IEEE Transactions on Pattern Analysis and Machine Intelligence, 21(12):1297–1311, 1999.
- Tani et al.  Jun Tani, Masato Ito, and Yuuya Sugita. Self-organization of distributedly represented multiple behavior schemata in a mirror system: Reviews of robot experiments using rnnpb. Neural networks : the official journal of the International Neural Network Society, 17:1273–89, 10 2004.
- Terrell and Mutlu  Allison Terrell and Bilge Mutlu. A regression-based approach to modeling addressee backchannels. In Proceedings of the 13th Annual Meeting of the Special Interest Group on Discourse and Dialogue, SIGDIAL ’12, pages 280–289, Stroudsburg, PA, USA, 2012. Association for Computational Linguistics.
- Tsoumakas and Katakis  Grigorios Tsoumakas and Ioannis Katakis. Multi-label classification: An overview. Int J Data Warehousing and Mining, 2007:1–13, 2007.
- Vinyals et al.  Oriol Vinyals, Timo Ewalds, Sergey Bartunov, Petko Georgiev, Alexander Sasha Vezhnevets, Michelle Yeo, Alireza Makhzani, Heinrich Küttler, John Agapiou, Julian Schrittwieser, John Quan, Stephen Gaffney, Stig Petersen, Karen Simonyan, Tom Schaul, Hado van Hasselt, David Silver, Timothy Lillicrap, Kevin Calderone, Paul Keet, Anthony Brunasso, David Lawrence, Anders Ekermo, Jacob Repp, and Rodney Tsing. Starcraft ii: A new challenge for reinforcement learning, 2017.
- Yuan and Shaw  Yufei Yuan and Michael J Shaw. Induction of fuzzy decision trees. Fuzzy Sets and systems, 69(2):125–139, 1995.
- Zheng et al.  Jiangchuan Zheng, Siyuan Liu, and Lionel M. Ni. Robust bayesian inverse reinforcement learning with sparse behavior noise. In Proceedings of the National Conference on Artificial Intelligence, Proceedings of the National Conference on Artificial Intelligence, pages 2198–2205, United States, 1 2014. AI Access Foundation.
- Ziebart et al.  Brian D. Ziebart, Andrew Maas, J. Andrew Bagnell, and Anind K. Dey. Maximum entropy inverse reinforcement learning. In Proceedings of the 23rd National Conference on Artificial Intelligence - Volume 3, AAAI’08, pages 1433–1438. AAAI Press, 2008.
- Życzkowski  Karol Życzkowski. Rényi extrapolation of shannon entropy. Open Systems & Information Dynamics, 10(03):297–310, 2003.