1 Introduction
Inverse reinforcement learning (IRL) enables a learning agent (learner) to acquire skills from observations of a teacher’s demonstrations. The learner infers a reward function explaining the demonstrated behavior and optimizes her own behavior accordingly. IRL has been studied extensively Abbeel and Ng (2004); Ratliff et al. (2006); Ziebart (2010); Boularias et al. (2011); Osa et al. (2018) under the premise that the learner can and is willing to imitate the teacher’s behavior.
In realworld settings, however, a learner typically does not blindly follow the teacher’s demonstrations, but also has her own preferences and constraints. For instance, consider demonstrating to an autopilot of a selfdriving car how to quickly navigate from A to B by going through a pedestrian zone. These demonstrations might conflict with the constraint of the autopilot to drive only on lanes in order to ensure maximum safety of human beings. Similarly, in robothuman interaction with the goal of teaching people how to cook, a teaching robot might demonstrate to a human user how to cook “roast chicken”, which could conflict with the preferences of the learner who is “vegetarian”. To give yet another example, consider a surgical training simulator which provides virtual demonstrations of expert behavior; a novice learner might not be confident enough to imitate a difficult procedure because of safety concerns. In all these examples, the learner might not be able to acquire useful skills from the teacher’s demonstrations.
In this paper, we formalize the problem of teaching a learner with preferences and constraints. First, we are interested in understanding the suboptimality of learneragnostic teaching, i.e., ignoring the learner’s preferences. Second, we are interested in designing learneraware teachers who account for the learner’s preferences and thus enable more efficient learning. To this end, we study a learner model with preferences and constraints in the context of the Maximum Causal Entropy (MCE) IRL framework Ziebart (2010); Ziebart et al. (2013); Zhou et al. (2018). This enables us to formulate the teaching problem as an optimization problem, and to derive and analyze algorithms for learneraware teaching.
Our main contributions are:

We analyze the problem of optimizing demonstrations for the learner when preferences are known to the teacher, and we propose a bilevel optimization approach to the problem (Section 4).

We propose strategies for adaptively teaching a learner with preferences unknown to the teacher, and we provide theoretical guarantees under natural assumptions (Section 5).

We empirically show that significant performance improvements can be achieved by learneraware teachers as compared to learneragnostic teachers (Section 6).
2 Problem Setting
Environment. Our environment is described by a Markov decision process (MDP) . Here and denote finite sets of states and actions. describes the state transition dynamics, i.e.,
is the probability of landing in state
by taking action from state . is the discounting factor. is an initial distribution over states. is the reward function. We assume that there exists a feature map such that the reward function is linear, i.e., for some . Note that a bound of ensures that for all .Basic definitions. A policy is a map such that
is a probability distribution over actions for every state
. We denote by the set of all such policies. The performance measure for policies we are interested in is the expected discounted reward , where the expectation is taken with respect to the distribution over trajectories induced by together with the transition probabilities and the initial state distribution . A policy is optimal for the reward function if , and we denote optimal policies by . Note that , where ,, is the map taking a policy to its vector of
(discounted) feature expectations. We denote by the image of this map. Note that the set is convex (see (Ziebart, 2010, Theorem 2.8) and Abbeel and Ng (2004)), and also bounded due to the discounting factor . For a finite collection of trajectories obtained by executing a policy in the MDP , we denote the empirical counterpart of by .An IRL learner and a teacher. We consider a learner implementing an inverse reinforcement learning (IRL) algorithm and a teacher . The teacher has access to the full MDP ; the learner knows the MDP and the parametric form of reward function but does not know the true reward parameter . The learner, upon receiving demonstrations from the teacher, outputs a policy using her algorithm. The teacher’s objective is to provide a set of demonstrations to the learner that ensures that the learner’s output policy achieves high reward .
The standard IRL algorithms are based on the idea of feature matching Abbeel and Ng (2004); Ziebart (2010); Osa et al. (2018): The learner’s algorithm finds a policy that matches the feature expectations of the received demonstrations, ensuring that where specifies a desired level of accuracy. In this standard setting, the learner’s primary goal is to imitate the teacher (via feature matching) and this makes the teaching process easy. In fact, the teacher just needs to provide a sufficiently rich pool of demonstrations obtained by executing , ensuring . This guarantees that . Furthermore, the linearity of rewards and ensures that the learner’s output policy satisfies .
Key challenges in teaching a learner with preference constraints. In this paper, we study a novel setting where the learner has her own preferences which she additionally takes into consideration when learning a policy using teacher’s demonstrations. We formally specify our learner model in the next section; here we highlight the key challenges that arise in teaching such a learner. Given that the learner’s primary goal is no longer just imitating the teacher via feature matching, the learner’s output policy can be suboptimal with respect to the true reward even if she had access to , i.e., the feature expectation vector of an optimal policy . Figure 1 provides an illustrative example to showcase the suboptimality of teaching when the learner has preferences and constraints. The key challenge that we address in this paper is that of designing a teaching algorithm that selects demonstrations while accounting for the learner’s preferences.
3 Learner Model
In this section we describe the learner models we consider, including different ways of defining preferences and constraints. First, we introduce some notation and definitions that will be helpful. We capture learner’s preferences via a feature map . We define as a concatenation of the two feature maps and given by and let . Similar to the map , we define , and , . Similar to , we define and as the images of the maps and . Note that for any policy , we have .
Standard (discounted) MCEIRL. Our learner models build on the (discounted) Maximum Causal Entropy (MCE) IRL framework Ziebart et al. (2008); Ziebart (2010); Ziebart et al. (2013); Zhou et al. (2018). In the standard (discounted) MCEIRL framework, a learning agent aims to identify a policy that matches the feature expectations of the teacher’s demonstrations while simultaneously maximizing the (discounted) causal entropy given by . More background is provided in Appendix C.
Including preference constraints. The standard framework can be readily extended to include learner’s preferences in the form of constraints on the preference features . Clearly, the learner’s preferences can render exact matching of the teacher’s demonstrations infeasible and hence we relax this condition. To this end, we consider the following generic learner model:
(1)  
s.t.  
Here, are convex functions representing preference constraints. We denote the parameters and variables in vector notation as , , and . The coefficients and are parameters and quantify the importance of matching the teacher’s demonstrations and satisfying the learner’s preferences. Next, we discuss two special instances of this generic learner model.
3.1 Learner Model with Hard Preferences Constraints
It is instructive to study a special case of the abovementioned generic learner model with , and a limiting case with and . Intuitively, the preferences take the form of hard constraints, i.e., the learner’s output policy must satisfy . Additionally, while satisfying these hard constraints, the learner minimizes the norm distance to the teacher’s demonstration. We formally describe the learner’s behavior below.
First, we define the learner’s constraint set as . Similar to , we define and . Also, note that and are projections of the set to the subspaces and respectively. Then, the learner’s behavior can be approximated as:

Learner can match: When , the learner outputs a policy s.t. .

Learner cannot match: Otherwise, the learner outputs a policy such that is given by the norm projection of the vector onto the set .
Figure 1 provides an illustration of the behavior of this learner model. We will design learneraware teaching algorithms for this learner model in Section 4.1 and Section 5.
3.2 Learner Model with Soft Preference Constraints
Another interesting learner model that we study in this paper arises from the generic learner when we consider number of boxtype linear constraints with . We consider norm penalty on violation, and for simplicity we consider . In this case, the learner’s model is given by
(2)  
s.t.  
The solution to the above problem corresponds to a softmax policy with a reward function where is parametrized by . The optimal parameters can be computed efficiently and the corresponding softmax policy is then obtained by SoftValueIteration procedure (see (Ziebart, 2010, Algorithm. 9.1), Zhou et al. (2018)). Details are provided in Appendix D. We will design learneraware teaching algorithms for this learner in Section 4.2.
4 Learneraware Teaching under Known Constraints
In this section, we analyze the setting when the teacher has full knowledge of the learner’s constraints.
4.1 A Learneraware Teacher for Hard Preferences: AwareCMDP
Here, we design a learneraware teaching algorithm when considering the learner from Section 3.1. Given that the teacher has full knowledge of the learner’s preferences, it can compute an optimal teaching policy by maximizing the reward over policies that satisfy the learner’s preference constraints, i.e., the teacher solves a constrainedMDP problem (see De (1960); Altman (1999)) given by
We refer to an optimal solution of this problem as and the corresponding teacher as AwareCMDP. We can make the following observation formalizing the value of learneraware teaching:
Theorem 1.
For simplicity, assume that the teacher can provide an exact feature expectation of a policy instead of providing demonstrations to the learner. Then, the value of learneraware teaching is
When the set
is defined via a set of linear constraints, the above problem can be formulated as a linear program and solved exactly; details are provided in Appendix
E.4.2 A Learneraware Teacher for Soft Preferences via Bilevel Optimization: AwareBiL
For the learner models in Section 3, the optimal learneraware teaching problem can be naturally formalized as the following bilevel optimization problem:
(3) 
where stands for the IRL problem solved by the learner given demonstrations from and can include preferences of the learner (see Eq. 1 in Section 3).
There are many possibilities for solving this bilevel optimization problem—see for example Sinha et al. (2018) for an overview. In this paper we adopted a singlelevel reduction approach to simplify the above bilevel optimization problem as this results in particularly intuitive optimiziation problems for the teacher. The basic idea of singlelevel reduction is to replace the lowerlevel problem, i.e., , by the optimality conditions for that problem given by the KarushKuhnTucker conditions Boyd and Vandenberghe (2004); Sinha et al. (2018). For the learner model outlined in Section 3.2, these reductions take the following form (see Appendix F in the supplementary material for details):
(4)  
s.t.  
where corresponds to a softmax policy with a reward function for . Thus, finding optimal demonstrations means optimization over softmax teaching policies while respecting the learner’s preferences. To actually solve the above optimization problem and find good teaching policies, we use an approach inspired by the FrankWolfe algorithm Jaggi (2013) detailed in Appendix F. We refer to a teacher implementing this approach as AwareBiL.
5 LearnerAware Teaching Under Unknown Constraints
In this section, we consider the more realistic and challenging setting in which the teacher does not know the learner ’s constraint set . Without feedback from , can generally not do better than the agnostic teacher who simply ignores any constraints. We therefore assume that and interact in rounds as described by Algorithm 1. The two versions of the algorithm we describe in Sections 5.1 and 5.2 are obtained by specifying how adapts the teaching policy in each round.
In this section, we assume that is as described in Section 3.1: Given demonstrations , finds a policy such that matches the projection of onto . For the sake of simplifying the presentation and the analysis, we also assume that and can observe the exact feature expectations of their respective policies, e.g., if is sampled from .
5.1 An Adaptive Learneraware Teacher Using Volume Search: AdAwareVol
In our first adaptive teaching algorithm AdAwareVol,
maintains an estimate
of the learner’s constraint set, which in each round gets updated by intersecting the current version with a certain affine halfspace, thus reducing the volume of . The new teaching policy is then any policy which is optimal under the constraint that . The interaction ends as soon as for a threshold . Details on the algorithm are provided in Appendix B.1.Theorem 2.
Upon termination of AdAwareVol, ’s output policy satisfies for any policy which is optimal under ’s constraints. For the special case that is a polytope defined by linear inequalities, the algorithm terminates in iterations.
5.2 An Adaptive Learneraware Teacher Using Line Search: AdAwareLin
In our second adaptive teaching algorithm, AdAwareLin, adapts the teaching policy by performing a binary search on a line segment of the form to find a vector that is the vector of feature expectations of a policy; here are fixed constants. If that is not successful, the teacher finds a teaching policy with . The following theorem analyzes the convergence of ’s performance to under the assumption that ’s search succeeds in every round. Further details and the proof of the theorem are provided in Appendix B.2.
Theorem 3.
Fix some and assume that there exists a constant such that, as long as , the teacher can find a teaching policy satisfying for some . Then the learner’s performance increases monotonically in each round of AdAwareLin, i.e., . Moreover, after at most teaching steps, the learner’s performance satisfies . Here we abbreviate .
6 Experimental Evaluation
In this section we evaluate our teaching algorithms for different types of learners on the environment introduced in Figure 1. The environment we consider here has three reward objects, i.e., a “star" object with reward of , a “plus" object with reward of and a “dot" object with reward of such that . Two objects of each type are placed randomly on the grid. Furthermore, there are two types of distractors: (i) two “green" distractors are randomly placed at a distance of 0cell and 1cell to the “star" objects; (ii) two “yellow" distractors are randomly placed at a distance of 1cell and 2cells to the “plus" objects, see Figure 1(a). We have a total of 6 preference features with as follows: the first three features in are binaryindicators whether there is a “green" distractor at a distance of 0cell, 1cell, and 2cells; similarly the next three features are binaryindicators for the “yellow" distractor. We use a discount factor of . Upon collecting an object, there is a probability of transiting to a terminal state.
We consider the learner with soft constraints from Section 3.2 for , and . We have a total of different learners depending on the preference features used by them out of total preference features discussed above, see Figure 1(a), e.g., L1 learner has no preference features, L2 learner has first two preference features (i.e., ), L3 learner has first four preference features, L4 learner has first five preference features, and L5 learner has all six preference features. The first row in Figure 2 shows the considered objectworlds and indicates the preference of the learners to avoid certain regions by the gray area.
Learner ()  

L1  L2  L3  L4  L5  
Teacher  Agnostic  
AwareBiL 



6.1 Teaching under known constraints
Our first set of results are presented in Figure 2. The second and third row show the reward function inferred by the learner for demonstrations provided by a learneragnostic teacher (Agnostic) and the bilevel learneraware teacher (AwareBiL), respectively. We observe that Agnostic fails to teach the learner about objects’ positive rewards in cases where the learners’ preferences conflict with the position of the most rewarding objects (second row). In contrast, AwareBiL always successfully teaches the learners about rewarding objects that are compatible with the learners’ preferences (third row).
We also compare Agnostic and AwareBiL in terms of reward achieved by the learner after teaching for object worlds of size in Table 1. The numbers show the average reward over 10 randomly generated objectworlds. We observe, that a learner can learn better policies from a teacher that knows about the learner’s preferences and takes them into account.
6.2 Teaching under unknown constraints
In this section we evaluate the teaching algorithms from Section 5. We consider the learner model from Section 3.1 that uses projection to match reward feature expectations as studied in Section 5. Here, we study the learner who considers first two preference features (i.e., ): this corresponds to an objectworld similar to that in Figure 2 (learner L2 with second gridworld). For modeling the hard constraints, we consider boxtype linear constraints with for these two preference features (also see Eq. 2).
In this context it is instructive to investigate how quickly these adaptive teaching strategies converge to the performance of a teacher who would have full knowledge about the learner. Results comparing the adaptive teaching strategies (AdAwareVol and AdAwareLin) are shown in Figure 3. We can observe that both teaching strategies converge to the best possible performance under full knowledge about the learner (AwareCMDP).
We also provide results showing the performance achieved by the adaptive teaching strategies on objectworlds of varying sizes, see Figure 3. Note that the performance of AdAwareVol decreases slightly when teaching for more rounds, i.e., comparing the results after 3 teaching rounds and at the end of the teaching process. This is because of approximations when learner is computing the policy via projection, which in turn leads to errors on the teacher side when approximating . In contrast, AdAwareLin performance always increases when teaching for more rounds.

7 Related Work
Our work is closely related to algorithmic machine teaching Goldman and Kearns (1995); Zhu et al. (2018), whose general goal is to design teaching algorithms that optimize the data that is provided to a learning algorithm. Algorithmic teaching provides a rigorous formalism for a number of realworld applications such as personalized education and intelligent tutoring systems Patil et al. (2014); Zhu (2015); Rafferty et al. (2016); Hunziker et al. (2018), social robotics Cakmak and Thomaz (2014), and humanintheloop systems Singla et al. (2013, 2014).
Most works in machine teaching so far focus on supervised learning tasks and assume that the learning algorithm is fully known to the teacher, see e.g.
Zhu (2013); Singla et al. (2014); Liu and Zhu (2016); Mac Aodha et al. (2018). In the IRL setting, few works study how to provide maximally informative demonstrations to the learner, e.g., Cakmak and Lopes (2012); Brown and Niekum (2019). In contrast to our work, their teacher fully knows the learner model and provides the demonstrations without any adaptation to the learner. The question of how a teacher should adaptively react to a learner has been addressed by Liu et al. (2018); Chen et al. (2018); Melo et al. (2018); Yeo et al. (2019), but only in the supervised setting. In recent work, Kamalaruban et al. (2019) studies interactive teaching algorithms for an IRL learner, however, they consider a sequential learner, and there is no notion of learner’s preferences and constraints in their setting.Within the area of IRL, there is a line of work on active learning approaches
Cohn et al. (2011); Brown et al. (2018); Amin et al. (2017); Cui and Niekum (2018), which is related to our work in the sense that they consider the question of how to optimize demonstrations for a given learner. In contrast to us, they take the perspective of the learner who actively influences the demonstrations she receives. A few papers have addressed aspects of the problem that arises in IRL when the learner does not have full access to the reward features, e.g., Levine et al. (2010) and Haug et al. (2018).Our work is also loosely related to multiagent reinforcement learning. Dimitrakakis et al. (2017) studies the interaction between agents with misaligned models with a focus on the question of how to jointly optimize a policy. Also HadfieldMenell et al. (2016) study the cooperation between agents which do not perfectly understand each other.
8 Conclusions
In the context of inverse reinforcement learning, we investigated the important problem of interacting with learners that have preferences and constraints that prevent them from closely approximating the teacher’s demonstrations. We demonstrated the suboptimality of learneragnostic teaching and proposed algorithms for learneraware teaching strategies for known and unknown preferences of the learner. In future work, we will evaluate our approach in machinehuman and humanmachine tasks and extend our approach to other learner models.
References
 Abbeel and Ng (2004) Abbeel, P. and Ng, A. Y. (2004). Apprenticeship learning via inverse reinforcement learning. In ICML.
 Altman (1999) Altman, E. (1999). Constrained Markov decision processes, volume 7. CRC Press.
 Amin et al. (2017) Amin, K., Jiang, N., and Singh, S. P. (2017). Repeated inverse reinforcement learning. In NIPS, pages 1813–1822.
 Boularias et al. (2011) Boularias, A., Kober, J., and Peters, J. (2011). Relative entropy inverse reinforcement learning. In AISTATS, pages 182–189.
 Boyd and Vandenberghe (2004) Boyd, S. and Vandenberghe, L. (2004). Convex optimization. Cambridge university press.
 Brown et al. (2018) Brown, D. S., Cui, Y., and Niekum, S. (2018). RiskAware Active Inverse Reinforcement Learning. In Conference on Robot Learning, pages 362–372.
 Brown and Niekum (2019) Brown, D. S. and Niekum, S. (2019). Machine teaching for inverse reinforcement learning: Algorithms and applications. In AAAI.
 Cakmak and Lopes (2012) Cakmak, M. and Lopes, M. (2012). Algorithmic and human teaching of sequential decision tasks. In AAAI.
 Cakmak and Thomaz (2014) Cakmak, M. and Thomaz, A. L. (2014). Eliciting good teaching from humans for machine learners. Artificial Intelligence, 217:198–215.
 Chen et al. (2018) Chen, Y., Singla, A., Mac Aodha, O., Perona, P., and Yue, Y. (2018). Understanding the role of adaptivity in machine teaching: The case of version space learners. In Advances in Neural Information Processing Systems, NeurIPS’18, pages 1476–1486.
 Cohn et al. (2011) Cohn, R., Durfee, E., and Singh, S. (2011). Comparing Actionquery Strategies in Semiautonomous Agents. In AAMAS, pages 1287–1288, Richland, SC.
 Cui and Niekum (2018) Cui, Y. and Niekum, S. (2018). Active reward learning from critiques. In 2018 IEEE International Conference on Robotics and Automation (ICRA), pages 6907–6914. IEEE.
 De (1960) De, G. G. (1960). Les problemes de decisions sequentielles. cahiers du centre d’etudes de recherche operationnelle vol. 2, pp. 161179.
 Dimitrakakis et al. (2017) Dimitrakakis, C., Parkes, D. C., Radanovic, G., and Tylkin, P. (2017). Multiview decision processes: the helperai problem. In Advances in Neural Information Processing Systems, pages 5443–5452.

Dudík et al. (2007)
Dudík, M., Phillips, S. J., and Schapire, R. E. (2007).
Maximum entropy density estimation with generalized regularization
and an application to species distribution modeling.
Journal of Machine Learning Research
, 8:1217–1260.  Goldman and Kearns (1995) Goldman, S. A. and Kearns, M. J. (1995). On the complexity of teaching. Journal of Computer and System Sciences, 50(1):20–31.
 HadfieldMenell et al. (2016) HadfieldMenell, D., Russell, S. J., Abbeel, P., and Dragan, A. (2016). Cooperative inverse reinforcement learning. In NIPS.
 Haug et al. (2018) Haug, L., Tschiatschek, S., and Singla, A. (2018). Teaching inverse reinforcement learners via features and demonstrations. In Advances in Neural Information Processing Systems, NeurIPS’18, pages 8464–8473.
 Hunziker et al. (2018) Hunziker, A., Chen, Y., Mac Aodha, O., GomezRodriguez, M., Krause, A., Perona, P., Yue, Y., and Singla, A. (2018). Teaching multiple concepts to a forgetful learner. CoRR, abs/1805.08322.
 Jaggi (2013) Jaggi, M. (2013). Revisiting FrankWolfe: Projectionfree sparse convex optimization. In Proceedings of the 30th International Conference on Machine Learning, pages 427–435.
 Kamalaruban et al. (2019) Kamalaruban, P., Devidze, R., Cevher, V., and Singla, A. (2019). Interactive teaching algorithms for inverse reinforcement learning. In IJCAI.
 Kazama and Tsujii (2005) Kazama, J. and Tsujii, J. (2005). Maximum entropy models with inequality constraints: A case study on text categorization. Machine Learning, 60(13):159–194.
 Leibo et al. (2017) Leibo, J. Z., Zambaldi, V., Lanctot, M., Marecki, J., and Graepel, T. (2017). Multiagent reinforcement learning in sequential social dilemmas. In Proceedings of the 16th Conference on Autonomous Agents and MultiAgent Systems, pages 464–473.
 Levine et al. (2010) Levine, S., Popovic, Z., and Koltun, V. (2010). Feature construction for inverse reinforcement learning. In NIPS, pages 1342–1350.
 Liu and Zhu (2016) Liu, J. and Zhu, X. (2016). The teaching dimension of linear learners. Journal of Machine Learning Research, 17(162):1–25.
 Liu et al. (2018) Liu, W., Dai, B., li, X., Rehg, J. M., and Song, L. (2018). Towards blackbox iterative machine teaching. In ICML.

Mac Aodha et al. (2018)
Mac Aodha, O., Su, S., Chen, Y., Perona, P., and Yue, Y. (2018).
Teaching categories to human learners with visual explanations.
In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
, pages 3820–3828.  Melo et al. (2018) Melo, F. S., Guerra, C., and Lopes, M. (2018). Interactive optimal teaching with unknown learners. In IJCAI, pages 2567–2573.
 Mendez et al. (2018) Mendez, J. A. M., Shivkumar, S., and Eaton, E. (2018). Lifelong inverse reinforcement learning. In Advances in Neural Information Processing Systems (NeurIPS) 2018, pages 4507–4518.

Osa et al. (2018)
Osa, T., Pajarinen, J., Neumann, G., Bagnell, J. A., Abbeel, P., Peters, J.,
et al. (2018).
An algorithmic perspective on imitation learning.
Foundations and Trends® in Robotics, 7(12):1–179.  Patil et al. (2014) Patil, K. R., Zhu, X., Kopeć, Ł., and Love, B. C. (2014). Optimal teaching for limitedcapacity human learners. In NIPS, pages 2465–2473.
 Rafferty et al. (2016) Rafferty, A. N., Brunskill, E., Griffiths, T. L., and Shafto, P. (2016). Faster teaching via pomdp planning. Cognitive science, 40(6):1290–1332.
 Ratliff et al. (2006) Ratliff, N. D., Bagnell, J. A., and Zinkevich, M. A. (2006). Maximum margin planning. In ICML, pages 729–736.

Singla et al. (2013)
Singla, A., Bogunovic, I., Bartók, G., Karbasi, A., and Krause, A. (2013).
On actively teaching the crowd to classify.
In NIPS Workshop on Data Driven Education.  Singla et al. (2014) Singla, A., Bogunovic, I., Bartók, G., Karbasi, A., and Krause, A. (2014). Nearoptimally teaching the crowd to classify. In ICML.

Sinha et al. (2018)
Sinha, A., Malo, P., and Deb, K. (2018).
A review on bilevel optimization: from classical to evolutionary
approaches and applications.
IEEE Transactions on Evolutionary Computation
, 22(2):276–295.  Yeo et al. (2019) Yeo, T., Kamalaruban, P., Singla, A., Merchant, A., Asselborn, T., Faucon, L., Dillenbourg, P., and Cevher, V. (2019). Iterative classroom teaching. In AAAI.
 Zhou et al. (2018) Zhou, Z., Bloem, M., and Bambos, N. (2018). Infinite time horizon maximum causal entropy inverse reinforcement learning. IEEE Trans. Automat. Contr., 63(9):2787–2802.
 Zhu (2013) Zhu, X. (2013). Machine teaching for bayesian learners in the exponential family. In NIPS, pages 1905–1913.
 Zhu (2015) Zhu, X. (2015). Machine teaching: An inverse problem to machine learning and an approach toward optimal education. In AAAI, pages 4083–4087.
 Zhu et al. (2018) Zhu, X., Singla, A., Zilles, S., and Rafferty, A. N. (2018). An overview of machine teaching. CoRR, abs/1801.05927.
 Ziebart (2010) Ziebart, B. D. (2010). Modeling purposeful adaptive behavior with the principle of maximum causal entropy. Carnegie Mellon University.
 Ziebart et al. (2013) Ziebart, B. D., Bagnell, J. A., and Dey, A. K. (2013). The principle of maximum causal entropy for estimating interacting processes. IEEE Transactions on Information Theory, 59(4):1966–1980.
 Ziebart et al. (2008) Ziebart, B. D., Maas, A. L., Bagnell, J. A., and Dey, A. K. (2008). Maximum entropy inverse reinforcement learning. In AAAI.
Appendix A List of Appendices
In this section we provide a brief description of the content provided in the appendices of the paper.
Appendix B Details for LearnerAware Teaching under Unknown Constraints (Section 5)
In this appendix, we provide more details on the adaptive teaching algorithms AdAwareVol and AdAwareLin described in Sections 5.1 and 5.2. Recall that both teaching algorithms are obtained from Algorithm 1 by defining the way in which the teacher adapts the teaching policy based on the learner ’s feature expectations in past rounds.
b.1 Details for AdAwareVol (Section 5.1)
Estimation of the learner’s constraint set.
In AdAwareVol, maintains an estimate of ’s constraint set, starting with . After observing the feature expectations of the policy found in round , updates this estimate as follows:
(5) 
The set on the right hand side of (5) with which gets intersected is a halfspace containing . This is due to the fact that is convex by assumption, and to our assumption that ’s learning algorithm is such that it outputs a policy whose feature expectations match the projection of to . Inductively, it follows that for all .
In practice, we implement a slightly modified version of the update step in which we intersect with a halfspace that is shifted in the direction of by a small amount, i.e., we use
with a step size parameter . This helps make the algorithm more robust to noise in the learner’s feature expectations. In our experiments, we used .
Update of the teaching policy.
After updating the estimate of the learner’s constraint set to , solves a constrained MDP in order to find
Given that is cut out by linear equations, solving the constrained MDP reduces to solving an LP, as described in Appendix E.
Termination of the interaction.
The algorithm terminates as soon as the stopping criterion is satisfied. Note that implies that
for any . Therefore, after termination we have
for any policy which is optimal under ’s constraints, which is the first statement of Theorem 2.
The second statement of Theorem 2 follows from the fact that if is a convex polytope cut out by linear inequalities, the number of faces, which is in , is an upper bound on the number of iterations of the algorithm, because one face is “eliminated” in each round.
b.2 Details for AdAwareLin (Section 5.2)
In AdAwareLin, updates the teaching policy based on ’s feature expectations from the previous round. To do so, uses LineSearch (Algorithm 2) to perform a binary search on the line segment
(6) 
in order to find a vector that is realizable as the vector of feature expectations of a policy. If the intersection of the line segment (6) with is nonempty, it is of the form for some due to the convexity of . In that case, LineSearch returns a policy with feature expectations
where is the maximal such that . If the intersection is empty, LineSearch returns a policy with feature expectations
Figure 4 illustrates the two cases that may occur.
b.2.1 Proof of Theorem 3
In this section, we provide the proof of Theorem 3, which gives a guarantee on the improvement of ’s performance in each round of the AdAwareLin algorithm. The assumption we make here is that, in every teaching round, LineSearch returns a teaching policy such that for some , where is a fixed constant. It is easy to see that this assumption, together with our assumption on ’s algorithm and the convexity of , imply that the change in learner performance
is nonnegative in every teaching round. The following proposition, which will be needed in the proof of Theorem 3, strengthens this statement:
Proposition 1.
Let be the maximally achievable learner performance. Assume that, in teaching round , can find a teaching policy whose feature expectations satisfy for some . Then
(7) 
where .
Proof of Proposition 1.
Consider the plane spanned by and and denote by the unique point in with the properties that

,

lies on the same side of the line through and as , and

and span a right triangle with at the rightangled corner.
Note that must lie inside this triangle, i.e., on the red line segment in Figure 5: Otherwise there would a point on the line segment connecting and , and hence in by convexity, which is closer to than , contradicting the fact that is closest to among all points in . Denote by the line passing through and .
The facts that is convex and that imply that
must lie on one side of the hyperplane
Therefore, we can upper bound in terms of the slope of the line which arises by intersecting that hyperplane with :
(8) 
Note that the slope is upper bounded by the slope of . We have , where is the length of the red line segment in Figure 5, and by Pythagoras’s theorem. Using that, we obtain
(9) 
The claimed estimate (7) follows by plugging this upper bound for into (8) and rearranging. ∎
Comments
There are no comments yet.