1 Introduction
In the reinforcement learning framework [10], data efficient approaches are especially important for real world and commercial applications, such as robotics. In such domains extensive interaction with the environment needs time and can be costly.
One data efficient approach for RL is transfer learning (TL) [11]. Typically, when an RL agent leverages TL, it uses knowledge acquired in one or more (source) tasks to speed up its learning in a more complex (target) task. Most realistic TL settings require transfer of knowledge between different tasks or heterogeneous agents that can be vastly different from each other (e.g., humans and software agents).
Transferring between heterogeneous agents is often challenging since most methodologies involve exploiting the agents’ structural similarity to transfer knowledge between tasks. As an example, TL can be applied between two similar RL agents, which both use the same function approximation method, by transferring their learned parameters. In such a case, a QValue transfer solution could be used, combined with an algorithm constructing mappings between the state variables of the two tasks.
Whereas solutions for extracting similarity between tasks have been extensively studied in the past [11, 3], the main problem of transferring between very dissimilar agents (e.g., humans and software agents) remains.
Consider for example a game hint system for human players. The game hint system can not directly transfer its internal knowledge to the human player. Moreover, it should transfer knowledge in a limited and prioritized way since the attention span of humans is limited.
The only prominent knowledge transfer unit between all agents (software, physical or biological) is action. Action suggestion (advice) can be understood by very different agents. However, even when transferring using advice, four problems arise:

Decide what to advise (production of advice)

Decide when to advise (distribution of advice), especially when using a limited advice budget

Determine a common action language in order to appropriately express the advice between heterogeneous agents

Communicate the advice effectively, ensuring its timely and noiseless reception
This article focuses on the first two problems—those of deciding when and what to advise under a budget. Moreover, we use the game of PacMan to test our methods’ effectiveness in a complex domain.
Whereas works such as [16] provide a formal understanding of RL students receiving advice and the implications on the student’s learning process (e.g. convergence properties) and papers like [17] and [13] provide practical methods for a teacher to advise agents, this work attempts a new learning formulation of the problem and proposes a novel learning algorithm based on it. We identify and exploit the similarities of the advising under a budget (AuB) problem to the classic explorationexploitation problem in RL and identify a subclass of reinforcement learning problems: Constrained Exploitation Reinforcement Learning.
Most successful methodologies for AuB require students to inform their teacher for their intended action. This is not a realistic requirement in many realworld TL problems, since it assumes one more communication channel between the student and the teacher, thus, it requires some form of structural compliance from the student. An example of how restrictive is this requirement for realworld applications comes from the game hint example system. The system advises the human player for his next action in realtime but the human player could never be expected to announce its intended action beforehand. Part of this work’s goal is also to alleviate such a prerequisite and propose methods that can also work without such knowledge.
Specifically, the contributions of this article are:

An empirical study on determining an appropriate advising policy in the game of PacMan

A novel application of average reward reinforcement learning to produce advice

A novel formulation of the learning to advise under budget (AuB) problem as a problem of constrained exploitation RL

A novel RL algorithm for learning a teaching policy to distribute advice, able to train faster (lower data complexity) than previous learning approaches and advise even when not having knowledge of the student’s intended action
2 Background
This section presents the necessary background to understand the methods proposed in this article. Brief introductions are provided to reinforcement learning and transfer learning, which are then followed by a more detailed discussion of the current advising methodologies.
2.1 Reinforcement Learning
Reinforcement Learning addresses the problem of how an agent can learn a behaviour through trialanderror interactions with a dynamic environment [10]. In an RL task the agent, at each time step, senses the environment’s state, , where is the finite set of possible states, and selects an action to execute, where is the finite set of possible actions in state . The agent receives a reward, , and moves to a new state according to a transition function, , of the task with . The general goal of the agent is to maximize the expected return, where the return, , is defined as some specific function of the reward sequence given also a discounting parameter, . The parameter, where , controls the importance of shortterm rewards over the most longterm ones, discounting the later by powers of factor of .
The outcome will be an actionvalue function which expresses the expected return starting from , taking action , and following after that policy , which dictates how the agent acts in a certain situation in order to maximize the reward received over time.
2.2 Transfer Learning and Advising under a Budget
Transfer Learning [11] refers to the process of using knowledge that has been acquired in a previously learned task, the source task, in order to enhance the learning procedure in a new and more complex task, the target task
. The more similar these two tasks are, the easier it is to transfer knowledge between them. By similarity, we mean the similarity of their underlying Markov Decision Processes (MDP) that is, the transition and reward functions of the two tasks and also their state and action spaces.
The type of knowledge that can be transferred between tasks varies among different TL methods, including value functions, entire policies, actions (policy advice) or a set of samples from a source task which can be used by a modelbased RL algorithm in a target task.
Focusing specifically on policy advice under an advice budget constraint, we identify two aspects of the problem, a) learning a policy to produce advice and b) distributing the advice in the most appropriate way, while respecting the advice budget constraint. Most methods in the literature produce advice by greedily using a learned policy for the task in hand [13, 17, 16]. For advice distribution, most methods rely on some form of heuristic function (and not learning) based on which the teacher decides when to advice. Examples of such methods are Importance Advice and Mistake Correcting [13].
The Importance Advice method produces advice by repeatedly querying a learned policy’s value function, on each state the student faces, to obtain the best action for that state. Distribution of advice, that is deciding when to advise or not, is determined by a heuristic logical expression of the form where is a threshold parameter determining the stateaction value gap between the best and the worst action for that state. If this value gap exceeds the threshold value, , the state is considered critical and advice is given. The algorithm continues until the advice budget finishes.
Mistake correcting (MC) [13] differs from Importance Advising only in presuming knowledge of the student’s intended action. Consequently, it validates the Importance Advising criterion only if the student action is wrong, not wasting advice when the student does not need it.
The method presented in [17] (Zimmer’s method )formulates the teaching problem as an RL one in order to learn an advice distribution policy. The teacher agent has an action set with two actions, advice, no advice. The teacher’s state space is an augmented version of the student’s one and is of the form: where
is the current state vector of the student,
is the intended action of the student (it is assumed that the student announces the intended action on every step), the remaining advice budget and the student’s training episode number. Moreover, the reward signal is a transformed version of the student’s reward with an extra positive reward for the teacher when the student reaches its goal in a small number of steps. We note that this method is tested only on the Mountain Car domain and the reward signal proposed for the teacher is domaindependent.2.3 PacMan
The experimental domain for the teaching methods presented in this article is the game of PacMan. PacMan is a famous 1980s arcade game in which the player navigates a maze like the one in Figure 1, trying to earn points by touching edible items and trying to avoid being caught by the four ghosts. In our experiments, we use a JAVA implementation of the game provided by the Ms. PacMan vs. Ghosts League [5], which conducts annual competitions. Ghosts in our setting will chase the player 80% of the time and choose actions randomly 20%.
The player and all ghosts have four actions — move up, down, left, and right — but some actions are occasionally unavailable due to the restrictions in the maze. Four moves are required to travel between the small dots on the grid, which represent food pellets and are worth 10 points each. The larger dots are power pellets, which are worth 50 points each. When the player gets the larger ones, the ghosts become edible for a short time, during which they slow down and flee the player. Eating a ghost is worth 200 points (which doubles every time for the duration of a single power pill). Then the ghost respawns in the lair at the center of the maze.The episode ends if any ghost catches PacMan, or after 2000 steps.
This domain is discrete but has a very large state space. There are 1293 distinct locations in the maze, and a complete state consists of the locations of PacMan, the ghosts, the food pellets, and the power pills, along with each ghost’s previous move and whether or not it is edible. The combinatorial explosion of possible states makes it essential to approach this domain through highlevel feature construction and Qfunction approximation.
In this article, we follow previous work [13] that adopted a highlevel feature set (highasymptote feature set) comprised of actionspecific features. When using actionspecific features, a feature set is really a set of functions . All actions share one Qfunction, which associates a weight with each feature. A Qvalue is . To achieve gradientdescent convergence, it is important to have the extra bias weight and also to normalize the features to the range .
For the state representation, we define a feature set which consists of features that count objects at a range of distances from PacMan maze, as we used (and defined) in previous work [13].
A perfect score in an episode is 5600 points, but this is quite difficult to achieve (for both human and agent players). An agent executing random actions earns an average of 250 points. The 7feature set allows an agent to learn to catch some edible ghosts and achieve a perepisode average of 3800 points.
3 The Teaching Task
In this section we attempt a more formal understanding of a teaching task that is based on action advice. The necessary notation is presented in Table I.
3.1 Definitions
Definition 3.1.
Student A student agent is an agent acting in an environment and capable of accepting advice from another agent^{1}^{1}1In this work we assume that a student agent always follows the given advice.
Definition 3.2.
Teacher A teacher agent is an agent capable to execute and inform a teaching policy (see Definition 3.7) to provide action advice to a student agent acting in a specific task.
Definition 3.3.
Acting Task The acting task is the task for which the teacher gives advice and can be defined as an MDP of the form on an environment .
Definition 3.4.
Teaching Task The teaching task is the task of providing action advice to a student agent to assist him in learning faster or learning better the acting task. Any teaching task is accompanied by a finite advice budget, .
Definition 3.5.
Teaching Action Space Given the action space of the acting task, the action space of the teacher in timestep is:
where an action of the acting task given as advice and the no advice action, , meaning that the teacher will not give advice in this step allowing the student to act on its own. is the advice budget left in timestep .
Definition 3.6.
Teaching State Space The teacher agent state space in timestep has the following form:
(1) 
where is a tuple containing any knowledge we can have for the student and its MDP in timestep . If the student’s MDP is and the teacher observes the current state of the student, , reward, and action then .
Definition 3.7.
The teaching policy actually transforms the acting policy, , of an actor agent (expressed through its respective stateaction value function, ), to a policy producing advice under budget. We should also note that a teaching policy will usually ([13, 17, 16]) set which means that the teaching policy is greedy with respect to the acting value function, .
As a minimal example of the proposed formulation, the Importance Advising method [13] which uses the state importance criterion (see Section 2.2) can be said to use a teaching state space, as it requires knowledge only about the current state, , of the student, the advice budget, and an acting policy, from which it produces advice.
3.2 Learning to Teach
The definitions presented in subsection 3.1 apply to any teacher agent even if it advises based on a heuristic function. In the following, we focus on teachers that use RL to learn a teaching policy (i.e., advice distribution policy).
In its most simplified version, the learning to teach task employs two agents: the teacher and the student. In the first learning phase a teacher agent has the role of the actor: it learns the acting task alone. It observes a state space and has an action set . Based on a reward signal received from the environment, it learns a policy to achieve the acting task goal . In our context, this first learning phase can be seen as the advice production phase since the teacher learns the policy that will be used to advise a student later on.
At any timestep the teacher agent may have to stop acting and a new agent, the student enters the acting task and the corresponding environment.
Consequently, the teacher agent has to now learn and use a teaching policy for the specific task to achieve the teaching goal, . Additionally to the definitions given in Section 3, this second learning phase (learning a teaching policy), requires realizing and formulating the following:

Return Horizon. Even if the teaching task is formulated as an episodic one, the teaching episode, also referred as a session, is not necessarily matching the student’s learning episode. The teacher’s episode scope is greater and could track several learning episodes of the student.

Reward signal. A different return horizon implies a different task goal and consecutively the teacher’s reward signal can be different from the student’s (e.g., encouraging more the learning progress of the student over its absolute learning performance).
Moreover, defining the teacher’s state space as a superset of the student’s state space (see Definition 3.6
) indicates one more difficulty of the learning to teach task. From the teacher’s point of view the student can be considered a timeinhomogeneous MarkovChain (MC)
[9], . This is because the transition matrix of the student’s MC is dependent on time, since it is learning and constantly changing its policy over time. The time inhomogeneity of this MC poses significant difficulties in handling the problem theoretically. Homogenizing this MC by defining it as a spacetime MC, can make practical solutions feasible but still theoretical treatment is difficult (e.g., no stationary distributions exist in this case).In general, every learning task can have its corresponding teaching task which could be thought as its dual. As learning to act in a specific task and teaching that task can be considered different tasks, they have their own goals and consequently, are “described” by different reward signals.
As an example, in [17] a teacher agent for the mountain car domain has a different reward signal to that of the student, encouraging teaching policies that help the student reach its goal sooner.
Learning a teaching policy, as this is described above, could be modelled by many different types of Markov Processes. However, none of the classic MDP formulations completely models the specific learning problem as a whole either by not handling the nonstationarity of the problem or by not handling the specific budget constraint imposed on the advising action. This fact is the main motivation of Section 5, where we present our proposed method for learning teaching policies.
4 Learning to Produce Advice
In this section, we focus on the advice itself and its production (not its distribution). The main challenge in producing advice based on the QValues of an RL value function is that these values are valid only if the policy they represent is fully followed, not when this policy is sparingly sampled to produce advice.
Based on previous methods in the literature (see Section 6) the most common teacher’s criterion for selecting which action to advise is , that is, greedy selection of the best action based on the teacher’s acting value function. However, the value of is not correct under the advising scenario since it is accurate only if the student will continue following teacher’s acting policy, thereafter. Unfortunately, this is usually not the case in our context since the student, after receiving advice, will often continue for a long period using its own policy exclusively. Even worse, in the early training phases—when advice is needed the most—the student’s policy will be vastly different from the teacher’s.
This realization is even more important if we consider how different the teacher and the student agents are allowed to be in our context. Consider a human student receiving advice in the game of PacMan. Human players often play fastpaced action games in a myopic and reactive manner, seeking shortterm survival and not a longterm strategic advantage.
In that case, a human student infrequently advised by a policy learned using a high
value close to 1 will often be mislead to locally suboptimal actions because these actions may be highly valued for the teacher’s farsighted policy. The human player will probably not follow such a policy thereafter and he has therefore been misled to an action that would be useful only if he would also follow the rest of the teacher’s acting policy too.
Ideally, we would like to use a teacher’s acting policy that would be mostly invariant to the student’s particularities. Such a teacher’s policy would advise actions that are good on average, whatever policy is followed thereafter by the student and whatever its internals and parameters are (e.g. value) etc.
In this article, we propose that the above considerations should affect the way we learn policies intended for teachers. Selecting a specific policy for advising, the RL algorithm producing it and its parameters, form a model selection problem for RL teachers.
4.1 Model Selection for Teachers
In this section, we want to investigate how factors such as the teacher’s value (see Section 2.1) influence advice quality for students that can possibly have very different characteristics (e.g, a myopic student and farsighted teacher). This is important in order to understand which teacheragent differences affect the teaching performance the most.
To assess the influence of the value in the teaching process, we experiment using an RL algorithm like RLearning [7, 10], which does not use a
value for the calculation of stateaction values and relies on estimating the average reward received by the student, using its policy from any state and thereafter.
Specifically, RLearning is an infinitehorizon RL algorithm where a different optimality criterion is used such that the value given action and state under policy is defined as the expectation:
(3) 
Where is the average expected reward per time step under policy . The intuition behind RLearning is that in the long run the average reward obtained by a specific policy is the same, but some stateaction pairs receive betterthanaverage rewards for a while, while others may receive worsethanaverage rewards. This transient, the difference to the average reward received, , is what defines the stateaction value. To keep a running estimate of the average reward, RLearning uses a second update rule, and one more parameter, , for the learning rate of that update.
Using RLearning to learn a teacher’s acting policy along with the rest of the experiments presented in Section 4.2, we can assess the importance of value and value mismatch between student and teacher. Moreover, we assess other factors that possibly influence the quality of advice such as the performance of the teacher in the acting task, its performance variance and a possible relation of its average tderror [10], with the quality of advising.
As defined in [10] the tderror, , represents the value estimation error of a value function for a specific state and action in time . For the QLearning [14] algorithm that is:
(4) 
This is also part of the QLearning update rule. Furthermore, by dividing (4) with the previous value estimation, , we get the percentage of error in relation with it, which we can call tderror percentage:
(5) 
where . In our context, when the teacher uses an acting policy to produce advice it can still compute, for each student’s experience, its own tderror just as it would do if it was actually making a learning update. In the same context, we can intuitively say that represents the teacher’s surprise^{2}^{2}2Note that this definition of surprise, although similar, is different to that presented in [15] which normalizes for different learners and not for different stateaction pairs. on its new estimation of a stateaction value.
Consequently, a teacher with high average tderror percentage, , is a teacher with more unreliable value estimation, and therefore, it can be less suitable for a teacher since its action suggestion is based on a nonconverged value function.
4.2 Experiments and Results
Based on the discussion in the previous section (Section 4.1) the main goal of the following experiments is to find the teacher’s policy parameters (such as ) that affect the quality of advice most for different student parameters. The experimental design is as follows. In the first phase, we created specific teachers by training five QLearning agents and one RLearning agent for 1000 episodes. The QLearning agents had all the same parameters, except , which took values in . The rest of their parameters were the same and fixed, specifically and (same with previous work [13]). The parameter accounting for eligibility traces was set to zero so that the effect of experimentally controlling the parameter is isolated. Finally, the parameter of RLearning was set to 0.0001 (preliminary results found it produced good results in PacMan).
After training for 1000 episodes, the specific QLearning teachers and the RLearning teacher were evaluated on 500 episodes of acting alone in the environment. We calculated their average episode score and the coefficient of variation of these scores as both being possible determining factors of advice quality. Coefficient of variation was used as a measure of score discrepancy as it shows the extent of variability in relation to the mean of the score, allowing a more clear comparison of variance between methods with different average performance. It is a unitless measure calculated as .
In Table II we can see their average episode score on 500 episodes along with the coefficient of variation of that score. RLearning had significantly worse average acting performance than all versions of QLearning. Interestingly, episodic QLearning (with close to 1) did not perform as well as expected. Moreover, a very low value (0.05) came up second, showing that a myopic RL agent can perform well in PacMan. This result indicates the highly stochastic nature of the game where reactive shortsighted strategies, based more on survival, can perform better than farsighted strategies.
After the initial training and the evaluation of the acting policies they learned, these agents could be used as teachers for tabularasa student agents. In these experiments we used a simple fixed teachingadvising policy called Every4Steps for all teachers since we focus only on the quality of the advice itself and not on the quality of its distribution to the student (teaching policy).
In the Every4Steps teaching policy, the teacher gives one piece of advice to the student every four steps. Using this fixed advising policy we can test and compare the efficacy of the advice when this is not given consecutively, thus testing how useful the advice is when the student does not take a complete policy trajectory from the teacher, but has to use its own policy in between.
Using the teaching policy Every4Steps and a budget of advice we ran 30 trials of advising learning students for each specific teacherstudent pair. Specifically, the parameters of these teacherstudent pairs come from the Cartesian product (30 pairs), where the RLearning teacher in the first set is denoted with a “” since it does not have a value.
In Figure 2 we can see the average performance of each teacherstudent pair compared to the same student not receiving advice at all. Combining these results with Table II of the teachers’ performance when they were acting alone, we can see that the best performer is not the best teacher, with best defined as the best average score when acting alone in the task. The best example of this is RLearning whose average score was the worst than any specific QLearning agent, however, as we can see in Figure 2 is almost as good of a teacher as the QLearning teacher. RLearning advising improved all student’s score whatever their value, while not resulting in a negative transfer for any of them.
Moreover, we can see a pattern where the lower the coefficient of variation (CV) for the acting performance is, the better the teacher, indicating that CV can be an important criteria in model selection for teachers. This is nontrivial since average agent performance (and not its variance) is the dominant model selection criteria adopted in most of the relevant literature in RL. Performance variance expressed by CV seems especially important in our context, that of sparse advising, where the advice should be good whatever the next actions of the student will be.
Based on the results presented here, we can not observe any particular pattern relating teaching performance with the values of a teacherstudentpair. Interestingly though a teacher is not the most helpful for a student. Even more, a for a student results to significant negative transfer. The teacher with the episodic value, and the no discounting RLearning one were the most helpful to all students showing that RLearning can perform well in settings where the student’s is unknown or varying, such as in the case of human students.
Having identified the possible use of RLearning for producing acting policies suitable for advising and the importance of performance CV to model selection, we conducted one more experiment between identical teachers.
Specifically, we independently trained 30 QLearning teachers with the same parameters, feature sets and characteristics for 1000 episodes. Due to their different experiences and the stochasticity of the game they naturally learned different policies (i.e., final feature weights in their function approximators). Then, the trained teachers played alone for 500 episodes and we recorded their average performance, average performance variance as also their average TDerror percentage, , as this was defined in Section 4.1. We then used the Every4step teaching policy with each one of them advising a standard Sarsa [6] student who would learn the task for 1000 episodes. Finally, we recorded the student’s average score.
In Figure 3 we can see a correlation plot of the factors mentioned above using a onetailed nonparametric Spearman correlation test at . Confirming the previous results we can see the negative and statistically significant relation of CV to teaching performance with . Acting performance also has a medium and positive correlation of with teaching performance (student’s score) but it is statistically insignificant on the limit. By weighing average performance in its calculation, CV has a stronger relation to teaching performance than standard statistic variance. Moreover, we see that teacher’s surprise, relates strongly () and negatively to the acting performance of the teacher and not to its teaching performance ().
5 Learning to Distribute Advice
In this section we change focus from advice production to advice distribution, learning a teaching policy in order to most effectively distribute the advice budget.
5.1 Constrained Exploitation Reinforcement Learning
We attempt a more natural formulation of the AuB learning problem described in Section 3.2 by identifying it as an instance of a more generic reinforcement learning problem. This RL problem can be simply described as learning control with constraints imposed on the exploitation ability of the learning agent. These constraints can either be a finite number of times the agent can exploit using its policy, possibly states where it is only allowed to explore, or even perhaps a task where it is costly to have access to an optimal policy and we are allowed to use it only for a limited number of times.
How does this RL problem relates to the learning to teach problem? The first insight is that the advise/noadvise decision problem has a striking resemblance to the core explorationexploitation problem of RL agents. Consider the learning to teach problem. We can view the problem as follows: When the teacher agent is advising it is actually acting on the environment, that is because an obedient student agent will always apply its advice thus becoming a deterministic actuator for the teacher. In the case of a nonobedient student, the teacher could be said using a stochastic actuator.
Consequently, we can view the teacher agent as an acting agent using a student agent as its actuator for the environment. Moreover, the teacher is acting greedily by advising its best action; thus, it exploits. Under this perspective, with advice seen as action, how could we view the no advice action of a teacher? The no advice action can be seen as “trusting” the student to control the environment autonomously. Thus, choosing not to advise in a specific state can be seen as denoting that state to be noncritical with respect to the remaining advice budget and the student’s learning progress, or denoting a lack of teacher’s knowledge for that state. From the teacher’s point of view, not advising can be seen as an exploration action. So controlling when not to advise can be seen as a directed exploration problem in MDPs. Imposing a budget constraint, that is a constraint on the number of times a teacher agent can advise (i.e., exploit) is a problem of constrained and directed exploitation.
We will consider a simple and motivating example of such a domain. In a grid world a robot learns an optimal path towards a rewarding goal state while it should keep away from a specific damaging state. The robot is semiautonomous, it can either control itself using its own policy or it can be teleoperated for a specific limited number of times. For the robot’s operator, what is an optimal use of this finite number of control interventions? What are the states that it would be best to control the robot directly, leaving control of the rest to the robot?
Similarly to the previous example, learning and executing advising policies in a game can be another example of the constrained exploitation problem, which is also the main focus of this article. For example, in a video game like PacMan, a game hints system plays the role of the external optimal controller with a limited intervention budget. Such a hint system could suggest actions to human players—when these are most necessary—depending also on the player’s policy.
In the rest of this section, we use the term exploitation where one can think of advising and the term exploration when notadvising, focusing on the broader learning problem.
5.2 Learning Constrained Exploitation policies
Formulating the constrained exploitation task as a reinforcement learning problem itself first requires defining a horizon for the returns. This horizon should be different from that of the actual underlying task (e.g., PacMan) because a) if the underlying task is episodic then the scope of an explorationexploitation policy is naturally greater than that and spans across many episodes of the learning agent b) if the underlying task is continuing or requires several training episodes for the student, the explorationexploitation policy may have to be evaluated in a shorter (finite) horizon (e.g., for the first training episodes). The importance of exploration is usually limited in the late episode(s) where the student may have already converged to a policy. A teaching policy should be primarily evaluated for a training period where advice still matters.
Concerning the return horizon of a constrained exploitation task (and similarly to [16] but in a different perspective), we propose algorithmic convergence [16] as a suitable stopping criterion for an explorationexploitation policy. This defines a meaningful horizon for explorationexploitation tasks since their goal is completed exactly then, not in the end of an episode and not in the continuous execution of an RL algorithm—after convergence—where exploration may not affect the underlying policy anymore. We proceed by defining the Convergence Horizon Return.
Definition 5.1.
Convergence Horizon Return Let be the return of the rewards received by an explorationexploitation policy, the value function of the underlying MDP and a small constant then:
(6) 
where for the time step applies:
(7) 
Given a small constant and the algorithmic convergence of the RL algorithm learning in the underlying MDP, the quantity . The algorithmic convergence will be realized either if the learning rate is discounted or if some temporal difference of the underlying algorithm tends to .
Using the convergence horizon for the return of a teaching task too, the next question can be what are the rewards constituting the return of a teaching task.
One possible goal for any teacher advising with a finite amount of advice would be to help minimize student’s regret with respect to the reward obtained by an optimal policy. However, since we do not assume such knowledge, and because there is a finite amount of advice, a better goal could be to advise based on the stateaction value of the advised action and not its immediate reward. If the student was able to follow the rest of the teacher’s policy after receiving advice, then the action for the current state would be the best possible. Consequently, we define the notion of value regret.
Definition 5.2.
Value Regret In a convergence horizon , the value regret, of an explorationexploitation policy (i.e., teaching policy) with respect to both an acting policy obtained after the period and an acting policy (i.e., student’s policy), , in time step is:
(8) 
where denotes the corresponding value function of .
The intuition behind this definition of regret in our context (where the acting agent is the student) is that the best teacher for any specific student would ideally be the student himself, when it would have reached convergence or its nearoptimal policy.
The important thing to note here is that because a student agent receives a finite amount of advice it cannot improve its asymptotic performance [16], consequently the evaluation of a teaching policy should ideally be based on the student’s optimal policy and not to that of some probably very different teacher, because that is its sustainable optimality.
For example, consider two states in a teacher’s acting MDP, and . A student agent learning with a very simplistic state representation may observe these states as just one, , and not differentiate between them. Then, the student’s optimal action in state will have a different expected return than that obtained by the teacher from either or . Its sustainable optimality is defined as to what is optimal given its simplistic internal representation. Any advice based on a finer representation may not be supported with consistency by the student in the long run. A teaching policy should be ideally evaluated on how much it speed up the student converging to its own optimal policy.
In the next section we propose a reward signal for teachers based on Value Regret.
5.3 The QTeaching algorithm
The QTeaching algorithm described and proposed in this section is an RL advising (teaching) algorithm learning a teaching policy. For this, we propose a novel reward scheme for the teacher based on the value regret (see Definition 5.2).
The key insight of the method is that of rewarding a teaching policy with quantities of the form where is an estimation of the student’s action in and is the teacher’s greedy action in (i.e., the action used for advice). This reward has a high value when the value of the greedy action is significantly higher than the value of the action that the student would take. This means that the teacher is encouraged to advise when the advised action is significantly better than the action the student would take.
For terms of efficiency and to emphasize the value impact of the advising action, QTeaching rewards all noadvice actions with zero. The advantages of such a scheme is that the teacher’s cumulative reward is based only on the value gain produced when advising and a teaching episode can finish when the budget finishes, not having to observe all the student’s episodes after its budget finishes. From preliminary experiments, rewarding no advice actions too (which occur significantly more than the maximum advice actions) was overpowering the advice actions, resulting in an imbalanced expression of the two actions in the teaching value function.
Still, when advising, the teacher should estimate in order to compute its reward. The simplest solution is that since we do not have access to the value function of the student or its internals, we use the acting value function of the teacher as an approximation for the optimal value function of the student, . To estimate the teacher has several options. If the teacher is notified of the intended action of the student beforehand, it can use that to compute the reward. If we assume no knowledge of the student’s intended action then some other estimation method for the student’s intended action should be used. An example of such an estimation method is used in the Predictive Advice method [13].
While predicting the actual student’s action () is possible, there are other—simpler—choices for this estimation too. For example, the Importance Advising (see Section 2.2) uses a very similar quantity for the advising threshold, of the form . For Importance Advising, we can say that —it pessimistically assumes the student will take the worst action, representing the risk of the state. The advantage of such an assignment is that it is based on a welltested criterion [13] and that it does not need knowledge of the student’s intended action (desirable for most realistic settings). The disadvantage is that we have a less detailed reward which is also not adapting to the student’s specific necessities but mostly, to the domain’s characteristics.
Based on this dichotomy, we propose two versions of QTeaching (see Algorithm 1), the offstudent’s policy QTeaching and the onstudent’s policy QTeaching. The onstudent’s policy QTeaching uses the value of the actual student’s action to compute the reward (thus it is directly influenced by its policy). We can intuitively say that onstudent’s policy QTeaching will advise when the student is mostly expected to act suboptimally with respect to the acting value function of the teacher, . On the other hand, the offstudent’s policy QTeaching uses the criterion discussed above and the teaching policy is not directly influenced by the policy of the student. Specifically, it is rewarding its teaching policy, , at timestep with the qvalue difference of the best action to the worst action, as these were found at time .
The QTeaching algorithm proceeds as follows (see Algorithm 1). A teacher agent enters an RL acting task to learn an acting policy. It initializes two actionvalue functions, and , the acting value function and the teaching value function respectively (lines 12). Of course, it can also use an existing acting value function.
Being in time step and state the teacher queries its acting value function for the greedy action in that state (line 6). Depending on whether we use the offstudent’s policy or the onstudent’s policy QTeaching, the teacher sets a baseline action, , to either the worst possible action for that state or to the action just executed by the student (lines 812).
Then, the teacher chooses an action from based on and its exploration strategy. If the teacher chooses to advise (line 13) it gives the action as an advice to the student agent. If the teacher chooses not to advise, the student will proceed with its own policy.
In line 19 the teacher observes the student’s actual action and its new state and reward, . Once again, the student may be the teacher himself, in this case, it observes its own action which was taken based on and its exploration strategy.
In line 20, the first QLearning update takes place for the acting value function based on the environment’s reward. For the teaching value function update, the teacher’s reward, is calculated first, based on the freshly updated values of the best and baseline actions, and respectively (lines 2125).
Finally, a QLearning update for the teaching value function takes place based on the reward and the algorithm continues in the same way until whatever of the following two events comes first: Either the advice budget finishes or the student reaches a learning episode which we have predetermined as its convergence horizon. These complete one learning episode or session for the teacher.
In this version, the QTeaching algorithm is based on the QLearning algorithm, although in principle any RL algorithm could be used for the underlying learning updates of QTeaching. However, if an offpolicy RL algorithm such as QLearning is chosen for the updates of both the acting and the teaching value function, then the point of transition from acting to teaching is irrelevant to the learning progress of the two policies. Reducing the impact of the exploration policy to the learning updates allows for smoother interaction between the two policies and ensures us that we continue to learn the same policies. In principle, a QTeaching agent is able to update both its acting and teaching value functions continually and refine not only when it should advise but also what it should advise.
Since our goal is to introduce QTeaching as a flexible and generic enough method to be applied to multiple domains, we propose a series of state features for the teaching policy that we think are necessary. From our experiments, QTeaching works best with an augmented version of the acting task state space (see Table III) similar to that of [17] (Zimmer’s method). Also in Table III, note the role of the student’s progress feature (): it homogenises the student’s Markov chain by inducing a state feature for time (see Section 3.2).
5.4 Experiments and Results
In this section, we present results from using QTeaching in the PacMan Domain. We evaluate both onstudent’s policy QTeaching and offstudent’s policy QTeaching, in two variations each: known or unknown student’s intended action. Note that methods like Zimmer’s and Mistake Correcting require knowledge of the student’s intended action.
We use two versions of students for the experiments. A lowasymptote and a highasymptote Sarsa students. Referring to [13] and Section 2.3, the low asymptote students receive a state vector of 16 primitive features related to the current game state while the high asymptote students receive a state vector of 7 highly engineered features providing more information. The lowasymptote students have significantly worse performance than the highasymptote ones.
Additionally, we choose to bootstrap all compared teaching methods with the same acting policy in order to compare only their advice distribution performance and not their quality of the advice. The acting policy used for producing advice comes from a highasymptote QTeaching agent after 1000 episodes of learning. Moreover, we use Sarsa students in order to emphasize the ability to advise students that are different to the teacher. All learning methods (Zimmer’s and QTeaching) were trained for 500 teaching episodes (sessions) to be equally compared for their learning efficiency too.
The QTeaching learning parameters for the teaching policy were , decaying and whereas all Sarsa students had , and .
The evaluation was based on the student performance (game score) and using the Total Reward TL metric [11]
divided by the fixed number of training episodes. The student performance is evaluated every 10 advising episodes (learning) for 30 episodes of acting alone (and not learning). For the comparisons between average score performances we used pairwise ttests with Bonferroni correction. Statistically significant results are denoted with their significance level and they always refer to paired comparisons.
In Figure 4(c), teacher agents advise a lowasymptote Sarsa student who always announces its intended action. We can see Zimmer’s method performs best and offstudent’s policy QTeaching comes second with a statistically significant difference (). The heuristic basedmethod Mistake Correcting with a tuned threshold value of comes third. Onstudent’s policy QTeaching performed worse than the previous three methods by a small margin, having not found an as good advice distribution policy (nonsignificant difference to Mistake Correcting). Finally, all methods performed statistically significantly better () than not advising, effectively speeding up the learning progress of the student.
In Figure 4(d), the teachers advise a highasymptote Sarsa student. Here, the tuned version of Mistake Correcting () performed statistically significantly better () than all methods, with QTeaching methods coming second and third (respectively) and Zimmer’s method coming next (having non significant differences between them).
For the case when the teacher agent is not aware of the student’s intended action, in Figure 4(a) the offstudentpolicy QTeaching performs best while Importance Advising () follows with a small performance difference (n.s.). Early Advising (giving all advice in the first steps) performs statistically significantly worse (at ) than both QTeaching and Importance Advising. In these experiments, we did not use onstudent’s policy QTeaching since that requires knowing the student’s intended action to compute the reward.
In Figure 4(b), advising a high asymptote Sarsa student, QTeaching had the second best performance with the heuristicbased method importance advising () performing better (non significant). For high performing students a poorly distributed advice budget can be much less effective. For example, if the teacher knows the student’s intended action it does not spend advice in states where the student would anyway choose the correct action. This fact is emphasized in this specific case, since no advising did not perform significantly worse compared to the rest of the methods.
Finally, in Table V we can see the average total reward in 1000 training episodes for all the teaching methods. All methods knowing the student’s intention performed better than those not, taking advantage of that knowledge.
It is important to note that QTeaching, the only learning AuB method allowing students to not announce their intended action, performed relatively well compared to methods that know the student’s intended action, which is an advantage of the proposed method.
Another advantage is that offstudent’s policy QTeaching can use the same teaching policy for very different students since it is not directly influenced from the student’s policy and the rewards received by the student when not advising (such as in the Zimmer method). This is a significant advantage in terms of learning speed and versatility since heuristic methods have to be manually tuned for each student separately to find the optimum threshold, .
Moreover, while Zimmer and QTeaching methods were both trained for 500 episodes (sessions), QTeaching training completed significantly faster since the Zimmer method has to observe all 1000 episodes of each student sessionto complete just one of its own, whereas QTeaching has an upper bound for its episode completion. This upper bound is the algorithmic convergence of the student (e.g, the lowasymptote student requires only 500 episodes to converge) and in most cases it will complete much faster, when the budget finishes (around the 30th episode for the lowasymptote student). More specifically, in Table V we can see the average training time needed for each teacher in terms of the average observed student episodes in each of the 500 teacher episodes. In general, our proposed methods need at least less training time than the Zimmer’s method. We should also note here that although nonlearning methods do not need training time they require a significant and variable amount of manual parameter tuning to achieve the reported performance.
Onstudent policy QTeaching did not perform as well expected, the main problem being the nonstationary reward depending on the student’s changing policy. We believe that this method needs significantly more training time than the offstudent’s policy QTeaching because of its nonstationary reward and it probably needs more informative features for the student’s current status. In our case, this was only its training episode which is the most basic information available for the student. Moreover, the training episode feature is studentdependent since its meaning varies among students—some students learn faster than others.
6 Related Work
There are several types of related work in the area of helping to learn. Some of this work focuses on teaching in nonRL settings [1, 8].
In the field of transfer learning in RL [12], an agent uses knowledge from a source task to aid its learning in a target task. However, agents perform transfer knowledge from one task to another and in an offline manner. Other differences of this typical TL setting to Agent Advising are described in section 2.2 of this article.
More closely related work has one RL agent teach another without a direct knowledge transfer. Examples of such works include imitation learning [4] and apprentice learning [2]. In these approaches an expert provides demonstrations of the task to a student, then the student has to extract a policy by either learning directly from them or building a model to generate mental experience. In our setting, the teacher does not provide a fullpolicy trajectory and has a limitation on the number of interventions (advice budget). Moreover, we do not require a student with special processing abilities except that of being able to receive advice
In [13] a nonlearning teaching framework for RL tasks is proposed based on action advice. The methods presented there are described in more detail in section 2.2. One drawback of these methods is that since they are based solely on the teacher’s qvalues they are not able to handle nonstationarity in the student’s learning task, and also have to be given a threshold of qvalue differences, above which a state is considered important. This parameter needs to be manually tuned for each student in contrast to offstudent’s policy QTeaching which can learn a more generic teaching policy focusing on the criticalities of the state space.
Also, since the methods presented in [13] are heuristicbased and not based on adaptive learning, the agent may spent all of its advising budget on early learning steps of the student that satisfy the importance threshold, while it may later experience even more important states that further exceed the given threshold.
The only other learning method for advising is introduced in [17] (Zimmer’s method). The method proposed there is described in more detail in section 2.2. One significant difference is that the method is based on the same reward received by the student, needing adhoc modifications for each task to encourage teacher towards a better advising policy. Our method uses a domainindependent reward signal based on the acting task qvalues and can be directly used in any task. Moreover, their method has greater data complexity since a complete batch of student training episodes is required for just one training episode of the teacher. As discussed in the previous section, our method may finish one teaching episode as early as the budget finishes; that is multiple times faster completion of one episode. Finally but most important, QTeaching can be used in the more realistic setting where there is no knowledge of the student’s intended action.
Concerning the model selection criteria proposed in Section 4.1 for the teacher’s acting policy, to the best of the author’s knowledge there is no other work in the relevant literature examining these criteria and furthermore proposing performance variance, and specifically CV, as an important one. Most relevant works choose models based on their average performance, which as discussed previously, is not enough to evaluate the teaching effectiveness of a policy sampled infrequently and in parts.
7 Conclusions and Future Work
In this article, we discussed and proposed criteria, considerations and methods for the problem of learning teaching policies to produce and distribute advice.
Concerning advice production, we identify a model selection problem for the teacher, selecting the appropriate acting policy from which to advise. The experiments showed the significant relation of CV to the teaching performance, promoting CV as an important criterion—among others tested—for selecting acting policies for advising. Moreover, averagereward RL was found to produce effective policies for sparse advising under budget, although these policies may underperform when used as acting ones.
Concerning advice distribution (i.e., teaching policy) we proposed a novel representation of the learning to teach problem as a constrained exploitation reinforcement learning problem. Based on this representation we proposed a novel RL algorithm for learning a teaching policy, QTeaching, able to advise even when not having knowledge of the student’s intended action. QTeaching was found to perform at least equally well with other compared methods while needing significantly less training time.
Advice distribution under budget is a challenging problem, both theoretically and practically, posing a series of problems such as the nonstationarity of the teaching task, as a result of having a learning student as part of the environment. Efficient and principled handling of the budget constraint is another challenge.
From our experiments, QTeaching can be considered a promising method based on a more formal understanding of the problem. It is significantly more efficient in terms of data complexity than Zimmer’s method, and it can learn teaching policies without the assumption of having knowledge for the student’s intended action.
There are several future directions. QTeaching could be adapted to student agents with specific “disabilities” and could also be tested under different budget costraints to examine how budget affects its teaching policies. Also, offstudent QTeaching could be tested on multistudent scenarios since not fitting to a particular student could be proven effective when teaching multiple different students. Moreover, the theoretical properties of the algorithms should be studied, especially the case of learning a teaching and an acting policy at the same time, e.g., under which specific assumptions a teaching policy converges.
The general usefulness of CV as a criteria for selecting teachers should be studied. Specifically, how teacher selection criteria such as CV are capturing the robustness of a policy when that policy is used sparingly for advising.
Finally, other teaching architectures and representations should be studied, allowing, for example, a teacher to use only one value function for both advising under a budget and acting. Such a hybrid agent transitions smoothly from its actor role to the teacher’s one. A unified architecture and knowledge representation would further reveal the deep connection between acting and teaching, one we strongly believe exists.
References
 [1] Doran Chakraborty and Sandip Sen. Teaching new teammates. In Proceedings of the fifth international joint conference on Autonomous agents and multiagent systems, pages 691–693. ACM, 2006.
 [2] Jeffery Allen Clouse. On Integrating Apprentice Learning and Reinforcement Learning. PhD thesis, 1996. AAI9709584.
 [3] Alessandro Lazaric. Transfer in reinforcement learning: a framework and a survey. In Reinforcement Learning, pages 143–173. Springer, 2012.
 [4] LongJi Lin. Selfimproving reactive agents based on reinforcement learning, planning and teaching. Machine learning, 8(34):293–321, 1992.
 [5] Philipp Rohlfshagen and Simon M Lucas. Ms pacman versus ghost team cec 2011 competition. In Evolutionary Computation (CEC), 2011 IEEE Congress on, pages 70–77. IEEE, 2011.
 [6] Gavin Rummery and Mahesan Niranjan. OnLine QLearning Using Connectionist Systems. Technical Report CUED/FINFENGRT 116, Engineering Department, Cambridge University, 1994.
 [7] Anton Schwartz. A reinforcement learning method for maximizing undiscounted rewards. In Proceedings of the tenth international conference on machine learning, volume 298, pages 298–305, 1993.
 [8] Peter Stone, Gal A Kaminka, Sarit Kraus, Jeffrey S Rosenschein, et al. Ad hoc autonomous agent teams: Collaboration without precoordination. In AAAI, 2010.
 [9] D.W. Stroock. An Introduction to Markov Processes. Graduate Texts in Mathematics. Springer Berlin Heidelberg, 2004.
 [10] Richard S. Sutton and Andrew G. Barto. Reinforcement Learning, An Introduction. MIT Press, 1998.
 [11] Matthew E. Taylor and Peter Stone. Transfer learning for reinforcement learning domains: A survey. Journal of Machine Learning Research, 10(1):1633–1685, 2009.
 [12] Matthew E Taylor and Peter Stone. Transfer learning for reinforcement learning domains: A survey. The Journal of Machine Learning Research, 10:1633–1685, 2009.
 [13] Lisa Torrey and Matthew Taylor. Teaching on a budget: agents advising agents in reinforcement learning. In Proceedings of the 2013 international conference on Autonomous agents and multiagent systems, pages 1053–1060. International Foundation for Autonomous Agents and Multiagent Systems, 2013.
 [14] Christopher JCH Watkins and Peter Dayan. Qlearning. Machine learning, 8(3):279–292, 1992.
 [15] Adam White, Joseph Modayil, and Richard S Sutton. Surprise and curiosity for big data robotics. In AAAI14 Workshop on Sequential DecisionMaking with Big Data, Quebec City, Quebec, Canada, 2014.
 [16] Yusen Zhan and Matthew E. Taylor. Online Transfer Learning in Reinforcement Learning Domains. In Proceedings of the AAAI Fall Symposium on Sequential Decision Making for Intelligent Agents (SDMIA), November 2015.
 [17] Matthieu Zimmer, Paolo Viappiani, and Paul Weng. TeacherStudent Framework: a Reinforcement Learning Approach. In AAMAS Workshop Autonomous Robots and Multirobot Systems, 2014.