In the reinforcement learning framework , data efficient approaches are especially important for real world and commercial applications, such as robotics. In such domains extensive interaction with the environment needs time and can be costly.
One data efficient approach for RL is transfer learning (TL) . Typically, when an RL agent leverages TL, it uses knowledge acquired in one or more (source) tasks to speed up its learning in a more complex (target) task. Most realistic TL settings require transfer of knowledge between different tasks or heterogeneous agents that can be vastly different from each other (e.g., humans and software agents).
Transferring between heterogeneous agents is often challenging since most methodologies involve exploiting the agents’ structural similarity to transfer knowledge between tasks. As an example, TL can be applied between two similar RL agents, which both use the same function approximation method, by transferring their learned parameters. In such a case, a Q-Value transfer solution could be used, combined with an algorithm constructing mappings between the state variables of the two tasks.
Whereas solutions for extracting similarity between tasks have been extensively studied in the past [11, 3], the main problem of transferring between very dissimilar agents (e.g., humans and software agents) remains.
Consider for example a game hint system for human players. The game hint system can not directly transfer its internal knowledge to the human player. Moreover, it should transfer knowledge in a limited and prioritized way since the attention span of humans is limited.
The only prominent knowledge transfer unit between all agents (software, physical or biological) is action. Action suggestion (advice) can be understood by very different agents. However, even when transferring using advice, four problems arise:
Decide what to advise (production of advice)
Decide when to advise (distribution of advice), especially when using a limited advice budget
Determine a common action language in order to appropriately express the advice between heterogeneous agents
Communicate the advice effectively, ensuring its timely and noiseless reception
This article focuses on the first two problems—those of deciding when and what to advise under a budget. Moreover, we use the game of Pac-Man to test our methods’ effectiveness in a complex domain.
Whereas works such as  provide a formal understanding of RL students receiving advice and the implications on the student’s learning process (e.g. convergence properties) and papers like  and  provide practical methods for a teacher to advise agents, this work attempts a new learning formulation of the problem and proposes a novel learning algorithm based on it. We identify and exploit the similarities of the advising under a budget (AuB) problem to the classic exploration-exploitation problem in RL and identify a sub-class of reinforcement learning problems: Constrained Exploitation Reinforcement Learning.
Most successful methodologies for AuB require students to inform their teacher for their intended action. This is not a realistic requirement in many real-world TL problems, since it assumes one more communication channel between the student and the teacher, thus, it requires some form of structural compliance from the student. An example of how restrictive is this requirement for real-world applications comes from the game hint example system. The system advises the human player for his next action in real-time but the human player could never be expected to announce its intended action beforehand. Part of this work’s goal is also to alleviate such a prerequisite and propose methods that can also work without such knowledge.
Specifically, the contributions of this article are:
An empirical study on determining an appropriate advising policy in the game of Pac-Man
A novel application of average reward reinforcement learning to produce advice
A novel formulation of the learning to advise under budget (AuB) problem as a problem of constrained exploitation RL
A novel RL algorithm for learning a teaching policy to distribute advice, able to train faster (lower data complexity) than previous learning approaches and advise even when not having knowledge of the student’s intended action
This section presents the necessary background to understand the methods proposed in this article. Brief introductions are provided to reinforcement learning and transfer learning, which are then followed by a more detailed discussion of the current advising methodologies.
2.1 Reinforcement Learning
Reinforcement Learning addresses the problem of how an agent can learn a behaviour through trial-and-error interactions with a dynamic environment . In an RL task the agent, at each time step, senses the environment’s state, , where is the finite set of possible states, and selects an action to execute, where is the finite set of possible actions in state . The agent receives a reward, , and moves to a new state according to a transition function, , of the task with . The general goal of the agent is to maximize the expected return, where the return, , is defined as some specific function of the reward sequence given also a discounting parameter, . The parameter, where , controls the importance of short-term rewards over the most long-term ones, discounting the later by powers of factor of .
The outcome will be an action-value function which expresses the expected return starting from , taking action , and following after that policy , which dictates how the agent acts in a certain situation in order to maximize the reward received over time.
2.2 Transfer Learning and Advising under a Budget
Transfer Learning  refers to the process of using knowledge that has been acquired in a previously learned task, the source task, in order to enhance the learning procedure in a new and more complex task, the target task
. The more similar these two tasks are, the easier it is to transfer knowledge between them. By similarity, we mean the similarity of their underlying Markov Decision Processes (MDP) that is, the transition and reward functions of the two tasks and also their state and action spaces.
The type of knowledge that can be transferred between tasks varies among different TL methods, including value functions, entire policies, actions (policy advice) or a set of samples from a source task which can be used by a model-based RL algorithm in a target task.
Focusing specifically on policy advice under an advice budget constraint, we identify two aspects of the problem, a) learning a policy to produce advice and b) distributing the advice in the most appropriate way, while respecting the advice budget constraint. Most methods in the literature produce advice by greedily using a learned policy for the task in hand [13, 17, 16]. For advice distribution, most methods rely on some form of heuristic function (and not learning) based on which the teacher decides when to advice. Examples of such methods are Importance Advice and Mistake Correcting .
The Importance Advice method produces advice by repeatedly querying a learned policy’s value function, on each state the student faces, to obtain the best action for that state. Distribution of advice, that is deciding when to advise or not, is determined by a heuristic logical expression of the form where is a threshold parameter determining the state-action value gap between the best and the worst action for that state. If this value gap exceeds the threshold value, , the state is considered critical and advice is given. The algorithm continues until the advice budget finishes.
Mistake correcting (MC)  differs from Importance Advising only in presuming knowledge of the student’s intended action. Consequently, it validates the Importance Advising criterion only if the student action is wrong, not wasting advice when the student does not need it.
The method presented in  (Zimmer’s method )formulates the teaching problem as an RL one in order to learn an advice distribution policy. The teacher agent has an action set with two actions, advice, no advice. The teacher’s state space is an augmented version of the student’s one and is of the form: where
is the current state vector of the student,is the intended action of the student (it is assumed that the student announces the intended action on every step), the remaining advice budget and the student’s training episode number. Moreover, the reward signal is a transformed version of the student’s reward with an extra positive reward for the teacher when the student reaches its goal in a small number of steps. We note that this method is tested only on the Mountain Car domain and the reward signal proposed for the teacher is domain-dependent.
The experimental domain for the teaching methods presented in this article is the game of Pac-Man. Pac-Man is a famous 1980s arcade game in which the player navigates a maze like the one in Figure 1, trying to earn points by touching edible items and trying to avoid being caught by the four ghosts. In our experiments, we use a JAVA implementation of the game provided by the Ms. Pac-Man vs. Ghosts League , which conducts annual competitions. Ghosts in our setting will chase the player 80% of the time and choose actions randomly 20%.
The player and all ghosts have four actions — move up, down, left, and right — but some actions are occasionally unavailable due to the restrictions in the maze. Four moves are required to travel between the small dots on the grid, which represent food pellets and are worth 10 points each. The larger dots are power pellets, which are worth 50 points each. When the player gets the larger ones, the ghosts become edible for a short time, during which they slow down and flee the player. Eating a ghost is worth 200 points (which doubles every time for the duration of a single power pill). Then the ghost respawns in the lair at the center of the maze.The episode ends if any ghost catches Pac-Man, or after 2000 steps.
This domain is discrete but has a very large state space. There are 1293 distinct locations in the maze, and a complete state consists of the locations of Pac-Man, the ghosts, the food pellets, and the power pills, along with each ghost’s previous move and whether or not it is edible. The combinatorial explosion of possible states makes it essential to approach this domain through high-level feature construction and Q-function approximation.
In this article, we follow previous work  that adopted a high-level feature set (high-asymptote feature set) comprised of action-specific features. When using action-specific features, a feature set is really a set of functions . All actions share one Q-function, which associates a weight with each feature. A Q-value is . To achieve gradient-descent convergence, it is important to have the extra bias weight and also to normalize the features to the range .
For the state representation, we define a feature set which consists of features that count objects at a range of distances from Pac-Man maze, as we used (and defined) in previous work .
A perfect score in an episode is 5600 points, but this is quite difficult to achieve (for both human and agent players). An agent executing random actions earns an average of 250 points. The 7-feature set allows an agent to learn to catch some edible ghosts and achieve a per-episode average of 3800 points.
3 The Teaching Task
In this section we attempt a more formal understanding of a teaching task that is based on action advice. The necessary notation is presented in Table I.
Student A student agent is an agent acting in an environment and capable of accepting advice from another agent111In this work we assume that a student agent always follows the given advice.
Teacher A teacher agent is an agent capable to execute and inform a teaching policy (see Definition 3.7) to provide action advice to a student agent acting in a specific task.
Acting Task The acting task is the task for which the teacher gives advice and can be defined as an MDP of the form on an environment .
Teaching Task The teaching task is the task of providing action advice to a student agent to assist him in learning faster or learning better the acting task. Any teaching task is accompanied by a finite advice budget, .
Teaching Action Space Given the action space of the acting task, the action space of the teacher in timestep is:
where an action of the acting task given as advice and the no advice action, , meaning that the teacher will not give advice in this step allowing the student to act on its own. is the advice budget left in time-step .
Teaching State Space The teacher agent state space in timestep has the following form:
where is a tuple containing any knowledge we can have for the student and its MDP in timestep . If the student’s MDP is and the teacher observes the current state of the student, , reward, and action then .
The teaching policy actually transforms the acting policy, , of an actor agent (expressed through its respective state-action value function, ), to a policy producing advice under budget. We should also note that a teaching policy will usually ([13, 17, 16]) set which means that the teaching policy is greedy with respect to the acting value function, .
As a minimal example of the proposed formulation, the Importance Advising method  which uses the state importance criterion (see Section 2.2) can be said to use a teaching state space, as it requires knowledge only about the current state, , of the student, the advice budget, and an acting policy, from which it produces advice.
3.2 Learning to Teach
The definitions presented in subsection 3.1 apply to any teacher agent even if it advises based on a heuristic function. In the following, we focus on teachers that use RL to learn a teaching policy (i.e., advice distribution policy).
In its most simplified version, the learning to teach task employs two agents: the teacher and the student. In the first learning phase a teacher agent has the role of the actor: it learns the acting task alone. It observes a state space and has an action set . Based on a reward signal received from the environment, it learns a policy to achieve the acting task goal . In our context, this first learning phase can be seen as the advice production phase since the teacher learns the policy that will be used to advise a student later on.
At any time-step the teacher agent may have to stop acting and a new agent, the student enters the acting task and the corresponding environment.
Consequently, the teacher agent has to now learn and use a teaching policy for the specific task to achieve the teaching goal, . Additionally to the definitions given in Section 3, this second learning phase (learning a teaching policy), requires realizing and formulating the following:
Return Horizon. Even if the teaching task is formulated as an episodic one, the teaching episode, also referred as a session, is not necessarily matching the student’s learning episode. The teacher’s episode scope is greater and could track several learning episodes of the student.
Reward signal. A different return horizon implies a different task goal and consecutively the teacher’s reward signal can be different from the student’s (e.g., encouraging more the learning progress of the student over its absolute learning performance).
Moreover, defining the teacher’s state space as a superset of the student’s state space (see Definition 3.6
) indicates one more difficulty of the learning to teach task. From the teacher’s point of view the student can be considered a time-inhomogeneous Markov-Chain (MC), . This is because the transition matrix of the student’s MC is dependent on time, since it is learning and constantly changing its policy over time. The time inhomogeneity of this MC poses significant difficulties in handling the problem theoretically. Homogenizing this MC by defining it as a space-time MC, can make practical solutions feasible but still theoretical treatment is difficult (e.g., no stationary distributions exist in this case).
In general, every learning task can have its corresponding teaching task which could be thought as its dual. As learning to act in a specific task and teaching that task can be considered different tasks, they have their own goals and consequently, are “described” by different reward signals.
As an example, in  a teacher agent for the mountain car domain has a different reward signal to that of the student, encouraging teaching policies that help the student reach its goal sooner.
Learning a teaching policy, as this is described above, could be modelled by many different types of Markov Processes. However, none of the classic MDP formulations completely models the specific learning problem as a whole either by not handling the non-stationarity of the problem or by not handling the specific budget constraint imposed on the advising action. This fact is the main motivation of Section 5, where we present our proposed method for learning teaching policies.
4 Learning to Produce Advice
In this section, we focus on the advice itself and its production (not its distribution). The main challenge in producing advice based on the Q-Values of an RL value function is that these values are valid only if the policy they represent is fully followed, not when this policy is sparingly sampled to produce advice.
Based on previous methods in the literature (see Section 6) the most common teacher’s criterion for selecting which action to advise is , that is, greedy selection of the best action based on the teacher’s acting value function. However, the value of is not correct under the advising scenario since it is accurate only if the student will continue following teacher’s acting policy, thereafter. Unfortunately, this is usually not the case in our context since the student, after receiving advice, will often continue for a long period using its own policy exclusively. Even worse, in the early training phases—when advice is needed the most—the student’s policy will be vastly different from the teacher’s.
This realization is even more important if we consider how different the teacher and the student agents are allowed to be in our context. Consider a human student receiving advice in the game of Pac-Man. Human players often play fast-paced action games in a myopic and reactive manner, seeking short-term survival and not a long-term strategic advantage.
In that case, a human student infrequently advised by a policy learned using a high
value close to 1 will often be mislead to locally sub-optimal actions because these actions may be highly valued for the teacher’s far-sighted policy. The human player will probably not follow such a policy thereafter and he has therefore been misled to an action that would be useful only if he would also follow the rest of the teacher’s acting policy too.
Ideally, we would like to use a teacher’s acting policy that would be mostly invariant to the student’s particularities. Such a teacher’s policy would advise actions that are good on average, whatever policy is followed thereafter by the student and whatever its internals and parameters are (e.g. value) etc.
In this article, we propose that the above considerations should affect the way we learn policies intended for teachers. Selecting a specific policy for advising, the RL algorithm producing it and its parameters, form a model selection problem for RL teachers.
4.1 Model Selection for Teachers
In this section, we want to investigate how factors such as the teacher’s value (see Section 2.1) influence advice quality for students that can possibly have very different characteristics (e.g, a myopic student and far-sighted teacher). This is important in order to understand which teacher-agent differences affect the teaching performance the most.
value for the calculation of state-action values and relies on estimating the average reward received by the student, using its policy from any state and thereafter.
Specifically, R-Learning is an infinite-horizon RL algorithm where a different optimality criterion is used such that the value given action and state under policy is defined as the expectation:
Where is the average expected reward per time step under policy . The intuition behind R-Learning is that in the long run the average reward obtained by a specific policy is the same, but some state-action pairs receive better-than-average rewards for a while, while others may receive worse-than-average rewards. This transient, the difference to the average reward received, , is what defines the state-action value. To keep a running estimate of the average reward, R-Learning uses a second update rule, and one more parameter, , for the learning rate of that update.
Using R-Learning to learn a teacher’s acting policy along with the rest of the experiments presented in Section 4.2, we can assess the importance of value and value mismatch between student and teacher. Moreover, we assess other factors that possibly influence the quality of advice such as the performance of the teacher in the acting task, its performance variance and a possible relation of its average td-error , with the quality of advising.
This is also part of the Q-Learning update rule. Furthermore, by dividing (4) with the previous value estimation, , we get the percentage of error in relation with it, which we can call td-error percentage:
where . In our context, when the teacher uses an acting policy to produce advice it can still compute, for each student’s experience, its own td-error just as it would do if it was actually making a learning update. In the same context, we can intuitively say that represents the teacher’s surprise222Note that this definition of surprise, although similar, is different to that presented in  which normalizes for different learners and not for different state-action pairs. on its new estimation of a state-action value.
Consequently, a teacher with high average td-error percentage, , is a teacher with more unreliable value estimation, and therefore, it can be less suitable for a teacher since its action suggestion is based on a non-converged value function.
4.2 Experiments and Results
Based on the discussion in the previous section (Section 4.1) the main goal of the following experiments is to find the teacher’s policy parameters (such as ) that affect the quality of advice most for different student parameters. The experimental design is as follows. In the first phase, we created -specific teachers by training five Q-Learning agents and one R-Learning agent for 1000 episodes. The Q-Learning agents had all the same parameters, except , which took values in . The rest of their parameters were the same and fixed, specifically and (same with previous work ). The parameter accounting for eligibility traces was set to zero so that the effect of experimentally controlling the parameter is isolated. Finally, the parameter of R-Learning was set to 0.0001 (preliminary results found it produced good results in Pac-Man).
After training for 1000 episodes, the -specific Q-Learning teachers and the R-Learning teacher were evaluated on 500 episodes of acting alone in the environment. We calculated their average episode score and the coefficient of variation of these scores as both being possible determining factors of advice quality. Coefficient of variation was used as a measure of score discrepancy as it shows the extent of variability in relation to the mean of the score, allowing a more clear comparison of variance between methods with different average performance. It is a unit-less measure calculated as .
In Table II we can see their average episode score on 500 episodes along with the coefficient of variation of that score. R-Learning had significantly worse average acting performance than all versions of Q-Learning. Interestingly, episodic Q-Learning (with close to 1) did not perform as well as expected. Moreover, a very low value (0.05) came up second, showing that a myopic RL agent can perform well in Pac-Man. This result indicates the highly stochastic nature of the game where reactive short-sighted strategies, based more on survival, can perform better than far-sighted strategies.
After the initial training and the evaluation of the acting policies they learned, these agents could be used as teachers for tabula-rasa student agents. In these experiments we used a simple fixed teaching-advising policy called Every-4-Steps for all teachers since we focus only on the quality of the advice itself and not on the quality of its distribution to the student (teaching policy).
In the Every-4-Steps teaching policy, the teacher gives one piece of advice to the student every four steps. Using this fixed advising policy we can test and compare the efficacy of the advice when this is not given consecutively, thus testing how useful the advice is when the student does not take a complete policy trajectory from the teacher, but has to use its own policy in between.
Using the teaching policy Every-4-Steps and a budget of advice we ran 30 trials of advising learning students for each -specific teacher-student pair. Specifically, the parameters of these teacher-student pairs come from the Cartesian product (30 pairs), where the R-Learning teacher in the first set is denoted with a “-” since it does not have a value.
In Figure 2 we can see the average performance of each teacher-student pair compared to the same student not receiving advice at all. Combining these results with Table II of the teachers’ performance when they were acting alone, we can see that the best performer is not the best teacher, with best defined as the best average score when acting alone in the task. The best example of this is R-Learning whose average score was the worst than any -specific Q-Learning agent, however, as we can see in Figure 2 is almost as good of a teacher as the Q-Learning teacher. R-Learning advising improved all student’s score whatever their value, while not resulting in a negative transfer for any of them.
Moreover, we can see a pattern where the lower the coefficient of variation (CV) for the acting performance is, the better the teacher, indicating that CV can be an important criteria in model selection for teachers. This is non-trivial since average agent performance (and not its variance) is the dominant model selection criteria adopted in most of the relevant literature in RL. Performance variance expressed by CV seems especially important in our context, that of sparse advising, where the advice should be good whatever the next actions of the student will be.
Based on the results presented here, we can not observe any particular pattern relating teaching performance with the values of a teacher-student-pair. Interestingly though a teacher is not the most helpful for a student. Even more, a for a student results to significant negative transfer. The teacher with the episodic value, and the no discounting R-Learning one were the most helpful to all students showing that R-Learning can perform well in settings where the student’s is unknown or varying, such as in the case of human students.
Having identified the possible use of R-Learning for producing acting policies suitable for advising and the importance of performance CV to model selection, we conducted one more experiment between identical teachers.
Specifically, we independently trained 30 Q-Learning teachers with the same parameters, feature sets and characteristics for 1000 episodes. Due to their different experiences and the stochasticity of the game they naturally learned different policies (i.e., final feature weights in their function approximators). Then, the trained teachers played alone for 500 episodes and we recorded their average performance, average performance variance as also their average TD-error percentage, , as this was defined in Section 4.1. We then used the Every-4-step teaching policy with each one of them advising a standard Sarsa  student who would learn the task for 1000 episodes. Finally, we recorded the student’s average score.
In Figure 3 we can see a correlation plot of the factors mentioned above using a one-tailed non-parametric Spearman correlation test at . Confirming the previous results we can see the negative and statistically significant relation of CV to teaching performance with . Acting performance also has a medium and positive correlation of with teaching performance (student’s score) but it is statistically insignificant on the limit. By weighing average performance in its calculation, CV has a stronger relation to teaching performance than standard statistic variance. Moreover, we see that teacher’s surprise, relates strongly () and negatively to the acting performance of the teacher and not to its teaching performance ().
5 Learning to Distribute Advice
In this section we change focus from advice production to advice distribution, learning a teaching policy in order to most effectively distribute the advice budget.
5.1 Constrained Exploitation Reinforcement Learning
We attempt a more natural formulation of the AuB learning problem described in Section 3.2 by identifying it as an instance of a more generic reinforcement learning problem. This RL problem can be simply described as learning control with constraints imposed on the exploitation ability of the learning agent. These constraints can either be a finite number of times the agent can exploit using its policy, possibly states where it is only allowed to explore, or even perhaps a task where it is costly to have access to an optimal policy and we are allowed to use it only for a limited number of times.
How does this RL problem relates to the learning to teach problem? The first insight is that the advise/no-advise decision problem has a striking resemblance to the core exploration-exploitation problem of RL agents. Consider the learning to teach problem. We can view the problem as follows: When the teacher agent is advising it is actually acting on the environment, that is because an obedient student agent will always apply its advice thus becoming a deterministic actuator for the teacher. In the case of a non-obedient student, the teacher could be said using a stochastic actuator.
Consequently, we can view the teacher agent as an acting agent using a student agent as its actuator for the environment. Moreover, the teacher is acting greedily by advising its best action; thus, it exploits. Under this perspective, with advice seen as action, how could we view the no advice action of a teacher? The no advice action can be seen as “trusting” the student to control the environment autonomously. Thus, choosing not to advise in a specific state can be seen as denoting that state to be non-critical with respect to the remaining advice budget and the student’s learning progress, or denoting a lack of teacher’s knowledge for that state. From the teacher’s point of view, not advising can be seen as an exploration action. So controlling when not to advise can be seen as a directed exploration problem in MDPs. Imposing a budget constraint, that is a constraint on the number of times a teacher agent can advise (i.e., exploit) is a problem of constrained and directed exploitation.
We will consider a simple and motivating example of such a domain. In a grid world a robot learns an optimal path towards a rewarding goal state while it should keep away from a specific damaging state. The robot is semi-autonomous, it can either control itself using its own policy or it can be teleoperated for a specific limited number of times. For the robot’s operator, what is an optimal use of this finite number of control interventions? What are the states that it would be best to control the robot directly, leaving control of the rest to the robot?
Similarly to the previous example, learning and executing advising policies in a game can be another example of the constrained exploitation problem, which is also the main focus of this article. For example, in a video game like Pac-Man, a game hints system plays the role of the external optimal controller with a limited intervention budget. Such a hint system could suggest actions to human players—when these are most necessary—depending also on the player’s policy.
In the rest of this section, we use the term exploitation where one can think of advising and the term exploration when not-advising, focusing on the broader learning problem.
5.2 Learning Constrained Exploitation policies
Formulating the constrained exploitation task as a reinforcement learning problem itself first requires defining a horizon for the returns. This horizon should be different from that of the actual underlying task (e.g., Pac-Man) because a) if the underlying task is episodic then the scope of an exploration-exploitation policy is naturally greater than that and spans across many episodes of the learning agent b) if the underlying task is continuing or requires several training episodes for the student, the exploration-exploitation policy may have to be evaluated in a shorter (finite) horizon (e.g., for the first training episodes). The importance of exploration is usually limited in the late episode(s) where the student may have already converged to a policy. A teaching policy should be primarily evaluated for a training period where advice still matters.
Concerning the return horizon of a constrained exploitation task (and similarly to  but in a different perspective), we propose algorithmic convergence  as a suitable stopping criterion for an exploration-exploitation policy. This defines a meaningful horizon for exploration-exploitation tasks since their goal is completed exactly then, not in the end of an episode and not in the continuous execution of an RL algorithm—after convergence—where exploration may not affect the underlying policy anymore. We proceed by defining the Convergence Horizon Return.
Convergence Horizon Return Let be the return of the rewards received by an exploration-exploitation policy, the value function of the underlying MDP and a small constant then:
where for the time step applies:
Given a small constant and the algorithmic convergence of the RL algorithm learning in the underlying MDP, the quantity . The algorithmic convergence will be realized either if the learning rate is discounted or if some temporal difference of the underlying algorithm tends to .
Using the convergence horizon for the return of a teaching task too, the next question can be what are the rewards constituting the return of a teaching task.
One possible goal for any teacher advising with a finite amount of advice would be to help minimize student’s regret with respect to the reward obtained by an optimal policy. However, since we do not assume such knowledge, and because there is a finite amount of advice, a better goal could be to advise based on the state-action value of the advised action and not its immediate reward. If the student was able to follow the rest of the teacher’s policy after receiving advice, then the action for the current state would be the best possible. Consequently, we define the notion of value regret.
Value Regret In a convergence horizon , the value regret, of an exploration-exploitation policy (i.e., teaching policy) with respect to both an acting policy obtained after the period and an acting policy (i.e., student’s policy), , in time step is:
where denotes the corresponding value function of .
The intuition behind this definition of regret in our context (where the acting agent is the student) is that the best teacher for any specific student would ideally be the student himself, when it would have reached convergence or its near-optimal policy.
The important thing to note here is that because a student agent receives a finite amount of advice it cannot improve its asymptotic performance , consequently the evaluation of a teaching policy should ideally be based on the student’s optimal policy and not to that of some probably very different teacher, because that is its sustainable optimality.
For example, consider two states in a teacher’s acting MDP, and . A student agent learning with a very simplistic state representation may observe these states as just one, , and not differentiate between them. Then, the student’s optimal action in state will have a different expected return than that obtained by the teacher from either or . Its sustainable optimality is defined as to what is optimal given its simplistic internal representation. Any advice based on a finer representation may not be supported with consistency by the student in the long run. A teaching policy should be ideally evaluated on how much it speed up the student converging to its own optimal policy.
In the next section we propose a reward signal for teachers based on Value Regret.
5.3 The Q-Teaching algorithm
The Q-Teaching algorithm described and proposed in this section is an RL advising (teaching) algorithm learning a teaching policy. For this, we propose a novel reward scheme for the teacher based on the value regret (see Definition 5.2).
The key insight of the method is that of rewarding a teaching policy with quantities of the form where is an estimation of the student’s action in and is the teacher’s greedy action in (i.e., the action used for advice). This reward has a high value when the value of the greedy action is significantly higher than the value of the action that the student would take. This means that the teacher is encouraged to advise when the advised action is significantly better than the action the student would take.
For terms of efficiency and to emphasize the value impact of the advising action, Q-Teaching rewards all no-advice actions with zero. The advantages of such a scheme is that the teacher’s cumulative reward is based only on the value gain produced when advising and a teaching episode can finish when the budget finishes, not having to observe all the student’s episodes after its budget finishes. From preliminary experiments, rewarding no advice actions too (which occur significantly more than the maximum advice actions) was overpowering the advice actions, resulting in an imbalanced expression of the two actions in the teaching value function.
Still, when advising, the teacher should estimate in order to compute its reward. The simplest solution is that since we do not have access to the value function of the student or its internals, we use the acting value function of the teacher as an approximation for the optimal value function of the student, . To estimate the teacher has several options. If the teacher is notified of the intended action of the student beforehand, it can use that to compute the reward. If we assume no knowledge of the student’s intended action then some other estimation method for the student’s intended action should be used. An example of such an estimation method is used in the Predictive Advice method .
While predicting the actual student’s action () is possible, there are other—simpler—choices for this estimation too. For example, the Importance Advising (see Section 2.2) uses a very similar quantity for the advising threshold, of the form . For Importance Advising, we can say that —it pessimistically assumes the student will take the worst action, representing the risk of the state. The advantage of such an assignment is that it is based on a well-tested criterion  and that it does not need knowledge of the student’s intended action (desirable for most realistic settings). The disadvantage is that we have a less detailed reward which is also not adapting to the student’s specific necessities but mostly, to the domain’s characteristics.
Based on this dichotomy, we propose two versions of Q-Teaching (see Algorithm 1), the off-student’s policy Q-Teaching and the on-student’s policy Q-Teaching. The on-student’s policy Q-Teaching uses the value of the actual student’s action to compute the reward (thus it is directly influenced by its policy). We can intuitively say that on-student’s policy Q-Teaching will advise when the student is mostly expected to act sub-optimally with respect to the acting value function of the teacher, . On the other hand, the off-student’s policy Q-Teaching uses the criterion discussed above and the teaching policy is not directly influenced by the policy of the student. Specifically, it is rewarding its teaching policy, , at time-step with the q-value difference of the best action to the worst action, as these were found at time .
The Q-Teaching algorithm proceeds as follows (see Algorithm 1). A teacher agent enters an RL acting task to learn an acting policy. It initializes two action-value functions, and , the acting value function and the teaching value function respectively (lines 1-2). Of course, it can also use an existing acting value function.
Being in time step and state the teacher queries its acting value function for the greedy action in that state (line 6). Depending on whether we use the off-student’s policy or the on-student’s policy Q-Teaching, the teacher sets a baseline action, , to either the worst possible action for that state or to the action just executed by the student (lines 8-12).
Then, the teacher chooses an action from based on and its exploration strategy. If the teacher chooses to advise (line 13) it gives the action as an advice to the student agent. If the teacher chooses not to advise, the student will proceed with its own policy.
In line 19 the teacher observes the student’s actual action and its new state and reward, . Once again, the student may be the teacher himself, in this case, it observes its own action which was taken based on and its exploration strategy.
In line 20, the first Q-Learning update takes place for the acting value function based on the environment’s reward. For the teaching value function update, the teacher’s reward, is calculated first, based on the freshly updated values of the best and baseline actions, and respectively (lines 21-25).
Finally, a Q-Learning update for the teaching value function takes place based on the reward and the algorithm continues in the same way until whatever of the following two events comes first: Either the advice budget finishes or the student reaches a learning episode which we have predetermined as its convergence horizon. These complete one learning episode or session for the teacher.
In this version, the Q-Teaching algorithm is based on the Q-Learning algorithm, although in principle any RL algorithm could be used for the underlying learning updates of Q-Teaching. However, if an off-policy RL algorithm such as Q-Learning is chosen for the updates of both the acting and the teaching value function, then the point of transition from acting to teaching is irrelevant to the learning progress of the two policies. Reducing the impact of the exploration policy to the learning updates allows for smoother interaction between the two policies and ensures us that we continue to learn the same policies. In principle, a Q-Teaching agent is able to update both its acting and teaching value functions continually and refine not only when it should advise but also what it should advise.
Since our goal is to introduce Q-Teaching as a flexible and generic enough method to be applied to multiple domains, we propose a series of state features for the teaching policy that we think are necessary. From our experiments, Q-Teaching works best with an augmented version of the acting task state space (see Table III) similar to that of  (Zimmer’s method). Also in Table III, note the role of the student’s progress feature (): it homogenises the student’s Markov chain by inducing a state feature for time (see Section 3.2).
5.4 Experiments and Results
In this section, we present results from using Q-Teaching in the Pac-Man Domain. We evaluate both on-student’s policy Q-Teaching and off-student’s policy Q-Teaching, in two variations each: known or unknown student’s intended action. Note that methods like Zimmer’s and Mistake Correcting require knowledge of the student’s intended action.
We use two versions of students for the experiments. A low-asymptote and a high-asymptote Sarsa students. Referring to  and Section 2.3, the low asymptote students receive a state vector of 16 primitive features related to the current game state while the high asymptote students receive a state vector of 7 highly engineered features providing more information. The low-asymptote students have significantly worse performance than the high-asymptote ones.
Additionally, we choose to bootstrap all compared teaching methods with the same acting policy in order to compare only their advice distribution performance and not their quality of the advice. The acting policy used for producing advice comes from a high-asymptote Q-Teaching agent after 1000 episodes of learning. Moreover, we use Sarsa students in order to emphasize the ability to advise students that are different to the teacher. All learning methods (Zimmer’s and Q-Teaching) were trained for 500 teaching episodes (sessions) to be equally compared for their learning efficiency too.
The Q-Teaching learning parameters for the teaching policy were , decaying and whereas all Sarsa students had , and .
The evaluation was based on the student performance (game score) and using the Total Reward TL metric 
divided by the fixed number of training episodes. The student performance is evaluated every 10 advising episodes (learning) for 30 episodes of acting alone (and not learning). For the comparisons between average score performances we used pairwise t-tests with Bonferroni correction. Statistically significant results are denoted with their significance level and they always refer to paired comparisons.
In Figure 4(c), teacher agents advise a low-asymptote Sarsa student who always announces its intended action. We can see Zimmer’s method performs best and off-student’s policy Q-Teaching comes second with a statistically significant difference (). The heuristic based-method Mistake Correcting with a tuned threshold value of comes third. On-student’s policy Q-Teaching performed worse than the previous three methods by a small margin, having not found an as good advice distribution policy (non-significant difference to Mistake Correcting). Finally, all methods performed statistically significantly better () than not advising, effectively speeding up the learning progress of the student.
In Figure 4(d), the teachers advise a high-asymptote Sarsa student. Here, the tuned version of Mistake Correcting () performed statistically significantly better () than all methods, with Q-Teaching methods coming second and third (respectively) and Zimmer’s method coming next (having non significant differences between them).
For the case when the teacher agent is not aware of the student’s intended action, in Figure 4(a) the off-student-policy Q-Teaching performs best while Importance Advising () follows with a small performance difference (n.s.). Early Advising (giving all advice in the first steps) performs statistically significantly worse (at ) than both Q-Teaching and Importance Advising. In these experiments, we did not use on-student’s policy Q-Teaching since that requires knowing the student’s intended action to compute the reward.
In Figure 4(b), advising a high asymptote Sarsa student, Q-Teaching had the second best performance with the heuristic-based method importance advising () performing better (non significant). For high performing students a poorly distributed advice budget can be much less effective. For example, if the teacher knows the student’s intended action it does not spend advice in states where the student would anyway choose the correct action. This fact is emphasized in this specific case, since no advising did not perform significantly worse compared to the rest of the methods.
Finally, in Table V we can see the average total reward in 1000 training episodes for all the teaching methods. All methods knowing the student’s intention performed better than those not, taking advantage of that knowledge.
It is important to note that Q-Teaching, the only learning AuB method allowing students to not announce their intended action, performed relatively well compared to methods that know the student’s intended action, which is an advantage of the proposed method.
Another advantage is that off-student’s policy Q-Teaching can use the same teaching policy for very different students since it is not directly influenced from the student’s policy and the rewards received by the student when not advising (such as in the Zimmer method). This is a significant advantage in terms of learning speed and versatility since heuristic methods have to be manually tuned for each student separately to find the optimum threshold, .
Moreover, while Zimmer and Q-Teaching methods were both trained for 500 episodes (sessions), Q-Teaching training completed significantly faster since the Zimmer method has to observe all 1000 episodes of each student sessionto complete just one of its own, whereas Q-Teaching has an upper bound for its episode completion. This upper bound is the algorithmic convergence of the student (e.g, the low-asymptote student requires only 500 episodes to converge) and in most cases it will complete much faster, when the budget finishes (around the 30th episode for the low-asymptote student). More specifically, in Table V we can see the average training time needed for each teacher in terms of the average observed student episodes in each of the 500 teacher episodes. In general, our proposed methods need at least less training time than the Zimmer’s method. We should also note here that although non-learning methods do not need training time they require a significant and variable amount of manual parameter tuning to achieve the reported performance.
On-student policy Q-Teaching did not perform as well expected, the main problem being the non-stationary reward depending on the student’s changing policy. We believe that this method needs significantly more training time than the off-student’s policy Q-Teaching because of its non-stationary reward and it probably needs more informative features for the student’s current status. In our case, this was only its training episode which is the most basic information available for the student. Moreover, the training episode feature is student-dependent since its meaning varies among students—some students learn faster than others.
6 Related Work
In the field of transfer learning in RL , an agent uses knowledge from a source task to aid its learning in a target task. However, agents perform transfer knowledge from one task to another and in an off-line manner. Other differences of this typical TL setting to Agent Advising are described in section 2.2 of this article.
More closely related work has one RL agent teach another without a direct knowledge transfer. Examples of such works include imitation learning  and apprentice learning . In these approaches an expert provides demonstrations of the task to a student, then the student has to extract a policy by either learning directly from them or building a model to generate mental experience. In our setting, the teacher does not provide a full-policy trajectory and has a limitation on the number of interventions (advice budget). Moreover, we do not require a student with special processing abilities except that of being able to receive advice
In  a non-learning teaching framework for RL tasks is proposed based on action advice. The methods presented there are described in more detail in section 2.2. One drawback of these methods is that since they are based solely on the teacher’s q-values they are not able to handle non-stationarity in the student’s learning task, and also have to be given a threshold of q-value differences, above which a state is considered important. This parameter needs to be manually tuned for each student in contrast to off-student’s policy Q-Teaching which can learn a more generic teaching policy focusing on the criticalities of the state space.
Also, since the methods presented in  are heuristic-based and not based on adaptive learning, the agent may spent all of its advising budget on early learning steps of the student that satisfy the importance threshold, while it may later experience even more important states that further exceed the given threshold.
The only other learning method for advising is introduced in  (Zimmer’s method). The method proposed there is described in more detail in section 2.2. One significant difference is that the method is based on the same reward received by the student, needing ad-hoc modifications for each task to encourage teacher towards a better advising policy. Our method uses a domain-independent reward signal based on the acting task q-values and can be directly used in any task. Moreover, their method has greater data complexity since a complete batch of student training episodes is required for just one training episode of the teacher. As discussed in the previous section, our method may finish one teaching episode as early as the budget finishes; that is multiple times faster completion of one episode. Finally but most important, Q-Teaching can be used in the more realistic setting where there is no knowledge of the student’s intended action.
Concerning the model selection criteria proposed in Section 4.1 for the teacher’s acting policy, to the best of the author’s knowledge there is no other work in the relevant literature examining these criteria and furthermore proposing performance variance, and specifically CV, as an important one. Most relevant works choose models based on their average performance, which as discussed previously, is not enough to evaluate the teaching effectiveness of a policy sampled infrequently and in parts.
7 Conclusions and Future Work
In this article, we discussed and proposed criteria, considerations and methods for the problem of learning teaching policies to produce and distribute advice.
Concerning advice production, we identify a model selection problem for the teacher, selecting the appropriate acting policy from which to advise. The experiments showed the significant relation of CV to the teaching performance, promoting CV as an important criterion—among others tested—for selecting acting policies for advising. Moreover, average-reward RL was found to produce effective policies for sparse advising under budget, although these policies may under-perform when used as acting ones.
Concerning advice distribution (i.e., teaching policy) we proposed a novel representation of the learning to teach problem as a constrained exploitation reinforcement learning problem. Based on this representation we proposed a novel RL algorithm for learning a teaching policy, Q-Teaching, able to advise even when not having knowledge of the student’s intended action. Q-Teaching was found to perform at least equally well with other compared methods while needing significantly less training time.
Advice distribution under budget is a challenging problem, both theoretically and practically, posing a series of problems such as the non-stationarity of the teaching task, as a result of having a learning student as part of the environment. Efficient and principled handling of the budget constraint is another challenge.
From our experiments, Q-Teaching can be considered a promising method based on a more formal understanding of the problem. It is significantly more efficient in terms of data complexity than Zimmer’s method, and it can learn teaching policies without the assumption of having knowledge for the student’s intended action.
There are several future directions. Q-Teaching could be adapted to student agents with specific “disabilities” and could also be tested under different budget costraints to examine how budget affects its teaching policies. Also, off-student Q-Teaching could be tested on multi-student scenarios since not fitting to a particular student could be proven effective when teaching multiple different students. Moreover, the theoretical properties of the algorithms should be studied, especially the case of learning a teaching and an acting policy at the same time, e.g., under which specific assumptions a teaching policy converges.
The general usefulness of CV as a criteria for selecting teachers should be studied. Specifically, how teacher selection criteria such as CV are capturing the robustness of a policy when that policy is used sparingly for advising.
Finally, other teaching architectures and representations should be studied, allowing, for example, a teacher to use only one value function for both advising under a budget and acting. Such a hybrid agent transitions smoothly from its actor role to the teacher’s one. A unified architecture and knowledge representation would further reveal the deep connection between acting and teaching, one we strongly believe exists.
-  Doran Chakraborty and Sandip Sen. Teaching new teammates. In Proceedings of the fifth international joint conference on Autonomous agents and multiagent systems, pages 691–693. ACM, 2006.
-  Jeffery Allen Clouse. On Integrating Apprentice Learning and Reinforcement Learning. PhD thesis, 1996. AAI9709584.
-  Alessandro Lazaric. Transfer in reinforcement learning: a framework and a survey. In Reinforcement Learning, pages 143–173. Springer, 2012.
-  Long-Ji Lin. Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning, 8(3-4):293–321, 1992.
-  Philipp Rohlfshagen and Simon M Lucas. Ms pac-man versus ghost team cec 2011 competition. In Evolutionary Computation (CEC), 2011 IEEE Congress on, pages 70–77. IEEE, 2011.
-  Gavin Rummery and Mahesan Niranjan. On-Line Q-Learning Using Connectionist Systems. Technical Report CUED/F-INFENG-RT 116, Engineering Department, Cambridge University, 1994.
-  Anton Schwartz. A reinforcement learning method for maximizing undiscounted rewards. In Proceedings of the tenth international conference on machine learning, volume 298, pages 298–305, 1993.
-  Peter Stone, Gal A Kaminka, Sarit Kraus, Jeffrey S Rosenschein, et al. Ad hoc autonomous agent teams: Collaboration without pre-coordination. In AAAI, 2010.
-  D.W. Stroock. An Introduction to Markov Processes. Graduate Texts in Mathematics. Springer Berlin Heidelberg, 2004.
-  Richard S. Sutton and Andrew G. Barto. Reinforcement Learning, An Introduction. MIT Press, 1998.
-  Matthew E. Taylor and Peter Stone. Transfer learning for reinforcement learning domains: A survey. Journal of Machine Learning Research, 10(1):1633–1685, 2009.
-  Matthew E Taylor and Peter Stone. Transfer learning for reinforcement learning domains: A survey. The Journal of Machine Learning Research, 10:1633–1685, 2009.
-  Lisa Torrey and Matthew Taylor. Teaching on a budget: agents advising agents in reinforcement learning. In Proceedings of the 2013 international conference on Autonomous agents and multi-agent systems, pages 1053–1060. International Foundation for Autonomous Agents and Multiagent Systems, 2013.
-  Christopher JCH Watkins and Peter Dayan. Q-learning. Machine learning, 8(3):279–292, 1992.
-  Adam White, Joseph Modayil, and Richard S Sutton. Surprise and curiosity for big data robotics. In AAAI-14 Workshop on Sequential Decision-Making with Big Data, Quebec City, Quebec, Canada, 2014.
-  Yusen Zhan and Matthew E. Taylor. Online Transfer Learning in Reinforcement Learning Domains. In Proceedings of the AAAI Fall Symposium on Sequential Decision Making for Intelligent Agents (SDMIA), November 2015.
-  Matthieu Zimmer, Paolo Viappiani, and Paul Weng. Teacher-Student Framework: a Reinforcement Learning Approach. In AAMAS Workshop Autonomous Robots and Multirobot Systems, 2014.