In one of the first papers of Artificial Intelligence, John McCarthy described an”Advice Taker” system that could learn by being told [mccarthy_programs_1959]. This idea was then elaborated in [hayes-roth_knowledge_1980, hayes-roth_advice_1981], where a general framework for learning from advice was proposed. This framework can be summarized in the following five steps [cohen_handbook_1982]:
1. Requesting or receiving the advice.
2. Converting the advice into an internal representation (Interpretation).
3. Converting the advice into a usable form (Operationalization).
4. Integrating the reformulated advice into the agent’s knowledge base.
5. Judging the value of the advice.
The first step describes how advice can be provided to the system. Step 2 is related to interpreting advice. Steps 3, 4 and 5 describe how advice can be integrated into the learning process. In what follows, we review advice-taking systems based on these three aspects. As interpreting advice is a relatively recent research question, and most of existing methods predefine the meaning of teaching signals, we cover this aspect at the end of each section.
1.1 Providing advice
Advice can be provided in two different forms: general and contextual (Fig. 1, Table1). We define general advice as general information about the task, such as concept definitions, behavioral constraints, and performance heuristics, that do not depend on the context in which they are provided. They are self-sufficient in that they include all the required information for being converted into a usable form (operationalization). They can take the form of if-then rules that inform the agent about the optimal behaviour, either directly by specifying which actions should be performed in different situations [maclin_creating_1996, kuhlmann_guiding_2004], or indirectly by defining task constraints [maclin_giving_2005, maclin_knowledge-based_2005, torrey_advice_2008]. As general advice
is not state-dependent, it can be communicated to the system at any moment of the task, even prior to the learning process (off-line).
Contextual advice, on the other hand, is state-dependent, in that the communicated information depends on the current state of the task. So, unlike general advice, it must be provided interactively along the task. Typical examples include guidance [thomaz_reinforcement_2006, suay_effect_2011], instructions [clouse_teaching_1992, rosenstein_supervised_2004, pradyot_instructing_2012, najar_interactively_20] and feedback [whitehead_complexity_1991, judah_reinforcement_2010, celemin_coach:_2019, najar_interactively_20].
The means by which advice is communicated by the human teacher to the system vary. The most natural and challenging way is to provide advice through natural language [kuhlmann_guiding_2004, cruz_interactive_2015, PaleologueMPC18]. Alternative solutions include specifying general advice via hand-written rules [maclin_creating_1996, maclin_knowledge-based_2005, maclin_giving_2005, torrey_advice_2008] and delivering contextual advice either via artificial interfaces such as keyboards and mouse clicks [suay_effect_2011, knox_training_2013] or through gestures [najar_interactively_20].
|General constraints||[hayes-roth_advice_1981, mangasarian_knowledge-based_2004, kuhlmann_guiding_2004, maclin_giving_2005, maclin_knowledge-based_2005, torrey_advice_2008]|
|General instructions||[maclin_creating_1996, kuhlmann_guiding_2004, branavan_reinforcement_2009, vogel_learning_2010, branavan_reading_2010]|
|Guidance||[thomaz_socially_2006, thomaz_learning_2009, suay_effect_2011, subramanian_exploration_2016, chu_learning_2016]|
|Contextual instructions||[utgoff_two_1991, clouse_teaching_1992, nicolescu_natural_2003, rosenstein_supervised_2004, thomaz_robot_2007, rybski_interactive_2007, tenorio-gonzalez_dynamic_2010, branavan_reading_2010, pradyot_integrating_2012, grizou_robot_2013, macglashan_translating_2014, cruz_interactive_2015, mathewson_simultaneous_2016, najar_interactively_20]|
|Corrective feedback||[nicolescu_natural_2003, chernova_interactive_2009, argall_teacher_2011, celemin_coach:_2019]|
|Evaluative feedback||[dorigo_robot_1994, colombetti_behavior_1996, isbell_social_2001, kaplan_robotic_2002, thomaz_reinforcement_2006-1, kim_learning_2007, knox_interactively_2009, judah_reinforcement_2010, tenorio-gonzalez_dynamic_2010, lopes_simultaneous_2011, knox_combining_2010, knox_augmenting_2011, knox_reinforcement_2012-1, knox_reinforcement_2012, grizou_robot_2013]|
|[griffith_policy_2013, grizou_interactive_2014, loftin_strategy-aware_2014, ho_teaching_2015, loftin_learning_2016, mathewson_simultaneous_2016, macglashan2017interactive, najar_training_2016, najar_interactively_20]|
1.2 Learning from advice
The way advice is used for learning depends on which kind of information is communicated.
Learning from general constraints:
The first ever implemented advice-taking system relied on general constraints that were written as LISP expressions, and converted into plans using a predefined set of transformations [hayes-roth_advice_1981]. General constraints were defined as domain concepts, behavioural constraints and performance heuristics. When the executed advice lead to unexpected or unfavorable consequences, learning was triggered by correcting the advice and refining the knowledge base through predefined learning rules.
Knowledge-Based Kernel Regression (KBKR) is a method that allows incorporating advice, given in the form of if-then rules, into a kernel-based regression model [mangasarian_knowledge-based_2004]
. This method was used for providing advice to an RL agent with Support Vector Regression as value function approximation[maclin_knowledge-based_2005]. In this case, advice was provided in the form of constraints on action values (e.g. if condition then ), and incorporated into the value function through the KBKR method. This approach was extended in [maclin_giving_2005], by proposing a new way of defining constraints on action values. In the new method, pref-KBKR (preference KBKR), the constraints were expressed in terms of action preferences (e.g. if condition then prefer action to action ). This method was also used in [torrey_advice_2008].
Learning from general instructions:
General instructions inform more directly about the optimal behaviour, compared to general constraints, by explicitly specifying what to do in different situations. Like general constraints, they can be provided in the form of if-then
rules. For example, RATLE was an advice taking RL system that built on the 5-steps advice-taking process, and used a Q-learning agent with a neural network for value function approximation[maclin_creating_1996]. Advice was written, using an expert programming language with a predefined set of domain specific terms, in the form of if-then rules and while-repeat loops. It was then incorporated into the Q-function, by using an extension of the Knowledge-Based Neural Network method (KBANN), that allows incorporating knowledge expressed in the form of rules into a neural network [towell_knowledge-based_1994].
In [kuhlmann_guiding_2004], a SARSA() agent using linear tile-coding function approximation was augmented with an Advice Unit that computed additional action values. Advice was expressed in a specific formal language in the form of if-then rules. Each time a rule was activated in a given step, the value of the corresponding action in the Advice Unit was increased or decreased by a constant, depending on whether the rule advised for or against the action. These values were added to those generated by the function approximator, and presented to the learning algorithm.
Besides if-then rules, general instructions can also be provided in the form of detailed plan descriptions, or recipes, containing the sequence of actions that should be performed [branavan_reinforcement_2009]. These batch instructions are themselves composed of a sequence of contextual instructions (cf. next paragraph). They can be considered as a special form of communicated demonstrations [lin_programming_1991, whitehead_learning_1991]. However, unlike in demonstrations or if-then rules, state information in general instructions can be implicit. For example, they can be implicitly formulated within the expression of the action (e.g. ”Click start, point to search, and then click for files or folders.”) [branavan_reinforcement_2009].
Learning from contextual instructions:
In contrast to general instructions, a contextual instruction depends on the state in which it is provided. To use the terms of the advice-taking process, a part of the information that is required for operationalization is implicit. More specifically, the condition part of each instruction is not explicitly communicated by the teacher, but must be inferred by the learner from the current context. Consequently, contextual instructions must be progressively provided to the learning agent along the task. Contextual instructions can be either low-level or high-level [branavan_reading_2010]. Low-level instructions indicate the next action to be performed [grizou_robot_2013], while high-level instructions indicate a more extended goal without explicitly specifying the sequence of actions that should be executed [macglashan_translating_2014].
A mathematical formulation of contextual instructions was proposed in [pradyot_integrating_2012]. Particularly, the authors distinguished two types of instructions: instructions and instructions.
instructions modify the agent’s policy towards a specific action by setting its selection probability to. For example, pointing to an object makes the agent perform a predefined action on it. instructions, on the other hand, reduce the complexity of the current state by projecting its representation into a subspace of features. For example, pointing to an object makes the agent consider only its features and ignore all other aspects.
We identify three ways of using contextual instructions (Fig. 2, Table 2). First, the communicated action can be simply executed. For example, verbal instructions can be used for guiding a robot along the task [thomaz_robot_2007, tenorio-gonzalez_dynamic_2010, cruz_interactive_2015]. In [nicolescu_natural_2003] and [rybski_interactive_2007], a Learning-from-Demonstrations (LfD) system was augmented with verbal instructions, in order to make the robot perform some actions during the demonstrations. This way of using instructions can be referred to as guidance, which is the term used in [thomaz_robot_2007] and [cruz_interactive_2015]. However, the term guidance can have other meanings that we detail later.
The second way of using instructions is to integrate the information about the action within the model of the task. In [utgoff_two_1991], the authors presented a State Preference method (SP), where a teacher interactively informed a Temporal Difference (TD) agent about the next preferred state. This information could be provided by telling the agent what action to perform. State preferences were transformed into linear inequalities, which were integrated into the TD algorithm in order to accelerate the learning process. In [clouse_teaching_1992], instructions were integrated into an RL algorithm by positively reinforcing the proposed action. In [rosenstein_supervised_2004], the authors presented an Actor-Critic architecture that used instructions for both decision-making and learning. For decision-making, the robot executed a composite real-valued action that was computed as a linear combination of the actor’s decision and the supervisor’s instruction. Then, the error between the instruction and the actor’s decision was used as an additional parameter to the TD error for updating the actor’s policy.
A third approach consists in using the provided instructions for building an instruction model besides the task model. Both models are then combined for decision-making. For example, in [pradyot_integrating_2012], the RL agent arbitrates between the action proposed by its Q-learning policy and the one proposed by the instruction model, based on a confidence criterion. The same arbitration criterion was used in [najar_interactively_20], to decide between the outputs of a Task Model and an Instruction Model.
Note that in Figure 2 and Table 2, we use the term shaping to describe the way instructions are integrated into the learning process, even though this term was never employed in the cited papers. This is justified by our general definition of shaping as the modification of an agent’s behaviour through the use of human teaching signals, and the similarity at the computational level of the methods used for instructions and feedback (cf. Section 2). Using the same terminology is important for unifying feedback and instructions under the same general view (cf. Section 4).
|Model-free reward shaping (1)||[clouse_teaching_1992]|
|Model-based value shaping (2)||[najar_social-task_2015]|
|Model-free value shaping (3)||[utgoff_two_1991, maclin_creating_1996, kuhlmann_guiding_2004, maclin_knowledge-based_2005, maclin_giving_2005, torrey_advice_2008]|
|Model-based policy shaping (4)||[grizou_robot_2013, pradyot_integrating_2012, najar_interactively_20]|
|Model-free policy shaping (5&6)||[thomaz_robot_2007, tenorio-gonzalez_dynamic_2010, cruz_interactive_2015, nicolescu_natural_2003, rybski_interactive_2007, rosenstein_supervised_2004, suay_effect_2011, thomaz_reinforcement_2006, celemin_coach:_2019]|
Learning from guidance:
Guidance is a term that is encountered in many papers and has been made popular by the work of Thomaz et al. [thomaz_socially_2006]
about Socially Guided Machine Learning . In the broad sense, guidance represents the general idea of guiding the learning process of an agent. In this sense, all interactive learning methods such as demonstrations and feedback can be considered as a form of guidance.
A bit more specific definition of guidance is when human inputs are provided in order to bias the exploration strategy [thomaz_learning_2009]. For instance, in [subramanian_exploration_2016], demonstrations were provided in order to teach the agent how to explore interesting regions of the state space. In [chu_learning_2016], kinesthetic teaching was used for guiding the exploration process for learning object affordances.
In the most specific sense, guidance constitutes a form of advice that consists in suggesting a limited set of actions from all the possible ones [suay_effect_2011, thomaz_reinforcement_2006]. In this sense, it is a generalization of contextual instructions where the teacher proposes more than one single action at a time. For example, in [cruz_interactive_2015], the authors used both terms of advice and guidance for referring to instructions. It is to be noted that in these works, the use of guidance was limited to the execution of the suggested action, so it was not directly integrated into the agent’s policy.
Learning from feedback:
We distinguish two main forms of feedback: evaluative and corrective. Evaluative feedback, also called critique, consists in evaluating the quality of the agent’s actions. Corrective feedback, also called instructive feedback, implicitly implies that the performed action is wrong. However, it goes beyond simply criticizing the performed action, by informing the agent about the correct one. Both forms of feedback can be provided either interactively after each performed action, or a posteriori to the task execution, in a batch fashion. Examples of interactive feedback can be found in [knox_interactively_2009] for evaluative feedback, and in [celemin_coach:_2019] for corrective feedback. Examples of batch feedback can be found in [judah_reinforcement_2010] for evaluative feedback, and in [argall_teacher_2011] for corrective feedback. While corrective feedback is used in several works, much more emphasis has been put on evaluative feedback, especially as a standalone training method. So, in this paragraph, we focus on corrective feedback; and we dedicate the next section to evaluative feedback.
Corrective feedback generally takes the form of either a contextual instruction [chernova_interactive_2009] or a demonstration [nicolescu_natural_2003]. In the latter case, we also talk about corrective demonstrations. The only difference with instructions (resp. demonstrations) is that they are provided after an action (resp. a sequence of actions) is executed by the robot, not before. So, operationalization is made with respect to the previous state, not to the current one.
So far, corrective feedback has been mainly used for augmenting LfD systems [nicolescu_natural_2003, chernova_interactive_2009, argall_teacher_2011]. For example, in [chernova_interactive_2009], while the robot is reproducing the provided demonstrations, the teacher could interactively correct any incorrect action. In [nicolescu_natural_2003], corrective feedback took the form of a shadowed demonstration. The corrective demonstration was delimited by two predefined verbal commands that were pronounced by the teacher. In [argall_teacher_2011], the authors presented a framework based on advice-operators, allowing a teacher to correct entire segments of demonstrations through a visual interface. Advice-operators were defined as numerical operations that can be performed on state-action pairs. The teacher could choose an operator from a predefined set, and apply it to the segment to be corrected.
In [celemin_coach:_2019], the authors took inspiration from advice-operators to propose learning from corrective feedback as a standalone method, contrasting with other methods for learning from evaluative feedback such as TAMER [knox_interactively_2009].
1.3 Interpreting Advice
The second step of the advice-taking process stipulates that advice needs to be converted into an internal representation. This step corresponds to interpreting the perceived advice. Predefining the meaning of contextual advice, for example by hand-coding the mapping between instructions and their corresponding actions, has been widely used in the literature [clouse_teaching_1992, nicolescu_natural_2003, rosenstein_supervised_2004, thomaz_robot_2007, rybski_interactive_2007, chernova_interactive_2009, tenorio-gonzalez_dynamic_2010, lockerd_tutelage_2004, pradyot_instructing_2012, cruz_interactive_2015, celemin_coach:_2019]. However, this solution has many drawbacks. First, it limits the possibility for the teacher to use its own preferred signals. Second, it becomes even more inconvenient when considering general instructions, which often require expert programming skills [maclin_creating_1996, mangasarian_knowledge-based_2004, kuhlmann_guiding_2004, maclin_giving_2005, maclin_knowledge-based_2005, torrey_advice_2008].
The ultimate goal of advice-taking methods is to allow robots to take advantage of human advice in a natural and unconstrained manner. However, natural language understanding still raises many challenges. To address this question, different approaches can be taken. Some methods relied on pre-trained parsers using supervised learning methods[kate_using_2006, zettlemoyer_learning_2009, matuszek_learning_2013]. For example, in [kuhlmann_guiding_2004], the system was able to convert domain-specific advice expressed in a constrained natural language into a formal advice, by using a parser trained with annotated data.
More recent approaches take inspiration from the grounded language acquisition literature [mooney_learning_2008], to learn a model that grounds the meaning of instructions into concepts from the real world. For example, natural language instructions can be paired with demonstrations of the corresponding tasks to learn the mapping between instructions and actions [chen_learning_2011, tellex_understanding_2011, duvallet_imitation_2013].
, the authors proposed a model for grounding high-level instructions into reward functions from user demonstrations. The agent had access to a set of hypotheses about possible tasks, in addition to command-to-demonstration pairings. Generative models of tasks, language, and behaviours were then inferred using Expectation Maximization (EM)[dempster_maximum_1977]. The authors extended their model in [macglashan_training_2014], to ground command meanings in reward functions, using evaluative feedback instead of demonstrations. In addition to having a set of hypotheses about possible reward functions, the agent was also endowed with planning abilities that allowed it to infer a policy according to the most likely task. In a similar work [grizou_robot_2013], a generative model was used for inferring a task from unlabeled low-level contextual instructions. The robot inferred a task while learning the meaning of the interactively provided verbal instructions. As in [macglashan_training_2014], the robot had access to a set of hypotheses about possible tasks, in addition to a planning algorithm.
A different approach for interpreting instructions relies on Reinforcement Learning [branavan_reinforcement_2009, branavan_reading_2010, vogel_learning_2010, mathewson_simultaneous_2016, najar_interactively_20]. In [branavan_reinforcement_2009], the authors used a policy-gradient RL algorithm with a predefined reward function, to map textual low-level instructions into actions in a GUI application. Contextual low-level instructions were provided a priori as a general instruction, detailing the step-by-step sequence of actions that must be performed. This model was extended in [branavan_reading_2010], to allow for the interpretation of high-level instructions, by including a model of the environment as in model-based RL [sutton_reinforcement_1998]. In [vogel_learning_2010], the authors followed the same idea for interpreting navigational instructions, in a path-following task, using the SARSA algorithm with value-function approximation. The rewards were computed according to the deviation from a reference path. In [najar_interactively_20], human instructions were interpreted by a robot and used simultaneously for learning a task. Even hough their meaning was not determined beforehand, the use of unlabeled instructions allowed the robot to learn the task faster, while reducing the number of interactions with the human teacher. The authors proposed an interpretation method allowing for sporadic instructions, as opposed to the standard RL-based interpretation methods used in [branavan_reinforcement_2009, branavan_reading_2010, vogel_learning_2010, mathewson_simultaneous_2016]. An extended comparison between different interpretation methods can be found in [najar2017shaping].
2 Evaluative feedback
Training a robot by evaluating its actions can be an alternative solution to the standard Reinforcement Learning approach, whenever the implementation of a proper reward function turns out to be challenging [kober_reinforcement_2013]. It can also be effective in situations where it is difficult for the teacher to execute demonstrations, and where instructions would require a sophisticated communication channel.
2.1 Providing evaluative feedback
In the literature, there exist different views about how to represent evaluative feedback. It can be represented as a scalar value [knox_interactively_2009], a binary value [thomaz_reinforcement_2006-1, najar_interactively_20], a positive reinforcer [kaplan_robotic_2002], or a categorical information [loftin_learning_2016].
Traditionally, evaluative feedback has been largely considered as a reward shaping technique that consists in providing the robot with intermediate rewards to speed-up the learning process [isbell_social_2001, thomaz_reinforcement_2006-1, tenorio-gonzalez_dynamic_2010, mathewson_simultaneous_2016]. In these works, evaluative feedback was considered in the same way as the feedback provided by the agent’s environment in RL; so intermediate rewards are homogeneous to MDP rewards.
Other works pointed out the difference that exists between immediate and delayed rewards [dorigo_robot_1994, colombetti_behavior_1996, knox_reinforcement_2012]. Particularly, they considered evaluative feedback as an immediate information about the value of an action [ho2017social]. For example, in [dorigo_robot_1994], the authors did not address temporal credit assignment, and the generated rewards constituted ”immediate reinforcements in response to the actions of the learning agent”, which comes to consider rewards as equivalent to action values. In the TAMER framework [knox_interactively_2009], human-generated rewards were used for computing a regression model , called the ”Human Reinforcement Function, to predict the amount of provided rewards for each state-action pair . In [knox_reinforcement_2012], different discount factors for were compared, and it was shown that setting the discount factor to zero was better suited, which came to consider more as an action value function than as a reward function 111The authors proposed another mechanism for handling temporal credit assignment, in order to alleviate the effect of highly dynamical tasks [knox_interactively_2009]. In their system, human-generated rewards were distributed backward to previously performed actions within a fixed time window.. This myopic discounting strategy was also employed in [najar_training_2016], where the head nods and shakes of a human teacher were converted into binary values and used for updating a robot’s action values.
In a different approach, evaluative feedback is not converted into a numerical value, but treated as a categorical piece of information that is directly used for deriving a policy, within a Bayesian framework [lopes_simultaneous_2011, griffith_policy_2013, loftin_learning_2016]. This approach is similar to the latter in that evaluative feedback only informs about the last performed action [ho_teaching_2015].
2.2 Learning from evaluative feedback
There exist different methods for deriving a policy from evaluative feedback that mostly depend on the adopted representation of the feedback (e.g., numerical values vs. categorical information). In the literature, we find different terminologies for qualifying the policy derivation methods, such as reward shaping [tenorio-gonzalez_dynamic_2010], interactive shaping [knox_interactively_2009] and policy shaping [griffith_policy_2013, cederborg_policy_2015]. In some works, the term shaping is not even adopted [loftin_learning_2016]. In this survey we consider the term shaping in its general meaning as influencing a learning system towards a desired behaviour. In this sense, all methods deriving a policy from evaluative feedback are considered as shaping methods. Here we propose a categorization of policy derivation methods under the scope of shaping.
To do so, we need to distinguish two cases. In the first case, evaluative feedback is combined with a model of the task that is derived from another source of information, such as a reward function. According to how it is represented, it can be used for biasing either the reward function [thomaz_reinforcement_2006-1], the value function [knox_reinforcement_2012-1], or the policy [griffith_policy_2013]. So, here the term shaping can be used for qualifying the combination method. In the second case, evaluative feedback constitutes the only available source of information [loftin_strategy-aware_2014]. However, we can consider this as a special case of the former situation in which the other source (e.g. reward function) provides no information. For example, we can consider a null reward function, a null value function, or a uniform policy. So, even when evaluative feedback is not combined with another source of information, we can still talk about a shaping method.
Thus, we can divide the literature treating about evaluative feedback into three groups: reward shaping, value shaping and policy shaping methods. Figure 3 and Table 3 summarize the different possibilities for shaping with evaluative feedback, depending on the adopted representation.
|Model-free reward shaping (1)||[isbell_social_2001, thomaz_reinforcement_2006-1, tenorio-gonzalez_dynamic_2010, mathewson_simultaneous_2016]|
|Model-based reward shaping (2)||[knox_combining_2010, knox_augmenting_2011, knox_reinforcement_2012-1]|
|Model-based value shaping (3)||[knox_combining_2010, knox_augmenting_2011, knox_reinforcement_2012-1]|
|Model-free value shaping (4)||[dorigo_robot_1994, colombetti_behavior_1996, najar_training_2016]|
|Model-free policy shaping (5)||[ho_teaching_2015, macglashan2017interactive, najar_interactively_20]|
|Model-based policy shaping (6)||[knox_combining_2010, knox_augmenting_2011, knox_reinforcement_2012-1, lopes_simultaneous_2011, griffith_policy_2013, loftin_learning_2016]|
After converting evaluative feedback into a numerical value, it can be considered as a delayed reward, just like MDP rewards, and used for computing a value function through temporal credit assignment [isbell_social_2001, thomaz_reinforcement_2006-1, tenorio-gonzalez_dynamic_2010, mathewson_simultaneous_2016]. This means that the effect of the provided feedback extends beyond the last performed action. When the robot has also access to a predefined reward function , a new reward function is computed by summing both forms of reward: , where is the human delivered reward.
Knox and Stone [knox_combining_2010, knox_augmenting_2011, knox_reinforcement_2012-1] proposed eight different shaping methods for combining the human reinforcement function with a predefined MDP reward function . One of them, Reward Shaping, generalizes the reward shaping method by introducing a decaying weight factor that controls the contribution of over :
Although reward shaping has been effective in many domains [isbell_social_2001, thomaz_reinforcement_2006-1, tenorio-gonzalez_dynamic_2010, mathewson_simultaneous_2016], this way of providing intermediate rewards does not fit into the definition of potential-based reward shaping [ng_policy_1999, wiewiora_potential-based_2003]. Consequently, it has been shown that it can cause sub-optimal behaviours such as positive circuits [knox_reinforcement_2012, ho_teaching_2015].
Value shaping consists in considering evaluative feedback as an action-preference function. The numerical representation of evaluative feedback is used for modifying the Q-function rather than the reward function. Q-Augmentation [knox_combining_2010, knox_augmenting_2011, knox_reinforcement_2012-1] uses the human reinforcement function for augmenting the MDP Q-function using:
When comparing different shaping methods, Knox and Stone observed that ”the more a technique directly affects action selection, the better it does, and the more it affects the update to the Q function for each transition experience, the worse it does” [knox_reinforcement_2012-1]. In fact, this can be explained by the specificity of the Q-function with respect to other preference functions. Unlike others preference function (e.g. Advantage function [harmon_advantage_1994]), a Q-function also informs about the proximity to the goal. Evaluative feedback, however, informs about local preferences without necessarily including such information [ho_teaching_2015]. So, augmenting a Q-function with evaluative feedback may lead to convergence problems.
In policy shaping, evaluative feedback is used for biasing the MDP policy, without interfering with the value function. Action Biasing [knox_combining_2010, knox_augmenting_2011, knox_reinforcement_2012-1] uses the same equation as Q-Augmentation but only in decision-making, so that the agent’s Q-function is not modified:
Control Sharing [knox_combining_2010, knox_augmenting_2011, knox_reinforcement_2012-1] arbitrates between the decisions of both evaluation sources based on a probability criterion. A parameter is used as a threshold for determining the probability of selecting the decision according to :
Otherwise, the decision is made according to the MDP policy.
Other policy shaping methods do not convert evaluative feedback into a scalar but into a categorical information [loftin_learning_2016]. The distribution of provided feedback is used within a Bayesian framework in order to derive a policy. Griffith et al. [griffith_policy_2013]
proposed a policy shaping method that outperformed Action Biasing, Control Sharing and Reward Shaping. After inferring the teacher’s policy from the feedback distribution, it computed the Bayes optimal combination with the MDP policy by multiplying both probability distributions.
It should be noted that in the aforementioned policy shaping methods, evaluative feedback was used only for biasing the MDP policy at decision-time, while reward and value shaping methods modified the task model. More recent policy shaping methods take a hybrid approach, where policy shaping is performed by modifying the agent’s policy. In [macglashan2017interactive] and [najar_interactively_20], evaluative feedback was used for updating the actor of an Actor-Critic architecture, without interfering with the value function. In [macglashan2017interactive] the update term was scaled by the gradient of the policy; whereas in [najar_interactively_20] the authors did not consider a multiplying factor for evaluative feedback.
Overall, policy shaping methods show better performance compared to other shaping methods [knox_reinforcement_2012-1, griffith_policy_2013, ho_teaching_2015]. In addition to performance, another advantage of policy shaping is that it is applicable to a wider range of methods that directly derive a policy, without computing a value function or even using rewards.
2.3 Interpreting evaluative feedback
As with instructions, some works proposed to interpret the meaning of evaluative feedback signals, in order to give more possibilities to the teacher for employing her own preferred signals.
In [kaplan_robotic_2002], a robot had the capacity to learn new stimuli as secondary reinforcers, by associating them to primary reinforcers through the clicker training method. In [kim_learning_2007], a binary classification of prosodic features was performed offline, before using it as a reward signal for task learning. In [lopes_simultaneous_2011], a predefined set of known feedback signals, both evaluative and corrective, were used for interpreting additional signals within an Inverse Reinforcement Learning framework [ng_algorithms_2000]. In [grizou_robot_2013], a robot learned to interpret evaluative feedback, while inferring the task using an EM algorithm. The robot knew the set of possible tasks, and was endowed with a planning algorithm allowing it to derive a policy for each possible task. This model was used for interpreting EEG-based evaluative feedback signals [grizou_interactive_2014]. Finally, some papers addressed the question of interpreting the teacher’s silence, which was referred to as implicit feedback [loftin_learning_2016].
Learning from Demonstration (LfD), also known as Programming by Demonstration (PbD) or Imitation Learning, has appeared in the 80’s as a method for programming industrial robots[lozano-perez_robot_1983]. It has, since then, given rise to numerous works in robotics aiming at developing intuitive teaching methods for non-expert users [billard_robot_2008, argall_survey_2009, chernova_robot_2014]. The main idea of LfD is to teach by example. The human teacher provides a set of examples of task executions, from which the robot must infer a model of the task. Each example is encoded as a sequence of state-action pairs.
3.1 Providing demonstrations
There exist different ways for a human teacher to demonstrate a task to a robot. The most natural way is to execute the task by herself, while the robot is observing. Demonstrations can be observed either through external sensors like a camera [atkeson_learning_1997], or by attaching sensors to the teacher’s body [calinon_incremental_2007]. This mode of demonstration, called imitation setting, is challenging as it requires to map the perceived examples from the teacher into an internal representation that is directly exploitable by the robot. This issue is commonly referred to as the correspondence problem [nehaniv_correspondence_2002].
Practical solutions to circumvent this problem exist. Early works about LfD in industrial settings used teleoperation, where the demonstrator executed the task by directly controlling the robot’s actuators [lozano-perez_robot_1983]. This way, states and actions that need to be reproduced were directly experienced by the robot. More recent works use sophisticated teleoperation devices, including joysticks [abbeel_autonomous_2010], data gloves [dillmann_teaching_2004] and teleoperation suits [lieberman_improvements_2004], along with other methods for controlling the robot’s joints, such as kinesthetic teaching [akgun_trajectories_2012]. Shadowing provides an alternative approach to teleoperation that consists in controlling the robot indirectly through preprogrammed behaviours. For example, the robot can be endowed with a following behaviour, so it can be guided by the human while directly experiencing world state transitions [hayes_robot_1994, nicolescu_learning_2001, rybski_interactive_2007].
3.2 Learning from demonstrations
To be able to execute the task, the robot needs to build a model that generalizes over all provided task executions. The main challenge is to exactly identify what should be kept from the provided examples. This question involves many aspects, such as feature selection and dealing with sub-optimal demonstrations[Mueller2018].
Another important aspect is to identify sub-goals or important keyframes from the demonstrated trajectories [akgun_trajectories_2012]. For instance, we can distinguish two types of imitation: mimicry and goal emulation. Mimicry consists in replicating the trajectories of the demonstrator. Goal emulation, on the other hand, consists in reproducing the effects of the demonstrations by one’s own means. The difference between the two imitation modes has been defined in terms of granularity of the task [nehaniv_correspondence_2002]. Different degrees of granularity can be defined, in order to take into account intermediate effects or sub-goals.
Mimicry has been used since early works in industrial contexts, where the goal was to ”play-back” recorded trajectories. Earliest methods were based on symbolic reasoning, where each demonstration was discretized then transformed into a first order logic representation [dufay_approach_1984]. More recent approaches rely on supervised learning methods that learn a mapping from states to actions [saunders_teaching_2006, chernova_interactive_2009, inamura_acquisition_1999].
Goal emulation techniques infer the teacher’s intention from the provided demonstrations. For example, Inverse Reinforcement Learning (IRL) consists in inferring the goal of the demonstrated task in the form of a reward function [ng_algorithms_2000]. The inferred reward function then enables the robot to derive a policy, through planning [abbeel_apprenticeship_2004, abbeel_autonomous_2010].
3.3 Interpreting demonstrations
In one respect, we can consider goal emulation techniques, such as IRL, as interpretation methods in that they infer the teacher’s intention from the observed behaviour. However, for these methods to work, they need appropriate state-action labels, expressed within the robot’s own referential. In other words, they need to overcome the correspondence problem. So, there is another level of interpretation of demonstrations that concerns the understanding of the observed states and actions, i.e. the mapping between the teacher’s states and actions and the robot’s states and actions [argall_survey_2009].
The resolution of the correspondence problem is still an open research question. However, the neuroscience literature has provided us with many insights about the mechanisms involved in recognizing other’s actions [rizzolatti2001neurophysiological]. For instance, it has been established that imitation is triggered by automatic activation of existing motor representations [brass_imitation:_2005].
These ideas inspired some solutions for the correspondence problem in robotics. For example, Alissandrakis et al. [alissandrakis_solving_2003] proposed a generic framework, ALICE, that builds a correspondence library, by comparing observations to generated behaviours through a predefined similarity measure. In a similar approach, Demiris and Hayes [demiris_imitation_2002] used a combination of inverse and forward models for both control and action recognition. Notably, these solutions reflect the importance of subjective experience in the interpretation process, that is the comparison between one’s own behaviour and the observed one.
In this section, we first compare the benefits and the limitations of the mentioned interactive learning methods. Then, we highlight their similarities.
4.1 Comparing different learning methods
When designing an interactive learning method, one may ask which one is better suited [suay_practical_2012]. The methods we presented so far rely on a wide variety of teaching signals and interaction modalities with the learning system.
In this survey, we categorized interactive learning methods according to the characteristics of teaching signals. These signals differ in how they are represented and in how they are integrated into the learning process. Particularly, each teaching signal requires a different level of involvement from the human teacher and provides a different level of control over the learning process. Some of them provide poor information about the policy, so the learning process relies mostly on autonomous exploration. Others are more informative regarding the policy, so the learning process mainly depends on the human teacher.
This aspect has been described by Breazeal and Thomaz as the guidance-exploration spectrum [breazeal_learning_2008]. In Section 1, we presented guidance as a special type of advice. So, in order to avoid confusion about the term guidance, we will use the term exploration-control spectrum instead of guidance-exploration (Fig. 4).
At one end of the exploration-control spectrum, we have autonomous learning methods that consist in endowing the learning agent with the capacity to evaluate its behaviour through predefined performance criteria. This can be done by implementing, for example, a reward function or a fitness function. These functions constitute evaluation sources that allow the agent to optimize its behaviour by trial-and-error, relying only on autonomous exploration, without requiring the help of a supervisor. However, a common issue in autonomous exploration is to find a suitable trade-off between exploration and exploitation. In fact, at no point of the learning process does the agent know whether its behaviour is the optimal one, or if it can still improve it. Consequently, it faces the dilemma of keeping the behaviour that it has already acquired, or exploring new ones. Systematic exploitation may lead the agent to sub-optimal behaviours, while exploration may be problematic in real-world applications.
Evaluative feedback constitutes another evaluation criterion that has many advantages over reward and fitness functions. First, like all interactive learning methods, it alleviates the limitations of autonomous learning, i.e. slow convergence and unsafe exploration. Whether it is represented as categorical information [griffith_policy_2013] or as immediate rewards [dorigo_robot_1994], it provides a more straightforward evaluation of the policy, as it directly informs about the optimality of the performed action [ho_teaching_2015].
Second, from an engineering point of view, evaluative feedback is generally easier to implement than a reward or a fitness function. For instance it is often hard, especially in complex environments, to design a priori an evaluation function that could anticipate all aspects of a task, and to take into account several criteria at once, such as risk and performance [kober_reinforcement_2013]. By contrast, evaluative feedback generally takes the form of binary values that can be easily implemented [knox_training_2013].
However, the informativeness of evaluative feedback is still limited, as it is only a reaction to the agent’s actions and does not communicate the optimal one. So, the agent still needs to explore different actions, with trial-and-error, as in the autonomous learning setting. The main difference is that exploration is not required any more once the robot tries the optimal action and gets a positive feedback. So, the trade-off between exploration and exploitation is less tricky to address than in autonomous learning.
The limitation in the informativeness of evaluative feedback can lead to poor performance. In fact, when it is the only available communicative channel, people tend to use it also as a form of guidance, in order to inform the agent about future actions [thomaz_reinforcement_2006-1]. This violates the assumption about how evaluative feedback should be used, which affects learning performance. Performance significantly improves when teachers are provided with an additional communicative channel for guidance [thomaz_reinforcement_2006]. This reflects the limitations of evaluative feedback and demonstrates that human teachers also need to provide guidance.
One possibility for improving the feedback channel is to allow for corrections and refinements [thomaz_asymmetric_2007]. Corrective feedback improves the informativeness of evaluative feedback, by allowing the teacher to inform the robot about the optimal action [celemin_coach:_2019]. But being also reactive to the robot’s actions, it still requires exploration. However, it prevents from waiting until the robot tries the correct action by its own.
On the other hand, corrective feedback requires more engineering efforts than evaluative feedback, as it is generally more than a binary information. As it operates over the action space, it requires to encode the mapping between feedback signals and their corresponding actions. In this aspect, it is homogeneous to contextual instructions as both operate on the same space.
An even more informative form of corrective feedback is provided by corrective demonstrations, which extend beyond correcting one single action to correcting a whole sequence of actions [chernova_interactive_2009]. Corrective demonstrations operate on the same space as demonstrations, which require more engineering than contextual instructions and also provide more control over the learning process.
The experiments of Thomaz and Breazeal have shown that human teachers want to provide guidance [thomaz_reinforcement_2006]. In contrast to feedback, guidance allows the agent to be informed about future aspects of the task, such as the next action to perform (instruction) [cruz_interactive_2015], an interesting region to explore (demonstration) [subramanian_exploration_2016] or a set of interesting actions to try [thomaz_reinforcement_2006].
However, the control over the learning process is exerted indirectly. By performing the communicated guidance, the robot does not directly integrate this information as being the optimal behaviour. Instead, it will be able to learn only through the experienced effects, for example by receiving a reward.
With respect to guidance, instructions inform more directly about the optimal policy, in two main aspects. First, instructions are a special case of guidance where the teacher communicates only the optimal action. Second, the information about the optimal action can be integrated into the learning process more directly than with pure guidance. The difference can be better explained in terms of operationalization. When the teacher tells the robot to perform an action in state as an instruction, learning can be done by integrating into the policy the information that ”in state the optimal action is ”. However, with guidance, operationalisation still requires the evaluation of the performed action. So, guidance is only about limiting exploration, without providing full control over the learning process.
In Section 1, we presented two ways for providing instructions: providing general instructions in the form of if-then rules, or interactively providing contextual (low-level or high-level) instructions as the agent progresses in the task. The advantage of general instructions is that they do not depend on the dynamics of the task, so they can be provided at any time. This puts less interactive load on the teacher in that he/she is not required to stay concentrated in order to provide the correct information at the right moment. However, they present some drawbacks. First, they can be difficult to formulate. The teacher needs to gain insight about the task and the environment dynamics, in order to take into account different situations in advance and to formulate relevant rules [kuhlmann_guiding_2004]. Furthermore, they require to know about the robot’s sensors and effectors in order to correctly express conditions and actions. So, formulating rules requires expertise about the task, the environment and the robot. Second, general instructions are difficult to communicate. They require either expert programming skills or sophisticated natural language communication.
Contextual instructions, on the other hand, communicate a less sophisticated message at a time, which makes them easier to formulate and to provide. They only inform about next the action to perform, without expressing the condition, which can be inferred by the agent from the current task state. However, this makes them more prone to ambiguity. For instance, writing general instructions by hand allows the teacher to specify the features that are relevant to the application of each rule, i.e., to control generalization. With contextual instructions, however, generalization has to be inferred by the agent from the context.
Finally, interactively providing instructions makes it easy for the teacher to adapt to changes in the environment’s dynamics. However, this can be difficult to do in highly dynamical tasks, as the teacher needs a lapse of time to communicate each instruction.
Demonstrations are on the control end of the spectrum. They provide more control to the teacher over the learning process. In contrast with instructions, they inform about more than one single action, by communicating sequences of state-action pairs. However, providing such control requires to overcome the correspondence problem. This is generally addressed through teleoperation or kinesthetic teaching. These solutions overcome the correspondence problem in two ways. First, the state mapping is avoided as the robot experiences its own states. Second, the mapping of actions is made by controlling the robot joints, either through an interface, or by exerting forces on the robot’s body. This can be seen as sending a continuous stream of instructions: the commands sent via the joystick or the forces exerted on the robot’s kinesthetic device. So, we can consider demonstrations as a sequence of contextual instructions222This does not hold for the imitation setting, where the mapping between observed and experienced states is not given..
However, there still exists some difference between demonstrations and instructions. First, teleoperated or kinesthetic demonstrations provide more control, not only over the learning process, but also over task execution. When providing demonstrations, the teacher controls the robot joints, so the communicated instruction streams are systematically executed. With instructions, however, the robot is in control of it’s own actions. The teacher only communicates the action to perform, and the robot can decide whether to execute it or not.
Second, demonstrations involve more human load than instructions. Demonstrations require from the teacher to be active in executing the task, while instructions involve only communication. This aspect confers some advantages to instructions in that they offer more possibilities in terms of interaction. Instructions can be provided with different modalities such as speech or gesture, and by using a wider variety of words or signals. Demonstrations, however, are constrained by the control interface. Moreover, demonstrations require continuous focus in providing complete trajectories, while instructions can be sporadic.
Therefore, instructions can be better suited in situations where demonstrations can be difficult to provide. For example, people with limited autonomy may be unable to demonstrate a task by themselves, or to control a robot’s joints. In these situations, communication is more convenient.
On the other hand, demonstrations are more adapted for highly dynamical tasks and continuous environments, since instructions require some time to be communicated.
4.2 Toward a unified view
Overall, all interactive learning methods overcome the limitations of autonomous learning, by providing more control over the agent’s policy. However, more control means more interaction load. So, the autonomy of the learning process is important for minimizing the burden on the human teacher. A central question in the interactive learning literature is how to combine different learning modalities in order take advantage of each one of them. So, often, different methods are combined within a single framework.
For example, RL can be augmented with evaluative feedback [judah_reinforcement_2010, sridharan_augmented_2011, knox_reinforcement_2012-1], corrective feedback [celemin_reinforcement_19], instructions [maclin_creating_1996, kuhlmann_guiding_2004, rosenstein_supervised_2004, pradyot_integrating_2012], instructions and evaluative feedback [najar_interactively_20], demonstrations [taylor_integrating_2011, subramanian_exploration_2016], demonstrations and evaluative feedback [leon_teaching_2011], or demonstrations, evaluative feedback and instructions [tenorio-gonzalez_dynamic_2010]. Demonstrations can be augmented with corrective feedback [chernova_interactive_2009, argall_teacher_2011], instructions [rybski_interactive_2007], instructions and feedback, both evaluative and corrective [nicolescu_natural_2003], or with prior Reinforcement Learning [syed_imitation_2007].
Integrating different teaching signals into one single and unified formalism remains an active research question. In this survey, we extracted several aspects that were shared across different approaches. We can see that the same overall process applies regardless of which specific teaching signals are in use (Fig. 5). For instance, whether we deal with contextual instructions or evaluative feedback, we need to go through the same overall process and ask the same questions about the computational implementation of these teaching signals. First, we need to think about how these signals must be encoded and whether or not their meaning will be hand-coded or interpreted by the learning agent. Second, we need to decide whether we should aggregate teaching signals into a ”Teacher Model”, or directly use them for influencing the learning process (model-based vs. model-free interactive learning). Finally, we need to choose a specific computational mechanism through which teaching signals (or their aggregated model) should influence the learning process, as various shaping strategies can be considered: reward shaping, value shaping or policy shaping.
From this perspective, all shaping methods that were specifically designed for evaluative feedback could be used for instructions, and vice-versa. For example, all the methods proposed by Knox and Stone for learning from evaluative feedback [knox_combining_2010, knox_augmenting_2011, knox_reinforcement_2012-1], can be recycled for learning from instructions. Similarly, the confidence criterion used in [pradyot_integrating_2012] for learning from instructions can serve as another Control Sharing mechanism, similar to the one proposed in [knox_combining_2010, knox_augmenting_2011, knox_reinforcement_2012-1] for learning from evaluative feedback. Finally, we also note that the policy shaping method proposed in [griffith_policy_2013] is mathematically equivalent to the Boltzmann Multiplication reported in [wiering_ensemble_2008] as an ensemble method for combining multiple policies. Although ensemble methods have not been proposed for this purpose, they could also be used for policy shaping with both feedback and instructions [najar2017shaping].
Common aspects between advice and demonstrations:
Until recently, advice and demonstrations have been mainly considered as two complementary but distinct approaches, i.e., communication vs. action. However, these two approaches share many common aspects that are illustrated by the 5-steps advice-taking process [hayes-roth_knowledge_1980, hayes-roth_advice_1981]
. The first step, requesting or receiving the advice deals with transparency and active learning issues that are common to both advice and demonstration settings[utgoff_two_1991, thomaz_transparency_2006, chernova_confidence-based_2007, BroekensTAFFC]. The second step, converting the advice into an internal representation also applies for demonstrations. In this case, it refers to the correspondence problem. With advice, we also have a correspondence problem that consists in interpreting the teaching signals, whether feedback or instructions. So, we can consider a more general correspondence problem, that is not proper to learning from demonstrations, and that consists in interpreting the perceived teaching signals, independently from their nature. The third, fourth and fifth steps, operationalization, integration and refinement, correspond to policy derivation. These three steps can sometimes be confounded, and we can regroup them into one more general step called shaping, i.e., using the interpreted teaching signal for biasing the behaviour, whether the teaching signal is aggregated into a separate model or directly integrated into the agent’s policy. These steps are also common for both advice and demonstrations.
General correspondence problem:
So far, the correspondence problem has been mainly addressed within the community of learning by imitation. Imitation is a special type of social learning in which the robot reproduces what it perceives. So, there is an assumption about the fact that what is seen has to be reproduced. Advice is different from imitation in that the robot has to reproduce what is communicated by the advice and not what is perceived. For instance, saying ”turn left”, requires from the robot to perform the action of turning left, not to reproduce the sentence ”turn left”.
However, evidence from neuroscience gave rise to a new understanding of the emergence of human language as a sophistication of imitation throughout evolution [adornetti_pragmatic_2015]. In this view, language is grounded in action, just like imitation [corballis_mirror_2010]
. For example, there is evidence that the mirror neurons of monkeys also fire to the sounds of certain actions, such as the tearing of paper or the cracking of nuts[kohler_hearing_2002], and that spoken phrases about movements of the foot and the hand activate the corresponding mirror-neuron regions of the pre-motor cortex in humans [aziz-zadeh_congruent_2006].
So, one challenging question it whether we could unify the problem of interpreting any kind of teaching signal under the scope of one general correspondence problem. This is a relatively new research question, and few attempts have been made in this direction. For example, Cederborg and Oudeyer [cederborg_social_2014] proposed a unified theoretical framework for learning from different sources of information. Their main idea is to relax the assumptions about the meaning of teaching signals, by taking advantage of the coherence between the different information sources.
Finally, when comparing demonstrations with instructions, we mentioned that demonstrations could be considered as a way of providing continuous streams of instructions, with the subtle difference that demonstrations are systematically executed by the robot. Considering this analogy, the growing literature about interpreting instructions [branavan_reading_2010, vogel_learning_2010, grizou_robot_2013, najar_interactively_20] could provide insights for designing new ways of solving the correspondence problem in imitation.
In this paper, we provided an overview of the existing methods for integrating advice into a Reinforcement Learning process. We proposed a taxonomy of different types of teaching signals, and described them according to three main aspects: how they can be provided to the learning agent, how they can be integrated into the learning process, and how they can be interpreted by the agent if their meaning is not determined beforehand. Finally, we compared the benefits and limitations of using each type of teaching signals and proposed a unified view of interactive learning methods.
The computational questions covered in this survey extend beyond the boundaries of Artificial Intelligence, as similar research questions regarding the computational implementation of social learning strategies are also raised in the field of Cognitive Neuroscience [biele2011neural, najar_imitation_19, olsson2020neural]. Thus we think this survey can be of interest for both communities.
This work was supported by the Romeo2 project.