A Conceptual Framework for Externally-influenced Agents: An Assisted Reinforcement Learning Review

07/03/2020 ∙ by Adam Bignold, et al. ∙ University of Alberta Deakin University Federation University Australia 0

A long-term goal of reinforcement learning agents is to be able to perform tasks in complex real-world scenarios. The use of external information is one way of scaling agents to more complex problems. However, there is a general lack of collaboration or interoperability between different approaches using external information. In this work, we propose a conceptual framework and taxonomy for assisted reinforcement learning, aimed at fostering such collaboration by classifying and comparing various methods that use external information in the learning process. The proposed taxonomy details the relationship between the external information source and the learner agent, highlighting the process of information decomposition, structure, retention, and how it can be used to influence agent learning. As well as reviewing state-of-the-art methods, we identify current streams of reinforcement learning that use external information in order to improve the agent's performance and its decision-making process. These include heuristic reinforcement learning, interactive reinforcement learning, learning from demonstration, transfer learning, and learning from multiple sources, among others. These streams of reinforcement learning operate with the shared objective of scaffolding the learner agent. Lastly, we discuss further possibilities for future work in the field of assisted reinforcement learning systems.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Reinforcement learning (RL) [124] is a learning approach in which an agent uses sequential decisions to interact with its environment trying to find a (near) optimal policy to perform an intended task. RL agents have the ability to improve while operating, to learn without supervision, and to adapt to changing circumstances [65]. By exploring, a standard agent learns solely from the signals it receives from the environment. The RL approach has shown success in domains such as robotics [73, 81], game-playing [136, 12], and inventory management [57], among others.

Like many machine learning techniques, RL faces the problem of high-dimensionality spaces. As environments become larger, the agent’s learning time increases and finding the optimal solution becomes impractical 

[22]. Early research on this topic [65, 85] argued that for RL to successfully scale into real-world scenarios, then the use of information external to the environment would be needed. Different RL strategies using this approach have emerged in order to speed up the learning process. They use external information to assist either the process of generalising the environment representation [107], the agent’s decision-making process [59], or in providing more focused exploration [55].

In this article, we refer to external information as any kind of information provided to the agent originating from outside of the agent’s representation of the environment. This may include demonstrations [83, 110, 25], advice and critiques [78, 59], initial bias based on previously gathered data [130], or highly-detailed domain-specific shaping functions [108].

Assisted reinforcement learning (ARL) encompass a range of techniques that use external information, either before, during, or after training, to improve the performance of the learner agent, as well as to scale RL to larger and more complex scenarios. While a relevant characteristic of RL is its ability to endow agents with new skills from the ground up, ARL also makes use of existing information and/or previously learned behaviour. Some methods for improving the agent’s performance using external information include: directly altering weights for actions and states (biasing) [144]; altering the state or action space [53]; critiquing past or advising on future decision-making [138]; dynamically altering reward functions [78]; directly modifying the policy [59]; guiding exploration and action selection [55]; and, creating information repositories/models to supplement the environmental information [107]. Figure 1 captures all of these methods in a basic view of the ARL conceptual framework used in this work. The classic RL approach is shown within the figure where an agent performs an action on the environment reaching a new state and obtaining a reward. In ARL, the response of the environment is also shared with the external information source from where advice is given to the agent or changes sometimes made directly to the environment [147].

Figure 1: Assisted reinforcement learning simplified framework. In autonomous reinforcement learning, an agent performs an action from a state and the environment produces an answer leading the agent to a new state and receiving a reward . Assisted reinforcement learning adds an external information source, referred to as a trainer, teacher, advisor or assistant, that observes the environment and the agent in order to generate advice. The trainer may advise the learner agent or sometimes directly modify the environment. Moreover, the agent may also actively ask advice to the external information source.

Current ARL approaches using external information constitute an important part of RL research. Some of the main streams that focus on the use of external information in RL include:

  1. [label=.]

  2. Heuristic reinforcement learning [23].

  3. Interactive reinforcement learning [3, 31].

  4. Reinforcement learning from demonstration [6, 92].

  5. Transfer learning [130, 112].

  6. Multiple information sources [7, 98].

Each of these approaches is described as an example later in Section 4. Moreover, other RL approaches that might also use external information in a significant portion include: Bayesian reinforcement learning [144]

, imitation learning 

[10, 99], preference-based learning [56], inverse reinforcement learning [100], and ensemble algorithms [145], among others.

In this article, we present a conceptual framework and a taxonomy to be used to describe the practice of using external information as well as a review of state-of-the-art ARL methods. A standardised ARL taxonomy will foster collaboration between different RL communities, improve comparability, allow a precise description of new approaches, and assist in identifying and addressing key questions for further research. The conceptual framework presented in this article, and the definition of external information on which it is based, are used to distinguish methods that employ external information from techniques such as traditional auto-nomous RL, state adaptation, dynamic programming, and fully supervised methods.

This article is organised as follows: Section 2 presents the conceptual framework for ARL as a method for formalising and describing ARL techniques. Section 3 defines a detailed taxonomy for ARL approaches, discussing the similarities between existing methods and ideas about the importance of providing a shared framework for representing and developing ARL. Section 4 frames current research in our proposed ARL taxonomy. This section aims to provide context and examples for the use of the framework for the collaboration and comparison of ARL techniques. Finally, Section 5 discusses the limitations of current state-of-the-art ARL methods, as well as provides goal-oriented discussion about some open questions and challenges to consider as future research directions.

2 A Conceptual Framework
for Assisted Reinforcement

In this section, we give more details about the ARL approach including some introductory examples of works in which external information sources have been used. Moreover, we define a conceptual framework identifying the different parts that comprise the underlying process used in ARL techniques. Based on this conceptual framework, in the following section, we define a more detailed taxonomy for ARL approaches.

2.1 Assisted Reinforcement Learning

The main strength of RL is its ability for endowing an agent with new skills given no initial knowledge about the environment. With an appropriate reward function and enough interaction with its environment, an RL agent can learn (near-) optimal behaviour [124]. The agent’s behaviour at every step is defined by its policy. The reward function promotes desirable behaviour and sometimes penalises undesirable behaviour. In the traditional view of RL, the reward function, and the rewards it produces, are internal to the environment [65]. Traditional RL, in which the environment is the sole provider of information to the agent, has been demonstrated to perform well in many different domains, especially when facing small and bounded problems [124]. However, RL has some difficulties when scaling up to large, unbounded environments, particularly regarding the time needed for the agent to learn the optimal policy [33, 35]. In RL, one approach to tackling this issue is to use external information to supplement the information that the environment provides [121, 90].

Information is considered external if it originates from outside of the agent’s interactions with the environment. In this regard, internal information is determined solely through interactions and observations with the environment. For example, in the case of a human the internal information would be anything the person can observe from the environment using their senses [97]. The external information would be any information provided by peers, advisors, the internet, books, maps, and tutelage. In RL, anything external to the agent is usually considered part of the environment. In this regard, if an agent is learning in an environment, a person can be considered as part of it, therefore, the agent could model that person or communicate with them. Although it is possible that external sources of information could be just treated as part of the environment, this is handicapping the agent in an unnecessary way. There are external sources of information that might not necessarily be treated as part of the environment because they are socially advantaged. For instance, if an external source is providing action advice using directions as ‘left’ and ‘right’, the agent does not have to learn the meaning of these words from the ground up, or learn how to react to these instructions. Instead, we assume the agent knows that advice is coming, what it means, and how to use it. For example, if a person eats some berries and later becomes sick, the person may determine that those berries are poisonous. In this case, this would be internal information obtained by interaction with the environment. If instead, a peer had previously advised the person that eating those berries will make them sick, that would be external information provided by an extrinsic source.

In this work, we review methods on externally-influenced agent learning, which we will refer to as assisted reinforcement learning. The ARL framework is defined to include any type of RL that uses external information to supplement agent learning and the decision-making process. Some common practices include the direct alteration of the agent’s understanding of the environment [107], focusing exploration efforts through critique and advice [138], or assisting the agent in the decision-making process [55]. For instance, existing ARL techniques include interactive reinforcement learning [3, 32], learning from demonstration [6, 92], and transfer learning [130, 112], among others.

In the case of Interactive Reinforcement Learning (IntRL), the RL approach is extended by involving an assistant directly into the agent training. An assistant, human or otherwise, can critique the agent’s decision-making and advise on which actions to take in the future [138]. The assistant provides their advice at any time during the learning process of the agent. In this regard, by critiquing past actions or advising on future decisions, both the agent and the assistant can quickly respond to changing environmental conditions [121].

Reinforcement Learning from Demonstration (RLfD) is similar to IntRL, differing primarily on the information structure and timing. RLfD typically allows a teacher to provide examples of how to perform a task before the agent begins training. The agent then uses these examples to start learning with behaviour that mimics the provided demonstrations [6], or treats them as advice to guide exploration [92].

Transfer Learning (TL) in RL scenarios allows for agents to generalise information across tasks. TL assumes that, while training and testing environments may not have the same state space or distribution, the learned behaviour in a particular environment can be useful in other environments or agents [130]. By identifying the relationships between domains, an agent can use known skills and behaviour to accelerate learning in new areas. The agent’s known skills and behaviour may be information it has learned itself from previous tasks or information given to it by other agents or sources about similar domains. The externally-sourced information that a TL agent receives allows the generalisation of known skills to new skills, often resulting in accelerated learning [112].

The previously discussed RL approaches (i.e., IntRL, RLfD, and TL) are examples of ARL methods that use external information to supplement the agent’s decision-making process and learning. Additional examples of approaches that use an external information source to assist the agent are addressed in Section 4. The external information source is most commonly a human or another artificial agent. Regardless of the source, the use of external information has often been shown to improve an agent’s ability and learning speed. In the next section, we present a more detailed conceptual framework for ARL which is the base for the taxonomy we propose subsequently.

2.2 Conceptual Framework

The proposed ARL framework is built to improve the classification, the comparability, and the discussion on different externally-influenced RL methods. To achieve this aim, the framework has been designed using insights and observations drawn from many different ARL approaches. The result is a framework that can describe existing methods while also being flexible enough to include future research. The framework details are shown in Figure 2.

Figure 2: Detailed view of the assisted reinforcement learning framework. The diagram includes four processing components shown as dashed red boxes. Inside the assisted agent, one can observe three different points where it can receive possible modifications from the external model. Additionally, three communication links are shown with underlined text. This framework is subsequently used to further discuss the proposed ARL taxonomy.

The proposed ARL framework comprises four processing components shown using red boxes in the diagram, i.e., information source, advice interpretation, external model, and the assisted agent itself. The external information source may not have perfect observability and also may not know details about the RL agent (algorithms, weights, hyperparameters, etc.), or make assumptions, e.g., value-based learners

[128]. The processing components are responsible for providing, transforming, and storing information. We do include the agent as part of the processing components since it is part of the RL process as well. However, an agent using ARL generally behaves as a traditional RL agent, i.e., it interacts with the environment by exploring/exploiting actions. Inside the agent, there are three different stages: reward update, internal processing, and action selection. Each of those stages may be altered by the external model using reward/state modifications, internal modifications, or action modifications respectively. Moreover, the ARL framework also comprises three communication links that connect the four processing components and are labelled: temporality, advice structure, and agent modification. These links are shown between the processing components and represent the communication lines in Figure 2 that connect the processing components together. The communication links convey information or denote constraints on the data such as where or when to provide information.

The ARL framework describes the transmission, modification, and modality of sourced information. In this regard, we consider the ARL framework as a whole unit, comprising traditional autonomous RL plus the components and links for assistance. Thus, the taxonomy is a part of the framework and oriented to describe the assisted learning section. Although the framework has been developed on how ARL is usually built, not all ARL approaches use all the proposed components and links. Below, we briefly describe each of the components and links of the framework. They are subsequently used in the next section to describe in detail the proposed taxonomy.

  • Information source: is the origin of the assistance being provided to the agent. The source may be a human, a repository, or another agent. There may be multiple information sources providing assistance to an agent.

  • Temporality: determines both the time at which information is provided to the agent, and the frequency with which it is provided. Information may be provided, before, during, or after agent training, and occur multiple times through the learning process. Therefore, it is also responsible for how the information source communicates temporal issues to the advice interpreter.

  • Advice interpretation: denotes the process of transforming incoming information into a format better suited for the agent. This may involve extracting key frames from video, converting audio samples to rewards, or mapping information to states.

  • Advice structure: represents the structure of the advice after translation in a form suitable for the external model. Some approaches may not have an explicit external model, therefore, this structure might instead be directly used to modify the agent.

  • External model: is responsible for retaining and relaying the information between the source and the agent. The model may retain the received information in the learning model, using it for later decisions, or it may discard the received information as soon as it has been used.

  • Agent modification: denotes the approach that the agent uses to benefit from the incoming information. The most common modification approaches may use information to alter the environmental reward signal or modify the agent’s behaviour or the decision-making process directly.

  • Assisted Agent: is the RL agent receiving the external information or advice while learning a new task. The agent needs to work out how to incorporate the provided information with its own learning. If a different action is suggested by the trainer then the agent may decide if it should follow to that advice or not.

3 Assisted Reinforcement
Learning Taxonomy

In this section, we describe the processing components and communication links included in the proposed framework within an ARL taxonomy111In this context, we refer the taxonomy as a classification of the different elements of the ARL framework, i.e., processing components and communication links, and not as a way to classify each ARL method. and give more details of each of them. Figure 3 shows all the elements of the proposed ARL taxonomy including examples for each processing component and communication link. In the taxonomy, we include the agent as a component being the one that receives the advice. Each of the seven elements, i.e., processing components and communication links, is described in detail in the following subsections.

Figure 3: The assisted reinforcement learning taxonomy. This figure shows the four processing components as dashed red boxes and the communication links as green parallelograms using underlined text. Examples for each component and method are included at the right.

3.1 Information Source

The external information source is the main factor that sets ARL apart from traditional RL approaches. It is responsible for introducing new information about the task to the agent, supplementing or replacing the information the agent receives from the environment. The source is external to the agent and the environment, providing information that either the agent may not have had access to, or would have eventually learned itself. The information source may be able to observe the environment, the agent, or the agent’s decision-making process. The objective of the information source is to assist the agent in achieving its goal faster.

There may be multiple information sources communicating with an agent. This may be humans, agents, other digital sources, or any combination of the three [64]. The use of multiple sources offers a wider range of available information to the agent. However, more complex modification methods may be required to manage the information and handle conflicting advice [67].

There are many examples of external information sources in current ARL literature, the most common of which are humans and additional reward functions [96, 140, 90]. For instance, RLfD and IntRL use human guidance to provide the agent with a generalised view of the solution [28, 123]. Moreover, the use of additional reward functions is one of the earliest examples of ARL. In such cases, the designer of the agent encodes some further information about the environment or goal as an additional reward, supplementing the original reward given by the environment.

An example of the use of additional reward functions can be found in Randløv and Alstrøm’s bicycle experiment [108], in which, they teach an agent to ride a bicycle towards a goal point. Without additional assistance, the RL agent would only receive a reward upon reaching the termination state. Randløv and Alstrøm encoded some of their knowledge as a shaping reward signal external to the environment, providing the agent with additional rewards if it is cycling towards the goal point. In this scenario, the system designers acted as an external information source, providing extra information to the RL agent. The use of this external information results in the agent learning the solution faster than using the traditional RL approach.

Some other information sources include behaviours from past experiences or other agents, repositories of labelled data or examples, or distribution tables for initialising/biasing agent behaviour [38]. Video, audio, and text sources may be used as well [34]. However, these sources may require substantial amounts of interpretation and preprocessing to be of use.

The accuracy, availability, or consistency of the information source can affect the maximum utility of the information [142, 132]. Identifying in advance inaccurate information given to the agent can significantly improve performance [33, 31]. While the information source may perform the validation and the verification of the given advice, the primary duty remains simply to act as a supplementary source of information. In this regard, both validation and verification of information are functions better suited for the external model or the assisted agent.

3.2 Temporality

The temporal component, or temporality, refers to the time at which information is communicated by the information source. The information may be provided in full to the agent at a set time (either before, during, or after training). This is referred to as planned assistance [103, 27]. Alternatively, the information may be provided at any time during the agent’s operation, referred to as interactive assistance [106, 119].

Planned assistance, on the one hand, is common in ARL methods. Some examples are predefined additional shaping functions, agent policy initialisation based on either prior experience or a known distribution, and the creation of subgoals that lead the way to a final solution [103]. These methods let the experiment designer endow the agent with initial information about the environment or the goal to be achieved. By providing this initial knowledge, the designer can reduce the agent’s need for exploration.

The bicycle experiment discussed in the previous section is an example of planned assistance. As mentioned, the agent is learning to control a bicycle and must learn to steer it towards a goal [108]. Before the experiment, the designers give the agent additional information in the form of a reward signal that correlates to the direction of the goal state. This planned assistance approach helps the agent to narrow the search space by giving it extra information about the environment. This small yet beneficial initial information results in a significant improvement in the agent’s learning speed.

Another example of planned assistance is found in heuristic RL. Heuristic RL is a method of applying advice to agent decision-making. One example is an experiment which implements heuristic RL in the RoboCup soccer domain [23], a domain known for its large state space and continuous state range. In this environment, one team attempts to score a goal, while the other team tries to block the first team from scoring, such as in half-field offence [66, 62]. In this experiment using heuristic RL, the defending team is given initial advice before training. This advice consists of two rules: if the agent is not near the ball then move closer, and if the agent is near the ball then do something with it. The experiment results show that a team that uses planned assistance performs better than a team that is given no initial knowledge [23].

Interactive assistance, on the other hand, refers to information provided by the source repeatedly throughout the agent’s learning. Information sources that assist interactively often can observe the agent’s current state, or the environment the agent is operating in. In current literature, humans are more commonly used as information sources for interactive assistance [139, 122]. The human can observe how the agent is performing and its current state in the environment, and provides guidance or critiques of the agent’s behaviour.

For example, Sophie’s Kitchen [138] presents an IntRL based agent, called Sophie, which attempts to bake a cake by interacting with the items and ingredients found in a kitchen. In this experiment, the agent will receive a reward if it successfully bakes the cake. At any point during the agent’s training, an observing human can provide the agent with an additional reward to supplement the reward signal given by the environment. If the agent performs an undesirable action, such as forgetting to add eggs to the cake, the human can punish the agent by providing an immediate negative reward. The human can also reward the agent for performing desirable actions, such as adding ingredients in the correct order. In this experiment, the human advisor is acting as an interactive information source.

Although the agent could learn the task without any assistance, the addition of the human advisor and interactive feedback allows the agent to learn the desired behaviour faster in comparison to autonomous RL [138]. The benefit of using interactive advice rather than planned advice is that the information source can react to the current state of the agent. Additionally, an interactive information source does not need to encode all possibly useful advice up front. Instead, it can choose to provide relevant information only when required. This approach does have a significant cost; the information source needs to be constantly observing the agent and determining what information is relevant. For instance, an approach using inverse RL through demonstrations may also consider providing failed examples to show the agent what not to do [115].

3.3 Advice Interpretation

The advice interpretation stage of the taxonomy denotes what transformations need to occur on the incoming information. The source provides information for the agent to use that may need to be translated into a format that the agent can understand. The information source may provide their assistance in many different forms. Some examples include audio [37], video [34], text [86]

, distributions and probabilities 

[90], or prior learned behaviour from a different task or agent [43]. This information needs to be adapted for use by the agent for the current task. The product of the advice interpretation stage depends on the structure that the agent or external model requires.

A field where the interpretation of incoming advice is crucial is Transfer Learning (TL). The goal of TL is to use behaviour learned in a prior task to improve performance in a new, previously unseen task [40]. A critical step in TL is the mapping of states and observations between the old and new domains. The information source provides information to the agent that does not fully align with its current task. Therefore, it is crucial that the information provided can be correctly interpreted, so as to be useful to the current domain. More commonly, this interpretation stage in TL is performed by hand. However, there has also been effort attempting to automate this stage [127, 93].

Another example of the use of the advice interpretation stage is with the sourcing of feedback for RL agents. In the Sophie’s Kitchen experiment [138], discussed in the previous section, the agent can be given positive or negative feedback by a human regarding its choice of actions. In this experiment, the human creates either a green (positive) or a red (negative) bar to represent the desired feedback to be given to the agent. This bar is used to interpret the reward signal to give to the agent, with the colour of the bar designating whether the reward is positive or negative, and the size of the bar designating the magnitude of the reward. This type of feedback can also be extended to audio, where recording phrases such as ‘Good’ or ‘Well Done’ are interpreted as positive rewards and ‘Bad’ or ‘Try Again’ are interpreted as negative rewards [135].

These methods can also be combined into a multi-model architecture to provide advice to an RL robotic agent using audiovisual sensory inputs, such as work by Cruz et al. [34]. In this experiment, a simulated robot learns how to clean a table using a multi-modal associative function to integrate auditory and visual cues into a single piece of advice which is used by the RL algorithm. In this scenario, the external information source is a human trainer and the RL algorithm represents the integrated advice as a state-action pair.

3.4 Advice Structure

The advice structure component refers to the form that the agent or external model requires incoming information to take. The information that the agent uses can be represented in a number of ways. Some examples of advice structures include: Boolean values denoting positive or negative feedback; rules determining action selection; matrices for mapping prior experiences to new states; case-based reasoning structures for the agent to consult with; or, hierarchical decision trees to represent options for the agent to take 

[122, 68].

The simplest form of structure is binary, in which the information takes only one from two options, such as ‘Good’ or ‘Bad’. An example of the use of a binary structure is the TAMER-RL agent [77]. TAMER-RL is an IntRL agent that uses binary feedback from an observing human. At any time step, the human can agree or disagree with the agent about its last action. In this case, the feedback is a binary structure indicating agree or disagree.

A more complex advice structure is used in case-based RL agents [113]. A case in this context represents a generalised area of the state space and provides information about which actions to take in that state. The use of a case-based structure allows the agent to gain more information from the information source compared to a binary structure, at a cost of more complex sourcing and interpretation approaches.

One of the more common advice structures is a simple state-action pair. A state-action pair consists of a single state and an associated piece of advice. The associated advice may be an additional scalar reward or a recommended action. Using a state-action pair, sourced information is interpreted to provide advice for a given state. In the cleaning-table robot task [34], discussed in the previous section, the external trainer using multi-modal advice provides an action to be performed in specific states. Once the advice is processed using the multi-modal integration function, the proposed action is given to the RL agent to be executed as a state-action pair considering the agent’s current state. This state-action structure has also been used for other methods including TAMER-RL [77], Sophie’s Kitchen [138], and policy-shaping approaches [59].

A novel rule-based interactive advice structure is introduced in [16]. Interactive RL methods rely on constant human supervision and evaluation, requiring a substantial commitment from the advice-giver. This constraint restricts the user to providing advice relevant to the current state and no other, even when such advice may be applicable to multiple states. Allowing users to provide information in the form of rules, rather than per-state action recommendations, increases the information per interaction, and does not limit the information to the current state. Rules can be interactively created during the agent’s operation and be generalised over the state space while remaining flexible enough to handle potentially inaccurate or irrelevant information. The learner agent uses the rules as persistent advice allowing the retention and reuse of the information in the future. Rule-based advice significantly reduces human guidance requirements while improving agent performance.

3.5 External Model

The external model is responsible for retaining and relaying information between the information source and the agent. The external model receives interpreted information from the information source and may either retain the information for use by the agent when required or pass it to the agent immediately.

A retained model is an external model that stores all information provided by the information source [55]. A retained model may be used if the cost of acquiring information is greater than the cost of storing it, if the information provided is general or applies to multiple states, or if the information is gathered incrementally. In instances where information is gathered incrementally, using a retained model allows the agent to build up a knowledge base over time. The agent may consult with the model at any time to determine if a reward signal is to be altered, or if there is any extra information that may assist with decision-making.

An immediate model passes the information directly to the agent [91]. In this case, the information received is only relevant to the current time step, or the cost of reacquiring the information from the source is less than that of retaining the information.

Approaches can also combine this by incorporating both a retained model as well as passing some information through directly, such as [33]. In this work, an RL agent uses a combination of interactive feedback and contextual affordances [36] to speed up the learning process of a robot performing a domestic task. On the one hand, contextual affordances are learned at the beginning of autonomous RL and are readily available from there on to avoid the so-called failed-states, which are states from where the robot is not able to finish the task successfully anymore. On the other hand, interactive feedback is provided by an external advisor and used to suggest actions to perform when the robot is learning the task. This advice is given to the robot to be used in the current state and it is discarded immediately after.

The external model may have different functions depending on its implementation. For instance, heuristic RL hosts a model that stores rules and advice that generalise over sections of the state space [49]. In TL, the external model may hold information regarding past experiences and policies from problems similar to the current domain [130, 11], or in inverse RL, the external model is a substitute for the reward function [1].

3.6 Agent Modification

The modification stage of the framework denotes how the information that the external model contains is used to assist the agent in achieving its goal. It is responsible for supplementing the agent’s reward, altering the agent’s policy, or helping with the decision-making process. A popular method for injecting external information into agent learning is shaping [116]. Shaping is a common method for altering agent performance by modifying parameters in the learning process. Erez and Smart [53] propose a list of techniques in which shaping can be applied to RL agents. These include altering the reward, the agent’s policy, agent learning parameters, and environmental dynamics [147].

Altering the reward the agent receives is a straightforward method for influencing an agent’s learning. It is known as reward-shaping, in which the external information is used to bias the agent’s learning [96]. Special care must be taken to ensure that any modification of the reward signal remains zero-sum to avoid the agent exploiting the shaped reward in ways that do not align with the desired goal. This can be achieved by ensuring that additional rewards are potential-based, meaning that they are derived from the difference in the values of a potential function at the current and successor states [61]. However, recent work by [13] shows a flaw in the previous method when transforming non-potential-based reward-shaping into potential-based. Alternatively, the authors introduce a policy invariant explicit shaping algorithm allowing for arbitrary advice, confirming that it ensures convergence to the optimal policy when the advice is misleading and also accelerates learning when the advice is useful [13]. Shaping techniques have also been used to alter state-action pairs [146], for dynamic situations [61, 47], and for multi-agent systems [46].

Policy-shaping is the modification of the agent’s behaviour [59]. This modification can be done either by influencing how the agent makes decisions or by directly altering the agent’s learned behaviour. A simple method of policy-shaping involves forcing it to take certain actions if advice from the information source has recommended them [60, 95]. This allows the external information source to guide the agent and take direct control over exploration/exploitation. Alternatively, the information source can choose to alter the agent’s behaviour directly by changing Q-values or installing rules that override the actions for chosen states [74]. This method of modification can improve agent performance rapidly, as it can give the agent partial solutions.

Internal modification is a method of altering the parameters of the agent that are essential to its learning. Parameters such as the learning rate (), discount factor (), and exploration percentage (), are all internal to the RL agent and may be altered to affect its performance [137]. For example, if an advisor observes that an agent is repeating actions and not exploring enough then the exploration percentage or learning rate may be temporarily increased. Internal modification is a simple method to implement. However, it can be difficult at times to know which parameters to adjust, and to what degree they are to be adjusted.

Environmental modification is an indirect method for influencing an RL agent. Altering the environment is not always achievable and may be a technique better suited for digital or simulated environments. Some examples of modifying the environment include altering or reducing the state space and observable information [71, 18], reducing the action space [118], modifying the agent’s starting state [48], or altering the dynamics of the environment to make the task easier to solve [89]. Below, we further describe these environmental modifications.

Reducing the state space can speed up the agent’s learning as there is less of the environment to search. While the agent cannot fully solve the task with an incomplete environment representation, it allows the agent to learn the basic behaviour. The level of detail in the state representation can then be increased, allowing the agent to refine its policy towards the correct behaviour [71, 18]. Reducing the action space is similar to the previous. The agent’s available actions are limited, and the agent attempts to learn the best behaviour it can with the actions it has available. Once a suitable behaviour has been achieved, new actions can be provided, and the agent can begin to learn more complex solutions [118]. Modifying the agent’s starting space alters where in the environment the agent begins learning. Using this approach, the agent can begin training close to the goal. As the agent learns how to navigate to the goal, the starting state is incrementally moved further away. This allows the agent to build upon its past knowledge of the environment [48]. Altering the dynamics of the environment involves changing how the environment operates to make the task easier for the agent to learn [147]. By altering attributes of the environment such as reducing gravity, lowering maximum driving speed, or reducing noise, the agent may learn the desired behaviour faster or more safely. After the agent learns a satisfactory behaviour, the environment dynamics can be changed to more typical levels [89].

3.7 Assisted Agent

The final component of the proposed ARL taxonomy is the RL agent. A key aspect of the taxonomy is that the agent, in the absence of any external information, should operate the same as any RL agent would. Given no external information, the agent should continue to explore and interact autonomously with its environment and attempt to achieve its goal.

In the next section, we present an in-depth look at some ARL techniques and describe them in terms of the taxonomy that has been presented in this section.

4 Assisted Reinforcement
Learning Approaches

This section presents an in-depth analysis of some popular ARL approaches. Each technique is described as an instance of the proposed taxonomy shown in Section 3, in some cases using a specific approach and in other cases a set of them. Therefore, for each presented ARL approach, we show how each processing component and each communication link particularly adapts to the ARL taxonomy using current literature in the respective field for concrete examples.

4.1 Heuristic Reinforcement Learning

Heuristic RL uses pieces of information that generalise over an area of the state space. The information is used to assist the agent in decision-making and reduce the searchable state space [15, 149]. An example of a heuristic is a rule. A rule can cover multiple states, making its use efficient at delivering advice to an agent. In Section 3.2, we have introduced a heuristic RL experiment applied to the RoboCup soccer domain [23]. In the RoboCup soccer domain, one team actively tries to score a goal, while the other team tries to block it. As mentioned, the defending team is given initial advice before training, consisting of two predefined rules. The following is an analysis of this heuristic RL example applied as the ARL taxonomy.

  • Information source: The information source for the RoboCup experiment is a person. In this case, the person has previously experimented with the robot soccer domain and can advise the agent with some rules that will speed up learning.

    Figure 4: Heuristic RL components according the proposed ARL taxonomy. The particular processing components and communication links illustrate a technique used in the RoboCup soccer domain [23].
  • Temporality: The advice for the agent is given before training begins. Once training has begun the person does not interact with the agent again. This is an example of planned assistance, where information is given to the agent at a fixed time, and the information is known by the information source in advance.

  • Advice interpretation: The information needs to be understandable by the agent. In the robot soccer domain, the person gives two rules; (i) if not near the ball then move towards the ball, and (ii) if near the ball do something with the ball. These rules are understandable by the human but need to be translated into machine code so that agent can use them. This is usually a task easily performed by a knowledgeable human operator. The result is conditional-like rules as: (i) IF NOT close_to_ball() THEN target_and_move(), and (ii) IF close_to_ball() THEN kick_ball().

  • Advice structure: The structure of the advice after being interpreted is a new rule. The rule needs to be compatible with the agent, including the ability to substitute variables and evaluate expressions.

  • External model: The external model used by the heuristic RL agent is a rule set. The external model retains all rules given to it. The model may also retain statistics about the rule relating to confidence, number of uses, and state space covered.

  • Agent modification: Heuristic RL uses the rule set to assist the agent in its decision-making. If a rule applies to the current state, then the action that the rule recommends is taken by the agent. This is a form of policy-shaping as the agent’s decision-making is directly manipulated by the external information.

  • Assisted Agent: The RL agent operates as usual. When it is time to decide on an action to take it consults the external model. The external model tests all the rules it has and checks to see if any applies to the current state, otherwise, the agent’s default decision-making mechanism is used.

Figure 4 shows how the heuristic RL approach fits into the proposed ARL taxonomy taking into consideration the previous definitions of processing components and communication links from the RoboCup soccer domain.

4.2 Interactive Reinforcement Learning

IntRL is another application of ARL. Most commonly, the information source is an observing human or a substitute for a human, such as an oracle, a simulated user, or another agent [141]. The human provides assessment and advice to the agent, reinforcing the agent’s past actions and guiding future decisions. The human can assess past actions in two ways, by stating that the agent’s chosen action is somehow correct or incorrect, or by telling the agent what the correct action to take is in that instance. Alternatively, the human can advise the agent on what actions to take in the future [84]. The human can recommend actions to take or to avoid, or provide more information about the current state to assist the agent in its decision-making [35].

IntRL applications include having a human to provide additional reward information [80, 79], and having a human or agent provide action advice [150, 4]. All of these methods work in real-time and similarly, differing mainly in the agent modification stage. The following is an analysis of these IntRL approaches applied as the ARL taxonomy.

  • Information source: The information source is a human or simulated user. A simulated user is a program, analogous to a human, that acts how a human would in a given situation. The human can observe the agent’s current and past states, past actions taken, and what action the agent recommends it takes.

  • Temporality: IntRL agents operate interactively. The advisor can provide information to the agent before, during, or after learning, and repeatedly throughout the learning process. This allows the advisor to react to current information and supply the agent with relevant advice.

  • Advice interpretation: The advisor provides either an assessment of past actions taken, recommendations about actions to take, or a reward signal. Computer simulated agents can receive this information as key presses. However, physical agents may receive this information through audio or video inputs [34]. In the case of audio inputs, these may be simple commands such as ’Correct’ or ’Go Right’, which can be translated to a form the agent can understand [37]. Supporting input modalities such as natural language makes systems based on IntRL more accessible to users who are not themselves familiar with RL.

  • Advice structure: A common structure of advice the agent requires is simply a state-action pair. Using this structure the human can assign advice to a state for the agent to use, such as: In this state, do this [9].

  • External model: Either retained or immediate models are commonly used [55, 75]. A retained model tracks what advice/feedback has been received for each state [55]. The agent can use this model to determine the human’s accuracy, consistency, and discount for each piece of advice received. The model acts as a lookup table for the agent, if advice exists for the current state, then the agent can use it. Alternative methods may not retain information given by the human and only use it for the current state [75].

    Figure 5: Interactive RL as the proposed ARL taxonomy. In this approach, interactive advice is given by the user and more commonly used as policy and reward shaping.
  • Agent modification: The most common methods of using the advice to modify the agents learning process are reward- and policy-shaping [84]. Reward-shaping uses assessment/critique gathered from the advisor to alter the reward given to the agent. If the advisor disagrees with a past action, then the reward received for that state-action pair is decreased. If the advisor recommends an action to take in the future, then policy-shaping can be used to override the agent’s usual action selection mechanism. One method of implementing policy-shaping for interactive advice is probabilistic policy reuse [55].

  • Assisted Agent: Most of the time, the RL agent operates as any other RL agent would, i.e., it performs actions in the environment by exploiting/ exploring. The agent should continue to do so even if no advice from the trainer is given. Although a trainer could proactively provide advice to the learner, sometimes the student could decide to request such advice, and the trainer may or may not respond to that request. For instance, heuristics have been used to decide if the trainer should provide advice and/or if the learner should ask for it [4]

    . In contrast, recent work estimates the learner’s uncertainty in its current state, asking for advice in case the level of uncertainty is above a predefined threshold


Figure 5 shows how the IntRL approach is adapted to the proposed ARL taxonomy taking into account the previous definitions of processing components and communication links.

4.3 Reinforcement Learning from

RLfD is a term coined by Schaal [111]. It refers to the setting where both a reward signal and demonstrations are available to learn from, combining the best of the fields of RL and Learning from Demonstration (LfD). Since RL presents an objective evaluation of behaviour, optimal behaviour can be achieved. Such an objective evaluation of behaviour is not present in LfD [8], where only expert demonstrations are available to be mimicked and generalised. The student can thus not surpass its master. Nevertheless, LfD is typically much more sample efficient than RL. Therefore, the aim is to combine the fast LfD method with objective behaviour evaluation and theoretical guarantees from RL.

Two different approaches have been proposed to use demonstrations in an RL setting. The first is the generation of an initial value-function for temporal-difference learning by using the demonstrations as passive learning experiences for the RL agent [117]. The second approach derives an initial policy from the demonstrations and uses that to kickstart the RL agent [19, 120]. In this regard, Taylor et al. propose the Human-Agent Transfer (HAT) algorithm [131], which consists of three steps: (i) demonstration: the agent performs the task teleoperated and records all state-action transitions, (ii) policy summarising: in order to bootstrap autonomous learning, policy rules are derived from the recorded state-action transitions, and (iii) independent learning: autonomous reinforcement learning using the policy summary to bias the learning. Below we use the HAT algorithm to describe how RLfD fits into the ARL taxonomy.

  • Information source: An expert of the task (human or otherwise) can provide sample behaviour by demonstrating its execution of the task. Preferably these demonstrations are efficient and successful executions of the task.

  • Temporality: It uses planned assistance. Demonstrations are recorded and given to the learning agent before it starts training.

  • Advice interpretation: The received demonstrations must be first transformed into the agent’s perspective by encoding them as sequences of state-action pairs. These are then processed using a classifier, which serves as the LfD component, creating an approximation of the demonstrator’s policy using rules.

    Figure 6: RL from demonstration as the proposed ARL taxonomy. In this case, the processing components and communication links are defined from the HAT algorithm [131], which combines RL and LfD.
  • Advice structure: The information is encoded as a classifier that maps states to the actions which the demonstrator is hypothesised to execute in those states.

  • External model: The generated rules are stored in the external model and not modified anymore. The external model can be queried with a state and responds with the hypothesised demonstrator action in that state.

  • Agent modification: The action proposed by the demonstrator can be integrated into the agent through three action biasing methods: (i) attributing a value bonus to the Q-value for that state-action pair, (ii) extending the agent’s action set with an action that executes the hypothesised demonstrator action, and (iii) probabilistically choosing to execute the action suggested by the model.

  • Assisted agent: During its decision-making (when and how depends on the implemented modification method) the agent has the option to consult the external model to obtain the action that the demonstrator is assumed to take. This kind of agent is sometimes referred to as curiosity-driven agent [104]. Otherwise, the agent acts as a usual RL agent.

Figure 6 shows how the RLfD approach is adapted to the proposed ARL taxonomy taking into account the previous definitions of processing components and communication links for the HAT algorithm.

4.4 Transfer Learning

The idea of transferring information between tasks (or between agents), rather than learning every task from the ground up seems to be obvious in retrospect. While transfer between different tasks has long been studied in humans, it has only gained popularity in RL settings in the last decade [130]. We consider three distinct settings where TL can be useful.

First, an agent may have learned how to perform a task and a new agent must learn to perform that same task or a variation on the task under different circumstances. Let us consider two agents with different state features, i.e., different sensors, or different action spaces (or different actuators). In this case, an inter-task mapping [17, 129] can be hand specified or learned from data [133, 5] to relate the new target agent to the existing source agent. One of the simplest ways to reuse such knowledge is to embed it into the target task agent, e.g., directly reuse the Q-values that the source agent had learned [129].

Second, let us now consider that the world may be non-stationary. In TL settings, it is common to assume that the agent is notified when the world (or task in that world) changes. However, a TL agent sometimes does not need to detect changes [63] or worry about the slow world changes over time [2]. As in the previous setting, the agent may want to modify the information, e.g., by using an inter-task mapping, to relate the two tasks. In addition, the agent may decide not to use its prior knowledge at all, e.g., to avoid negative transfer because the tasks are too dissimilar [129].

Third, TL could be a critical step within a curriculum learning approach [134, 14]. For example, previous work has shown that learning a sequence of tasks that gradually increase in difficulty can be faster than directly training on the final (difficult) task [133, 52]. In addition to curricula that are created by machine learning experts, curricula constructed by naive human participants have also been considered [105]. Others have considered as a complementary problem a learning agent autonomously creating a curriculum [94, 39]. In all cases, the difficulty is scaffolding correctly so that the agent can learn quickly on a sequence of tasks. These approaches are distinct from multi-task learning [55], where the agent wants to learn over a distribution of tasks, and lifelong learning [26, 101], where learning a new task should also improve performance on previous tasks. The following is an analysis of TL methods in terms of the ARL taxonomy.

  • Information source: The information comes from an agent with different capabilities or the same agent that has trained on a different task.

  • Temporality: Transfer typically occurs when a task changes or when an agent first faces a novel task. In both cases, it is planned assistance, i.e., the source agent transfers knowledge to the target agent before the target agent begins learning. If the inter-task mapping is initially unknown, some time may be spent trying to learn an inter-task mapping or estimate task similarity to previous tasks. However, the more time spent before the transfer, the less impact transfer can have.

  • Advice interpretation: There are many types of information that can be transferred, including Q-values, rules, a model, etc. [129]. TL methods assume the target agent has access to the source agent’s ‘brain’, an assumption that may not always be true, e.g., if the designer of the source agent has not provided an API or if the source agent is a human.

  • Advice structure: The structure of the transferred knowledge is as varied as the types of information that can be provided. This variety of information includes Q-values, rules, or a model, among others.

  • External model: The source model is normally retained. Because the source task knowledge is not necessarily sufficient for optimal performance in the target task, it is important for the target agent to be able to learn to outperform the transferred information.

  • Agent modification: The target task agent uses the transferred information to bias its learning. The transferred knowledge is not typically modified. Instead, the target task agent builds on top of the knowledge, learning when to ignore it and instead follow the knowledge it has learned from the environment.

  • Assisted Agent: The agent is a typical RL agent that can take advantage of one or more types of prior knowledge.

Figure 7 shows how the TL approach can be represented within the proposed ARL taxonomy taking into account the previous definitions of processing components and communication links.

Figure 7: Transfer learning as the proposed ARL taxonomy. In this case, an agent with different capabilities (or the same agent) provides the model of a source task which is transferred to a target task.

4.5 Multiple Information Sources

While the majority of work in ARL is based on a single source of advice, several researchers have considered scenarios where multiple sources of advice may exist [20, 44, 58, 148]. Although the use of multiple information sources is not an ARL approach by itself and could comprise sources utilising any of the previously mentioned approaches, we include it here to highlight how this multiple sources can be framed within the proposed taxonomy. The introduction of multiple advisors may have benefits for ARL agents, particularly in scenarios where each individual advisor has knowledge which is limited in some way [114], e.g., individual advisors may have expertise covering different sub-areas of the problem domain. However, it also introduces additional problems for the agent, such as handling inconsistencies or direct conflicts between the guidance provided by different advisors, or learning to judge the reliability of each advisor, possibly in a state-sensitive manner [150]. In the extreme case, an agent may even need to be able to identify and ignore the advice provided by deliberately malicious advisors [98]. The following is an analysis of approaches using multiple information sources with respect to the proposed ARL taxonomy.

  • Information source: Prior research has identified several scenarios in which an agent may have access to multiple sources of external information. Argall et al. [7] argue that when robots are applied to tasks within society in general, it is very likely that multiple users will interact with and guide the behaviour of a robot. In the context of TL, multiple sources of information may be derived either from experience on varying MDPs [102], or on alternative mappings from a single prior MDP to the current environment [125]. In multi-agent systems, each agent may serve as a potential source of information for every other agent [41, 54].

  • Temporality: Assistance may be planned or interactive. For instance, Argall et al. [7] have considered two different sources of information, in the form of teacher demonstrations and teacher feedback on trajectories generated by the learner. The former may be provided in advance of learning consisting of complete state-action trajectories, i.e., planned assistance, while the latter occurs on an interactive basis during learning, and structurally consists of a subset of the learner’s actions being flagged as correct by the teacher, i.e., interactive assistance.

    Figure 8: Multiple information sources as the proposed ARL taxonomy. In this case, there could be multiple humans or multiple agents. One important aspect is to integrate the different pieces of advice. The agent may also learn multiple policies as in multi-objective RL.
  • Advice interpretation: The majority of work so far on ARL from multiple information sources has assumed that these sources are homogeneous in terms of the timing and nature of the information provided. However, this need not be the case, and for heterogeneous information sources, some aspects of the advice may differ in terms of interpretation and structure. In this regard, the advice needs to be integrated considering either all possible sources (equally or non-equally contributing), some sources (with the information provided partially or fully considered), or only from one source at a time [114].

  • Advice structure: Each information source may use a different structure of advice. Therefore, individually all the aforementioned structures in previous sections are possible to be used, e.g., machine rule, state-action pair, rule system, value, or model. The final structure into a single piece of advice may be done by integrating the multiple information sources, for instance using a multi-modal integration function [34].

    Approach Information source Tempora- lity Advice interpre- tation Advice structure External model Agent modifi- cation Assisted agent
    Heuristic reinforcement learning Human-domain expert Planned Convert rule to machine language Machine rule Retained rule-set Policy shaping Normal agent
    Interactive reinforcement learning Human / simulated user Interactive Convert modal cue to signal State-action pair Immediate Policy / reward shaping Curiosity-driven agent
    Reinforcement learning from demonstration Domain expert Planned Convert demonstration to agent’s perspective Rule system Retained rule system Action biasing Curiosity-driven agent
    Transfer learning Agent with different capabilities Planned Q-values, rules, or models Value, rule, or model Retained source model Action biasing Normal agent
    Multiple information sources Multi-users or multi-agent system Planned or interactive Multi-source integration Integrated advice Separated or combined model Weighted or unweighted combination Multi-policy agent
    Table 1: Summary of the reviewed assisted reinforcement learning approaches adapted to the proposed taxonomy.
  • External model: An ARL agent must choose whether (i) to maintain a separate model for each information source, (ii) to combine the information from all sources into a single model, or (iii) a combination of both. An example of the latter approach is the inverse RL system presented in [70], which learns a model of each information source in the form of a feature-weighting function and then forms a combined feature-weighting via averaging. As noted by Karlsson [70], single-model approaches may encounter difficulties if dealing with information sources which are fundamentally incompatible with each other. An additional benefit of maintaining independent models is that these can also be augmented by additional data on characteristics of each information source, such as the reliability or consistency of its advice [7, 125].

  • Agent modification: Any of the modification approaches discussed in the earlier sections of this paper may also be applied in the context of multiple information sources. For example, agent modification methods from LfD [7], TL [125, 102], reward-shaping [21, 76] as well as inverse RL [70, 126]. The main additional consideration is how these methods may be affected by the presence of multiple external models. The main methods examined so far use a combination of the models, either weighted or unweighted [7, 70] or select a single best model to use [125].

  • Assisted Agent: In most circumstances, the operation of the agent itself is largely unaffected by the presence of more than one information source. However, Tanwani and Billard [126] consider the task of performing inverse RL from multiple demonstrations provided by multiple experts, operating according to different strategies or preferences. To address the potential incompatibilities between these strategies, the agent attempts to learn a set of multiple policies, so as to be able to satisfy any policy expert strategy, including those not provided to the agent. This approach is closely related to multi-policy algorithms developed for multiobjective RL [109].

Figure 8 shows how an approach using multiple information sources is adapted to the proposed ARL taxonomy taking into account the previous definitions of processing components and communication links. Moreover, Table 1 summarises how each of the ARL approaches and examples reviewed in this section is adapted to the proposed taxonomy.

5 Future Directions

In this section, we propose some further possibilities for future work in the field of ARL. These open questions have been identified from the current literature in the field. Many of these issues are shared with autonomous RL but it still remains open how they could be applied within the ARL framework.

5.1 Incorrect Assistance

A common assumption that ARL methods make is that all external information that the agent receives is accurate. Accurate information is correct advice that assists the agent in completing its goal. However, the assumption that information will always be of use to the agent is wrong, especially when the information source is an observing human [51]. Humans may deliver advice late, and therefore the agent may relate it to a wrong state. The advice may be of short-term use to the agent but prevent it from achieving optimal performance. Moreover, the human trainer may even be malicious and actively attempting to sabotage the agent’s performance.

Incorrect information can be introduced by other sources as well. Some examples for non-human incorrect advice include behaviour transferred from another domain that does not align correctly, rules that generalise over multiple states which may cover exception states, and noisy or missing information from audio-visual sources [34].

Information given to agents may be correct initially, but over time no longer be the optimal solution [2]. Other advice may be mostly accurate or correct for most states, however, there can exist states of exception to the advice. These exception states can be the critical difference between an ordinary solution and the optimal solution. There is a need for research on how to identify and mitigate incorrect information in these scenarios, especially considering that even a very small amount of incorrect advice may be really detrimental for the learning process [31].

5.2 Multiple Information Sources

As reviewed in the previous section, the use of multiple information sources may naturally arise on some application scenarios, and can increase the agent’s knowledge of the environment, and increase confidence in decision-making if the different sources agree on an action. However, the use of multiple sources raises additional questions:

  • What if the different sources disagree on the best action to take?

  • How can the agent identify the best information source to listen to?

  • How can the agent manage conflicting information?

  • How can the agent measure trust in the different information sources?

Additionally, the use of multiple sources may be extended to crowdsourcing [67]. In this context, crowdsourcing refers to the enlistment and use of a large number of people, either paid or unpaid and can range in size from tens to tens of thousands. Typically, crowdsourcing is performed via the internet. This can raise challenges of malicious users, anonymity, and large uncertainty in the value and reliability of the information.

5.3 Explainability

Explainability refers to translating the agent’s information into a form the human can understand [29, 45]

. The reasons why an agent develops certain behaviours can sometimes be difficult to understand for non-expert end-users. When combining the RL method with policy modification methods such as rules, expert assistance, external models, and policy-shaping, understanding why an agent chooses to take an action becomes even more difficult. Developing methods for understanding agent learning and its decision-making is important as it allows the human to remain informed of the agent’s motivations and decisions, and keep track of the accountability of the actions taken. This can be beneficial for artificial intelligence ethics, and human-computer teaching, among other fields.

5.4 Two-Way Communication

Two-way communication refers to the ability for the information source and the agent to converse with each other, perhaps multiple times before making a decision [72]. Two-way communication can allow the information source, presumably human, and the agent to ask questions to each other, request more information, and to clarify decision-making and its reasoning. Although the proposed framework includes two-way communication, as shown in Figure 1, most current ARL methods do not have two-way communication to the extent that non-expert human advisors can interact with the agent freely. For two-way communication to apply to non-expert human advisors issues of explainability (as shown in the previous section), timing, and agent initiation need to be addressed.

Timing refers to the time it takes to communicate back and forth. Agents sometimes have a fixed time limit, during which they need to learn, communicate, and decide on the next action. Methods for reducing the time it takes to interact with the human and reducing the number of interactions needed with the human are two areas open for research. Agent initiation refers to the ability for the agent to initiate communication with the human source itself. The agent may choose to do this so to request clarification on information, or request assistance for decision-making. A challenge for agent initiation is to determine when and how often the agent should request assistance. The requests for assistance should be frequent enough to make use of the information source while not becoming a nuisance to the human, or detracting from learning time, and should consider the cost of the request, e.g., in paid crowdsourcing.

5.5 Other Challenges

There are also other challenges to be considered for future possibilities of ARL systems. Although many of the issues described in this section are also shared with autonomous RL [87], we focus the discussion on how particularly externally-influenced agents may be affected in the context of the ARL framework. While we describe the essential implications on ARL systems for each of the following areas, we note that further and deeper discussion may be addressed for each of them.

  • Real-time policy inference: Many RL systems need to be deployed in real-world scenarios and, therefore, policy inference must happen in real-time [82]. Using ARL frameworks may lead to additional issues since the external information source should observe and react to the RL agent’s state as fast as possible, otherwise the assistance may become unnecessary or incorrect for the new reached state.

  • Assistance delay: There are RL systems where determining the state or receiving the reward signal may take even weeks, such as a recommender system where the reward is based on user interaction [88]. In these contexts, the external information source may also lead to unknown delays in the system actuators, sensors, or rewards, making the assistance atemporal, either delayed or ahead, or even in some cases being conflicting or redundant considering the RL agent’s autonomous operation.

  • Continuous states and actions: When an RL agent works in high-dimensional continuous state and action spaces [90, 9] there could be issues for learning even in traditional RL [50]. In an ARL framework, additional problems may be present as the agent uses external information which may be not accurate enough given the high dimensionality. In the presence of high-dimensional states and actions, even small differences in the received assistance may substantially slow the learning process since these differences may represent in essence a very different state or action.

  • Safety constraints: In RL environments, there are safety constraints that should never or at least rarely be violated [69]. Special care is needed when receiving information from an external source since there could be situations that the advisor may repeatedly direct the agent to unsafe states and, in turn, lead to an increase in the time needed for learning.

  • Partially observable environments: In practice, many RL problems are partially observable [24]. For instance, partial observabilities may occur in non-stationary environments [90] or in presence of stochastic transitions [30]. If the external information source does not have observations to clearly infer the current state in the environment may lead to giving incorrect assistance to the learner agent.

  • Multi-objective reward: In many cases, RL agents need to balance multiple and conflicting subgoals, therefore, they may use multi-dimensional reward functions [143]. In this regard, an external information source may give priority to a particular subgoal over the others, unbalancing the global reward function. There could be also issues when multiple information sources are used covering or favouring different subgoals. Moreover, when using a multi-objective reward in TL, there could only be some subgoals from the source task which are relevant in the target task, therefore, the RL agent should also coordinate and filter relevant information.

  • Multi-agent systems: There could be multiple agents learning a task and multiple external information sources. In this case, if an information source provides advice it could be generalised to all of them or it could be pointed specifically to an agent. Moreover, advice useful for one agent may be detrimental to another, depending on the state, the agent’s current knowledge, or its particular reward function. Using multiple information sources, if an agent consults an external source, it may be necessary to discriminate which one is the best for the particular state. Additionally, the teacher-student approach usually integrated into ARL requires the teacher to be an expert in the learning domain. In this regard, multiple learning agents may also advise each other while learning in a common environment [41].

6 Conclusions

In this article, we have presented an ARL framework, comprising all RL techniques that use external information. ARL methods use external information to supplement the information the agent receives from the environment to improve performance and decision-making.

To describe the different ARL methods, we propose a taxonomy to classify the different functions of an externally-influenced RL agent. Through the analysis of the current literature, we have found seven key features that make up an ARL technique. They are divided into four processing components and three communication links. A definition and examples of each of these seven features have been presented.

Additionally, we demonstrated the applicability of the framework on different ARL fields. These areas include heuristic RL, IntRL, RLfD, TL, and multiple information sources. Each of these fields has been analysed and described as applied to the presented taxonomy. Finally, we also present some ideas about areas for future research in order to extend the ARL field.


This work has been partially supported by the Australian Government Research Training Program (RTP) and the RTP Fee-Offset Scholarship through Federation University Australia.


  • [1] P. Abbeel and A. Y. Ng (2004) Apprenticeship learning via inverse reinforcement learning. In Proceedings of the International Conference on Machine learning ICML, pp. 1–8. External Links: ISBN 1581138385 Cited by: §3.5.
  • [2] V. Akila and G. Zayaraz (2015) A brief survey on concept drift. In Intelligent Computing, Communication and Devices, pp. 293–302. Cited by: §4.4, §5.1.
  • [3] S. Amershi, M. Cakmak, W. B. Knox, and T. Kulesza (2014) Power to the people: the role of humans in interactive machine learning. AI Magazine 35 (4), pp. 105–120. Cited by: item 2, §2.1.
  • [4] O. Amir, E. Kamar, A. Kolobov, and B. Grosz (2016) Interactive teaching strategies for agent training. In Proceedings of the International Joint Conference on Artificial Intelligence IJCAI, pp. 804–811. Cited by: 7th item, §4.2.
  • [5] H. B. Ammar, E. Eaton, P. Ruvolo, and M. E. Taylor (2015) Unsupervised Cross-Domain Transfer in Policy Gradient Reinforcement Learning via Manifold Alignment. In Proceedings of the Association for the Advancement of Artificial Intelligence conference AAAI, Cited by: §4.4.
  • [6] B. Argall, B. Browning, and M. Veloso (2007) Learning by demonstration with critique from a human teacher. In Proceedings of the ACM/IEEE International Conference on Human-Robot Interaction HRI, pp. 57–64. External Links: ISBN 1595936173 Cited by: item 3, §2.1, §2.1.
  • [7] B. D. Argall, B. Browning, and M. Veloso (2009) Automatic weight learning for multiple data sources when learning from demonstration. In Proceedings of the IEEE International Conference on Robotics and Automation ICRA, pp. 226–231. Cited by: item 5, 1st item, 2nd item, 5th item, 6th item.
  • [8] B. D. Argall, S. Chernova, M. Veloso, and B. Browning (2009) A survey of robot learning from demonstration. Robotics and autonomous systems 57 (5), pp. 469–483. Cited by: §4.3.
  • [9] A. Ayala, C. Henríquez, and F. Cruz (2019) Reinforcement learning using continuous states and interactive feedback. In Proceedings of the International Conference on Applications of Intelligent Systems, pp. 1–5. Cited by: 4th item, 3rd item.
  • [10] J. Bandera, J. Rodriguez, L. Molina-Tanco, and A. Bandera (2012) A survey of vision-based architectures for robot learning by imitation. International Journal of Humanoid Robotics 9 (01), pp. 1250006. Cited by: §1.
  • [11] B. Banerjee (2007) General game learning using knowledge transfer. In Proceedings of the International Joint Conference on Artificial Intelligence IJCAI, pp. 672–677. Cited by: §3.5.
  • [12] P. Barros, A. Tanevska, and A. Sciutti (2020) Learning from learners: adapting reinforcement learning agents to be competitive in a card game. arXiv preprint arXiv:2004.04000. Cited by: §1.
  • [13] P. Behboudian, Y. Satsangi, M. E. Taylor, A. Harutyunyan, and M. Bowling (2020) Useful policy invariant shaping from arbitrary advice. In AAMAS Adaptive and Learning Agents Workshop ALA 2020, pp. 9. Cited by: §3.6.
  • [14] Y. Bengio, J. Louradour, R. Collobert, and J. Weston (2009) Curriculum learning. In Proceedings of the International Conference on Machine learning ICML, New York, NY, USA, pp. 41–48. Cited by: §4.4.
  • [15] R. A. Bianchi, L. A. Celiberto Jr, P. E. Santos, J. P. Matsuura, and R. L. de Mantaras (2015) Transferring knowledge as heuristics in reinforcement learning: a case-based approach. Artificial Intelligence 226, pp. 102–121. Cited by: §4.1.
  • [16] A. Bignold (2019) Rule-based interactive assisted reinforcement learning. Ph.D. Thesis, Federation University Australia. Cited by: §3.4.
  • [17] H. Bou Ammar, M. E. Taylor, K. Tuyls, and G. Weiss (2011) Reinforcement learning transfer using a sparse coded inter-task mapping. In European Workshop on Multi-Agent Systems, pp. 1–16. Cited by: §4.4.
  • [18] M. Breyer, F. Furrer, T. Novkovic, R. Siegwart, and J. Nieto (2019) Comparing task simplifications to learn closed-loop object picking using deep reinforcement learning. IEEE Robotics and Automation Letters 4 (2), pp. 1549–1556. Cited by: §3.6, §3.6.
  • [19] T. Brys, A. Harutyunyan, H. B. Suay, S. Chernova, M. E. Taylor, and A. Nowé (2015) Reinforcement learning from demonstration through shaping. In Proceedings of the International Joint Conference on Artificial Intelligence IJCAI, pp. 26. Cited by: §4.3.
  • [20] T. Brys, A. Harutyunyan, P. Vrancx, A. Nowé, and M. E. Taylor (2017) Multi-objectivization and ensembles of shapings in reinforcement learning. Neurocomputing 263, pp. 48–59. Cited by: §4.5.
  • [21] T. Brys, A. Nowé, D. Kudenko, and M. E. Taylor (2014) Combining multiple correlated reward and shaping signals by measuring confidence.. In Proceedings of the Association for the Advancement of Artificial Intelligence conference AAAI, pp. 1687–1693. Cited by: 6th item.
  • [22] A. R. Cassandra and L. P. Kaelbling (2016) Learning policies for partially observable environments: scaling up. In Proceedings of the International Conference on Machine Learning ICML, pp. 362. Cited by: §1.
  • [23] L. A. Celiberto Jr, C. H. Ribeiro, A. H. Costa, and R. A. Bianchi (2007) Heuristic reinforcement learning applied to robocup simulation agents. In RoboCup 2007: Robot Soccer World Cup XI, pp. 220–227. External Links: ISBN 3540688463 Cited by: item 1, §3.2, Figure 4, §4.1.
  • [24] H. Chen, B. Yang, and J. Liu (2018) Partially observable reinforcement learning for sustainable active surveillance. In Proceedings of the International Conference on Knowledge Science, Engineering and Management, pp. 425–437. Cited by: 5th item.
  • [25] S. Chen, V. Tangkaratt, H. Lin, and M. Sugiyama (2019) Active deep Q-learning with demonstration. Machine Learning, pp. 1–27. Cited by: §1.
  • [26] Z. Chen and B. Liu (2016) Lifelong machine learning. Synthesis Lectures on Artificial Intelligence and Machine Learning, Morgan & Claypool Publishers. Cited by: §4.4.
  • [27] S. Cheng, T. Chang, and C. Hsu (2013) A framework of an agent planning with reinforcement learning for e-pet. In Proceedings of the International Conference on Orange Technologies ICOT, pp. 310–313. Cited by: §3.2.
  • [28] L. C. Cobo, K. Subramanian, C. L. Isbell Jr, A. D. Lanterman, and A. L. Thomaz (2014) Abstraction from demonstration for efficient reinforcement learning in high-dimensional domains. Artificial Intelligence 216, pp. 103–128. Cited by: §3.1.
  • [29] F. Cruz, R. Dazeley, and P. Vamplew (2019) Memory-based explainable reinforcement learning. In Proceedings of the Australasian Joint Conference on Artificial Intelligence, pp. 66–77. Cited by: §5.3.
  • [30] F. Cruz, R. Dazeley, and P. Vamplew (2020) Explainable robotic systems: interpreting outcome-focused actions in a reinforcement learning scenario. arXiv preprint arXiv:2006.13615. Cited by: 5th item.
  • [31] F. Cruz, S. Magg, Y. Nagai, and S. Wermter (2018) Improving interactive reinforcement learning: what makes a good teacher?. Connection Science 30 (3), pp. 306–325. Cited by: item 2, §3.1, §5.1.
  • [32] F. Cruz, S. Magg, C. Weber, and S. Wermter (2014) Improving reinforcement learning with interactive feedback and affordances. In Proceedings of the Joint IEEE International Conference on Development and Learning and on Epigenetic Robotics ICDL-EpiRob, pp. 165–170. Cited by: §2.1.
  • [33] F. Cruz, S. Magg, C. Weber, and S. Wermter (2016) Training agents with interactive reinforcement learning and contextual affordances. IEEE Transactions on Cognitive and Developmental Systems 8 (4), pp. 271–284. Cited by: §2.1, §3.1, §3.5.
  • [34] F. Cruz, G. I. Parisi, J. Twiefel, and S. Wermter (2016) Multi-modal integration of dynamic audiovisual patterns for an interactive reinforcement learning scenario. In Proceedings fo the IEEE/RSJ International Conference on Intelligent Robots and Systems IROS, pp. 759–766. Cited by: §3.1, §3.3, §3.3, §3.4, 3rd item, 4th item, §5.1.
  • [35] F. Cruz, G. I. Parisi, and S. Wermter (2018) Multi-modal feedback for affordance-driven interactive reinforcement learning. In

    Proceedings of the International Joint Conference on Neural Networks IJCNN

    pp. 5515–5122. Cited by: §2.1, §4.2.
  • [36] F. Cruz, G. I. Parisi, and S. Wermter (2016) Learning contextual affordances with an associative neural architecture. In Proceedings of the European Symposium on Artificial Neural Network. Computational Intelligence and Machine Learning ESANN, pp. 665–670. Cited by: §3.5.
  • [37] F. Cruz, J. Twiefel, S. Magg, C. Weber, and S. Wermter (2015) Interactive reinforcement learning through speech guidance in a domestic scenario. In Proceedings of the International Joint Conference on Neural Networks IJCNN, pp. 1341–1348. Cited by: §3.3, 3rd item.
  • [38] F. Cruz, P. Wüppen, S. Magg, A. Fazrie, and S. Wermter (2017) Agent-advising approaches in an interactive reinforcement learning scenario. In Proceedings of the Joint IEEE International Conference on Development and Learning and Epigenetic Robotics ICDL-EpiRob, pp. 209–214. Cited by: §3.1.
  • [39] F. L. Da Silva and A. H. R. Costa (2018) Object-oriented curriculum generation for reinforcement learning. In Proceedings of the International Conference on Autonomous Agents and Multiagent Systems AAMAS, pp. 1026–1034. Cited by: §4.4.
  • [40] F. L. Da Silva and A. H. R. Costa (2019) A survey on transfer learning for multiagent reinforcement learning systems. Journal of Artificial Intelligence Research 64, pp. 645–703. Cited by: §3.3.
  • [41] F. L. Da Silva, R. Glatt, and A. H. R. Costa (2017) Simultaneously learning and advising in multiagent reinforcement learning. In Proceedings of the International Conference on Autonomous Agents and Multiagent Systems AAMAS, pp. 1100–1108. Cited by: 1st item, 7th item.
  • [42] F. L. Da Silva, P. Hernandez-Leal, B. Kartal, and M. E. Taylor (2020) Uncertainty-aware action advising for deep reinforcement learning agents. In Proceedings of the Association for the Advancement of Artificial Intelligence conference AAAI, pp. 5792–5799. Cited by: 7th item.
  • [43] F. L. Da Silva, G. Warnell, A. H. R. Costa, and P. Stone (2020) Agents teaching agents: a survey on inter-agent transfer learning. Autonomous Agents and Multi-Agent Systems 34 (1), pp. 9. Cited by: §3.3.
  • [44] F. L. Da Silva (2019) Integrating agent advice and previous task solutions in multiagent reinforcement learning. In Proceedings of the International Conference on Autonomous Agents and Multiagent Systems AAMAS, pp. 2447–2448. Cited by: §4.5.
  • [45] R. Dazeley, P. Vamplew, and F. Cruz (2020) A conceptual framework for establishing trust and social acceptance: a review of explainable reinforcement learning. Submitted to Artificial Intelligence. Cited by: §5.3.
  • [46] S. Devlin and D. Kudenko (2011) Theoretical considerations of potential-based reward shaping for multi-agent systems. In Proceedings of the International Conference on Autonomous Agents and Multiagent Systems AAMAS, pp. 225–232. Cited by: §3.6.
  • [47] S. Devlin and D. Kudenko (2012) Dynamic potential-based reward shaping. In Proceedings of the International Conference on Autonomous Agents and Multiagent Systems AAMAS, pp. 433–440. Cited by: §3.6.
  • [48] K. Dixon, R. J. Malak, and P. K. Khosla (2000) Incorporating prior knowledge and previously learned information into reinforcement learning agents. Carnegie Mellon University, Institute for Complex Engineered Systems. Cited by: §3.6, §3.6.
  • [49] M. Dorigo and L. Gambardella (2014) Ant-Q: a reinforcement learning approach to the traveling salesman problem. In Proceedings of International Conference on Machine Learning ICML, pp. 252–260. Cited by: §3.5.
  • [50] G. Dulac-Arnold, R. Evans, H. van Hasselt, P. Sunehag, T. Lillicrap, J. Hunt, T. Mann, T. Weber, T. Degris, and B. Coppin (2015) Deep reinforcement learning in large discrete action spaces. arXiv preprint arXiv:1512.07679. Cited by: 3rd item.
  • [51] K. Efthymiadis, S. Devlin, and D. Kudenko (2013) Overcoming erroneous domain knowledge in plan-based reward shaping. In Proceedings of the International Conference on Autonomous Agents and Multiagent Systems AAMAS, pp. 1245–1246. External Links: ISBN 1450319939 Cited by: §5.1.
  • [52] M. Eppe, S. Magg, and S. Wermter (2019) Curriculum goal masking for continuous deep reinforcement learning. In Proceedings of the Joint IEEE International Conference on Development and Learning and Epigenetic Robotics ICDL-EpiRob, pp. 183–188. Cited by: §4.4.
  • [53] T. Erez and W. D. Smart (2008) What does shaping mean for computational reinforcement learning?. In Proceedings of the IEEE International Conference on Development and Learning ICDL, pp. 215–219. External Links: ISBN 1424426618 Cited by: §1, §3.6.
  • [54] A. Fachantidis, M. E. Taylor, and I. Vlahavas (2019) Learning to teach reinforcement learning agents. Machine Learning and Knowledge Extraction 1 (1), pp. 21–42. Cited by: 1st item.
  • [55] F. Fernández and M. Veloso (2006) Probabilistic policy reuse in a reinforcement learning agent. In Proceedings of the International Conference on Autonomous Agents and Multiagent Systems AAMAS, pp. 720–727. Cited by: §1, §1, §2.1, §3.5, 5th item, 6th item, §4.4.
  • [56] J. Fürnkranz, E. Hüllermeier, W. Cheng, and S. Park (2012) Preference-based reinforcement learning: a formal framework and a policy iteration algorithm. Machine learning 89 (1-2), pp. 123–156. Cited by: §1.
  • [57] I. Giannoccaro and P. Pontrandolfo (2002) Inventory management in supply chains: a reinforcement learning approach. International Journal of Production Economics 78 (2), pp. 153–161. Cited by: §1.
  • [58] M. Gimelfarb, S. Sanner, and C. Lee (2018) Reinforcement learning with multiple experts: a Bayesian model combination approach. In Advances in Neural Information Processing Systems, pp. 9528–9538. Cited by: §4.5.
  • [59] S. Griffith, K. Subramanian, J. Scholz, C. Isbell, and A. L. Thomaz (2013) Policy shaping: integrating human feedback with reinforcement learning. In Advances in Neural Information Processing Systems, pp. 2625–2633. Cited by: §1, §1, §1, §3.4, §3.6.
  • [60] J. Grizou, M. Lopes, and P. Oudeyer (2013) Robot learning simultaneously a task and how to interpret human instructions. In Proceedings of the Joint IEEE International Conference on Development and Learning and Epigenetic Robotics ICDL-EpiRob, pp. 1–8. Cited by: §3.6.
  • [61] A. Harutyunyan, S. Devlin, P. Vrancx, and A. Nowé (2015) Expressing arbitrary reward functions as potential-based advice.. In Proceedings of the Association for the Advancement of Artificial Intelligence conference AAAI, pp. 2652–2658. Cited by: §3.6.
  • [62] M. Hausknecht, P. Mupparaju, S. Subramanian, S. Kalyanakrishnan, and P. Stone (2016) Half field offense: an environment for multiagent learning and ad hoc teamwork. In AAMAS Adaptive and Learning Agents Workshop ALA 2016, Cited by: §3.2.
  • [63] P. Hernandez-Leal, Y. Zhan, M. E. Taylor, L. E. Sucar, and E. Munoz de Cote (2016-11) Efficiently detecting switches against non-stationary opponents. Autonomous Agents and Multi-Agent Systems, pp. 1–23. External Links: ISSN 1387-2532 Cited by: §4.4.
  • [64] C. L. Isbell, M. Kearns, D. Kormann, S. Singh, and P. Stone (2000) Cobot in LambdaMOO: a social statistics agent. In Proceedings of the Association for the Advancement of Artificial Intelligence conference AAAI, pp. 36–41. Cited by: §3.1.
  • [65] L. P. Kaelbling, M. L. Littman, and A. W. Moore (1996) Reinforcement learning: a survey. Journal of artificial intelligence research, pp. 237–285. Cited by: §1, §1, §2.1.
  • [66] S. Kalyanakrishnan, Y. Liu, and P. Stone (2006) Half field offense in RoboCup soccer: a multiagent reinforcement learning case study. In Robot Soccer World Cup, pp. 72–85. Cited by: §3.2.
  • [67] E. Kamar, S. Hacker, and E. Horvitz (2012) Combining human and machine intelligence in large-scale crowdsourcing. In Proceedings of the International Conference on Autonomous Agents and Multiagent Systems AAMAS, pp. 467–474. Cited by: §3.1, §5.2.
  • [68] F. Kaplan, P. Oudeyer, E. Kubinyi, and A. Miklósi (2002) Robotic clicker training. Robotics and Autonomous Systems 38 (3), pp. 197–206. Cited by: §3.4.
  • [69] T. G. Karimpanal, S. Rana, S. Gupta, T. Tran, and S. Venkatesh (2019) Learning transferable domain priors for safe exploration in reinforcement learning. In Proceedings of the International Joint Conference on Neural Networks IJCNN, pp. 1–8. Cited by: 4th item.
  • [70] J. Karlsson (2014) Learning to play games from multiple imperfect teachers. Master’s Thesis, Chalmers University of Technology, Gothenburg, Sweden. Cited by: 5th item, 6th item.
  • [71] M. Kerzel, H. B. Mohammadi, M. A. Zamani, and S. Wermter (2018) Accelerating deep continuous reinforcement learning through task simplification. In Proceedings of the International Joint Conference on Neural Networks IJCNN, pp. 1–6. Cited by: §3.6, §3.6.
  • [72] T. Kessler Faulkner, R. A. Gutierrez, E. S. Short, G. Hoffman, and A. L. Thomaz (2019) Active attention-modified policy shaping: socially interactive agents track. In Proceedings of the International Conference on Autonomous Agents and Multiagent Systems AAMAS, pp. 728–736. Cited by: §5.4.
  • [73] H. Kitano, M. Asada, Y. Kuniyoshi, I. Noda, E. Osawa, and H. Matsubara (1997) RoboCup: a challenge problem for AI. AI magazine 18 (1), pp. 73. External Links: ISSN 0738-4602 Cited by: §1.
  • [74] M. J. Knowles and S. Wermter (2008) The hybrid integration of perceptual symbol systems and interactive reinforcement learning. In Proceedings of the International Conference on Hybrid Intelligent Systems, pp. 404–409. Cited by: §3.6.
  • [75] W. B. Knox, B. D. Glass, B. C. Love, W. T. Maddox, and P. Stone (2012) How humans teach agents. International Journal of Social Robotics 4 (4), pp. 409–421. Cited by: 5th item.
  • [76] W. B. Knox, P. Stone, and C. Breazeal (2013) Training a robot via human feedback: a case study. In Proceedings of the International Conference on Social Robotics, pp. 460–470. Cited by: 6th item.
  • [77] W. B. Knox and P. Stone (2009) Interactively shaping agents via human reinforcement: the TAMER framework. In Proceedings of the International Conference on Knowledge Capture, pp. 9–16. External Links: ISBN 1605586587 Cited by: §3.4, §3.4.
  • [78] W. B. Knox and P. Stone (2010) Combining manual feedback with subsequent MDP reward signals for reinforcement learning. In Proceedings of the International Conference on Autonomous Agents and Multiagent Systems AAMAS, pp. 5–12. Cited by: §1, §1.
  • [79] W. B. Knox and P. Stone (2012) Reinforcement learning from human reward: discounting in episodic tasks. In Proceedings of the IEEE International Symposium on Robot and Human Interactive Communication RO-MAN, pp. 878–885. Cited by: §4.2.
  • [80] W. B. Knox and P. Stone (2012) Reinforcement learning from simultaneous human and MDP reward.. In Proceedings of the International Conference on Autonomous Agents and Multiagent Systems AAMAS, pp. 475–482. Cited by: §4.2.
  • [81] J. Kober, J. A. Bagnell, and J. Peters (2013) Reinforcement learning in robotics: a survey. The International Journal of Robotics Research 32 (11), pp. 1238–1274. Cited by: §1.
  • [82] S. Koenig and R. G. Simmons (1993) Complexity analysis of real-time reinforcement learning. In Proceedings of the Association for the Advancement of Artificial Intelligence conference AAAI, pp. 99–107. Cited by: 1st item.
  • [83] G. Konidaris, S. Kuindersma, R. Grupen, and A. Barto (2012) Robot learning from demonstration by constructing skill trees. The International Journal of Robotics Research 31 (3), pp. 360–375. Cited by: §1.
  • [84] G. Li, R. Gomez, K. Nakamura, and B. He (2019) Human-centered reinforcement learning: a survey. IEEE Transactions on Human-Machine Systems 49 (4), pp. 337–349. Cited by: 6th item, §4.2.
  • [85] L. J. Lin (1991) Programming robots using reinforcement learning and teaching.. In Proceedings of the Association for the Advancement of Artificial Intelligence conference AAAI, pp. 781–786. Cited by: §1.
  • [86] X. Liu, R. Deng, K. R. Choo, and Y. Yang (2019) Privacy-preserving reinforcement learning design for patient-centric dynamic treatment regimes. IEEE Transactions on Emerging Topics in Computing. Cited by: §3.3.
  • [87] D. J. Mankowitz, G. Dulac-Arnold, and T. Hester (2019) Challenges of real-world reinforcement learning. In ICML Workshop on Real-Life Reinforcement Learning, pp. 14. Cited by: §5.5.
  • [88] T. A. Mann, S. Gowal, R. Jiang, H. Hu, B. Lakshminarayanan, and A. Gyorgy (2018) Learning from delayed outcomes with intermediate observations. arXiv preprint arXiv:1807.09387. Cited by: 2nd item.
  • [89] C. Millan, B. Fernandes, F. Cruz, R. Dazeley, and S. Fernandes (2020) A robust approach for continuous interactive reinforcement learning. Submitted to IEEE Transactions on Neural Networks and Learning Systems, TNNLS. Cited by: §3.6, §3.6.
  • [90] C. Millán, B. Fernandes, and F. Cruz (2019) Human feedback in continuous actor-critic reinforcement learning. In Proceedings of the European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning ESANN, pp. 661–666. Cited by: §2.1, §3.1, §3.3, 3rd item, 5th item.
  • [91] I. Moreira, J. Rivas, F. Cruz, R. Dazeley, A. Ayala, and B. Fernandes (2020) Deep reinforcement learning with interactive feedback in a human-robot environment. Submitted to MDPI Electronics. Cited by: §3.5.
  • [92] A. Nair, B. McGrew, M. Andrychowicz, W. Zaremba, and P. Abbeel (2018) Overcoming exploration in reinforcement learning with demonstrations. In Proceedings of the IEEE International Conference on Robotics and Automation ICRA, pp. 6292–6299. Cited by: item 3, §2.1, §2.1.
  • [93] S. Narvekar, J. Sinapov, M. Leonetti, and P. Stone (2016) Source task creation for curriculum learning. In Proceedings of the International Conference on Autonomous Agents and Multiagent Systems AAMAS, pp. 566–574. Cited by: §3.3.
  • [94] S. Narvekar, J. Sinapov, and P. Stone (2017-08) Autonomous task sequencing for customized curriculum design in reinforcement learning. In Proceedings of the International Joint Conference on Artificial Intelligence IJCAI, Cited by: §4.4.
  • [95] N. Navidi (2020) Human AI interaction loop training: new approach for interactive reinforcement learning. arXiv preprint arXiv:2003.04203. Cited by: §3.6.
  • [96] A. Y. Ng, D. Harada, and S. Russell (1999) Policy invariance under reward transformations: theory and application to reward shaping. In Proceedings of the International Conference on Machine Learning ICML, Vol. 99, pp. 278–287. Cited by: §3.1, §3.6.
  • [97] Y. Niv (2009) Reinforcement learning in the brain. Journal of Mathematical Psychology 53 (3), pp. 139–154. Cited by: §2.1.
  • [98] L. Nunes and E. Oliveira (2003) Exchanging advice and learning to trust. Cooperative Information Agents VII, pp. 250–265. Cited by: item 5, §4.5.
  • [99] T. Osa, J. Pajarinen, G. Neumann, J. A. Bagnell, P. Abbeel, J. Peters, et al. (2018) An algorithmic perspective on imitation learning. Foundations and Trends® in Robotics 7 (1-2), pp. 1–179. Cited by: §1.
  • [100] K. Parameswaran, R. Devidze, V. Cevher, and A. Singla (2019) Interactive teaching algorithms for inverse reinforcement learning. In Proceedings of the International Joint Conference on Artificial Intelligence IJCAI, pp. 2692–2700. Cited by: §1.
  • [101] G. I. Parisi, R. Kemker, J. L. Part, C. Kanan, and S. Wermter (2019) Continual lifelong learning with neural networks: a review. Neural Networks. Cited by: §4.4.
  • [102] E. Parisotto, J. L. Ba, and R. Salakhutdinov (2015) Actor-mimic: deep multitask and transfer reinforcement learning. arXiv preprint arXiv:1511.06342. Cited by: 1st item, 6th item.
  • [103] I. Partalas, D. Vrakas, and I. Vlahavas (2008) Reinforcement learning and automated planning: a survey. In Artificial Intelligence for Advanced Problem Solving Techniques, pp. 148–165. Cited by: §3.2, §3.2.
  • [104] D. Pathak, P. Agrawal, A. A. Efros, and T. Darrell (2017) Curiosity-driven exploration by self-supervised prediction. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops

    pp. 16–17. Cited by: 7th item.
  • [105] B. Peng, J. MacGlashan, R. Loftin, M. L. Littman, D. L. Roberts, and M. E. Taylor (2017-05) Curriculum Design for Machine Learners in Sequential Decision Tasks (Extended Abstract). In Proceedings of the International Conference on Autonomous Agents and Multiagent Systems AAMAS, Cited by: §4.4.
  • [106] P. M. Pilarski and R. S. Sutton (2012) Between instruction and reward: human-prompted switching. In AAAI Fall Symposium Series: Robots Learning Interactively from Human Teachers, pp. 45–52. Cited by: §3.2.
  • [107] B. Price and C. Boutilier (2003) Accelerating reinforcement learning through implicit imitation. Journal of Artificial Intelligence Research 19, pp. 569–629. Cited by: §1, §1, §2.1.
  • [108] J. Randløv and P. Alstrøm (1998) Learning to drive a bicycle using reinforcement learning and shaping. In Proceedings of the International Conference on Machine Learning ICML, pp. 463–471. Cited by: §1, §3.1, §3.2.
  • [109] D. M. Roijers, P. Vamplew, S. Whiteson, and R. Dazeley (2013) A survey of multi-objective sequential decision-making. Journal of Artificial Intelligence Research 48, pp. 67–113. Cited by: 7th item.
  • [110] L. Rozo, P. Jiménez, and C. Torras (2013) A robot learning from demonstration framework to perform force-based manipulation tasks. Intelligent service robotics 6 (1), pp. 33–51. Cited by: §1.
  • [111] S. Schaal (1997) Learning from demonstration. Advances in neural information processing systems 9, pp. 1040–1046. Cited by: §4.3.
  • [112] K. Shao, Y. Zhu, and D. Zhao (2018) Starcraft micromanagement with reinforcement learning and curriculum transfer learning. IEEE Transactions on Emerging Topics in Computational Intelligence 3 (1), pp. 73–84. Cited by: item 4, §2.1, §2.1.
  • [113] M. Sharma, M. P. Holmes, J. C. Santamaría, A. Irani, C. L. Isbell Jr, and A. Ram (2007) Transfer learning in real-time strategy games using hybrid cbr/rl.. In Proceedings of the International Joint Conference on Artificial Intelligence IJCAI, Vol. 7, pp. 1041–1046. Cited by: §3.4.
  • [114] C. R. Shelton (2001) Balancing multiple sources of reward in reinforcement learning. In Advances in Neural Information Processing Systems, pp. 1082–1088. Cited by: 3rd item, §4.5.
  • [115] K. Shiarlis, J. Messias, and S. Whiteson (2016) Inverse reinforcement learning from failure. In Proceedings of the International Conference on Autonomous Agents and Multiagent Systems AAMAS, pp. 1060–1068. Cited by: §3.2.
  • [116] B. F. Skinner (1975) The shaping of phylogenic behavior. Journal of the Experimental Analysis of Behavior 24 (1), pp. 117–120. Cited by: §3.6.
  • [117] W. D. Smart and L. P. Kaelbling (2002) Effective reinforcement learning for mobile robots. In Proceedings of the IEEE International Conference on Robotics and Automation ICRA, Vol. 4, pp. 3404–3410. Cited by: §4.3.
  • [118] M. Sridharan, B. Meadows, and R. Gomez (2017) What can I not do? towards an architecture for reasoning about and learning affordances. In Proceedings of the International Conference on Automated Planning and Scheduling, pp. 461–469. Cited by: §3.6, §3.6.
  • [119] C. Stahlhut, N. Navarro-Guerrero, C. Weber, S. Wermter, and V. WTM (2015) Interaction is more beneficial in complex reinforcement learning problems than in simple ones. In Proceedings of the Interdisziplinärer Workshop Kognitive Systeme (KogSys), pp. 142–150. Cited by: §3.2.
  • [120] H. B. Suay, T. Brys, M. E. Taylor, and S. Chernova (2016) Learning from demonstration for shaping through inverse reinforcement learning. In Proceedings of the International Conference on Autonomous Agents and Multiagent Systems AAMAS, pp. 429–437. Cited by: §4.3.
  • [121] H. B. Suay and S. Chernova (2011) Effect of human guidance and state space size on interactive reinforcement learning. In Proceedings of the IEEE International Symposium on Robot and Human Interactive Communication RO-MAN, pp. 1–6. Cited by: §2.1, §2.1.
  • [122] K. Subramanian, C. Isbell, and A. Thomaz (2011) Learning options through human interaction. In IJCAI Workshop on Agents Learning Interactively from Human Teachers (ALIHT), Cited by: §3.2, §3.4.
  • [123] K. Subramanian, C. L. Isbell Jr, and A. L. Thomaz (2016) Exploration from demonstration for interactive reinforcement learning.. In Proceedings of the International Conference on Autonomous Agents and MultiAgent Systems AAMAS, pp. 447–456. Cited by: §3.1.
  • [124] R. S. Sutton and A. G. Barto (2018) Reinforcement learning: an introduction. MIT press. Cited by: §1, §2.1.
  • [125] E. Talvitie and S. P. Singh (2007) An experts algorithm for transfer learning.. In Proceedings of the International Joint Conference on Artificial Intelligence IJCAI, pp. 1065–1070. Cited by: 1st item, 5th item, 6th item.
  • [126] A. K. Tanwani and A. Billard (2013) Transfer in inverse reinforcement learning for multiple strategies. In Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems IROS, pp. 3244–3250. Cited by: 6th item, 7th item.
  • [127] M. E. Taylor, G. Kuhlmann, and P. Stone (2008) Autonomous transfer for reinforcement learning. In Proceedings of the International Conference on Autonomous Agents and Multiagent Systems AAMAS, pp. 283–290. Cited by: §3.3.
  • [128] M. E. Taylor, P. Stone, and Y. Liu (2005) Value functions for rl-based behavior transfer: a comparative study. In Proceedings of the Association for the Advancement of Artificial Intelligence conference AAAI, Vol. 5, pp. 880–885. Cited by: §2.2.
  • [129] M. E. Taylor, P. Stone, and Y. Liu (2007) Transfer learning via inter-task mappings for temporal difference learning. Journal of Machine Learning Research 8 (1), pp. 2125–2167. Cited by: 3rd item, §4.4, §4.4.
  • [130] M. E. Taylor and P. Stone (2009) Transfer learning for reinforcement learning domains: a survey. Journal of Machine Learning Research 10 (7), pp. 1633–1685. Cited by: item 4, §1, §2.1, §2.1, §3.5, §4.4.
  • [131] M. E. Taylor, H. B. Suay, and S. Chernova (2011) Integrating reinforcement learning with human demonstrations of varying ability. In Proceedings of the International Conference on Autonomous Agents and Multiagent Systems AAMAS, pp. 617–624. Cited by: Figure 6, §4.3.
  • [132] M. E. Taylor, N. Carboni, A. Fachantidis, I. Vlahavas, and L. Torrey (2014) Reinforcement learning agents providing advice in complex video games. Connection Science 26 (1), pp. 45–63. Cited by: §3.1.
  • [133] M. E. Taylor, S. Whiteson, and P. Stone (2007) Transfer via Inter-Task Mappings in Policy Search Reinforcement Learning. In Proceedings of the International Conference on Autonomous Agents and Multiagent Systems AAMAS, pp. 156–163. Cited by: §4.4, §4.4.
  • [134] M. E. Taylor (2009-03) Assisting Transfer-Enabled Machine Learning Algorithms: Leveraging Human Knowledge for Curriculum Design. In The AAAI 2009 Spring Symposium on Agents that Learn from Human Teachers, Cited by: §4.4.
  • [135] A. C. Tenorio-Gonzalez, E. F. Morales, and L. Villaseñor-Pineda (2010) Dynamic reward shaping: training a robot by voice. In Advances in Artificial Intelligence–IBERAMIA 2010, pp. 483–492. Cited by: §3.3.
  • [136] G. Tesauro (1994) TD-Gammon, a self-teaching backgammon program, achieves master-level play. Neural computation 6 (2), pp. 215–219. Cited by: §1.
  • [137] G. Tesauro (2004) Extending Q-learning to general adaptive multi-agent systems. In Advances in neural information processing systems, pp. 871–878. Cited by: §3.6.
  • [138] A. L. Thomaz and C. Breazeal (2007) Asymmetric interpretations of positive and negative human feedback for a social learning agent. In Proceedings of the IEEE International Symposium on Robot and Human Interactive Communication RO-MAN, pp. 720–725. Cited by: §1, §2.1, §2.1, §3.2, §3.2, §3.3, §3.4.
  • [139] A. L. Thomaz, G. Hoffman, and C. Breazeal (2006) Reinforcement learning with human teachers: understanding how people want to teach robots. In Proceedings of the IEEE International Symposium on Robot and Human Interactive Communication RO-MAN, pp. 352–357. Cited by: §3.2.
  • [140] A. L. Thomaz, C. Breazeal, et al. (2006) Reinforcement learning with human teachers: evidence of feedback and guidance with implications for learning performance. In Proceedings of the Association for the Advancement of Artificial Intelligence conference AAAI, Vol. 6, pp. 1000–1005. Cited by: §3.1.
  • [141] A. L. Thomaz, G. Hoffman, and C. Breazeal (2005) Real-time interactive reinforcement learning for robots. In AAAI 2005 Workshop on Human Comprehensible Machine Learning, Cited by: §4.2.
  • [142] L. Torrey and M. E. Taylor (2013) Teaching on a Budget: Agents Advising Agents in Reinforcement Learning. In Proceedings of the International Conference on Autonomous Agents and Multiagent Systems AAMAS, Cited by: §3.1.
  • [143] P. Vamplew, C. Foale, and R. Dazeley (2020) A demonstration of issues with value-based multiobjective reinforcement learning under stochastic state transitions. arXiv preprint arXiv:2004.06277. Cited by: 6th item.
  • [144] N. Vlassis, M. Ghavamzadeh, S. Mannor, and P. Poupart (2012) Bayesian reinforcement learning. Reinforcement Learning, pp. 359–386. Cited by: §1, §1.
  • [145] M. A. Wiering and H. Van Hasselt (2008) Ensemble algorithms in reinforcement learning. IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics) 38 (4), pp. 930–936. Cited by: §1.
  • [146] E. Wiewiora, G. Cottrell, and C. Elkan (2003) Principled methods for advising reinforcement learning agents. In Proceedings of the International Conference on Machine learning ICML, pp. 792–799. Cited by: §3.6.
  • [147] H. Xu, R. Bector, and Z. Rabinovich (2020) Teaching multiple learning agents by environment-dynamics tweaks. In AAMAS Adaptive and Learning Agents Workshop ALA 2020, pp. 8. Cited by: §1, §3.6, §3.6.
  • [148] T. Yamagata, R. Santos-Rodríguez, R. McConville, and A. Elsts (2019)

    Online feature selection for activity recognition using reinforcement learning with multiple feedback

    arXiv preprint arXiv:1908.06134. Cited by: §4.5.
  • [149] M. Yang, H. Samani, and K. Zhu (2019) Emergency-response locomotion of hexapod robot with heuristic reinforcement learning using q-learning. In Proceedings of the International Conference on Interactive Collaborative Robotics, pp. 320–329. Cited by: §4.1.
  • [150] Y. Zhan, H. B. Ammar, and M. E. Taylor (2016-07) Theoretically-Grounded Policy Advice from Multiple Teachers in Reinforcement Learning Settings with Applications to Negative Transfer. In Proceedings of the International Joint Conference on Artificial Intelligence IJCAI, Cited by: §4.2, §4.5.