Active Learning for Autonomous Intelligent Agents: Exploration, Curiosity, and Interaction

03/06/2014
by   Manuel Lopes, et al.
Inria
University of Zaragoza
0

In this survey we present different approaches that allow an intelligent agent to explore autonomous its environment to gather information and learn multiple tasks. Different communities proposed different solutions, that are in many cases, similar and/or complementary. These solutions include active learning, exploration/exploitation, online-learning and social learning. The common aspect of all these approaches is that it is the agent to selects and decides what information to gather next. Applications for these approaches already include tutoring systems, autonomous grasping learning, navigation and mapping and human-robot interaction. We discuss how these approaches are related, explaining their similarities and their differences in terms of problem assumptions and metrics of success. We consider that such an integrated discussion will improve inter-disciplinary research and applications.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 5

page 26

10/09/2018

Investigating Enactive Learning for Autonomous Intelligent Agents

The enactive approach to cognition is typically proposed as a viable alt...
06/22/2021

A Survey on Human-aware Robot Navigation

Intelligent systems are increasingly part of our everyday lives and have...
06/25/2021

Active Learning in Robotics: A Review of Control Principles

Active learning is a decision-making process. In both abstract and physi...
07/15/2020

Active Visual Information Gathering for Vision-Language Navigation

Vision-language navigation (VLN) is the task of entailing an agent to ca...
12/28/2021

Embodied Learning for Lifelong Visual Perception

We study lifelong visual perception in an embodied setup, where we devel...
09/19/2013

Exploration and Exploitation in Visuomotor Prediction of Autonomous Agents

This paper discusses various techniques to let an agent learn how to pre...
09/23/2017

Autonomous Agents Modelling Other Agents: A Comprehensive Survey and Open Problems

Much research in artificial intelligence is concerned with the developme...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

One of the most remarkable aspects of human intelligence is its adaptation to new situations, new tasks and new environments. To fulfill the dream of Artificial Intelligence and to build truly Autonomous Intelligent Agents (Robots included), it is necessary to develop systems that can adapt to new situations by learning fast how to behave or how to modify their previous knowledge. Consequently, learning has taken an important role in the development of such systems. This paradigm shift has been motivated by the limitations of other approaches to cope with complex open-ended problems and fostered by the progress achieved in the fields of statistics and machine learning. Since tasks to be learned are becoming increasingly complex, have to be executed in ever changing environments and may involve interactions with people or other agents, learning agents are faced with situations that require either a lot of data to model and cover high dimensional spaces and/or a continuous acquisition of new information to adapt to novel situations. Unfortunately, data is not always easy and cheap, but often requires a lot of time, energy, computational or human resources and can be argued to be a limiting factor in the deployment of systems where learning is a key factor.

Consider for instance a robot learning from data obtained during operation. It is common to decouple the acquisition of training data from the learning process. However, the embodiment in this type of systems provides a unique opportunity to exploit an active learning (AL) 222Active learning can also be used to describe situations where the student is involved in the learning process as opposed to passively listening to lectures, see for instance [Linder et al., 2001]. approach (AL)[Angluin, 1988, Thrun, 1995, Settles, 2009] to guide the robot actions towards a more efficient learning and adaptation and, consequently, to achieve a better performance more rapidly.

The robot example illustrates the main particularity of learning for autonomous agents: the abstract learning machine is embodied in a (cyber) physical environment and so it needs to find the relevant information for the task at hand by itself. Although these ideas have been around for more than twenty years [Schmidhuber, 1991b, Thrun, 1992, Dorigo and Colombetti, 1994, Aloimonos et al., 1988], in the last decade there has been a renewed interest from different perspectives in actively gathering data during autonomous learning. Broadly speaking, the idea of AL is to use the current knowledge the system has about the task that is currently being learned to select the most informative data to sample. In the field of machine learning this idea has been envigorated by the existence of huge amounts of unlabeled data freely available on the internet or from other sources. Labeling such data is expensive as it requires the use of experts or costly procedures. If similar accuracy can be obtained with less labeled data then huge savings, monetary and/or computational, could be made.

In the context of intelligent system, another line of motivation and inspiration comes from the field of artificial development [Schmidhuber, 1991b, Weng et al., 2001, Asada et al., 2001, Lungarella et al., 2003, Oudeyer, 2011]. This field, inspired by developmental psychology, tries to understand biological development by creating computational models of the process that biological agents go through their lifetimes. In such process there is no clearly defined tasks and the agents have to create their own representations, decide what to learn and create their own learning experiments.

A limiting factor on active approaches is the limited theoretical understanding of some of its processes. Most theoretical results on AL are recent [Settles, 2009, Dasgupta, 2005, Dasgupta, 2011, Nowak, 2011]. The first intuition on why AL might required a smaller number of labeled data, is to note that the system will only ask for data that might changes its hypothesis and so uninformative examples will not be used. Nevertheless, previous research provides an optimistic perspective on the applicability of AL for real application, and indeed there already many examples: image classification [Qi et al., 2008], text classification [Tong and Koller, 2001], multimedia [Wang and Hua, 2011], among many others (see [Settles, 2009] for a review). Active learning can also be used to plan experiments in genetics research, e.g. the robot scientist [King et al., 2004]

eliminates redundant experiments based on inductive logic programming. Also, most algorithms already have an active extension: logistic regression

[Schein and Ungar, 2007]

, support vector machines

[Tong and Koller, 2001], GP [Kapoor et al., 2007]

, neural networks

[Cohn et al., 1996], mixture models [Cohn et al., 1996]

, inverse reinforcement learning

[Lopes et al., 2009b], among many others.

In this paper we take a very broad perspective on the meaning of AL: any situation where an agent (or a team) actively looks for data instead of passively waiting to receive it. The previous description rules out those cases where a learning process uses data previously obtained in any possible way (e.g. by random movements, or with a predefined paths; or by receiving data from people or other agents). Thus, the key property of such algorithms is the involvement of the agent to decide what information suits better its learning task. There are multiple intances of this wide definition of AL with sometimes unexplored links. We structured them in three big groups: a) exploration where an agent explores its environment to learn; b) curiosity where the agent discovers and creates its own goals; and c) interaction where the existence of a human-in-the-loop is taken explicitly into account.

Classical Active Learning(AL), refers to a set of approaches in which a learning algorithm is able to interactively query a source of information to obtain the desired outputs at new data points [Settles, 2009]. Optimal Experimental Design(OED), an early perspective on active learning where the design of the experiments is optimal according to some statistical criteria [Schonlau et al., 1998]. Usually not considering the interactive perspective of sensing. Learning Problem

, refers to the problem of estimating a function, including a policy, from data. The measures of success for such a problem vary depending on the domain. Also known as the pure exploration problem.

Optimization Problem, refers to the problem of finding a particular value of an unknown function from data. When compared with the Learning Problem, it is not interested in estimating the whole unknown function. Optimization Algorithm, refers to methods to find the maximum/minimum of a given function. The solution to this problem might require, or not, learn a model of the function to guide exploration. We distinguish it from the Learning Problem due to the its specificities. Bayesian Optimization, class of methods to solve an optimization problem that use statistical measures of uncertainty about the target function to guide exploration [Brochu et al., 2010]. Optimal Policy

, in the formalism of markov decision process, the optimal policy is the policy that provides the maximum expected (delayed) reward. We will use it also to refer to any policy, exploration or not, that is optimal according to some criteria.

Exploration Policy, defines the decision algorithm, or policy, that selects which actions are selected during the active learning process. This policy is not, in general, the same as the optimal policy for the learning problem. See a discussion at [Duff, 2003, Şimşek and Barto, 2006, Golovin and Krause, 2010, Toussaint, 2012]. Empirical Measures, class of measures that estimate the progress of learning by measuring empirically how recent data as allowed the learning task to improve.

Table 1: Glossary

1.1 Exploration

Exploration by an agent (or a team of agents) is at the core of rover missions, search and rescue operations, environmental monitoring, surveillance and security, best teaching strategies, online publicity, among others. In all these situations the amount of time and resources for completing a task is limited or unknown. Also, there are often trade-offs to be made between different tasks such as surviving in a hostile environment, communicating with other agents, gathering more information to minimize risk, collecting and analyzing samples. All these tasks must be accomplished in the end but the order is relevant inasmuch as is it helps subsequent tasks. For instance collecting geological samples for analysis and communicating the results will be easier if the robot has already a map of the environment. Active strategies are of paramount importance to select the right tasks and actively execute the task maximizing the operation utility while minimizing the required resources or the time to accomplish the goal.

1.2 Curiosity

A more open-ended perspective on learning should consider cases where the task itself is not defined. Humans develop and grow in an open-ended environment without pre-defined goals. Due to this uncertainty we cannot assume that all situations are considered a-priori and the agent itself has to adapt and learn new tasks. Even more problematic is that the tasks faced are so complex that learning them might require the acquisition of new skills.

Recent results from neuroscience have given several insights into visual attention and general information seeking in humans and other animals. Results seem to indicate that curiosity is an intrinsic drive in most animals [Gottlieb et al., 2013]. Similarly to animals with complex behaviors, an initial period of immaturity dedicated to play and learning might allow to develop such skills. This is the main idea of developmental robotics [Weng et al., 2001, Asada et al., 2001, Elman, 1997, Lungarella et al., 2003, Oudeyer, 2011] where the complexity of the problems that the agent is able to solve increases with time. During this period the agent is not solving a task but learning for the sake of learning. This early stage is guided by curiosity and intrinsic motivation [Barto et al., 2004, Schmidhuber, 1991b, Oudeyer et al., 2005, Singh et al., 2005, Schmidhuber, 2006, Oudeyer et al., 2007] and its justification is that it is a skill that will lead to a better adaptation to a large distribution of problems [Singh et al., 2010b].

1.3 Interaction

Learning agents have intensively tackled the problem of acquiring robust and adaptable skills and behaviors for complex tasks from two different perspectives: programming by demonstration (a.k.a. imitation learning) and learning through experience. From an AL perspective, the main difference between these two approaches is the source of the new data. Programming by demonstration is based on examples provided by some external agent (usually a human). Learning through experience exploits the embodiment of the agent to gather examples by itself by acting on the world. In the abstract AL from machine learning the new data/labels used to come from an oracle and no special regard is given to what exactly the oracle is besides well behaved properties such as no bias and consistency. More recently, data and labels may come from ratings and tagging provided by humans resulting in bias and inconsistencies. This is also the case for agents interacting with humans in which applications had taken into account where that information comes from and what other sources of information might be exploited. For instance, sometimes humans may provide more easily information other than labels that can further guide exploration

333I don’t like the last sentence with the last changes in the section.

1.4 Organization

This review will consider AL in this general setting. We will first clarify the AL principles for autonomous intelligent agents in Sec. 2. Then the core review will be organized in three main parts: Sec 3 AL during self-exploration; Sec. 4 autonomous discovery/creation of goals; and finally Sec 5 AL with humans.

2 Active Learning for Autonomous Intelligent Agents

In this Section we provide an integrated perspective on the many approaches for active learning. The name active learning has mostly been used in machine learning but here we consider any situation where a learning agent uses its current hypothesis about the learning task to select what/where/how to learn next. Different communities formulated problems with similar ideas and all of them can be useful for autonomous intelligent agents. Different approaches are able to reduce the time, or samples, required to learn but they consider different fitness functions, learning algorithms and choices of what can be selected. Figure 1 shows the three main perspectives on for single task active learning. Exploration in reinforcement learning [Sutton and Barto, 1998], bayesian optimization [Brochu et al., 2010], multi-armed bandits [Bubeck and Cesa-Bianchi, 2012], curiosity [Oudeyer and Kaplan, 2007], interactive machine learning [Breazeal et al., 2004] or active learning for classification and regression problems [Settles, 2009], all these share many properties and face similar challenges. Interestingly, a better understanding of the different approaches from the various communities can lead to more powerful algorithms. Also, in some cases to solve the problem of one community, it is necessary to rely on the formalism of another. For instance, active learning for regression methods that can synthesize queries need to find the most informative point. This is, in general, an optimization problem in high-dimension and it is not possible to solve it exactly. Bayesian optimization methods can then be used to find the best point with a minimum of function evaluations [Brochu et al., 2010]. Another example, still for regression, is to decompose complex regression functions to a set of local regressions and then rely on multi-armed bandit algorithms to balance exploration in a more efficient way [Maillard, 2012].

Figure 1: Different choices on active learning. A robot might choose: to look for the most informative set of sampling locations, ignoring the travel and data acquisition cost and the information gathered on the way there, either by selecting a) among an infinite set of location or b) by reducing its choices to a pre-defined set of locations; or c) consider the best path including the cost and the information on the way.

Each of these topics would benefit from a dedicated survey and we do not aim at a definite discussion on all the methods. In this section we will discuss all these approaches with the goal of understanding the similarities, strengths and domains of application. Due to the large variety of methods and formalism we can not describe the full details and mathematical theory but we will provide references for most methods. This Section can be seen as a cookbook of active learning methods where all the design choices and tradeoffs are explained jointly with links for the theory and for examples of application (see Figure 2 for a summary).

2.1 Optimal Exploration Problem

To ground the discussion, let us consider a robot whose mission is to build a map of some physical quantities of interest over a region (e.g. obstacles, air pollution, density of traffic, presence of diamonds…). The agent will have a set of on-board capabilities for acting in the environment that will include moving along a path or to a specific position and using its sensors to obtain measurements about the quantity of interest. In addition to this, it may be possible to make decisions about other issues such as what algorithms should be used to process the obtained measurements or to fit the model of the environment. The set of all possible decisions will define the space of exploration policies444The concept is similar to the policy for reinforcement learning, but here the policy is not optimizing total reward but, instead, exploration gain (to be defined latter)

. To derive an active algorithm for this task, we need to model the costs and the loss function associated to the actions of a specific exploration policy

. The most common costs include the cost of using each of the on-board sensors (e.g. energy consumption, time required to acquire the measurement or changes in the payload) and the cost of moving from one location to another (e.g. energy and the associated autonomy constraints). Regarding the loss function, it has to capture the error of the learned model w.r.t. the unknown true model. For instance, one may consider the uncertainty of the predictions at each point or the uncertainty on the locations of the objects of interest.

Figure 2: During autonomous exploration there are different choices that are made by an intelligent agent. These include what does the agent selects to explore; how does it evaluate its success; and how does it estimate the information gain of each choice.

The optimal exploration policy is the one that simultaneously gives the best learned model but with the smallest possible cost:

(1)

where is an exploration policy (i.e. a sequence of actions possibly conditioned on the history of states and/or actions taken by the agent), denotes the space of possible policies, is a function that summarizes the utility of the policy555Note that might have different semantics depend on the task at hand. It can be an exploration policy used to learn a model in a pure learning problem, or it can be an exploitation policy in an optimization setting. For a more detailed description on the relation of the exploration policy with the learning task see [Duff, 2003, Şimşek and Barto, 2006, Golovin and Krause, 2010, Toussaint, 2012]., and is a space of points that can be sampled. Function depends on the policy itself, the cost of executing this policy and the loss of the policy . The loss depends on a function learned with the dataset acquired following policy . Equation 1 selects the best way to act, taking into account the task uncertainty along time. Clearly this problem is, in general, intractable and the following sections describe particular instantiations, approximations and models of this optimal exploration problem [Şimşek and Barto, 2006].

Equation 1 is intentionally vague with respect to several crucial aspects of the optimization process. For instance, time is not included in any way, and just the abstract policy and the corresponding policy space are explicit. Also, many different costs and loss models can be fed into the function , with the different choices resulting in different problems and algorithms. It is the aim of this work to build bridges between this general formulation and the problems and solutions proposed in different fields. However, before delving into the different instances of this problem, we briefly describe the three most common frameworks for active learning and then discuss possible choices for the policy space and the role of the terms and in the context of autonomous agents.

2.2 Learning Setups

2.2.1 Function approximation

Regression, and classification, problems are the most common problems in machine learning methods. In both cases given a dataset of points the goal is to find an approximation of the input output relation . Typical loss functions are the squared mean error for regression and the loss for classification, with denoting the indicator function. In this setup the cost function directly measures the cost of obtaining measurements (e.g. collecting the measurement or moving to the next spot), if it exists. The active learning perspective corresponds to deciding for which input it is more relevant to ask for the corresponding label . Some other restrictions can be included such as being restricted to a finite set of input points (pool-based active learning) or having the points arriving sequentially and having to decide to query or not (online learning)(see [Settles, 2009] for a comprehensive discussion on the different settings).

2.2.2 Multi-Armed Bandits

An alternative formalism that is usually applied to discrete selection problems is the multi-armed bandit (MAB) formalism [Gittins, 1979, Bubeck and Cesa-Bianchi, 2012]. Multi-arm bandits define a problem where a player, at each round, can choose an arm among a set of possible ones. After playing the selected arm the player receives a reward. In the most common setting the goal of the player is to find a strategy that allows it to get the maximum possible cumulative reward. The loss in bandit problems is usually based on the concept of regret, that is, the difference between the reward that was collected and the reward that would have been collect if the player knew which was the best arm since the beginning [Auer et al., 2003]. Many algorithms have been proposed for different variants of the problems where instead of regret the player is tested after a learning period and it has either to declare what is the best arm [Victor Gabillon et al., 2011] or the value of all the arms [Carpentier et al., 2011].

2.2.3 Mdp

The most general, and well known, formalism to model sequential decision processes are markov-decision process (MDP)[Bellman, 1952]. When there is no knowledge about the model of the environment and an agent has to optimize a reward function while interacting with the environment the problem is called reinforcement learning (RL) [Sutton and Barto, 1998]. A sequential problem is modeled as a set of states , actions that allow the system to change between state and the rewards that the system receives at each time step . The time evolution of the system is considered to depend on the current state and the chosen action , i.e. . The goal of the agent is to find a policy, i.e. , that maximizes the total discounted reward . For a complete treatment on the topic refer to [Kaelbling et al., 1996, Sutton and Barto, 1998, Szepesvári, 2011, Kober et al., 2013]. As the agent does not know the dynamics and the reward function it can not act optimally with respect to the cost function without first exploring the environment for that information. Then it can explicitly create a model of the environment and exploit it [Hester and Stone, 2011, Nguyen-Tuong and Peters, 2011] and directly try to find a policy that optimizes the behavior [Deisenroth et al., 2013]. The balance between the amount of exploration necessary to learn the model and the exploitation of the latter to collect reward is, in general, an intractable problem and is usually called the exploitation-exploration dilemma.

Partial-observable markov decision processes (POMDP) generalize the concept for cases where the state is not directly observable [Kaelbling et al., 1998].

2.3 Space of Exploration Policies

The policy space is defined by all possible sequences of actions that can be taken by the agent or, alternatively, by all the different closed-loop policies that generate such sequences. The simplest approach is to select a single data point from the environment database and use it to improve the model . In this case, is defined by the set of all possible sequences of data points (or the algorithm, or sensor, that is used to select them). Another case is when autonomous agents gather information by moving in the environment. Here, the actions usually include all the trajectories necessary to sample particular locations (or the motion commands that take the agent to them).

Figure 3: Different possible choices available to an exploring agent. Considering an agent in an initial state (grey state) it has to decide where to explore next (information value indicated by the values in the nodes). From the current location all the states might be reachable (Left figure), or there might be restrictions and some state might might only be reachable after several steps (Right figure). In the latter case the agent has also to evaluate what are the possible actions after each move.

However, the formulation of Eq. 1 is much more general and can incorporate any other possible decision to be made by the agent. An agent might try to select particular locations to maximize information or could select at a more abstract level between different regions, e.g. starting to map the beach or the forest. This idea can be pushed further. The agent might decide among different exploration types and request a helicopter survey of a particular location instead of measuring with its own sensors. In this case the robot selects among different exploration types. The agent might even decide between learning methods and representations that, in view of the current data, will behave better, produce more accurate models or result in better performance (see Section 3). This choice modifies the function used to compute the loss and can be changed several times during the learning process.

The following list summarizes the possible choices that have been considered in the literature in the context of active learning:

  • next location, or next locations

  • among a pre-defined partition of the space

  • among different exploration algorithms

  • learning methods

  • representations

  • others

2.4 Cost

The term represents the cost of the policy and we will assume that each action taken following incurs a cost which is independent of future actions. However, the cost of an action may depend on the history of actions and states of the agent. Indeed, modeling this dependency is an important design decision, specially for autonomous agents. Figure 3 illustrates the implications of this dependency. In the first example, the cost of an action depends only on the action. This is usually the case of costs associated to sensing the environment. In the second case, the cost depends on the previous action since it implies a non-zero cost motion. This type of cost appears naturally for autonomous agents that need to move from one location to another666Action is not precisely defined yet. The previous distinction abuses notation by abstracting over the specific action definition (e.g. local displacements or global coordinates). The important thing is that moving incurs a cost that depends on previous actions.. In many cases, the cost will consist of a combination of different costs that can individually depend or not on previous actions.

2.5 Loss and Active Learning Tasks

Choice Prob. Optimization Learning
Point Bayesian Optimization [Brochu et al., 2010] Classical Active Learning [Settles, 2009]
Discrete tasks Multi-armed bandits [Auer et al., 2003] AL for MAB [Carpentier et al., 2011]
Trajectory Exploration/Exploitation [Kaelbling et al., 1996] Exploration
Table 2: Taxonomy active learning

The term represents the loss incurred by the exploration policy. Recall that the agent’s objective is to learn model . The loss is therefore defined as a function of the goodness of the learned model. Obviously, the function varies with the task. It can be a discriminant function for classification, a function approximation for some quantity of interest or a policy mapping states to actions. In any case, the learned function will be determined by the flow of observations induced by the policy

(e.g. training examples for a classifier or measurements of the environment to build a map).

Another important aspect that must be considered is when the loss is evaluated. One possibility is that only the final learned model is used to obtain the expected loss. In this case, mistakes made during training are not taken into account. Alternatively, one may consider the accumulated loss during the whole lifetime of the agent, where even the cost and errors made during the learning phase are taken into account. We can also think that no explicit learning phase exists in this setting. In the MAB literature these measures are known as the simple regret and average regret. The latter tells, in hindsight, how much was lost by not pulling always the best arm. And the former tells how good is the arm estimated as being the best.

Earlier on, we did not make explicit what the loss function aims to capture during the learning process. Again, there are two possible generic options to consider: learn the whole environment (what we consider to be a pure learning problem); or find a location of the environment that provides the highest value (optimization problem). Note that in both cases, it is necessary to learn a model . However, in the first case we are interested in minimizing the error of the learned model

(2)

while in the second case we are just interested on fitting a function that helps us to find the maximum of

(3)

irrespectively of what the function is actually approximating. In a multi-armed bandit setting this amounts to just detect which is the best arm, or learn the payoff of all the arms. Table 2 summarizes this perspective. In this pure learning problem of multi-armed bandits regret bounds on the simple regret can also be made [Carpentier et al., 2011, Victor Gabillon et al., 2011]. For the general RL problem regret bounds have also been established [Jaksch et al., 2010].

2.6 Measures of Information

The utility of the policy in Eq. 1 is measured using a function . Computing the information gain of a given sample is a difficult task which can be computationally very expensive or intractable. Furthermore, it can be implemented in multiple different ways depending on how the information is defined and on the assumptions and decisions done in terms of loss, cost and representation. Also, we note that in some cases, due to interdependencies between all the points, the order in which the samples are obtained might be relevant. The classification below follows the one proposed in [Settles, 2009] (also refer to [MacKay, 1992, Settles, 2009] for further details) and completes it by including empirical measures as a different way of assessing the information gain of a sample. The latter class of measures aims to consider those cases where there is no single model that covers the whole state-space, or if the agents lacks the knowledge to select which is the best one [Schmidhuber, 1991b, Oudeyer and Kaplan, 2007].

2.6.1 Uncertainty sampling and Entropy

The simplest way to select the new sample is to select the one we are currently more uncertain about. Formally, this can be modeled as the entropy of the output. Uncertainty sampling where the query is made where the classifier is most uncertain about [Lewis and Gale, 1994], still used in support vector machines [Tong and Koller, 2001], logistic regression [Schein and Ungar, 2007], among others.

2.6.2 Minimizing the version space

The version space defines the subset of all possible models (or parameters of a model) that are consistent with the current samples and, therefore, provides the set of hypotheses we are still undecided about. This space cannot in general be computed. It has been approximated in many different ways. An initial model considered Selective Sampling [Cohn et al., 1994] where a pool, or stream, of unlabeled examples exists and the learner may request the labels to an oracle. The goal was to minimize the amount of labeled data to learn the concepts to a fixed accuracy. Query by committee [Seung et al., 1992, Freund et al., 1997] considers a committee of classifiers and measures the degree of disagreement between the committee. Another perspective was proposed by [Angluin, 1988] to find the correct hypothesis using membership queries. In this method the learner as a class of hypothesis and has to identify the correct hypothesis exactly. Perhaps the best-studied approach of this kind is learning by queries [Angluin, 1988, Cohn et al., 1994, Baum, 1991]. Under this setting approaches have generalized methods based on binary search [Nowak, 2011, Melo and Lopes, 2013]. Also, active learning in support vector machines can be seen in a version space perspective or as the uncertainty of the classifier [Tong and Koller, 2001].

2.6.3 Variance reduction

Variance reduction aims to select the sample(s) that will minimize the variance of the estimation for unlabeled samples [Cohn et al., 1996]

. There exist closed form solutions for some specific regression problems (e.g. linear regression or Gaussian mixture models). In other cases, the variance is computed over a set of possible unlabeled examples which may be computationally expensive. Finally, there are other decision-theoretic based measures such as the expected model change

[Settles et al., 2007] or the expected error reduction [Roy and McCallum, 2001, Moskovitch et al., 2007] which select the sample that, in expectation, will result in the largest change in the model parameters or in the largest reduction in the generalization error, respectively.

2.6.4 Empirical Measures

Empirical measures make less assumptions on the data-generating process and instead estimate empirically the expected quality of each data-points/region [Schmidhuber, 1991b, Schmidhuber, 2006, Oudeyer and Kaplan, 2007, Oudeyer et al., 2007, Lopes et al., 2012]. This type of measures consider problems where (parts of-) the state space have properties that change over time, can not be learned accurately, or require much more data than other parts given a particular learning algorithm. Efficient learning in those situations will require to balance exploration so that resources are assigned according to the difficulty of the task. In those cases where this prior information is available, it can be directly incorporated in the previous methods. The increase in complexity may result in computationally expensive algorithms. When the underlying structure is completely unknown, it might be difficult to find proper models to take into account all the uncertainty. And even for the case where there is a generative model that explains the data, its complexity will be very high.

Let us use a simple analogy to illustrate the main idea behind empirical measures. Signal theory tells us what is the sampling rate required to accurately reconstruct a signal with a limited bandwidth. To estimate several signals, an optimal allocation of sampling resources would require the knowledge of each signal bandwidth. Without this knowledge, it is necessary to estimate simultaneously the signal and the optimal sampling rate, see Figure 6. Although for this simple case one can imagine how to create such an algorithm, the formalization of more complex problems might be difficult. Indeed, in real applications it is quite common to encounter similar problems. For instance, a robot might be able to recover the map in most parts of the environment but fail in the presence of mirrors. Or, a visual attention system might end up spending most of its time looking at a tv set showing static.

The first attempt to develop empirical measures was made by [Schmidhuber, 1991a, Schmidhuber, 1991b] in which an agent could model its own expectation about how future experiences can improve model learning. After this seminal paper, several measures to empirically estimate how can data improve task learning have been proposed and a integrated view can be seen in [Oudeyer et al., 2007]. To note that if there is an accurate generative model of the data, then empirical measures reduce to standard methods, see for instance the generalization of Rmax method [Brafman and Tennenholtz, 2003] to the use of empirical measures in [Lopes et al., 2012].

Figure 4: Intrinsic motivation systems rely on the use of empirical measure of learning progress to select actions to promise higher learning gains. Instead of considering complex statistical generative models of the data, the actual results obtained by the learning system are tracked and used to create an estimator of the learning progress. From [Oudeyer et al., 2007].

In more concrete terms empirical measure rely not on the statistical properties of a generative data model, but on tracking the evolution of the quality of estimation, see Figure 4.

A simple empirical measure of learning progress can be made by estimating the variation of the estimated prediction error. If we consider a loss model for the learning problem as: , where is the true model and is the observed data. Putting an absolute threshold directly on the loss is hard. Note that the predictive error has the entropy of the true distribution as a lower bound, which is unknown [Cohn et al., 1996]. Therefore, these methods drive exploration based on the learning progress instead of the current learner accuracy. Using the change in loss they may gain robustness by becoming independent of the loss’ absolute value and can potentially detect time-varying conditions [Oudeyer et al., 2007, Lopes et al., 2012].

We can define in terms of the change in the (empirically estimated) loss as follows. Let denote the experiences in except the last and is the transition model learned from the reduced data-set . We define . This estimates to which extent the last experiences help to learn a better model as evaluated over the complete data. Thus, if is small, then the last visitations in the data-set did not have a significant effect on improving . To note that finding a good estimator for the expected loss is not trivial and resampling methods might be required [Lopes et al., 2012]. See also [Oudeyer et al., 2007] for different definitions of learning progress.

2.7 Solving strategies

The optimal exploration problem defined in Eq. 1 is in its most general case computationally intractable. Note that we aim at finding a exploration policy, or an algorithm, that is able to minimize the amount of data required while minimizing the loss. In Fig. 1

that would amount to choose among all the possible trajectories, of equivalent cost, the ones that provide the best fit. Furthermore, common statistical learning theory does not directly apply to most active learning algorithms and it is difficult to obtain theoretical guarantees about their properties. The main reason is that most theory on learning relies on the assumption that data is acquired randomly, i.e. the training data comes from the some distribution as the real data, while in active learning the agents itself chooses the next data point.

2.7.1 Theoretical guarantees for binary search

Despite previous remarks, there are several cases where it is possible to show that active learning provides a gain and obtain some guarantees. [Castro and Novak, 2008, Balcan et al., 2008] identify the expected gains that active learning can give in different classes of problems. For instance, [Dasgupta, 2005, Dasgupta, 2011] studied the problem of actively finding the optimal threshold on a line for a separable classification problem. A binary search applied to this problem yields an exponential gain in sample efficiency. In what conditions, and for which problems this gain still hold is currently under study. As discussed by the authors, in the worst case it might still be necessary to classify the whole dataset to identify the best possible classifier. However, if we consider the average case and consider the expected learning quality for finite sample sizes, results show that we can get exponential improvements over random exploration. Indeed, other authors have shown that generalized binary search algorithms can be derived for more complex learning problems [Nowak, 2011, Melo and Lopes, 2013].

2.7.2 Greedy methods

Many practical solutions are greedy, i.e. they only look at maximizing directly a function. We note the difference between a greedy approach that directly maximizes a function an a myopic approach that ignores the long-term effects of those choices. As we discuss now, there are cases where greedy methods are not myopic. The question is how far are greedy solutions from the optimal exploration strategy. This is in general a complex combinatorial problem. If the loss function being minimized has some structural properties, then some guarantees can be found that relate the sample complexity of a given algorithm with the possible best polynomial time algorithm. Under this approach the submodular property has been extensively used [Krause and Guestrin, 2005, Golovin et al., 2010b, Golovin and Krause, 2010, Maillard, 2012]. Submodular functions are functions that observe the diminishing return property, i.e. if then . This means that choosing a datapoint sooner during the optimization will always provide equal or more information than the same point later on.

A theorem from [Nemhauser et al., 1978] says that for monotonic submodular functions, the value of the function for the set obtained with the greedy algorithm is close, , to the value of the optimal set ). This means that if we would solve the combinatorial problem, the solution we get with the greedy algorithm is at most below the true optimal solution.

Unfortunately not all problems are submodular. First, some target functions are not submodular. Second, online learning methods introduce bias since the order of the data changes the active learning results. Third, some problems cannot be solved using a greedy approach. For these problems a greedy algorithm can be exponentially bad (worst than random exploration). Also, a common situation is to have submodular problems given some unknown parameters without which it is not possible to use a the greedy algorithm. In this situation it is necessary to take an exploration/exploitation strategy to explore the parameter space to gather information about the properties of the loss function and and then exploit it.

2.7.3 Approximate Exploration

The most general case as shown in Figure 1 is not submodular and the best solution rely of PAC-bounds. Two of the most influential works on the topics are: [Kearns and Singh, 2002] and R-max [Brafman and Tennenholtz, 2003]

. Both take into account how often a state-action pair has been visited to decide if further exploration is needed or if the model can be trusted enough (in a PAC setting) to be used for planning purposes. With different technical details both algorithms guaranty that with high-probability the system learns a policy whose value is close to the optimal one. Some other approaches consider limited look-ahead planning to approximately solve this problem

[Sim and Roy, 2005, Krause and Guestrin, 2007].

2.7.4 No-regret

In the domain of multi-armed bandits several algorithms have been developed that can solve the optimization [Victor Gabillon et al., 2011] or the learning [Carpentier et al., 2011] problem with the best possible regret sometime taking into account specific knowledge about the statistical properties of each arm, but in many cases taken a distribution free approach [Auer et al., 2003].

3 Exploration

In this section we present the main approaches of active learning, particularly focused in systems with physical restrictions, i.e. where the cost depends on the state. This section organizes the literature according to what is being selected as policy for exploration. The distinctions are not clear in some cases, and some works include aspects of more than one problem, or can be seen in different perspectives. We consider three different parts: greedy selection of points where and considering a selection among an infinite set of points or among a finite set, the last part considers the cases where the selection takes explicitly into account and longer time horizons. There is already a great variety of approaches but mainly the division corresponds to classical active learning, multi-armed bandits and exploration-exploitation in reinforcement learning. We are interested in applications related to general autonomous agents and will only consider approaches focused on the classical active learning methods if they provide a novel idea.

3.1 Single-Point Exploration

This section describes works that, at each time step, choose which is the single best observation point to explore without any explicit long term planning. This is the most common setting in active learning for function approximation problems [Settles, 2009], with examples ranging from vehicle detection [Sivaraman and Trivedi, 2010], object recognition [Kapoor et al., 2007] among others. Note that, as seen in Section 2.7, in some cases information measures were defined where a greedy choice is (quasi-) optimal. Figure 5 provides an example of this setting where a robot is able to try to grasp an object at any point to learn the probability of success, at each new trial the robot can still choose amongst the same (infinite) set of grasping points.

Figure 5:

Approximating a sinus varying p in a one dimensional input space representing a robot actively learning which object locating afford a more successful grasp. (a) Robotic setup. (b) Estimated mean. The blue points are the observations generated from a Bernoulli experiment, using the true p (blue line). Failures are represented by crosses and successes by circles. The red line with marks is the approximated mean computed from the posterior. (b) Predicted posterior beta distributions for each point along x. From

[Montesano and Lopes, 2012].

3.1.1 Learning reliability of actions

An example of the use of active learning under this setting, and with particular interest for physical systems, is to learn the reliability of actions. For instance, it has been suggested that grasping could be addressed by learning a function that relates a set of visual features with the probability of grasp success when a robot tries to grasp at those points [Saxena et al., 2006]. This process requires a large database of synthetically generated grasping points (as initially suggested by [Saxena et al., 2006]), or alternatively to actively search and select where to apply grasping actions to estimate their success [Salganicoff et al., 1996, Morales et al., 2004]. Another approach, proposed by [Montesano and Lopes, 2009, Montesano and Lopes, 2012] (see also Figure 5), derived a kernel based algorithm to predict the probability of a successful grasp together with its uncertainty based on Beta priors. Another approach used Gaussian process to model directly probability densities of successful grasps [Detry et al., 2009]. Clearly such success probabilities depend on the grasping policy is being applied, and a combination of the two will be required to learn the best grasping strategy [Kroemer et al., 2009, Kroemer et al., 2010].

Another example is to learn several terrain properties in mobile robots such as obstacle detection and terrain classification. [Dima et al., 2004] use active learning to request human users the correct labels of extensive datasets acquired by robots using density measures. Also using multiview approaches [Dima and Hebert, 2005]. Another property exploited by other authors is the traversability of given regions [Ugur et al., 2007].

A final example considers how to optimize the parameters of a controller whose results can only be evaluated as success or failure [Tesch et al., 2013]. The authors rely on Bayesian optimization to select which parameters are still expected to provide higher probabilities of success.

3.1.2 Learning general input-output relations

Several works explore different ways to learn input-ouputs maps. A simple case is to learn forward-backward kinematic or dynamical models of robots but it can also be the effects of time extended policies such as walking.

To learn the dynamical model of a robot, [Martinez-Cantin et al., 2010] considered how to select which measure to gather next to improve the model. The authors consider a model parameterized by the location and orientation of a rigid body and their goal is to learn such parameters as fast as possible. They rely on uncertainty measures such as a-optimality.

For non-parametric models several works learn different models of the robotic kinematic, using either nearest-neighbors

[Baranes and Oudeyer, 2012] or local-linear maps [Rolf et al., 2011]. Empirical measures of learning progress were used by [Baranes and Oudeyer, 2012] and [Rolf et al., 2011].

3.1.3 Policies

Another example is to learn what action to apply in any given situation. In many cases this is learned from user input. This setting will be discussed in detail in Section 5.3.

[Chernova and Veloso, 2009] considering support vector machines as the classification method. The authors consider the confidence on the prediction of the SVM and while the robot is moving it will query the teacher when that confidence is low.

Under the formalism of inverse reinforcement learning, queries are made to a user that allow to infer the correct reward [Lopes et al., 2009b, Melo and Lopes, 2010, Cohn et al., 2010, Cohn et al., 2011, Judah et al., 2012]. Initial sample complexity results show that this approaches can indeed provide gains on the average case [Melo and Lopes, 2013].

3.2 Multi-Armed Bandits

This section discusses works that, similarly to the previous section, solely choose a single exploration point. The main difference is that we consider here the setting where this choice is discrete, or categorical. There are several learning problems that fall under this setting: environmental sensing and online sensor selection, multi-task learning, online selection of learning/exploration strategy, among others (see Table 3).

There are two main origins for this different set of choices. One is that the problem is intrinsically discrete. For instance the system can either be able to select among a set of different sensors, different learning algorithms [Baram et al., 2004, Hoffman et al., 2011, Hester et al., 2013], or being interested in learning from among a set of discrete tasks [Barto et al., 2004]. Another case is when the discretization is made to simplify the exploration problem in a continuous space, reducing the cases presented in Section 3.1 to a MAB problem. Examples include environmental sensing where the state is partitioned for computational purposes [Krause et al., 2008], or learning dynamical models of robots where the partition is created online based on the similarities of the function properties at each location [Oudeyer et al., 2005, Baranès and Oudeyer, 2009] (see Figure 6). In all cases the goal is to learn a function in all domain by learning a function in each partial domain. Or to learn the relation of all the choices with their outputs. For a limited time horizon the best overall learning must be obtained.

In the recently introduced strategic student problem [Lopes and Oudeyer, 2012], the authors provide an unified view of these problems, following a computational approach similar to [Baram et al., 2004, Hoffman et al., 2011, Baranes and Oudeyer, 2012]. After having a finite set of different possible choices that can be explored, both problems can be approached in the same way and relying on variants of the EXP4 algorithm [Auer et al., 2003]. This algorithm considers adversarial bandit settings and relies on a collection of experts. The algorithm has zero regret on the choice of experts and each expert will track the recent quality of each choice.

We note that most algorithms for MAB were defined for the exploration-exploitation setting, but there are cases where there is a pure-exploration problem. The main difference is that if we define the learning improvement as reward, this reward will change with time, as sampling the same location will reduce its value. It is worth to note that if the reward function were known then most of these cases could be reduced to a submodular optimization where a greedy heuristic is quasi-optimal. When this is not the case then a MAB algorithm must be used to ensure proper exploration of all the arms

[Lopes and Oudeyer, 2012, Golovin et al., 2010a].

One interesting aspect to note is that, in most of cases, the optimal strategy is non-stationary. That is, for different time instants, the percentage of time applied to each choice is different. We can see that there is a developmental progression from learning simpler topics to more complex ones. Even at the extreme cases where with little amount of time some choices are not studied at all. These results confirms the heuristics of learning progress given by [Schmidhuber, 1991b, Oudeyer et al., 2007]. Both works considered that at any time instant the learner must sample the task that has given a larger benefit in the recent past. For the case at hand we can see that the solution is to probe, at any time instant, the task whose learning curve has an higher derivative, and for smooth learning curves both criteria are equivalent.

Figure 6: An example of a regression problem where the properties of the function to be learned vary along space. An optimal sampling of such signal will be non-uniform and could be solved efficiently if the signal properties were known. Without such information exploration strategies must be devised that learn simultaneously the properties of the signal and sample it efficiently. See [Lopes and Oudeyer, 2012] for a discussion. From [Oudeyer and Kaplan, 2007].

We will now present some works that do active exploration by selecting among a finite set of choices. We divide the approaches in terms of choosing different (sub-) tasks or different strategies to explore, or learn a single task. Clearly this division depends on different nomenclatures and on how the problems are formulated.

3.2.1 Multiple (Sub-)Tasks

In this case we considered that there is a set of possible choices to be made that correspond to learning a different (sub-)task. This set can be pre-defined, or acquired autonomously (see Section 4), to have a large dictionary of skills that can be used in different situations or to create complex hierarchical controllers [Barto et al., 2004, Byrne, 2002]

Multi-task problems have been considered in classification tasks [Qi et al., 2008, Reichart et al., 2008]. Here active learning methods are used to improve not only one task, but the overall quality of the different tasks.

More interestingly for our discussion are the works from [Singh et al., 2005, Oudeyer et al., 2007]. The authors divide the problem of learning complex agent-environment tasks into learning a set of macro-action, or predictive models, in an autonomous way (see Section 4). These initial problems took very naive approaches and were latter improved with more efficient methods. [Oudeyer et al., 2007] initially considered that each parameter region gave a different learning gain, and the one that were given the highest gain was selected. Taking into account the previous discussion we know that a better exploration strategy must be applied and the authors considered more robust measures and created a stochastic policy to provide efficient results in high-dimensional problems [Baranes and Oudeyer, 2012]. More recently [Maillard, 2012] introduce a new formulation of the problem and a new algorithm with specific regret bounds. The initial work of [Singh et al., 2005] lead to further improvements. The measures of progress that guide the selection of the macro action that is to be chosen started to consider the change in value function during learning [Şimşek and Barto, 2006]. Similar ideas were applied to learn affordances [Hart and Grupen, 2013] where different controllers and their validity regions are learned following their learning progress.

In distributed sensing it is required to estimate which sensors provide the most information about a environmental quantity. Typically this quantity is time varying and the goal is to actively estimate which sensors provide more information. When using a gaussian process as function approximation it is important to consider exploration to find the property of the kernel and then, for known parameters of the kernel, a simple offline policy provides optimal results [Krause and Guestrin, 2007]. This partition in a finite set of choices allows to derive more efficient exploration/sensing strategies and still ensure tight bounds [Krause et al., 2008, Golovin and Krause, 2010, Golovin et al., 2010a].

3.2.2 Multiple Strategies

The other big perspective is to consider that the choices are the different methods that can be used to learn from the task, in this case a single-task is often considered. This learning how to learn approach makes explicit that a learning problem is extremely depending on the method to collect the data and the algorithm used to learn the task.

Other approaches include the choice among the different teachers that are available to be observed [Price and Boutilier, 2003] where some of them might not even be cooperative [Shon et al., 2007], or even choose between looking/asking for a teacher demonstration or doing self-exploration [Nguyen and Oudeyer, 2012].

Another approach considers the problem of having different representation and selecting the best one. The representation that gives more progress will be used more frequently [Konidaris and Barto, 2008, Maillard et al., 2011].

The previous mentioned work of [Lopes and Oudeyer, 2012] showed also that the same algorithm can be used to select online which exploration strategy was best to learn faster the transition probability model of an MDP. The authors compared R-Max, and random. A similar approach was suggested by [Castronovo et al., 2012] where a list of possible exploration reward is proposed and a given arm bandit is assigned to each one. Both works took a simplified approach by considering that reset actions were available and the choices were only made at the beginning of each episode. This limitation was recently improved by considering that the agent can evaluate and select online the best exploration strategies [Hester et al., 2013]. In this work the authors relied on a factored representation of an MDP [Hester and Stone, 2012] and using many different exploration bonuses they were able to define a large set of exploration strategies. The new algorithm at each instant computes the gain in reward for the selected exploration strategy and simultaneously the expected gain for all the other strategies using an importance sampling idea. Using such expected gains the system can select online the best strategy given better results than any single exploration strategy would do.

Prob. Choices Topics References
reg. n Regions n Functions [Baranes and Oudeyer, 2010, Baranes and Oudeyer, 2012]
mdp n Environment n Environments [Barto et al., 2004, Oudeyer et al., 2005, Oudeyer et al., 2007]
reg. n Environment n Environments [Lopes and Oudeyer, 2012]
reg. Control or Task Space Direct/Inv. Model [Baranes and Oudeyer, 2012, Jamone et al., 2011, Rolf et al., 2011]
mdp Exploration strategies 1 Environment [Baram et al., 2004, Krause et al., 2008, Lopes and Oudeyer, 2012]
mdp n Teachers 1 Environment [Price and Boutilier, 2003, Shon et al., 2007]
reg. Teacher,self-exploration 1 Function [Nguyen and Oudeyer, 2012]
mdp n Representations 1 Environment [Konidaris and Barto, 2008, Maillard et al., 2011]
Table 3: Formulation of several Machine Learning problems as a Strategic Student Problem.

3.3 Long-term exploration

We now consider active exploration strategies in which the whole trajectory is considered within the optimization criteria instead of planning only a single step ahead. A real world example is the one of selecting informative paths for environmental monitoring, see Figure 7.

We divide this section in two parts. A first part entitled Exploration in Dynamical Systems considering exploration where the dynamical constraints of the system are taken into account and another, that considers similar aspects, specific to Reinforcement Learning and Markov Decision Processes. We make this distinction due to the different communities, formalisms and metrics commonly used in each domain.

Figure 7: In environmental monitoring it is necessary to find the trajectories that provide the more critical information about different variables. Selecting the most informative trajectories based on the space and time variation and the physical restrictions on of the mobile sensors is a very complex problem. The figures show the trajectories followed by simulated aerial vehicles, samples are only allowed inside the US territory. Courtesy from [Marchant and Ramos, 2012].

3.3.1 Exploration in Dynamical Systems

The most representative example of such a problem is one of the best studied problems in robotics: simultaneous localization and mapping (SLAM). The goal is to build a map of an unknown environment while keeping track of the robot position within it. Early works focused on active localization given an a priori map. In this case, the objective is to actively move the robot to obtain a better localization. In [Fox et al., 1998] the belief over the robot position and orientation was obtained using a Monte Carlo algorithm. Actions are chosen based on a utility function based on the expected entropy of the robot location. A set of predefined relative motions are considered and only moving costs are considered.

The first attempts to actively explore the environment during SLAM aimed to maximize the expected information gain [Feder et al., 1999, Bourgault et al., 2002, Stachniss and Burgard, 2003, Stachniss et al., 2005]. The implementation details depend on the on-board sensors (e.g. sonar or laser), the SLAM representation (feature based or grid maps) and the technique (EKF, Monte Carlo). For instance, in [Feder et al., 1999] an EKF was used to represent the robot location and the map features measured using sonar. Actions were selected to minimize the total area of error ellipses for the robot and each landmark, by reducing the expected covariance matrix at the next time step. For grid maps, similar ideas have been developed using mutual information [Stachniss and Burgard, 2003] and it is even possible to combine both representations [Bourgault et al., 2002] using a weighted criteria. Most of the previous approaches consider just a single step ahead, have to discretize the action space or ignore the information that will be obtained during the path and its effect in the quality of the map. A more elaborated strategy was proposed in [Sim and Roy, 2005] where an a-optimality criterion over the whole trajectory was used. To make the problem computationally tractable, only a set of predefined trajectories is considered using breadth-first search. The work in [Martinez-Cantin et al., 2007] directly aims to estimate the trajectory (i.e. a policy) in a continuous action-state space taking into account the cost to go there and all the information gathered in the path [Martinez-Cantin et al., 2007]. The policies are parameterized using way-points and the optimization is done over the latter. Some works explore similar ideas in the context of navigation and obstacle avoidance. For instance, [Kneebone and Dearden, 2009] uses a POMDP framework to incorporate uncertainty into Rapid Random Trees planning. The resulting policy takes into account the information the robot will obtain while executing the plan. Hence, the map is implicitly refined during the plan resulting in an improved model of the environment.

The active mapping approaches described above deal mainly with mapping environments with obstacles. However, similar ideas have been used to map other phenomena such as rough terrain, gas concentration or other environmental monitoring tasks. In this setting, robots allow to cover larger areas and to reconfigure the sensor network dynamically during operation. This makes active strategies even more relevant than in traditional mapping. Robots must decide where, when and what to sample to accurately monitor the quantities of interest. In this domain it is important to consider learning non-stationary space-time models [Krause and Guestrin, 2007, Garg et al., 2012]. By exploiting submodularity it is possible to compute efficient paths for multiple robots assuring that they will gather information in a set of regions [Singh et al., 2007]. Without relying on a particular division into regions, but without any proven bounds, [Marchant and Ramos, 2012] used Bayesian optimization tools to find an informative path in a space-time model.

3.3.2 Exploration / Exploitation

Another setting where the learner actively plans its actions to improve learning is in reinforcement learning (see an early review on the topic [Thrun, 1992]). In this general setting the agent is not just learning but is simultaneously being evaluated on its actions. This means that the errors made during learning count towards the global evaluation. In the Reinforcement learning (RL) approaches this is the most common setting. Under our taxonomy here the problem is also the one more challenging as the choice of where to explore next depends on the current location and the system has to take into account the way to travel to such locations.

As discussed before, this most general case, as shown in Figure 1, is not submodular and there is not hope to find a computationally efficient method to solve it exactly. Initial proposals considered the uncertainty in the models and guided exploration based on this uncertainty and other measures such as recency of visits. The authors then proposed that a never-ending exploration strategy could be made that incorporates knowledge about already well know states and novel ones. [Schmidhuber et al., 1997, Wiering and Schmidhuber, 1998].

The best solutions, with theoretical guarantees, aim at finding efficient algorithms that have an high-probability of finding a solution that is approximately correct, following the standard probably approximately correct learning (PAC) [Strehl and Littman, 2008, Strehl et al., 2009]. Two of the most influential works on the topic are: [Kearns and Singh, 2002] and R-max [Brafman and Tennenholtz, 2003]. Both take into account how often a state-action pair has been visited to decide how much further exploration must be done. Specifically, for the case of R-max [Brafman and Tennenholtz, 2003], the algorithm divides the states into known and unknown based on the number of visits made. This number is defined based on general bounds for having a high certainty on the correct transition and reward model. Then the algorithm proceeds by considering a surrogate reward function that is R-max in unknown states and the observed reward in known states. For a further analysis an more recent algorithm see the discussion in [Strehl and Littman, 2008].

PAC-RL measures consider that most of the times the agent will be executing a policy that is close to the optimal one. An alternative view is to see if the cumulative reward is close to the best one, as in the notion of regret. Such regret measure have been already generate some RL algorithms [Salganicoff and Ungar, 1995, Ortner, 2007, Jaksch et al., 2010].

Yet another approach considers Bayesian RL [Dearden et al., 1998, Poupart et al., 2006, Vlassis et al., 2012, Sorg et al., 2010c]. In this formalism the agents aims at finding a policy that is (close to) optimal taking into account the model uncertainty. The resulting policies solve implicitly the exploration-exploitation problem. Bayesian RL exploits prior knowledge about the transition dynamics to reason explicitly about the uncertainty of the estimated model. Bayesian exploration bonus (BEB) approach [Kolter and Ng, 2009] mixes the ideas of Bayesian RL with R-max where the state are not explicitly separated between known and unknown but instead each state get a bonus proportionally to the uncertainty in the model. The authors were able to show that this algorithm approximates the - hard to compute - bayesian optimal solution.

A recent approach considered how can R-max be generalized for the case where each different state might have different statistical properties [Lopes et al., 2012]. Specially in the case where the different properties are not known, empirical measures of learning progress have been proposed to allow the system to balance online the exploration necessary to verify the PAC-MDP conditions.

As a generalization of exploration methods in reinforcement learning, such as [Brafman and Tennenholtz, 2003], ideas have been suggested such as planning to be surprised [Sun et al., 2011] or the combination of empirical learning progress with visit counts [Hester and Stone, 2012]. This aspect will be further explored in Section 4.

We note also that the ideas and algorithms for exploration/exploitation are not limited to finite state representations, there have been recent results extending them to to POMDPs [Fox and Tennenholtz, 2007, Jaulmes et al., 2005, Doshi et al., 2008], Gaussian Process Dynamical Systems [Jung and Stone, 2010], structured domains [Hester and Stone, 2012, Nouri and Littman, 2010], and relational problems [Lang et al., 2010].

Most of the previous approaches are optimistic in the face of uncertainty. In the real world most of the times exploration must be done in incremental and safe ways due to the physical limits and security issues. In most cases process are not ergodic and care must be made. Safe exploration techniques have started to be developed [Moldovan and Abbeel, 2012]. In this work the system is able to know if an exploration step can be reversed. This means that the robot can see ahead and estimate if it can return to the previous location. Results show that the exploration trajectory followed is different from other methods but allows the system to explore only the safe parts of the environment.

3.4 Others

There are other exploration methods that do not fit well in the previously defined structure, in most cases because they do not model explicitly the uncertainty. Relevant examples consider policy search and active vision. Other cases combine different methods to accomplish different goals.

3.4.1 Mixed Approaches

There are several methods that include several levels of active learning to accomplish complex tasks, see Figure 8.

Figure 8: Several problem require the use of active learning at several different levels and/or time scales. Here is the examples of the SAGG-RIAC architecture. The structure is composed of two parts: a higher level for selecting target goals, and a lower level, which considers the active choice of the controllers to reach such goals. The system allows to explore the space of reachable goals and learn the controllers required to reach them in a more efficient way. From [Baranes and Oudeyer, 2012].

In [Martinez-Cantin et al., 2009, Martinez-Cantin et al., 2010] the authors want to learn a dynamical model of a robot arm, or a good map of the environment, with the minimum amount of data. For this it is necessary to find a trajectory, consisting of a sequence of via-points, that reduces the uncertainty on the estimator as fast as possible. The main difficulty is that this is in itself a computationally expensive problem, and if it is to be used in real time, then efficient Bayesian optimization techniques must be used [Brochu et al., 2010].

Another examples is the SAGG-RIAC architecture [Baranes and Oudeyer, 2012]. In this system a hierarchy of forward models are learned and for this it actively makes choices at two levels: in a goal space, it chooses what topic/region to sample (i.e. which goal to set), and in a control space, it chooses which motor commands to sample to improve its know-how towards goals chosen at the higher level.

We can also view the works of [Kroemer et al., 2009, Kroemer et al., 2010] as having a level of active exploration of good grasping points and another level of implicit exploration to find the best grasping strategies.

3.4.2 Implicit exploration

Learning in robots and data collection are always intertwined. Even if such data collection process is explicit in many cases, other situations, even if strongly dependent on that same process, address it only in an implicit way or as a side-effect of an optimization process [Deisenroth et al., 2013]. The most noteworthy example are all policy gradient methods and similar approaches [Sutton et al., 2000, Kober et al., 2013]. In these methods the learner tries to directly optimize a policy given experiments and the corresponding associated reward. Some methods consider stochastic policies and the noise on the policy is used to perform exploration and collect data [Peters et al., 2005]. The exploration reduces under the same process that adjust the parameters to improve the expected reward. Another line of research is to use more classical methods of optimization to find the best set of parameters that maximize a reward function [Stulp and Sigaud, 2012]. Recently, and using a more accurate model of uncertainty it is possible to use Bayesian optimization methods to search for the best policy parameters that result in the highest success rate [Tesch et al., 2013].

3.4.3 Active Perception

Another common use of the word active is in active perception

. Initially it was introduced because many computer vision problems become easier if more than one images is available or even a stream of video. An active motion of the camera can make such extra information much easier to discover. More recently it was motivated by the possibilities opened by having a robot acting in the environment to discover world properties.

This idea has been applied to segment object and learn about their properties [Fitzpatrick et al., 2003], disambiguate and model articulated objects [Katz et al., 2008], disambiguate sound [Berglund and Sitte, 2005], among others. Attention can also be seen as an instance of active perception, [Meger et al., 2008] presents an attention system and learning in a real environment to learn about object using SIFTs and finally, in highly cluttered environments active approach can also provide significant gains [van Hoof et al., 2012].

3.5 Open Challenges

Under the label of exploration we considered several domains that include standard active learning, exploration and exploitation problems, multi-armed bandits and general online learning problems. All these problems have already a large research body but there are still many open challenges.

Clearly a great deal of work is still necessary to expand the classes of problem that can be actively sampled in an efficient way. In all the settings we described there exist already many different approaches, many of them with formal guarantees. Nevertheless, for any particular instance of a problem it is not clear what method is the most efficient in practice, or how to synthesize the exploration strategies from a problem domain description.

Some of the heuristics and methods, and also many of the hypothesis and models, proposed in the developmental communities can be natural extensions to the active learning setting. For instance there is a very limited research on active learning for more complex models such as time-variant problems, domains with heteroscedastic noise and properties (see many of the differences in Table

4).

4 Curiosity

Most active approaches for learning address the problem of learning a single, well defined, task as fast as possible. Some of the examples given, such as safe exploration, already showed that in many cases there is a multi-criteria goal to be fulfilled. In a truly autonomous and intelligent system knowing what tasks are worth exploring or even which tasks are to be learned is a ill-defined problem.

In the 50s and 60s researchers started to be amazed by the amount of time children and primates spend in tasks that do not have a clear objective return. This spontaneous motivation to explore and intrinsic curiosity to novelty [Berlyne, 1960] challenged utilitarian perspectives on behavior. The main question is why do so many animals have a long period of playing and are curious, activities that in many perspectives can be considered risky and useless? One important reason seems to be that is this intrinsic motivation that will create situations for learning that will be useful in future situations [Baldassarre, 2011, Singh et al., 2009], only after going through school will that knowledge have some practical benefit. Intelligent agents are not myopically optimizing their behavior but also gathering a large set of perceptual, motor, and cognitive skills that will have a benefit in a large set of possible future tasks. A major problem is how to define a criteria of what a successful learning is if the task is just to explore for the sake of pure exploration. Some hypothesis can be made that this stage results from an evolutionary process that leads to a better performance in a class of problems [Singh et al., 2010b]. Or that intrinsic motivation is a way to deal with bounded agents where maximizing the objective reward would be too difficult [Singh et al., 2010a, Sorg et al., 2010a]. Even for very limited time spans where an agent wants to select a single action, there are many somewhat contradictory mechanisms for attention and curiosity [Gottlieb, 2012]. An agent might have preferences for: specific stimuli; actions to promise bigger learning gains; selecting actions that provide the required information for reward prediction/gathering.

The idea of assuming that the future will bring new unknown tasks can be operationalized even in a single domain. Consider a dynamical environment (defined as a MDP) where there is a training phase of unknown length. In one approach the agent progressively learns how to reach all the states that can be reached in 1 step. After being sufficiently sure that it found all such states and has a good enough policy to reach them the system increases the number of steps and starts the process. This work, suggested by [Auer et al., 2011, Lim and Auer, 2012], shows that it is possible to address such problem and still ensure formal regret bounds. Under different formalisms we can also see the POWERPLAY system as a way to increasingly augment the complexity of already explained problems [Schmidhuber, 2011]. The approach from [Baranes and Oudeyer, 2012] can also be seen in this perspective where the space of policy parameters is explored in an increasing order of complexity.

One of the earliest works that tried to operationalize these concepts was made by [Schmidhuber, 1991b]. More recently several researchers have extended the study to many other domains [Schmidhuber, 1995, Schmidhuber, 2006, Singh et al., 2005, Oudeyer et al., 2007]. Research in this field has considered new problems such as: situations where parts of the state space are unlearnable [Baranès and Oudeyer, 2009, Baranes and Oudeyer, 2012]; guide exploration in different spaces [Baranes and Oudeyer, 2012]; environmental changes [Lopes et al., 2012]; empirical measures of learning progress [Schmidhuber, 2006, Oudeyer et al., 2007, Baranès and Oudeyer, 2009, Baranes and Oudeyer, 2012, Hester et al., 2013, Lopes et al., 2012]; limited agents [Singh et al., 2010a, Sorg et al., 2010a, Sequeira et al., 2011]; open-ended problems [Singh et al., 2005, Oudeyer et al., 2007]; autonomous discovery of good representations [Luciw et al., 2011]; and selecting efficient exploration policies [Lopes and Oudeyer, 2012, Hester et al., 2013].

Some of these ideas are natural extensions to the active learning setting, e.g. time-variant problems, heteroscedastic domains but, usually due to limited formal understanding, theoretical results have been limited. Table 4 shows a comparison of the main qualitative differences between the traditional perspective and this more recent generalizations.

Active Learning Artificial Curiosity
Learn with reduced time/data Learn with reduced time/data
Fixed tasks Tasks change and are selected by the agent
Learnable everywhere Parts might not be learnable
Everything can be learned in the limit Not everything can be learned during a lifetime
Reduce uncertainty Improve progress
Table 4: Active Learning vs Artificial Curiosity

4.1 Creating Representations

A very important aspect in any learning machine is to be able to create, or at least select, its own representations. In many cases (most?) the success of a learning algorithm is critically dependent on the selected representations. Any variant of feature selection is the most common approach for the problem and it is assumed that a large bank of features exist and the learning algorithm chooses a good sub-set of them, considering sparsity, or any other criteria. Nevertheless, the problem is not trivial and most heuristics are bound to fail in most cases

[Guyon and Elisseeff, 2003].

Some works focused just on the perceptual capabilities of agents. For instance, [Meng and Lee, 2008]

grows radial basis functions to learn mappings between sensory modalities by sampling locations with an high error. For the discussion on this document, particularly in this section, the most relevant works are those that not consider just what is the best representation for a particular task, but those that have a co-adaptation perspective and co-select the representation and the behavior. For instance

[Ruesch and Bernardino, 2009, Schatz and Oudeyer, 2009, Rothkopf et al., 2009] study what is the relation between the behavior of an agent and the most representative retinal distribution.

Several works consider how to learn a good representations of the state-space of an agent while exploring an environment. These learned representations are not only good to classify regions but also to navigate and create hierarchies of behavior [Luciw et al., 2011, Bakker and Schmidhuber, 2004]. Early works considered how a finite-automaton and an hierarchy could be learned from data [Pierce and Kuipers, 1995].

Generalizations of those ideas consider how to detect regularities that identify non-static world objects and thus allowing to infer actions that change the world in the desired ways [Modayil and Kuipers, 2007].

4.2 Bounded Rationality

There are several models of artificial curiosity, or intrinsic motivation systems, that, in general, guide the behavior of the agent to novel situations. These models provide exploration bonuses, sometimes called intrinsic rewards, to focus attention on such novel situations. The advantages of such models for an autonomous agents are, in many situations, not clear.

An interesting perspective can be that of bounded rationality. Even if agents were able to see all the environment they might lack the reasoning and planning capabilities to behave optimally. Another way to see these works is to consider that the agent lives in a POMDP problem and, for some cases, it is possible to find a different reward function that mitigate some of the partial observability problem.

A very interesting perspective was approached with the definition of the optimal reward problem [Sorg et al., 2010a]. In here the authors consider that the learning agent is limited in its reasoning capabilities. If it tries to optimize the observed reward signal it will be sub-optimal in the task, and so another reward is found that allows the agent to learn the task. The authors have extended their initial approach to have a more practical algorithm using reward gradient [Sorg et al., 2010b] and by comparing different search methods [Sorg et al., 2011]. Recently the authors considered how the computational resources must be taken into account when choosing between optimizing a new reward or planning the next actions. Such search for an extra reward signal can also be used to improve coordination in a multi-agent scenario [Sequeira et al., 2011].

4.3 Creating Skills

When an animal is faced with a new environment there are an infinite number of different tasks that it might try to achieve, e.g. learn the properties of all objects or understand its own dynamics in this new environment. It can be argued that there is the single goal of survival and that any sub-division is an arbitrary construct. We agree with this view but we consider that such sub-division will create a set of reusable sub-goals that might provide advantages in the single main goal.

This perspective on (sub) goal creation motivated one of the earliest computational models on intrinsic motivated systems [Barto et al., 2004, Singh et al., 2005], see Figure 9. There the authors, using the theory of options [Sutton et al., 1999], construct new goals (as options) every time the agent finds a new ”salient” stimuli. In this toy example turning on a light, ringing a bell are considered reusable skills that might have an interest on latter stages and so if a skill is learned that reaches such state efficiently it will be able to learn complex hierarchical skills by combining the basic actions and the new learned skills.

The main criticism of those works is that the hierarchical nature of the problem was pre-designed and the saliency of novelty measures were tuned to the problem. To solve such limitations many authors have explored ways to autonomously define which skills much be created. Next we will discuss different approaches that have been proposed to create new skills in various problems.

In regression problems several authors reduced the problem of learning a single complex task into learning a set of multiple simpler tasks. In problems modeled as MDPs authors have considered how to create macro-state or macro actions that can be reused in different problems of allow to create a complex hierarchical control system. After such division of a problem into a set of smaller problems it is necessary to decide what to learn at each time instant. For this, results from multi-armed bandits can be used, see [Lopes and Oudeyer, 2012] and Section 3.2.

Figure 9: The playroom domain where a set of motor skills is incrementally created and learned resulting in a set of reusable, and hierarchical, repertoire of skills. (a) Playroom domain; (b) Speed of learning of various skills; (c) The effect of intrinsically motivated learning when extrinsic reward is present. From [Singh et al., 2005].

4.3.1 Regression Models

In problems that consist in learning forward and backward maps among spaces (e.g. to learn dynamical models of systems), authors have considered how to incrementally create a partition of the space into regions of consistent properties [Oudeyer et al., 2007, Baranès and Oudeyer, 2009]. An initial theoretical study frames such model as a multi-armed bandits over a pre-defined hierarchical partition of the space [Maillard, 2012].

The set of skills that is created by the system might represent many different problems. Either an hierarchical decomposition of skills, but we can also see it as a decomposition of a problem in several, simpler, local problems. An example is the optimization setting of [Krause et al., 2008]. Here the authors try to find which regions of a given area must be sampled to provide more information about one of several environmental conditions. It considers an already known sub-division and learns the properties of each one. Yet, in real world applications, the repertoire of topics to choose from might not be provided initially or might evolve dynamically. The aforementioned works of [Oudeyer et al., 2007, Baranes and Oudeyer, 2012] consider initially a single region (a prediction task in the former and a control task in the latter) but then automatically and continuously constructs new region, by sub-dividing or joining previous existing ones.

In order to discover affordances of objects and new ways to manipulate them, [Hart et al., 2008] introduces an intrinsic reward that motivates the system to explore changes in the perceptual space. These changes are related to different motions of the objects upon contact from the robot arm.

A different perspective on regression methods is considering that the input space is a space of policy parameters and the output is whatever time-extended results of applying such policy. Taking into account this perspective, the approach from [Baranes and Oudeyer, 2012], similarly to POWERPLAY [Schmidhuber, 2011] and the approach from [Auer et al., 2011, Lim and Auer, 2012], explores the policy space in an increasing order of complexity of learning each behavior.

4.3.2 Mdp

In the case of problems formulated as MDPs several researchers have defined automatic measures to create options or other equivalent state-action abstractions, see [Barto and Mahadevan, 2003] for an early discussion. [Mannor et al., 2004] considered approaches such as online clustering of the state-action space using measures of connectivity, and variance of reward values. One such connectivity measure was introduced by [McGovern and Barto, 2001] where states that are present in multiple paths to the goals are considered sub-goals and an option is initiated to reach them. These states can be seen as ”doors” connecting between high-connected parts of the state-space. Other measures of connectivity have been suggested by [Menache et al., 2002, Şimşek and Barto, 2004, Şimşek et al., 2005, Simsek and Barto, 2008]. Even before the introduction of the options formalism, [Digney, 1998] introduced a method that would create skills based on reward gradients. [Hengst, 2002]

exploited the factored structure of the problem to create the hierarchy, by measuring which factors are more predictable and connecting that to the different levels of the hierarchy. A more recent approach models the problem as a dynamic bayesian network that explains the relation between different tasks

[Jonsson and Barto, 2006]. Another perspective considers how to simultaneously learn different representations for the high-level and the lower level. By ensuring that neighbor states at the lower level are clustered in the higher level, it is possible to create efficient hierarchies of behavior [Bakker and Schmidhuber, 2004].

An alternative perspective on the creation of a set of reusable macro actions is to exploit commonalities in collections of policies [Thrun et al., 1995, Pickett and Barto, 2002].

4.4 Diversity and Competence

For many learning problems we can define several spaces of parameters, usually the input parameters and the resulting behaviors are trivial concepts. Most of the previous concepts can be applied in different spaces and in many cases, and dependent on the metric of learning, there is a decision to be made on which of these spaces is better to use when guiding exploration. The robot might detect salient events in perceptual space, or generate new references, in the control space of a robot or on the environment space. Although coming from different perspectives: developmental robotics [Baranes and Oudeyer, 2012] and evolutionary development [Lehman and Stanley, 2011] argue that exploration in the behavior space might be more efficient and relevant than in the space of the parameters that generate that behavior.

The first perspective proposed by [Lehman and Stanley, 2011] is that many different genetic controller encodings might lead to very similar behaviors, and when considering also the morphological and environmental restrictions, the space of behaviors is much smaller than the space of controller encodings. The notion of diversity is not clear due to the redundancy in the control parameters, see [Mouret and Doncieux, 2011] for a discussion. It is interesting to note that in a more computational perspective, particle filters tend to also consider diversity criteria to detect convergence and improve efficiency [Gilks and Berzuini, 2002].

From a robot controller point of view we can see a similar idea as proposed by [Baranes and Oudeyer, 2010], see Figure 10. In this case we consider the case of redundant robots where many different joint position lead to the same task space position of the robot. And so a dramatic reduction of the size of the exploration space is achieved. Also the authors introduced the concept of competence where, and again for the case of redundant robots, the robot might prefer to be able to reach a larger volume of the task space, even without knowing all the possible solution to reach each point, than being able to use all the dexterity in a small part of the task space and not knowing how to reach the rest.

Figure 10: Model of the correspondences between a controller space and a task space to be learned by a robot. Forward models deffine a knowledge of the effects caused by the execution of a controller. Inverse models, which deffine a skill or competence, are mechanisms that allow to retrieve one or several controller(s) (if it exists) allowing to achieve a given effect (or goal) yi in the task space.

Other authors have considered also exploration in task space, e.g. [Jamone et al., 2011] and [Rolf et al., 2011]. We can refer again to the works of [Schmidhuber, 2011, Lim and Auer, 2012] and see that they also consider as criteria having access to the more diversified set of policies possible.

4.5 Development

The previous discussion might lead us to think that a pure data-driven approach might be sufficient to address all the real world complexity. Several authors consider that data-driven approaches must be combined with pre-structured information. For examples artificial development considers that the learning process is guided not only by the environment and the data it is collect but also by the ”genetic information” of the system [Elman, 1997, Lungarella et al., 2003].

In living organism, it is believed that maturational constraints help reduce the complexity of learning in early stages thus resulting in better and more efficient learning in the longer term. It does this by structuring the perceptual and motor space [Nagai et al., 2006, Lee et al., 2007, Lopes and Santos-Victor, 2007, Lapeyre et al., 2011, Baranes and Oudeyer, 2011, Oudeyer et al., 2013] or by developing intrinsic rewards that focus attention to informative experiences [Baldassarre, 2011, Singh et al., 2010b], pre-dispositions to detect meaningful salient events, among many other aspects.

4.6 Open Challenges

In a broad perspective, open-ended learning and curiosity is still far from being a problem well understood, or even well formulated. Evolutionary models [Singh et al., 2010b] and recent studies in neurosciences [Gottlieb et al., 2013] are starting to provide a more clear picture on if, and why, curiosity is an intrinsic drive in many animals. A clear understanding on why this drive exist, what triggers the drive to learn new tasks, and why agents seek complex situations will provide many insights on human cognition and on the development of autonomous and robust agents.

A related discussion is that a purely data-driven approach will not be able to consider such long-term learning problems. If we consider large domain problems, time-varying, the need for prior information that provide exploration constraints will be a fundamental aspect on any algorithm. This developmental constraints, and all genetic information, will be fundamental to any of such endeavor. We note that during learning and development it is required to co-develop representations, exploration strategies, learning methods, and hierarchical organization of behavior will require the introduction of novel theoretical frameworks.

5 Interaction

The previous sections considered active learning where the agents act, or make queries, and either the environment or an oracle provides more data. Such abstract formalism might not be the best model when the oracle is a human with specific reasoning capabilities. Humans have a tremendous amount of prior knowledge, inference capabilities that allows them to solve very complex problems and so a benevolent teacher might guide exploration and provide information for learning. Feedback from a teacher takes the form of: initial condition for further self-exploration in robotics [Nicolescu and Mataric, 2003], information about the task solution [Calinon et al., 2007], information about affordances [Ekvall and Kragic, 2004], information about the task representation [Lopes et al., 2007], among others. Figure 11 explains this process where the world state, the signals produced by the teacher and the signal required to the learning algorithms are not in the same representation and an explicit mechanism of translation is required. An active learning approach can also allow a robot to inquire a user about adequate state representations, see Fig. 12.

Figure 11: In many situations agents gather data from humans. These instructions need to be translated to a representation that is understood by the learning agent. From [Grizou et al., 2013].

It has been suggested that interactive learning, human-guided machine learning or learning with human in-the-loop, might be a new perspective on robot learning that combines the ideas of learning by demonstration, learning by exploration, active learning and tutor feedback [Dillmann et al., 2000, Dillmann et al., 2002, Fails and Olsen Jr, 2003, Nicolescu and Mataric, 2003, Breazeal et al., 2004, Lockerd and Breazeal, 2004, Dillmann, 2004]. Under this approach the teacher interacts with the robot and provides extra feedback. Approaches have considered extra reinforcement signals [Thomaz and Breazeal, 2008], action requests [Grollman and Jenkins, 2007a, Lopes et al., 2009b], disambiguation among actions [Chernova and Veloso, 2009], preferences among states [Mason and Lopes, 2011], iterations between practice and user feedback sessions [Judah et al., 2010, Korupolu et al., 2012] and choosing actions that maximize the user feedback [Knox and Stone, 2009, Knox and Stone, 2010].

In this document we are more focused in active perspective and so it is the learner that has to ask for such information. Having a human on the loop we have to consider the cost in terms of tiredness of making many queries. Studies and algorithms have considered such aspect and addressed the problem of deciding when to ask. Most approaches will just ask to user whenever the information is needed [Nicolescu and Mataric, 2001] or when there is high uncertainty [Chernova and Veloso, 2009]. A more advanced situation considers making queries only when it is too risky to try experiments [Doshi et al., 2008]. [Cakmak et al., 2010a] compare the results when the robot has the option of asking or not the teacher for feedback and in a more recent work they study how can the robot make different types of queries including: label, features and demonstrations [Cakmak and Thomaz, 2011, Cakmak and Thomaz, 2012].

Most of these systems have been developed to speed-up learning or to provide a more intuitive way to program robots. There are reasons to believe that an interactive perspective on learning from demonstration might lead to better results (even for the same amount of data). The theoretical aspects of these interactive systems have not been explored, besides the directly applied results from active learning. One justification for the need and expected gain of using such systems is discussed by [Ross and Bagnell, 2010]. Even if an agent learns from a good demonstration then, when executing that learned policy, its error will grow with (where is the horizon of the task). The reason being that any deviation from the correct policy moves the learner to a region where the policy has a worse fit. If a new demonstration is requested in that new region then the system learns not only how to execute a good policy but also how to correct from small mistakes. Such observation, as the authors refer, was already given by [Pomerleau, 1992] without a proof.

Another reason to use interactive systems is that when the users train the system they might become more comfortable with using it and accept it. See the work from [Ogata et al., 2003] for a study on this subject. The queries of the robot will have the dual goal of allowing the robot to deal with its own limitations and give the user information about the robot’s uncertainty on the task being learned [Fong et al., 2003, Chao et al., 2010].

There are many cases where the learning data comes directly from humans but no special uncertainty models are used. Such system either have an intuitive interface to provide information to the system during teleoperation [Kristensen et al., 1999], or it is the system that initiates questions based on perceptual saliency [Lutkebohle et al., 2009]. There is also the case where the authors just follow the standard active learning setting (e.g. to learn a better gesture classification the system is able to ask the user to provide more examples of a given class [Francke et al., 2007] even if for human-robot interfaces [Lee and Xu, 1996]).

This section will start by presenting a perspective on the behavior of humans when they teach machines and the various ways in which a human can help a learning system. We then divide our review into systems for active learning from demonstration where the learner makes questions to the user and a second part where the teacher intervenes whenever it is required. Finally we show that sometimes it is important to try to learn explicit the teaching behavior of the teacher.

5.1 Interactive Learning Scenarios

The type of feedback/guidance that an human can provide depends on the task, the human knowledge, how easy it is to provide each type of information, the communication channels between the system and the user, among many other factors. For instance in a financial situation it is straightforward to attribute values to the outcomes of a policy but in some tasks, dancing for instance, it is much easier to provide trajectory information. In some tasks a combination of both is also required, for instance when teaching how to play tennis it is easy to provide a numeric evaluation of the different policies, but only by showing particular motions can a learner really improve its game.

The presence of other agents in the environment creates diverse opportunities for different learning and exploration scenarios. We can view the other agents as teachers that can behave in different ways. They can provide:

  • guidance on exploration

  • examples

  • task goals

  • task solutions

  • example trajectories

  • quantitative or qualitative evaluation on behavior

  • information about their preferences

By guiding exploration we consider that the agent is able to learn by itself but the extra feedback, or guidance, provided by the teacher will improve its learning speed. The teacher can be demonstrating new tasks and from this the learner might get several extra elements: the goal of the task, how to solve the task, or simply environment trajectories. Another perspective puts the teacher in a jury perspective of evaluating the behavior of the system, either providing directly an evaluation on the learner’s behavior or by reveling his preferences. Several authors provided studies on how to model the different sources of information during social learning in artificial agents [Noble and Franks, 2002, Melo et al., 2007, Nehaniv, 2007, Lopes et al., 2009a, Cakmak et al., 2010b, Billing and Hellström, 2010].

Teacher Examples
unaware [Price and Boutilier, 2003]
batch [Argall et al., 2009, Lopes et al., 2010, Calinon et al., 2007]
active Section 5.3
teaching [Cakmak and Thomaz, 2012, Cakmak and Lopes, 2012]
mixed [Katagami and Yamada, 2000, Judah et al., 2010, Thomaz and Breazeal, 2008]
on-the-loop [Grollman and Jenkins, 2007a, Knox and Stone, 2009, Mason and Lopes, 2011]
ambiguous protocols [Grizou et al., 2013]
Table 5: Interactive Learning Teachers

We can describe interactive learning system along another axis, and that is what type of participation the human has in the process. Table 5 provides a non-exhaustive list of the different positions of a teacher during learning. First, the demonstrator can be completely unaware that a learner is observing him and collecting data for learning. Many systems are like this and use the observation as a dataset to learn. Most interesting cases are those where the teacher is aware of the situation and provides the learner with a batch of data; this is the more common setting. In the active approach the teacher is passive and only answers the questions of the learner (refer to Section 5.3), while in the teaching setting it is the teacher that actively selects the best demonstration examples, taking into account the task and the learner’s progress. Recent examples exist of human on-the-loop setting where the teacher observes the actions of the robot and only acts when it is required to make a correction or provide more data.

As usual all these approaches are not pure and many combine different perspectives. There are situations where different teachers are available to be observed and the learner chooses which one to observe [Price and Boutilier, 2003] where some of them might not even be cooperative [Shon et al., 2007], and even choose between looking at a demonstrator or just learn by self-exploration [Nguyen et al., 2011].

Figure 12: Active learning can also be used to instruct a robot how to label states allowing to achieve a common framing and providing symbolic representations that allow more efficient planning systems. In active learning of grounded relational symbols, the robot generates situations in which it is uncertain about the symbol grounding. After having seen the examples in (1) and (2), the robot can decide whether it wants to see (3a) or (3b). An actively learning robot takes its current knowledge into account and prefers to see the more novel (3b). From [Kulick et al., 2013].

5.2 Human Behavior

Humans change the way they act when they are demonstrating actions to others [Nagai and Rohlfing, 2009]. This might help the learner by attracting attention to the relevant parts of the actions, but it also shows that humans will change the way a task is executed, see [Thomaz and Cakmak, 2009, Kaochar et al., 2011, Knox et al., 2012].

It is clear now that when teaching robots there is also a change in behavior [Thomaz et al., 2006, Thomaz and Breazeal, 2008, Kaochar et al., 2011]. An important aspect is that, many times, the feedback is ambiguous and deviates from the mathematical interpretation of a reward or a sample from a policy. For instance, in the work of [Thomaz and Breazeal, 2008] the teachers frequently gave a reward to exploratory actions even if the signal was used as a standard reward. Also, in some problems we can define an optimal teaching sequence but humans do not behave according to those strategies [Cakmak and Thomaz, 2010].

[Kaochar et al., 2011] developed a GUI to observe the teaching patterns of humans when teaching an electronic learner to achieve a complex sequential task ( e.g. search and detect scenario ). The more interesting finding is that humans use all available channels of communication, including demonstration; examples; reinforcement; and testing. The use of testing varies a lot among users and without a fixed protocol many users will create very complex forms of interaction.

5.3 Active Learning by Demonstration

Social learning, that is learning how to solve a task after seeing it being done has been suggested has an efficient way to program robots. Typically, the burden of selecting informative demonstrations has been completely on the side of the teacher. Active learning approaches endow the learner with the power to select which demonstrations the teacher should perform. Several criteria have been proposed: game theoretic approaches [Shon et al., 2007], entropy [Lopes et al., 2009b, Melo and Lopes, 2010], query by committee [Judah et al., 2012], membership queries [Melo and Lopes, 2013], maximum classifier uncertainty [Chernova and Veloso, 2009], expected myopic gain [Cohn et al., 2010, Cohn et al., 2011] and risk minimization [Doshi et al., 2008].

One common goal is to find the correct behavior, defined as the one that matches the teacher, by repeatedly asking for the correct behavior in a given situation. Such idea as been applied in situations as different as navigation [Lopes et al., 2009b, Cohn et al., 2010, Cohn et al., 2011, Melo and Lopes, 2010], simulated car driving [Chernova and Veloso, 2009] or object manipulation [Lopes et al., 2009b].

5.3.1 Learning Policies

Another learning task of interest is to acquire policies by querying an oracle. [Chernova and Veloso, 2009] used support-vector machine classifiers to make queries to the teacher when it is uncertain about the action to execute as measured by the uncertainty of the classifier. They apply this uncertainty sampling perspective online, and thus only make queries in states that are actually encountered by the robot. A problem with this approach is that the information on the dynamics of the environment is not taken into account when learning the policy. To address this issue, [Melo and Lopes, 2010] proposed a method that computes a kernel based on MDP metrics [Taylor et al., 2008] that includes the information of the environment dynamics. In this way the topology of the dynamics is better preserved and thus better results can be obtained then with just a simple classifier with a naive kernel. They use the method proposed by [Montesano and Lopes, 2012] to make queries where there is lower confidence of the estimated policy.

Directly under the inverse reinforcement learning formalism, one of the first approaches were proposed by [Lopes et al., 2009b]. After a set of demonstration it is possible to compute the posterior distribution of reward that explain the teacher behavior. By seeing each sample of the posterior distribution as a different expert, the authors took a query by committee perspective allowing the learner to ask the teacher what is the correct action in the state where there is higher disagreement among the experts (or more precisely where the predicted policies are more different). This work was latter extended by considering not just the uncertainty on the policy but the expected reduction in the global uncertainty [Cohn et al., 2010, Cohn et al., 2011].

The teacher can directly ask about the reward value at a given location [Regan and Boutilier, 2011] and it has been shown that reward queries can be combined with action queries [Melo and Lopes, 2013].

The previous works on active inverse reinforcement learning can be seen as a way to infer the preferences of the teacher. This problem of preference elicitation has been addressed in several domains [Fürnkranz and Hüllermeier, 2010, Chajewska et al., 2000, Braziunas and Boutilier, 2005, Viappiani and Boutilier, 2010, Brochu et al., 2007].

5.4 Online Feedback and Guidance

Another approach is to consider that the robot is always executing and that a teacher/user might interrupt it at any time and assume the command of the robot. These corrections will act as new demonstrations to be incorporated in the learning process.

The TAMMER framework, and its extensions, considers how signals from humans can speed up exploration and learning in reinforcement learning tasks [Knox and Stone, 2009, Knox and Stone, 2010]. The interesting aspect is that MDP reward is informational poor but it is sampled from the process while the human reinforcement is rich in information but might have stronger biases. Knox [Knox and Stone, 2009, Knox and Stone, 2010] presented the initial framework where the agent learns to predict the human feedback and then selects actions to maximize the expected reward from the human. After learning to predict such behavior during learning the agent will also observe the reward from the environment. The combination of both allows the robot to learn better using information given by the user will shape the reward function [Ng et al., 1999] improving the learning rate of the agent. Recently this process was improved to allow both processes to occur simultaneously [Knox and Stone, 2012].

It is important to take care to ensure that the shaping made by a human does not change the task. [Zhang et al., 2009] introduced a method were the teacher is able to provide extra rewards to change the behavior of the learner but, at the same time, considering that there is a limited budget on such extra rewards. Results showed that there are some tasks that are not possible to teach under a limited budget.

Other approaches considered that the learner can train by self-exploration and have several periods where the teacher is able to criticize its progress [Manoonpong et al., 2010, Judah et al., 2010].

Several work consider that initially the system will not show any initiative and will be operated by the user. Then as learning progresses the system will start acting according to the learned model while the teacher will act when a correction, or an exception, is needed. For instance, in the dogged learning approach suggested in [Grollman and Jenkins, 2007a, Grollman and Jenkins, 2007b, Grollman and Jenkins, 2008] an AIBO robot is teleoperated and learns a policy from the user to dribble a ball towards a goal. After that training period the robot starts executing the learned policy but, at any time, the user has the possibility of resuming the teleoperation to provide eventual corrections. With this process a policy, encoded with a gaussian process, can be learned with better quality. A similar approach was followed in the work of [Mason and Lopes, 2011]. The main difference is that here the robot does not learn a policy and instead learns the preferences of the user and the interaction is done with a natural language interface. The authors consider a cleaning robot that is able to move objects in a room. Initially the robot as only a generic user profile that consider desired object locations, then after several interactions the robot moves the objects to the requested location. Every time the user says that the room is clean/tidy, the robot memorizes the configuration and through a kernel method is able to generalize what is a clean of not clean robot to different contexts. With the advent of compliant robots the same approach can be made where the corrections are provided directly by moving the robot arm [Sauser et al., 2011].

An interesting aspect that was not explored much is to consider delays in the user’s feedback. If such asynchronous behavior exist then the agent must decide how to act while waiting for the feedback [Cohn et al., 2012].

5.5 Ambiguous Protocols and Teacher Adaptation

In most of the previous discussion we considered that the feedback signals provided by the teacher have a semantic meaning that is known to the learner. Nevertheless, in many cases the signals provided by the teacher might be too noisy or have unknown meaning. Several of these works fall under the learning from communication framework [Klingspor et al., 1997], where a shared understanding between the robot and the teacher is fundamental to allow good interactive learning sessions.

The system in [Mohammad and Nishida, 2010] automatically learns different interaction protocols for navigation tasks where the robot learns the actions it should make and which gestures correspond to those actions. In [Lopes et al., 2011, Grizou et al., 2013] the authors introduce a new algorithm for inverse reinforcement learning under multiple instructions with unknown symbols. At each step the learner executes an action and waits for the feedback from the user. This feedback can be understood as a correct/incorrect action, the name of the action itself or a silence. The main difficulty is that the user uses symbols that have an unknown correspondence with such feedback meanings. The learner assumes that the teacher feedback protocol and simultaneously estimates the reward function, the protocols being used and the meaning of the symbols used by the teacher. An early work consider such process in isolation and considered that learning the meaning of communication can be simplified by using the expectation from the already known task model [Kozima and Yano, 2001].

Other works, such as [Lauria et al., 2002, Kollar et al., 2010], consider the case of learning new instructions and guidance signals for already known tasks, thus providing more efficient commands for instructing the robot. This algorithm is different from typical learning by demonstration systems because data is acquired in an interactive and online setting. It is different from previous learning by interaction systems in the sense that the feedback signals received are unknown.

The shared understanding between the teacher and the agents needs also to include a shared knowledge of the names of states. In [Kulick et al., 2013] the authors take an active learning approach allowing the robot to learn state descriptions that are meaningful for the teacher, see Fig. 12.

5.6 Open Challenges

There are two big challenges in interactive systems. A first one is to clearly understand the theoretical properties of such systems. Empirical results seem to indicate that an interactive approach is more sample efficient than any specific approach taken in isolation. Another aspect is the relation between active learning and optimal teaching, where does not exist yet a clear understanding on the problems that can be learned efficiently but not taught and vice-versa. The second challenge is to model accurately the human, or in general the cognitive/representational differences between the teacher and the learner, during the interactive learning process. This challenge include how to create a shared representation of the problem, how to create interaction protocols, and physical interfaces, that enables such shared understanding, and how to exploit the multi-modal cues that humans provides during instruction and interaction.

6 Final Remarks

In this document we presented a general perspective on agents that, aiming at learning fast, look for the most important information required. To our knowledge it is the first time that a unifying look on methods and goals of different communities was made. Several further developments are still necessary in all these domains, but there is already the opportunity to a more multidisciplinary perspective that can give rise to more advanced methods.

References

  • [Aloimonos et al., 1988] Aloimonos, J., Weiss, I., and Bandyopadhyay, A. (1988). Active vision. Inter. Journal of Computer Vision, 1(4):333–356.
  • [Angluin, 1988] Angluin, D. (1988). Queries and concept learning. Machine Learning, 2:319–342.
  • [Argall et al., 2009] Argall, B., Chernova, S., and Veloso, M. (2009). A survey of robot learning from demonstration. Robotics and Autonomous Systems, 57(5):469–483.
  • [Asada et al., 2001] Asada, M., MacDorman, K., Ishiguro, H., and Kuniyoshi, Y. (2001). Cognitive developmental robotics as a new paradigm for the design of humanoid robots. Robotics and Automation, 37:185–193.
  • [Auer et al., 2003] Auer, P., Cesa-Bianchi, N., Freund, Y., and Schapire, R. (2003). The nonstochastic multiarmed bandit problem. SIAM Journal on Computing, 32(1):48–77.
  • [Auer et al., 2011] Auer, P., Lim, S. H., and Watkins, C. (2011). Models for autonomously motivated exploration in reinforcement learning. In Proceedings of the 22nd international conference on Algorithmic learning theory, ALT’11, pages 14–17, Berlin, Heidelberg. Springer-Verlag.
  • [Bakker and Schmidhuber, 2004] Bakker, B. and Schmidhuber, J. (2004). Hierarchical reinforcement learning based on subgoal discovery and subpolicy specialization. In Proc. of the 8-th Conf. on Intelligent Autonomous Systems, pages 438–445.
  • [Balcan et al., 2008] Balcan, M. F., Hanneke, S., and Wortman, J. (2008). The true sample complexity of active learning. In Conf. on Learning Theory (COLT).
  • [Baldassarre, 2011] Baldassarre, G. (2011). What are intrinsic motivations? a biological perspective. In Inter. Conf. on Development and Learning (ICDL’11).
  • [Baram et al., 2004] Baram, Y., El-Yaniv, R., and Luz, K. (2004). Online choice of active learning algorithms. The Journal of Machine Learning Research, 5:255–291.
  • [Baranes and Oudeyer, 2010] Baranes, A. and Oudeyer, P. (2010). Intrinsically motivated goal exploration for active motor learning in robots: A case study. In Intelligent Robots and Systems (IROS), 2010 IEEE/RSJ Inter. Conf. on, pages 1766–1773.
  • [Baranes and Oudeyer, 2011] Baranes, A. and Oudeyer, P. (2011). The interaction of maturational constraints and intrinsic motivations in active motor development. In Inter. Conf. on Development and Learning (ICDL’11).
  • [Baranes and Oudeyer, 2012] Baranes, A. and Oudeyer, P. (2012). Active learning of inverse models with intrinsically motivated goal exploration in robots. Robotics and Autonomous Systems.
  • [Baranès and Oudeyer, 2009] Baranès, A. and Oudeyer, P.-Y. (2009). R-iac: Robust intrinsically motivated exploration and active learning. Autonomous Mental Development, IEEE Transactions on, 1(3):155–169.
  • [Barto and Mahadevan, 2003] Barto, A. and Mahadevan, S. (2003). Recent advances in hierarchical reinforcement learning. Discrete Event Dynamic Systems, 13(4):341–379.
  • [Barto et al., 2004] Barto, A., Singh, S., and Chentanez, N. (2004). Intrinsically motivated learning of hierarchical collections of skills. In Inter. Conf. on development and learning (ICDL’04), San Diego, USA.
  • [Baum, 1991] Baum, E. B. (1991). Neural net algorithms that learn in polynomial time from examples and queries. Neural Networks, IEEE Transactions on, 2(1):5–19.
  • [Bellman, 1952] Bellman, R. (1952). On the theory of dynamic programming. Proceedings of the National Academy of Sciences of the United States of America, 38(8):716.
  • [Berglund and Sitte, 2005] Berglund, E. and Sitte, J. (2005). Sound source localisation through active audition. In Intelligent Robots and Systems, 2005.(IROS 2005). 2005 IEEE/RSJ Inter. Conf. on, pages 653–658.
  • [Berlyne, 1960] Berlyne, D. (1960). Conflict, arousal, and curiosity. McGraw-Hill Book Company.
  • [Billing and Hellström, 2010] Billing, E. and Hellström, T. (2010). A formalism for learning from demonstration. Paladyn. Journal of Behavioral Robotics, 1(1):1–13.
  • [Bourgault et al., 2002] Bourgault, F., Makarenko, A., Williams, S., Grocholsky, B., and Durrant-Whyte, H. (2002). Information based adaptive robotic exploration. In IEEE/RSJ Conf. on Intelligent Robots and Systems (IROS).
  • [Brafman and Tennenholtz, 2003] Brafman, R. and Tennenholtz, M. (2003). R-max - a general polynomial time algorithm for near-optimal reinforcement learning. The Journal of Machine Learning Research, 3:213–231.
  • [Braziunas and Boutilier, 2005] Braziunas, D. and Boutilier, C. (2005). Local utility elicitation in gai models. In Twenty-first Conf. on Uncertainty in Artificial Intelligence, pages 42–49.
  • [Breazeal et al., 2004] Breazeal, C., Brooks, A., Gray, J., Hoffman, G., Lieberman, J., Lee, H., Thomaz, A. L., and Mulanda., D. (2004). Tutelage and collaboration for humanoid robots. Inter. Journal of Humanoid Robotics, 1(2).
  • [Brochu et al., 2010] Brochu, E., Cora, V., and De Freitas, N. (2010). A tutorial on bayesian optimization of expensive cost functions, with application to active user modeling and hierarchical reinforcement learning. Arxiv preprint arXiv:1012.2599.
  • [Brochu et al., 2007] Brochu, E., de Freitas, N., and Ghosh, A. (2007). Active preference learning with discrete choice data. In Advances in Neural Information Processing Systems.
  • [Bubeck and Cesa-Bianchi, 2012] Bubeck, S. and Cesa-Bianchi, N. (2012). Regret analysis of stochastic and nonstochastic multi-armed bandit problems. Foundations and Trends® in Stochastic Systems, 1(4).
  • [Byrne, 2002] Byrne, R. W. (2002). Seeing actions as hierarchically organised structures: great ape manualskills. In The Imitative Mind. Cambridge University Press.
  • [Cakmak et al., 2010a] Cakmak, M., Chao, C., and Thomaz, A. (2010a). Designing interactions for robot active learners. IEEE Transactions on Autonomous Mental Development, 2(2):108–118.
  • [Cakmak et al., 2010b] Cakmak, M., DePalma, N., Arriaga, R., and Thomaz, A. (2010b). Exploiting social partners in robot learning. Autonomous Robots.
  • [Cakmak and Lopes, 2012] Cakmak, M. and Lopes, M. (2012). Algorithmic and human teaching of sequential decision tasks. In AAAI Conference on Artificial Intelligence (AAAI’12), Toronto, Canada.
  • [Cakmak and Thomaz, 2010] Cakmak, M. and Thomaz, A. (2010). Optimality of human teachers for robot learners. In Inter. Conf. on Development and Learning (ICDL).
  • [Cakmak and Thomaz, 2011] Cakmak, M. and Thomaz, A. (2011). Active learning with mixed query types in learning from demonstration. In Proc. of the ICML Workshop on New Developments in Imitation Learning.
  • [Cakmak and Thomaz, 2012] Cakmak, M. and Thomaz, A. (2012). Designing robot learners that ask good questions. In 7th ACM/IEE Inter. Conf. on Human-Robot Interaction.
  • [Calinon et al., 2007] Calinon, S., Guenter, F., and Billard, A. (2007). On learning, representing and generalizing a task in a humanoid robot. IEEE Transactions on Systems, Man and Cybernetics, Part B. Special issue on robot learning by observation, demonstration and imitation, 37(2):286–298.
  • [Carpentier et al., 2011] Carpentier, A., Ghavamzadeh, M., Lazaric, A., Munos, R., and Auer, P. (2011). Upper confidence bounds algorithms for active learning in multi-armed bandits. In Algorithmic Learning Theory.
  • [Castro and Novak, 2008] Castro, R. and Novak, R. (2008). Minimax bounds for active learning. IEEE Trans. on Information Theory, 54(5):2339–2353.
  • [Castronovo et al., 2012] Castronovo, M., Maes, F., Fonteneau, R., and Ernst, D. (2012). Learning exploration/exploitation strategies for single trajectory reinforcement learning. 10th European Workshop on Reinforcement Learning (EWRL 2012).
  • [Chajewska et al., 2000] Chajewska, U., Koller, D., and Parr, R. (2000). Making rational decisions using adaptive utility elicitation. In National Conf. on Artificial Intelligence, pages 363–369. Menlo Park, CA; Cambridge, MA; London; AAAI Press; MIT Press; 1999.
  • [Chao et al., 2010] Chao, C., Cakmak, M., and Thomaz, A. (2010). Transparent active learning for robots. In Human-Robot Interaction (HRI), 2010 5th ACM/IEEE Inter. Conf. on, pages 317–324.
  • [Chernova and Veloso, 2009] Chernova, S. and Veloso, M. (2009). Interactive policy learning through confidence-based autonomy. J. Artificial Intelligence Research, 34:1–25.
  • [Cohn et al., 1994] Cohn, D., Atlas, L., and Ladner, R. (1994). Improving generalization with active learning. Machine Learning, 15(2):201–221.
  • [Cohn et al., 1996] Cohn, D. A., Ghahramani, Z., and Jordan, M. I. (1996). Active learning with statistical models. Journal of Artificial Intelligence Research, 4:129–145.
  • [Cohn et al., 2011] Cohn, R., Durfee, E., and Singh, S. (2011). Comparing action-query strategies in semi-autonomous agents. In Inter. Conf. on Autonomous Agents and Multiagent Systems.
  • [Cohn et al., 2012] Cohn, R., Durfee, E., and Singh, S. (2012). Planning delayed-response queries and transient policies under reward uncertainty. Seventh Annual Workshop on Multiagent Sequential Decision Making Under Uncertainty (MSDM-2012), page 17.
  • [Cohn et al., 2010] Cohn, R., Maxim, M., Durfee, E., and Singh, S. (2010). Selecting Operator Queries using Expected Myopic Gain. In 2010 IEEE/WIC/ACM Inter. Conf. on Web Intelligence and Intelligent Agent Technology, pages 40–47.
  • [Şimşek and Barto, 2004] Şimşek, O. and Barto, A. G. (2004). Using relative novelty to identify useful temporal abstractions in reinforcement learning. In Inter. Conf. on Machine Learning.
  • [Dasgupta, 2005] Dasgupta, S. (2005). Analysis of a greedy active learning strategy. In Advances in Neural Information Processing Systems (NIPS), pages 337–344.
  • [Dasgupta, 2011] Dasgupta, S. (2011). Two faces of active learning. Theoretical computer science, 412(19):1767–1781.
  • [Dearden et al., 1998] Dearden, R., Friedman, N., and Russell, S. (1998). Bayesian q-learning. In AAAI Conf. on Artificial Intelligence, pages 761–768.
  • [Deisenroth et al., 2013] Deisenroth, M., Neumann, G., and Peters, J. (2013). A survey on policy search for robotics. Foundations and Trends in Robotics, 21.
  • [Detry et al., 2009] Detry, R., Baseski, E., ?, M. P., Touati, Y., Kruger, N., Kroemer, O., Peters, J., and Piater, J. (2009). Learning object-specific grasp affordance densities. In IEEE 8TH Inter. Conf. on Development and Learning.
  • [Digney, 1998] Digney, B. (1998). Learning hierarchical control structures for multiple tasks and changing environments. In fifth Inter. Conf. on simulation of adaptive behavior on From animals to animats, volume 5, pages 321–330.
  • [Dillmann, 2004] Dillmann, R. (2004). Teaching and learning of robot tasks via observation of human performance. Robotics and Autonomous Systems, 47(2):109–116.
  • [Dillmann et al., 2000] Dillmann, R., Rogalla, O., Ehrenmann, M., Zollner, R., and Bordegoni, M. (2000). Learning robot behaviour and skills based on human demonstration and advice: the machine learning paradigm. In Inter. Symposium on Robotics Research (ISRR), volume 9, pages 229–238.
  • [Dillmann et al., 2002] Dillmann, R., Zöllner, R., Ehrenmann, M., Rogalla, O., et al. (2002). Interactive natural programming of robots: Introductory overview. In Tsukuba Research Center, AIST. Citeseer.
  • [Dima and Hebert, 2005] Dima, C. and Hebert, M. (2005). Active learning for outdoor obstacle detection. In Robotics Science and Systems Conf., Cambridge, MA.
  • [Dima et al., 2004] Dima, C., Hebert, M., and Stentz, A. (2004). Enabling learning from large datasets: Applying active learning to mobile robotics. In Robotics and Automation, 2004. Proceedings. ICRA’04. 2004 IEEE Inter. Conf. on, volume 1, pages 108–114.
  • [Dorigo and Colombetti, 1994] Dorigo, M. and Colombetti, M. (1994). Robot shaping: Developing autonomous agents through learning. Artificial intelligence, 71(2):321–370.
  • [Doshi et al., 2008] Doshi, F., Pineau, J., and Roy, N. (2008). Reinforcement learning with limited reinforcement: using bayes risk for active learning in pomdps. In 25th Inter. Conf. on Machine learning (ICML’08), pages 256–263.
  • [Duff, 2003] Duff, M. (2003). Design for an optimal probe. In Inter. Conf. on Machine Learning.
  • [Ekvall and Kragic, 2004] Ekvall, S. and Kragic, D. (2004). Interactive grasp learning based on human demonstration. In Robotics and Automation, 2004. Proceedings. ICRA’04. 2004 IEEE Inter. Conf. on, volume 4, pages 3519–3524.
  • [Elman, 1997] Elman, J. (1997). Rethinking innateness: A connectionist perspective on development, volume 10. The MIT press.
  • [Fails and Olsen Jr, 2003] Fails, J. and Olsen Jr, D. (2003). Interactive machine learning. In 8th Inter. Conf. on Intelligent user interfaces, pages 39–45.
  • [Feder et al., 1999] Feder, H. J. S., Leonard, J. J., and Smith, C. M. (1999). Adaptive mobile robot navigation and mapping. International Journal of Robotics Research, 18(7):650–668.
  • [Fitzpatrick et al., 2003] Fitzpatrick, P., Metta, G., Natale, L., Rao, S., and Sandini., G. (2003). Learning about objects through action: Initial steps towards artificial cognition. In IEEE Inter. Conf. on Robotics and Automation, Taipei, Taiwan.
  • [Fong et al., 2003] Fong, T., Thorpe, C., and Baur, C. (2003). Robot, asker of questions. Robotics and Autonomous systems, 42(3):235–243.
  • [Fox et al., 1998] Fox, D., Burgard, W., and Thrun, S. (1998). Active markov localization for mobile robots. Robotics and Autonomous Systems, 25(3):195–207.
  • [Fox and Tennenholtz, 2007] Fox, R. and Tennenholtz, M. (2007). A reinforcement learning algorithm with polynomial interaction complexity for only-costly-observable mdps. In National Conf. on Artificial Intelligence (AAAI).
  • [Francke et al., 2007] Francke, H., Ruiz-del Solar, J., and Verschae, R. (2007). Real-time hand gesture detection and recognition using boosted classifiers and active learning. Advances in Image and Video Technology, pages 533–547.
  • [Freund et al., 1997] Freund, Y., Seung, H., Shamir, E., and Tishby, N. (1997). Selective sampling using the query by committee algorithm. Machine learning, 28(2):133–168.
  • [Fürnkranz and Hüllermeier, 2010] Fürnkranz, J. and Hüllermeier, E. (2010). Preference learning: An introduction. Preference Learning, page 1.
  • [Garg et al., 2012] Garg, S., Singh, A., and Ramos, F. (2012). Efficient space-time modeling for informative sensing. In Sixth Inter. Workshop on Knowledge Discovery from Sensor Data, pages 52–60.
  • [Gilks and Berzuini, 2002] Gilks, W. and Berzuini, C. (2002). Following a moving target?onte carlo inference for dynamic bayesian models. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 63(1):127–146.
  • [Gittins, 1979] Gittins, J. (1979). Bandit processes and dynamic allocation indices. Journal of the Royal Statistical Society. Series B (Methodological), pages 148–177.
  • [Golovin et al., 2010a] Golovin, D., Faulkner, M., and Krause, A. (2010a). Online distributed sensor selection. In Proc. ACM/IEEE Inter. Conf. on Information Processing in Sensor Networks (IPSN).
  • [Golovin and Krause, 2010] Golovin, D. and Krause, A. (2010). Adaptive submodularity: A new approach to active learning and stochastic optimization. In Proc. Inter. Conf. on Learning Theory (COLT).
  • [Golovin et al., 2010b] Golovin, D., Krause, A., and Ray, D. (2010b). Near-optimal bayesian active learning with noisy observations. In Proc. Neural Information Processing Systems (NIPS).
  • [Gottlieb, 2012] Gottlieb, J. (2012). Attention, learning, and the value of information. Neuron, 76(2):281–295.
  • [Gottlieb et al., 2013] Gottlieb, J., Oudeyer, P.-Y., Lopes, M., and Baranes, A. (2013). Information seeking, curiosity and attention: computational and empirical mechanisms. Trends in Cognitive Sciences.
  • [Grizou et al., 2013] Grizou, J., Lopes, M., and Oudeyer, P.-Y. (2013). Robot Learning Simultaneously a Task and How to Interpret Human Instructions. In Joint IEEE International Conference on Development and Learning and on Epigenetic Robotics (ICDL-EpiRob), Osaka, Japan.
  • [Grollman and Jenkins, 2007a] Grollman, D. and Jenkins, O. (2007a). Dogged learning for robots. In Robotics and Automation, 2007 IEEE Inter. Conf. on, pages 2483–2488.
  • [Grollman and Jenkins, 2007b] Grollman, D. and Jenkins, O. (2007b). Learning robot soccer skills from demonstration. In Development and Learning, 2007. ICDL 2007. IEEE 6th Inter. Conf. on, pages 276–281.
  • [Grollman and Jenkins, 2008] Grollman, D. and Jenkins, O. (2008). Sparse incremental learning for interactive robot control policy estimation. In Robotics and Automation, 2008. ICRA 2008. IEEE Inter. Conf. on, pages 3315–3320.
  • [Guyon and Elisseeff, 2003] Guyon, I. and Elisseeff, A. (2003). An introduction to variable and feature selection. The Journal of Machine Learning Research, 3:1157–1182.
  • [Hart and Grupen, 2013] Hart, S. and Grupen, R. (2013). Intrinsically motivated affordance discovery and modeling. In Intrinsically Motivated Learning in Natural and Artificial Systems, pages 279–300. Springer.
  • [Hart et al., 2008] Hart, S., Sen, S., and Grupen, R. (2008). Intrinsically motivated hierarchical manipulation. In 2008 IEEE Conf. on Robots and Automation (ICRA), Pasadena, California.
  • [Hengst, 2002] Hengst, B. (2002). Discovering hierarchy in reinforcement learning with hexq. In MACHINE LEARNING-Inter. WORKSHOP THEN Conf.-, pages 243–250. Citeseer.
  • [Hester et al., 2013] Hester, T., Lopes, M., and Stone, P. (2013). Learning exploration strategies. In AAMAS, USA.
  • [Hester and Stone, 2011] Hester, T. and Stone, P. (2011). Reinforcement Learning: State-of-the-Art, chapter Learning and Using Models. Springer.
  • [Hester and Stone, 2012] Hester, T. and Stone, P. (2012). Intrinsically motivated model learning for a developing curious agent. In AAMAS Workshop on Adaptive Learning Agents.
  • [Hoffman et al., 2011] Hoffman, M., Brochu, E., and de Freitas, N. (2011). Portfolio allocation for bayesian optimization. In Uncertainty in artificial intelligence, pages 327–336.
  • [Jaksch et al., 2010] Jaksch, T., Ortner, R., and Auer, P. (2010). Near-optimal regret bounds for reinforcement learning. J. Mach. Learn. Res., 11:1563–1600.
  • [Jamone et al., 2011] Jamone, L., Natale, L., Hashimoto, K., Sandini, G., and Takanishi, A. (2011). Learning task space control through goal directed exploration. In Inter. Conf. on Robotics and Biomimetics (ROBIO’11).
  • [Jaulmes et al., 2005] Jaulmes, R., Pineau, J., and Precup, D. (2005). Active learning in partially observable markov decision processes. In NIPS Workshop on Value of Information in Inference, Learning and Decision-Making.
  • [Jonsson and Barto, 2006] Jonsson, A. and Barto, A. (2006). Causal graph based decomposition of factored mdps. The Journal of Machine Learning Research, 7:2259–2301.
  • [Judah et al., 2012] Judah, K., Fern, A., and Dietterich, T. (2012). Active imitation learning via reduction to iid active learning. In UAI.
  • [Judah et al., 2010] Judah, K., Roy, S., Fern, A., and Dietterich, T. (2010). Reinforcement learning via practice and critique advice. In AAAI Conf. on Artificial Intelligence (AAAI-10).
  • [Jung and Stone, 2010] Jung, T. and Stone, P. (2010). Gaussian processes for sample efficient reinforcement learning with rmax-like exploration. Machine Learning and Knowledge Discovery in Databases, pages 601–616.
  • [Kaelbling et al., 1998] Kaelbling, L. P., Littman, M. L., and Cassandra, A. R. (1998). Planning and acting in partially observable stochastic domains. Artificial intelligence, 101(1):99–134.
  • [Kaelbling et al., 1996] Kaelbling, L. P., Littman, M. L., and Moore, A. W. (1996). Reinforcement learning: A survey. J. Artificial Intelligence Research, 4:237–285.
  • [Kaochar et al., 2011] Kaochar, T., Peralta, R., Morrison, C., Fasel, I., Walsh, T., and Cohen, P. (2011). Towards understanding how humans teach robots. User Modeling, Adaption and Personalization, pages 347–352.
  • [Kapoor et al., 2007] Kapoor, A., Grauman, K., Urtasun, R., and Darrell, T. (2007). Active learning with gaussian processes for object categorization. In IEEE 11th Inter. Conf. on Computer Vision.
  • [Katagami and Yamada, 2000] Katagami, D. and Yamada, S. (2000). Interactive classifier system for real robot learning. In Robot and Human Interactive Communication, 2000. RO-MAN 2000. Proceedings. 9th IEEE Inter. Workshop on, pages 258–263.
  • [Katz et al., 2008] Katz, D., Pyuro, Y., and Brock, O. (2008). Learning to manipulate articulated objects in unstructured environments using a grounded relational representation. In RSS - Robotics Science and Systems IV, Zurich, Switzerland.
  • [Kearns and Singh, 2002] Kearns, M. and Singh, S. (2002). Near-optimal reinforcement learning in polynomial time. Machine Learning, 49(2):209–232.
  • [King et al., 2004] King, R., Whelan, K., Jones, F., Reiser, P., Bryant, C., Muggleton, S., Kell, D., and Oliver, S. (2004). Functional genomic hypothesis generation and experimentation by a robot scientist. Nature, 427(6971):247–252.
  • [Klingspor et al., 1997] Klingspor, V., Demiris, J., and Kaiser, M. (1997). Human-robot communication and machine learning. Applied Artificial Intelligence, 11(7):719–746.
  • [Kneebone and Dearden, 2009] Kneebone, M. and Dearden, R. (2009). Navigation planning in probabilistic roadmaps with uncertainty. ICAPS. AAAI.
  • [Knox et al., 2012] Knox, W., Glass, B., Love, B., Maddox, W., and Stone, P. (2012). How humans teach agents: A new experimental perspective. Inter. Journal of Social Robotics, Special Issue on Robot Learning from Demonstration.
  • [Knox and Stone, 2009] Knox, W. and Stone, P. (2009). Interactively shaping agents via human reinforcement: The tamer framework. In fifth Inter. Conf. on Knowledge capture, pages 9–16.
  • [Knox and Stone, 2010] Knox, W. and Stone, P. (2010). Combining manual feedback with subsequent mdp reward signals for reinforcement learning. In 9th Inter. Conf. on Autonomous Agents and Multiagent Systems (AAMAS’10), pages 5–12.
  • [Knox and Stone, 2012] Knox, W. and Stone, P. (2012). Reinforcement learning from simultaneous human and mdp reward. In 11th Inter. Conf. on Autonomous Agents and Multiagent Systems.
  • [Kober et al., 2013] Kober, J., Bagnell, D., and Peters, J. (2013). Reinforcement learning in robotics: a survey. Inter. Journal of Robotics Research, 32(11):1236 1272.
  • [Kollar et al., 2010] Kollar, T., Tellex, S., Roy, D., and Roy, N. (2010). Grounding verbs of motion in natural language commands to robots. In Inter. Symposium on Experimental Robotics (ISER), New Delhi, India.
  • [Kolter and Ng, 2009] Kolter, J. and Ng, A. (2009). Near-bayesian exploration in polynomial time. In 26th Annual Inter. Conf. on Machine Learning, pages 513–520.
  • [Konidaris and Barto, 2008] Konidaris, G. and Barto, A. (2008). Sensorimotor abstraction selection for efficient, autonomous robot skill acquisition. In Inter. Conf. on Development and Learning (ICDL’08).
  • [Korupolu et al., 2012] Korupolu, V.N., P., Sivamurugan, M., and Ravindran, B. (2012). Instructing a reinforcement learner. In Twenty-Fifth Inter. FLAIRS Conf.
  • [Kozima and Yano, 2001] Kozima, H. and Yano, H. (2001). A robot that learns to communicate with human caregivers. In First Inter. Workshop on Epigenetic Robotics, pages 47–52.
  • [Krause and Guestrin, 2005] Krause, A. and Guestrin, C. (2005). Near-optimal nonmyopic value of information in graphical models. In Uncertainty in AI.
  • [Krause and Guestrin, 2007] Krause, A. and Guestrin, C. (2007). Nonmyopic active learning of gaussian processes: an exploration-exploitation approach. In 24th Inter. Conf. on Machine learning.
  • [Krause et al., 2008] Krause, A., Singh, A., and Guestrin, C. (2008). Near-optimal sensor placements in gaussian processes: Theory, efficient algorithms and empirical studies. Journal of Machine Learning Research, 9:235–284.
  • [Kristensen et al., 1999] Kristensen, S., Hansen, V., Horstmann, S., Klandt, J., Kondak, K., Lohnert, F., and Stopp, A. (1999). Interactive learning of world model information for a service robot. Sensor Based Intelligent Robots, pages 49–67.
  • [Kroemer et al., 2009] Kroemer, O., Detry, R., Piater, J., and Peters, J. (2009). Active learning using mean shift optimization for robot grasping. In Intelligent Robots and Systems, 2009. IROS 2009. IEEE/RSJ Inter. Conf. on, pages 2610–2615.
  • [Kroemer et al., 2010] Kroemer, O., Detry, R., Piater, J., and Peters, J. (2010). Combining active learning and reactive control for robot grasping. Robotics and Autonomous Systems, 58(9):1105–1116.
  • [Kulick et al., 2013] Kulick, J., Toussaint, M., Lang, T., and Lopes, M. (2013). Active learning for teaching a robot grounded relational symbols. In Inter. Joint Conference on Artificial Intelligence (IJCAI’13), Beijing, China.
  • [Lang et al., 2010] Lang, T., Toussaint, M., and Kersting, K. (2010). Exploration in relational worlds. Machine Learning and Knowledge Discovery in Databases, pages 178–194.
  • [Lapeyre et al., 2011] Lapeyre, M., Ly, O., and Oudeyer, P. (2011). Maturational constraints for motor learning in high-dimensions: the case of biped walking. In Inter. Conf. on Humanoid Robots (Humanoids’11), pages 707–714.
  • [Lauria et al., 2002] Lauria, S., Bugmann, G., Kyriacou, T., and Klein, E. (2002). Mobile robot programming using natural language. Robotics and Autonomous Systems, 38(3-4):171–181.
  • [Lee and Xu, 1996] Lee, C. and Xu, Y. (1996). Online, interactive learning of gestures for human/robot interfaces. In Robotics and Automation, 1996. Proceedings., 1996 IEEE Inter. Conf. on, volume 4, pages 2982–2987.
  • [Lee et al., 2007] Lee, M., Meng, Q., and Chao, F. (2007). Staged competence learning in developmental robotics. Adaptive Behavior, 15(3):241–255.
  • [Lehman and Stanley, 2011] Lehman, J. and Stanley, K. (2011). Abandoning objectives: Evolution through the search for novelty alone. Evolutionary Computation, 19(2):189–223.
  • [Lewis and Gale, 1994] Lewis, D. and Gale, W. (1994). A sequential algorithm for training text classifiers. In 17th annual Inter. ACM SIGIR Conf. on Research and development in information retrieval, pages 3–12. Springer-Verlag New York, Inc.
  • [Lim and Auer, 2012] Lim, S. and Auer, P. (2012). Autonomous exploration for navigating in mdps. JMLR.
  • [Linder et al., 2001] Linder, S., Nestrick, B., Mulders, S., and Lavelle, C. (2001). Facilitating active learning with inexpensive mobile robots. Journal of Computing Sciences in Colleges, 16(4):21–33.
  • [Lockerd and Breazeal, 2004] Lockerd, A. and Breazeal, C. (2004). Tutelage and socially guided robot learning. In Intelligent Robots and Systems, 2004.(IROS 2004). Proceedings. 2004 IEEE/RSJ Inter. Conf. on, volume 4, pages 3475–3480.
  • [Lopes et al., 2011] Lopes, M., Cederborg, T., and Oudeyer, P.-Y. (2011). Simultaneous acquisition of task and feedback models. In IEEE - International Conference on Development and Learning (ICDL’11), Frankfurt, Germany.
  • [Lopes et al., 2012] Lopes, M., Lang, T., Toussaint, M., and Oudeyer, P.-Y. (2012). Exploration in model-based reinforcement learning by empirically estimating learning progress. In Neural Information Processing Systems (NIPS’12), Tahoe, USA.
  • [Lopes et al., 2009a] Lopes, M., Melo, F., Kenward, B., and Santos-Victor, J. (2009a). A computational model of social-learning mechanisms. Adaptive Behavior, 467(17).
  • [Lopes et al., 2010] Lopes, M., Melo, F., Montesano, L., and Santos-Victor, J. (2010). Abstraction levels for robotic imitation: Overview and computational approaches. In Sigaud, O. and Peters, J., editors, From Motor to Interaction Learning in Robots, volume 264 of Studies in Computational Intelligence, pages 313–355. Springer.
  • [Lopes et al., 2007] Lopes, M., Melo, F. S., and Montesano, L. (2007). Affordance-based imitation learning in robots. In IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS’07), USA.
  • [Lopes et al., 2009b] Lopes, M., Melo, F. S., and Montesano, L. (2009b). Active learning for reward estimation in inverse reinforcement learning. In Machine Learning and Knowledge Discovery in Databases (ECML/PKDD’09).
  • [Lopes and Oudeyer, 2012] Lopes, M. and Oudeyer, P.-Y. (2012). The strategic student approach for life-long exploration and learning. In IEEE International Conference on Development and Learning (ICDL), San Diego, USA.
  • [Lopes and Santos-Victor, 2007] Lopes, M. and Santos-Victor, J. (2007). A developmental roadmap for learning by imitation in robots. IEEE Transactions on Systems, Man, and Cybernetics - Part B: Cybernetics, 37(2).
  • [Luciw et al., 2011] Luciw, M., Graziano, V., Ring, M., and Schmidhuber, J. (2011). Artificial curiosity with planning for autonomous perceptual and cognitive development. In Inter. Conf. on Development and Learning (ICDL’11).
  • [Lungarella et al., 2003] Lungarella, M., Metta, G., Pfeifer, R., and Sandini, G. (2003). Developmental robotics: a survey. Connection Science, 15(40):151–190.
  • [Lutkebohle et al., 2009] Lutkebohle, I., Peltason, J., Schillingmann, L., Wrede, B., Wachsmuth, S., Elbrechter, C., and Haschke, R. (2009). The curious robot-structuring interactive robot learning. In Robotics and Automation, 2009. ICRA’09. IEEE Inter. Conf. on, pages 4156–4162.
  • [MacKay, 1992] MacKay, D. (1992). Information-based objective functions for active data selection. Neural computation, 4(4):590–604.
  • [Maillard, 2012] Maillard, O. (2012). Hierarchical optimistic region selection driven by curiosity. In Advances in Neural Information Processing Systems.
  • [Maillard et al., 2011] Maillard, O. A., Munos, R., and Ryabko, D. (2011). Selecting the state-representation in reinforcement learning. In Advances in Neural Information Processing Systems.
  • [Mannor et al., 2004] Mannor, S., Menache, I., Hoze, A., and Klein, U. (2004). Dynamic abstraction in reinforcement learning via clustering. In Inter. Conf. on Machine Learning, page 71.
  • [Manoonpong et al., 2010] Manoonpong, P., Wörgötter, F., and Morimoto, J. (2010). Extraction of reward-related feature space using correlation-based and reward-based learning methods. In 17th Inter. Conf. on Neural information processing: theory and algorithms - Volume Part I, ICONIP’10, pages 414–421, Berlin, Heidelberg. Springer-Verlag.
  • [Marchant and Ramos, 2012] Marchant, R. and Ramos, F. (2012). Bayesian optimisation for intelligent environmental monitoring. In Intelligent Robots and Systems (IROS), 2012 IEEE/RSJ Inter. Conf. on, pages 2242–2249.
  • [Martinez-Cantin et al., 2009] Martinez-Cantin, R., de Freitas, N., Brochu, E., Castellanos, J., and Doucet, A. (2009). A Bayesian exploration-exploitation approach for optimal online sensing and planning with a visually guided mobile robot. Autonomous Robots - Special Issue on Robot Learning, Part B.
  • [Martinez-Cantin et al., 2007] Martinez-Cantin, R., de Freitas, N., Doucet, A., and Castellanos., J. (2007). Active policy learning for robot planning and exploration under uncertainty. In Robotics: Science and Systems (RSS).
  • [Martinez-Cantin et al., 2010] Martinez-Cantin, R., Lopes, M., and Montesano, L. (2010). Body schema acquisition through active learning. In IEEE International Conference on Robotics and Automation (ICRA’10), Alaska, USA.
  • [Mason and Lopes, 2011] Mason, M. and Lopes, M. (2011). Robot self-initiative and personalization by learning through repeated interactions. In 6th ACM/IEEE International Conference on Human-Robot Interaction (HRI’11).
  • [McGovern and Barto, 2001] McGovern, A. and Barto, A. G. (2001). Automatic discovery of subgoals in reinforcement learning using diverse density. In Inter. Conf. on Machine Learning (ICML’01), San Francisco, CA, USA.
  • [Meger et al., 2008] Meger, D., Forssén, P., Lai, K., Helmer, S., McCann, S., Southey, T., Baumann, M., Little, J., and Lowe, D. (2008). Curious george: An attentive semantic robot. Robotics and Autonomous Systems, 56(6):503–511.
  • [Melo et al., 2007] Melo, F., Lopes, M., Santos-Victor, J., and Ribeiro, M. I. (2007). A unified framework for imitation-like behaviors. In 4th International Symposium on Imitation in Animals and Artifacts, Newcastle, UK.
  • [Melo and Lopes, 2010] Melo, F. S. and Lopes, M. (2010). Learning from demonstration using mdp induced metrics. In Machine learning and knowledge discovery in databases (ECML/PKDD’10).
  • [Melo and Lopes, 2013] Melo, F. S. and Lopes, M. (2013). Multi-class generalized binary search for active inverse reinforcement learning. submitted to Machine Learning.
  • [Menache et al., 2002] Menache, I., Mannor, S., and Shimkin, N. (2002). Q-cut dynamic discovery of sub-goals in reinforcement learning. Machine Learning: ECML 2002, pages 187–195.
  • [Meng and Lee, 2008] Meng, Q. and Lee, M. (2008).

    Error-driven active learning in growing radial basis function networks for early robot learning.

    Neurocomputing, 71(7):1449–1461.
  • [Modayil and Kuipers, 2007] Modayil, J. and Kuipers, B. (2007). Autonomous development of a grounded object ontology by a learning robot. In National Conf. on Artificial Intelligence (AAAI).
  • [Mohammad and Nishida, 2010] Mohammad, Y. and Nishida, T. (2010). Learning interaction protocols using Augmented Bayesian Networks applied to guided navigation. In Intelligent Robots and Systems (IROS), 2010 IEEE/RSJ Inter. Conf. on, pages 4119–4126.
  • [Moldovan and Abbeel, 2012] Moldovan, T. M. and Abbeel, P. (2012). Safe exploration in markov decision processes. CoRR, abs/1205.4810.
  • [Montesano and Lopes, 2009] Montesano, L. and Lopes, M. (2009). Learning grasping affordances from local visual descriptors. In IEEE International Conference on Development and Learning (ICDL’09), China.
  • [Montesano and Lopes, 2012] Montesano, L. and Lopes, M. (2012). Active learning of visual descriptors for grasping using non-parametric smoothed beta distributions. Robotics and Autonomous Systems, 60(3):452–462.
  • [Morales et al., 2004] Morales, A., Chinellato, E., Fagg, A., and del Pobil, A. (2004). An active learning approach for assessing robot grasp reliability. In IEEE/RSJ Inter. Conf. on Intelligent Robots and Systems (IROS 2004).
  • [Moskovitch et al., 2007] Moskovitch, R., Nissim, N., Stopel, D., Feher, C., Englert, R., and Elovici, Y. (2007). Improving the detection of unknown computer worms activity using active learning. In KI 2007: Advances in Artificial Intelligence, pages 489–493. Springer.
  • [Mouret and Doncieux, 2011] Mouret, J. and Doncieux, S. (2011). Encouraging behavioral diversity in evolutionary robotics: an empirical study. Evolutionary Computation.
  • [Nagai et al., 2006] Nagai, Y., Asada, M., and Hosoda, K. (2006). Learning for joint attention helped by functional development. Advanced Robotics, 20(10):1165–1181.
  • [Nagai and Rohlfing, 2009] Nagai, Y. and Rohlfing, K. (2009). Computational analysis of motionese toward scaffolding robot action learning. Autonomous Mental Development, IEEE Transactions on, 1(1):44–54.
  • [Nehaniv, 2007] Nehaniv, C. L. (2007). Nine billion correspondence problems. Cambridge University Press.
  • [Nemhauser et al., 1978] Nemhauser, G., Wolsey, L., and Fisher, M. (1978). An analysis of approximations for maximizing submodular set functions. Mathematical Programming, 14(1):265–294.
  • [Ng et al., 1999] Ng, A. Y., Harada, D., and Russell, S. (1999). Policy invariance under reward transformations: Theory and application to reward shaping. In Inter. Conf. on Machine Learning.
  • [Nguyen and Oudeyer, 2012] Nguyen, M. and Oudeyer, P.-Y. (2012). Interactive learning gives the tempo to an intrinsically motivated robot learner. In IEEE-RAS Inter. Conf. on Humanoid Robots.
  • [Nguyen et al., 2011] Nguyen, S., Baranes, A., and Oudeyer, P. (2011). Bootstrapping intrinsically motivated learning with human demonstration. In Inter. Conf. on Development and Learning (ICDL’11).
  • [Nguyen-Tuong and Peters, 2011] Nguyen-Tuong, D. and Peters, J. (2011). Model learning for robot control: a survey. Cognitive Processing, 12(4):319–340.
  • [Nicolescu and Mataric, 2001] Nicolescu, M. and Mataric, M. (2001). Learning and interacting in human-robot domains. Systems, Man and Cybernetics, Part A: Systems and Humans, IEEE Transactions on, 31(5):419–430.
  • [Nicolescu and Mataric, 2003] Nicolescu, M. and Mataric, M. (2003). Natural methods for robot task learning: Instructive demonstrations, generalization and practice. In second Inter. joint Conf. on Autonomous agents and multiagent systems, pages 241–248.
  • [Noble and Franks, 2002] Noble, J. and Franks, D. W. (2002). Social learning mechanisms compared in a simple environment. In Artificial Life VIII: Eighth Inter. Conf.on the Simulation and Synthesis of Living Systems, pages 379–385. MIT Press.
  • [Nouri and Littman, 2010] Nouri, A. and Littman, M. (2010). Dimension reduction and its application to model-based exploration in continuous spaces. Machine learning, 81(1):85–98.
  • [Nowak, 2011] Nowak, R. (2011). The geometry of generalized binary search. Information Theory, Transactions on, 57(12):7893–7906.
  • [Ogata et al., 2003] Ogata, T., Masago, N., Sugano, S., and Tani, J. (2003). Interactive learning in human-robot collaboration. In Intelligent Robots and Systems, 2003.(IROS 2003). Proceedings. 2003 IEEE/RSJ Inter. Conf. on, volume 1, pages 162–167.
  • [Ortner, 2007] Ortner, P. A. R. (2007). Logarithmic online regret bounds for undiscounted reinforcement learning. In Advances in Neural Information Processing Systems (NIPS).
  • [Oudeyer and Kaplan, 2007] Oudeyer, P. and Kaplan, F. (2007). What is intrinsic motivation? a typology of computational approaches. Frontiers in Neurorobotics, 1.
  • [Oudeyer, 2011] Oudeyer, P.-Y. (2011). Developmental Robotics. In Seel, N., editor, Encyclopedia of the Sciences of Learning, Springer Reference Series. Springer.
  • [Oudeyer et al., 2013] Oudeyer, P.-Y., Baranes, A., and Kaplan, F. (2013). Intrinsically motivated learning of real world sensorimotor skills with developmental constraints. In Baldassarre, G. and Mirolli, M., editors, Intrinsically Motivated Learning in Natural and Artificial Systems. Springer.
  • [Oudeyer et al., 2007] Oudeyer, P.-Y., Kaplan, F., and Hafner, V. (2007). Intrinsic motivation systems for autonomous mental development. IEEE Transactions on Evolutionary Computation, 11(2):265–286.
  • [Oudeyer et al., 2005] Oudeyer, P.-Y., Kaplan, F., Hafner, V., and Whyte, A. (2005). The playground experiment: Task-independent development of a curious robot. In AAAI Spring Symposium on Developmental Robotics, pages 42–47.
  • [Peters et al., 2005] Peters, J., Vijayakumar, S., and Schaal, S. (2005). Natural Actor-Critic. In Proc. 16th European Conf. Machine Learning, pages 280–291.
  • [Pickett and Barto, 2002] Pickett, M. and Barto, A. (2002). Policyblocks: An algorithm for creating useful macro-actions in reinforcement learning. In MACHINE LEARNING-Inter. WORKSHOP THEN Conf.-, pages 506–513.
  • [Pierce and Kuipers, 1995] Pierce, D. and Kuipers, B. (1995). Learning to explore and build maps. In National Conf. on Artificial Intelligence, pages 1264–1264.
  • [Pomerleau, 1992] Pomerleau, D. (1992). Neural network perception for mobile robot guidance. Technical report, DTIC Document.
  • [Poupart et al., 2006] Poupart, P., Vlassis, N., Hoey, J., and Regan, K. (2006). An analytic solution to discrete bayesian reinforcement learning. In Inter. Conf. on Machine learning, pages 697–704.
  • [Price and Boutilier, 2003] Price, B. and Boutilier, C. (2003). Accelerating reinforcement learning through implicit imitation. J. Artificial Intelligence Research, 19:569–629.
  • [Qi et al., 2008] Qi, G., Hua, X., Rui, Y., Tang, J., and Zhang, H. (2008). Two-dimensional active learning for image classification. In

    Computer Vision and Pattern Recognition (CVPR’08)

    .
  • [Regan and Boutilier, 2011] Regan, K. and Boutilier, C. (2011). Eliciting additive reward functions for markov decision processes. In Inter. Joint Conf. on Artificial Intelligence (IJCAI’11), Barcelona, Spain.
  • [Reichart et al., 2008] Reichart, R., Tomanek, K., Hahn, U., and Rappoport, A. (2008). Multi-task active learning for linguistic annotations. ACL 08.
  • [Rolf et al., 2011] Rolf, M., Steil, J., and Gienger, M. (2011). Online goal babbling for rapid bootstrapping of inverse models in high dimensions. In Development and Learning (ICDL), 2011 IEEE Inter. Conf. on.
  • [Ross and Bagnell, 2010] Ross, S. and Bagnell, J. A. D. (2010). Efficient reductions for imitation learning. In 13th Inter. Conf. on Artificial Intelligence and Statistics (AISTATS).
  • [Rothkopf et al., 2009] Rothkopf, C. A., Weisswange, T. H., and Triesch, J. (2009). Learning independent causes in natural images explains the spacevariant oblique effect. In Development and Learning, 2009. ICDL 2009. IEEE 8th Inter. Conf. on, pages 1–6.
  • [Roy and McCallum, 2001] Roy, N. and McCallum, A. (2001). Toward optimal active learning through monte carlo estimation of error reduction. ICML, Williamstown.
  • [Ruesch and Bernardino, 2009] Ruesch, J. and Bernardino, A. (2009). Evolving predictive visual motion detectors. In Development and Learning, 2009. ICDL 2009. IEEE 8th Inter. Conf. on, pages 1–6.
  • [Salganicoff and Ungar, 1995] Salganicoff, M. and Ungar, L. (1995). Active exploration and learning in real-valued spaces using multi-armed bandit allocation indices. In MACHINE LEARNING-Inter. WORKSHOP THEN Conf.-, pages 480–487.
  • [Salganicoff et al., 1996] Salganicoff, M., Ungar, L. H., and Bajcsy, R. (1996). Active learning for vision-based robot grasping. Machine Learning, 23(2).
  • [Sauser et al., 2011] Sauser, E., Argall, B., Metta, G., and Billard, A. (2011). Iterative learning of grasp adaptation through human corrections. Robotics and Autonomous Systems.
  • [Saxena et al., 2006] Saxena, A., Driemeyer, J., Kearns, J., and Ng, A. Y. (2006). Robotic grasping of novel objects. In Neural Information Processing Systems (NIPS).
  • [Schatz and Oudeyer, 2009] Schatz, T. and Oudeyer, P.-Y. (2009). Learning motor dependent crutchfield’s information distance to anticipate changes in the topology of sensory body maps. In Development and Learning, 2009. ICDL 2009. IEEE 8th Inter. Conf. on, pages 1–6.
  • [Schein and Ungar, 2007] Schein, A. and Ungar, L. H. (2007). Active learning for logistic regression: an evaluation. Machine Learning, 68:235–265.
  • [Schmidhuber, 1991a] Schmidhuber, J. (1991a). Curious model-building control systems. In Inter. Joint Conf. on Neural Networks, pages 1458–1463.
  • [Schmidhuber, 1991b] Schmidhuber, J. (1991b). A possibility for implementing curiosity and boredom in model-building neural controllers. In From Animals to Animats: First Inter. Conf. on Simulation of Adaptive Behavior, pages 222 – 227, Cambridge, MA, USA.
  • [Schmidhuber, 1995] Schmidhuber, J. (1995). On learning how to learn learning strategies. Technical Report FKI-198-94, Fakultaet fuer Informatik, Technische Universitaet Muenchen.
  • [Schmidhuber, 2006] Schmidhuber, J. (2006). Developmental robotics, optimal artificial curiosity, creativity, music, and the fine arts. Connection Science, 18(2):173 – 187.
  • [Schmidhuber, 2011] Schmidhuber, J. (2011). Powerplay: Training an increasingly general problem solver by continually searching for the simplest still unsolvable problem. Technical report, http://arxiv.org/abs/1112.5309.
  • [Schmidhuber et al., 1997] Schmidhuber, J., Zhao, J., and Schraudolph, N. (1997). Reinforcement learning with self-modifying policies. Learning to learn, 293:309.
  • [Schonlau et al., 1998] Schonlau, M., Welch, W., and Jones, D. (1998). Global versus local search in constrained optimization of computer models. In Flournoy, N., Rosenberger, W., and Wong, W., editors, New Developments and Applications in Experimental Design, volume 34, pages 11–25. Institute of Mathematical Statistics.
  • [Sequeira et al., 2011] Sequeira, P., Melo, F., Prada, R., and Paiva, A. (2011). Emerging social awareness: Exploring intrinsic motivation in multiagent learning. In IEEE Inter. Conf. on Developmental Learning.
  • [Settles, 2009] Settles, B. (2009). Active learning literature survey. Technical Report CS Tech. Rep. 1648, University of Wisconsin-Madison.
  • [Settles et al., 2007] Settles, B., Craven, M., and Ray, S. (2007). Multiple-instance active learning. In Advances in neural information processing systems, pages 1289–1296.
  • [Seung et al., 1992] Seung, H., Opper, M., and Sompolinsky, H. (1992). Query by committee. In

    fifth annual workshop on Computational learning theory

    , pages 287–294.
  • [Shon et al., 2007] Shon, A. P., Verma, D., and Rao, R. P. N. (2007). Active imitation learning. In AAAI Conf. on Artificial Intelligence (AAAI’07).
  • [Sim and Roy, 2005] Sim, R. and Roy, N. (2005). Global a-optimal robot exploration in slam. In IEEE Inter. Conf. on Robotics and Automation (ICRA).
  • [Şimşek and Barto, 2006] Şimşek, Ö. and Barto, A. (2006). An intrinsic reward mechanism for efficient exploration. In Inter. Conf. on Machine learning, pages 833–840.
  • [Simsek and Barto, 2008] Simsek, O. and Barto, A. (2008). Skill characterization based on betweenness. In Neural Information Processing Systems (NIPS).
  • [Şimşek et al., 2005] Şimşek, Ö., Wolfe, A., and Barto, A. (2005). Identifying useful subgoals in reinforcement learning by local graph partitioning. In Inter. Conf. on Machine learning, pages 816–823.
  • [Singh et al., 2007] Singh, A., Krause, A., Guestrin, C., Kaiser, W., and Batalin, M. (2007). Efficient planning of informative paths for multiple robots. In Proc. of the Int. Joint Conf. on Artificial Intelligence.
  • [Singh et al., 2005] Singh, S., Barto, A., and Chentanez, N. (2005). Intrinsically motivated reinforcement learning. In Advances in neural information processing systems (NIPS), volume 17, pages 1281–1288.
  • [Singh et al., 2009] Singh, S., Lewis, R., and Barto, A. (2009). Where do rewards come from? In Annual Conf. of the Cognitive Science Society.
  • [Singh et al., 2010a] Singh, S., Lewis, R., Sorg, J., Barto, A., and Helou, A. (2010a). On Separating Agent Designer Goals from Agent Goals: Breaking the Preferences–Parameters Confound. Citeseer.
  • [Singh et al., 2010b] Singh, S., Lewis, R. L., Barto, A. G., and Sorg, J. (2010b). Intrinsically motivated reinforcement learning: an evolutionary perspective. IEEE Transactions on Autonomous Mental Development, 2(2).
  • [Sivaraman and Trivedi, 2010] Sivaraman, S. and Trivedi, M. (2010). A general active-learning framework for on-road vehicle recognition and tracking. Intelligent Transportation Systems, IEEE Transactions on, 11(2):267–276.
  • [Sorg et al., 2010a] Sorg, J., Singh, S., and Lewis, R. (2010a). Internal rewards mitigate agent boundedness. In Int. Conf. on Machine Learning (ICML).
  • [Sorg et al., 2010b] Sorg, J., Singh, S., and Lewis, R. (2010b). Reward design via online gradient ascent. In Advances of Neural Information Processing Systems, volume 23.
  • [Sorg et al., 2010c] Sorg, J., Singh, S., and Lewis, R. (2010c). Variance-based rewards for approximate bayesian reinforcement learning. 26th Conf. on Uncertainty in Artificial Intelligence.
  • [Sorg et al., 2011] Sorg, J., Singh, S., and Lewis, R. (2011). Optimal rewards versus leaf-evaluation heuristics in planning agents. In Twenty-Fifth AAAI Conf. on Artificial Intelligence.
  • [Stachniss and Burgard, 2003] Stachniss, C. and Burgard, W. (2003). Exploring unknown environments with mobile robots using coverage maps. In AAAI Conference on Artificial Intelligence.
  • [Stachniss et al., 2005] Stachniss, C., Grisetti, G., and Burgard, W. (2005). Information gain-based exploration using rao-blackwellized particle filters. In Robotics: Science and Systems.
  • [Strehl et al., 2009] Strehl, A. L., Li, L., and Littman, M. (2009). Reinforcement learning in finite MDPs: PAC analysis. J. of Machine Learning Research.
  • [Strehl and Littman, 2008] Strehl, A. L. and Littman, M. L. (2008). An analysis of model-based interval estimation for markov decision processes. J. Comput. Syst. Sci., 74(8):1309–1331.
  • [Stulp and Sigaud, 2012] Stulp, F. and Sigaud, O. (2012). Policy improvement methods: Between black-box optimization and episodic reinforcement learning. In ICML.
  • [Sun et al., 2011] Sun, Y., Gomez, F., and Schmidhuber, J. (2011). Planning to be surprised: Optimal bayesian exploration in dynamic environments. Artificial General Intelligence, pages 41–51.
  • [Sutton and Barto, 1998] Sutton, R. and Barto, A. (1998). Reinforcement Learning: An Introduction. MIT Press, Cambridge, MA.
  • [Sutton et al., 2000] Sutton, R., McAllester, D., Singh, S., and Mansour, Y. (2000). Policy gradient methods for reinforcement learning with function approximation. In Adv. Neural Information Proc. Systems (NIPS), volume 12, pages 1057–1063.
  • [Sutton et al., 1999] Sutton, R., Precup, D., Singh, S., et al. (1999). Between mdps and semi-mdps: A framework for temporal abstraction in reinforcement learning. Artificial intelligence, 112(1):181–211.
  • [Szepesvári, 2011] Szepesvári, C. (2011). Reinforcement learning algorithms for mdps. Wiley Encyclopedia of Operations Research and Management Science.
  • [Taylor et al., 2008] Taylor, J., Precup, D., and Panagaden, P. (2008). Bounding performance loss in approximate mdp homomorphisms. In Advances in Neural Information Processing Systems, pages 1649–1656.
  • [Tesch et al., 2013] Tesch, M., Schneider, J., and Choset, H. (2013). Expensive function optimization with stochastic binary outcomes. In Inter. Conf. on Machine Learning (ICML’13).
  • [Thomaz and Breazeal, 2008] Thomaz, A. and Breazeal, C. (2008). Teachable robots: Understanding human teaching behavior to build more effective robot learners. Artificial Intelligence, 172(6-7):716–737.
  • [Thomaz and Cakmak, 2009] Thomaz, A. and Cakmak, M. (2009). Learning about objects with human teachers. In ACM/IEEE Inter. Conf. on Human robot interaction, pages 15–22.
  • [Thomaz et al., 2006] Thomaz, A., Hoffman, G., and Breazeal, C. (2006). Reinforcement learning with human teachers: Understanding how people want to teach robots. In Robot and Human Interactive Communication, 2006. ROMAN 2006. The 15th IEEE Inter. Symposium on, pages 352–357.
  • [Thrun, 1992] Thrun, S. (1992). Efficient exploration in reinforcement learning. Technical Report CMU-CS-92-102, Carnegie-Mellon University.
  • [Thrun, 1995] Thrun, S. (1995). Exploration in active learning. Handbook of Brain Science and Neural Networks, pages 381–384.
  • [Thrun et al., 1995] Thrun, S., Schwartz, A., et al. (1995). Finding structure in reinforcement learning. Advances in neural information processing systems, pages 385–392.
  • [Tong and Koller, 2001] Tong, S. and Koller, D. (2001). Support vector machine active learning with applications to text classification. Journal of Machine Learning Research, 2:45–66.
  • [Toussaint, 2012] Toussaint, M. (2012). Theory and Principled Methods for Designing Metaheuristics, chapter The Bayesian Search Game. Springer.
  • [Ugur et al., 2007] Ugur, E., Dogar, M. R., Cakmak, M., and Sahin, E. (2007). Curiosity-driven learning of traversability affordance on a mobile robot. In Development and Learning, 2007. ICDL 2007. IEEE 6th Inter. Conf. on, pages 13–18.
  • [van Hoof et al., 2012] van Hoof, H., Krömer, O., Amor, H., and Peters, J. (2012). Maximally informative interaction learning for scene exploration. In IEEE/RSJ Inter. Conf. on Intelligent Robots and Systems (IROS).
  • [Viappiani and Boutilier, 2010] Viappiani, P. and Boutilier, C. (2010). Optimal bayesian recommendation sets and myopically optimal choice query sets. In Advances in Neural Information Processing Systems.
  • [Victor Gabillon et al., 2011] Victor Gabillon, Alessandro Lazaric, Mohammad Ghavamzadeh, and Bubeck, S. (2011). Multi-bandit best arm identification. In Neural Information Processing Systems (NIPS’11).
  • [Vlassis et al., 2012] Vlassis, N., Ghavamzadeh, M., Mannor, S., and Poupart, P. (2012). Bayesian reinforcement learning. Reinforcement Learning, pages 359–386.
  • [Wang and Hua, 2011] Wang, M. and Hua, X. (2011). Active learning in multimedia annotation and retrieval: A survey. ACM Transactions on Intelligent Systems and Technology (TIST), 2(2):10.
  • [Weng et al., 2001] Weng, J., McClelland, J., Pentland, A., Sporns, O., Stockman, I., Sur, M., and Thelen, E. (2001). Autonomous mental development by robots and animals. Science, 291:599 – 600.
  • [Wiering and Schmidhuber, 1998] Wiering, M. and Schmidhuber, J. (1998). Efficient model-based exploration. In Inter. Conf. on Simulation of Adaptive Behavior: From Animals to Animats 6, pages 223–228.
  • [Zhang et al., 2009] Zhang, H., Parkes, D., and Chen, Y. (2009). Policy teaching through reward function learning. In ACM Conf. on Electronic commerce, pages 295–304.