CLIC: Curriculum Learning and Imitation for feature Control in non-rewarding environments

01/28/2019 ∙ by Pierre Fournier, et al. ∙ 0

In this paper, we propose an unsupervised reinforcement learning agent called CLIC for Curriculum Learning and Imitation for Control. This agent learns to control features in its environment without external rewards, and observes the actions of a third party agent, Bob, who does not necessarily provide explicit guidance. CLIC selects which feature to train on and what to imitate from Bob's behavior by maximizing its learning progress. We show that CLIC can effectively identify helpful behaviors in Bob's actions, and imitate them to control the environment faster. CLIC can also follow Bob when he acts as a mentor and provides ordered demonstrations. Finally, when Bob controls features than the agent cannot, or in presence of a hierarchy between aspects of the environment, we show that CLIC ignores non-reproducible and already mastered behaviors, resulting in a greater benefit from imitation.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Consider a reinforcement learning (rl) agent placed in an environment without external rewards, but with a third party agent, called Bob, acting independently of the agent, and in particular not providing any explicit guidance.

In the absence of external rewards, we define as an intrinsic motivation for the agent to maximally control the environment state. If the latter includes multiple features, e.g. corresponding to different objects as in Figure 1, the agent should seek to control them independently. Besides learning from intrinsic motivation, the agent should also take advantage of Bob’s presence. Even if Bob does not provide clear guidance to the RL agent, reproducing his actions may help gain control and the agent should do so.

Learning simultaneously to control all the aspects of a realistic environment state is likely to be sub-optimal: some aspects may not be controllable at all, or become so only after having mastered others, while training too much on already mastered ones may slow down progress on new ones. Likewise, Bob may act on aspects that the agent cannot act on, or produce behaviors the agent already masters. In both cases, imitating his actions would be a waste of time. As a consequence, the agent can benefit from using curriculum learning to guide its acquisition of environment control, both to choose what to practice and what to imitate from Bob.

From these observations, this work addresses the problem of combining feature-based control in a non-rewarding discrete environment, and imitation learning applied to an ambiguous and unconstrained third party agent, with the help of curriculum learning.

Figure 1: Left. An environment with six objects, each one having its own state. To interact with an object, the agent has to go through and touch an unknown predefined ordered list of intermediate positions, only shown here in gray for the jigsaw puzzle. Top right. Example of hierarchical relationship between two objects: the jigsaw piece is both an object and an intermediate position for the jigsaw puzzle, so that setting the puzzle state to 1 requires to set the piece state to 1. Bottom right. Example of partial control over features: Bob has a key and can open the closed lock but the agent cannot, so that trying to reproduce Bob’s actions will fail, as the action touch on the lock has no effect.

Feature-based control

The intrinsic motivation of our agent is feature control. Here we let aside the problem of discovering independently controllable features (Thomas et al., 2017) in the environment. We rather assume that some environment features, controllable or not, are readily available and focus on how to act on them, for example opening the box in Figure 1.

Using the vocabulary of Unicorn (Mankowitz et al., 2018), we define the act of controlling one feature independently as a task, and the value to set this feature to as a goal. We use Universal Value Function Approximators (uvfas) (Schaul et al., 2015) to learn in a discrete environment and in this multitask, multigoal and unsupervised setting.

Imitation learning

In Learning from Demonstrations (LfD) (Schaal, 1997) applied to rl, two hypotheses are frequently made: 1) the expert provides demonstrations to guide the agent, 2) it does so for one task only, that the agent does not have to identify. In this work these requirements are not met. Instead, we assume that: 1) Bob acts independently of the agent in the environment, without the role of explicitly demonstrating useful behaviors, and 2) the agent does not know which feature Bob wants to control when acting. For example, if Bob moves an object among several, the agent should detect that Bob demonstrated a behavior that can be imitated to gain control on this object specifically and not on the others.

To incorporate the imitation of Bob’s behaviors to autonomous learning, we combine uvfas with an adapted version of the double DQN from Demonstration (dqnfd) algorithm (Hester et al., 2017).

Curriculum Learning

To decide what to learn autonomously and what to imitate in Bob’s behavior, the agent integrates curriculum learning by maximizing its absolute learning progress (LP) (Oudeyer et al., 2007; Baranes & Oudeyer, 2010): when a feature is too hard to control or already mastered, the agent does not make progress on controlling it and focuses on others. Using absolute values ensures the agent will refocus on a task for which its competence drops (Colas et al., 2018).

Contributions

In this work, we introduce a new reinforcement learning setting, where an agent is placed in a non-rewarding discrete environment, and an unconstrained demonstrator Bob acts ambiguously, performing several tasks without the agent knowing what he is achieving. Combining feature control, imitation learning and curriculum learning, we propose an agent clic which learns in this setting by:

  1. [topsep=0pt,itemsep=1pt,partopsep=1pt,parsep=4pt]

  2. using uvfas in the discrete action case for controlling independently multiple environment features, without external rewards.

  3. observing Bob’s ambiguous actions, determining the features that they help control, and imitating them.

  4. selecting which features to train on, and what to imitate from Bob through absolute learning progress maximization.

Our results show that clic can:

  1. [topsep=0pt,itemsep=1pt,partopsep=1pt,parsep=1pt]

  2. use its observations of Bob to learn more and faster control of its environment.

  3. be taught control of features in a certain order by showing it demonstrations in this order.

  4. identify then ignore non-reproducible and already mastered behaviors when imitating Bob.

2 Related work

Our work falls at the intersection of three domains: unsupervised rl, imitation learning and curriculum learning.

Unsupervised Reinforcement Learning

There is a growing body of literature about learning a set of skills in an environment without any external reward signal (Machado & Bowling, 2016; Gregor et al., 2016; Eysenbach et al., 2018; Warde-Farley et al., 2018). The corresponding domain has recently been termed “unsupervised rl ”.

A first concern with these approaches is what to learn in the absence of reward. Recent works like Andrychowicz et al. (2017) and Plappert et al. (2018) defined controlling the environment state as the objective of the agent: they represent goal states as some additional input, and also use uvfas and goal-conditioned policies to learn how to reach them. In this case, the agent rewards itself when it reaches a goal state. This work follows a similar approach but at a finer level, by learning to control state features.

A second concern is how to make profit of the learned skills to address tasks of interest for an external user. In Eysenbach et al. (2018), three mechanisms are proposed for doing so: (i) considering the acquisition of skills as some pre-training process which will accelerate a subsequent rl stage, (ii) using hierarchical rl based on the skills to build a higher level policy choosing the adequate skill at all times to maximize some external reward, or (iii) imitating an expert. Quite interestingly, in the latter case, the authors ask whether learning various skills can help imitating an expert, whereas in this paper we ask whether imitating another agent can help acquiring useful skills.

Imitation learning

The potential of learning from demonstrations to accelerate skill acquisition is well-known in the context of both rl for robotics (Ijspeert et al., 2013) and agents in simulation (Levine & Koltun, 2013; Hester et al., 2017; Večerík et al., 2017). In this work, we build upon mechanisms proposed in Hester et al. (2017). From this perspective, our combination of intrinsically motivated learning and imitation learning can be seen as a kind of multitask, multigoal version of dqnfd, where tasks are features to control, and curriculum learning is applied to the choice of these tasks. Some multitask imitation learning agents have been proposed (Parisotto et al., 2015; Peng et al., 2018), but they are not endowed with curriculum learning capabilities.

Curriculum learning

Curriculum learning is a long standing research topic in developmental robotics (Baranes & Oudeyer, 2010; Forestier et al., 2017) and has recently become the focus of intensive research in rl (Graves et al., 2017; Blaes et al., 2018; Narvekar & Stone, 2018; Weinshall & Cohen, 2018). Our curriculum learning strategy is especially close to that of the curious algorithm of Colas et al. (2018), where a multitask multigoal agent called e-uvfa is also combined with absolute LP maximization to choose what task to train on.

Combining Imitation and Curriculum Learning

The interplay of curriculum and imitation learning in an autonomous agent is present in several works like Nguyen et al. (2011) and Duminy et al. (2019), which also demonstrate that imitation learning can drive the skill acquisition trajectory of an autonomous agent learning from intrinsic motivations. However, these works do not consider a unique ambiguous and potentially non-reproducible teacher as we do, and above all address distinct technical issues by not using deep rl.

To summarize, in regards to existing work, we investigate unexplored questions at the intersection of unsupervised rl, imitation and curriculum learning.

3 Methods

Our framework is based on four components: (1) the intrinsic motivation of the agent consists in controlling environment features so that they take desired values (Section 3.1); (2) the agent learns how to satisfy its intrinsic motivations through a version of double dqn augmented to tackle individual feature control (Section 3.2); (3) when another agent acts in the environment (Section 3.3), our agent imitates the way it controls these features (Section 3.4); (4) to choose what to learn and imitate, the agent builds and follows a curriculum over features, based on learning progress maximization (Section 3.5).

3.1 Feature control as intrinsic motivation

In the absence of an external reward, our agent is driven by intrinsic motivations based on feature control, where tasks are features, and goals the values to set them to.

Following Abbeel & Ng (2004)

, we consider a Markov Decision Process without a reward function. States

write , and in traditional goal-conditioned control, the agent learns an optimal policy for reaching goal starting from . If the goal is and the state is discrete, the state is terminal if and only if for all .

However, in many cases, an agent can be interested in controlling only a subset of the features. To this end, we consider a subset of features of interest, such that is the f-th feature of for . For each feature of interest we define an intrinsic reward function: , with if and otherwise, for .

3.2 Learning to control features

With these intrinsic rewards, the agent can learn from state-of-the-art rl methods. clic is based on the double dqn algorithm (Van Hasselt et al., 2015) applied to a uvfa (Schaul et al., 2015).

Thus we define a general feature-related action-value function:

The agent learns a policy distribution that maximizes for all . The value function

is approximated with a neural network taking

and as input and providing for all and . At the start of an episode, the agent samples a desired value and a feature following Section 3.5. clic then acts following until it reaches for , or a timeout is reached. The trajectory obtained is a list of tuples that are stored in a standard replay buffer. At each step, clic minimizes the double dqn loss on mini-batches taken from the replay buffer:

(1)

where is the target network (Van Hasselt et al., 2015).

A difference between clic and double dqn is that, once and are chosen, clic uses softmax exploration rather than -greedy, as the -decreasing schedule of the latter is unfit to cope with cases where different features need different amounts of exploration. Also, we do not input

in the neural network through a one-hot encoded vector, as proposed in two close algorithms

(Peng et al., 2018; Colas et al., 2018) because we empirically observed that the network generalized too much between tasks, outputting high Q-values for uncontrolled features111We attribute this Q-value overestimation to the fact that masking only the first layer of weights with a one-hot encoded input vector is not sufficient to maintain the subsequent unmasked layers outputs low for uncontrolled features..

3.3 Bob

All the above helped clic learn autonomous control of different features of its environment. We now consider an agent Bob that clic may imitate.

Every steps, Bob chooses a feature to control, a value to set it to, and achieves trajectories for (Algorithm 1, line 12). Section 4.1 describes how Bob chooses and but, in all cases, clic does not have access to this information.

Trajectories from Bob may be helpful to learn to control several features: if Bob changed and set feature to value at some point, then Bob’s actions up to this point constitute a demonstration for setting to , in the spirit of Hindsight Experience Replay (her) (Andrychowicz et al., 2017). Transitions from the instances of Bob trajectories are augmented with such pairs, and clic can compute their rewards with . These augmented transitions are stored both in the same replay buffer as the agent’s own experience (Algorithm 1, line 15), and in a separate set for imitation learning.

3.4 Imitation learning

1:  Input: transition function , empty replay buffer
2:  Initialize: state , , , step
3:  loop
4:     
5:     ,
6:     
7:     Minimize (1) on batch from
8:     if terminal or timeout then
9:        ,
10:     end if
11:     if k % = 0 then
12:         (Bob’s demonstrations for )
13:        for all  do
14:           Augment with (3.3)
15:           
16:        end for
17:        for  steps do
18:           
19:           Sample batch of
20:           Minimize on
21:        end for
22:        Empty
23:     end if
24:     
25:  end loop
Algorithm 1 clic: Curriculum Learning and Imitation for Control

After a set of trajectories from Bob is observed, the agent performs steps of imitation on (Algorithm 1, lines 18-20).

Extending Hester et al. (2017), we define a large margin classification loss for our general Q-values. Given a feature and a goal , it writes:

(2)

where is a margin function that is when and otherwise. This loss ensures that the values of Bob’s action will be at least a margin above Q-values of other actions.

At each of the imitation step, clic selects a feature following Section 3.5, and minimizes on batches of tuples from .

3.5 Curriculum learning

clic uses its own learning progress to select which feature to practice (Section 3.2 and Algorithm 1, lines 2, 9), and which feature to imitate Bob on (Section 3.4 and Algorithm 1, line 18). To that end, it tracks its learning progress on each feature of interest, and samples preferably features for which this learning progress is maximal.

Similar to Colas et al. (2018), we define the agent’s competence at step for feature s the average success over a window of attempts at controlling — success meaning at the end of episode with desired value . The learning progress then writes

It is used to derive for each feature

a sampling probability at step

:

with controlling the sampling randomness: means pure random sampling, while means pure proportional sampling. To ensure a minimum amount of exploration on features while learning a curriculum, clic uses . We call clic-rnd the version of clic that samples features randomly without maximizing LP, using .

The use in of absolute values ensures a competence drop on a feature leads to training again on this feature (Baranes & Oudeyer, 2013; Colas et al., 2018). The parameter is added so that only a sufficient competence drop () is visible, and makes sure to favor controlling new features over maintaining those mastered to the highest level of competence .

4 Experimental setup

4.1 Environments

Experiments are conducted in several discrete state, discrete action environments containing objects, in which imitating Bob’s actions is more or less challenging: , with six independent fully controllable objects; two partially controllable variations of called , ; and a distinct environment with fully controllable but hierarchically related objects.

Independent objects, full control:

The environment (Figure 1, left) is a grid-world with six objects to interact with. The actions of an agent are up, down, right, left and touch. The environment state perceived is defined as the agent position in the grid plus the state of all objects: . The agent always starts in the center of the grid, and we use 25% () of sticky actions (Machado et al., 2017) for stochasticity.

When Bob is present and acts in the environment, clic perceives and stores the states visited by Bob as . We assume that the only source of non-optimality of Bob’s actions comes from the sticky actions. Bob’s choice of objects to act on depends on the experiment and is detailed for each result.

To control objects and set the corresponding feature values to 1, clic has to go through and touch an unknown predefined ordered list of intermediate positions for each object. This setup abstractly reproduces the idea that several steps may be necessary to manipulate properly an object: a longer list means a more difficult object to manipulate.

Controlling this environment — or one of the three others introduced next — means controlling its objects: the features of interest of clic are objects states, (recall that given the definition of , if , ). Thus from now, controlling feature always means being consistently able to set the corresponding object state to any desired value . In all environments, states are normalized so that each feature takes its values in .

Independent objects, partial control: and

In , the six objects are fully controllable by the clic agent. In realistic environments, some aspects of the environment are likely to be controllable by Bob only. and are copies of where respectively one and three of the six objects can be controlled by clic, though Bob can control all of them. Technically, choosing action touch on the intermediate positions associated to the uncontrollable objects has simply no effect (Figure 1, bottom right).

Hierarchically related objects, full control:

In , the six objects are independent of each other: there is no overlap between their lists of intermediate positions. On the contrary, in , we modify these lists so that controlling Object becomes a prerequisite to control Object . Simply put, setting Object state to 1 means setting Object to 1 plus going through an additional list of intermediate positions (Figure 1, top right).

4.2 Parameters

For the sake of simplicity, in all experiments, the desired value for all features of interest is , even if the features take their values in . All clic parameters are as follows: we take , , , , , an episode timeout is reached after 200 steps, the batch size is 64, the temperature for the softmax exploration is and the replay buffer size is

. Q-values are approximated with a neural network with two 32-neurons hidden layers with

relu activations 222The code of the experiments is available at https://www.dropbox.com/s/zs1ukyo0la7hze2/clic.zip?dl=0..

5 Results

Features of interest being objects states, we use the words feature and object interchangeably in this section. In all figures, and refer to the competence and learning progress of clic (or clic-rnd) for Object at step ; is the average normalized competence at step , with the number of controllable objects; is the number of steps clic spends imitating Bob on Object (max. ). Unless stated otherwise, when Bob is said to act on a set of objects, he samples randomly one of these object at the start of all his trajectories and acts to set its state to 1. Plain curves correspond to the median of 10 runs and shaded areas to their interquartile ranges.

The experiments aim at determining: 1) if clic can identify useful demonstrations from Bob’s actions with unknown goals and imitate them; 2) to what extent clic’s learning trajectory can be influenced by Bob’s behaviors; and 3) if clic can identify and ignore behaviors that should not be reproduced, because it will not be able to do so, or already knows how to do so. As explained in Section 3.5, the curriculum learning agent clic uses proportional sampling with whereas the basic agent clic-rnd uses random sampling with .

5.1 Imitating Bob

Figure 2: A. Average competence of clic in , when Bob does nothing or controls three or six objects. B. Average competence of clic agent in , when Bob does nothing or controls one intermediate object or the two hardest of the hierarchy.

First we analyze the impact of observing and imitating Bob, both in (Figure 2 A) and (Figure 2 B). In , Bob either does nothing, or acts on objects 1, 2 and 3, or acts on all objects. It is clear that clic learns faster when it sees Bob acting on more objects.

In , Bob either does nothing, or controls the intermediate object 4 or the two hardest of the hierarchy, 5 and 6. This environment is harder to control in autonomy as it requires more exploration to discover the hard objects. Results show that clic uses Bob’s behavior to gain control over the features Bob seeks to control, but also over those Bob controls as intermediate steps.

For example, when Bob controls Objects 5 and 6, the hierarchy implies that he has to also act on all other objects before. In this case, results show that clic can imitate Bob’s actions to control more than only what Bob intended to achieve. When Bob controls only Object 4, clic gains control over features corresponding to objects up to 4 as well, but discovering autonomously harder ones is too difficult.

5.2 Following Bob’s teaching

In Figure 2 A, when Bob controls only a subset of features, clic first imitates him to reach , and then explores the rest of the environment autonomously. From another point of view, these results prove that Bob can influence clic’s learning trajectory by only showing it how to control some selected features. We now show that, acting as a mentor, Bob can completely control this learning trajectory.

In Figure 3, instead of choosing randomly an object to control, Bob shows how to control Object 1 until the agent’s competence for this feature is above 0.9. Then Bob demonstrates control over Object 2, and so on. If the agent’s competence on a previously mastered object falls below 0.9, Bob provides demonstrations for it again. The figure compares the performances of clic (Figure 3 B) and clic-rnd (Figure 3 A), in (where objects are of the exact same difficulties) and with Bob acting as specified.

Without curriculum learning, clic-rnd’s learning trajectory starts to follow Bob’s demonstrations order but does not do so up to the end: as clic-rnd learns autonomously in parallel of imitating Bob, and randomly chooses what to learn in autonomy, nothing prevents it from learning to control features 4, 5 and 6 before Bob shows how to do so.

By contrast, clic strictly follows the curriculum taught by Bob, learning almost always one feature at a time. This time, during its autonomous learning phase, LP maximization pushes clic to stick to the features Bob demonstrated, until it does not make progress on them anymore. In other words, clic can be taught control over the environment in a desired order by simply showing it demonstrations in this order.

We observe that following Bob’s order does not speed up global control of the environment. In , all objects are independent, so there is no transfer between learning the different features of interest. The ability to focus on learning features one by one is balanced by the fact that the network overfits when it trains on one feature only at a time.

Figure 3: Competence for each object in , when Bob provides demonstrations in order, for clic-rnd (A) and clic (B). LP maximization enables the agent to follow Bob’s demonstrations order.

5.3 Ignoring non-reproducible behavior from Bob

Figure 4: Average competence of clic-rnd and clic over all controllable features in A. , B. and C. . The benefits of LP maximization increase with the number of uncontrollable aspects of the environment.

When Bob controls features that the agent cannot, imitating Bob can be a waste of time. Yet, if the agent knew in advance the subset of features it cannot control and imitate Bob on, learning would be simpler as the agent could focus on fewer aspects of the environment. But in this work, clic does not have this knowledge and relies on curriculum learning to discover what to learn and imitate.

In Figure 4, we perform experiments in (A) with all objects controllable by the agent, (B) with half of them controllable, and (C) with only one controllable. In each environment, Bob acts on all objects, including those that the agent has no control on, and we study the impact of LP maximization on the agent performances.

Comparing the three blue curves shows that without curriculum learning, having fewer features to master boosts only slightly clic-rnd performances on global control. As expected, without a mechanism to tell between what can and cannot be learned, much of the advantage of having fewer to learn is lost. Also, we note that the variability of clic-rnd increases with the number of uncontrollable features: achieving global control becomes dependent on the probability that it discovers the only controllable features more or less early.

In , when all objects are controllable, LP maximization does not bring any benefit and clic does not outperform clic-rnd. Indeed objects are independent and of equal difficulties, so there can only be a poor form of transfer between them, and a low gain from practicing them and imitating Bob on them in any specific order, as LP maximization pushes the agent to do.

In and , on the contrary, the red curves show that LP maximization helps achieving maximum control faster in presence of uncontrollable features: when three or five objects are uncontrollable, clic learns faster than clic-rnd. Besides, the impact of curriculum learning is greater when there are more uncontrollable features: the gap between clic-rnd and clic performances is wider in than it is in .

Reasons for this gain are found in Figure 5, which shows the impact of LP maximization when only is controllable. Figure 5 displays the evolution of the agent learning progress (A) and competence (B) on , the amount of steps spent imitating Bob controlling (C), and the amount of steps uselessly spent imitating Bob controlling (D).

When it does not maximize LP (dashed blue), the agent ignores the LP bump (Figure 5 A) resulting from gaining some control over the feature (Figure 5 C). So it randomly selects what to train on, and above all what to imitate: the agent tries to reproduce Bob’s behavior when it can (Figure 5 B) as often as when it cannot (Figure 5 D), as shown by the blue curves at the same level. This is sub-optimal as performances on uncontrollable objects will never increase.

Instead, when the agent maximizes LP (plain red), the bump in LP at 50k steps (Figure 5 A) results in more imitation for this feature (Figure 5 B) and more episodes trying to control it. Simultaneously, the agent stops focusing on other features for which its learning progress is too small, among which uncontrollable ones (Figure 5 D). This focus results in a faster learning pace on controllable features (Figure 5 C).

5.4 Ignoring Bob’s demonstrations for mastered features

Figure 5: Effect of LP maximization in environment with one controllable object : A. Learning progress ; B. Steps spent imitating Bob controlling , ; C. Competence (same as Figure 2.c); D. with not controllable by clic.

By contrast with to , in , all features are controllable but some are easier than others. In particular, controlling Object is easier than controlling Object , and any demonstration for Object contains a demonstration for Object .

As a consequence, in such an environment, when Bob acts on random objects, he ends up providing more demonstrations for the easiest features. If Bob chooses to guide the agent as in Section 5.2, he initially focuses on these easy features, and the bias towards them is even stronger.

Figure 6 shows the global performances of clic-rnd and clic in these two scenarios where actions from Bob are biased towards easy features. In both scenarios, curriculum learning helps ignoring Bob’s behavior affecting already mastered features. The effect of curriculum is greater when Bob guides the agent, as the bias is stronger in this case. But the same effect holds when Bob chooses randomly what to do, at least early in learning.

The fact that curriculum learning enables clic to stop imitating Bob on easy features can be clearly seen in Figure 7. Here Bob guides the agent, and the bias towards easy features is strong. clic-rnd chooses randomly what to imitate among what it is shown, so it imitates Bob mainly on quickly mastered features.

For instance, at step 400K, Object 3 is not yet mastered by clic-rnd so Bob demonstrates it, along with its necessary intermediate steps Objects 1 and 2; clic-rnd, observing these three objects being controlled, imitates Bob on all of them (Figure 7 B), whereas it already knows how to control Objects 1 and 2. Thus it loses precious imitation steps to make progress on Object 3.

Instead, thanks to LP maximization, clic stops imitating Bob on Object 1 (Figure 7 D) as soon as its competence on it reaches 1 and stops rising (Figure 7 C). This mechanism enables clic to learn more and faster by focusing its learning resources on non-mastered features only.

Figure 6: Average competence of clic-rnd and clic in , depending on Bob’s policy: selecting randomly objects to control, or teaching the agent in order.

6 Conclusion

Figure 7: In , when Bob teaches objects to in order: A. Competence of clic-rnd on each object. B. Steps spent by clic-rnd imitating Bob controlling each object. C. Competence of clic on each object. D. Steps spent by clic imitating Bob controlling each object.

In this work, we proposed a new learning setting, with a non-rewarding environment where a third party agent called Bob acts without communicating the intent of its actions and in ways that can be non-reproducible for the agent. This setting, although discrete, is a first step towards real-life environments, where artificial agents will face a very large number of potential tasks and goals, next to other agents with different capabilities. We combined feature control, curriculum and imitation learning to build an agent called clic that addresses the issues raised by this challenging learning context.

In particular, we showed that clic predictably makes faster progress when it observes more behaviors from Bob, but also that it can leverage Bob’s actions to make progress for more tasks than only those demonstrated. We demonstrated that Bob could mentor clic and control its developmental trajectory by simply providing ordered demonstrations. Eventually, we showed that clic can effectively use learning progress maximization to tell between what is and is not useful to learn and imitate, and thus learns faster both when the environment is partially controllable and when it contains a natural hierarchy.

A specificity of this work was that, rather than considering a human expert teaching the agent, as is usually the case in interactive learning, we considered an external software agent providing non intentional demonstrations. As a consequence, we did not focus on limiting the amount of demonstrated behaviors, as is often the case in the domain (Kang et al., 2018). Another topic of interest that emerges from our work is the importance of the order of the demonstrations performed by a teaching agent. We keep these two topics for future work.

Additionally, our focus being on the combination of curriculum learning and imitation learning rather than on representation learning, experiments were performed in discrete state, discrete action environments with independent features. But the main ideas presented here could easily be extended to the continuous case, replacing dqnfd with ddpgfd (Večerík et al., 2017) and trying to learn independently controllable features (Thomas et al., 2017).

Acknowledgements

Anonymized.

References