An Extensible Interactive Interface for Agent Design

06/06/2019
by   Matthew Rahtz, et al.
ETH Zurich
0

In artificial intelligence, we often specify tasks through a reward function. While this works well in some settings, many tasks are hard to specify this way. In deep reinforcement learning, for example, directly specifying a reward as a function of a high-dimensional observation is challenging. Instead, we present an interface for specifying tasks interactively using demonstrations. Our approach defines a set of increasingly complex policies. The interface allows the user to switch between these policies at fixed intervals to generate demonstrations of novel, more complex, tasks. We train new policies based on these demonstrations and repeat the process. We present a case study of our approach in the Lunar Lander domain, and show that this simple approach can quickly learn a successful landing policy and outperforms an existing comparison-based deep RL method.

READ FULL TEXT VIEW PDF

page 4

page 5

11/15/2018

Reward learning from human preferences and demonstrations in Atari

To solve complex real-world problems with reinforcement learning, we can...
08/21/2020

A Composable Specification Language for Reinforcement Learning Tasks

Reinforcement learning is a promising approach for learning control poli...
03/23/2021

Replacing Rewards with Examples: Example-Based Policy Search via Recursive Classification

In the standard Markov decision process formalism, users specify tasks b...
10/31/2019

Dynamic Cloth Manipulation with Deep Reinforcement Learning

In this paper we present a Deep Reinforcement Learning approach to solve...
10/08/2020

Information-Driven Adaptive Sensing Based on Deep Reinforcement Learning

In order to make better use of deep reinforcement learning in the creati...
04/10/2018

Crafting a Toolchain for Image Restoration by Deep Reinforcement Learning

We investigate a novel approach for image restoration by reinforcement l...
10/19/2018

Supervising strong learners by amplifying weak experts

Many real world learning tasks involve complex or hard-to-specify object...

1 Introduction

In AI we identify the correct behavior for robot systems in several ways. A popular way is through reward functions. However, reward functions make a lot of assumptions about the design setting; for example, they are only readily applicable in problems where the state space is defined in human-understandable features, or where goal states can be easily expressed in a general-purpose programming language.

Approaches based on learned reward functions allow users to specify tasks in alternative, more flexible ways. For example, in Christiano et al. (2017) and Sadigh et al. (2017) the target behavior is specified through comparisons between states. Over time the reward function learns a state ranking that agrees with (and hopefully generalizes) these comparisons and is used to train a policy or optimize a trajectory. However, these methods often exhibit poor sample complexity on complex tasks, and this has led to approaches that leverage e.g. an externally generated set of expert demonstrations Ibarz et al. (2018).

In this work, we draw inspiration from hierarchical planning systems where plans consist of a sequence of high-level actions, each of which runs a simpler, primitive, policy. Our approach combines two ideas: 1) we can use a set of primitive policies to efficiently define policies that perform complex tasks; and 2) in subsequent training rounds, we can use these complex policies as primitives themselves. Starting from an initial state, we sample a number of rollouts of fixed length from our primitive policies. We show these rollouts to a human designer who selects the best one for the target task in an interface akin to those used in Christiano et al. (2017) and Sadigh et al. (2017). The process then repeats from the final state in the selected rollout until the designer indicates the end of an episode.

From this data, we can train new policies in two ways. First, from the implicit comparisons between the best rollout and the other rollouts, we learn a reward function using preference learning. This reward function can then be used to train a policy using reinforcement learning. Second, we can train goal classifiers based on states reached in the demonstrations, and train a policy using reinforcement learning to maximize goal classification probability. This provides the user with a way to specify behaviors quickly when arriving at a goal state is easy through a sub-optimal demonstration.

We provide a demonstration of this approach in the Lunar Lander domain. We train policies in three stages. In the first stage, we use random behavior to learn two policies: stabilize and drop. Stabilize is trained to reach goal states where the lander is level and (close to) stationary. Drop is trained to turn off the engines for final landing. In the second stage, we use a combination of random behavior and the stabilize policy to train policies that move stably to the left or stably to the right. In the final stage, we use all four policies to train a policies that successfully lands without crashing. Our final solution is able to land successfully in over 90% of episodes.

2 Related work

This work continues a recent trend of agent specification using interpretable techniques.

One example is preference learning, where specification is based on comparisons between examples of behavior. However, in preference learning, the user has little control over exploration—that is, what kind of examples they are comparing, and therefore what kind of information they are supplying. Existing approaches for generating examples include selection based on reward uncertainty Christiano et al. (2017) and active synthesis of maximally-informative comparisons Sadigh et al. (2017). In contrast, our approach allows the user to influence exploration directly through demonstration.

Another example of this trend is natural language instruction. However, natural language must be grounded in the environment. This can be difficult; possibilities include demonstration of instructions Co-Reyes et al. (2018), examples of goal states corresponding to instructions Bahdanau et al. (2018), and rewards manually conditioned on instructions Hermann et al. (2017). We approach the problem from the opposite direction: instead of taking an existing vocabulary and grounding it in the environment, we give the user the means to define a vocabulary of behaviors for themselves.

3 Method

Training is based on demonstrations using a set of behaviors defined by the user. These behaviors, which we term behavioral primitives, are encoded as policies. The training process is iterative: experience generated by one set of demonstrations is used to define primitives for the next set of demonstrations, continuing until a primitive that can perform the full task. The user defines the first set of primitives by identifying interesting goal states in experience generated by a random policy then training policies to reach each of those states. Thereafter the user defines primitives either based either on goal states in experience generated by previous demonstrations, or based directly on behaviors demonstrated. The training system therefore consists of three components:

  • An interface for defining goal states based on experience generated during training so far.

  • Apparatus for training behavioral primitives.

  • An interface for giving demonstrations using those behavioral primitives.

Each of these components is described in detail below.

3.1 Defining goal states

One way to define a behavioral primitive is by training a goal-conditioned policy based on a goal defined by the user. In contrast to previous work in goal-conditioned RL Schaul et al. (2015); Andrychowicz et al. (2017); Nair et al. (2018), we consider goal states that are abstract concepts, rather than specific environment states. Instead of referring to, say, a particular position in the room, one of our goal states might be ‘near a human’, yielding a behavioral primitive that moves to a human. This enlarges the set of possible behaviors that can be encoded as goal-conditioned policies. The user defines these abstract goal states by example. These examples are used to train a binary classifier (which we refer to as a discriminator) to recognize whether the agent is in the defined state.

The user begins by browsing videos of previous episodes generated during the training process. Initially, these episodes are generated by a random policy, but later in training these include episodes generated by demonstrations. Once the user has identified an ‘interesting’ state (a behavior she believes will be useful for later demonstrations), she labels video frames as positive or negative examples of that state. Frames are mapped to environment observations (which may be lower-dimensional than the video frames) by the system, creating a set of positive and negative examples of observations. These examples are then used to train a binary classifier, a discriminator

, implemented using a neural network, to recognise the abstract state. In our experiments below, roughly 400 examples are required to train a robust discriminator for each state.

3.2 Behavioral primitives from goal states

Figure 2: Demonstrations interface. Demonstrations are given one action at a time. Each action corresponds to some temporally-extended behavior generated by running the corresponding primitive policy for a fixed number of timesteps (here, three timesteps). Because the results of each primitive may be unpredictable, the user chooses between actions based on the simulated rollouts that would result (here, the filled circle moving to the position indicated by the unfilled circle). Once the user chooses an action, the final state from the corresponding rollout is used as the first state of the next set of rollouts.

To define primitive policies based on user-defined goal states, we use discriminator probability output as a reward signal then train using an off-the-shelf reinforcement learning algorithm, Proximal Policy Optimization Schulman et al. (2017) from OpenAI Baselines Dhariwal et al. (2017). To maximise reward, the resulting policy must activate the discriminator as strongly as possible - in other words, it must move to and stay in the goal state.

3.3 Demonstrations using behavioral primitives

The user gives demonstrations by using primitive policies as temporally-extended actions. Based on the current environment state, the user chooses a policy; the policy is run for a fixed number of timesteps; based on the new state, the user chooses a new policy; and so on. Primitive policies are thus similar to options in the options framework Sutton et al. (1999), with a termination condition based on the number of steps.

When the effect of each primitive is predictable, the choice of primitive at each step is straightforward. In general, however, we assume primitives will be somewhat unpredictable. This is partly because we do not expect trained primitives to be perfect (e.g. movement may be erratic). However, the environment itself may also be unpredictable. In the Lunar Lander game described below, for example, it is difficult to predict how the spacecraft will move in low gravity. To enable an informed choice, at each demonstration step we show the user not only the current environment state but also a video of the rollout that would result from running each policy from that state. Essentially, the user chooses by examining the short-term futures that would result from each action.

This interface is illustrated in Figure 2.

3.4 Behavioral primitives from demonstrations

In addition to defining primitives from goal states, we also support defining behavioral primitives directly from behavior demonstrated by the user. Our early experiments used a simple imitation learning technique, behavioral cloning 

Pomerleau (1991). However, the resulting policies would often perform significantly worse than the demonstrations themselves. Instead, we note that our demonstrations offer an additional source of information. Each demonstrated action yields not only information about optimal behavior (the rollout from the selected primitive) but also comparisons to sub-optimal behaviors from the same state (the rollouts from the other primitives). These comparisons can be used to train a policy using preference learning techniques.

In particular, we implement preference learning based on Christiano et al. (2017). This involves training a neural network to predict the result of each pairwise comparison (between the chosen rollout and one of the other rollouts, predicting which was the chosen rollout). The prediction is made using a latent reward value calculated for each frame in each rollout. This predicted reward function is then used to train a policy using a reinforcement learning algorithm. Again, we use Proximal Policy Optimization Schulman et al. (2017).

(We also investigating combining behavioral cloning with preference learning, but we found this to result in worse performance than preference learning alone. Future work will investigate other ways of making use of both types of information.)

3.5 Iterated training

For simple tasks, it may be possible to give demonstrations of the full task using only the first set of primitives defined, and from those demonstrations train a final behavioral primitive that successfully performs the task. In general, however, we assume that initial primitives will only enable demonstration of some part of the task.

The user has a number of options for defining new primitives based on previous primitives.

Demonstration of new behavior. The user might directly demonstrate a behavior she intends to use in later demonstrations, distilling those demonstrations into a new primitive policy as described above. For example, using a set of basic quadcopter movement primitives the user might demonstrate a loop-the-loop behavior and use this as a new primitive.

Goal states from demonstrations. Demonstrations generate experience exploring parts of the state space not covered by the initial set of random behavior. From this experience new goal states can be defined from which new primitives can be trained. Returning to the quadcopter example, the user could demonstrate moving the quadcopter over a charging platform, and using that goal state, train a policy to move to the charging platform. Practically, this involves saving demonstrated episodes for browsing and labelling with the same interface as the initial set of goal states.

Curriculum learning with goal states. In some cases it may not be necessary to demonstrate exploration of new parts of the state space; interesting new states may be reached by simply running one of the existing primitive policies in the environment (assuming that the policy is stochastic and therefore does some exploration of its own). Consider training a robot to navigate a maze. Random exploration from the initial state may only wander around in the first part of the maze, so that the best goal state initially possible to define is only a short distance along the correct path. But by exploring randomly in the vicinity of this first state, it is easier to wander into a deeper part of the maze in which a second goal state may be defined, and so on. Essentially, we can train using a curriculum of gradually more advanced goals. Running a single policy in the environment can be seen as a special type of demonstration where only one action is available.

This full training process is shown in Figure 1.

4 Case study: Lunar Lander

Figure 3: Lunar Lander gameplay. The user must guide the descent of the spacecraft, landing in the area designated by the two flags.

As a concrete example, we use this system to train an agent to play a video game. Lunar Lander is a simple game included with OpenAI Gym Brockman et al. (2016) in which a user must control a 2D spacecraft, landing gently in a designated landing zone on the lunar surface (Figure 3). With a (discrete) action space of ‘rotate spacecraft left’, ‘rotate spacecraft right’ and ‘fire thrusters at the bottom of the spacecraft’, the spacecraft is very hard to control; it is challenging to train a robust policy through simple imitation learning because it is difficult to give good demonstrations. Previous work with Lunar Lander has, for example, attempted to make control easier by assisting the user in the original action space Reddy et al. (2018). In contrast, we enable the user to define a new control space which is easier to use in the first place.

4.1 Goal states and behavioral primitives

Figure 4: Primitives defined for Lunar Lander. Starting from random behavior, we use goal states to define ‘stabilize descent’ and ‘drop with engines off’ primitives. We then define further goal states in experience generated by the ‘stabilize descent’ primitive to define ‘fly stably left’ and ‘fly stably right’ primitives.

Iteration 1: starting from random behavior, we define a stabilize descent goal state and primitive. Each episode begins with the spacecraft falling towards the surface at a random speed and angle; this primitive slows down the spacecraft and returns its angle to neutral. We train this primitive based on examples of the spacecraft being level and having a low velocity (angle and velocity are both included in the observation space).

Iteration 2: from experience generated by running the ‘stabilize descent’ policy, we define stably fly left and stably fly right primitives. One major difficulty when demonstrating with original controls is the need to rotate in order to move left or right. It is easy to rotate too much and become unstable. These primitives move the spacecraft slowly left or right without allowing the angle to deviate too much from neutral. We also train these primitives using goal states. We generate experience using the ‘stabilize descent’ primitive, which is not perfect and sometimes drifts slightly in one direction. We collect examples of this drifting and use those examples to train goal state discriminators and corresponding policies.

Iteration 3: we define a drop primitive, completely shutting off the spacecraft’s engines so that when the craft is sufficiently close to the lunar surface we can actually land. This primitive is defined based on instances from the initial set of random behavior in which the engine is not firing.

An illustration of these primitives is shown in Figure 4.

Iteration 4: finally, using these four primitives—‘stabilize descent’, ‘fly stably left’, ‘fly stably right’ and ‘drop’—we train a policy to actually play the game by demonstrating successful landings and training a policy from those demonstrations.

Defining this set of four demonstration primitives takes roughly one hour. Though the primitives are not perfect, they are sufficient to land the spacecraft in the landing zone in 80% of demonstrations.

4.2 Results

Figure 5: Lunar Lander training results—success rate against number of RL steps in the environment. Using our approach, we are able to train a policy which lands successfully in almost all episodes. A comparable preference learning baseline, Deep Reinforcement Learning from Human Preferences Christiano et al. (2017), succeeds in only one out of three runs. A baseline using imitation learning instead of reinforcement learning, DAgger Ross et al. (2011)

, is more robust but does not achieve high performance. (Successful landing rate is calculated using a moving window of ten episodes. Shaded regions indicate one standard deviation across three runs, each with a different random seed. Full training curve for DAgger not shown due to dissimilarity in training method.)

Starting from the four demonstration primitives described above, we perform three training runs, each with a different random initialization. In all runs, we found that only eight demonstrated episodes are required to train a successful policy. This requires roughly 15 minutes of human interaction time, followed by 90 minutes of training time (reinforcement learning using the reward function inferred from demonstrations). The resulting policy is capable of landing successfully in over 90% of episodes (see Figure 5).

We compare these results against two baselines, matching 15 minutes of human interaction time in each case. First, we compare to preference learning from randomly-selected examples of behavior generated while training, as in Christiano et al. (2017). Here, training is significantly less robust; we could only train a successful policy in one out of three runs. In the other two runs, the policy did not produce the kind of examples that would enable the user to give informative preferences, resulting in the policy only learning to hover in mid-air rather than to land. Second, we compare to simple imitation learning using Dataset Aggregration (DAgger) Ross et al. (2011). Here the trained policy does learn to land, but often does so by crash-landing, and often misses the target area.

5 Conclusions and discussion

We have presented a proposal for an interactive training interface that combines several learning methods to enable a non-technical user to build a policy from scratch. We use this interface to build a policy that plys the Lunar Lander game, and find that our method outperforms two comparable baseline methods in robustness and policy performance.

5.1 Future work

Incorporating imitation learning. We were surprised to find that combining behavioral cloning with preference learning resulted in worse performance than preference learning alone. We would like to explore alternative methods of combining the two—for example, using behavioral cloning to pre-train, rather than as part of a combined loss.

Variable-timescale primitives. In this work, all primitives ran for the same number of timesteps. However, part of the power of an iterated training process is that demonstrations can take place at increasingly abstract levels as training progresses. To enable this, primitive policies would need to run for more steps. There should be a clear criterion for how many steps are necessary for each policy. One way to achieve this might be through flexible termination conditions, as used in the options framework Sutton et al. (1999).

5.2 Acknowledgements

We thank Rohin Shah and Adam Gleave for helpful comments and suggestions on an earlier version of this manuscript. This work was supported by the Center for Human-Compatible AI.

References