Visual Imitation Learning with Recurrent Siamese Networks

01/22/2019 ∙ by Glen Berseth, et al. ∙ The University of British Columbia 0

People solve the difficult problem of understanding the salient features of both observations of others and the relationship to their own state when learning to imitate specific tasks. In this work, we train a comparator network which is used to compute distances between motions. Given a desired motion the comparator can provide a reward signal to the agent via the distance between the desired motion and the agent's motion. We train an RNN-based comparator model to compute distances in space and time between motion clips while training an RL policy to minimize this distance. Furthermore, we examine a challenging form of this problem where a single is provided for a given task. We demonstrate our approach in the setting of deep learning based control for physical simulation of humanoid walking in both 2D with 10 degrees of freedom (DoF) and 3D with 38 DoF.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Often, in Rienforcement Learning (RL) the designer formulates a reward function to elicit the desired behaviour in the policy. However, people often modify or refine their objectives as they learn. For example, a gymnast that is learning how to perform a flip can understand the overall motion from observing a few demonstrations. However, over time the gymnast, along with their previous experience, will learn to understand the less obvious but significant state features that determine a good flipping motion. In this same vein we gradually learn a distance function where, as the agent explores and gets more skilled, the agent refines its state space understanding and therefore the distance metric can further refine its accuracy.

Robots and people may plan using an internal pose space understanding; however, typically when people observe others performing tasks only visual information is available. Often, using distances in pose-space is ill-suited for imitation as changing some features will have result in drastically different visual appearance. In order to understand how to perform tasks from visual observation some mapping/transformation is used , which allows for the minimization of . Even with a method to transform observations to a similar pose every person has different capabilities. Because of this, people must learn how to transform demonstrations into a representation where they can reproduce the behaviour to the best of their ability. In our work here we construct a distance metric derived from the agent’s visual perceptions without the need for an intermediate pose representation by allowing the agent to observe itself externally and compare that perception with a demonstration.

Searching for a distance function has been an active topic of research (Abbeel & Ng, 2004; Argall et al., 2009)

. Given some vector of features the goal is to find an optimal transformation of these features such that when differences are computed in this transformed space there exists a strong contextual meaning. With a meaningful transformation function

a distance can be computed between an agent and a demonstration. Previous work has explored the area of state-based distance functions, but many rely pose based metrics (Ho & Ermon, 2016; Merel et al., 2017a) that come from an expert. Few use image based inputs and none consider the importance of learning a distance function in time as well as space (Sermanet et al., 2017; Finn et al., 2017; Liu et al., 2017; Dwibedi et al., 2018). In this work we train a recurrent siamese network to learn distances between videos (Chopra et al., 2005).

Imitation learning and RL often intersect when only the expert state distribution is available. In this case the agent needs to try actions in the environment to find ones that will lead to behaviour similar to the expert. However, designing a reward function that will provide a reasonable distance between the agent and the expert is complex. An important detail of imitating demonstrations is their sequential and causal nature. There is both an ordering and speed in which the demonstration is performed. It is important to match the demonstrations state distribution. However, similarity between states may force the agent to imitate the same timing as the demonstration. This can be highly effective and lead to learning smooth motions but it can be sparse and constrains the result to have similar timing as the demonstration. However, when the agent’s motion becomes desynchronized with the demonstration the agent will receive low reward. Consider the case when a robot has learned to stand before it can walk. This state is close to the demonstration and action that help the robot stand should be encouraged. Therefore we learn an Recurrent Neural Network (RNN)-based distance function that can give reward for out of sync but similar behaviour. The work in Liu et al. (2017); Dwibedi et al. (2018); Sermanet et al. (2017); Peng et al. (2018b) also performs imitation from video observation but each assumes some sort of time alignment between the agent and demonstration. Considering the data sparsity of the problem we include data from other tasks in order to learn a more robust distance function in visual sequence space.

Our method has similarities to both 

Inverse Reinforcement Learning

(IRL(Abbeel & Ng, 2004) and Generative Advisarial Imitation Learning (GAIL(Ho & Ermon, 2016). The process of learning a cost function that will understand the space of policies in order to find an optimal policy given a demonstration is fundamentally IRL. While using positive examples from the expert and negative examples from the policy is similar to the method GAIL used to train a discriminator to understand the desired distribution. In this work we build upon these techniques by constructing a method that can learn polices using only noisy partially observable visual data. We also construct a cost function that takes into account the demonstration timing as well as pose by using a recurrent Siamese network. Our contribution rests on proposing and exploring this form of recurrent Siamese network as a way to address the key problem of defining the reward structure for imitation learning for deep RL agents.

2 Preliminaries

In this section we outline some key details of the general RL framework and specific specialized formulations for RL that we rely upon when developing our method in Section: 3.

2.1 Reinforcement Learning

Using the RL framework formulated with a Markov Dynamic Process (MDP): at every time step , the world (including the agent) exists in a state , wherein the agent is able to perform actions , sampled from a policy which results in a new state

according to the transition probability function

. Performing action from state produces a reward from the environment; the expected future discounted reward from executing a policy is:

(1)

where is the max time horizon, and is the discount factor, indicating the planning horizon length.

The agent’s goal is to optimize its policy, , by maximizing . Given policy parameters , the goal is reformulated to identify the optimal parameters :

(2)

Using a Gaussian distribution to model the stochastic policy

. Our stochastic policy is formulated as follows:

(3)

where is a diagonal covariance matrix with entries on the diagonal, similar to (Peng et al., 2017).

For policy optimization we employ stochastic policy gradient methods (Sutton et al., 2000). The gradient of the expected future discounted reward with respect to the policy parameters, , is given by:

(4)

where is the discounted state distribution, represents the initial state distribution, and models the likelihood of reaching state by starting at state and following the policy for steps (Silver et al., 2014). represents an advantage function (Schulman et al., 2016).

2.2 Imitation Learning

Imitation learning is the process of training a new policy to reproduce the behaviour of some expert policy. Behavioural Cloning (BC) is a fundamental method for imitation learning. Given an expert policy possibly represented as a collection of trajectories a new policy

can be learned to match this trajectory using supervised learning.

(5)

While this simple method can work well, it often suffers from distribution mismatch issues leading to compounding errors as the learned policy deviates from the expert’s behaviour.

Similar to BC, IRL also learns to replicate some desired behaviour. However, IRL makes use of the environment, using the RL environment without a defined reward function. Here we describe maximal entropy IRL (Ziebart et al., 2008). Given an expert trajectory a policy can be trained to produce similar trajectories by discovering a distance metric between the expert trajectory and trajectories produced by the policy .

(6)

where is some learned cost function and is a causal entropy term. is the expert policy that is represented by a collection of trajectories. IRL is searching for a cost function that is low for the expert and high for other policies. Then, a policy can be optimized by maximizing the reward function .

GAIL (Ho & Ermon, 2016) uses a Generative Advasarial Network (GAN)-based (Goodfellow et al., 2014) framework where the discriminator is trained with positive examples from the expert trajectories and negative examples from the policy. The generator is a combination of the environment and the current state visitation probability induced by the policy .

(7)

In this framework the discriminator provides rewards for the RL policy to optimize, as the probability of a state generated by the policy being in the distribution .

3 Our Approach

In this section we describe our method to perform recurrent vision based imitation learning.

3.1 Partial Observable Imitation Learning Without Actions

For many problems we want to learn how to replicate the behaviour of some expert without access to the experts actions. Instead, we may only have access to an actionless noisy observation of the expert that we call a demonstration. Recent work uses BC

to learn a new model to estimate the actions used via maximum-likelihood estimation 

(Torabi et al., 2018). Still, BC often needs many expert examples and tends to suffer from state distribution mismatch issues between the expert policy and the student (Ross et al., 2011). Work in (Merel et al., 2017b) proposes a system based on GAIL that can learn a policy from a partial observation of the demonstration. In this work the state input to the discriminator is a customized version of the expert’s pose and does not take into account the demonstration’s sequential nature. The work in (Wang et al., 2017) provides a more robust GAIL framework along with a new model to encode motions for few-shot imitation. This model uses an RNN to encode a demonstration but uses expert state and action observations. In our work we limit the agent to only a partial visual observation as a demonstration. Additional works learn implicit models of distance (Yu et al., 2018; Pathak et al., 2018; Finn et al., 2017; Sermanet et al., 2017), none of these explicitly learn a sequential model to use the demonstration timing. Another version of GAIL, infoGAIL (Li et al., 2017), was used on some pixel based inputs. In contrast, here we train a recurrent siamese model that can be used to enable curriculum learning and allow for reasonable distances to be computed even when the agent and demonstration are out of sync or have different capabilities.

3.2 Distance-Based Reinforcement Learning

Given a distance function that indicates how far away the agent is from some desired behaviour a reward function over states can be constructed . In this framework there is no reward signal coming from the environment instead fixed rewards produced by the environment are replaced by the agent’s learned model that is used to compare itself to some desired behaviour.

(8)

Here can take many forms. In IRL it is some learned cost function, in GAIL it is the discriminator probability. In this framework this function can be considered more general and it can be interpreted as distance from desired behaviour, Specifically, in our work is learning a distance between video clips.

Many different methods can be used to learn a distance function in state-space. We use a standard triplet loss over time and task data Chopra et al. (2005). The triplet loss is used to minimize the distance between two examples that are positive, very similar or from the same class, and maximize the distance between pairs of examples that are known to be un-related.

Data used to train the siamese network is a combination of trajectories generated from simulating the agent in the environment as well as the demonstration.

(9)

Where is a positive example , pair where the distance should be minimal and is a negative example , pair where the distance should be maximal. The margin is used as an attractor or anchor to pull the negative example output away from and push values towards a 0 to 1 range. computes the output from the underlying network. A diagram of this image-based training process and design is shown in Figure 0(a). The distance between two states is calculated as and the reward as . For recurrent models we use the same loss however, the states are sequences. The sequence is fed into the RNN and a single output encoding is produced for that sequence. During RL training we compute a distance given the sequence of states observed so far in the episode. This is a very flexible framework that allows us to train a distance function in state space where all we need to provide is labels that denote if two states, or sequences, are similar or not.

3.3 Sequence Imitation

Using a distance function in the space of states can allow us to find advantageous policies. The hazard with using a state only distance metric when you are given demonstrations as sequences to imitate the RL agent can suffer from phase-mismatch. In this situation the agent may be performing the desired behaviour but at a different speed. As the demonstration timing and agent diverge the agent receives less reward, even though it is visiting states that exist elsewhere in the demonstration. If instead we consider the current state conditioned on the previous states, we can learn to give reward for visiting states that are only out of sync with the demonstration motion.

This distance metric is formulated in a recurrent style where the distance is computed from the current state and conditioned on all previous states

. The loss function remains the same as in Eq.

9 but the overall learning process changes to use an RNN-based model. A diagram of the method is shown in Figure 0(b). This model uses a time distributed RNN. A single convolutional network is first used to transform images of the agent and demonstration to encoding vectors . After the sequence of images is distributed through there is an encoded sequence , this sequence is fed into the RNN until a final encoding is produced . This same process is done for a copy of the RNN producing for the demonstration. The loss is computed in a similar fashion to (Mueller & Thyagarajan, 2016) using the sequence outputs of images from the agent and another from the demonstration. The reward at each timestep is computed as . At the beginning of ever episode the RNN’s internal state is reset. The policy and value function use a layer neural network with and units respectively.

(a) conv siamese network (TCN)
(b) conv RNN siamese network
Figure 1: Siamese network network structure. The convolutional portion of the network includes convolution layers of filters with size

and stride

, filters of size and stride and filters of size and stride . The features are then flattened and followed by two dense layers of and units. The majority of the network uses ReLU activations except the last layer that uses a sigmoid activation. Dropout is used between the convolutional layers. The RNN-based model uses a GRU layer with hidden units, followed by a dense layer of units.

3.4 The Rl Simulation Environment

Our simulation environment is similar to the OpenAIGym robotics environments (Plappert et al., 2018). The demonstration the agent is learning to imitate is generated from a clip of mocap data. The mocap data is used to animate a second robot in the simulation. Frames from the simulation are captured and used as video input to train the distance metric. The policy is trained on pose data, as link distances and velocities relative to the robot’s Centre of Mass (COM) of the simulated robot. This is a new simulation environment that has been created to take motion capture data and produce multi-view video data that can be used for training RL

agents or generating data for computer vision tasks. The environment also includes new challenging and dynamic tasks for humanoid robots. The simulation and rendering have been optimized and efficient EGL-based off-screen rendering is used to capture video data.

3.5 Data Augmentation

In a manner similar to how a person may learn to understand and reproduce a behaviour (Council et al., 2000; Gentner & Stevens, 2014) we apply a number of data augmentation methods to produce additional data used to train the distance metric. Using methods analogous to the cropping and warping methods popular in computer vision (He et al., 2015) we randomly crop sequences and randomly warp the demonstration timing. The cropping is performed by both initializing the agent to random poses from the demonstration motion and terminating episodes when the agent’s head, hands or torso contact the ground. The motion warping is done by replaying the demonstration motion at different speeds. We also make use of time information in a similar way to (Sermanet et al., 2017), where observations at similar times in the same sequence are often correlated and observations at different times may have little similarity. To this end we generate more training samples using randomly cropped sequences from the same trajectory as well as reversed and out of sync versions. Imitation data for other tasks is also used to help condition the distance metric learning process. Motion clips for running, backflips, frontflips, dancing, punching, kicking and jumping are used along with the desired walking motion. See the supplementary document for more details on how positive and negative pairs are created from this data.

The algorithm used to train the distance metric and policy is outlined in Algorithm 1. Importantly the RL environment generates two different state representations for the agent. The first state is the internal robot pose. The second state is the agent’s rendered view. The rendered view is used with the distance metric to compute the similarity between the agent and the demonstration. We attempted to use the visual features as the state input for the policy as well. This resulted in poor policy quality.

1:Randomly initialize model parameters and
2:Create experience memory
3:Given a demonstration
4:while not done do
5:     for  do
6:         
7:         for  do
8:              
9:              
10:              
11:              
12:              
13:         end for
14:     end for
15:     
16:     Update the distance metric parameters using
17:     Update the policy parameters using
18:end while
Algorithm 1 Visual Imitation Learning Algorithm

4 Experiments and Results

This section contains a collection of experiments and results produced to investigate the capabilities of the method.

4.1 2D Humanoid Imitation

The first experiment performed uses the method to learn a cyclic walking gait for a simulated humanoid walking robot given a single motion example, similar to (Peng & van de Panne, 2017). In this simulated robotics environment the agent is learning to imitate a given reference motion of a walk. The agent’s goal is to learn how to actuate Proportional Derivative (PD) controllers at  fps to mimic the desired motion. The simulation environment provides a hard coded reward function based on the robot’s pose that is used to evaluate the policies quality. The images captured from the simulation are converted to grey-scale with pixels. In between control timesteps the simulation collects images of the agent and the rendered demonstration motion. The agent is able to learn a robust walking gate even though it is only given noisy partial observations of a demonstration. We find it extremely helpful to normalize the distance metric outputs using where scales the filtering width (Peng & van de Panne, 2017). Early in training the distance metric often produces noisy large values, also the RL

method constantly updates reward scaling statistics, the initial high variance data reduces the significance of better distance metric values produced later on by scaling them to very small numbers. The improvement of using this normalize reward is shown in Figure 

2(a). Example motion of the agent after learning is shown in Figure 2 and in the supplemental Video.

trim=.4250pt .150pt 0.4350pt .250pt,clip trim=.4250pt .150pt 0.4350pt .250pt,clip trim=.4250pt .150pt 0.4350pt .250pt,clip trim=.4250pt .150pt 0.4350pt .250pt,clip trim=.4250pt .150pt 0.4350pt .250pt,clip trim=.4250pt .150pt 0.4350pt .250pt,clip trim=.4250pt .150pt 0.4350pt .250pt,clip trim=.4250pt .150pt 0.4350pt .250pt,clip trim=.4250pt .150pt 0.4350pt .250pt,clip trim=.4250pt .150pt 0.4350pt .250pt,clip trim=.4250pt .150pt 0.4350pt .250pt,clip

Figure 2: Still frame shots from trained policy in the humanoid2d environment.
(a) Reward smoothing
(b) Siamese update frequency
(c) Ablation humanoid3d walking
Figure 3: Ablation analysis of the method. We find that training RL policies is sensitive to size and distribution of rewards. The siamese network benefits from a number of training adjustments that make it more suitable for RL.

4.2 Algorithm Analysis

We compare the method to two other methods that can be considered as learning a distance function in state space, GAIL and using a Variational Auto Encoder (VAE) to train an encoding and compute distances between those encodings using the same method as the siamese network in Figure 4(a). We find that the VAE method does not appear to capture the important distances between states, possibly due to the complexity of the decoding transformation. Similarly, we use try a GAIL type baseline and find that the method would either produce very jerky motion or stand still, both of which are contained in the imitation data. Our method that considers the temporal structure of the data produces higher value policies.

Additionally, we create a multi-modal version of the method where the same learning framework is use. Here we replace the bottom conv net with a dense network and learn a distance metric between agent poses and imitation video. The results of these models along with using the default manual reward function are shown in Figure 4(b). The multi-modal version appears to perform about equal to the vision-only modal. In Figure 4(b) we also compare our method to a non sequence based model that is equivalent to TCN. On average the method achieves higher value policies.

We conduct an additional ablation analysis in Figure 2(c) to compare the effects of additional data augmentation methods. The first augmentation method is Reference State Initialization (RSI), where the initial state of the agent is randomly selected from the expert mocap data. The second is Early Episode Sequence Priority (EESP) where we cut a earlier smaller window out of a batch of sequences to train the RNN on. We find it very helpful to reduce RSI. If more episodes start in the same state it increases the temporal alignment of training batches for the RNN. We believe it is very important that the distance metric be most accurate for the earlier states in an episode so we also use EESP. We increase the probability of cropping earlier and shorter windows used for RNN training batches. As the agent gets better the average length of episodes increases and so to will the average size of the cropped window. Last, we tried pretraining the distance function. This leads to mixed results, see Figure 3(a) and Figure 3(b). Often, pretraining overfits the initial data collected leading to poor early RL training. However, in the long run pretraining does appear to improve over its non pretrained version.

(a) siamese loss
(b) siamese loss with pretraining
Figure 4: Training losses for the siamese distance metric.
(a) baseline comparison (humanoid2d)
(b) sequence-based comparison (humanoid2d)
(c) TSNE embedding (humanoid3d)
Figure 5: Baseline comparisons between our sequence-based method, GAIL and TCN. In 4(a) We compare our method to GAIL and a VAE where the reward is the euclidean distance of the encodings. We perform two additional baseline comparison between our method and TCN in 4(b). These both show that on average our method performs similar to TCN or better over time. In these plots the large solid lines are the average performance of a collection of policy training simulations. The dotted lines of the same colour are the specific performance values for each policy training run. The filled in areas around the average policy performance is the variance over the collection of policy training simulations.

4.3 3D Humanoid Robot Imitation

In these simulated robotics environments the agent is learning to imitate a given reference motion of a walk or run. A single imitation motion demonstration is provided by the simulation environment as a cyclic motion, similar to (Peng et al., 2018a). The agent controls and observes frames at  fps. During learning, additional data is included from other tasks for the walking task this would be: walking-dynamic-speed, running, frontflips and backflips) that are used to train the distance metric. We also include data from a modified version of the tasks that has a randomly generated speed modifier walking-dynamic-speed, that warps the demonstration timing. This additional data is used to provide a richer understanding of distances in space and time. The input to the distance metric is a pair of frames from the simulation per control step, see Algorithm 1.

We find that using the RNN-based distance metric makes the learning process more gradual. This can be seen in Figure 4(b) where the original manually created reward has sparse feedback which leads to a slow initial learning . Some example trajectories using the policy learned using the method are shown in Figure 6 (walking) and Figure 7 (running) and in the supplemental Video.

trim=.40pt .150pt 0.40pt .250pt,clip trim=.40pt .150pt 0.40pt .250pt,clip trim=.40pt .150pt 0.40pt .250pt,clip trim=.40pt .150pt 0.40pt .250pt,clip trim=.40pt .150pt 0.40pt .250pt,clip trim=.40pt .150pt 0.40pt .250pt,clip trim=.40pt .150pt 0.40pt .250pt,clip trim=.40pt .150pt 0.40pt .250pt,clip trim=.40pt .150pt 0.40pt .250pt,clip trim=.40pt .150pt 0.40pt .250pt,clip trim=.40pt .150pt 0.40pt .250pt,clip

trim=.40pt .150pt 0.40pt .250pt,clip trim=.40pt .150pt 0.40pt .250pt,clip trim=.40pt .150pt 0.40pt .250pt,clip trim=.40pt .150pt 0.40pt .250pt,clip trim=.40pt .150pt 0.40pt .250pt,clip trim=.40pt .150pt 0.40pt .250pt,clip trim=.40pt .150pt 0.40pt .250pt,clip trim=.40pt .150pt 0.40pt .250pt,clip trim=.40pt .150pt 0.40pt .250pt,clip trim=.40pt .150pt 0.40pt .250pt,clip trim=.40pt .150pt 0.40pt .250pt,clip

Figure 6: Still frame shots of the agent’s motion after training on humanoid3d walking.

trim=.40pt .150pt 0.40pt .250pt,clip trim=.40pt .150pt 0.40pt .250pt,clip trim=.40pt .150pt 0.40pt .250pt,clip trim=.40pt .150pt 0.40pt .250pt,clip trim=.40pt .150pt 0.40pt .250pt,clip trim=.40pt .150pt 0.40pt .250pt,clip trim=.40pt .150pt 0.40pt .250pt,clip trim=.40pt .150pt 0.40pt .250pt,clip trim=.40pt .150pt 0.40pt .250pt,clip trim=.40pt .150pt 0.40pt .250pt,clip trim=.40pt .150pt 0.40pt .250pt,clip

trim=.40pt .150pt 0.40pt .250pt,clip trim=.40pt .150pt 0.40pt .250pt,clip trim=.40pt .150pt 0.40pt .250pt,clip trim=.40pt .150pt 0.40pt .250pt,clip trim=.40pt .150pt 0.40pt .250pt,clip trim=.40pt .150pt 0.40pt .250pt,clip trim=.40pt .150pt 0.40pt .250pt,clip trim=.40pt .150pt 0.40pt .250pt,clip trim=.40pt .150pt 0.40pt .250pt,clip trim=.40pt .150pt 0.40pt .250pt,clip trim=.40pt .150pt 0.40pt .250pt,clip

trim=.40pt .150pt 0.40pt .250pt,clip trim=.40pt .150pt 0.40pt .250pt,clip trim=.40pt .150pt 0.40pt .250pt,clip trim=.40pt .150pt 0.40pt .250pt,clip trim=.40pt .150pt 0.40pt .250pt,clip trim=.40pt .150pt 0.40pt .250pt,clip trim=.40pt .150pt 0.40pt .250pt,clip trim=.40pt .150pt 0.40pt .250pt,clip trim=.40pt .150pt 0.40pt .250pt,clip trim=.40pt .150pt 0.40pt .250pt,clip trim=.40pt .150pt 0.40pt .250pt,clip

Figure 7: Still frame shots of the agent’s motion after training on humanoid3d running.

Sequence Encoding

Using the learned sequence encoder a collection of motions from different classes are processed to create a TSNE embedding of the encodings (Maaten & Hinton, 2008). In Figure 4(c) we plot motions both generated from the learned policy and the expert trajectories . Interestingly, there are clear overlaps in specific areas of the space for similar classes across learned and expert data. There is also a separation between motion classes in the data and the cyclic nature of the walking cycle is very visible.

5 Discussion and Conclusion

Learning a distance metric is imperfect, the distance metric can compute inaccurate distances in areas of the state space it has not yet been trained on. This could imply that when the agent explores and finds truly new and promising trajectories the distance metric will produce bad results. We attempt to mitigate this affect by including training data from different tasks while training the distance metric. We believe the method will benefit greatly from a larger collection of multitask data and increased variation of each task. Additionally, if the distance metric confidence is modelled this information could be used to reduce variance and overconfidence during policy optimization.

It appears Deep Deterministic Policy Gradient (DDPG) works well for this type of problem. Our hypothesis is the learned reward function is changing between data collection phases it may be better to view this as off-policy data. Learning a reward function while training also adds additional variance to the policy gradient. This may indicate that the bias of off-policy methods could be preferred over the added variance of on-policy methods. We also find it important to have a small learning rate for the distance metric. This reduces the reward variance between data collection phases and allows learning a more accurate value function. Another approach may be to use partially observable RL that has the ability to learn a better model of the value function given a changing RNN-based reward function. Training the distance metric could benefit from additional regularization such as constraining the kl-divergence between updates to reduce variance. Another option is learn a sequence-based policy as well given that the rewards are now not dependant on a single state observation.

We our method to GAIL but we found it has limited temporal consistency. This led to learning very jerky and overactive policies. The use of a recurrent discriminator for GAIL may mitigate some of these issues. It is challenging to produce result better than the carefully manually crafted reward functions used by some of the RL simulation environments that include motion phase information in the observations (Peng et al., 2018a, 2017). Still as environments become increasing more realistic and grow in complexity we will need more robust methods to describe the desired behaviour we want from the agent.

Training the distance metric is a complex balancing game. One might expect that the model should be trained early and fast so that it quickly understands the difference between a good and bad demonstration. However, quickly learning confuses the agent, rewards can change quickly which can cause the agent to diverge off toward an unrecoverable policy space. Slower is been better, as the distance metric my not be accurate but it may be locally or relatively reasonable which is enough to learn a good policy. As learning continues these two optimizations can converge together.

References

6 Appendix

6.1 DataSets

The mocap used in the created environment come from the CMU mocap database and the SFU mocap database.

6.2 Training Details

The learning simulations are trained using Graphics Processing Unit (GPU)s. The simulation is not only simulating the interaction physics of the world but also rendering the simulation scene in order to capture video observations. On average it takes days to execute a single training simulation. The process of rendering and copying the images from the GPU is one of the most expensive operations with the method.

6.3 Distance Function Training

In Figure 7(b) we show the training curve for the recurrent siamese network. The model learns smoothly considering that the training data used is constantly changing as the RL agent explores. In Figure 7(a) the learning curve for the siamese RNN is shown after performing pretraining. We can see the overfitting portion that occurs during RL training. This overfitting can lead to poor reward prediction during the early phase of training.

(a) siamese loss
(b) siamese loss with pretraining
(c) siamese learning rate
Figure 8: Training losses for the siamese distance metric. Higher is better as it indicates the distance between sequences from the same class are closer.

It can be difficult to train a sequenced based distance function. One particular challenge is training the distance function to be accurate across the space of possible states. We found a good strategy was to focus on the beginning of episode data. If the model is not accurate on states seen earlier in the episode it may never learn how to get into good states later on in the episode that the distance function understands better. Therefore, when constructing batches to train the RNN on we give higher probability to starting earlier in episodes. We also give a higher probability to shorter sequences. As the agent gets better average episodes length increase, so to will the randomly selected sequence windows.

7 Positive and Negative Examples

We use two methods to generate positive and negative examples. The first method is similar to TCN where we can make an assumption that sequences that overlap more in time are more similar. For each episode two sequences are generate, one for the agent and one for the imitation motion. We compute positive pairs by altering one of these seqeunces and comparing this altered verion to its original version. Here we list the number of ways we alter sequences for positive pairs.

  1. Adding Gaussian noise to each state in the sequence (mean and variance )

  2. Out of sync versions where the first state is removed from the first sequence and the last state from the second sequence

  3. Duplicating the first state in either sequence

  4. Duplicating the last state in either sequence

We alter sequences for negative pairs by

  1. Reversing the ordering of the second sequence in the pair.

  2. Randomly picking a state out of the second sequence and replicating it to be as long as the first sequence.

  3. Randomly shuffling one sequence.

  4. Randomly shuffling both sequences.

The second method we use to create positive and negative examples is by including data for additional classes of motion. These classes denote different task types. For the humanoid3d environment we generate data for walking-dynamic-speed, running, backflipping and frontflipping. Pairs from the same tasks are labelled as positive and pair from different classes are negative.

7.1 RL Algorithm Analysis

It is not clear which RL algorithm may work best for this type of imitation problem. A number of RL algorithms were evaluated on the humanoid2d environment Figure 8(a). Surprisingly, Trust Region Policy Optimization (TRPO(Schulman et al., 2015) did not work well in this framework, considering it has a controlled policy gradient step, we thought it would reduce the overall variance. We found that DDPG (Lillicrap et al., 2015) worked rather well. This could be related to having a changing reward function, in that if the changing rewards are considered off-policy data it can be easier to learn. This can be seen in Figure 8(b) where DDPG is best at estimating the future discounted rewards in the environment. We tried also Continuous Actor Critic Learning Automaton (CACLA(Van Hasselt, 2012) and Proximal Policy Optimization (PPO(Schulman et al., 2017), we found that PPO did not work particularly well on this task, this could also be related to added variance.

(a) Average Reward
(b) Bellman Error
(c) sequence-based comparison DDPG
Figure 9: RL algorithm comparison on humanoid2d environment.