Over the past decade, imitation learning (IL) has been successfully applied to a wide range of domains, including robot learning Englert et al. (2013); Schulman et al. (2013), autonomous navigation Choudhury et al. (2017); Ross et al. (2013), manipulation tasks Nair et al. (2017); Prieur et al. (2012), and self-driving cars Codevilla et al. (2018). Traditionally, IL aims to train an imitator to learn a control policy only from expert demonstrations. The imitator is typically presented with multiple demonstrations at training time, with a goal to distill them into . To learn effectively and efficiently, a large set of high-quality demonstrations are necessary. This is especially significant in current state-of-the-art IL algorithms, such as dataset aggregation (DAgger) Ross et al. (2011) and generative adversarial imitation learning (GAIL) Ho and Ermon (2016). Although these approaches have been the dominant algorithms in IL, a major bottleneck for them is their reliance on high-quality demonstrations, which often require extensive supervision from human experts. In addition, a serious flaw in the learned policy is its tendency to overfit to demonstration data, preventing it from generalizing to new ones. To overcome the aforementioned challenges in IL, a number of methods have been investigated to enhance the generalizability and data efficiency, or reduce the degree of human supervision. Initial efforts in this direction were based on the idea of meta learning Duan et al. (2017); Finn et al. (2017); Yu et al. (2018), in which the imitator is trained with a meta learner and is able to quickly learn a new task from a small set of demonstrations. However, such schemes still require training a meta-learner with tremendous amount of time and demonstration data, leaving much room for improvement. Thus, a rapidly-growing body of literature on the usage of forward/inverse dynamics models to learn within an environment in a self-supervised manner Agrawal et al. (2016); Nair et al. (2017); Pathak et al. (2018) has emerged in the past few years. One key advantage of this method is that it provides an autonomous way for preparing training data, removing the need of human intervention. In this paper, we call it self-supervised IL.
Self-supervised IL allows an imitator to collect training data by itself instead of using predefined extrinsic reward functions or expert supervision during training. It only needs demonstration during inference, drastically decreasing the time and effort required from human experts. Although the core principles of self-supervised IL are straightforward and have been exploited in many fields Agrawal et al. (2016); Nair et al. (2017); Pathak et al. (2017); Pathak et al. (2018), recent research efforts have been targeted at addressing the challenges of multi-modality and multi-step planning. For example, the use of forward consistency loss and forward regularizer have been extensively investigated to enhance the task performance of the imitator Agrawal et al. (2016); Pathak et al. (2018)
. This becomes especially essential when the lengths of trajectories grow and demonstration samples are sparse, as multiple paths may co-exist to lead the imitator from its initial observation to the goal observation. The issue of multi-step planning has also drawn a lot of attention from researchers, and is usually tackled by recurrent neural networks (RNNs) and step-by-step demonstrationsNair et al. (2017); Pathak et al. (2018). The above self-supervised IL approaches report promising results, however, most of them are limited in applicability due to several drawbacks. First, traditional methods of data collection are usually inefficient and time-consuming. Inefficient data collection results in poor exploration, giving rise to a degradation in robustness to varying environmental conditions (e.g., noise in motor control) and generalizability to difficult tasks. Second, human bias in data sampling range tailored to specific interesting configurations is often employed Agrawal et al. (2016); Nair et al. (2017). Although a more general exploration strategy called curiosity-driven exploration was later proposed in Pathak et al. (2017), it focuses only on exploration in states novel to the forward dynamics model, rather than those directly influential to the inverse dynamics model. Furthermore, it does not discuss the applicability to continuous control domains, and fails in high dimensional action spaces according to our experiments in Section 4. Unlike the approaches discussed above, we do not propose to deal with multi-modality or multi-step planning. Instead, we focus our attention on improving the overall quality of the collected samples in the context of self-supervised IL. This motivates us to equip the model with the necessary knowledge to explore the environment in an efficient and effective fashion.
In this paper, we propose a simple but efficient IL scheme, called adversarial exploration strategy, that motivates exploration of an environment in a self-supervised manner (i.e., without any extrinsic reward or human demonstration). Inspired by Pinto et al. (2017); Shioya et al. (2018); Sukhbaatar et al. (2018)
, we implement our system by jointly training a deep reinforcement learning (DRL) agent and an inverse dynamics model competing with each other. The former explores the environment to collect training data for the latter, and receives rewards from the latter if the data samples are considered hard. The latter is trained with the training data collected by the former, and only generates rewards when it fails to predict the true actions performed by the former. In such an adversarial setting, the DRL agent is rewarded only for the failure of the inverse dynamics model. Therefore, the DRL agent learns to sample hard examples to maximize the chances to fail the inverse dynamics model. On the other hand, the inverse dynamics model learns to be robust to the hard examples collected by the DRL agent by minimizing the probability of failures. Thus, as the inverse dynamics model become stronger, the DRL agent is also incentivized to search for harder examples to obtain rewards. Overly hard examples, however, may lead to very biased exploration and make the learning unstable. To stabilize the learning progress of the inverse dynamics model, we further propose a reward structure such that the DRL agent is encouraged to explore moderately hard samples for the inverse dynamics model, but not too hard for the latter to learn. The self-regulating feedback between the DRL agent and the inverse dynamics model allows them to automatically construct a curriculum for exploration.
We perform extensive experiments to evaluate adversarial exploration strategy on multiple OpenAI gym Brockman et al. (2016) robotic arm and hand manipulation environments simulated by the MuJoCo physics engine Todorov et al. (2012), including FetchReach, FetchPush, FetchPickAndPlace, FetchSlide, and HandReach. Learning to perform these robotic tasks is more practical than learning to perform most of the other OpenAI gym tasks (e.g., Atari games), because only a very limited set of chained actions will result in success. We examine the effectiveness of our method by comparing it against a number of baseline models. The experimental results show that our method is more effective and data-efficient than the baselines in both low- and high-dimensional observation spaces. We also demonstrate that in most of the cases the inverse dynamics model trained by our method is comparable to that directly trained with expert demonstrations in terms of performance. The above observations suggest that our method is superior to the baselines even in the absence of human priors. We further evaluate our method on environments with high-dimensional action spaces, and show that our method is able to achieve higher success rates than the baselines. The contributions of this work are summarized as follows:
We introduce an adversarial exploration strategy for self-supervised IL. It consists of a DRL agent and an inverse dynamics model designed for efficient exploration and data collection.
We employ a competitive scheme for the DRL agent and the inverse dynamics model, enabling them to automatically construct a curriculum for exploration of observation space.
We suggest a reward structure for the proposed scheme to stabilize the training progress.
We validate the proposed method and compare it with the baselines in both low- and high-dimensional state spaces for multiple robotic arm and hand manipulation tasks.
We demonstrate that the proposed method is suitable and effective for environments with high-dimensional action spaces.
The remainder of this paper is organized as follows. Section 2 introduces background materials. Section 3 describes the proposed adversarial exploration strategy in detail. Section 4 reports the experimental results, and provides an in-depth analysis of our method. Section 5 concludes this paper.
In this section, we briefly review DRL, policy gradient methods, as well as inverse dynamics model.
2.1 Deep Reinforcement Learning and Policy Gradient Methods
DRL trains an agent to interact with an environment . At each timestep , the agent receives an observation , where is the observation space of . It then takes an action from the action space based on its current policy , receives a reward , and transitions to the next observation . The policy is represented by a deep neural network with parameters , and is expressed as . The goal of the agent is to learn a policy to maximize the discounted sum of rewards :
are a class of RL techniques that directly optimize the parameters of a stochastic policy approximator using policy gradients. Although these methods have achieved remarkable success in a variety of domains, the high variance of gradient estimates has been a major challenge. Trust region policy optimization (TRPO)Schulman et al. (2015) circumvented this problem by applying a trust-region constraint to the scale of policy updates. However, TRPO is a second-order algorithm, which is relatively complicated, and not compatible with architectures that include noise or parameter sharing Schulman et al. (2017). In this paper, we employ a more recent type of policy gradient methods, called proximal policy optimization (PPO) Schulman et al. (2017). PPO is an approximation to TRPO, which similarly prevents large changes to the policy between updates, but requires only first-order optimization. Compared to TRPO, PPO is more general, and has better sample complexity (empirically) while retaining the stability and reliability of TRPO 111For more details on PPO, please refer to our supplementary material..
2.2 Inverse Dynamics Model
An inverse dynamics model takes as input a pair of observations , and predicts the action required to reach the next observation from the current observation . It is usually expressed as:
where () are sampled from the collected data, and represents the trainable parameters of the inverse dynamics model. At training time,
is iteratively updated to minimize the loss function:
where is a distance metric, and the ground truth action. During testing, a sequence of observations is captured from an expert demonstration. A pair of observations is fed into the inverse dynamics model at timestep . Starting from , the objective of the inverse dynamics model is to predict a sequence of actions and reach the final observation as close as possible.
In this section, we first describe the proposed adversarial exploration strategy. We then explain the training methodology in detail. Finally, we discuss a technique for stabilizing the training progress.
3.1 Adversarial Exploration Strategy
Fig. S1 222Fig. S1 is presented in our supplementary material. shows a framework that illustrates the proposed adversarial exploration strategy, which includes a DRL agent and an inverse dynamics model . Assume that is the sequence of observations and actions generated by as it explores using a policy . At each timestep , collects a 3-tuple training sample for , while predicts an action and generates a reward for . In this work, is modified from Eq. 2
to include an additional hidden vector, which recurrently encodes the information of the past observations. is thus formulated as:
where denotes the trainable parameters of . is iteratively updated to minimize as follows:
where is a scaling constant. We use mean squared error as the distance metric , since we only consider continuous control domains in this paper. It can be replaced with a cross-entropy loss for discrete control tasks. We directly use as for , which is formulated as:
Our method targets at improving both the quality and efficiency of the data collection process performed by , as well as the performance of . Therefore, the goal of the proposed framework is twofold. First, has to learn an adversarial policy such that its cumulated discounted rewards is maximized. Second, requires to learn an optimal such that Eq. 6 is minimized. Minimizing (i.e., ) leads to decreased , forcing to enhance to explore more difficult samples to increase . This implies that is motivated to focus on ’s weak points, instead of randomly collecting ineffective training samples. Training with hard samples not only accelerates its learning progress, but also helps to boost its performance.
3.2 Training Methodology
We describe the training methodology of our adversarial exploration strategy by a pseudocode presented in Algorithm 1. Assume that ’s policy is parameterized by a set of trainable parameters , and is represented as . We create two buffers and for storing the training samples of and , respectively. In the beginning, , , , , , , as well as a timestep cumulative counter
is initialized. A number of hyperparameters are set to appropriate values, including the number of iterations, the number of episodes , the horizon , as well as the update period of . At each timestep , perceives the current observation from , takes an action according to , and receives the next observation and a termination indicator (line 9 to 11). is set to 1 only when equals , otherwise it is set to 0. We then store and in and , respectively. We update every timesteps using the samples stored in , as shown in (line 13 to 21). At the end of each episode, we update with samples drawn from according to the loss function defined in Eq. 5 (line 23).
3.3 Stabilization Technique
Although the adversarial exploration strategy is effective in collecting hard samples, it could be problematic if becomes too strong such that the collected samples are too difficult for to learn. Overly difficult samples result in a large variance in gradients derived from , which in turn lead to a performance drop and instability in the learning progress. We analyze this phenomenon in greater detail in Section 4.5. To tackle the issue, we propose a training technique that reshapes as follows:
where is a pre-defined threshold value. This technique poses a restriction on the range of , driving to gather moderate samples instead of overly hard ones. Note that the value of affects the learning speed and final performance. We illustrate the impact of on the learning curve of in Section 4.5.
In this section, we present experimental results for a series of robotic tasks, and validate that (i) our method is effective in both low- and high-dimensional observations spaces; (ii) our method is effective in environments with high-dimensional action spaces; (iii) our method is more data efficient than the baseline models; and (iv) our method is robust against action space noise. We first introduce our experimental setup. Then, we report results of both robotic arm and hand manipulation tasks. Finally, we present a comprehensive set of ablative analyses to justify each of our design choices.
4.1 Experimental Setup
We first describe the environments and tasks. Next, we explain the evaluation procedure and the method for collecting expert demonstrations. We then go through the baselines used for comparison.
4.1.1 Environments and Tasks
We use OpenAI gym Brockman et al. (2016) environments simulated by the MuJoCo Todorov et al. (2012) physics engine, and evaluate our method on a number of robotic arm and hand manipulation tasks. We use the Fetch and Shadow Dexterous Hand Plappert et al. (2018a) for the arm and hand manipulation tasks, respectively. For the arm manipulation tasks, which include FetchReach, FetchPush, FetchPickAndPlace, and FetchSlide, the imitator (i.e., the inverse dynamic model) takes as input the positions and velocities of a gripper and an object. It then computes the gripper’s action in 3-dimensional space to manipulate it. For the hand manipulation task HandReach, the imitator takes as input the positions and velocities of the fingers of a robotic hand, and computes the velocity of each joint to achieve the goal. In addition to low-dimensional observations (i.e., position, velocity, and gripper state), we also perform experiments for the above tasks using visual observations (i.e., high-dimensional observations) in the form of camera images taken from the third-person perspective. The detailed description of the above tasks is specified in Plappert et al. (2018a). For the detailed configurations of these tasks, please refer to our supplementary material.
4.1.2 Evaluation Procedure
All of our experimental results are evaluated and averaged over 20 trials, corresponding to 20 different random initial seeds. In each trial, we train an imitator by the training data collected by its self-supervised data collection strategy. Please note that imitators implemented by different methods have different data collection strategies. We periodically evaluate the imitator-under-test every 10K timesteps. The evaluation is performed by measuring the success rate over 500 episodes. At the beginning of each episode, the imitator receives a sequence of observations from a successful expert demonstration. At each timestep , the imitator infers an action needed to reach an expert observation from its current observation . For a fair comparison, all imitators have the same model architecture, and are trained with the same amount of training data. The detailed configurations of the hyperparameters are summarized and discussed in the supplementary material.
4.1.3 Collection of Expert Demonstration
For each task mentioned in Section 4.1.1, we first randomly configure task-relevant settings (e.g., goal position, initial state, etc.). We then collect non-trivial and successful episodes generated by a pre-trained expert agent Andrychowicz et al. (2017). It should be noted that the collected demonstration data only contain observations. The interested reader is referred to our supplementary material for the implementation detail of the pre-trained expert agent, and the methodology we employed to filter out trivial episodes.
4.1.4 Baseline Methods
We compare our proposed methodology against the following four baseline models in our experiments.
Demo: This method has the imitator trained directly with expert demonstrations. It provides the performance upper bound, since training data is the same as testing data.
Curiosity: This method trains a DRL agent via curiosity Pathak et al. (2017); Pathak et al. (2018) to collect training samples. Unlike the original implementation in Pathak et al. (2017), we replace its DRL algorithm with PPO, as training should be done on a single thread for a fair comparison with the other baseline methods. We believe it to be an important baseline due to its proven effectiveness in Pathak et al. (2018).
Noise Plappert et al. (2018b): In this method, noise is injected to the parameter space of a DRL agent to encourage exploration Plappert et al. (2018b). Please note that the exploratory behavior relies entirely on the parameter space noise, without the use of any extrinsic reward. This method is included for comparison because of its superior performance and data-efficiency in many DRL tasks.
4.2 Performance Comparison in Robotic Arm Manipulation Tasks
We compare the performance of the proposed method and the baselines on the robotic arm manipulation tasks described in Section 4.1.1. As opposed to discrete control domains, these tasks are especially challenging, as the sample complexity grows in continuous control domains. Furthermore, the imitator may not have the complete picture of the environment dynamics, increasing its difficulty to learn an inverse dynamics model. In FetchSlide, for instance, the movement of the object on the slippery surface is affected by both friction and the force exerted by the gripper. It thus motivates us to investigate whether the proposed method can help overcome the challenge. In the subsequent paragraphs, we discuss the experimental results in both low- and high-dimensional observation spaces. All of the experimental results are obtained by following the procedure described in Section 4.1.2.
Fig. 1 plots the learning curves for all of the methods in low-dimensional observation spaces. In all of the tasks, our method yields superior or comparable performance to the baselines except for Demo, which is trained directly with expert demonstrations. In FetchReach, it can be seen that every method achieves a success rate of 1.0. This implies that it does not require a sophisticated exploration strategy to learn an inverse dynamics model in an environment where the dynamics is relatively simple. It should be noted that although all methods reach the same final success rate, ours learns significantly faster than Demo. In contrast, in FetchPush, our method is comparable to Demo, and demonstrates superior performance to the other baselines. Our method also learns drastically faster than all the other baselines, which confirms that the proposed strategy does improve the performance and efficiency of self-supervised IL. Our method is particularly effective in tasks that require an accurate inverse dynamics model. In FetchPickAndPlace, for example, our method surpasses all the other baselines. However, all methods including Demo fail to learn a successful inverse dynamics model in FetchSlide, which suggests that it is difficult to train an imitator when the outcome of an action is not completely dependent on the action itself. It is worth noting that Curiosity loses to Random in FetchPush and FetchSlide, and Noise performs even worse than these two methods in all of the tasks. We therefore conclude that Curiosity is not suitable for continuous control tasks, and the parameter space noise strategy cannot be directly applied to self-supervised IL. In addition to the quantitative results presented above, we further discuss the empirical results qualitatively. Please refer our supplementary material for the qualitative results.
The learning curves of all methods in high-dimensional observation spaces are illustrated in Fig. 2. It can be seen that our method performs significantly better than the other baseline methods in most of the tasks, and is comparable to Demo. In FetchPickAndPlace, ours is the only method that learns a successful inverse dynamics model. Similar to the results in low-dimensional settings, Curiosity is no better than Random in high-dimensional observation spaces. Note that we do not include the Noise baseline here because it performs worse enough already in low-dimensional settings.
4.3 Performance Comparison in Robotic Hand Manipulation Task
Fig. 1 plots the learning curves for each of the methods considered. Please note that Curiosity, Noise and our method are pre-trained with 30K samples collected by random exploration, as we observe that these method on their own suffer from large errors in an early stage during training, which prevents them from learning at all. After the first 30K samples, they are trained with data collected by their respective strategy instead. From the results in Fig. 1, it can be seen that Demo easily stands out from the other methods as the best-performing model, surpassing them all by a considerable extent. Although our method is not as impressive as Demo, it significantly outperforms all of the other methods, reaching a success rate of 0.4 while the others are stuck at around 0.2.
4.4 Robustness to Noisy Action
We benchmark our method in an environment with noisy actions to validate the robustness of our method. In this environment, every action taken by the imitator is injected with a Gaussian noise, which results in unaligned data. Note that we only inject noise in the training phase, as we aim to benchmark the robustness of data-collection strategy. The scale of the injected noise can be found in the supplementary material. In Table. 1, we report the performance drop rate for each method in all tasks. The performance drop rate is defined as: , where is the performance under the original setting and the action noise setting respectively, and the performance is measured by the highest success rate during training. From Table. 1, it can be seen that our method has the lowest performance drop rate in most of the tasks, which indicates that our method is robust to noisy actions. Please also note that although Curiosity and Noise also achieve a drop rate of 0% in HandReach and FetchSlide, we do not consider them robust due to their poor performance in the original environment (Fig. 1). Interestingly, our method actually demonstrates an increase in performance in FetchPush and HandReach, but we leave the investigation of this phenomenon for future works. To conclude, we find that the proposed method is more robust to unaligned data than the other baselines, making it a more practical choice in a real world setting.
4.5 Ablative Analysis
We further investigate the effectiveness of our method by a detailed analysis of the collected data, the stabilization technique, and the influence of .
Training error distribution.
We plot the distribution of (Eq. 5) of the first 2K collected samples during the training phase in Fig. 3, where Ours(w stab) and Ours(w/o stab) denote our method with and without the use of the stabilization technique. The vertical axis corresponds to the number of samples, while the horizontal axis corresponds to . The curves in Fig. 3
are smoothed by kernel density estimation. It can be seen that bothOurs(w stab) and Ours(w/o stab) concentrate on notably higher values than Random. This indicates that the adversarial exploration strategy does help collect hard samples for the inverse dynamics model.
Effectiveness of stabilization.
From Fig. 3, it can be observed that Ours(w stab) has a lower mean loss than Ours(w/o stab), which implies that the stabilization technique successfully guides the DRL agent to favor those moderately hard samples. We also observe that the center of loss distribution for Ours(w stab) is close to the value of , as shown in Fig. 3, confirming that our reward structure guides data collection by . To further demonstrate the effectiveness of the stabilization technique, we plot the learning curves of Ours(w stab) and Ours(w/o stab) in Fig. 4. Although Ours(w/o stab) is comparable to Ours(w stab) for the initial 10K samples, it suffers from a significant degradation in performance for the rest of the training progress. This result indicates that the stabilization technique does improve the overall performance of our method.
Influence of .
Fig. 5 plots the learning curves of our methods using and . From the experimental results, we observe that Ours(0.100) and Ours(3.000) perform comparably, which means that the choice of has little influence on the model’s performance.
From the analyses presented above, we conclude that the adversarial exploration strategy is effective in improving the overall quality of the collected data. Furthermore, the proposed stabilization technique is not sensitive to the choice of , and guides data collection towards moderately hard samples, which assists learning a better inverse dynamics model.
In this paper, we present an adversarial exploration strategy that consists of a DRL agent and an inverse dynamics model competing with each other for self-supervised IL. Through our experimental results, we demonstrate that our method improves the efficiency of data-collection and further boosts the overall performance of self-supervised IL imitator in several robotic tasks. In addition, we further show that our method is more robust to the noises in actions. To conclude, our method draws a significant improvement than the other baselines in terms of performance and efficiency.
6 Framework of adversarial exploration strategy
7 Qualitative Analysis of Robotic Arm Manipulation Tasks
In addition to the quantitative results presented above, we further discuss the empirical results qualitatively. Through visualizing the training progress, we observe that our method initially acts like Random, but later focuses on interacting with the object in FetchPush, FetchSlide, and FetchPickAndPlace. This phenomenon indicates that adversarial exploration strategy naturally gives rise to a curriculum that improves the learning efficiency, which resembles curriculum learning Bengio et al. (2009). Another benefit that comes with the phenomenon is that data collection is biased towards interactions with the object. Therefore, the DRL agent concentrates on collecting interesting samples that has greater significance, rather than trivial ones. For instance, the agent prefers pushing the object to swinging the robotic arm. On the other hand, although Curiosity explores the environment very thoroughly in the beginning by stretching the arm into numerous different poses, it quickly overfits to one specific pose. This causes its forward dynamics model to keep maintaining a low error, making it less curious about the surroundings. Finally, we observe that the exploratory behavior of Noise does not change as frequently as ours, Random, and Curiosity. We believe that the method’s success in the original paper Plappert et al. (2018b) is largely due to extrinsic rewards. In the absence of extrinsic rewards, however, the method becomes less effective and unsuitable for data collection, especially in self-supervised IL.
8 Proximal Policy Optimization (PPO)
We employ PPO Schulman et al. (2017) as the RL agent responsible for collecting training samples because of its ease of use and good performance. PPO computes an update at every timestep that minimizes the cost function while ensuring the deviation from the previous policy is relatively small. One of the two main variants of PPO is a clipped surrogate objective expressed as:
where is the advantage estimate, and a hyperparameter. The clipped probability ratio is used to prevent large changes to the policy between updates. The other variant employs an adaptive penalty on KL divergence, given by:
where is an adaptive coefficient adjusted according to the observed change in the KL divergence. In this work, we employ the former objective due to its better empirical performance.
9 Implementation Details of Inverse Dynamics Model
In the experiments, the inverse dynamics model of all methods employs the same network architecture. For low-dimensional observation setting, we use 3 Fully-Connected (FC) layers with 256 hidden units followed by
activation units. For high-dimensional observation setting, we use 3-layer Convolutional Neural Network (CNN) followed by
activation units. The CNNs are configured as (32, 8, 4), (64, 4, 2), and (64, 3, 1), with each element in the 3-tuple denoting the number of output features, width/height of the filter, and stride. The features extracted by stacked CNNs are then fed forward to a FC with 512 hidden units followed byactivation units.
10 Implementation Details of Adversarial Exploration Strategy
For both low- and high- dimensional observation settings, we use the architecture proposed in Schulman et al. (2017). During training, we periodically update the DRL agent with a batch of transitions as described in Algorithm. 1. We split the batch into several mini-batches, and update the RL agent with these mini-batches iteratively. The hyperparameters are listed in Table. 2 (Our method).
11 Implementation details of Curiosity
Our baseline Curiosity is implemented based on the work Pathak et al. (2018). The authors in Pathak et al. (2018) propose to employ a curiosity-driven RL agent Pathak et al. (2017) to improve the efficiency of data collection. The curiosity-driven RL agent takes curiosity as intrinsic reward signal, where curiosity is formulated as the error in an agent’s ability to predict the consequence of its own actions. This can be defined as a forward dynamics model:
where is the predicted feature encoding at the next timestep, the feature vector at the current timestep, the action executed at the current timestep, and the parameters of the forward model . The network parameters is optimized by minimizing the loss function :
For low- and high- dimensional observation settings, we use the architecture proposed in Schulman et al. (2017). The implementation of depends on the model architecture of the RL agent. For low-dimensional observation setting, we implement with the architecture of low-dimensional observation PPO. Note that does not share parameters with the RL agent in this case. For high-dimensional observation setting, we share the features extracted by the CNNs of the RL agent, then feed these features to which consists of a FC with 512 hidden units followed by activation. The hyperparameters settings can be found in Table. 2(Curiosity).
12 Implementation Details of Noise
|Batch size for inverse dynamic model update||64|
|Learning rate of inverse dynamic model||1e-3|
|Timestep per episode||50|
|Optimizer for inverse dynamic model||Adam|
|Number of batch for update inverse dynamic model||25|
|Batch size for RL agent||2050|
|Mini-batch size for RL agent||50|
|Number of training iteration ()||200|
|Number of training episode per iteration ()||10|
|Horizon () of RL agent||50|
|Update period of RL agent||2050|
|Learning rate of RL agent||1e-3|
|Optimizer for RL agent||Adam|
|Number of batch for update inverse dynamic model||500|
|Batch size for RL agent||2050|
|Mini-batch size for RL agent||50|
|Number of training iteration ()||10|
|Number of training episode per iteration ()||200|
|Horizon () of RL agent||50|
|Update period of RL agent||2050|
|Learning rate of RL agent||1e-3|
|Optimizer for RL agent||Adam|
|Number of batch for update inverse dynamic model||500|
|The other hyperparameters||Same as Plappert et al. (2018b)|
13 Implementation Details of Demo
We collect 1000 episodes of expert demonstrations using the procedure defined in Sec. S8 for training Demo. Each episodes lasts 50 timesteps. The demonstration data is in the form of a 3-tuple , where is the current observation, the action, and the next observation. The pseudocode for training Demo is shown in Algorithm. S1 below. In each training iteration, we randomly sample 200 episodes, namely 10k transitions (line 4). The sampled data is then used to update the inverse dynamics model (line 5).
14 Configuration of Environments
We briefly explain each configuration of the environment below. For detailed description, please refer to Plappert et al. (2018a).
FetchReach: Control the gripper to reach a goal position in 3D space. The imitator can fully comprehend the environment dynamics.
FetchPush: Control the Fetch robot to push the object to a target position. The imitator cannot fully comprehend the environment as the movement of the gripper may not affect the object.
FetchPickAndPlace: Control the gripper to grasp and lift the object to a goal position. In addition to the imitator not having the complete picture of the environment dynamics, this task requires a more accurate inverse dynamics model.
FetchSlide: Control the robot to slide the object to a goal position. The task requires an even more accurate inverse dynamics model, as the object’s movement on the slippery surface is hard to predict.
HandReach: Control the Shadow Dextrous Hand to reach a goal hand pose. The task is especially challenging due to high-dimensional action spaces.
15 Setup of Expert Demonstration
We employ Deep Deterministic Policy Gradient combined with Hindsight Experience Replay (DDPG-HER) Andrychowicz et al. (2017) as the expert agent. For training and evaluation, we run the expert to collect transitions for 1000 and 500 episodes, respectively. To prevent the imitator from succeeding in the task without taking any action, we only collect successful and non-trivial episodes generated by the expert agent. Non-trivial episodes are filtered out based on the following task-specific schemes:
FetchReach: An episode is considered trivial if the distance between the goal position and the initial position is smaller than 0.2.
FetchPush: An episode is determined trivial if the distance between the goal position and the object position is smaller than 0.2.
FetchSlide: An episode is considered trivial if the distance between the goal position and the object position is smaller than 0.1.
FetchPickAndPlace: The episode is considered trivial if the distance between the goal position and the object position is smaller than 0.2.
HandReach: We do not filter out trivial episodes as this task is too difficult for most of the methods.
16 Setup of Noisy Action
To test the robustness of our method to noisy actions, we add noise to the actions in the training stage. Let denote the predicted action by the imitator. The actual noisy action to be executed by the robot is defined as:
where is set as . Note that will be clipped in the range defined by each environment.
- Agrawal et al. (2016) Pulkit Agrawal, Ashvin Nair, Pieter Abbeel, Jitendra Malik, and Sergey Levine. 2016. Learning to poke by poking: Experiential learning of intuitive physics. In Proc. Advances in Neural Information Processing Systems (NIPS). pp. 5074-5082.
- Andrychowicz et al. (2017) Marcin Andrychowicz, Filip Wolski, Alex Ray, Jonas Schneider, Rachel Fong, Peter Welinder, Bob McGrew, Josh Tobin, Pieter Abbeel, and Wojciech Zaremba. 2017. Hindsight experience replay. In Proc. Advances in Neural Information Processing Systems (NIPS). pp. 5048-5058.
Bengio et al. (2009)
Jérôme Louradour, Ronan
Collobert, and Jason Weston.
Curriculum learning. In
Proc. Int. Conf. Machine Learning (ICML). 41-48.
- Brockman et al. (2016) Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang, and Wojciech Zaremba. 2016. OpenAI gym. arXiv:1606.01540 (2016).
- Choudhury et al. (2017) Sanjiban Choudhury, Mohak Bhardwaj, Sankalp Arora, Ashish Kapoor, Gireeja Ranade, Sebastian Scherer, and Debadeepta Dey. 2017. Data-driven planning via imitation learning. arXiv:1711.06391 (2017).
- Codevilla et al. (2018) Felipe Codevilla, Matthias Müller, Alexey Dosovitskiy, Antonio López, and Vladlen Koltun. 2018. End-to-end driving via conditional imitation learning. In Proc. Int. Conf. Robotics and Automation (ICRA).
- Duan et al. (2017) Yan Duan, Marcin Andrychowicz, Bradly Stadie, Jonathan Ho, Jonas Schneider, Ilya Sutskever, Pieter Abbeel, and Wojciech Zaremba. 2017. One-shot imitation learning. In Proc. Advances in Neural Information Processing Systems (NIPS). pp. 1087-1098.
- Englert et al. (2013) Peter Englert, Alexandros Paraschos, Jan Peters, and Marc Peter Deisenroth. 2013. Model-based imitation learning by probabilistic trajectory matching. In Proc. Int. Conf. Robotics and Automation (ICRA). pp. 1922-1927.
- Finn et al. (2017) Chelsea Finn, Tianhe Yu, Tianhao Zhang, Pieter Abbeel, and Sergey Levine. 2017. One-shot visual imitation learning via meta-learning. arXiv:1709.04905 (2017).
- Ho and Ermon (2016) Jonathan Ho and Stefano Ermon. 2016. Generative adversarial imitation learning. In Proc. Advances in Neural Information Processing Systems (NIPS). pp. 4565-4573.
- Mnih et al. (2016) Volodymyr Mnih, Adria Puigdomenech Badia, Mehdi Mirza, Alex Graves, Timothy Lillicrap, Tim Harley, David Silver, and Koray Kavukcuoglu. 2016. Asynchronous methods for deep reinforcement learning. In Proc. Int. Conf. Machine Learning (ICML). pp. 1928-1937.
Nair et al. (2017)
Ashvin Nair, Dian Chen,
Pulkit Agrawal, Phillip Isola,
Pieter Abbeel, Jitendra Malik, and
Sergey Levine. 2017.
Combining self-supervised learning and imitation for vision-based rope manipulation. InProc. Int. Conf. Robotics and Automation (ICRA). pp. 2146-2153.
- Pathak et al. (2017) Deepak Pathak, Pulkit Agrawal, Alexei A. Efros, and Trevor Darrell. 2017. Curiosity-driven exploration by self-supervised prediction. In Proc. Int. Conf. Machine Learning (ICML).
- Pathak et al. (2018) Deepak Pathak, Parsa Mahmoudieh, Michael Luo, Pulkit Agrawal, Dian Chen, Fred Shentu, Evan Shelhamer, Jitendra Malik, Alexei A. Efros, and Trevor Darrell. 2018. Zero-shot visual imitation. In Proc. Int. Conf. Learning Representations (ICLR).
- Pinto et al. (2017) Lerrel Pinto, James Davidson, Rahul Sukthankar, and Abhinav Gupta. 2017. Robust adversarial reinforcement learning. In Proc. Int. Conf. Machine Learning (ICML).
- Plappert et al. (2018a) Matthias Plappert et al. 2018a. Multi-goal reinforcement learning: Challenging robotics environments and request for research. arXiv:1802.09464 (2018).
- Plappert et al. (2018b) Matthias Plappert, Rein Houthooft, Prafulla Dhariwal, Szymon Sidor, Richard Y. Chen, Xi Chen, Tamim Asfour, Pieter Abbeel, and Marcin Andrychowicz. 2018b. Parameter space noise for exploration. In Proc. Int. Conf. Learning Representations (ICLR).
et al. (2012)
Véronique Perdereau, and Alexandre
Modeling and planning high-level in-hand manipulation actions from human knowledge and active learning from demonstration. InProc. IEEE/RSJ Int. Conf. Intelligent Robots and Systems (IROS). pp. 1330-1336.
et al. (2011)
Geoffrey Gordon, and Drew Bagnell.
A reduction of imitation learning and structured
prediction to no-regret online learning. In
Proc. Int. Conf. Artificial Intelligence and Statistics (AISTATS). pp. 627-635.
- Ross et al. (2013) Stéphane Ross, Narek Melik-Barkhudarov, Kumar Shaurya Shankar, Andreas Wendel, Debadeepta Dey, J Andrew Bagnell, and Martial Hebert. 2013. Learning monocular reactive UAV control in cluttered natural environments. In Proc. Int. Conf. Robotics and Automation (ICRA). pp. 1765-1772.
- Schulman et al. (2013) John Schulman, Ankush Gupta, Sibi Venkatesan, Mallory Tayson-Frederick, and Pieter Abbeel. 2013. A case study of trajectory transfer through non-rigid registration for a simplified suturing scenario. In Proc. IEEE/RSJ Int. Conf. Intelligent Robots and Systems (IROS). pp. 4111-4117.
- Schulman et al. (2015) John Schulman, Sergey Levine, Pieter Abbeel, Michael Jordan, and Philipp Moritz. 2015. Trust region policy optimization. In Proc. Int. Conf. Machine Learning (ICML). pp. 1889-1897.
- Schulman et al. (2017) John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. 2017. Proximal policy optimization algorithms. arXiv:1707.06347 (2017).
- Shioya et al. (2018) Hiroaki Shioya, Yusuke Iwasawa, and Yutaka Matsuo. 2018. Extending robust adversarial reinforcement learning considering adaptation and diversity. In Proc. Int. Conf. Learning Representations (ICLR) Workshop.
- Sukhbaatar et al. (2018) Sainbayar Sukhbaatar, Zeming Lin, Ilya Kostrikov, Gabriel Synnaeve, Arthur Szlam, and Rob Fergus. 2018. Intrinsic motivation and automatic curricula via asymmetric self-play. In Proc. Int. Conf. Learning Representations (ICLR).
- Sutton et al. (2000) Richard S Sutton, David A McAllester, Satinder P Singh, and Yishay Mansour. 2000. Policy gradient methods for reinforcement learning with function approximation. In Proc. Advances in Neural Information Processing Systems (NIPS). pp. 1057-1063.
- Todorov et al. (2012) Emanuel Todorov, Tom Erez, and Yuval Tassa. 2012. MuJoCo: A physics engine for model-based control View. In Proc. IEEE/RSJ Int. Conf. Intelligent Robots and Systems (IROS). pp. 5026-5033.
- Williams (1992) Ronald J Williams. 1992. Simple statistical gradient-following algorithms for connectionist reinforcement learning. In Reinforcement Learning. pp. 5-32.
- Yu et al. (2018) Tianhe Yu, Chelsea Finn, Annie Xie, Sudeep Dasari, Tianhao Zhang, Pieter Abbeel, and Sergey Levine. 2018. One-shot imitation from observing humans via domain-adaptive meta-learning. arXiv:1802.01557 (2018).