Deep reinforcement learning (DRL) has demonstrated great promise in recent years [mnih2015human, alphago]. However, despite being shown to be a viable approach in robotics [pmlr-v87-kalashnikov18a, openai2018learning], DRL still suffers from low sample efficiency in practice—an acute issue in robot learning.
Given how critical this issue is, many diverse propositions have been presented. For brevity, we only recall the most related to our work. A first idea is to better utilize observed samples, e.g., memory replay [lin1992self] or hindsight experience replay (HER) [Andrychowicz2017HindsightReplay]. Although better observation reuse does not reduce sample requirements in a DRL algorithm, it decreases the needed number of actual interactions with the environment to obtain a satisfactory behavior, which is the most important factor in robot learning. Another idea is to exploit any a priori domain knowledge one may have (e.g., symmetry [kidzinski2018learning]) to support learning. Besides, a robot is generally expected to solve not only one fixed task, but multiple related ones. Multi-task reinforcement learning [Plappert2018Multi-GoalResearch] is considered beneficial as it would not be feasible to repeatedly solve each encountered task tabula rasa. Finally, in order to avoid or reduce learning on an actual robot, recent works have investigated how policies learned in a simulator can be transferred to a real robot [tobin2017domain].
In this work, in order to further improve the data efficiency of DRL, we propose two novel data augmentation methods that better reuse the samples observed in the true environment by exploiting the symmetries (i.e., any invariant transformation) of the problem. For simplicity, we present them in the same set-up as HER [Andrychowicz2017HindsightReplay], although the techniques can be instantiated in other settings.
The first technique, Kaleidoscope Experience Replay (KER), is based on reflectional symmetry, it posits that trajectories in robotic spaces can enjoy multiple reflections and remain in the feasible workspace. Namely, for a given robotic problem, there is a set of reflections that can generate from any observed trajectory many new artificial feasible ones for training. For concreteness, in this paper we focus on reflections with respect to hyperplanes.
The second technique, Goal-augmented Experience Replay (GER), can be seen as a generalization of HER: any artificial goal generated by HER can be instead replaced by a random goal sampled in a small ball around . This idea takes advantage of tasks where success is defined as reaching a final pose within a distance of the goal set by a threshold (such tasks are common in robotics). Here, trajectories are augmented using the invariant transformation that consists in changing actual goals to random goals close of reached ones.
In Sec. II, we present the necessary background (DRL and HER) for our work. Sec. III introduces related work that seeks to increase data efficiency. Sec. IV details our two invariant transform data augmentation techniques. Sec. V presents experimental results on OpenAI Gym Fetch tasks [brockman2016openai], which demonstrates the effectiveness of our propositions. Sec. VI and Sec. VII highlight key lessons.
In this work, we consider robotic tasks that are modeled as multi-goal Markov decision processes[schaul2015universal] with continuous state and action spaces: where is a continuous state space, is a continuous action space, is a set of goals, is the unknown transition function that describes the effects of actions, is the immediate reward when reaching state after performing action in state if the goal were . Finally,
is a joint probability distribution over initial states and initial goals, andis a discount factor. In this framework, the robot learning problem corresponds to an RL problem that aims at obtaining a policy such that the expected discounted sum of rewards is maximized for any given goal.
Due to the continuity of the state-action spaces, this optimization problem is usually restricted to a class of parameterized policies. In DRL, the parameterization is defined by the neural network architecture. To learn such continuous policies, actor-critic algorithms[Konda1999]Lillicrap2015] is a model-free off-policy state-of-the-art DRL algorithm learning a deterministic policy, which is desirable in robotic tasks. In DDPG, the transitions are collected into a replay buffer to later update the action-value function in a semi-gradient way and the policy with the deterministic policy gradient [Silver2014]. Because the policy has to adapt to multiple goals, as in HER, we rely on universal value functions [schaul2015universal]: the classic inputs of the value function and the policy of DDPG are augmented with the desired goal.
When the reward function is sparse, as assumed here, the RL problem is particularly hard to solve. In particular, we consider here reward functions that are described as follows:
where is the indicator function, is a distance, and is a fixed threshold.
To tackle this issue, HER is based on the following principle: any trajectory that failed to reach its goal still carries useful information; it has at least reached the states of its trajectory path. Using this natural and powerful idea, memory replay can be augmented with the failed trajectories by changing their goals in hindsight and computing the new associated rewards.
In the Fetch-v1 tasks solved by HER, the states are generally defined as where represent the positions of gripper, object and the relate positions of gripper and object, is the velocities of gripper and object, and is the eular angles of object, and is the moved distance in every step. The actions are defined as where represent the new position that the gripper should reach and the desired distance of each finger in the gripper should move. The dimensions of are (FetchPickAndPlace-v1) and (FetchPush-v1,FetchSlide-v1)
Iii Related Work
HER [Andrychowicz2017HindsightReplay, Plappert2018Multi-GoalResearch] has been extended in various ways. Prioritized replay was incorporated in HER to learn from more valuable episodes with higher priority [Zhao2018Energy-BasedPrioritization]. In [Fang2019DHER:Replay], HER was generalized to deal with dynamic goals. In [Gerken2019ContinuousControllers], a variant of HER was also investigated where completely random goals replace achieved goals and in [Rauber2019HindsightGradients], it was adapted to work with on-policy RL algorithms. All these extensions are orthogonal to our work and could easily be combined with KER. We leave these for future work.
Symmetry has been considered in MDPs [Zinkevich2001SymmetryLearning] and RL [Kamal2008ReinforcementStates, Agostini2009ExploitingSpacesE, Mahajan2017SymmetryLearning, Kidzinski2018LearningEnvironments, Amadio2019ExploitingTasks]. It can be known a priori or learned [Mahajan2017SymmetryLearning]. In this work, we assume the former, which is reasonable in many robotics tasks. A natural approach to exploit symmetry in sequential decision-making is by aggregating states that satisfy an equivalence relation induced by some symmetry [Zinkevich2001SymmetryLearning, Kamal2008ReinforcementStates]. Another related approach takes into account symmetry in the policy representation [Amadio2019ExploitingTasks]. Doing so reduces representation size and generally leads to faster solution times. However, the state-aggregated representation may be difficult to recover, especially if many symmetries are considered simultaneously. Still another approach is to use symmetry during training instead. One simple idea is to learn the Q-function by performing an additional symmetrical update [Agostini2009ExploitingSpacesE]. Another method is to augment the training points with their reflections [Kidzinski2018LearningEnvironments]. In this paper, we generalize further this idea as a data augmentation technique where many symmetries can be considered and pairs of symmetrical updates do not need to be simultaneously applied.
To the best of our knowledge, data augmentation has not been considered much to accelerate learning in RL. It has, however, been used extensively and with great success in machine learning[Baird1992DocumentModels]
and in deep learning[KrizhevskyImagenetNetworks]. Interestingly, symmetries can also be exploited in neural network architecture design [Gens2014DeepNetworks]. However, in our case, the integration of symmetry in deep networks will be left as future work.
Iv Invariant Transformations for Rl
To reduce the number of interactions with the real environment, we propose to leverage symmetries (i.e., any invariant transformations) in the space of feasible trajectories in order to generate artificial training data from actual trajectories collected during the robot’s learning.
We now present this idea more formally. We start with some notations and definitions. A trajectory of length with goal can be written as , , , , , , , where , , , , and . We assume that all trajectories have a length not larger than , which is true in robotics. The set of all trajectories is denoted .
A trajectory is said to be feasible if for , . The set of feasible trajectories is denoted . A trajectory of length with goal is said to be successful if where and . The set of successful trajectories is denoted .
We can now define the different notions of symmetries that we use in this paper. A symmetry of is a one-to-one mapping such that where . In words, a symmetry of leaves the space invariant. As we only apply symmetries to feasible trajectories, we directly consider their restrictions to and keep the same notation, i.e., . A decomposable symmetry is a symmetry such that there exist one-to-one mappings , , and that satisfy for any :
where , , , , , , , , , , , and . A reward-preserving symmetry is a symmetry such that for any trajectory , the rewards appearing in are exactly the same as those in in the same order. The previous definitions of symmetries can be naturally applied to the set of successful trajectories as well.
Besides, note that any number of symmetries induces a group structure (i.e., they can be composed). Given fixed symmetries and a trajectory, one could possibly generate up to new trajectories. This property is useful if one only knows a fixed number of symmetries for a given problem, because a recursive application of those symmetries could lead to an exponential increase of trajectories that could be used for training.
As a general approach to increase data efficiency in DRL, one can leverage the symmetries of or for data augmentation. As an illustration, we propose ITER (Invariant Transform Experience Replay), an architecture that combines our two proposed techniques (explained in details below):
Kaleidoscope experience replay (KER) is based on decomposable symmetries of .
Goal-Augmented Experience Replay (GER) is based on reward-preserving decomposable symmetries of , but can be applied to all feasible trajectories in the same fashion as HER.
While the two methods are combined here, only one of them could be used instead. An overview of our architecture is illustrated in Fig. 1.
Iv-a Kaleidoscope Experience Replay (KER)
KER uses reflectional symmetry111Though more general invariant transformations (e.g., rotation, translation) could also be used in place of reflectional symmetry.. Consider a 3D workspace with a bisecting plane as shown in Fig. 2. If a feasible trajectory is generated in the workspace (red in Fig. 2), natural symmetry would then yield a new feasible trajectory reflected on this plane. More generally, the plane may be rotated by some angle along axis and still define an invariant symmetry for the robotic task.
We can now precisely define KER, which amounts to augmenting any original trajectory with a certain number of random reflectional symmetries. To ensure the generation of feasible symmetrical trajectories, we can first derive the maximum valid angle
according to the relative pose between working space of a robot and the position area of objects and goals in any given robotic manipulation task. Hyperparameterdenotes the number of symmetric planes applied by KER for trajectories symmetrization. We consider that the center of the robot base is positioned in the origin of world coordinate. For any given , there is always a symmetrical plane which is parallel to plane and goes through the center of robot arm base, we refer to this decomposable symmetry as . Thus, the number of new trajectories generated is . If , KER directly applies to symmetrize the new trajectory with:
where the 3-tuple denotes the position and the 3-tuple denotes the orientation and is the distance between the two fingers of a robot. If , KER first generates a set of rotated symmetric planes which are rotated along axis
by a set of random angles (sampled from a uniform distribution). For each plane, the associated decomposable symmetries are denoted as . The decomposition of is:
where ,, are standard rotation matrix with respect to x axis, y axis and z xis respectively, and maps orientation matrix to relative angles in Cardanion type (Cardanian mapping is a relative rotating operation, about x, y, and z axes in order. Relative rotating means that after we rotate about x, then we use the new rotated y, and the same for z.).
Note that instead of storing the reflected trajectories in the replay buffer, the random symmetries could be instead applied to sampled minibatches. This approach was tried previously for single-symmetry scenarios [kidzinski2018learning]. Doing so, however, is more computationally taxing as transitions are reflected every time they are sampled and more significantly leads to lower performance222Please visit our project page for this and more supplementary information www.JuanRojas.net/ker.. Our conjecture is that such an approach leads to a lower diversity in the minibatches.
Iv-B Goal-Augmented Experience Replay (GER)
GER exploits the formulation of any reward function (see Eqn. 1) that defines a successful trajectory as one whose end position is within a small radial threshold (a ball) centered around the goal. When the robot obtains a trajectory, we therefore know that it can in fact be considered successful for any goal within a ball centered around its end position. Based on this observation, GER augments trajectories by replacing the original goal with a random goal sampled within that ball. This ball can be formally described as where is the state reached in the original trajectory and is a threshold.
Formally, GER is based on reward-preserving decomposable symmetries of where and are identity mappings and is randomly chosen, conditional to a trajectory reaching some state , in the following set: . Interestingly, such symmetries, when viewed as mappings from to can be applied to the whole set of feasible trajectories to generate successful trajectories, which we do in our architecture. In this sense, GER is a generalization of HER and can be implemented in the same fashion. Thus our architecture, like HER, is also applied on minibatches. For each application of GER, the size of the minibatch is increased with the new artificial transitions.
V Experiments and Results
Our experimental evaluation is performed according to the same set-up as for the evaluation of HER [Andrychowicz2017HindsightReplay]. Namely, we use a simulated 7-DOF Fetch Robotics arm trained with DDPG on the pushing, sliding, and pick-and-place tasks from OpenAI Gym [brockman2016openai].
An epoch is defined as a fixed-size set of successive episodes. Since a trajectory, which is also defined as a single episode, can be qualified assuccessful (Sec. IV), we can compute a success rate over an epoch by counting the number of successful episodes. In order to highlight the difference between HER and ITER in earlier learning stage, we use a smaller number of episodes (1 epoch = 100 episodes) to define a complete epoch than in the HER paper (1 epoch = 800 episodes).
The neural network learning the action-value function is composed of 3 hidden layers with 256 units and ReLU activation function. The output neuron predicting the Q-value does not have an activation function. The actor network approximating the policy is also composed of 3 hidden layers with 256 units and ReLU activation function. The activation function of the output layer is tanh.
The provided learning curves (Figs. 3-6) display the testing performance of the target networks of DDPG. It means that the exploration has been disabled and the policy is fully deterministic. After each learning epoch, the testing success rate is computed over 10 episodes. We display an average over 5 random seeds for each curve.
During a learning epoch, the exploration strategy, the learning rates, the soft update ratio, and the replay buffer size are all the same as in HER.
As HER, GER uses the <<future>> strategy to choose which reached state will be the center of the ball for picking random goals: a random reached state that appeared later in the same trajectory as the current transition.
The environment difficulty is the same as defined in OpenAI Gym (harder than the one in the original HER paper): is fixed to cm.
We design our experiments to demonstrate the effectiveness of our propositions and answer those questions:
How does ITER (GER+KER) perform compared to HER on single and multi-goal tasks ?
How much KER contributes to the performance of ITER? How many should be used?
What is the contribution of GER to the performance of ITER? What is the impact of ?
V-a Does ITER improve performance with respect to HER?
The best combination uses random symmetries for KER and applications of GER (where one of them uses a threshold equal to zero in order to also take full advantage of realized goals). We show that ITER outperforms the data efficiency of HER on different robotic tasks in both the multi (Fig. 3) and single goal (Fig. 4) tasks.
The slide task is a difficult task where the result is determined by only a few contacts (generally one) between the gripper and the box. This makes it hard to learn and generalize, which would explain why ITER does not improve so much over HER. The pick-and-place task in the single-goal setting is also a difficult task. Without using any training trick like in the HER paper, HER is not able to learn anything, while our approch can start to learn after about 180 epochs.
V-B How many symmetries should we use in KER?
In this experiment (Fig. 5), we use a single GER application with a zero threshold (i.e., HER). We observe that performance increase is monotonic with respect to the number of random symmetries , although the gain diminishes for larger .
V-C Does GER improve performance?
In this experiment, KER is not used and we vary the number of GER applications. We observe similar behavior as KER: the more GER is applied the better the performances, until reaching a ceiling (Fig. 6).
Because GER changes the size of the minibatches, we performed a control experiment, not shown here, where we increased the size of the minibatches in vanilla HER. We found that it did not improve the performance of HER.
Vi-a Apply invariant transformations to the input or output of the replay buffer
One interesting question concerns how the new data should be used. We could either populates the replay buffer with the artificial transitions or apply the transformations to a minibatch sampled from the replay buffer. If the transformations are applied after, the diversity of the minibatch could be limited because all the new artificial transitions come from the same source. In the meantime, if the transformation is applied before, then we do not fully exploit the information contained in this transformation because we only do a unique sampling where we could sampling several times. In our experiment, we notice that applying KER before, and GER after works better in practice.
Vi-B Performance drop with KER
After a certain number of epoch, we also observed an unexpected performance drop when applying KER. The larger is, the faster this phenomenon appears. We first thought that the algorithm was overfitting the new artificial goals introduced by KER at the expense of real goals. However, in an experiment not shown here, we observed that the performance drop was also affecting the artificial goals. We have two other hypotheses to explain this phenomenon:
We update the policy on too much data that are unlikely to appear under the current policy. This implies that there is a mismatch between the state distribution induced by the minibatches and that of an optimal policy. If it is the case, we should weight the new artificial transitions to limit their importance on updates [Rauber2019HindsightGradients].
As DDPG is known to be very sensitive to hyperparameters [Henderson2017], they should be optimized for each new .
This investigation is left for future work.
We proposed two novel data augmentation techniques KER and GER to amplify the efficiency of observed samples in a memory replay mechanism. KER exploited reflectional symmetry in the feasible workspace (though in general it could be employed with other types of symmetries). GER, as an extension of HER, is specific to goal-oriented tasks where success is defined in terms of a thresholded distance. The combination of these techniques greatly accelerated learning as demonstrated in our experiments.
Our next step is to use our method to solve the same tasks with a simulated Baxter robot, and then transfer to a real Baxter using a sim2real methodology. Furthermore, we aim at extending our proposition to other types of symmetries and to other robotic tasks as well.
This work is supported by Major Project of the Guangdong Province Department for Science and Technology [grant number 2016B0911006], by the Natural Science Foundation of China [grant numbers 61750110521 and 61872238], and the Shanghai Natural Science Foundation [grant number 19ZR1426700].