Towards More Sample Efficiency in Reinforcement Learning with Data Augmentation

10/19/2019 ∙ by Yijiong Lin, et al. ∙ 0

Deep reinforcement learning (DRL) is a promising approach for adaptive robot control, but its current application to robotics is currently hindered by high sample requirements. We propose two novel data augmentation techniques for DRL in order to reuse more efficiently observed data. The first one called Kaleidoscope Experience Replay exploits reflectional symmetries, while the second called Goal-augmented Experience Replay takes advantage of lax goal definitions. Our preliminary experimental results show a large increase in learning speed.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 3

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Deep reinforcement learning (DRL) has demonstrated great promise in recent years (mnih2015human; alphago). However, despite being shown to be a viable approach in robotics (pmlr-v87-kalashnikov18a; openai2018learning), DRL still suffers from high sample complexity in practice—an acute issue in robot learning.

Given how critical this issue is, many diverse propositions have been presented. For brevity, we only recall the most related to our work. A first idea is to better utilize observed samples, e.g., memory replay (lin1992self) or hindsight experience replay (HER) (Andrychowicz2017HindsightReplay). Although better observation reuse does not reduce sample requirements in a DRL algorithm, it decreases the number of actual interactions with the environment, which is the most important factor in robot learning. Another idea is to exploit any a priori domain knowledge one may have (e.g., symmetry (kidzinski2018learning)) to support learning. Besides, a robot is generally expected to solve not only one fixed task, but multiple related ones. Multi-task reinforcement learning (Plappert2018Multi-GoalResearch) is considered beneficial as it would not be feasible to repeatedly solve each encountered task tabula rasa. Finally, in order to avoid or reduce learning on an actual robot, recent works have investigated how policies learned in a simulator can be transferred to a real robot (tobin2017domain).

In this work, in order to further reduce the requirement on the number of actual samples, we propose two novel data augmentation methods that better reuse the samples observed in the true environment by exploiting the symmetries (i.e., any invariant transformation) of the problem. For simplicity, we present them in the same set-up as HER (Andrychowicz2017HindsightReplay), although the techniques can be instantiated in other settings with any DRL algorithm based on memory replay.

The first technique, Kaleidoscope Experience Replay (KER), is based on reflectional symmetry, it posits that trajectories in robotic spaces can enjoy multiple reflections and remain in the valid workspace. Namely, for a given robotic problem, there is a set of reflections that can generate from any valid observed trajectory many new artificial valid ones for training. For concreteness, in this paper we focus on reflections with respect to hyperplanes.

The second technique, Goal-augmented Experience Replay (GER), can be seen as a generalization of HER: any artificial goal generated by HER can be instead replaced by a random goal sampled in a small ball around . This idea takes advantage of tasks where success is defined as reaching a final pose within a distance of the goal set by a threshold (such tasks are common in robotics). Here, successful trajectories are augmented using the invariant transformation that consists in changing actual goals to random close goals.

In Sec. 2, we present this work’s setup. Sec. 3 introduces related work. Sec. 4 details our two data augmentation techniques. Sec. 5 presents preliminary results; and Sec. 6 highlights key lessons.

2 Background

In this work, we consider robotic tasks that are modeled as multi-goal Markov decision processes

(schaul2015universal) with continuous state and action spaces: where is a continuous state space, is a continuous action space, is a set of goals, is the unknown transition function that describes the effects of actions, is the immediate reward when reaching state after performing action in state if the goal were . Finally,

is a joint probability distribution over initial states and initial goals, and

is a discount factor. In this framework, the robot learning problem corresponds to an RL problem that aims at obtaining a policy such that the expected discounted sum of rewards is maximized for any given goal.

When the reward function is sparse, as assumed here, this RL problem is particularly hard to solve. In particular, we consider here reward functions that are described as follows: where is the indicator function, is a distance, and is a fixed threshold.

To tackle this issue, Andrychowicz2017HindsightReplay proposed HER, which is based on the following principle: Any trajectory that failed to reach its goal still carries useful information; it has at least reached the states of its trajectory path. Using this natural and powerful idea, memory replay can be augmented with the failed trajectories by changing their goals in hindsight.

3 Related Work

HER (Andrychowicz2017HindsightReplay; Plappert2018Multi-GoalResearch) has been extended in various ways. Prioritized replay was incorporated in HER to learn from more valuable episodes with higher priority (Zhao2018Energy-BasedPrioritization). In (Fang2019DHER:Replay), HER was generalized to deal with dynamic goals. In (Gerken2019ContinuousControllers), a variant of HER was also investigated where completely random goals replace achieved goals and in (Rauber2019HindsightGradients), it was adapted to work with on-policy RL algorithms. All these extensions are orthogonal to our work and could easily be combined with KER. We leave these for future work.

Symmetry has been considered in MDPs (Zinkevich2001SymmetryLearning) and RL (Kamal2008ReinforcementStates; Agostini2009ExploitingSpaces; Mahajan2017SymmetryLearning; Kidzinski2018LearningEnvironments; Amadio2019ExploitingTasks). It can be known a priori or learned (Mahajan2017SymmetryLearning). In this work, we assume the former, which is reasonable in many robotics tasks. A natural approach to exploit symmetry in sequential decision-making is by aggregating states that satisfy an equivalence relation induced by a symmetry (Zinkevich2001SymmetryLearning; Kamal2008ReinforcementStates). Another related approach takes into account symmetry in the policy representation (Amadio2019ExploitingTasks). Doing so reduces representation size and generally leads to faster solution times. However, the state-aggregated representation may be difficult to recover, especially if many symmetries are considered simultaneously. Still another approach is to use symmetry during training instead. One simple idea is to learn the Q-function by performing an additional symmetrical update (Agostini2009ExploitingSpaces). Another method is to augment the training points with their reflections (Kidzinski2018LearningEnvironments). In this paper, we generalize further this idea as a data augmentation technique where many symmetries can be considered and pairs of symmetrical updates do not need to be simultaneously applied.

While, to the best of our knowledge, data augmentation has not been considered much to accelerate learning in RL, it has been used extensively and with great success in machine learning

(Baird1992DocumentModels)

and more so in deep learning

(KrizhevskyImagenetNetworks)

. Interestingly, symmetries can also be exploited in neural network architecture design

(Gens2014DeepNetworks). However, in our case, the integration of symmetry in deep networks will be left as future work.

4 Data Augmentation for RL

To reduce the number of interactions with the real environment our goal is to generate artificial training data based on actual trajectories collected during the robot’s learning.

Our architecture leverages our two proposed techniques, Kaleidoscope experience replay (KER) and Goal-Augmented Experience Replay (GER). While the two methods are combined here, only one of them could be used instead. An overview of our architecture is illustrated in Fig. 1.

Figure 1: Framework overview: real and symmetrically transformed transitions are stored in the replay buffer. Sampled minibatches are then augmented with GER before updating the policy.

4.1 Kaleidoscope Experience Replay (KER)

KER uses reflectional symmetry111Though more general invariant transformations could also be used in place of reflectional symmetry.. Consider a 3D workspace with a bisecting plane as shown in Fig. 2. If a valid trajectory is generated in the workspace (blue in Fig. 2), natural symmetry would then yield a new valid trajectory reflected on this plane. More generally, the plane may be rotated by some angle along axis and still define an invariant symmetry for the robotic task.

Figure 2: Kaleidoscope Experience Replay leverages natural symmetry. Valid trajectories are reflected with respect to plane , where the latter can itself be rotated by some along axis .

We can now precisely define KER, which amounts to augmenting any original trajectory with a certain number of random symmetries. A random symmetry is a reflectional symmetry with respect to the plane after it has been rotated by a random angle about the -axis.

Note that instead of storing the reflected trajectories in the replay buffer, the random symmetries could be instead applied to sampled minibatches. This approach was tried previously for single-symmetry scenarios (kidzinski2018learning). Doing so, however, is more computationally taxing as transitions are reflected every time they are sampled and more significantly leads to lower performance222Please visit our project page for this and more supplementary information (not referenced until after review).. Our conjecture is that such approach leads to a lower diversity in the minibatches.

4.2 Goal-Augmented Experience Replay (GER)

GER exploits the formulation of any reward function that defines a successful trajectory as one whose end position is within a small radial threshold (a ball) centered around the goal. When the robot obtains a successful trajectory, we therefore know that it can in fact be considered successful for any goal within a ball centered around its end position. Based on this observation, GER augments successful trajectories by replacing the original goal with a random goal sampled within that ball. This ball can be formally described as where is the final state reached in the original trajectory and is a threshold, which does not have to be constant for each application of GER. Therefore, GER can be seen as a generalization of HER and can be implemented in the same fashion. This is why in our architecture, GER is applied on minibatches, like HER.

5 Preliminary Experimental Results

Our experimental evaluation is performed according to the HER formulation (Andrychowicz2017HindsightReplay). Namely, we use a simulated 7-DOF Fetch Robotics arm trained with DDPG on the pushing, sliding, and pick-and-place tasks.

We design our experiments to demonstrate the effectiveness of our propositions and final combination, which uses random symmetries for KER and applications of GER (where one of them uses a threshold equal to zero in order to also take full advantage of realized goals). We now present some initial experimental results. As shown in Fig. 3, our method vastly improves the learning speed compared to vanilla HER.

In our experiments, we have observed that performance is monotonic with respect to the number of random symmetries, although the gain diminishes for larger . Similar observations can be made for the number of applications of GER.

Figure 3: Comparison of vanilla HER with the combination of 8 KER symmetries and 4 GERs.

6 Conclusion

We proposed two novel data augmentation techniques KER and GER to amplify the efficiency of observed samples in a memory replay mechanism. KER exploited reflectional symmetry in the valid workspace (though in general it could be employed with other types of symmetries). GER, as an extension of HER, is specific to goal-oriented tasks where success is defined in terms of a thresholded distance. The combination of these techniques greatly accelerated learning as demonstrated in our experiments.

Our next step is to use our method to solve the same tasks with a simulated Baxter robot, and then transfer to a real Baxter using a sim2real methodology. Furthermore, we aim at extending our proposition to other types of symmetries and to other robotic tasks as well.

References