Implementations of Deep RL Algorithms in OpenAI Gym Environments
We propose an architecture for learning complex controllable behaviors by having simple Policies Modulate Trajectory Generators (PMTG), a powerful combination that can provide both memory and prior knowledge to the controller. The result is a flexible architecture that is applicable to a class of problems with periodic motion for which one has an insight into the class of trajectories that might lead to a desired behavior. We illustrate the basics of our architecture using a synthetic control problem, then go on to learn speed-controlled locomotion for a quadrupedal robot by using Deep Reinforcement Learning and Evolutionary Strategies. We demonstrate that a simple linear policy, when paired with a parametric Trajectory Generator for quadrupedal gaits, can induce walking behaviors with controllable speed from 4-dimensional IMU observations alone, and can be learned in under 1000 rollouts. We also transfer these policies to a real robot and show locomotion with controllable forward velocity.READ FULL TEXT VIEW PDF
In this article, we show that learned policies can be applied to solve l...
An originally chaotic system can be controlled into various periodic
Deep reinforcement learning has emerged as a popular and powerful way to...
Recently reinforcement learning (RL) has emerged as a promising approach...
We present a unified model-based and data-driven approach for quadrupeda...
In this letter, we formulate a novel Markov Decision Process (MDP) for d...
In the current level of evolution of Soccer 3D, motion control is a key
Implementations of Deep RL Algorithms in OpenAI Gym Environments
The recent success of Deep Learning (DL) on simulated robotic tasks has opened an exciting research direction. Nevertheless, many robotic tasks such as locomotion still remain an open problem for learning-based methods due to their complexity or dynamics. From a Deep Learning (DL) perspective, one way to tackle these complex problems is by using more and more complex policies (such as recurrent networks). Unfortunately, more complex policies are harder to train and require even more training data which is often problematic for robotics.
Robotics is naturally a great playground for combining strong prior knowledge with DL. The robotics literature contains many forms of prior knowledge about locomotion tasks and nature provides impressive examples of similar architectures. Note that this knowledge does not need to be in the form of perfect examples, it can also be in form of intuition about the specific robotic problem. As an example, for locomotion it can be defined as leg movements patterns based on certain gait and external parameters.
We incorporate this intuitive type of prior knowledge into learning in the form of a parameterized Trajectory Generator (TG). We keep the TG separate from the learned policy and define it as a stateful module that outputs actions (e.g. target motor positions) which depend on its internal state and external parameters. We introduce a new architecture in which the policy has control over the TG by modulating its parameters as well as directly correcting its output (Fig. 1). In exchange, the policy receives the TG’s state as part of its observation. As the TG
is stateful, these connections yield a controller that is implicitly recurrent while using a feed-forward Neural Network (NN) as the learned policy. The advantage of using a feed-forward NN is that learning is often significantly less demanding than with recurrent NNs. Moreover, this separation of the feed-forward policy and the statefulTG makes the architecture compatible with any reward based learning method.
We call our architecture Policies Modulating Trajectory Generators (PMTG) to stress the interaction between the learned policy and the predefined TG. In essence, we replace the task of learning a locomotion controller by that of learning to modulate a TG in parallel with learning to control a robot.
In this manuscript, we first illustrate the architecture of PMTG using a synthetic control problem. Next, we tackle quadruped locomotion using PMTG. We use desired speed as the control input and different TG
s that generate leg trajectories based on parameters such as stride length, frequency and walking height. We train our policies in simulation using Reinforcement Learning (RL) or Evolutionary Strategies (ES) with policies as simple as one linear layer using only a four-dimensional proprioceptive observation space (IMU). Finally we transfer the learned policies to a real robot and demonstrate learned locomotion with controllable speed.
to parameterize the movement of each actuated degrees of freedom. To control locomotion, Gay et al. learned a neural network that modulated a CPG. Sharma and Kitani  designed phase-parametric policies by exploiting the cyclic nature of locomotion. Tan et al.  learnt a feedback balance controller on top of a user-specified pattern generator. Although our method is inspired by , there are key differences. In Tan et al. , the pattern generator is fixed and independent to the feedback control. In this way, the feedback control can only modify the gait in the vicinity of the signal defined by pattern generator. In contrast, in our architecture, the feedback control modulates the TG including its frequency. This is crucial since we are interested in dynamically changing the high-level behavior of the locomotion, which requires changing the underlying trajectory and its frequency.
In this paper, our focus is to learn controllable locomotion, in which the robot can change its behavior given external control signals (e.g. changing running speed with a remote). One way to achieve it is to train separate controllers for corresponding behaviors and switch them online according to the user-specified signals . However, abruptly switching controllers can cause jerky motion and loss of balance. An alternative is to learn a generic controller and add the external control signals to the observation space . As only one controller is learned and there is no need for transitions, this formulation significantly decreases the difficulty of the task. We choose the second approach and show that with PMTG, we can learn controllable locomotion efficiently.
Our basic architecture is shown in Fig. 1 and consists of three main blocks: an existing/predefined controller, a learned policy, and the system to control (a robot). In this manuscript, we refer to the existing controller as the Trajectory Generator (TG), because we limit our experimental section to periodic motions. However, PMTG can be extended to various types of predefined controllers (e.g. a kinematic controller) in a straightforward manner.
Just like the robotic system to control, the TG is a black box from the policy’s point of view. It receives a number of parameters as inputs from the policy and it outputs its state and actions at every time step. Hence, learning a policy in PMTG is equivalent to learning to control the original dynamical system (robot) extended by the TG. One simply concatenates the action space of the original problem and the controllable parameters of the TG. Similarly, the observation space is extended with the state variables of the TG. Note, that the state of the TG does not affect the reward.
The policy can optionally accept control inputs to allow external control of the robot. These control inputs are also appended to the robot’s observations/state and fed into the policy. This simple formulation allows PMTG to be trained using a large selection of policy optimization methods.
The outputs of the controller are the actions that control the robot’s actuators. These actions are computed as the sum of the output of the TG and the policy111We use the subscript to refer to the policy because it computes feedback signals.:
One interpretation of this equation is that the TG generates possibly sub-optimal actions based on parameters learned by the policy. To improve upon these sub-optimal actions, the policy learns correction terms that are added to the output or learns to suppress if needed.
A different, yet important, interpretation is that the policy optimization algorithm can use the TG as a memory unit. Because we do not place restrictions on the type of TG, it makes sense to think about very simple choices of TG
s that still provide the policy with useful memory. For example, imagine choosing a leaky integrate-and-fire neuron with a constant input as the TG and letting the policy control the integration leak rate. In this case, the policy could use the TG as a controllable clock signal. Because of this last interpretation, we only consider feed-forward neural networks for the policy in this work. All the memory is to be provided by the TG. As we will demonstrate using both the synthetic control problem and robot locomotion, these benefits of the TG allow us to efficiently learn architecturally simple policies (e.g. linear) that still generate complex and robust behavior.
We now introduce a synthetic control problem to illustrate how PMTG works. We consider a 2D environment of by in which a point is to be moved along a desired cyclic trajectory to maximize the returned reward (Fig. 2. The input to the environment (action space) is the desired next position .
As prior knowledge, under the assumption that we know that the trajectory will be a highly deformed and displaced version of figure-eight, we pick an eight curve as the trajectory generator and allow the policy to change the amplitudes along the and axes222To limit the complexity of this example, we do not use an external control signal nor allow the policy to control the offset or frequency of the trajectory generator. (Fig. 2):
where represents the current timestep and is stored by TG as its internal state.
For the policy, the observations are the current position along and coordinates and the state of the TG (the current time step). The actions are the desired position and the parameters of the TG
(amplitudes used for the figure-eight). The reward is the negative Euclidian distance to a deformed figure-eight. We used the Proximal Policy Optimization (PPO) algorithm for learning with a fully connected neural networks with two layers and ReLU non-linearities 333
With hyperparameter search for up to 200 neurons per layer..
Using the architecture, PMTG + PPO reaches almost optimal behavior with a reward close to zero (Fig. 3). For comparison, training a pure reactive controller using Vanilla PPO fails to produce any good result. The failure to learn by the reactive controller can be explained by the nature of the task, partially observable state space, and lack of memory to distinguish different phases of the target figure. Since the reactive controller lacks time-awareness (or external memory), we also tested Vanilla PPO with a time signal as an additional observation. This combination performed better than Vanilla PPO, but still worse than PMTG. In this example problem we showed a basic TG, its combination with a feed-forward policy, and how PMTG allows a feed-forward policy to learn a problem that is challenging for a pure reactive controller.
Robot locomotion is a challenging problem for learning. Partial observations, noisy sensors combined with latency, and rich contacts all increase the difficulty of the task. Despite the challenges, the nature of locomotion makes it a good fit for PMTG. TG design can be based on the idea that legs follow a periodic motion with specific characteristics. A clear definition of the legs’ trajectories is not needed. Instead, we can roughly define the family of trajectories using parameters such as stride length, leg clearance, walking height, and frequency. Fig. 4 shows a sample trajectory of the leg and parameters based on this idea. The detailed definition of a TG for locomotion can be found in Appendix.
The detailed architecture adapted to quadruped locomotion is shown in Fig. 5. At every timestep, the policy receives observations (), desired velocity (, control input) and the phase () of the trajectory generator. It computes 3 parameters for the TG (frequency , amplitude and walking height ) and 8 actions for the legs of the robot () that will directly be added to the TG’s calculated leg positions (). The sum of these actions is used as desired motor positions, which are tracked by Proportional-Derivative controllers. Since the policy dictates the frequency at each time step, it dictates the step-size that will be added to TG’s phase. This eventually allows the policy to warp time and use the TG in a time-independent fashion.
For locomotion, the design of TG can be as simple as Fig. 4 or can be composed of more complex open loop mechanisms. We use a stateful module that generates trajectories in an open loop fashion by using 3 parameters (walking height, amplitude, frequency). It is possible to use a TG that is pre-optimized for the given task, or hand-tuned to roughly generate a desired gait. For walking and running gaits, we used a TG that uses a gait shown in Fig. 6 and pre-optimized as a standalone open-loop controller. Despite the pre-optimization, the TG itself cannot provide stable forward locomotion since it lacks the feedback from the robot. In addition, for the bounding gait, we tested PMTG with a simpler and a hand-tuned TG that is not optimized. The only behavior provided by the TG is swinging front and back legs in half period phase difference (Fig. 6).
We train the locomotion policy using PyBullet  to simulate the Minitaur robot from Ghost Robotics. As the training algorithm we use both Evolutionary Strategies (specifically ARS ) and Reinforcement Learning (specifically PPO ). During training, we vary the desired forward velocity during each rollout. We start with , gradually increase the desired speed to , and keep it there for a while, then decrease it back to by the end of the rollout. The exact speed profile is shown in Fig. 8. During each rollout, We add random directional perturbation forces (up to 60N vertical, 10N horizontal) to the robot multiple times to favor more stable behaviors. Each rollout ends either at or when the robot falls. The reward function is calculated based on the difference between the desired speed and robot’s speed as:
where is the maximum desired velocity for the task, are the robot’s actual velocity and the target velocity at the current timestep. We selected this reward function because it provides the maximum reward when the robot is within the range (20% of the top speed) of the desired speed and decreases to if the difference is higher.
The observation includes the robot’s roll and pitch and angular velocities along these two axes received (IMU sensor reading, 4 dimensions total). Overall the policy uses 7 input dimensions: 4 observation dimensions, the desired velocity as the control input, and the phase of the TG represented by . The action space for the policy is 11 dimensional: 8 dimensions for the legs (swing and extension for each leg), and 3 parameters consumed by the TG (frequency, amplitude, walking height).
For policy complexity, we evaluated both a two-layer fully connected neural network (up to 200 neurons per layer) as well as a simple linear policy (77 parameters). We trained the policies using 3 separate tasks: slow walking (up to ), fast walking (up to ) and bounding (up to ). For the bounding gait, we use a different TG with phases shown in Fig. 6. For walking gaits, the TG alone does not provide forward motion, but the robot does not immediately fall. The open loop (TG only) bounding gait fails immediately.
Our architecture makes learning the complex locomotion task easier. When we use PMTG, both algorithms successfully learn controllable locomotion (Fig. 7). Both the linear controller and the two-layer feed-forward neural network achieve the desired behavior. The curves for Vanilla ES-Lin and Vanilla PPO show the results for a reactive controller instead of PMTG (we simply remove the TG, ). Without PMTG both algorithms fail to achieve the optimal reward levels. Lower rewards show that the controller learns the walking behavior but cannot fully keep up with the changing target speed 444The literature contains successful learning of reactive controllers on locomotion tasks with less complexity, richer state space and different reward functions ..
By combining PMTG with ARS and a linear policy, we achieved high data efficiency for learning locomotion. We also note that the linear policy has relatively few parameters (77). As an added benefit, we observed that learning with PMTG combined with a linear policy required fewer rollouts for the given locomotion task. Fig. 7 shows learning curves for the hyperparameters with the fastest learning speed (ES with 8 directions per iteration). We observe that it is possible to learn good policies with ES in fewer than 1000 rollouts. This is possible because we were able to embed prior knowledge into the TG and because the architecture reduces the complexity of the policy learning problem. The number of rollouts is low relative to the complexity of the locomotion task. This opens a research direction to using PMTG for on-robot learning, which we are planning as a future work.
Next, we look at the characteristics of a sample converged controller using PMTG and ES with a linear policy. We focus on running instead of walking because it shows considerable changes in TG parameters and gait during a single rollout. Fig. 8 shows a single run after training: The robot does not have any problems tracking the desired speed. Fig. 8 shows that the policy significantly modulates both the amplitude (which commands stride length) and the frequency of the gait depending on the desired speed. These parameters affect the output coming from the TG, but they do not necessarily show the eventual leg movement since the policy can add corrections. Fig. 8 shows the swing angle of one the legs during the same rollout. The motion of the leg is periodic, but the shape of the signal changes significantly depending on the speed.
The reality gap between simulation and real environments is a major challenge for learning in robotics. In many scenarios, learning in simulation can easily converge to unrealistic behaviors or exploit the corner cases of a simulator. In PMTG, we provide a class of initial trajectories (TG) that the policy can build upon. The converged policies usually follow the characteristics of the TG, avoiding unrealistic behaviors. Additionally, we use randomization by applying random directional virtual forces to the robot during training to avoid overfitting to the simulation model.
We deployed a number of the learned controllers to the robot to see how our results transfer to the real world. We define success if the robot successfully moves forward at various speeds and does not fall during the rollout. A short summary of these results is shown in the supplementary video. For slower walking, all the policies successfully worked on the robot. The emerged behaviors are similar to the simulation. The robot slowly increases its speed and walks at different desired speeds and slows down to stop without any observable problems.
The policies trained for walking at faster speeds (up to ) mostly completed the rollouts successfully, but the legs were occasionally slipping at higher speeds. Although slippage affected the overall behavior for certain policies (i.e. distance run, direction) the robot recovered from falling and continued walking in most trials.
When we used the TG with a bounding gait, we observed different emerged gaits for different learning algorithms and hyperparameters. The policies trained with PPO were the most stable, jumping forward by modulating the parameters for walking height. The policy significantly overrode the gait using its correction ability. The resulting behavior shows forward movement of the robot using jumps at different speeds. We also include these different behaviors in the supplementary video.
We introduced a new control architecture (PMTG) that represents prior knowledge as a parameterized Trajectory Generator (TG) and combines it with a learned policy. Unique to our approach, the policy modulates the trajectory generator at every time step based on observations coming from the environment and TGs internal state. This allows the policy to generate many behaviors based on this prior knowledge, while also using TG as a memory unit. This combination enables simple reactive policies (e.g. linear) to learn complex behaviors.
To validate our technique, we used PMTG to tackle quadruped robot locomotion. We show the generality of PMTG by training the architecture using both ES and RL algorithms on two different gaits. We used relatively simple policies considering the complexity of the locomotion task, and had success with a linear policy. The policy uses only IMU readings – a low dimensional set of observations for a robot locomotion task – and takes the desired velocity as an external control input. We showed successful transfer of these policies from simulation to the Minitaur robot.
We plan to use our current approach as a starting point for this class of architectures with simple learned policies. In this work we relied on ad hoc trajectory generators that were chosen based on our intuition about a given problem. In future work, we are interested in getting a deeper understanding of which types of trajectory generators work best for a specific domain, possibly extracting trajectory generators from demonstrations. Finally, we are interested in theoretical foundations for PMTG and how it relates to existing models, specifically recurrent ones.
The authors would like to thank the members of the Google Brain Robotics group for their support and help.
Annals of Mathematics and Artificial Intelligence, 76(1-2):5–23, 2016.
Large-scale evolution of image classifiers.In
International Conference on Machine Learning, 2017.
The phase of the trajectory generator (between and ) is defined as:
where defines the frequency of the trajectory generator. In PMTG architecture, is selected by the policy at each time step as an action.
In this work, we use the following trajectory generator for the legs:
, and are respectively the swing and extension of the leg as shown in Fig. 4.
defines the center for the swing DOF and extension DOF (in radians).
defines the center for the extension DOF. Extension is represented in terms of rotation of the two motors in the opposite direction, hence the unit is also radians. Since all legs share the same , it corresponds to the walking height of the robot.
defines the amplitude of the swing signal (in radians). This corresponds to the size of a stride during locomotion.
defines the amplitude of the extension during swing phase. This corresponds to the ground clearance of the feet during the swing phase.
defines the extension difference between when the leg is at the end of the swing and when the leg is at end of the stance. This is mostly useful for climbing up or down.
We compute based on the swing and stance phases:
where defines the proportion of the duration of the swing phase to the stance phase.
For each leg, the phase is calculated separately as
where represents the phase difference of this leg compared to the first (left front) leg. This is defined by the selected gait (i.e. walking vs bounding).