Reinforcement learning (RL) agents are able to perform locomotion tasks using continuous actions and observations in animal-like domains such as the planar walker, half-cheetah, and humanoid lillicrap2015continuous; haarnoja2018soft. While the agents learn policies that successfully maximise a cumulative return, the behaviours of agents in such domains often appear idiosyncratic to humans in at least two ways. Most notably, the locomotion behaviour in animals is associated with a natural rhythm and stereotypy, such as in the case of a walking dog, galloping horse, or the tripod gait of an ant. In the absence of this spatial order and temporal rhythm in simulated agents, the behaviours of the learned policies can look unnatural.
Idiosyncratic behaviour can arise in simulated agents because the physical simulation is incorrect, the task does not capture the tasks executed by embodied agents, or the constraints on a simulated agent are different from those on an embodied agent. The physical simulation itself is an inaccurate and incomplete model of the physical world – it allows perfect floors, instantaneous, noise-free actuation, and a windless environment. Embodied agents necessarily have behaviour policies that are robust to deviations from any precise environment crafted for simulation. Additionally, the mechanical and biological constraints that restrict the behaviour of an embodied agent, a robot or an animal, are different from those experienced by simulated agents. Embodied agents are also necessarily affected by notions of mechanical stress and fatigue. For example, an embodied walker would be less likely to drag a leg, due to the notion of wear, or flail around its arms, due to considerations of energy and efficiency.
In this work, we leverage order in the spatial structure of the domain. Specifically, the vast majority of animals have at least one plane of external symmetry, which is reflected in their stereotyped gaits when they are walking or running. Similarly, for simulated bodies with nearly identical limbs, if the agent knows how to move one limb relative to the body in a particular way, it can do the same for the other similar limbs. We use symmetry as a natural inductive bias in order to reduce the space of behaviour policies that can be achieved by agents, as well as to explore if this approach can boost the learning process.
2 Related work
We try to use invariances in the physical domain/task to augment learning. Invariances to scale, orientation, translation and relative motion are common to the natural world and a hallmark of the dynamics of tangible objects and of continuous fields. The use of such invariances to reduce the dimensionality of the equations governing a physical process is thus ubiquitous in the natural sciences barenblatt1996scaling; wilson1971renormalization
. In machine learning, supervised classification tasks use rotation, symmetry and scale invariances to augment training data to get improved performancebaird1992document; simard2003best; krizhevsky2012imagenet. In RL, symmetry and rotation invariances have been used to augment data in AlphaGo silver2016mastering. In robotics and continuous control applications, some ways to improve the real-world suitability and sample efficiency of RL agents include curriculum learning bengio2009curriculum, learning from motion-capture data merel2018neural, and using constraint-based objective functions bohez2019value. Our work follows the same theme of making the policies learned by RL agents more realistic, and aiming to make the learning process efficient.
3 Domain and tasks
To illustrate the use of symmetry, we can use any animal-like domain with a saggital plane of symmetry, which is a plane of symmetry common to several animals. We use the quadruped domain from Tassa et al. (2018) deepmindcontrolsuite2018. The body of the quadruped comprises a torso with four legs. Each leg is identical, with three hinge joints, as shown in Figure 1 (left). The joints are controlled by torque actuators. We consider the move tasks in Tassa et al. (2018) deepmindcontrolsuite2018, in which positive reward is given for an upright orientation relative to the horizontal plane and a forward velocity of the torso, relative to a specified desired velocity, in the frame of reference of the torso. Within the move tasks, the walk task and run task are specified by different values of desired velocity. Details of the reward function are given in Tassa et al. (2018) deepmindcontrolsuite2018. The observations comprise joint angles and velocities, forces and torques on the joints, the state of the actuators, and the torso velocity.
This is illustrated in Figure 1 (right), by means of a pose with the legs shown in blue, and its corresponding mirrored pose, with the legs shown in pink. In the move tasks, for every observation-action pair that the agent encounters, there is a corresponding mirrored observation-action pair that can be obtained by reflection about the plane of symmetry. The mirroring operation for observations and actions are essentially matrix multiplications, and are obtained cheaply from the observations. The mirrored corresponding observation-action pair will result in the same forward displacement, and therefore the same reward. Therefore, for every trajectory executed by the agent, there is a corresponding mirrored trajectory that results in the same sequence of rewards.
4 Symmetric policy
A symmetric gait such as a human walk is characterised by a phase difference between the legs, and an antisymmetric gait such as that of a galloping horse is characterised by an in-phase synchrony between the left and right halves of the body. When we use the term symmetric policy
, we mean that the probability, where and are obtained by reflecting the state and action , respectively, about a plane of symmetry. By this definition, the biological symmetric and antisymmetric gaits are both executed by a symmetric policy.
5 Proposed algorithm
We use MPO, an actor-critic algorithm described in Abdolmaleki et al. abdolmaleki2018maximum. To encourage the agent to take symmetries into account, we augment the batch of experiences generated by the actor and stored in the replay buffer, by computing a set of corresponding mirrored trajectories for use by the learner of the MPO algorithm abdolmaleki2018maximum. The MPO algorithm has two steps, policy evaluation and policy improvement. In this section, we summarise the algorithm and specify the modifications required to use data from mirrored trajectories, with a pseudocode of our approach given in algorithm 1.
5.1 Policy Evaluation
We employ a 1-step temporal difference (TD) learning, fitting a parametric Q-function with parameters by minimizing the squared (TD) error
where and , which we optimize via gradient descent. We let be the parameters of a target network that is held constant for steps (and then copied from the optimized parameters ). Using the fact that for a symmetric policy, , we also fit the parametric Q-function for ,
5.2 Policy Improvement
Step 1: We construct a non-parametric improved policy . This is done by maximizing for states drawn from a replay buffer while ensuring that the solution stays close to the current policy ; i.e. , as detailed in Abdolmaleki et al. (abdolmaleki2018maximum).
Step 2: We fit a new parametric policy to samples from by solving the optimization problem , where is the new and improved policy, which employs additional regularization (abdolmaleki2018maximum). To learn about/from mirrored data, we repeat steps 1 and 2 for mirrored states and mirrored actions calculated from original data.
We refer to training with the MPO algorithm in abdolmaleki2018maximum as “normal training” and training with the additional experience from the mirrored data, as described in Section 3, as “augmented training”. For the normal and augmented training conditions, we use the same hyper-parameters as in abdolmaleki2018maximum, with one main modification: we slowed down the actor to get a rate of trajectory generation of 1000 s of environment steps every 30 s of real time, to bring it closer to the data-generation rate of real robots. The learner thus operates in a data-limited regime. The resulting cumulative reward as a function of gradient steps is shown in Figure 2. For each of the training conditions, the augmented training condition achieves better performance, typically in fewer episodes. This preliminary result suggests that knowledge of symmetry, enforced through data augmentation in the policy optimisation step, is a useful bias for the agent to shape its policy.
We presented an approach for incorporating symmetry in the environment in an actor-critic architecture by augmenting the experiences generated by the actor. A different approach to inform an agent of symmetry in its domain could, for instance, compress the state representation, explicitly specifying the relationship between an observation-action pair and its mirrored version. This involves changing the underlying Markov decision process so that transitions are between “canonical" observations. In general, this means that transitions do not arise from a physical process and may require modifications to the algorithm to ensure that no undesirable bias has been introduced. A promising extension of this work would be to move towards a policy that explicitly encodes the symmetry using hierarchy, with similar legs sharing their lower-level policy with the corresponding actuators of other legs, and using experiences from the other legs to update their policies. Using symmetric policies leads to a change in the space of policies that can be achieved and therefore qualitative change in gait. We plan to investigate this further. In general, this work can also be useful for transfer learning to tasks which are not symmetric; an agent may first learn to walk using a knowledge of symmetry to be data-efficient, and then learn to navigate an external non-uniform landscape.
We would like to thank Ankit Anand, Eser Aygün, Tom Erez, Philippe Hamel, Yuval Tassa and Daniel Toyama.