Meta-Reinforcement Learning for Adaptive Motor Control in Changing Robot Dynamics and Environments

by   Timothée Anne, et al.

This work developed a meta-learning approach that adapts the control policy on the fly to different changing conditions for robust locomotion. The proposed method constantly updates the interaction model, samples feasible sequences of actions of estimated the state-action trajectories, and then applies the optimal actions to maximize the reward. To achieve online model adaptation, our proposed method learns different latent vectors of each training condition, which are selected online given the newly collected data. Our work designs appropriate state space and reward functions, and optimizes feasible actions in an MPC fashion which are then sampled directly in the joint space considering constraints, hence requiring no prior design of specific walking gaits. We further demonstrate the robot's capability of detecting unexpected changes during interaction and adapting control policies quickly. The extensive validation on the SpotMicro robot in a physics simulation shows adaptive and robust locomotion skills under varying ground friction, external pushes, and different robot models including hardware faults and changes.



page 1

page 5

page 6

page 7


Complementary Meta-Reinforcement Learning for Fault-Adaptive Control

Faults are endemic to all systems. Adaptive fault-tolerant control maint...

Iterative Reinforcement Learning Based Design of Dynamic Locomotion Skills for Cassie

Deep reinforcement learning (DRL) is a promising approach for developing...

Learning Task Space Actions for Bipedal Locomotion

Recent work has demonstrated the success of reinforcement learning (RL) ...

Jointly Learning to Construct and Control Agents using Deep Reinforcement Learning

The physical design of a robot and the policy that controls its motion a...

Adaptive Prior Selection for Repertoire-based Online Learning in Robotics

Among the data-efficient approaches for online adaptation in robotics (m...

Multi-expert learning of adaptive legged locomotion

Achieving versatile robot locomotion requires motor skills which can ada...

A Novel Self-Organizing PID Approach for Controlling Mobile Robot Locomotion

A novel self-organizing fuzzy proportional-integral-derivative (SOF-PID)...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

In robot motor control, developing responsive control policies that adapt to unforeseen environments is crucial to task success. These changes and unexpected situations can be intrinsic or extrinsic, such as robot damage, motor failure, varying friction, and external force disturbances. For robot locomotion, traditional approaches of planning and control require expert knowledge and accurate dynamics models and constraints of both the robot and the environment [1, 2], which are all subject to unforeseeable changes that are difficult to know beforehand. Moreover, even using data-efficient learning techniques such as Bayesian optimization to tune decision variables and control parameters, it can only achieve adaptation on a trial-by-trial basis [3] and also require extensive computation [4] which is not able to respond to changes on the fly.

Recent advances in Reinforcement-Learning (RL) lead to algorithms achieving human-like or animal-level performance in a range of difficult control tasks. Model-free RL can perform global search of control parameters and obtain globally optimal gaits while combined with walking pattern generation [5]. Also, an RL-based feedback policy can achieve human-like bipedal walking by imitating human motion capture data [6]. With simulation training including actuator properties, a model-free RL scheme can train different locomotion policies separately and deploy on a real quadrupedal robot [7]. By using a multi-expert learning, an hierarchical RL architecture can learn to fuse multiple motor skills and generate multimodal locomotion coherently on a real quadruped [8]. However, in general, model-free RL algorithms have limited sample efficiency, resulting in long training time to produce viable policies. For example, it took a model-free RL algorithm 83 hours to achieve human performance on the Atari game suite, compared to 15 minutes for a human [9]. Similarly, AlphaStar [10] used 200 years of equivalent real-time to reach expert human performance playing Starcraft II.

On the other hand, model-based approaches can achieve comparatively high performance while being more sample-efficient by several orders of magnitude, converging faster than model-free approaches for locomotion tasks [11]. To deploy robots in the real-world, online adaptation to changes in the environment are required as not all conditions can be considered by pre-trained policies, such as drastic changes of environments or robots by amputation. Hence, meta-learning, or learning to learn is a novel and more promising approach for solving such generic adaptations. Model-based meta-RL has been used in the real-world to adapt the control of a six-leg millirobot to different floor conditions [12]. A model-based meta-RL algorithm, FAMLE, was used in the real-world on a Minitaur quadruped, where a latent black-box context vector encoded different environment conditions [13].

Fig. 1: Adaptive and robust locomotion against uncertainties: (a) external force, (b) slippery ground, (c) faulty motors, and (d) leg amputation.
Fig. 2: Schematic view of the model-based meta-RL control framework for legged locomotion.

Our proposed method has made new improvements that require no prior knowledge of specific gaits. For example, FAMLE relies on sinusoidal gaits and therefore needs to optimize the amplitudes and phases of sinusoidal patterns by the model predictive control at a low frequency of 0.5Hz. In our work, we directly sample in the joint space at a much higher frequency at 50Hz, and we further improve the sampling process by specifying constraints on velocity, acceleration and jerk of the desired joint trajectories. Our study extensively validated the capability of adaptation in simulated test scenarios with large variations in floor friction, external forces or unexpected damage to joints.

Based on the interaction model, our method allows changing the reward function online and therefore is able to modify the behavior of the robot. For example, the learned controller can track a variable forward velocity, even it has been trained on a fixed desired velocity. Likelihood estimation with condition latent vectors allows the meta-model to adapt to already seen conditions. Meta-training should allow ”on the fly” optimization to better adapt to the current unknown condition.

In this paper, we present an improved model-based meta-RL approach to quadruped locomotion that is capable of online adaption to changing environments, as shown in Fig. 1. The main contributions of this work are:

  1. The proposed algorithm is capable of learning from scratch and requires no prior knowledge of the type of gaits, such as periodic phases of leg movements.

  2. Our methods introduces and applies hard constraints of velocity, acceleration and jerk on the sampled actions during the search process.

  3. The capability and robustness of online adaption to changes in both the robot and the environment, such as external force disturbances, varying frictions, faulty motors and leg amputation.

The remainder of the paper is organized as follows. We outline the background in Section II and related work in Section III. In Section IV, we elaborate the methodology and technical details on the model-based RL algorithm and the improvements by meta-learning. Section V presents extensive simulation validations, results and analysis. Finally, we conclude and suggest future work in Section VI.

Ii Background

This section presents the preliminaries of RL, Model Predictive Control (MPC) and meta-learning.

Ii-a Reinforcement Learning

In reinforcement learning, the agent learns to solve a task in an unknown environment

, defined by a Markov decision process

where is the set of continuous states of the environment , the initial state, the set of continuous actions the agent can perform in the environment, the probabilistic transition function and the reward function.

The goal of the agent is to learn a policy , parameterized by , which decides which action to perform given the current state to maximize the long-term reward , where is the horizon and is the discount factor.

Model-free RL focuses in directly learning such a policy, whereas model-based RL focuses on learning a model of the transition function – the transition of states given the current state and actions – which can be used to train the policy with fictive transitions or with model predictive control.

Ii-B Model Predictive Control

Given the current state and a horizon , Model Predictive Control (MPC) uses a forward model of dynamics to select an action sequence which maximizes the predicted cumulative reward . The agent performs the first action from the action sequence and collects the resulting state . The MPC then repeats such optimization and allows the agent to alleviate the possible error in the model prediction. Compared to model-free RL, we can change the reward online to control the agent’s behavior using model-based RL in an MPC fashion.

Ii-C Meta-Learning

We use meta-learning to train an agent to solve several tasks, where the neural network learns to adapt to several varying conditions, such as different floor frictions, the presence of external disturbances or having a damaged motor. For the neural network model, the initial set of weights

must be found, such that only a small number of gradient descent steps with little collected data in a unknown environment can produce effective adaptations.

Iii Related Work

Iii-a Model-Free Deep Reinforcement Learning

Proximal Policy Optimization [14] has been used to train a model-free controller to control Minitaur in simulation, producing trotting and gallop gaits [15] on the real robot. Soft Actor-Critic (SAC) [16] has also been used to train on the real Minitaur robot within two hours [17, 18]. This limitation of sampling-efficiency motivates us to focus on model-based RL.

Iii-B Model-Based Deep Reinforcement Learning

There are three types of model-based RL: learning to predict the expected return from a starting state distribution, for example using Bayesian Optimization [19]; learning to predict the outcome from a given starting state and given policy [20, 21, 22]; and learning to model the transition function using a forward dynamical model. Here, we use the third type of model.

Forward models of dynamics are either deterministic or probabilistic, where deterministic models can be linear models [23] or neural networks [24], and probabilistic models estimate uncertainty for modeling stochastic environments or estimating the long-term prediction uncertainty. Gaussian Processes [25, 26] or Bayesian neural networks [27] can be used to scale the abilities of Gaussian Processes models to higher dimensional environments.

For locomotion tasks, model-based RL with a forward model can have the same performance as model-free methods, while requiring at least an order of magnitude less samples [11]

. An ensemble of feed-forward neural networks is used to model the forward dynamics of the environment with uncertainty estimation. MPC uses this uncertainty estimation to formulate a more robust control which alleviates early overfitting model-based RL. The same method has been used with meta-learning to adapt the control of a 6-leg real millirobot to different floors


Iii-C MPC and Meta-Learning

Several optimization methods have been used for model-based RL, for example, Model Predictive Path Integral [28], random shooting [17] or Cross-entropy method [29, 11]. We use random shooting for the simplicity, easy parallelism and proven performances on real robots [12].

There are two main methods: a meta-learner model outputs the set of initial weights of the learner [30], or is optimized using a meta-loss, it can be gradient-descent [31] or evolutionary strategies [32]. Gaussian processes have been used [33] but only for low dimension environments. Meta-RL has been used with model-free RL [34], model-based RL [12] or a mix of both [35]. For model-based RL, gradient based meta-learning was shown to be more data-efficient, resulting in a better and faster adaptation [12]. Hence, we use gradient-based meta-learning.

For increasing generalization and adaptation to unseen condition of the environment, an adversarial loss has been used [36]. Other methods employ context variables [37], bias transformation [38], or condition latent vector [13], to learn different input of sub-parts of the model for different condition, and then adapt this sub-parts to the current condition.

Iv Methodology

This section presents details of the model-based RL algorithm as discussed in Section IV-A and meta-learning algorithm Section IV-B. We highlight our improvements which results in new robot capability of robust and versatile walking without a predefined, parameterized gait.

Iv-a Model-Based Reinforcement-Learning algorithm

The model-based RL algorithm runs at 50Hz, sending desired actions to PD controllers running at 250Hz to generate torques for physics simulation. The algorithm is composed of two main parts: the forward dynamics model, and MPC. Fig. 2 illustrates the schematics of the control framework.

Iv-A1 The Forward Model of Dynamics

We use a fully-connected feed-forward neural network, with two hidden layers of 256 units using a ReLU (Rectified Linear Unit) activation function. It takes the concatenation of the current state and action (

) as input, and learns to predict the difference in the resulting state: , which is a standard means to get the prediction .

The model parameter is the set of weights of the connections between the units. It is optimized using the gradient-based optimizer Adam [39] on a dataset of triplet

using mean squared error as loss function. We depict details of the model-based RL algorithm in Algorithm 


Compared to the work in [13] for the Minitaur, our study formulate the state space as: the angular joints positions and velocities, the base orientation angles and angular rates, and the linear base velocities. The addition of the angular velocity of the base is the key of our success for controlling the robot at 50Hz. In contrast, only Euler angles and angular rates of the base in the horizontal plane were used in [13] to control the Minitaur gait parameters at a much lower frequency of 0.5Hz.

   = episodes in with a random controller
   random weights
  for  episodes do
     Train on using Adam
     for  to steps do
         MPC(, , )
     end for
  end for
Algorithm 1 Model-based reinforcement learning algorithm
Input: Environment .

Iv-A2 Model Predictive Control

The method of random shooting is implemented which is suitable for parallel computing, and the algorithm is detailed in Algorithm 2. At each time step, action sequences of length are sampled. Each sequence is evaluated starting with the current state, using the model to estimate the corresponding state trajectory. From these trajectories, long term reward is computed and the action with the highest estimated reward is selected.

Real actuators have inherent limitations in velocity, acceleration and jerk. Instead of uniformly sampling desired joint angles within the limits, continuity constraints are used, where each desired joint state of the sequence is sampled using previous joint positions to ensure velocities, accelerations and jerks are smooth and bellow their respective limits.

As the improvement to the previous work [13], we enforced physical constraints during sampling of actions: , , and , where , , and are the desired joint angle, velocity, acceleration and jerk, respectively. The limits of velocity, acceleration and jerk are the soft constraints for the smoothness and continuity of actions. For safety reasons, regarding the joint position limits, we further imposed hard constraints of sampled actions on to avoid hitting the physical limit of joint movements.

This improvements on sampling enforces the MPC in a more suitable subspace. During training, it increased the distance traveled compared to the default condition during a 10s episode by an order of 2: from m (without), to m (with), where p-value 0.001 on 20 episodes. It also reduces the observed jerk by an order of 5: from rad/ (without) to rad/ (with), where p-value 0.001 on 20 episodes.

  Sample using for continuity
  for  to  do
  end for
  for  to steps do
     for all  to samples do
     end for
  end for
Algorithm 2 MPC algorithm
Inputs: A model , initial state , past actions , horizon and discount factor .

Iv-B Meta-learning algorithm

Before meta-training, an expert is trained for each training condition using the proposed model-based RL algorithm to collect its training data. To adapt the model to each condition , a specific latent vector is optimized during meta-learning using the regression loss on the data of the corresponding condition. This vector of fixed dimensions is then given to the input layer, alongside the current state and action when the condition is selected. We use a first order meta-learning called Reptile [40], which is composed of two phases: meta-training (Algorithm 3) and meta-adaptation (Algorithm 4).

Iv-B1 Meta-training

   = random weights
  for  to  do
      = random vector
  end for
  for  to  do
     for  to  do
     end for
  end for
Algorithm 3 Meta-learning training, called once before adaptation
Inputs: datasets from different conditions.
  for  to  do
  end for
  for  to  do
  end for
  return  ,
Algorithm 4 Meta-learning adaptation, called at each step
Inputs: a meta-trained set of weights , a list of learned condition latent vectors , a dataset of the past steps.

The initial set of weights and each condition latent vector are optimized for adaptation. Meta-training is separated into two nested loops. In the inner loop, one training dataset and its corresponding condition latent vector are selected. The model weights are initialized to and Adam [39] optimize both of them for the regression loss of the current dataset .

In the outer loop, is optimized by taking a small step, with a linearly decreasing schedule, towards the optimized weights of the inner-loop. This allows to converge to a nearby point (in the euclidean sense) to the optimal set of weights of each training condition. We detail the algorithm in Algorithm 3.

Iv-B2 Meta-adaptation

At each time step, we select the most likely training condition using the previous time steps, each condition latent vector and the set of weights . We then optimize the corresponding latent vector and the set of model weights, starting from , using the same optimization procedure as the inner loop but with the past steps. We detail the algorithm in Algorithm 4.

After the set of weights and the condition latent vector are optimized for the current condition, we use the MPC to select the optimal action to apply, and then new state information is collected, and the whole meta-adaptation iterates. This procedure allows any changes in the condition to be detected, and therefore the agent can adapt accordingly.

Iv-C Limitations

Apart from the standard classical robot control of tuning PD gains and joints limits, the proposed method still requires fine-tuning of reward function, model architecture, hyper-parameters for the meta-learning and the adaptation. The use of MPC instead of a neural network policy has a trade-off between real-time computation and performance, i.e. MPC performs better in terms of adaption but requires more computation needed from the sampling procedure.

Fig. 3: Walking in presence of large uncertainties: (a) constant external force disturbance, (b) low-friction slippery ground, (c) faulty motors, and (d) with one missing leg.

V Results

We used a custom version of the robot model (adapted from the open source SpotMicro robot 

[41]) in PyBullet simulation to validate our method. Here, we first present the learning capability of the model-based RL algorithm on SpotMicro with a first adaption to a sequence of different conditions (Section V-A). We further validate the adaptation capability of the proposed meta-learning under various fixed frictions (Section V-B) and time-varying, decreasing friction (Section V-C).

V-a Overview

Fig. 4: Walking with a continuously changing velocity from 0.5m/s to -0.2m/s: (a) forward, (b) static, and (c) backward.
Fig. 5: Measured joints trajectories of SpotMicro from a 20s test scenario where the robot or environment changed every 5s: default friction (green), blocked right hip pitch joint (red), slippery ground (yellow) and constant external push (brown).
Expert\Condition Default Slippery Lateral Force Damaged Motor
Default 100%, 3.2 0.2 20%, 0.7 0.7 40%, 2.0 1.0 100%, 0.7 0.2
Slippery 30%, 1.3 0.5 90%, 2.4 0.4 10%, 0.7 0.4 100%, 0.4 0.1
Lateral Force 0% 0% 70%, 2.3 0.7 90%, 0.4 0.1
Damage Motor 90%, 0.5 0.2 70%, 0.3 0.2 30%, 0.6 0.3 100%, 2.5 0.1
Meta-Trained 70%, 3.0 0.6 80%, 2.4 0.7 40%, 2.0 1.3 100%, 2.7 0.1
TABLE I: Success rate and average distance of travel for 10 episodes from different expert and the meta-trained model.

We trained the expert model with a default condition of friction , and this resulted default controller for walking is robust to perturbations, withstanding several pushes of 10N for 0.2s. After 300 of 10s-episodes which produced training data in the given condition, the quadruped was able to walk on slippery ground (friction coefficient ), against external forces or with a blocked motor or a missing/amputated leg. Tab. I shows the comparison between experts and meta-trained models under different conditions.

Using the proposed meta-learning method, the agent was able to adapt to four different conditions: default, fixed front-right hip motor, slippery ground and external forces. It traveled an average distance of m compared to a default expert which traveled m (averaged over 20 episodes). The joint trajectories from these test scenarios are shown in Fig. 5.

Fig. 6: Online tracking of continuous and variable walking velocity using the model trained only at the fixed velocity of 0.5m/s, while the desired velocity in the reward function changes from from 0.5m/s to -0.2m/s during the test scenario.
Fig. 7: Online tracking of discrete and variable walking velocity using the model trained only at the fixed velocity of 0.5m/s, while every 2 s the desired velocity is randomly sampled within 0.5m/s and -0.2m/s during the test scenario.

The controller can achieve variable walking speed, despite being trained with only at a constant desired forward velocity, we can command different desired velocities online continuously, as shown in Fig. 4 and Fig. 6. Moreover, the trained expert controller can also generate continuous control actions to track discrete, discontinuous commanded velocities (see Fig. 7).

Fig. 8: Distance traveled for the full range of frictions from 3 expert models and the meta-trained model.
Fig. 9: Distance traveled with decreasing friction from the 3 expert models and the meta-trained model (p-value: , and ).

V-B Ground with Constant Friction

We evaluated the adaptation capability using the meta-trained model and compared it to experts over the full range of different frictions (0.1 to 0.8 with 0.05 increments). We first trained 5 sets of experts for frictions 0.2, 0.4 and 0.6, using 300 10s-episodes, i.e., 50 minutes of data. Then we meta-trained 5 meta-models to adapt to these 3 frictions, each using one set of experts data, with the purpose that they could adapt to the full range afterwards.

Each set of these 5 models was evaluated for each friction with 4 10s-episodes, so this gives 20 evaluations per expert and 20 evaluations for the meta-learning. The meta-trained models outperformed the experts on the full range of frictions, see Fig. 8. As expected, each expert had its best performance when the friction constant is around its trained value.

V-C Ground with Decreasing Friction


As a comparison, we used the same experts and meta-model and benchmarked thei adaptation capability on a ground with continuously decreasing friction. We evaluated each set of models with 4 10s-episodes where friction coefficient started at 0.8 and linearly decreased to 0.1. This gave 20 evaluations per expert and 20 evaluations for meta-learning.

The meta-trained models demonstrated better walking performance and traversed farther (3.38m) than the experts (3.07m), using a t-test with a p-value under

, see Fig. 9. In Fig. 10

, the curves and shaded areas are the means and standard deviations of the velocity, respectively. Snapshots of the walking gait using the meta-trained model are shown in Fig. 

3, more details of walking performance can be seen in the video here.

Additionally, Fig. 11 depicts the estimated condition at each time step using the past 0.1s (i.e. 5 time steps). At the beginning of the episode, when friction was higher, the model estimated a friction of 0.6 to be more likely (from 0.8 down to 0.5, i.e., 0-4.5s), then switched to a friction of 0.4 (from 0.5 down to 0.3, i.e., 4.5-7s), and finished by estimating a more likely friction of 0.2 (from 0.3 down to 0.1, i.e., 7-10s).

Fig. 10: Forward walking velocity with decreasing friction from 3 expert models and the meta-trained model.
Fig. 11: Likelihood of each pre-trained condition using online meta-adaptation in presence of decreasing friction.

V-D Discussion

The results validate the efficiency and effectiveness of the meta-learning method to detect the most probable current condition and adapt accordingly to different ground friction coefficients. The online adaptive walking using the proposed meta-learning outperformed the specialized experts which were specifically trained on specific frictions. This demonstrated the capability of meta-learning to incorporate knowledge from all training data.

We have also pushed the extreme test case in terms of unseen hardware failures. We specifically designed a case for meta-training where a motor of one leg of the quadruped was blocked at a fixed joint position (emulated actuator failures), and tested if meta-learning can adapt to the damage from a different leg faster than learning from scratch. Our investigation showed that the meta-adaptation is not able to adapt to such changes on a different leg, and we hypothesize that a second-order meta-learning algorithm, e.g. Model-Agnostic Meta-Learning [31], may be a better solution for such an extreme case.

Vi Conclusion and future work

Based on the past work of model-based meta-RL [13], we have made contributions to improve the algorithm for adaptive and robust quadruped locomotion in changing robot dynamics (motor failure and amputation) and varying environmental constraints (time-varying friction, external pushes). In physics-based simulation, we have demonstrated this method can learn quadrupedal walking without using a periodic gait signal [13] or a phase vector [6]. Instead, by updating an interaction model of the robot and environment and applying the optimal control actions, a walking gait is naturally generated as the outcome of maximizing the task reward. We further validated the capability of our proposed framework in adapting to different conditions such as robot damage, changing friction, and external force disturbances.

Future work will apply this method on the real SpotMicro robot, and identify potential issues of sim2real transfer which will be addressed by new solutions for meta-model and more effective search of model predictive control procedure. We hypothesize the current meta-learning algorithm is efficient enough for multi-task learning, which can be further studied. Also, a second-order meta-learning algorithm [31] can be Incorporated could potentially achieve better adaptation to novel situations.

Vii Acknowledgement

This work has been supported by EPSRC UK Robotics and Artificial Intelligence Hub for Offshore Energy Asset Integrity Management (EP/R026173/1).