In robot motor control, developing responsive control policies that adapt to unforeseen environments is crucial to task success. These changes and unexpected situations can be intrinsic or extrinsic, such as robot damage, motor failure, varying friction, and external force disturbances. For robot locomotion, traditional approaches of planning and control require expert knowledge and accurate dynamics models and constraints of both the robot and the environment [1, 2], which are all subject to unforeseeable changes that are difficult to know beforehand. Moreover, even using data-efficient learning techniques such as Bayesian optimization to tune decision variables and control parameters, it can only achieve adaptation on a trial-by-trial basis  and also require extensive computation  which is not able to respond to changes on the fly.
Recent advances in Reinforcement-Learning (RL) lead to algorithms achieving human-like or animal-level performance in a range of difficult control tasks. Model-free RL can perform global search of control parameters and obtain globally optimal gaits while combined with walking pattern generation . Also, an RL-based feedback policy can achieve human-like bipedal walking by imitating human motion capture data . With simulation training including actuator properties, a model-free RL scheme can train different locomotion policies separately and deploy on a real quadrupedal robot . By using a multi-expert learning, an hierarchical RL architecture can learn to fuse multiple motor skills and generate multimodal locomotion coherently on a real quadruped . However, in general, model-free RL algorithms have limited sample efficiency, resulting in long training time to produce viable policies. For example, it took a model-free RL algorithm 83 hours to achieve human performance on the Atari game suite, compared to 15 minutes for a human . Similarly, AlphaStar  used 200 years of equivalent real-time to reach expert human performance playing Starcraft II.
On the other hand, model-based approaches can achieve comparatively high performance while being more sample-efficient by several orders of magnitude, converging faster than model-free approaches for locomotion tasks . To deploy robots in the real-world, online adaptation to changes in the environment are required as not all conditions can be considered by pre-trained policies, such as drastic changes of environments or robots by amputation. Hence, meta-learning, or learning to learn is a novel and more promising approach for solving such generic adaptations. Model-based meta-RL has been used in the real-world to adapt the control of a six-leg millirobot to different floor conditions . A model-based meta-RL algorithm, FAMLE, was used in the real-world on a Minitaur quadruped, where a latent black-box context vector encoded different environment conditions .
Our proposed method has made new improvements that require no prior knowledge of specific gaits. For example, FAMLE relies on sinusoidal gaits and therefore needs to optimize the amplitudes and phases of sinusoidal patterns by the model predictive control at a low frequency of 0.5Hz. In our work, we directly sample in the joint space at a much higher frequency at 50Hz, and we further improve the sampling process by specifying constraints on velocity, acceleration and jerk of the desired joint trajectories. Our study extensively validated the capability of adaptation in simulated test scenarios with large variations in floor friction, external forces or unexpected damage to joints.
Based on the interaction model, our method allows changing the reward function online and therefore is able to modify the behavior of the robot. For example, the learned controller can track a variable forward velocity, even it has been trained on a fixed desired velocity. Likelihood estimation with condition latent vectors allows the meta-model to adapt to already seen conditions. Meta-training should allow ”on the fly” optimization to better adapt to the current unknown condition.
In this paper, we present an improved model-based meta-RL approach to quadruped locomotion that is capable of online adaption to changing environments, as shown in Fig. 1. The main contributions of this work are:
The proposed algorithm is capable of learning from scratch and requires no prior knowledge of the type of gaits, such as periodic phases of leg movements.
Our methods introduces and applies hard constraints of velocity, acceleration and jerk on the sampled actions during the search process.
The capability and robustness of online adaption to changes in both the robot and the environment, such as external force disturbances, varying frictions, faulty motors and leg amputation.
The remainder of the paper is organized as follows. We outline the background in Section II and related work in Section III. In Section IV, we elaborate the methodology and technical details on the model-based RL algorithm and the improvements by meta-learning. Section V presents extensive simulation validations, results and analysis. Finally, we conclude and suggest future work in Section VI.
This section presents the preliminaries of RL, Model Predictive Control (MPC) and meta-learning.
Ii-a Reinforcement Learning
In reinforcement learning, the agent learns to solve a task in an unknown environment
, defined by a Markov decision processwhere is the set of continuous states of the environment , the initial state, the set of continuous actions the agent can perform in the environment, the probabilistic transition function and the reward function.
The goal of the agent is to learn a policy , parameterized by , which decides which action to perform given the current state to maximize the long-term reward , where is the horizon and is the discount factor.
Model-free RL focuses in directly learning such a policy, whereas model-based RL focuses on learning a model of the transition function – the transition of states given the current state and actions – which can be used to train the policy with fictive transitions or with model predictive control.
Ii-B Model Predictive Control
Given the current state and a horizon , Model Predictive Control (MPC) uses a forward model of dynamics to select an action sequence which maximizes the predicted cumulative reward . The agent performs the first action from the action sequence and collects the resulting state . The MPC then repeats such optimization and allows the agent to alleviate the possible error in the model prediction. Compared to model-free RL, we can change the reward online to control the agent’s behavior using model-based RL in an MPC fashion.
We use meta-learning to train an agent to solve several tasks, where the neural network learns to adapt to several varying conditions, such as different floor frictions, the presence of external disturbances or having a damaged motor. For the neural network model, the initial set of weightsmust be found, such that only a small number of gradient descent steps with little collected data in a unknown environment can produce effective adaptations.
Iii Related Work
Iii-a Model-Free Deep Reinforcement Learning
Proximal Policy Optimization  has been used to train a model-free controller to control Minitaur in simulation, producing trotting and gallop gaits  on the real robot. Soft Actor-Critic (SAC)  has also been used to train on the real Minitaur robot within two hours [17, 18]. This limitation of sampling-efficiency motivates us to focus on model-based RL.
Iii-B Model-Based Deep Reinforcement Learning
There are three types of model-based RL: learning to predict the expected return from a starting state distribution, for example using Bayesian Optimization ; learning to predict the outcome from a given starting state and given policy [20, 21, 22]; and learning to model the transition function using a forward dynamical model. Here, we use the third type of model.
Forward models of dynamics are either deterministic or probabilistic, where deterministic models can be linear models  or neural networks , and probabilistic models estimate uncertainty for modeling stochastic environments or estimating the long-term prediction uncertainty. Gaussian Processes [25, 26] or Bayesian neural networks  can be used to scale the abilities of Gaussian Processes models to higher dimensional environments.
For locomotion tasks, model-based RL with a forward model can have the same performance as model-free methods, while requiring at least an order of magnitude less samples 
. An ensemble of feed-forward neural networks is used to model the forward dynamics of the environment with uncertainty estimation. MPC uses this uncertainty estimation to formulate a more robust control which alleviates early overfitting model-based RL. The same method has been used with meta-learning to adapt the control of a 6-leg real millirobot to different floors.
Iii-C MPC and Meta-Learning
Several optimization methods have been used for model-based RL, for example, Model Predictive Path Integral , random shooting  or Cross-entropy method [29, 11]. We use random shooting for the simplicity, easy parallelism and proven performances on real robots .
There are two main methods: a meta-learner model outputs the set of initial weights of the learner , or is optimized using a meta-loss, it can be gradient-descent  or evolutionary strategies . Gaussian processes have been used  but only for low dimension environments. Meta-RL has been used with model-free RL , model-based RL  or a mix of both . For model-based RL, gradient based meta-learning was shown to be more data-efficient, resulting in a better and faster adaptation . Hence, we use gradient-based meta-learning.
For increasing generalization and adaptation to unseen condition of the environment, an adversarial loss has been used . Other methods employ context variables , bias transformation , or condition latent vector , to learn different input of sub-parts of the model for different condition, and then adapt this sub-parts to the current condition.
This section presents details of the model-based RL algorithm as discussed in Section IV-A and meta-learning algorithm Section IV-B. We highlight our improvements which results in new robot capability of robust and versatile walking without a predefined, parameterized gait.
Iv-a Model-Based Reinforcement-Learning algorithm
The model-based RL algorithm runs at 50Hz, sending desired actions to PD controllers running at 250Hz to generate torques for physics simulation. The algorithm is composed of two main parts: the forward dynamics model, and MPC. Fig. 2 illustrates the schematics of the control framework.
Iv-A1 The Forward Model of Dynamics
We use a fully-connected feed-forward neural network, with two hidden layers of 256 units using a ReLU (Rectified Linear Unit) activation function. It takes the concatenation of the current state and action () as input, and learns to predict the difference in the resulting state: , which is a standard means to get the prediction .
The model parameter is the set of weights of the connections between the units. It is optimized using the gradient-based optimizer Adam  on a dataset of triplet
using mean squared error as loss function. We depict details of the model-based RL algorithm in Algorithm1.
Compared to the work in  for the Minitaur, our study formulate the state space as: the angular joints positions and velocities, the base orientation angles and angular rates, and the linear base velocities. The addition of the angular velocity of the base is the key of our success for controlling the robot at 50Hz. In contrast, only Euler angles and angular rates of the base in the horizontal plane were used in  to control the Minitaur gait parameters at a much lower frequency of 0.5Hz.
Iv-A2 Model Predictive Control
The method of random shooting is implemented which is suitable for parallel computing, and the algorithm is detailed in Algorithm 2. At each time step, action sequences of length are sampled. Each sequence is evaluated starting with the current state, using the model to estimate the corresponding state trajectory. From these trajectories, long term reward is computed and the action with the highest estimated reward is selected.
Real actuators have inherent limitations in velocity, acceleration and jerk. Instead of uniformly sampling desired joint angles within the limits, continuity constraints are used, where each desired joint state of the sequence is sampled using previous joint positions to ensure velocities, accelerations and jerks are smooth and bellow their respective limits.
As the improvement to the previous work , we enforced physical constraints during sampling of actions: , , and , where , , and are the desired joint angle, velocity, acceleration and jerk, respectively. The limits of velocity, acceleration and jerk are the soft constraints for the smoothness and continuity of actions. For safety reasons, regarding the joint position limits, we further imposed hard constraints of sampled actions on to avoid hitting the physical limit of joint movements.
This improvements on sampling enforces the MPC in a more suitable subspace. During training, it increased the distance traveled compared to the default condition during a 10s episode by an order of 2: from m (without), to m (with), where p-value 0.001 on 20 episodes. It also reduces the observed jerk by an order of 5: from rad/ (without) to rad/ (with), where p-value 0.001 on 20 episodes.
Iv-B Meta-learning algorithm
Before meta-training, an expert is trained for each training condition using the proposed model-based RL algorithm to collect its training data. To adapt the model to each condition , a specific latent vector is optimized during meta-learning using the regression loss on the data of the corresponding condition. This vector of fixed dimensions is then given to the input layer, alongside the current state and action when the condition is selected. We use a first order meta-learning called Reptile , which is composed of two phases: meta-training (Algorithm 3) and meta-adaptation (Algorithm 4).
The initial set of weights and each condition latent vector are optimized for adaptation. Meta-training is separated into two nested loops. In the inner loop, one training dataset and its corresponding condition latent vector are selected. The model weights are initialized to and Adam  optimize both of them for the regression loss of the current dataset .
In the outer loop, is optimized by taking a small step, with a linearly decreasing schedule, towards the optimized weights of the inner-loop. This allows to converge to a nearby point (in the euclidean sense) to the optimal set of weights of each training condition. We detail the algorithm in Algorithm 3.
At each time step, we select the most likely training condition using the previous time steps, each condition latent vector and the set of weights . We then optimize the corresponding latent vector and the set of model weights, starting from , using the same optimization procedure as the inner loop but with the past steps. We detail the algorithm in Algorithm 4.
After the set of weights and the condition latent vector are optimized for the current condition, we use the MPC to select the optimal action to apply, and then new state information is collected, and the whole meta-adaptation iterates. This procedure allows any changes in the condition to be detected, and therefore the agent can adapt accordingly.
Apart from the standard classical robot control of tuning PD gains and joints limits, the proposed method still requires fine-tuning of reward function, model architecture, hyper-parameters for the meta-learning and the adaptation. The use of MPC instead of a neural network policy has a trade-off between real-time computation and performance, i.e. MPC performs better in terms of adaption but requires more computation needed from the sampling procedure.
We used a custom version of the robot model (adapted from the open source SpotMicro robot) in PyBullet simulation to validate our method. Here, we first present the learning capability of the model-based RL algorithm on SpotMicro with a first adaption to a sequence of different conditions (Section V-A). We further validate the adaptation capability of the proposed meta-learning under various fixed frictions (Section V-B) and time-varying, decreasing friction (Section V-C).
|Expert\Condition||Default||Slippery||Lateral Force||Damaged Motor|
|Default||100%, 3.2 0.2||20%, 0.7 0.7||40%, 2.0 1.0||100%, 0.7 0.2|
|Slippery||30%, 1.3 0.5||90%, 2.4 0.4||10%, 0.7 0.4||100%, 0.4 0.1|
|Lateral Force||0%||0%||70%, 2.3 0.7||90%, 0.4 0.1|
|Damage Motor||90%, 0.5 0.2||70%, 0.3 0.2||30%, 0.6 0.3||100%, 2.5 0.1|
|Meta-Trained||70%, 3.0 0.6||80%, 2.4 0.7||40%, 2.0 1.3||100%, 2.7 0.1|
We trained the expert model with a default condition of friction , and this resulted default controller for walking is robust to perturbations, withstanding several pushes of 10N for 0.2s. After 300 of 10s-episodes which produced training data in the given condition, the quadruped was able to walk on slippery ground (friction coefficient ), against external forces or with a blocked motor or a missing/amputated leg. Tab. I shows the comparison between experts and meta-trained models under different conditions.
Using the proposed meta-learning method, the agent was able to adapt to four different conditions: default, fixed front-right hip motor, slippery ground and external forces. It traveled an average distance of m compared to a default expert which traveled m (averaged over 20 episodes). The joint trajectories from these test scenarios are shown in Fig. 5.
The controller can achieve variable walking speed, despite being trained with only at a constant desired forward velocity, we can command different desired velocities online continuously, as shown in Fig. 4 and Fig. 6. Moreover, the trained expert controller can also generate continuous control actions to track discrete, discontinuous commanded velocities (see Fig. 7).
V-B Ground with Constant Friction
We evaluated the adaptation capability using the meta-trained model and compared it to experts over the full range of different frictions (0.1 to 0.8 with 0.05 increments). We first trained 5 sets of experts for frictions 0.2, 0.4 and 0.6, using 300 10s-episodes, i.e., 50 minutes of data. Then we meta-trained 5 meta-models to adapt to these 3 frictions, each using one set of experts data, with the purpose that they could adapt to the full range afterwards.
Each set of these 5 models was evaluated for each friction with 4 10s-episodes, so this gives 20 evaluations per expert and 20 evaluations for the meta-learning. The meta-trained models outperformed the experts on the full range of frictions, see Fig. 8. As expected, each expert had its best performance when the friction constant is around its trained value.
V-C Ground with Decreasing Friction
As a comparison, we used the same experts and meta-model and benchmarked thei adaptation capability on a ground with continuously decreasing friction. We evaluated each set of models with 4 10s-episodes where friction coefficient started at 0.8 and linearly decreased to 0.1. This gave 20 evaluations per expert and 20 evaluations for meta-learning.
The meta-trained models demonstrated better walking performance and traversed farther (3.38m) than the experts (3.07m), using a t-test with a p-value under, see Fig. 9. In Fig. 10
, the curves and shaded areas are the means and standard deviations of the velocity, respectively. Snapshots of the walking gait using the meta-trained model are shown in Fig.3, more details of walking performance can be seen in the video here.
Additionally, Fig. 11 depicts the estimated condition at each time step using the past 0.1s (i.e. 5 time steps). At the beginning of the episode, when friction was higher, the model estimated a friction of 0.6 to be more likely (from 0.8 down to 0.5, i.e., 0-4.5s), then switched to a friction of 0.4 (from 0.5 down to 0.3, i.e., 4.5-7s), and finished by estimating a more likely friction of 0.2 (from 0.3 down to 0.1, i.e., 7-10s).
The results validate the efficiency and effectiveness of the meta-learning method to detect the most probable current condition and adapt accordingly to different ground friction coefficients. The online adaptive walking using the proposed meta-learning outperformed the specialized experts which were specifically trained on specific frictions. This demonstrated the capability of meta-learning to incorporate knowledge from all training data.
We have also pushed the extreme test case in terms of unseen hardware failures. We specifically designed a case for meta-training where a motor of one leg of the quadruped was blocked at a fixed joint position (emulated actuator failures), and tested if meta-learning can adapt to the damage from a different leg faster than learning from scratch. Our investigation showed that the meta-adaptation is not able to adapt to such changes on a different leg, and we hypothesize that a second-order meta-learning algorithm, e.g. Model-Agnostic Meta-Learning , may be a better solution for such an extreme case.
Vi Conclusion and future work
Based on the past work of model-based meta-RL , we have made contributions to improve the algorithm for adaptive and robust quadruped locomotion in changing robot dynamics (motor failure and amputation) and varying environmental constraints (time-varying friction, external pushes). In physics-based simulation, we have demonstrated this method can learn quadrupedal walking without using a periodic gait signal  or a phase vector . Instead, by updating an interaction model of the robot and environment and applying the optimal control actions, a walking gait is naturally generated as the outcome of maximizing the task reward. We further validated the capability of our proposed framework in adapting to different conditions such as robot damage, changing friction, and external force disturbances.
Future work will apply this method on the real SpotMicro robot, and identify potential issues of sim2real transfer which will be addressed by new solutions for meta-model and more effective search of model predictive control procedure. We hypothesize the current meta-learning algorithm is efficient enough for multi-task learning, which can be further studied. Also, a second-order meta-learning algorithm  can be Incorporated could potentially achieve better adaptation to novel situations.
This work has been supported by EPSRC UK Robotics and Artificial Intelligence Hub for Offshore Energy Asset Integrity Management (EP/R026173/1).
-  P. Fankhauser et al., “Robust rough-terrain locomotion with a quadrupedal robot,” in 2018 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2018, pp. 1–8.
-  I. Chatzinikolaidis et al., “Contact-implicit trajectory optimization using an analytically solvable contact model for locomotion on variable ground,” IEEE Robot. Autom. Lett., vol. 5, no. 4, pp. 6357–6364, 2020.
-  A. Rai et al., “Bayesian optimization using domain knowledge on the atrias biped,” in 2018 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2018, pp. 1771–1778.
K. Yuan et al.
, “Bayesian optimization for whole-body control of high-degree-of-freedom robots through reduction of dimensionality,”IEEE Robot. Autom. Lett., vol. 4, no. 3, pp. 2268–2275, 2019.
-  H. Dallali et al., “On global optimization of walking gaits for the compliant humanoid robot, coman using reinforcement learning,” Cybernetics and Information Technologies, vol. 12, no. 3, pp. 39–52, 2012.
-  C. Yang et al., “Learning natural locomotion behaviors for humanoid robots using human bias,” IEEE Robotics and Automation Letters, vol. 5, no. 2, pp. 2610–2617, 2020.
-  J. Hwangbo et al., “Learning agile and dynamic motor skills for legged robots,” Science Robotics, vol. 4, no. 26, 2019.
-  C. Yang et al., “Multi-expert learning of adaptive legged locomotion,” Science Robotics, vol. 5, no. 49, 2020.
-  M. Hessel et al., “Rainbow: Combining improvements in deep reinforcement learning,” in AAAI-18, 2018, pp. 3215–3222.
-  O. Vinyals et al., “AlphaStar: Mastering the Real-Time Strategy Game StarCraft II,” 2019.
-  K. Chua et al., “Deep reinforcement learning in a handful of trials using probabilistic dynamics models,” in NeurIPS 2018, 2018, pp. 4759–4770.
-  A. Nagabandi et al., “Learning to adapt in dynamic, real-world environments through meta-reinforcement learning,” in ICLR 2019, 2019.
-  R. Kaushik et al., “Fast online adaptation in robotics through meta-learning embeddings of simulated priors,” 2020.
-  J. Schulman et al., “Proximal policy optimization algorithms,” CoRR, vol. abs/1707.06347, 2017.
-  J. Tan et al., “Sim-to-real: Learning agile locomotion for quadruped robots,” in Robotics: Science and Systems XIV, 2018.
-  T. Haarnoja et al., “Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor,” in ICML 2018, 2018, pp. 1856–1865.
-  T. Haarnoja et al., “Learning to walk via deep reinforcement learning,” in Robotics: Science and Systems XV, 2019.
-  S. Ha et al., “Learning to Walk in the Real World with Minimal Human Effort,” arXiv:2002.08550 [cs], Feb. 2020.
-  E. Brochu et al., “A tutorial on bayesian optimization of expensive cost functions, with application to active user modeling and hierarchical reinforcement learning,” CoRR, vol. abs/1012.2599, 2010.
-  J. Lehman and K. O. Stanley, “Exploiting open-endedness to solve problems through the search for novelty,” in Artificial Life XI, 2008, pp. 329–336.
-  J.-B. Mouret and J. Clune, “Illuminating search spaces by mapping elites,” arXiv preprint arXiv:1504.04909, 2015.
-  S. Forestier et al., “Intrinsically motivated goal exploration processes with automatic curriculum learning,” CoRR, vol. abs/1708.02190, 2017.
-  S. Levine et al., “End-to-end training of deep visuomotor policies,” J. Mach. Learn. Res., vol. 17, pp. 39:1–39:40, 2016.
-  D. Ha and J. Schmidhuber, “Recurrent world models facilitate policy evolution,” in NeurIPS 2018, 2018, pp. 2455–2467.
-  M. Deisenroth and C. Rasmussen, “Pilco: A model-based and data-efficient approach to policy search,” in ICML 2011, 2011, pp. 465–472.
-  R. Kaushik et al., “Multi-objective model-based policy search for data-efficient learning with sparse rewards,” in CoRL 2018, vol. 87, 2018, pp. 839–855.
Y. Gal and Z. Ghahramani, “Dropout as a bayesian approximation: Representing model uncertainty in deep learning,” inICML 2016, vol. 48, 2016, pp. 1050–1059.
-  G. Williams et al., “Model predictive path integral control using covariance variable importance sampling,” CoRR, vol. abs/1509.01149, 2015.
-  R. Y. Rubinstein and D. P. Kroese,
-  K. Li and J. Malik, “Learning to optimize,” in ICLR, 2017.
-  C. Finn et al., “Model-agnostic meta-learning for fast adaptation of deep networks,” in International Conference on Machine Learning, ICML, vol. 70. PMLR, 2017, pp. 1126–1135.
-  X. Song et al., “Rapidly adaptable legged robots via evolutionary meta-learning,” arXiv preprint arXiv:2003.01239, 2020.
-  S. Sæmundsson et al., “Meta reinforcement learning with latent variable gaussian processes,” arXiv preprint arXiv:1803.07551, 2018.
-  Y. Yang et al., “Norml: No-reward meta learning,” CoRR, 2019.
-  T. Hiraoka et al., “Meta-model-based meta-policy optimization,” arXiv preprint arXiv:2006.02608, 2020.
-  Z. Lin et al., “Model-based adversarial meta-reinforcement learning,” arXiv preprint arXiv:2006.08875, 2020.
-  R. Mendonca et al., “Meta-reinforcement learning robust to distributional shift via model identification and experience relabeling,” arXiv preprint arXiv:2006.07178, 2020.
C. Finn et al.
, “One-shot visual imitation learning via meta-learning,”CoRR, 2017.
-  D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” in ICLR 2015,, 2015.
-  A. Nichol et al., “On first-order meta-learning algorithms,” CoRR, 2018.
-  D.-Y. Kim, “Spotmicro - robot dog by kdy0523,” Feb 2019. [Online]. Available: https://www.thingiverse.com/thing:3445283