Log In Sign Up

Robotic Control Using Model Based Meta Adaption

by   Karam Daaboul, et al.

In machine learning, meta-learning methods aim for fast adaptability to unknown tasks using prior knowledge. Model-based meta-reinforcement learning combines reinforcement learning via world models with Meta Reinforcement Learning (MRL) for increased sample efficiency. However, adaption to unknown tasks does not always result in preferable agent behavior. This paper introduces a new Meta Adaptation Controller (MAC) that employs MRL to apply a preferred robot behavior from one task to many similar tasks. To do this, MAC aims to find actions an agent has to take in a new task to reach a similar outcome as in a learned task. As a result, the agent will adapt quickly to the change in the dynamic and behave appropriately without the need to construct a reward function that enforces the preferred behavior.


page 1

page 6


The effects of negative adaptation in Model-Agnostic Meta-Learning

The capacity of meta-learning algorithms to quickly adapt to a variety o...

Multitask Adaptation by Retrospective Exploration with Learned World Models

Model-based reinforcement learning (MBRL) allows solving complex tasks i...

Fast and Slow Learning of Recurrent Independent Mechanisms

Decomposing knowledge into interchangeable pieces promises a generalizat...

Behaviour-conditioned policies for cooperative reinforcement learning tasks

The cooperation among AI systems, and between AI systems and humans is b...

Training an Interactive Helper

Developing agents that can quickly adapt their behavior to new tasks rem...

Using Meta Reinforcement Learning to Bridge the Gap between Simulation and Experiment in Energy Demand Response

Our team is proposing to run a full-scale energy demand response experim...

Variational Meta Reinforcement Learning for Social Robotics

With the increasing presence of robots in our every-day environments, im...

I Introduction

Adaptive behavior lies in the very nature of life as we know it. By forming a variety of behaviors, the animal brain enables its host to adapt to environmental changes continuously [sterling2015principles]

. Toddlers, for example, can learn how to walk in the sand in several moments, whereas robots often struggle to adapt fast and show rigid behavior encountering a task not seen before. Fast adaption is possible because animals do not learn from scratch and leverage prior knowledge to solve a new task. In machine learning, the domain of meta-learning takes inspiration from this phenomenon by enabling a learning machine to develop a hypothesis on how to solve a new task using information from prior hypotheses of similar tasks

[schmidhuber:1987:srl]. Thus, it aims to learn models that are quickly adaptable to new tasks and can be described as a set of methods that apply a learned prior of common task structure to make a generalized inference with small amounts of data [schmidhuber:1987:srl], [Finn2018].
The domain of model-based reinforcement learning

(MBRL) comprises methods that enable a Reinforcement Learning (RL) agent to successfully master complex behaviors using a deep neural network as a model of a tasks system dynamics

[Atkeson1997]. To solve an RL task, this dynamics model is utilized to optimize a sequence of actions (e.g., with model predictive control) or to optimize a policy, making MBRL more sample efficient than model-free reinforcement learning (MFRL) [Williams], [Deisenroth2011], [Nagabandi2017]. Even though MBRL methods show improved sample efficiency compared to MFRL approaches, the amount of training data needed to reach ”good” performance scales exponentially with the dimensionality of the input state-action space of the dynamics model [Chatzilygeroudis2018]. Additionally, data scarcity is even more challenging when a system has to adapt online while executing a task. A robot, for example, might encounter sudden changes in system dynamics (e.g., damaged joints) or changes in environmental dynamics (e.g., new terrain conditions) that require fast online adaption. By combining meta-learning and MBRL, robots can learn how to quickly form new behaviors when the environment- or system-dynamics change [Nagabandi2018], [Saemundsson2018], [Kaushik2020], [Belkhale2021]. However, newly formed behavior might be undesirable even if the underlying task is mastered correctly according to the environment’s reward function. For example, as seen in figure 1, a robot Ant, trained to walk as fast as possible, will start to jump or roll if the gravity of its environment is very low. In a real-world setting, such a situation might damage the robot. Therefore, the RL agent requires a tailored reward function to form behavior that does no damage. Nevertheless, designing a reward function is challenging since it is time-consuming and challenging to master, especially for various tasks.

Figure 1: Action sequences of an ant robot during meta-testing. The test task is to adapt to the gravity of . A model-based meta-reinforcement learning approach with MPC results in an undesired robot behavior (row 1). MAC finds a behavior similar to the one learned at its reference task (row 2).

This paper introduces a controller for MBRL that employs meta-learning to apply a selected robot behavior from one robotic task to a range of similar tasks. In other words, it aims to find actions the robot has to take in a new task to reach a similar outcome as in a learned task. Thus, it alleviates the need to construct a reward function that enforces preferred behavior. It builds on top of the FAMLE algorithm by Kaushik et al. [Kaushik2020] making use of an Embedding Neural Network (ENN) for quick adaption to new tasks through task embeddings as learned priors. By combining an ENN with an RL policy of a reference task, the controller predicts which actions need to be taken in unseen tasks to mimic the behavior of the reference task. While being initialized with the most likely embedding, a trained meta is adapted to approximate future environment states and compare them to the preferred states of the reference task. Actions leading to states in the unseen task that are very similar to those reached by the RL policy in the reference task are then chosen to be executed in the environment. To account for the usage and adaption of a meta-model during planning, we call our approach Meta Adaptation Controller (MAC).
First, we introduce related work and preliminaries. Next, the challenge and our approach to solving it are described. Finally, experiment results are presented that compare MAC with MPC employing different meta-learning methods.

Ii Related Work

In recent years, robotics has achieved remarkable success with model-based RL approaches [Zhang2018],[Nagabandi2019],[yang2020]. The agent can choose optimal actions by utilizing the experiences generated by the model[Nagabandi2017]. As a result, the amount of data required for model-based methods is typically much smaller than their model-free counterparts, making these algorithms more attractive for robotic applications. One drawback in many of these works is the assumption that the environment is stationary. In real robot applications, however, many uncertainties are difficult to model or predict, some of which are internal (e.g., malfunctions [Nagabandi2018]) and others external (e.g., wind[Belkhale2021]). These uncertainties make the stationary assumption impractical. That can lead to suboptimal behavior or even catastrophic failure. Therefore, a quick adaptation of the learned model is critical.
”Gradient-based meta-learning methods leverage gradient descent to learn the commonalities among various tasks” [Lee2018, p. 1]. One such method introduced by Finn et al. [Finn2017] is Model-Agnostic Meta-Learning (MAML). The key idea of MAML is to tune a model’s initial parameters such that the model has maximal performance on a new task. Here, meta-learning is achieved with bi-level optimization, a models task-specific optimization and a task-agnostic meta optimization. Instantiated for MFRL, MAML uses policy gradients of a neural network model, whereas, in MBRL, MAML is used to train a dynamics model. REPTILE by Nicol et al. [Nichol] is the first-order implementation of MAML. In contrast to MAML, task-specific gradients do not need to be differentiated through the optimization process. This makes REPTILE more computationally efficient with similar performance.
A model-based approach using gradient-based MRL was presented in the work of Nagabandi et al. [Nagabandi2018] and targets online adaption of a robotic system that encounters different system dynamics in real-world environments. In this context, Kaushik et al. [Kaushik2020] point out that in an MRL setup where situations do not possess strong global similarity, finding a single set of initial parameters is often not sufficient to learn quickly. One potential solution would be to find several initial sets of model parameters during meta-training and, when encountering a new task, use the most similar one so that an agent can adapt through several gradient steps. Their work Fast Adaptation through Meta-Learning Embeddings

(FAMLE) approaches this solution by extending a dynamical models input with a learnable d-dimensional vector describing a task. Similarly, Belkhale et al.

[Belkhale2021] introduce a meta-learning approach that enables a quadcopter to adapt online to various physical properties of payloads (e.g., mass, tether length) using variational inference. Intuitively each payload causes different system dynamics and therefore defines a task to be learned. Since it is unlikely to accurately model such dynamics by hand and it is not realistic to know every payloads properties value beforehand, the meta-learning goal is the rapid adaption to unknown payloads without prior knowledge of the payload’s physical properties. That is why a probabilistic encoder network finds a task-specific latent vector fed into a dynamics network as an auxiliary network. Using the latent vector, the dynamics network learns to model the factors of variation that affect the payload’s dynamics and are not present in the current state. All these algorithms use MPC during online adaption. Our work introduces a new controller for online adaption in a model-based meta-reinforcement learning setting.

Iii Preliminaries

Iii-a Meta Learning

Quick online adaption to new tasks can be viewed in the light of a few-shot learning setting where the goal of meta-learning is to adapt a model to an unseen task of a task distribution with a small amount of data samples [Finn2017]. The meta-learning procedure usually is divided into meta-training with meta-learning tasks and meta-testing with meta-test tasks both drawn from without replacement [Finn2018]. During meta-training, task data may be split into train and test sets usually representing data points of a task . Meta-testing task data is hold out during meta-training [Finn2018]. Meta-training is then performed with and can be viewed as bi-level learning of model parameters [Rajeswaran2019]. In the inner-level, an update algorithm

with hyperparameters

must find task-specific parameters by adjusting meta-parameters . In the outer-level, must be adjusted to minimize the cumulative loss of all across all learning tasks by finding common characteristics of different tasks through meta parameters :


Once is found, it can be used during meta-testing for quick adaption:

Iii-B Model-based Reinforcement Learning

In RL, a task can be described as a Markov Decision Process (MDP)

with a set of states , a set of actions , a reward function , an initial state distribution

, a transition probability distribution

, and a discrete-time finite or continuous-time infinite horizon . MBRL methods sample ground truth data from a specific task and use this data to train a dynamics model

that estimates the underlying dynamics of the task to approximate which state follows which action. This is done by optimizing the weights

to maximize the log-likelihood of the observed data:


The learned dynamics model is then utilized to optimize a sequence of actions (e.g., with model predictive control) or to optimize a policy [Deisenroth2011], [Nagabandi2017].

Iii-C Gradient-based Reinforcement Learning with REPTILE

REPTILE from Nichol et al. [Nichol] is a first-order implementation of MAML. During meta-training, task data in the form of trajectories is sampled with roll-outs from or by taking random actions. A single task is sampled from without replacement for each training-iteration that passes the inner-level and outer-level once. In the inner-level, task-specific parameters are generated by adapting with

steps of stochastic gradient descent as

and its learning rate :


In the outer-level the parameters are then being adjusted to minimize the euclidean distance between and with learning rate :

Figure 2: A high-level overview of the meta adaption controller. While being initialized with the most likely embedding, a trained meta-model is adapted to approximate future environment states and compare them to states of the reference task. Actions leading to states in the unseen task that are very similar to those reached by the RL policy in the reference task are then chosen to be executed in the environment.

Iii-D Model-based Meta-Reinforcement Learning using task embeddings

Each dynamic encountered by an agent can be represented by an MDP and therefore interpreted as an RL task . Since, in real-world applications, new dynamics can appear at any time (e.g., a malfunctioning robot leg), a new task could appear at any time. Hence, a task can be understood as an arbitrary trajectory segment of timesteps under a specific dynamic. A meta-learner a meta-learner is trained to adapt to the distribution of these temporal fragments based on recent observations [Nagabandi2018], [Kaushik2020]. FAMLE by Kaushik et al. [Kaushik2020] extends a dynamics model input with an additional input , which is a d-dimensional vector describing a task . By meta-training model parameters and embeddings jointly, several initial sets of model parameters are found, each conditioned on a task represented by resulting in a task conditioned dynamics model . If an unseen task appears, its similarity to prior tasks is measured. The most-likely task embedding is then used to condition the model parameters and enable faster adaption.
The meta-training process is described in Algorithm 1. Prior to meta-training, tasks are sampled inside a simulation from resulting in a set of meta-training tasks . For each training data is sampled by a simulated robot randomly taking actions. Then, to be learned task embeddings corresponding to each task are initialized. During meta-training, relating to the meta-learning goal defined in Equation 1, initial model parameters and task embeddings are found that minimize the loss for any task sampled from :


The loss of a task-conditioned dynamical model for a specific task be as follows:


Following the REPTILE algorithm, bi-level optimization is achieved by making a gradient-based, task-specific update of and in the inner level with a fixed :


and simultaneously updating and towards their task-specific counterparts in the outer level:

0:  Distribution over tasks
0:  Learning rate outer-level
0:  Learning rate inner-level
0:  Number of sampled tasks
0:  Empty Dataset
0:   as steps of stochastic gradient descent
1:  for  do
2:     Sample a training task from
3:     Save the task:
4:     Collect task data:
5:     Save task data:
6:  end for
7:  for  do
8:     Sample task data:
9:     Perform steps of SGD with:
10:     Perform update:
11:     Perform update:
12:  end for
13:  return  (, ) as (, )
Algorithm 1 Meta-training process using REPTILE and an Embedding Neural Network

During online adaptation (i.e. meta-testing) the dynamics model is adapted based on recent observations while making the assumption that a new task is taking place after every control steps. First, based on recent observations the most likely situational embedding is defined:


Next, the dynamical model is updated online by simultaneously updating and taking gradient steps:


Iv Agents forming adapted behavior through meta-learning

Figure 3 displays a robot incentivized to walk in one specific direction as fast as possible. The meta-learning objective is to walk successfully in different gravitational settings. First, as in algorithm 1, data is collected by randomly taking actions in different gravitational settings. Next, a meta-learning method (e.g., REPTILE) is used to train a meta-model. While the robot achieves good performance during meta-testing with MPC, its adapted behavior in low gravitational settings is to jump and roll since this results in the highest reward (Fig. 1). In a real-world setting, similar adverse behaviors could have unknown consequences like damage to the robot or its environment. Designing a reward function that enables intended adaption is not a promising approach. First, developing the proper function for one specific task takes many trials, which is time-consuming. Moreover, finding a reward function that works across various tasks is difficult. For example, settings with low gravity require constraining the motion of the robot not to jump or roll, whereas high gravity settings demand rotation flexibility. More complex meta-learning tasks are even more challenging.

V Apply preferred behavior to similar tasks

Instead of designing a reward function that provides the right incentives for different tasks, a technique is needed that guarantees correct motion with minimal supervision. One possible solution is to add constraints to the MPC optimization problem. These constraints force the states predicted by the model to be similar to the predefined states that we call task-anchors . An example is shown in Fig. 3 (red arrows), where the task-anchor accounts for the robot’s rotational motion so that the robot adapts to low gravity conditions without jumping or rolling.

s.t.: (12)

Here is the similarity threshold.

Instead of finding the proper movement across all tasks, we only choose a movement of one specific task that may work well in similar tasks. This movement can be extracted from learned RL policies or classical feedback controllers. We call this task a reference task and the used policy (controller) a reference policy. Our algorithm aims to find actions the robot has to take in a new test task to reach a similar outcome as the desired outcome in the reference task using the reference policy. To achieve that, our algorithm utilizes two variations of a meta-trained embedding neural network (ENN). The first variation entails the meta-trained network with meta parameters and learned embeddings . The second variation entails the meta-trained network with meta parameters conditioned on the embedding of a reference task .
Putting these pieces together, MAC (Fig. 2) optimizes a sequence of states and actions to maximize the predicted reward in the test-task while also eventually ensuring dynamics feasibility:

s.t.: (15)

Where is the desired outcome of the reference task given the current state and using the reference policy:


The constraint 16 is approximated using a similarity measurement between the predicted states and the reference state

. As the similarity measure, we use the cosine similarity of the state vectors.

Figure 3: An ant robot incentivized to walk as fast as possible. The red arrows depict an anchor that restricts the robot in its rotation
0:  Meta-learned parameters and embeddings
0:  Meta Adaption Controller MAC
0:  Empty set of recent observations
1:  while task not solved do
2:     Determine most likely embedding given and {see Eq. 9}
3:     Execute steps of SGD using and receive {see Eq. 10}
4:     Execute action and receive state {see Alg. 3}
5:     Save observation
6:     if then remove oldest observation from
7:  end while
Algorithm 2 Meta-testing an Embedding Neural Network to adapt online using our control algorithm

After meta-training an ENN with algorithm 1, meta-testing (i.e., online adaption) is executed with algorithm 2. Here the most likely embedding is determined, and the ENN is adapted to approximate future environment states. Subsequently, MAC is used to find the action to take in a new task that reaches a similar outcome as in the reference task.
The MAC procedure is shown in Algorithm 3, and can be summarized with the following steps:

  1. Given the current state , sample a set of reference actions using the reference policy .

  2. Using the embedding of the reference task , estimate a set of reference states that contains information about what states would follow if the reference actions in the reference task were executed.

  3. Sample a set of actions

    from an unconditional Gaussian distribution

    . These actions will be used together with the test task embedding to predict the following states .

  4. Use the reward function to estimate a set of rewards and measure the state similarity of the predicted states to the reference state .

  5. From the sets and store only the most similar states and the actions used to reach these states.

  6. Repeat the procedure according to the planning horizon.

  7. Choose the first action of the most similar action sequence with the highest reward.

  8. Update the distribution towards action sequences with higher similarity and reward.

0:  Reward function ; similarity measure
0:  Action elites ; planning horizon
0:  Reference policy
0:  Task-specific configuration
0:  Reference task configuration
0:  Set of initial states
0:  Empty set of next states
0:  Empty set of actions to next states
0:  Empty set of calculated rewards
1:  Use random distribution to sample actions
2:  for  do
3:     Sample set of ref. actions with
4:     Calculate ref. states with
5:     Sample set of actions from
6:     Get set of predictions of next states with
7:     Calculate rewards with
8:     Calculate state similarity with
9:     Update based on and to consists of the most similar states with the highest rewards
13:  end for
14:  Extract the first action of the most similar action sequence with the highest reward.
15:  Update based on
16:  return  
Algorithm 3 Meta Adaption Controller (MAC)

Vi Experiments

We compare our meta adaption controller (MAC) algorithm with two baseline algorithms. The algorithms are compared during meta-testing within four different environments where the agent must quickly adapt to new tasks and collect as much reward as possible (Figure 4). The environments are based on the MuJoCo physics engine developed by Todorov et al. [todorov2012mujoco] and among them previous meta-RL literature [Nagabandi2018]: Halfcheetah-disabled, Halfcheetah-pier, Ant-disabled, Ant-gravity.

Each baseline uses a multilayer perceptron (MLP) with three hidden layers, each consisting of 512 neurons. The first baseline (RMPC) trains the MLP on the MAML first-order implementation REPTILE by Nichol et al.

[Nichol] and uses MPC for meta-testing. By following the approach of FAMLE by Kaushik et al. [Kaushik2020], the second baseline (FMPC) extends the MLP with an additional embedding input, meta-trains the resulting embedding neural network (ENN) using REPTILE, and uses MPC for meta-testing. To extract reference actions, MAC uses a policy from an actor-critic module trained on one reference task per environment with the soft actor-critic algorithm by Haarnoja et al. [Haarnoja2018]. The reference task in each environment corresponds to the goal of its base environment not using any meta-task (i.e., Ant without disability and in standard gravitational setting; HalfCheetah without disability). Further, MAC reuses the trained ENN as explained in algorithm 2.
A hyperparameter search regarding meta-training and meta-testing was carried out for each environment algorithm combination. To compare the individual performances in each environment, we sampled a new task after 500 steps. During testing, every task is sampled five times with different environment seeds. With each environment tested on five different seeds per task, our experiments show that the MAC algorithm outperforms both baselines with the exception when the robot jumps and rolls in the ant gravity environment as depicted in Figure 1. The experiment results are displayed in Table I.

Figure 4: Environments used to test our algorithm. In each environment, the corresponding robot needs to run as fast as possible in one direction. Halfcheetah-disabled blocks different joints of a cheetah robot. Halfcheetah-pier changes the cheetah’s limb flexibility while running on moving ground. Ant-disabled resizes different legs of an ant robot. In Ant-Gravity, the environment’s gravitational setting is adjusted.
Environment tasks env steps RMPC FMPC MAC
Ant-gravity 8 20000 18840* 5981 10838
Ant-disabled 1 2500 2001 1856 2397
Hc-disabled 2 5000 2109 9762 10513
Hc-pier 2 5000 2570 4342 4393
*: Robot jumps and rolls in environment leading to high rewards
Table I: Meta-testing in different environments

Vii Conclusion

In this paper, we presented MAC, a robot control algorithm that employs meta-reinforcement learning to apply a preferred robot behavior from one task to many similar tasks. At its core is the combination of a meta-trained embedding neural network and a RL policy of a reference task. While adapting the neural network online, it predicts which actions need to be taken in unseen tasks to mimic the behavior of the reference task obtained by the policy. Our experiments demonstrated that this mechanism works across various tasks in different environments and outperforms meta-testing with model predictive control in different model-based meta-reinforcement learning setups.