Bayesian Meta-Learning for Few-Shot Policy Adaptation Across Robotic Platforms

03/05/2021
by   Ali Ghadirzadeh, et al.
13

Reinforcement learning methods can achieve significant performance but require a large amount of training data collected on the same robotic platform. A policy trained with expensive data is rendered useless after making even a minor change to the robot hardware. In this paper, we address the challenging problem of adapting a policy, trained to perform a task, to a novel robotic hardware platform given only few demonstrations of robot motion trajectories on the target robot. We formulate it as a few-shot meta-learning problem where the goal is to find a meta-model that captures the common structure shared across different robotic platforms such that data-efficient adaptation can be performed. We achieve such adaptation by introducing a learning framework consisting of a probabilistic gradient-based meta-learning algorithm that models the uncertainty arising from the few-shot setting with a low-dimensional latent variable. We experimentally evaluate our framework on a simulated reaching and a real-robot picking task using 400 simulated robots generated by varying the physical parameters of an existing set of robotic platforms. Our results show that the proposed method can successfully adapt a trained policy to different robotic platforms with novel physical parameters and the superiority of our meta-learning algorithm compared to state-of-the-art methods for the introduced few-shot policy adaptation problem.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

11/06/2020

A Few Shot Adaptation of Visual Navigation Skills to New Observations using Meta-Learning

Target-driven visual navigation is a challenging problem that requires a...
09/16/2019

Meta Reinforcement Learning for Sim-to-real Domain Adaptation

Modern reinforcement learning methods suffer from low sample efficiency ...
09/18/2021

Fast User Adaptation for Human Motion Prediction in Physical Human-Robot Interaction

Accurate prediction of human movements is required to enhance the effici...
03/10/2020

Fast Online Adaptation in Robotics through Meta-Learning Embeddings of Simulated Priors

Meta-learning algorithms can accelerate the model-based reinforcement le...
09/28/2019

Learning Fast Adaptation with Meta Strategy Optimization

The ability to walk in new scenarios is a key milestone on the path towa...
03/06/2022

A Composable Framework for Policy Design, Learning, and Transfer Toward Safe and Efficient Industrial Insertion

Delicate industrial insertion tasks (e.g., PC board assembly) remain cha...
07/16/2018

Meta-Learning with Latent Embedding Optimization

Gradient-based meta-learning techniques are both widely applicable and p...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Robot learning methods aim to develop data-driven approaches that can deal with the complexity and the diversity of robotic problems. Policy learning algorithms have achieved impressive results but mostly on a fixed robotic hardware; to account for even a minor change in the robot hardware, the policy has to be retrained from scratch. Training samples are usually quite expensive in such setups, and in most cases, retraining policies from scratch is not data-efficient. This problem becomes more profound when learning complex visuomotor skills which typically require a significant amount of data. Therefore, the ability to adapt a policy to changes in the robot hardware using few training samples is an important property of a learning system. Likewise, sample efficiency of robot learning algorithms will be significantly improved overall if we can train a policy for a task on a robotic platform once and adapt it to all other similar robots with only few training samples. However, training general policies that adapt efficiently to various robots with different morphologies is a challenging problem for existing data-driven methods. Even though generic learning algorithms can train policies to acquire a wide range of manipulation skills [33, 41, 17] there is no way to transfer a trained policy to a new robot despite the common structure that is present in different robotic platforms.

Fig. 1: Meta-training robot policies on multiple robotic platforms. Top: the block diagram of the policy with the platform independent blocks in blue and the platform dependent block in green. Bottom: different robotic platforms based on which we generated 400 robots in simulation.

Platform-independent policy training can be formulated as a few-shot learning problem where the goal is to adapt an action-selection policy to a new platform from small amounts of robot data [28, 9]. Such a few-shot learning problem can be addressed by meta-learning [13] - a learning to learn framework that leverages past knowledge to solve novel tasks more efficiently. In our case, a meta-learning agent can learn a representation of the common structure of joint-level motor commands for a given manipulation task across different robotic platforms.

However, ambiguities arising from the limited data can make it difficult to find a unique representation of a robotic platform. Standard meta-learning approaches yield a single solution that is consistent with the given data but might be sub-optimal when the data does not carry sufficient information about the task. Probabilistic meta-learning algorithms [15, 37, 19, 38]

address this problem by leveraging Bayesian inference to generate multiple solutions that are consistent with the few-shot training examples. We therefore introduce a probabilistic meta-learning framework that models this uncertainty with a probability distribution from which multiple possible platform representations can be sampled.

The main contributions of this paper are: (1) introducing a learning framework to adapt a trained policy to a modified robot hardware or a novel but similar robotic platform, (2) introducing a probabilistic meta-learning framework that proposes multiple potential action-selection policies given few demonstrated motion trajectories, and (3) experimentally evaluating the framework and benchmarking it against some meta-learning algorithms to adapt a trained policy to novel robotic platforms for a simulated reaching and a real-robot picking task. We experimentally evaluate our framework on robots in simulation which are generated by modifying four existing robot platforms, namely ABB YuMi, Baxter, Franka Emika, and Kinova robots. By training and evaluating a meta-learner on this setup, we show how a trained policy can be adapted to a new robot by providing only a few demonstrated motion trajectories. Our experimental results show superior performance compared to the prior meta-learning algorithms [13, 19] for a robotic reaching task on different target robots. Also, our real-robot experiments show that the method can successfully adapt a visuomotor policy trained for a grasping task to a YuMi robot with different gripper sizes.

Ii Related work

Transfer learning in robotics:

Few-shot transfer learning has in recent years gained in popularity in robotics research because it addresses both the sample-inefficiency problem of reinforcement learning (RL) algorithms and high costs of recollecting real robot data for every new manipulation task. Prior work focused broadly on the simulation-to-real transfer learning problem

[4, 3, 43, 35]

, few-shot imitation learning

[12, 16, 47, 5, 27, 6], transfer of perception models across different domains [39, 22, 8], and the transfer of skills across tasks [36, 24, 21], dynamics [9, 2], and nonstationary environments [1, 45]. However, limited work has been done on transfer learning across robotic platforms [11, 10, 25, 40, 7].

Devin et al. [11] proposed to decompose a policy network into robot-specific and task-specific modules to facilitate the transfer of knowledge to novel combinations of pre-trained tasks and robots. We similarly decompose a policy into task- and robot-specific components but in contrast to [11] our goal is to transfer manipulation skills to a novel robotic platform in a few-shot manner. Chen et al. [7] and Schaff et al. [40] proposed to learn a policy conditioned on an encoding of a robotic hardware, such as the kinematic structure of the robot. Their goal is to learn a general policy that transfers to a new robot either in a zero-shot manner or via fine-tuning [7] and optimizing the robotic hardware together with the policy at the same time [40]. We also learn a robot encoding but instead of feeding it to the policy we use the encoding to initialize a network that is further updated by gradient descent. Huang et al. [25] suggested to train one policy that control each limb of an agent and coordinate between different limbs through a message passing technique. Even though their approach demonstrated appealing results for walking gates, it is not clear how to apply it to visuomotor policy training where perception and control are trained together. Dasari et al. [10] introduced a large-scale multi-task and multi-robot dynamic model training that can be used with model-based action-selection approaches such as model-predictive control [14]. We train policies given large-scale datasets constructed in simulation by adjusting physical parameters of different robotic platforms. However, we aim to meta-train an action-selection policy that can then be efficiently adapted to a novel robot given only few motion trajectories from the target robot.

Meta-learning: Meta-learning is a common approach to solve few-shot learning problems [42, 44, 26, 34]. A popular optimization-based meta-learning method is the model-agnostic meta learning (MAML) framework [13]. However, an important challenge in the few-shot learning setup arises when the few examples do not contain sufficient information to to properly solve a new task [15]. To account for such task ambiguities different probabilistic frameworks based on MAML [15, 20, 46, 38] were introduced as well as a variety of other Bayesian meta learning frameworks [19, 37, 23, 29, 48]. Similar to [29, 38], we introduce a low-dimensional meta-task latent variable which we embed into the gradient-based meta-learner following MAML. In contrast to [38], we propose to perform the bi-level gradient descent optimization of MAML directly on the generated network parameters. We also find in Section V that our approach significantly outperforms VERSA [19] and a method based amortized variational inference (AVI) [37].

Iii Preliminaries

Policy training with generative models: Similar to [18]

, we consider a finite-horizon Markov decision process defined by a tuple

, where denotes both the set of end (terminal) states and the set of goal states , is a set of motor actions for the robot motor ,

is the state transition probability assuming a fixed initial state,

is the goal state distribution and is the probability of the reward conditioned on a goal state and a fixed-length sequence of open-loop motor actions at time index . In order to complete a robotic manipulation task given a goal state , we wish to find a goal-conditioned policy , parameterized by , that assigns a distribution over the possible sequence of actions that maximize the expected reward . We follow [17, 18] and find the parameters by first introducing an action latent variable and a trajectory generative model , parametrized by , that maps a latent action sample into a sequence of actions . Given , the policy parameters are then found by marginalizing over and maximizing the expected reward

(1)

where a sub-policy , parametrized by , assigns a distribution over the action latent variable conditioned on the goal state . Intuitively, the action latent variable captures the high-level objective of the policy , while the generative model translates this objective into a sequence of low-level motor actions.

The complete policy network is illustrated in Figure 2 with blue color. A goal state is first mapped into a latent action sample by the sub-policy and then to a sequence of motor actions by the generative model

. The two parametric models

and together form the policy . The sub-policy can be trained independently of the robotic hardware platform since the structure of can be fixed such that the same sub-policy can be used for all robotic platforms without any model adaptation (Sec. IV-C). However, the generative model generates platform-dependent low-level motor actions which requires the parameters to be adapted to different platforms. The generative model can be trained on a dataset consisting of sequences

demonstrated on a given robotic platform based on variational autoencoders (VAEs)

[32, 17] which maximize the following variational lower bound

(2)

where denotes the approximate posterior distribution and

the prior over the action latent variable which is set to the standard normal distribution.

Multi-task meta learning: Meta learning can be used to model the common structure shared across different robotic platforms in a way that enables few-shot transfer to related tasks. Let a task be defined as the task of generating valid sequence of actions for the th robotic platform. Let be an unknown distribution from which we sample infinitely many tasks each of which is represented by a dataset consisting of motor action sequences demonstrated on the platform . A common representation of the tasks can be learned by training a meta-model using the MAML framework. MAML learns a common feature representation or network initialization to achieve an optimal solution on a novel task with only a small number of gradient steps. MAML consists of (a) a meta-train phase in which the meta-model is trained, and (b) adaptation or meta-test phase in which the meta-model is adapted to a novel task using gradient descent. The meta-model in phase (a) is trained by optimizing the following objective

(3)

where, is the learning rate which itself is a trainable parameter, and and are the support and the query sets, respectively, such that their union forms the entire task dataset . Meta-learning methods optimize for few-shot generalization in a nested scheme consisting of an inner loop, in which the model is adapted to individual tasks using the support set , and an outer loop, in which the meta-model’s parameters are updated by optimizing the sum of the task-specific losses on the query sets .

Iv Few-shot policy training based on meta-learning

In this section, we introduce our probabilistic meta-learning algorithm that outputs a probability distribution over the meta-model parameters. Using this distribution, we sample the parameters of the trajectory generative models that are used to model a given set of few demonstrated trajectories. We formally define the problem set-up in Sec. IV-A, and introduce the framework in Sec. IV-B. Finally, we explain the training of the sub-policy in Sec. IV-C.

Iv-a Platform-independent policy training setup

Our goal is to devise a meta-learning algorithm that adapts the policy , more specifically, the trajectory generative model , shown in Figure 2, to an unseen robotic platform, i.e., a novel meta-task in a few-shot manner given the set of support sequence of actions . The meta-model is trained on meta-train datasets containing a valid sequence of motor actions to run on the th robot. The datasets are divided into support and query sets containing support sequence of actions and query sequence of actions .

The platforms are generated in simulation by choosing a robot model and randomly adjusting the length of different robot links. Using existing motion planners, such as rapidly-exploring random tree (RRT) [30], we then generate several sequences of motor actions that complete the desired goal in simulation. The generated trajectories and the label of the robotic platform form one meta-task dataset , while their union over all meta-tasks forms the meta-train dataset .

Fig. 2: The computation graph of the action-selection policy at meta-test time.

Iv-B Our proposed probabilistic meta-learning framework

Since the small size of a dataset representing a new meta-task gives rise to ambiguities in the meta-model parameters, we extend MAML to a probabilistic setting. The idea of our probabilistic meta-learning algorithm is to represent meta-tasks in a low-dimensional latent space. Similar to [38], we obtain such low dimensional representation of the meta-tasks by introducing a meta-task latent variable and a network generative model , parametrized by , that generates the high dimensional meta-parameters conditioned on the representation . Similar to the original MAML’s inner loop optimization, we perform gradient descent optimization on the meta-parameters , as opposed to [38] that performs the gradient descent in the task latent space. Our choice is a more straightforward extension of MAML into a conditional and probabilistic meta-learning algorithm while still outperforming state-of-the-art meta-learning algorithms on this problem setting. We leave the comparison between our meta-learning algorithm and [38] to our future work.

Using the meta-task latent variable we can first generate different initial meta-models that are well-suited for the few-shot gradient descent updates and then choose the one that results in the best performance on the given novel platform. The variable is sampled from a variational distribution , parametrized by , which is trained jointly with the network generative model using the meta-train dataset as explained next. Note that we choose

to be a Gaussian distribution.

During the meta-train phase (a) the goal is to find the parameters of the variational distribution and network generative model , that yield a distribution over the possible meta-model parameters . By sampling from this distribution, different platform-dependent trajectory generative models are generated that can be efficiently optimized similar to MAML by one or few gradient updates. More precisely, and are optimized by maximizing the following variational lower bound [32]

(4)

where, is the prior over the meta-task latent variable and is the gradient step as in the MAML adaptation step and is a scalar parameter that balances the two loss terms. Note that we omitted the action latent variable from the notation in Eq. (4) for the sake of clarity. Please refer to Sec. IV-C for more details.

During the meta-test phase (b), illustrated in Fig. 2 with green color, a few sequences of motor actions from a new robotic platform are given to the algorithm which then returns a distribution over the parameters of the trajectory generative model . More specifically, some sequences of motor actions are first recorded given human demonstrations. The action sequences are then given to the variational distribution from which we sample several different latent task representations . These are further mapped to the meta-parameters by the network generative model . Finally, a set of task parameters

is found by applying the gradient descent rule on the loss functions

.

Iv-C Training the sub-policy

In this section, we explain how we can train a single sub-policy that can be used with different trajectory generative models on different robotic platforms. The sub-policy is trained on only one robot and the training is performed independently and prior to the training the meta-model in phase (a).

The training of the sub-policy as described in [18] requires a trained generative model which in turn requires a training dataset of the motor trajectories. We therefore choose one robotic platform for which we can obtain a training dataset of motor action sequences . We train a VAE on by optimizing Eq. (2). The obtained generative model is only used to train the sub-policy

with the Expectation-Maximization algorithm introduced in

[18].

However, to be able to use the obtained sub-policy with different trajectory generative models on different robotic platforms, we need to ensure a consistent structure of the action latent variable . We define a set of trajectories to be consistent, if they result in the same end state after executing them on the respective robotic platform. For a set of consistent trajectories, the goal is then to obtain the same latent action sample for all robotic platforms. This is achieved using and the previously trained VAE encoder introduced in Eq. 2. In particular, a given trajectory , regardless the robotic platform, is matched with a consistent trajectory by comparing the end states. The latter is mapped to a latent action sample . The resulting and considered trajectory can then be used to train the meta-model in Eq. (4).

V Experiments

We apply our proposed learning framework to a robotic reaching task in which the end-effector of the robot has to reach to different 3D target positions, and a visuomotor picking task similar to [8] in which a YuMi robot is trained to pick different objects in cluttered background given image pixels as the input. We experimentally evaluate (1) the generality of our learned meta-model by training it on three different scenarios described below, (2) the benefits of the probabilistic formulation of our method by benchmarking it to MAML, and (3) the benefits of gradient based optimization by benchmarking it to two gradient-free methods described in Sec. V-B. We experimentally evaluate the generality of our meta-learning method when adapting the policy to (i) a slightly modified robotic hardware generated by changing the length of a link of the robot, and to (ii) a completely new robotic platform which is not used during the meta-training phase. We created

different robots in simulation by adjusting the mechanical parameters of four 7 degree-of-freedom robotic platforms, ABB YuMi, Kinova, Franka Emika and Baxter, shown in Figure 

1. We consider three different scenarios: (a) meta-training given data from robots built on one of the platforms (100 robots), and adapting to novel robots built on the same platform, (b) meta-training given all of the simulated robots and adapting as in (a), and (c) meta-training given data from robots built on only three platforms (300 robots) and adapting to novel robots built on the excluded platform. The scenarios (a) and (b) are used to evaluate criteria (i), while scenario (c) is used to evaluate criteria (ii).

At meta-test time, we sample novel robots for each platform introduced in Figure 1. We used 5 demonstrated motion trajectories to adapt the trajectory generative model , produced by the meta-learning agent, to the platform . The demonstrated trajectories are consistent across platforms as defined in Sec. IV-C. We evaluate the performance of the obtained policy on the novel robot using Eq. (1).

V-a Network architectures

In this section, we introduce the network architectures of the sub-policy, the trajectory generative model, and our proposed meta-learner model. Following [8], the sub-policy consists of four convolutional layers with channels and filters followed by three fully connected dense layers with neurons in each layer. For the picking task, the sub-policy maps an image of the size , and for the reaching task a 2D target position into a 6D action latent variable . In the latter case, the convolutional layers are excluded. The generative model , consisting of two dense layers with neurons, then maps the 6D action latent variable into a trajectory of motor actions of the size , i.e., motor actions over time-steps. The motor actions are joint level position commands sent to a position controller.

Our proposed meta-model consists of an encoder and a network generative model . The encoder maps 5 demonstrated motor trajectories of the size into the 2D task latent variable using 4 dense layers with neurons, respectively. The network generative model consists of 5 dense layers of size

. It generates the parameters of the meta-model neural network (having 832 weights and 106 biases) given the 2D task latent variable

as the input. The meta-model has the same architecture as the generative model, and is adapted to a platform-specific generative model after one or few gradient updates. The Rectified Linear Unit (ReLU) is used as the non-linearity in our network architecture.

Fig. 3:

Scenario (a): The average and the 95% confidence interval of the reaching error (in cm) of the adapted policies for all methods. The meta-learner is trained and tested on the data generated by the same platform.

Y stands for YuMi, B for Baxter, K for Kinova, and F for Franka Emika, while the numbers denote the index of the test robot.

V-B Benchmarking methods

We benchmark our method with MAML [13], VERSA [19] and a method based on amortized variational inferences (AVI) [37] using three different meta-train and meta-test datasets constructed according to the three scenarios. These methods are used to adapt the parameters of the trajectory generative model given few demonstrated motion trajectories during the meta-test phase.

MAML: The meta-model in MAML is obtained with the same architecture as the trajectory generative model (introduced in Sec. V-A) and is adapted to a platform-specific generative model using one gradient update.

VERSA: We adopt the VERSA model used for few-shot view reconstruction in [19] to obtain the motor trajectory generative model. The meta-learner consists of an encoder that similarly to our approach maps 5 motor trajectories into a 2D task latent variable. However, the task latent variable is directly given as an input to the trajectory generative model as opposed to the network generative models as in our method. Therefore, in VERSA, the parameters of the trajectory generative model are platform independent.

AVI: In order to study the benefit of the gradient descent component of our method, we exclude the gradient optimization step from our method. The meta-learner in this case consists of a task encoder and a network generative model.

For both VERSA and AVI, we use the same network architecture as described in Sec. V-A

. We train all the methods, including ours, for 1000 epochs using ADAM optimizer 

[31] with learning rate . For our method we use . All hyper-parameters are found by searching for optimal values.

Fig. 4: Scenario (b): The average and 95% confidence interval of the reaching error as in Fig.3. The meta-learner is trained and tested on the data generated by all four platforms.
Fig. 5: Scenario (c): The average and 95% confidence interval of the reaching error as in Fig.3. The meta-learner is trained on the data from all platforms except the one that is used for meta-testing.

V-C Evaluating the meta-learning agent

For our method, VERSA and AVI, we sampled policies from the learned distribution based on the 5 motion trajectories. In Fig. 3-5 we provide the average and

confidence interval of the reaching error on the obtained 20 polices for the robotic reaching task. For MAML we report the reaching error of the obtained single solution. Our experimental results find that our method outperforms MAML, VERSA, and AVI in most cases for all of the three scenarios. In scenario (a), in which the meta-train and meta-test datasets include data from only one platform, we observe that in most cases our method performs considerably better than the others, also with lower variance (Fig. 

3). This experiment suggests that our few-shot learning approach performs well when adapting the policy to changes in the same robot hardware.

For scenario (b), in which the meta-train dataset contains data from all 4 robot platforms, MAML performs significantly worse compared to scenario (a) (Fig 4). This means that providing a more diverse dataset deteriorates MAML adaptation performance. Such diversity makes it difficult for MAML to find a network initialization from which different meta-tasks can be solved. For this scenario, our method provides the best average performance in most cases, and achieves significant gains compared to prior methods on the YuMi and Kinova platforms. These results demonstrate the effectiveness of adaptation to a wider range of meta-tasks.

Fig. 6: Visualization of the meta-task latent space. Left: meta-task latent samples corresponding to the training dataset of a model trained for scenario (b). Right: meta-task latent samples corresponding to the training dataset of a model trained for scenario (c) as well as to the testing dataset of the excluded robot (in red).

Next, we evaluate our method based on criteria (ii) using scenario (c), in which the meta-learner should adapt to out-of-distribution tasks (Fig. 5). We observe that the methods do not yield satisfactory performance. The results of our method are slightly better for the Baxter and YuMi robots, potentially due to the similarities between these two robots. To further analyze the results, we also studied the latent structure of our method for models trained in scenarios (b) and (c). In Fig. 6 we illustrate the mean values of the meta-task latent variable obtained from the Gaussian encoder corresponding to all the robots used to generate the meta-train datasets in scenarios (b) and (c). The left figure illustrates the latent space of a model trained on scenario (b). We see that our method well separates the four robotic platforms in the meta-task latent space. The right three figures in Fig. 6 illustrate the meta-task latent space of three independently trained models for scenario (c) in which Baxter robots are excluded from the meta-train phase. The figures also show the Baxter robots encoded at meta-test time. The results suggest that the Baxter robots (at meta-test) look similarly to the YuMi robots that were used to train the meta-learner, which is consistent with the results visualised in Fig. 5.

Beyond these visualizations, we add one new experiment to understand how we might mitigate the challenge of adapting to out-of-distribution tasks. In particular, to avoid meta-testing on completely out-of-distribution tasks as in scenario (c), we included and meta-task samples of the excluded platform in the meta-train dataset and repeated the experiment. The results (Fig. 7) are significantly better than the results reported for scenario (c), suggesting that a relatively small amount of meta-training data on these platforms is particularly helpful for alleviating out-of-distribution challenges.

Finally, we present our experimental results for the visuomotor picking task on a real YuMi robot. First, we built 4 different robot grippers that are CM, CM, CM and CM longer than the original YuMi robot gripper. Using our method, we meta-trained given 100 simulated robots as described in scenario (a) and meta-tested on the real YuMi robot with the four different grippers. On average, we obtained picking success rate after 5-shot adaptation to the YuMi robot with the different gripper sizes. These results validate that our approach can scale to a real robot with image observations.

Fig. 7: The average and 95% confidence interval of the reaching error as in Fig.3. Our meta-learner is trained on the data from all platforms except the one used for meta-testing of which 0, 10 and 20 meta-task samples are used during meta-training.

Vi Conclusion

The primarily goal of this study was to answer the following question: is it possible to obtain a meta-learner that captures the common structure shared across different robotic platforms such that a policy can be adapted to a novel platform in an unambiguous and data-efficient manner? We approached this problem by introducing a MAML-based probabilistic meta-learning framework where the meta-tasks are represented by a low-dimensional meta-task latent variable. We exploited the latent meta-task variable for modeling the meta-task uncertainties originating from the few-shot formulation of the policy adaptation problem. We studied the performance of our proposed framework on a reaching task performed in simulation on 400 robots generated with various physical parameters of the existing robotic platforms. Our method obtained superior results compared to the benchmarking MAML, VERSA and AVI frameworks. The adaptation of the policy was particularly successful to the robotic platforms whose variations of the physical parameters were present in the meta-train dataset. However, the performance of all considered frameworks dropped when the adaptation was done on a novel robotic platform without a similar representative in the meta-train dataset. We leave such out-of-distribution cases as well as evaluation on more complex visuomotor scenarios using several more platforms as future work. Given the promising results in this study, we ultimately aim to apply our framework on a multi-robot multi-task scenario.

Acknowledgments

This work was supported by Knut and Alice Wallenberg Foundation, the EU through the project EnTimeMent and the Swedish Foundation for Strategic Research through the COIN project.

References

  • [1] M. Al-Shedivat, T. Bansal, Y. Burda, I. Sutskever, I. Mordatch, and P. Abbeel (2018) Continuous adaptation via meta-learning in nonstationary and competitive environments. In ICLR, Cited by: §II.
  • [2] O. M. Andrychowicz, B. Baker, M. Chociej, R. Jozefowicz, B. McGrew, J. Pachocki, A. Petron, M. Plappert, G. Powell, A. Ray, et al. (2020) Learning dexterous in-hand manipulation. The International Journal of Robotics Research. Cited by: §II.
  • [3] K. Arndt, A. Ghadirzadeh, M. Hazara, and V. Kyrki (2020) Few-shot model-based adaptation in noisy conditions. arXiv preprint arXiv:2010.08397. Cited by: §II.
  • [4] K. Arndt, M. Hazara, A. Ghadirzadeh, and V. Kyrki (2020) Meta reinforcement learning for sim-to-real domain adaptation. In ICRA, Cited by: §II.
  • [5] A. Bonardi, S. James, and A. J. Davison (2020) Learning one-shot imitation from humans without humans. IEEE Robotics and Automation Letters. Cited by: §II.
  • [6] J. Bütepage, A. Ghadirzadeh, Ö. Öztimur Karadag, M. Björkman, and D. Kragic (2020) Imitating by generating: deep generative models for imitation of interactive tasks. Frontiers in Robotics and AI. Cited by: §II.
  • [7] T. Chen, A. Murali, and A. Gupta (2018) Hardware conditioned policies for multi-robot transfer learning. In NeurIPS, Cited by: §II, §II.
  • [8] X. Chen, A. Ghadirzadeh, M. Björkman, and P. Jensfelt (2020) Adversarial feature training for generalizable robotic visuomotor control. In ICRA, Cited by: §II, §V-A, §V.
  • [9] I. Clavera, A. Nagabandi, S. Liu, R. S. Fearing, P. Abbeel, S. Levine, and C. Finn (2019) Learning to adapt in dynamic, real-world environments through meta-reinforcement learning. In ICLR, Cited by: §I, §II.
  • [10] S. Dasari, F. Ebert, S. Tian, S. Nair, B. Bucher, K. Schmeckpeper, S. Singh, S. Levine, and C. Finn (2019) RoboNet: large-scale multi-robot learning. In Conference on Robot Learning, Cited by: §II, §II.
  • [11] C. Devin, A. Gupta, T. Darrell, P. Abbeel, and S. Levine Learning modular neural network policies for multi-task and multi-robot transfer. In ICRA 2017, Cited by: §II, §II.
  • [12] Y. Duan, M. Andrychowicz, B. Stadie, O. J. Ho, J. Schneider, I. Sutskever, P. Abbeel, and W. Zaremba (2017) One-shot imitation learning. In NeurIPS, Cited by: §II.
  • [13] C. Finn, P. Abbeel, and S. Levine (2017) Model-agnostic meta-learning for fast adaptation of deep networks. ICML. Cited by: §I, §I, §II, §V-B.
  • [14] C. Finn and S. Levine (2017) Deep visual foresight for planning robot motion. In ICRA, Cited by: §II.
  • [15] C. Finn, K. Xu, and S. Levine (2018) Probabilistic model-agnostic meta-learning. In NeurIPS, Cited by: §I, §II.
  • [16] C. Finn, T. Yu, T. Zhang, P. Abbeel, and S. Levine (2017) One-shot visual imitation learning via meta-learning. In Conference on Robot Learning, Cited by: §II.
  • [17] A. Ghadirzadeh, A. Maki, D. Kragic, and M. Björkman (2017) Deep predictive policy training using reinforcement learning. In IROS, Cited by: §I, §III, §III.
  • [18] A. Ghadirzadeh, P. Poklukar, V. Kyrki, D. Kragic, and M. Björkman (2020) Data-efficient visuomotor policy training using reinforcement learning and generative models. ArXiv:2007.13134. Cited by: §III, §IV-C.
  • [19] J. Gordon, J. Bronskill, M. Bauer, S. Nowozin, and R. Turner (2019) Meta-learning probabilistic inference for prediction. In ICLR, Cited by: §I, §I, §II, §V-B, §V-B.
  • [20] E. Grant, C. Finn, S. Levine, T. Darrell, and T. Griffiths (2018) Recasting gradient-based meta-learning as hierarchical bayes. In ICLR, Cited by: §II.
  • [21] A. Gupta, R. Mendonca, Y. Liu, P. Abbeel, and S. Levine (2018) Meta-reinforcement learning of structured exploration strategies. In NeurIPS, Cited by: §II.
  • [22] A. Hämäläinen, K. Arndt, A. Ghadirzadeh, and V. Kyrki (2019) Affordance learning for end-to-end visuomotor robot control. In IROS, Cited by: §II.
  • [23] J. Harrison, A. Sharma, and M. Pavone (2018) Meta-learning priors for efficient online bayesian regression. In International Workshop on the Algorithmic Foundations of Robotics, Cited by: §II.
  • [24] K. Hausman, J. T. Springenberg, Z. Wang, N. Heess, and M. Riedmiller (2018) Learning an embedding space for transferable robot skills. In International Conference on Learning Representations, Cited by: §II.
  • [25] W. Huang, I. Mordatch, and D. Pathak (2020) One policy to control them all: shared modular policies for agent-agnostic control. ICML. Cited by: §II, §II.
  • [26] M. A. Jamal and G. Qi (2019) Task agnostic meta-learning for few-shot learning. In CVPR, Cited by: §II.
  • [27] S. James, M. Bloesch, and A. J. Davison (2018) Task-embedded control networks for few-shot imitation learning. In Conference on Robot Learning, Cited by: §II.
  • [28] R. Julian, B. Swanson, G. S. Sukhatme, S. Levine, C. Finn, and K. Hausman (2020) Efficient adaptation for end-to-end vision-based robotic manipulation. arXiv preprint arXiv:2004.10190. Cited by: §I.
  • [29] J. Kaddour, S. Sæmundsson, and M. P. Deisenroth (2020) Probabilistic active meta-learning. ArXiv:2007.08949. Cited by: §II.
  • [30] S. Karaman and E. Frazzoli (2010) Incremental sampling-based algorithms for optimal motion planning. Robotics Science and Systems VI. Cited by: §IV-A.
  • [31] D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §V-B.
  • [32] D. P. Kingma and M. Welling (2013) Auto-encoding variational bayes. ArXiv:1312.6114. Cited by: §III, §IV-B.
  • [33] S. Levine, C. Finn, T. Darrell, and P. Abbeel (2016) End-to-end training of deep visuomotor policies.

    The Journal of Machine Learning Research

    .
    Cited by: §I.
  • [34] H. Li, W. Dong, X. Mei, C. Ma, F. Huang, and B. Hu (2019) LGM-net: learning to generate matching networks for few-shot learning. ArXiv:1905.06331. Cited by: §II.
  • [35] X. B. Peng, M. Andrychowicz, W. Zaremba, and P. Abbeel (2018) Sim-to-real transfer of robotic control with dynamics randomization. In ICRA, Cited by: §II.
  • [36] L. Pinto and A. Gupta (2017) Learning to push by grasping: using multiple tasks for effective learning. In ICRA, Cited by: §II.
  • [37] S. Ravi and A. Beatson (2019) Amortized bayesian meta-learning. In ICLR, Cited by: §I, §II, §V-B.
  • [38] A. A. Rusu, D. Rao, J. Sygnowski, O. Vinyals, R. Pascanu, S. Osindero, and R. Hadsell (2019) Meta-learning with latent embedding optimization. In International Conference on Learning Representations, Cited by: §I, §II, §IV-B.
  • [39] F. Sadeghi, A. Toshev, E. Jang, and S. Levine (2018) Sim2real viewpoint invariant visual servoing by recurrent control. In CVPR, Cited by: §II.
  • [40] C. Schaff, D. Yunis, A. Chakrabarti, and M. R. Walter (2019) Jointly learning to construct and control agents using deep reinforcement learning. In ICRA, Cited by: §II, §II.
  • [41] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov (2017) Proximal policy optimization algorithms. ArXiv:1707.06347. Cited by: §I.
  • [42] J. Snell, K. Swersky, and R. Zemel (2017) Prototypical networks for few-shot learning. In NeurIPS, Cited by: §II.
  • [43] X. Song, Y. Yang, K. Choromanski, K. Caluwaerts, W. Gao, C. Finn, and J. Tan (2020) Rapidly adaptable legged robots via evolutionary meta-learning. ArXiv:2003.01239. Cited by: §II.
  • [44] F. Sung, Y. Yang, L. Zhang, T. Xiang, P. H. S. Torr, and T. M. Hospedales (2018) Learning to compare: relation network for few-shot learning. In CVPR, Cited by: §II.
  • [45] A. Xie, J. Harrison, and C. Finn (2020) Deep reinforcement learning amidst lifelong non-stationarity. arXiv preprint arXiv:2006.10701. Cited by: §II.
  • [46] J. Yoon, T. Kim, O. Dia, S. Kim, Y. Bengio, and S. Ahn (2018) Bayesian model-agnostic meta-learning. In NeurIPS, Cited by: §II.
  • [47] T. Yu, C. Finn, A. Xie, S. Dasari, T. Zhang, P. Abbeel, and S. Levine (2018) One-shot imitation from observing humans via domain-adaptive meta-learning. ArXiv:1802.01557. Cited by: §II.
  • [48] Y. Zou and X. Lu (2020) Gradient-em bayesian meta-learning. ArXiv:2006.11764. Cited by: §II.