Imagine a company that is developing software for robots to be deployed in households or flexible manufacturing situations. Each of these settings might be fairly different in terms of the types of objects to be manipulated, the distribution over object arrangements, or the typical goals. However, they all have the same basic underlying kinematic and physical constraints, and could in principle be solved by the same general-purpose task and motion planning (tamp) system. Unfortunately, tamp is highly computationally intractable in the worst case, involving a combination of search in symbolic space, search for motion plans, and search for good values for continuous parameters such as object placements and robot configurations that satisfy task requirements.
A robot faced with a distribution over concrete tasks can learn to perform tamp more efficiently by adapting its search strategy to suit these tasks. It can learn a small set of typical grasps for the objects it handles frequently, or good joint configurations for taking objects out of a milling machine in its workspace. This distribution cannot be anticipated by the company for each robot, so the best the company can do is to ship robots that are equipped to learn very quickly once they begin operating in their respective new workplaces.
The problem faced by this hypothetical company can be framed as one of meta-learning: given a set of tasks drawn from some meta-level task distribution, learn some structure or parameters that can be used as a prior so that the system, when faced with a new task drawn from that same distribution, can learn very quickly to behave effectively.
Concretely, in this work we focus on improving the interface between symbolic aspects of task planning and continuous aspects of motion planning. At this interface, given a symbolic plan structure, it is necessary to select values for continuous parameters that will make lower-level motion-planning queries feasible, or to determine that the symbolic structure itself is infeasible. Typical strategies are to search over randomly sampled values for these parameters, or to use hand-coded “generators” to sample them [1, 2].
Our strategy is to learn deterministic functions, which we call specializers, that map a symbolic operator (such as place(objA)
) and a detailed world state description (including object shapes, sizes, poses, etc.) into continuous parameter values for the operator (such as a grasp pose). Importantly, rather than focusing on learning a single set of specializers from a large amount of data at deployment time, we will focus on meta-learning approaches that allow specializers to be quickly adapted online. We will use deep neural networks to represent specializers and backpropagation to train them.
We compare two different modular meta-learning strategies: one, based on maml , focuses on learning neural network weights that can be quickly adapted via gradient descent in a new task; the other, based on BounceGrad , focuses on learning a fixed set of neural network “modules” that can be quickly chosen among in a new task.
We demonstrate the effectiveness of these approaches in an object manipulation domain, illustrated in Figure 1, in which the robot must move all of the objects from one table to another. This general goal is constant across tasks, but different tasks will vary the object shapes and sizes, requiring the robot to learn to manipulate these different types of objects. We conjecture that the meta-learning approach will allow the system, at meta-training time, to discover generally useful, task-independent strategies, such as placing objects near the back of the available space; and, at deployment time, to quickly learn to adapt to novel object geometries. Note that our methods are agnostic to exactly which aspects of the environments are common to all tasks and which vary – these concepts are naturally discovered by the meta-learning algorithm. We show that the meta-learning strategies perform better than both a random sampler and a reasonable set of hand-built, task-agnostic, uninformed specializers.
Ii Related Work
Our work is focused on choosing continuous action parameters within the context of a symbolic plan. We are not learning control policies for tasks [5, 6, 7], nor are we learning planning models of tasks . We assume any necessary policies and planning models exist; our goal is to make planning with such models more efficient by learning specializers via modular meta-learning. There is existing work on learning samplers for continuous action parameters in the context of task and motion planning [9, 10, 11, 12], but these works do not explicitly consider the problem of learning samplers that can be quickly adapted to new tasks. Our work addresses this problem using meta-learning. Specializers are closely related to the notion of generators, used in some task and motion planning systems [1, 2].
Meta-learning is a particularly important paradigm for learning in robotics, where training data can be very expensive to acquire, because it dramatically lowers the data requirements for learning new tasks. Although it has a long history in the transfer learning literature, meta-learning was recently applied with great effectiveness to problems in robotics by Finn et al. .
Learning Modular Structure: Our approach is a modular learning approach, in the sense of Andreas et al. : the learned specializers are associated with planning operators, allowing them to be recombined in new ways to solve novel problems. Andreas et al. 
use reinforcement learning to train subtask modules in domains with decomposable goals. Unlike in our work, they assume a policy sketch (sequence of symbolic operators) is given. The challenge in their work instead stems from learning to map the symbolic language to control actions. Devin et al. decompose visuomotor policies into “task-specific” and “robot-specific” modules, and show that these modules can be composed to solve new tasks. Separate lines of work have focused on learning modular parameterized skills from the perspectives of control theory [17, 18] and learning from demonstration [19, 20].
Modular Meta-Learning: Modular meta-learning was developed by Alet et al.  and forms the inspiration for this work. Their work includes an EM-like training procedure that alternates between composing and optimizing neural network modules, and also includes a mechanism for choosing the best compositional structure of the modules to fit a small amount of training data on a new task.
Iii Problem Setting
In this section, we describe the class of task and motion planning meta-learning problems.
Iii-a Task and Motion Planning
Robot task and motion (tamp) problems are typically formulated as discrete-time planning problems in a hybrid discrete-continuous state transition system [21, 22], with discrete variables modeling which objects are being manipulated and other task-level aspects of the domain, and continuous variables modeling the robot configuration, object poses, and other continuous properties of the world.
A world state is a complete description of the state of the world, consisting of , where is the robot configuration, the are the states of each object in the domain, and the are other discrete or continuous state variables, such as the index of the held object. An object’s state has dimensions describing its properties, including pose and shape, as well as possibly other attributes such as mass, color, or owner.
We now define a tamp problem, using some definitions from Garrett et al. . A predicate is a Boolean function. A fluent is an evaluation of a predicate on a tuple , where is a set of discrete objects and is a set of continuous values. A set of basis predicates can be used to completely describe a world state. Given a world state , the set of basis fluents is the maximal set of atoms that are true in and can be constructed from the basis predicate set . A set of derived predicates can be defined in terms of basis predicates. A planning state is a set of fluents, including a complete set of basis fluents and any number of derived fluents; it is assumed that any fluent not in is false.
An action or operator is given by an argument tuple and a parameter tuple , a set of fluent preconditions on , and a set of positive and negative fluent effects on . An action instance is an action with arguments and parameters replaced with discrete objects and continuous values . An action instance is applicable in planning state if . The result of applying an action instance to a planning state is a new state , where and are the positive and negative fluents in , respectively. For to be a well-formed action, and must be structured so that the planning state resulting from applying is valid (contains a complete set of basis fluents). Then determines a world state , and the action can be viewed as a deterministic transition on world states.
A tamp problem is given by a set of actions , an initial planning state , and a goal set of fluents . A sequence of action instances, , is called a plan. A plan is a task-level solution for problem if is applicable in , each is applicable in the th state resulting from application of the previous actions, and is a subset of the final state.
A sequence of actions with arguments instantiated but not parameters, , is called a plan skeleton. A plan skeleton is task-feasible for problem if there exist continuous values such that plan is a task-level solution for . A world-state trajectory for plan and initial planning state is the sequence of world states induced by the sequence of planning states resulting from applying starting from . A plan is a complete solution for the tamp problem if, letting , there exist robot trajectories such that is a collision-free path (a motion plan) from to , the robot configurations in world states and respectively.
Finding good search strategies for solving tamp problems is an active area of research, partly owing to the difficulty of finding good continuous parameter values that produce a complete solution . Our learning method could be adapted for use in many tamp approaches, but for clarity we focus on a very simple one, sketched below.111Generally, not all of the elements of are actually free parameters given a skeleton. Some elements of may be uniquely determined by other parameters, or by the state in which an action is applied. We will not complicate our notation by explicitly handling these cases. See Section V for discussion of the tamp system we use in our implementation.
The problems of symbolic task planning to yield plausible plan skeletons (Line 1) and collision-free motion planning (Line 3) are both well studied, and effective solutions exist. We focus on the problem of searching over continuous values for the skeleton parameters (Line 2), to find a set resulting in a plan that is a complete solution to the tamp problem. We first outline two simple strategies for finding . In random sampling, we perform a simple depth-first backtracking search: sample values for uniformly at random from some legal set, check that there is a motion plan from to , continue on to sample if so, and either sample again or backtrack to a higher node in the search tree if not. In the hand-crafted strategy, we rely on a human programmer to write one or more specializers for each action . A specializer is a function , where is a planning state, are the discrete object arguments to which will be applied, and is the step of the skeleton where a particular instance of occurs. The specializer
returns a vector of continuous parameter valuesfor . So, in this hand-crafted strategy, for each plan skeleton we need only consider the following discrete set of plans :
where the values select which specializer to use for each step. Each setting of the values yields a different plan .
Now, it is sensible to combine the search over both skeletons and specializers into a single discrete search. Let be a set of specializers (the reason for this notation will become clear in the next section) and be a discrete set of “actions” obtained by combining each action with each specializer , for valid . We obtain our algorithm for planning with specializers:
Iii-B Learning Specializers
We begin by defining our learning problem for just a single task. A single-task specializer learning problem is a tuple , where is a set of actions specifying the dynamics of the domain, is a training set of (initial state, goal) problem instances, is a set of functional forms for specializers (such as neural network architectures), and is a set of initial weights such that is a set of fully instantiated specializers that can be used for planning with the Plan algorithm.
Informally, the objective of our learning problem is to find such that planning with will, in expectation over new problem instances drawn from the same distribution as , be likely to generate complete solutions. The function classes of the specializers are given (just like in the hand-crafted strategy), but the weights are learned.
Although our ultimate objective is to improve the efficiency of the overall planner, that is done by replacing the search over continuous parameter values with a deterministic choice or search over a finite set of parameter values provided by the specializers; so, our objective is really that these specializers be able to solve problems from .
Most directly, we could try to minimize single-task loss on , so that where:
Unfortunately, this loss is much too difficult to optimize in practice; in Section IV we will outline strategies for smoothing and approximating the objective.
Iii-C Meta-Learning Specializers
In meta-learning, we wish to learn, from several training tasks, some form of a prior that will enable us to learn to perform well quickly on a new task. A specializer meta-learning problem, given by a tuple , differs from a single-task specializer learning problem both in that it has multiple datasets , and in that it has a different objective. We make the implicit assumption, standard in meta-learning approaches, that there is a hierarchical distribution over problems that the robot will encounter: we define a task to be a single distribution over , and assume we have a distribution over tasks.
Let be a specializer learning algorithm that returns , tailored to work well on problems drawn from . Our meta-learning objective will then be to find a value of that serves as a good prior for Learn on new tasks, defined by new distributions. Formally, the meta-learning objective is to find , where the meta-learning loss is, letting index over tasks:
The idea is that a new set of weights obtained by starting with and applying Learn on a training set from task should perform well on a held-out test set from task .
After learning , the robot is deployed. When it is given a small amount of training data drawn from a new task, it will call to get a new set of weights , then use the planner to solve future problem instances from this new task. If the meta-learning algorithm is effective, it will have learned to
learn quickly (from a small dataset ) to
plan quickly (using the specializers in place of a full search over continuous parameter values ),
motivating our title.
In this section, we begin by describing two single-task specializer learning algorithms, and then we discuss a specializer meta-learning algorithm that can be used with any specializer learning algorithm.
Iv-a Single-Task Specializer Learning Algorithms
Recall that an algorithm for learning specializers takes as input , where are the actions describing domain dynamics, is a dataset of problem instances, is a set of functional forms for specializers, and is a set of initial weights such that is a set of fully instantiated specializers. The job of a learning algorithm is to return .
We consider two algorithms: alternating descent (ad) and subset selection (ss).
a) Alternating Descent: ad adjusts the weights to tune them to dataset .
If we knew, for each , the optimal plan skeleton and choices of specializers that lead to a complete solution for the tamp problem , then we could adjust the elements of corresponding to the chosen specializers in order to improve the quality of . However, this optimal set of actions and specializers is not known, so we instead perform an EM-like alternating optimization, much like the optimization performed at meta-training time in BounceGrad  or at meta-test time in moma . In particular, we use the PlanT algorithm (described in detail later) to find a skeleton and sequence of specializers to be optimized. PlanT is an approximation of Plan (Section III-A) that can return illegal plans. Then, we adjust the elements of corresponding to the so the plan becomes “less illegal.”
More formally, we assume the existence of a predicate loss
predicate loss functionfor each predicate in the domain, which takes the arguments of predicate ( and ) and a world state , and returns a positive-valued loss measuring the degree to which the fluent is violated in . If is true in , then must be zero. For example, if fluent asserts that the pose of object should be the value , then we might use the squared distance as the predicate loss, where is the actual pose of in .
Consider the situation in which we run PlanT, and it returns a plan created from a plan skeleton and the chosen specializers . From this, we can compute both the induced trajectory of planning states , and the induced trajectory of world states . We can now define a trajectory loss function on for :
This is a sum over steps in the plan, and for each step, a sum over positive fluents in its effects, of the degree to which that fluent is violated in the resulting world state . Here, is the predicate associated with fluent . Recall that , where we have included to expose the specializers’ parametric forms. Thus, we have:
If the are differentiable with respect to the , and the functional forms generating the are differentiable with respect to and their continuous inputs, then can be adjusted via a gradient step to reduce . This method will adjust only the values of that were used in the specializers chosen by PlanT. We use feedforward neural networks to implement each individual . The overall algorithm is:
We now describe in detail the PlanT procedure, which is an approximation of Plan. Generally, while we are learning , we will not have a complete and correct set of specializers, but we still need to assemble plans in order to adjust the . In addition, to prevent local optima, and inspired by the use of simulated annealing for structure search in BounceGrad  and moma , we do not always consider the with least loss early on. PlanT, rather than trying to find a that is a complete solution to the problem, treats SymbolicPlan as a generator, generates symbolic plans that are not necessarily valid solutions, and for each one that is feasible with respect to motion planning, computes its loss. It then samples a plan to return using a Boltzmann distribution derived from the losses, with “temperature” parameter computed as a function of the number of optimization steps done so far. This should be chosen to go to zero as increases.
b) Subset Selection: ss assumes that includes a large set of specializers, and simply selects a subset of them to use during planning, without making any adjustments to the weights . Let be the set of specializers for action and let integer be a parameter of the algorithm. The ss algorithm simply finds the size- subset of for each such that is minimized222Technically speaking, the first argument to should be all the weights ; we can assume that is the following operation: leave the elements of that parameterize the unchanged, and set the rest to 0.. There are many strategies for finding such a set; in our experiments, we have a small number of actions and set , and so we can exhaustively evaluate all possible combinations.
Iv-B Specializer Meta-Learning Algorithm
Recall that the meta-learning problem is to take in different datasets and return , which should be a good starting point for a learning algorithm on a new task. This ideal objective is difficult to optimize, so we must make approximations.
We begin by describing the meta-learning algorithm, which follows a strategy very similar to maml
. We do stochastic gradient descent in an outer loop: draw a taskfrom the task distribution, use some learning algorithm Learn to compute a new set of weights for starting from , and update with a gradient step to reduce the trajectory loss on evaluated using .
In practice, we drop the Hessian term in the gradient by taking the gradient with respect to rather than (so ); this is done successfully by many maml implementations, for efficiency. We define:
This represents the smoothed trajectory loss for the best plan we can find using the given , summed over all planning problems in . When we compute the gradient, we ignore the dependence of the plan structure on
. We estimateas follows:
When Learn is the subset selection learner (ss), the Learn procedure returns only a subset of the , corresponding to the chosen specializers. Only the weights in that subset are updated with a gradient step on the test data.
We demonstrate the effectiveness of our approach in an object manipulation domain where the robot is asked to move all of the objects from one table to another, as illustrated in Figure 1. The object geometries vary across tasks, while a single task is a distribution over initial configurations of the objects on the starting table. We consider 6 tasks for training and 3 tasks for evaluation, across 3 object types: cylinders, bowls, and vases. The phrase “final task” will refer to a random sample of one of the 3 evaluation tasks.
To avoid complications surrounding manipulation, we use a KUKA iiwa robot arm without a gripper. Grasp legality is computed using a simple end effector pose test based on the geometry (radius, height, and shape) of the object being grasped. We require that cylinders are grasped from the side, while bowls and vases are grasped from above, on their lip. The robot base pose is automatically set among two legal positions: one in front of each table.
There are four operators: moveToGrasp and moveToPlace move the robot (and any held object) to a configuration suitable for grasping or placing an object, grasp picks an object up, and place places it onto the table. All operators take in the ID of the object being interacted with as discrete arguments, set through the search over plan skeletons. The continuous parameters learned by our specializers are the target end effector pose for each operation; we employ an inverse kinematics solver to try reaching these poses. We learn three specializers for each of the first three operators, and one specializer for place due to its relative simplicity. The state representation is a vector containing the end effector pose, each object’s position, object geometry information, robot base position, and the ID of the currently-grasped object (if any).
All specializers are implemented as fully connected, feedforward deep neural networks with hidden layer sizes [100, 50, 20], a capacity which preliminary experiments found necessary. We use the Adam optimizer  with initial learning rate , decaying by 10% every 1000 iterations. We also batch our implementation and use batch size 32.
For motion planning, we use the RRT-Connect algorithm ; we check for infeasibility crudely by giving the algorithm a computation allotment, implemented as a maximum number of random restarts to perform upon which a (infeasible) straight-line trajectory is returned. We use Fast-Forward  as our symbolic planner. For simulation and visualization, we use the pybullet  software.
A major source of difficulty in this domain is that the end effector poses chosen by the specializers must be consistent with both each other (place pose depends on grasp pose, etc.) and the object geometries. Furthermore, placing the first few objects near the front of the goal table would impede the robot’s ability to place the remaining objects. We should expect that the general strategies discovered by our meta-learning approach would handle these difficulties.
To implement the discrete search over plan skeletons and specializers, we adopt the tamp approach of Srivastava et al. , which performs optimistic classical planning using abstracted fluents, attempts to find a feasible motion plan, and incorporates any infeasibilities back into the initial state as logical fluents. We also set a depth limit for this iterative process, so that we only ever search over five candidate plan skeletons per training iteration. During evaluation, there is no such limit. For each skeleton, we search exhaustively over all combinations of specializer per operator, of which there are 27 possibilities.
Evaluation: We evaluate the MetaLearn algorithm with both the alternating descent (ad) learner and the subset selection (ss) learner. We test against two baselines, random sampling and the hand-crafted strategy, both of which are described in Section III-A. The random sampler is conditional, sampling only end effector poses that satisfy the kinematic constraints of the operators. At final task time with the ad learner, we optimize the specializers on 10 batches of training data, then evaluate on a test set of 50 problems from this task. At final task time with the ss learner, we choose a subset of specializer per operator that performs the best over one batch of training data, then use only that subset to evaluate on a test set. Note that we should expect the test set evaluation to be much faster with the ss learner than with the ad learner, since we are planning with fewer specializers.
The results show that both meta-learning approaches perform much better at the final task than the baselines do. The random sampler fails because it expends significant effort trying to reach infeasible end effector poses, such as those behind the objects. The hand-crafted specializers, though they perform better than the random sampler, suffer from a lack of context: because they are task-agnostic, they cannot specialize, and so search time is wasted on continuous parameter values that are inappropriate for the current task, making timeouts more frequent. Furthermore, the hand-crafted strategy does not adapt to the state (such as the locations of objects surrounding one being grasped). Qualitatively, we found that the general strategies we outlined earlier for succeeding in this domain were meta-learned by our approach.
Notably, the alternating descent (ad) learner performs better than the subset selection (ss) learner, likely because in the former, the specializer weights are optimized for the final task rather than held fixed. These findings suggest that this sort of fine-tuning is an important step to learning specializers in our domain. However, this improvement comes at the cost of much longer training times, since the ad learner performs an inner gradient computation which the ss learner does not. Another finding is that the ss learner expends much less search effort than the ad learner, which is expected since the ss learner plans with only one specializer per operator.
|Setting||System||Final Task Solve %||Train Iters to 50%||Search Effort||Train Time (Hours)|
|3 obj.||Baseline: Random||24||N/A||52.2||N/A|
|3 obj.||Baseline: Hand-crafted||68||N/A||12.1||N/A|
|3 obj.||Meta-learning: ad||100||500||2.5||4.3|
|3 obj.||Meta-learning: ss||100||500||2.0||0.6|
|5 obj.||Baseline: Random||14||N/A||81.3||N/A|
|5 obj.||Baseline: Hand-crafted||44||N/A||34.3||N/A|
|5 obj.||Meta-learning: ad||88||2.1K||8.6||7.4|
|5 obj.||Meta-learning: ss||72||6.8K||4.1||1.5|
|7 obj.||Baseline: Random||0||N/A||N/A||N/A|
|7 obj.||Baseline: Hand-crafted||18||N/A||64.0||N/A|
|7 obj.||Meta-learning: ad||76||5.1K||18.3||12.3|
|7 obj.||Meta-learning: ss||54||9.2K||7.8||2.1|
Vi Conclusion and Future Work
We used modular meta-learning to address the problem of learning continuous action parameters in multi-task tamp.
One interesting avenue for future work is to allow the specializers to be functions of the full plan skeleton, which would provide them with context necessary for picking good parameter values in more complex domains. Another is to remove the assumption of deterministic specializers by having them either be stochastic neural networks or output a distribution over the next state, reparameterized using Gumbel-Softmax 
. Finally, we hope to explore tasks requiring planning under uncertainty. These tasks would require more sophisticated compositional structures; we would need to search over tree-structured policies rather than sequential plans as we did in this work. This search could be made tractable using heuristics for solvingpomdps [28, 29].
We gratefully acknowledge support from NSF grants 1420316, 1523767, and 1723381; from AFOSR grant FA9550-17-1-0165; from Honda Research; and from Draper Laboratory. Rohan is supported by an NSF Graduate Research Fellowship. Any opinions, findings, and conclusions expressed in this material are those of the authors and do not necessarily reflect the views of our sponsors.
-  L. P. Kaelbling and T. Lozano-Pérez, “Hierarchical task and motion planning in the now,” in Robotics and Automation (ICRA), 2011 IEEE International Conference on. IEEE, 2011, pp. 1470–1477.
-  S. Srivastava, E. Fang, L. Riano, R. Chitnis, S. Russell, and P. Abbeel, “Combined task and motion planning through an extensible planner-independent interface layer,” in Robotics and Automation (ICRA), 2014 IEEE International Conference on. IEEE, 2014, pp. 639–646.
-  C. Finn, P. Abbeel, and S. Levine, “Model-agnostic meta-learning for fast adaptation of deep networks,” arXiv preprint arXiv:1703.03400, 2017.
-  F. Alet, T. Lozano-Pérez, and L. P. Kaelbling, “Modular meta-learning,” arXiv preprint arXiv:1806.10166 (to appear in CoRL 18), 2018.
J. Peters, J. Kober, K. Mülling, O. Krämer, and G. Neumann, “Towards
robot skill learning: From simple skills to table tennis,” in
Joint European Conference on Machine Learning and Knowledge Discovery in Databases, 2013, pp. 627–631.
-  J. Kober, J. A. Bagnell, and J. Peters, “Reinforcement learning in robotics: A survey,” The International Journal of Robotics Research, vol. 32, no. 11, pp. 1238–1274, 2013.
-  B. Argall and A. Billard, “A survey of tactile human-robot interactions,” Robotics and Autonomous Systems, vol. 58, no. 10, pp. 1159–1176, 2010.
-  H. Pasula, L. S. Zettlemoyer, and L. P. Kaelbling, “Learning symbolic models of stochastic domains,” J. Artif. Intell. Res. (JAIR), vol. 29, 2007.
-  B. Kim, L. P. Kaelbling, and T. Lozano-Pérez, “Learning to guide task and motion planning using score-space representation,” in Robotics and Automation (ICRA), 2017 IEEE International Conference on. IEEE, 2017, pp. 2810–2817.
-  ——, “Guiding search in continuous state-action spaces by learning an action sampler from off-target search experience,” in Proceedings of Thirty-Second AAAI Conference on Artificial Intelligence, 2018.
-  Z. Wang, C. R. Garrett, L. P. Kaelbling, and T. Lozano-Pérez, “Active model learning and diverse action sampling for task and motion planning,” arXiv preprint arXiv:1803.00967, 2018.
-  R. Chitnis, D. Hadfield-Menell, A. Gupta, S. Srivastava, E. Groshev, C. Lin, and P. Abbeel, “Guided search for task and motion plans using learned heuristics,” in Robotics and Automation (ICRA), 2016 IEEE International Conference on. IEEE, 2016, pp. 447–454.
-  S. J. Pan, Q. Yang et al., “A survey on transfer learning,” IEEE Transactions on knowledge and data engineering, vol. 22, no. 10, pp. 1345–1359, 2010.
-  J. Andreas, M. Rohrbach, T. Darrell, and D. Klein, “Neural module networks,” in
-  J. Andreas, D. Klein, and S. Levine, “Modular multitask reinforcement learning with policy sketches,” arXiv preprint arXiv:1611.01796, 2016.
-  C. Devin, A. Gupta, T. Darrell, P. Abbeel, and S. Levine, “Learning modular neural network policies for multi-task and multi-robot transfer,” in Robotics and Automation (ICRA), 2017 IEEE International Conference on. IEEE, 2017, pp. 2169–2176.
-  S. Schaal, “Dynamic movement primitives - A framework for motor control in humans and humanoid robotics,” in Adaptive motion of animals and machines. Springer, 2006, pp. 261–280.
-  G. Neumann, C. Daniel, A. Paraschos, A. Kupcsik, and J. Peters, “Learning modular policies for robotics,” Frontiers in computational neuroscience, vol. 8, p. 62, 2014.
-  G. Konidaris, S. Kuindersma, R. Grupen, and A. Barto, “Robot learning from demonstration by constructing skill trees,” The International Journal of Robotics Research, vol. 31, no. 3, pp. 360–375, 2012.
-  K. Shiarlis, M. Wulfmeier, S. Salter, S. Whiteson, and I. Posner, “TACO: Learning task decomposition via temporal alignment for control,” arXiv preprint arXiv:1803.01840, 2018.
-  C. R. Garrett, T. Lozano-Pérez, and L. P. Kaelbling, “Sampling-based methods for factored task and motion planning,” arXiv preprint arXiv:1801.00680 (to appear in IJRR), 2018.
-  M. Toussaint, “Logic-geometric programming: An optimization-based approach to combined task and motion planning,” in IJCAI, 2015, pp. 1930–1936.
-  D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.
-  J. J. Kuffner and S. M. LaValle, “RRT-connect: An efficient approach to single-query path planning,” in Robotics and Automation, 2000. Proceedings. ICRA’00. IEEE International Conference on, vol. 2. IEEE, 2000, pp. 995–1001.
-  J. Hoffmann and B. Nebel, “The FF planning system: Fast plan generation through heuristic search,” Journal of Artificial Intelligence Research, vol. 14, pp. 253–302, 2001.
-  E. Coumans, Y. Bai, and J. Hsu, “Pybullet physics engine,” 2018. [Online]. Available: http://pybullet.org/
-  E. Jang, S. Gu, and B. Poole, “Categorical reparameterization with Gumbel-Softmax,” arXiv preprint arXiv:1611.01144, 2016.
-  H. Kurniawati, D. Hsu, and W. S. Lee, “SARSOP: Efficient point-based POMDP planning by approximating optimally reachable belief spaces,” in Robotics: Science and systems, vol. 2008. Zurich, Switzerland., 2008.
-  J. Pineau, G. Gordon, S. Thrun et al., “Point-based value iteration: An anytime algorithm for POMDPs,” in IJCAI, vol. 3, 2003, pp. 1025–1032.