## I Introduction

We are interested in robots that can learn a wide variety of tasks efficiently.
Recently, there has been an increasing interest in the *one-shot imitation learning* problem [atkeson1997learning, duan2017one, finn2017one, huang2019neural, pathakICLR18zeroshot, xu2018neural, yu2018one], where the goal is to learn to execute a previously unseen task from only a single demonstration of the task.
This setting is also referred to as *meta-learning* [vinyals2016matching, finn2017one], where the meta-training stage uses a set of tasks in a given domain to simulate the one-shot testing scenario. This allows the learned model to generalize to previously unseen tasks with a single demonstration in the meta-testing stage.

The main shortcoming of these one-shot approaches is that they typically require a large amount of data for meta-training (400 meta-training tasks in [huang2019neural] and 1000 in [xu2018neural] for the Block Stacking task [xu2018neural]) to generalize reliably to unseen tasks. However, this requirement is infeasible in realistic domains with hierarchical and long-horizon tasks that involve long-term environment interactions.

The primary contribution of this paper is to formulate one-shot imitation learning for long-horizon tasks as a planning problem to leverage the structure of the symbolic domain definition.
This allows us to disentangle the policy execution from the inter-task generalization, which significantly reduces the number of tasks required in meta-training. In our formulation, the planner focuses on the policy execution, while the symbol grounding problem learns to map the continuous input state (*e.g.,* object poses or images) to the symbolic state representation required by the planner. In this case, the inter-task generalization is solely handled by the symbol grounding and disentangled from the policy execution. We argue that it is much easier for the symbol grounding problem to achieve inter-task generalization compared to the black-box policy networks used in previous works [duan2017one, huang2019neural, finn2017one, xu2018neural] because the symbol grounding function can be shared among the tasks in the same or similar domains. We further improve the generalization of symbol grounding by proposing a *modular* Symbol Grounding Network (SGN) to infer the symbolic state of a given continuous input state. The modularity of our SGN enables effective parameter sharing among the symbolic states to further improve data efficiency.

The central technical challenge is that the outputs of the Symbol Grounding Network can be error-prone in low-data regimes. In this case, the SGN can output *invalid* symbolic states that lead to subsequent symbolic planning
failures.
Consider Block Stacking domain shown in Figure 1 as an example, it is possible (and likely as we will show in the experiments) for the SGN to output that both On(A, B) and Clear(B) are true. However, this is inconsistent, since both conditions cannot simultaneously be true.

Our solution to the challenge is to propose
a continuous relaxation of the symbolic planner that replaces the set-theoretic representation [ghallab2004automated] in the symbolic planner with the probabilistic symbols [konidaris2015symbol].
This allows the planner to *directly*

plan on probabilistic distribution over the symbolic states. This addresses the aforementioned inconsistency because the symbol grounding stage is no longer forced to make discrete decisions, and can simply provide continuous estimation of the symbolic state. We show that our continuous relaxation of the symbolic planner can still leverage the information provided by the continuous outputs of the SGN to complete the task from a single demonstration.

Figure 1

compares the proposed framework with the symbolic planner and other neural network based approaches for one-shot imitation learning. We refer to our continuous relaxation of the symbolic planner as Continuous Planner (CP). Compared to symbolic planners, our Continuous Planner can take continuous states as inputs. Compared to existing one-shot imitation models, our formulation decouples the demonstration interpretation model from the policy model and leads to better data efficiency.

In summary, the main contributions of our work are: (i) Formulating one-shot imitation as planning to disentangle the policy execution from inter-task generalization, leading to better data efficiency; (ii) Proposing Continuous Planner to allow the planner to directly operate on the distribution over the symbolic states and resolve the invalid state problem; (iii) Introducing modularity to the Symbol Grounding Neural Networks to further improve the inter-task generalization.

## Ii Related Work

Structures in one-shot imitation learning. The goal of one-shot imitation learning is to translate a task demonstration (observation) into an executable policy [duan2017one, finn2017one, goo2018one, xu2018neural, huang2019neural]. Real-world tasks are often long-horizon and multi-step, posing severe challenges for simple techniques such as behavior cloning-based methods [duan2017one]. Recent works in one-shot imitation learning aim to mitigate this challenge by imposing modular or hierarchical task structures in order to learn reusable subtask policies [niekum2015learning, shiarlis2018taco, goo2018one, xu2018neural, huang2019neural].
NTP [xu2018neural] decomposes demonstration with hierarchical programs. NTG [huang2019neural] models the task structure as a graph generation problem.
However, these works rely on function approximators to *implicitly* model the transitions among the subtasks, which often limits their ability to generalize.
Our model explicitly grounds the pre- and post-condition symbols of each subtask via our symbolic planning formulation.

High-level task planning. Our planning formulation to handle the symbol uncertainty is primarily inspired by Konidaris et al. [konidaris2015symbol], wherein probabilistic symbolic representation is used to replace the classic set-theoretic representation [ghallab2004automated]
to express the symbol uncertainty in symbol acquisition.
A major distinction in our formulation is that we still assume a *deterministic domain*, where the transition between the symbolic states via actions is known and deterministic.
This allows us to derive a continuous relaxation of the planner (Section IV-C) that greatly reduces the computational complexity. The relaxation assumes deterministic transitions while still handling uncertainty in symbol grounding. A similar deterministic assumption has also been leveraged under POMDP [littman1996algorithms, kaelbling1998planning, ross2008online], in which the deterministic POMDP (DET-POMDP) [littman1996algorithms] formulation assumes deterministic transition function and known observation function. Our formulation can be viewed as a relaxed DET-POMDP formulation with an unknown observation function.
Lastly, such formulation involving uncertainty is also related to a large body of research in probabilistic planning [yoon2007ff, yoon2008probabilistic, little2007probabilistic, jetchev2013learning, huber2000hybrid]. However, these works only handle the uncertainty in the state transitions but not uncertainty in the symbols themselves [konidaris2015symbol].

Symbol Grounding for Manipulation. Our formulation connects the symbolic planner with the continuous input state. This is related to progress on symbol grounding of geometric continuous state for manipulation planning [bidot2017geometric, gravot2005asymov, kaelbling2010hierarchical]. The main difference is that most of these approaches assume that they are given the mapping from geometric to symbolic states. On the other hand, our Symbol Grounding Networks does not assume a given mapping and can use arbitrary continuous states as input. The idea of learning this symbol grounding has also been explored for manipulation [abdo2012low, dearden2014manipulation, sjoo2011learning]. However, these approaches either do not consider planning on the symbol or do not take into account the symbol uncertainty in planning as our Continuous Planner.

## Iii Problem Setup and Preliminaries

### Iii-a One-Shot Imitation Learning

The goal of one-shot imitation learning [duan2017one] is to execute a previously *unseen* task from a single demonstration.
Let be the set of tasks in a domain, such as block stacking. Given a demonstration for task , where each is a continuous state, the goal is to have a model to output a policy to complete a new instance of the task. Modeling poses an extreme data efficiency challenge because it requires learning to interpret the demonstrations and to perform the task simultaneously. Most previous works adopt a meta-learning setup [finn2017maml], where is further divided into mutually exclusive sets of meta-training tasks and unseen meta-testing tasks . is trained on so that can successfully complete the tasks . The assumption is that we have sufficient demonstrations and tasks in so the learned can generalize to tasks in at test time.

### Iii-B Symbolic Planning

In this work, we formulate one-shot imitation learning as a classical symbolic planning problem.
Following the definition in [ghallab2004automated], a *planning problem* contains an initial state , a goal , and a set of operators .
Each operator is defined by:
,
where includes the name of the operator and the list of arguments, specifies the conditions that need to be satisfied in order to apply the operator, and defines how the state would be updated after applying the operator.
The *state* is defined as the set of all ground atoms that are true (*e.g., *{On(A,B), Clear(A)}). A ground atom is false if it is not presented in the state. A ground atom, such as On(A,B), consists of the predicate (On(, )) and the objects (A and B) for the arguments.
An action is a grounded operator, where all the arguments of the operator are substituted by objects.
The solution of the planning problem is a plan of a sequence of actions . When starting at the initial state , the plan will lead to a state that satisfies the goal .

We use the Planning Domain Definition Language (PDDL) to specify our planning problem. In PDDL, the planning problem is split into a domain file and a problem file. The domain file contains the set of operators and the predicates. The problem file contains the initial state and the goal . Consider block stacking domain as an example. All block stacking tasks share the same domain file, while each of the tasks has its own problem file to specify the initial configuration/state and goal configuration/state. When we say and are in the same domain, we assume that they share the same domain file and is available at training.

## Iv Our Method

We address the one-shot imitation learning problem, where the goal is to complete a previously unseen task based on a single demonstration. We formulate it as a planning problem and use our Symbol Grounding Networks (SGN) to explicitly ground the symbols. This disentangles the policy execution from the inter-task generalization and leads to better data efficiency compared to previous one-shot approaches. The key challenge is that the SGNs are error-prone with limited training data and can output “invalid” states. This leads to subsequent failure of the symbolic planner. We address this challenge by proposing the Continuous Planner to directly plan on the probabilistic outputs of SGNs. An overview of our approach is shown in Figure 2. We will first formulate one-shot imitation learning as a planning problem in Section IV-A. Next, we will discuss the details of our Symbol Grounding Networks and Continuous Planner in Section IV-B and Section IV-C. Finally, we will include details of learning and inference of our framework in Section IV-D.

### Iv-a One-Shot Imitation as a Planning Problem

Existing one-shot imitation learning methods [duan2017one, huang2019neural, xu2018neural] parameterize as policy models conditioned on demonstrations. While these methods have been shown to generalize to , training such policy networks requires a large amount of data in because the policy networks need to simultaneously interpret demonstrations and perform tasks. We formulate one-shot imitation as a symbolic planning problem: We disentangle the modeling of compound into learning a Symbol Grounding Networks (SGN) and perform Continuous Planning (CP): . In this case, the inter-task generalization is handled by the SGN, while the CP can focus on policy execution. Such disentanglement significantly reduces the complexity of generalizing to unseen tasks.

Now we introduce the symbolic planning formulation of one-shot imitation learning. We can think of one-shot imitation learning as specifying the goal of the task using a demonstration. In this case, if we can map the demonstration to a symbolic goal , and map the current observing continuous state to the corresponding symbolic state , then we can use planning to solve the task based on the operators defined in the domain file. We observe that both and can be obtained by solving the symbol grounding problem that maps a continuous state to the corresponding symbolic state . For , we can use the symbol grounding of the demonstration for a task . The symbolic state for the final continuous state is guaranteed to satisfy the goal. As for the initial state , we can obtain it by symbol grounding of the currently observing continuous state . We address symbol grounding by proposing the Symbol Grounding Networks (SGN). By predicting both the current and the goal symbolic states, we have:

(1) |

Our formulation operates as a closed-loop policy conditioned on the demonstration if we view the plan as a multi-step action. We can further improve by conditioning the goal recognition on the entire demonstration instead of just the final observation . The outline is shown in Figure 2(b).

### Iv-B Symbol Grounding Networks

Eq. (1) formalizes the disentanglement of symbol grounding and planning. Based on this formulation, CP is independent of the inter-task distribution shift and focuses on the policy execution. In this case, the Symbol Grounding Networks (SGN) plays an important role to handle the inter-task generalization. Given a demonstration for an unseen task , the SGN needs to recognize the corresponding goal despite being trained only on meta-training tasks. More generally, the SGN needs to map the current continuous state from any task to the corresponding symbolic state. While still challenging, SGN is easier to optimize than the original compound problem because we can expect the symbol grounding to be shared among similar domains.

Nevertheless, as SGN plays an important role for the inter-task generalization of our model, we still need to make sure that it is data efficient. We thus leverage the recent progress on modular neural networks [andreas2016neural], which has shown to improve generalization in Visual Question Answering [andreas2016neural] and Policy Learning [wang2018nervenet] by parameter sharing among model components. Let be the current continuous state input to the SGN. For each predicate p, we have a predicate module , and for each object b, we have a object module

. Both are parameterized by Multi-Layer Perceptrons. A ground atom

p(b1,b2)is classified by:

(2) |

where each object module extracts an embedding from the current continuous state . The embeddings from the object modules are later concatenated by and fed as input to the predicate module . This allows us to share the parameters for symbol grounding of each of the ground atom in the domain and improve the data efficiency of our SGN. Figure 3 shows examples of the ground atoms.

### Iv-C Continuous Planner

We have discussed how to map a continuous state to a probabilistic symbolic state through our SGN in Eq. (2). The next step is to perform planning on the outputs. As the symbolic planner requires discrete symbolic state input, an obvious approach is to discretize the probabilistic output (*e.g., *with a threshold 0.5).
The main drawback of such strategy is that there is no guarantee that it would yield a “valid” symbolic state. For instance, after the discretization, it is possible that the output symbolic set would contain On(A,B) (A is on B) and Clear(B) (B has nothing on top), which cannot be both true at the same time.
Addressing such challenge is particularly crucial for one-shot imitation learning in low-sample regimes because the SGN is more likely to make inaccurate predictions.

Without additional handcrafted rules, invalid state checking can be posed as a satisfiability (SAT) problem given the planning domain, but applying it to every neural network output is computationally prohibitive and it is still non-trivial to map invalid states to valid ones.
We address this challenge by proposing a continuous relaxation of the symbolic planner to allow it to *directly* plan on the probabilistic outputs in Eq. (2).
We achieve this by replacing the set-theoretic representation [ghallab2004automated] in deterministic planner with probabilistic symbols [konidaris2015symbol] and aim to find a plan towards the recognized goal distribution. This allows our Continuous Planner to output a list of actions based on the SGN outputs.

Classical symbolic planning involves the following key steps: (i) Define the current symbolic state; (ii) Find the list of applicable actions; (iii) Select one of the applicable actions; (iv) Apply the action and arrive at a new state; and (v) Stop when reaching the goal. We now explain the continuous relaxation of all these five steps by replacing the symbolic states with probabilistic symbols or distributions over states. More importantly, because of the discrete and deterministic nature of the planning problem, we can derive efficient iterative formulas for all these steps based on the outputs of SGN without marginalizing a large state space.

State Representation. To handle the uncertainty in the outputs of SGN, we adopt the probabilistic symbolic states representation from [konidaris2015symbol]. For example, assume that we only have three possible ground atoms: Clear(A), Clear(B), On(A, B)

. If the SGN outputs the following probabilities:

(3) | ||||

(4) | ||||

(5) |

then it can be seen as specifying a distribution over all the 8 possible symbolic states. We will use this probabilistic distribution as the state representation of our continuous planner instead of the set of true ground atoms. We use to denote the distribution over symbolic states, which maps a symbolic state to its corresponding probability. Given that we know the set of all ground atoms, and assume conditional independence among the ground atoms, we can represent compactly with the probability that ground atom is true given the distribution for all .

Applicable Actions. As discussed in Section III-B, a symbolic action is applicable if all the ground atoms in its precondition are true. Now that our state representation is no longer discrete, there is no more “applicable” or “inapplicable” actions. Instead, given the current distribution , the applicable probability of an action is:

(6) |

where means that satisfies the precondition set of . This summation can be represented by the probabilities of the ground atoms in the precondition set because of the conditional independence.

Action Selection. As defined in Eq. (6), we no longer have a list of applicable actions to apply, but the likelihood of the actions being applicable. In this case, the selection of actions becomes a ranking of the applicability of actions.

Action Application. In symbolic planners, the current symbolic state is moved to a new symbolic state by action application. We would like an analogous notion for distribution over states. In this case, we consider the shift of the distribution over states by an *attempt* of the action. Given the current distribution over states, the distribution would shift if we attempt an action because it is possible that the action can succeed and change the current symbolic state. At the same time, it is also possible that the action may fail because the precondition is not satisfied. The new distribution after attempting from can be described by:

(7) |

where is the symbolic state we get by applying to and we use to represent that satisfies . The first term captures the transitions that the attempt of is successful and the second term captures the failure of the action. Based on this definition, the probability of a specific ground atom in the new distribution is:

(8) | |||

(9) |

We consider three types of ground atom : (i) is in the positive effect set of . In this case, we know automatically that is true as long as we have . In addition, we also have . Eq. (9) can be rewritten as:

(10) | |||

(11) |

(ii) is in the negative effect set of . In this case, we have when , and the first term in Eq. (9) is . For the second term, we have , and thus:

(12) | ||||

(13) |

(iii) is unaffected by . In this case, means that , and Eq. (9) can be rewritten as:

(14) | ||||

(15) |

These derivations allow us to efficiently compute the action application iteratively using without marginalizing over the large discrete state space.

Goal Satisfaction. Similarly, the goal satisfaction condition is no longer defined by the presence of the ground atom in the symbolic state because neither the goal nor the current state is symbolic. Instead, we have the current distribution over symbolic state, and the distribution of goal state. The objective of our search is thus to match the two distributions.

With these continuous relaxations, we have defined all the operations (i) to (v) on the symbolic states in terms of the distribution over states. In addition, our derivation is efficient as it only iteratively operates on that can directly be mapped to the SGN outputs using Eq. (2). In this case, our Continuous Planner can directly run the search to match the goal distribution on the outputs of SGN. Note that our Continuous Planner would have the exact same behavior as the symbolic planner if the distribution concentrates on a single state. Therefore, Continuous Planner is generalizing the symbolic planner to handle distribution of states.

### Iv-D Learning and Inference

Learning. Our Continuous Planner is a continuous relaxation of the symbolic planner and does not require training. Therefore, we only have to train on SGN. We learn Eq. (2) with full supervision. The procedure is summarized in Algorithm 1 and Figure 2(a). computes the aligned symbolic state for each based on the action annotation used in previous works [huang2019neural, xu2018neural].

Inference. As defined in Eq. (1), our model can act as a closed-loop policy conditioned on . The procedure is summarized in Algorithm 2 and Figure 2(b). As the predictions and are simply distributions over states, it is possible that the model fails to reach the goal after executing the actions in . In this case, we update the initial symbolic state and re-plan.

## V Experiments

Our goal is to produce a closed-loop policy to solve a previously unseen task based on a single demonstration. The key insight is to formulate it as a planning problem along with the symbol grounding problem. We propose Continuous Planner that directly operates on the outputs of the Symbol Grounding Networks. Our experiments aim to answer the following questions: (1) How does our formulation of one-shot imitation learning as a planning problem compare to alternative policy representations? (2) How well does our continuous relaxation of Continuous Planner handle the outputs of our Symbol Grounding Networks? We answer these questions by evaluating our method on two task domains: Block Stacking and Object Sorting in BulletPhysics [huang2019neural]. We compare the proposed framework with alternative formulations of one-shot imitation learning and evaluate the importance of our continuous relaxation.

Implementation Details. We use the object poses as continuous input states to our SGN. Both the object modules and the predicate modules of the SGN are 2-layer perceptrons with 128 hidden units in each layer. For all the planner based methods including our Continuous Planner, we use a simple forward planning algorithm for a fair comparison.

### V-a Baselines for Comparison

We compare the following baselines for the experiments:

Neural Task Graph Networks (NTG) [huang2019neural].

NTG is the closest to our work among the deep learning-based one-shot imitation learning approaches

[duan2017one, xu2018neural, finn2017one]. In contrast to our planning-based formulation, NTG parameterizes the policy with a graphical structure to modularize the policy. This modularity does improve the data efficiency compared to the ones without [duan2017one, xu2018neural, finn2017one], and thus we see NTG as our main comparison among all the previous works. Note that in the graph-based formulation, the policy execution is still coupled with the inter-task generalization in contrast to our decoupling using the planning-based formulation.Symbolic Planner + Discrete SGN (SP). As previously discussed, one may discretize the output of the Symbolic Grounding Network and plan with a classical symbolic planner (SP). The goal of comparing with SP is to show the effectiveness of our Continuous Planner. We use the same SGN as our Continuous Planner and discretize the outputs by picking the symbolic state with the highest probability from the distribution over states given by the SGN.

SP + Manual Heuristics. We have briefly discussed in Section IV-C that while it is possible to formulate the invalid state checking as a SAT problem, it leads to a computational bottleneck. An alternative way is to manually design rules to identify invalid states. For example, as aforementioned, the state where both On(A, B) and Clear(B) are true is invalid. In addition, one needs to define how to rectify the observed invalid state to a valid one. For example, when we see both On(A, B) and Clear(B), we need to further decide which one to keep to make the state valid. Hence, a baseline is to manually define domain-dependent heuristics to rectify the states. One could also use methods like Markov Logic Networks [richardson2006markov] with infinite weights on the rules of invalid states for this. The drawback of this baseline is that defining the rules needs extensive domain knowledge, whereas our method is domain-independent. Again we use the same SGN as SP for a fair comparison.

### V-B Evaluating Block Stacking

Experimental Setups. We follow the setup of Block Stacking tasks in previous works [huang2019neural, xu2018neural], where the goal is to stack the set of 8 blocks into a target configuration. The blocks are 5 cm cubes. The final block configuration of the demonstration is used as the goal state. For the Manual Heuristics baseline, we have three rules for invalid states in this domain. First, if an object is on top of another object, then the bottom object cannot be Clear. Further, a block can only be on top of another block and can only have one other block on top of it.

Results. The results are shown in Figure 5. The x-axis shows the number of training tasks used in and the y-axis shows the success rate of the model on tasks in given a single demonstration. All of the planning-based methods outperform the policy networks of NTG [huang2019neural], even though the NTG policy is already parameterized by the task graph to improve the data efficiency. The results show the importance of our formulation of one-shot imitation learning as a planning problem to disentangle the inter-task generalization from the task execution. Our Continuous Planner is able to directly plan on the probabilistic outputs of SGN and significantly outperform the symbolic planner baseline, in which we are forced to make uninformed discrete decisions based on the SGN outputs that can easily lead to invalid states. In addition, our Continuous Planner is able to perform comparably to the Manual Heuristics baseline without using any further manually designed rules and heuristics to rectify the invalid states. This demonstrates the domain-independent scalability of our method. Figure 4 compares the proposed Continuous Planner and the symbolic planner baseline. In Figure 4(a), the discretized SGN outputs fails to recognize that the goal configuration requires On(H, G). This prevents the symbolic planner from completing the task. On the other hand, the continuous outputs of SGN are still able to inform our Continuous Planner about the goal On(H, G). Figure 4(b) shows an important case for another task. The original discretized output of SGN is an invalid state, where both On(E, A) and Clear(A) are presented. The symbolic planner fails to reach the goal because of this incorrect state discretization. On the other hand, our Continuous Planner can still operate on this distribution over states to match the goal distribution and successfully complete the task.

### V-C Evaluating Object Sorting

Experimental Setups. The goal of the Object Sorting domain is to move the objects scattered on the tabletop to the corresponding containers shown in the demonstrations. We use four object categories and four containers. We consider the challenging setting of Huang et al. [huang2019neural], where the task is initialized in such a way that it requires alternative solutions to a task that are distinct from the demonstration. Similarly, the rules for our Manual Heuristics is that an object cannot be at different locations at the same time. We use this as a hard constraint and maximize the total probability.

Results. As shown in Figure 6, the proposed Continuous Planner quickly achieves the perfect performance with only 8 training tasks, without any additional manual rules and heuristics to rectify the outputs of SGN. On the other hand, both of the symbolic planner and the NTG baselines are still unable to achieve 100% success rate with 15 training tasks. The main challenge is that the model has to infer alternative solutions to the Object Sorting task that are not observed from the demonstration. In order to address the challenge, the NTG baseline specifically designed a task graph generation model to complete the task graph from the demonstration, which enables NTG to outperform the symbolic planner baseline. Compared to NTG, our Continuous Planner approach formulates one-shot imitation as a planning problem and is thus capable of completing the tasks with alternative solutions. This leads to better generalization not only between tasks but also between the alternative solutions to complete the same task.

## Vi Conclusion

We presented a new formulation of one-shot imitation learning as a planning problem. This disentangles the inter-task generalization from the policy execution and leads to better data efficiency. The key challenge is that the Symbol Grounding Networks to connect the planner with the continuous input state can be unreliable without sufficient training data. This introduces uncertainty in the state representation. We address this by replacing the set-theoretic representation in the symbolic planner with the probabilistic symbols and generalize the planner to operates on the distribution over states. This allows us to derive the continuous relaxation of the symbolic planner, which significantly improves the performance over the symbolic planner baseline by handling the symbol uncertainty. We show that the resulting Continuous Planner is able to outperform state-of-the-art one-shot imitation learning approaches on two challenging task domains. This shows the importance of both our planning formulation and continuous relaxation.

Acknowledgements. Toyota Research Institute (“TRI”) provided funds to assist the authors with their research but this article solely reflects the opinions and conclusions of its authors and not TRI or any other Toyota entity.

Comments

There are no comments yet.