Generalized Inverse Planning: Learning Lifted non-Markovian Utility for Generalizable Task Representation

by   Sirui Xie, et al.

In searching for a generalizable representation of temporally extended tasks, we spot two necessary constituents: the utility needs to be non-Markovian to transfer temporal relations invariant to a probability shift, the utility also needs to be lifted to abstract out specific grounding objects. In this work, we study learning such utility from human demonstrations. While inverse reinforcement learning (IRL) has been accepted as a general framework of utility learning, its fundamental formulation is one concrete Markov Decision Process. Thus the learned reward function does not specify the task independently of the environment. Going beyond that, we define a domain of generalization that spans a set of planning problems following a schema. We hence propose a new quest, Generalized Inverse Planning, for utility learning in this domain. We further outline a computational framework, Maximum Entropy Inverse Planning (MEIP), that learns non-Markovian utility and associated concepts in a generative manner. The learned utility and concepts form a task representation that generalizes regardless of probability shift or structural change. Seeing that the proposed generalization problem has not been widely studied yet, we carefully define an evaluation protocol, with which we illustrate the effectiveness of MEIP on two proof-of-concept domains and one challenging task: learning to fold from demonstrations.


page 7

page 8

page 12


Trajectory Modeling via Random Utility Inverse Reinforcement Learning

We consider the problem of modeling trajectories of drivers in a road ne...

Meta Inverse Reinforcement Learning via Maximum Reward Sharing for Human Motion Analysis

This work handles the inverse reinforcement learning (IRL) problem where...

Inverse Reinforce Learning with Nonparametric Behavior Clustering

Inverse Reinforcement Learning (IRL) is the task of learning a single re...

Learning Task Knowledge and its Scope of Applicability in Experience-Based Planning Domains

Experience-based planning domains (EBPDs) have been recently proposed to...

Learning Task Specifications from Demonstrations via the Principle of Maximum Causal Entropy

In many settings (e.g., robotics) demonstrations provide a natural way t...

Langevin Dynamics for Inverse Reinforcement Learning of Stochastic Gradient Algorithms

Inverse reinforcement learning (IRL) aims to estimate the reward functio...


Humans learn underlying utility by observing others’ behaviors. It is widely accepted that we humans have a Theory of Mind (ToM), assume others as bounded rational agents, and inversely solve for their utility to understand their planned behaviors Baker et al. (2009). The inferred utility is associated with some concepts, which together specify the task. Then in a similar context, this utility can generalize to incentivize us to act similarly. In this work, we study a formal definition of such generalization and a proper machinery to learn such utility.

The utility we want to study is different from the reward function in classical reinforcement learning. In their seminal book, Sutton and Barto (1998) distinguish planning from reinforcement learning as requiring some explicit deliberation using a world model. It is this deliberation that the utility we discuss here expects to capture.

Figure 1: (a) An MDP on which the agent needs to reach without hitting a (highlighted with the green arrow). Any Markovian rewards that make this expected behavior optimal are coupled with . (b) The default and weird sequences for folding a cloth. The former should have higher utility. The underlying utility should also generalize robustly.

Apart from these philosophical concerns, there are also computational issues in the setup of classical reinforcement learning. One of our key insights is that tasks specified with Markovian reward functions do not generalize over environments. Consider a didactic example from Littman et al. (2017), see Fig.1a. The desired behavior we want to specify is ‘maximizing the probability of reaching the goal without hitting a bad state ’. It is equivalent to a temporal description ‘do not visit a until you reach the ’, which unfortunately cannot be represented with Markovian rewards independently from the environment. Concretely, let’s assume a discount of and a reward of +1 for reaching the goal. If , setting to the bad state encourages the desired behavior. But if , this reward needs to be . Even though this example may seem contrived, it captures the essence of the limitation of Markovian rewards. There are more natural examples in our daily life. Imagine you are folding your clothes after laundry. Normally, not until you fold the left and right sleeves do you fold it in half from top to bottom, see Fig.1b. Searching your memory, you will probably realize it is the default order ever since you learned to fold clothes as a kid, no matter the one you fold is a T-shirt or a sweater, no matter it is your all-time favorite or a brand-new one. Remarkably, the utility you learned on this temporally extended task exhibits robust generalization. In other words, we have this cognitive capability to learn a task representation with a utility function that is independent of the environment.

In fact, a slip in the transition probability is only the mildest one in environmental shifts. In the example of cloth folding, hoodies and T-shirts have different structures in the underlying probabilistic graphical models (PGM). Concretely, this is because they have different numbers of edges in their contours thus different numbers of nodes in their object-oriented probabilistic graphs. Despite of this difference, the utility learning mechanism ought to be tolerant of the heterogeneous nature of demonstrations. The learned utility should also help when folding a sweater after seeing these demonstrations. From a classicist’ perspective, this requirement goes beyond the formulation of RL, Markovian Decision Processes (MDP), where the language that specifies the structure of the MDP (which is essentially a PGM) is constrained to be propositional. For readers who are not familiar with classical AI or computational linguistics, propositional logic can be understood as a language without object-orientation. Its object-oriented counterpart is first-order or relational logic. Intuitively, utility associated with a relation, i.e., “symmetric”, can generalize better than than utility associated with a grounded description such as “the left and right sleeves are symmetric”. In classical AI, this property is called lifted in the sense of not being grounded with specific entities.

How can machines learn utility that is both non-Markovian and lifted? The closest solution in the literature is Inverse Reinforcement Learning (IRL)

Abbeel and Ng (2004). However, IRL adopts the fundamental modeling of MDP. Given an MDP and a set of demonstrations from it, IRL learns a Markovian reward function by matching the mean statistics of states or state-action pairs. This learned reward function can only encourage the expected behavior in the identical MDP, because apparently it will fall into the trap of the didactic example above. Apart from that, utility learned with vanilla IRL also fails to generalize to a PGM with different structure. In this work, we propose a joint treatment for learning generalizable task representation from human demonstrations.

Our contributions are threefold:

  • We characterize the domain of generalization in this utility learning problem by formally defining a schema for the planning task that represents a set of planning problems. The target utility should be learned in a subset of this domain and successfully generalize to other problem instances. This is in stark contrast to the formulation of IRL for which only one planning problem is of interest. We hence term the problem generalized inverse planning.

  • We propose an energy-based model (EBM) for generalized inverse planning. In Statistics, energy-based models are also dubbed

    descriptive models. This is because the model is expected to match the minimal statistical description in data. It is this description that differentiates the proposed model, Maximum Entropy Inverse Planning (MEIP), from prior arts such as MaxEnt-IRL Ziebart et al. (2008). Instead of matching the mean statistics in demonstrations under a Markovian assumption, MEIP matches the ordinal statistics, a description that sufficiently captures the temporal relations in non-Markovian planning. This model can be learned with Maximum Likelihood, sampled with Monte Carlo Tree Search (MCTS). To combine MEIP with a first-order concept language, we further introduce a boosting method to pursue abstract concepts associated with the utility.

  • Seeing that the generalization problem in this work was barely studied systematically, we carefully design an evaluation protocol. Under this protocol, we validate the generalizability of utility learned with MEIP in two proof-of-concept experiments for environmental change and a challenging task, learning to fold clothes.


Inverse Reinforcement Learning

Imitation learning, also known as learning from demonstrations, is a long-standing problem in the community of artificial intelligence. Earliest works most adopt the paradigm of Behavior Cloning (BC), directly supervise the policy at each step of provided sequences Hayes and Demiris (1994), Amit and Matari (2002). Atkeson and Schaal (1997) was the fist to consider the temporal drifting in sequences. Nonetheless, BC is always believed to be less transferable without generatively modeling the decision making sequences. Alternatively, Ng et al. (2000) proposed another paradigm, IRL, to inversely solve for the reward function in an MDP. The learned reward function is expected to help the agent learn a new policy, hopefully covers more states in the MDP. Together with some extensions such as Abbeel and Ng (2004) and Ratliff et al. (2006), they set up the standard formulation of IRL.

Consider a finite-horizon MDP, which is a tuple , where is the set of states, is the set of actions, is the Markovian transition probability, is the Markovian transition probability and is the horizon. Abbeel and Ng (2004) assume the Markovian reward to be linear to some predefined features of states and derive the feature expectation as . Specifically, the underlying unknown reward is assumed to be with unknown parameter . Given a set of demonstration trajectories from this MDP with ,

, the value with the estimated parameter



The algorithm of IRL always runs with an off-the-shelf reinforcement learning method to generate trajectories with the optimal value for the same MDP except that the reward is the current estimation. We thus have two sets of trajectories. Assuming humans behave rationally when demonstrating, Abbeel and Ng then propose a learning objective to maximize the margin between these two sets. When the model converges, the mean statistics from and should be matched.

To account for humans’ bounded rationality and imperfect demonstration, Ziebart et al. (2008) introduced a probabilistic modeling for IRL. They started from a statistical mechanics perspective: the IRL model above does not address the ambiguity that many reward functions can lead to the same feature count. They propose MaxEnt-IRL to maximize the entropy of distribution of trajectories to break the tie while matching the mean statistics. The distribution of a trajectory in a deterministic MDP is thus


where is the partition function. For non-deterministic MDPs, there is an extra factor to account for the transition probability . In MaxEnt-IRL, the reward function is learned with Maximum Likelihood Estimate (MLE). Some variants such as Boularias et al. (2011) and Ortega and Braun (2013) learn by minimizing relative entropy (KL divergence).

We derive our model of Maximum Entropy Inverse Planning with the same principle. But in stark contrast to matching the mean statistics under the assumption of Markovian independence, MEIP matches the ordinal statistics Kendall rank correlation to account for the non-Markovian temporal relations. To some degree, we share the same spirit with Xu et al. (2009) and Garrett et al. (2016). But their models are discriminative, which are believed to be more data-hunger and less generalizable than analysis by synthesis Grenander (1993).

Structured Utility Function

Prior attempts to make utility functions structured fall into two regimes: those that adopt First-order Logic (FOL) for lifting Džeroski et al. (2001) Kersting and Driessens (2008) and those that adopt Linear Temporal Logic (LTL) for non-Markovian rewards Li et al. (2017), Littman et al. (2017), Icarte et al. (2018). Leveraging the expressive powers of language and axioms, these structured utility functions can be generalized over a task domain. We draw inspiration from them. There are works that also inversely solve for structured utility function from demonstrations Munzer et al. (2015), Vazquez-Chanlatte et al. (2018), but they only focus on one regime. To the best of our knowledge, our work is the first to provide a unified probabilistic model for both regimes. Our framework adopts a relational language for abstract and composable concepts and match ordinal statistics with a descriptive model. The learned structured utility function is expected to represent a task in Generalized Planning, to be elaborated below.

Generalized Inverse Planning

The planning we want to discuss in this work is generalized in the sense that its representation generalizes over a domain with multiple planning problems Srivastava et al. (2011). It is thus fundamentally different from the formulation of reinforcement learning which only maximizes rewards on one specific MDP.

Definition 1 (Planning Domain)

A planning domain consists of a set of fluents which are real-valued functions and a set of actions which are associated with some entities or parameters or both.

Obviously, fluents here are generalized predicates in classical symbolic planning, whose arity should be given in advance and value may vary with time. Actions may also be associated with preconditions and effects, specified by sets of fluent values. This definition makes the planning domain lifted to the first-order, abstracting out instances of objects, agents, and events, only describes their classes and relations. It ,therefore, generalizes to arbitrary numbers of instances. In the literature, STRIPS-style action language Fikes and Nilsson (1971) provides such abstraction.

Definition 2 (Stochastic Planning Problem)

A stochastic planning problem is given by a domain , a set of entities , a set of transition probabilities for , a reward function and a distribution over initial states. A state is specified by a set of fluent values.

Here we do not assume the transition probabilities to be Markovian. It can thus be easily generalized to Partially Observable Kaelbling et al. (1998) or semi-Markovian Sutton et al. (1999). States are sets of fluent values, which may be grounded spatial relations, grounded temporal relations, or output of functions. The reward function is preferred over Goal because the extensionality of the latter one might be hard to specify, particularly for inversion from demonstrations. As introduced earlier, Markovian rewards may not generalize over a domain independently to the environment, we assume to be non-Markovian for each planning problem. A plan is a (sub)optimal trajectory of states for a planning problem.

Given a set of plans from different planning problems in the same domain, a task is their abstraction that generalizes these plans and captures their underlying optimality. It may abstract out numerical values in . At the minimal level, it is the action or transition at each state that we care about. This principle was previously discussed in Martin and Geffner (2004), Natarajan et al. (2011), and Silver et al. (2020), which directly imitate the behavior with expressive policies whose hypothesis space is logically or programmatically constrained. But completely depending on the policy also constrained the capability of generalization. For example, the learned policy may not work or generalize well if the action set is (uncountably) infinite. To this end, an ordering in the state space is still needed for a task to represent the (possibly bounded) rationality in demonstrated plans Khardon (1999).

Definition 3 (Planning Task)

A planning task is a tuple , where is the domain and is a partial ordering relation over states.

This minimal algebraic description of a task for generalized planning. And Generalized Inverse Planning is to inversely solve for a representation of the task that maintains this algebraic structure from demonstrations.

Our proposal towards generalized inverse planning is a nested algorithm. In the inner loop, it learns a numerical representation of with maximum likelihood. In the outer loop it pursuits first-order concepts with maximum a posteriori (MAP). We introduce from inside out in the next two sections and provide pseudo-code in Alg.1 and Alg.2.

Maximum Entropy Inverse Planning

Descriptive Model and Maximum Entropy

We adopt descriptive modeling for inverse planning. Different from a discriminative model, which models a conditional probability, i.e.

a classifier, a descriptive model specifies the probability distribution of the signal, based on an energy function defined on the signal through some descriptive feature statistics extracted from the signal

Wu et al. (2019). In the literature of modern AI, it is also known as energy-based model LeCun et al. (2006). From the discussion above, it comes clear to us that given a set of plans from different problems in the same domain, the minimum statistical description to match should be ordinal to account for temporal relations. Therefore, different from the feature expectation in IRL, we match the Kendall ranking statistics for inverse planning given plans from the same domain:


where scores the concordance or discordance of a ranking function for temporally indexed states and where from a plan for a problem :


The range of Kendall is , where indicates a perfect match. However, normally there can be multiple distributions of plans and thus multiple ranking functions

that match this ordinal statistics. Similar to MaxEnt-IRL, we employ the principle of maximum entropy

Jaynes (1957) to resolve ambiguities in choosing distributions. Concretely, we maximize the entropy of the distribution over plans under the constraint that the Kendall can be matched between the demonstrations and generated plans:


Under the KKT condition, we can derive the Boltzmann form of this distribution from Eq.5’s Langrangian:


where is the partition function, is the Langrangian multiplier and accounts for the the transition probability .

Utility Learning for Order Preserving

Comparing Eq.6 with Eq.2, it is easy to see the correspondence between and . To further derive the utility function of plans from Eq.6, we first specify the concrete form of . We assume for each state

, which is a set of grounded fluent values, there is a vector of fist-order concepts

that generalizes over the domain . We will introduce how this vector can be learned in next section. Here we can simply assume it is given. And we can further assume that the ranking function is piece-wise linear with respect to this concept vector

. Being piece-wise linear is a general assumption because functions with this characteristics are in theory as expressive as artificial neural networks. Specifically, we discretize each entry

into bins and attach a vector as a functional:


Obviously, there is no need to separate and anymore. So we drop for the derivation below.

To further illustrate the utility function in MEIP, we can rewrite as:


Therefore we have


It is easy to see that this reward function is non-Markovian, in stark contrast to in MaxEnt-IRL.

We can solve for the by maximum likelihood (MLE) over given demonstrations, which according to Jaynes (1957) implies maximum entropy:


Notice that is dropped as a constant. If is differentiable w.r.t. , consider Eq.10’s gradient


the second term can be approximated by sampling. Apparently, there is an contrastive view here: when maximizing the likelihood of demonstrations, we maximize their Kendall and minimize the Kendall of generated plans. The underlying intuition is that at convergence, for pairs of states in the demonstrations, if ; for pairs in generated plans, if ; for pairs with one state in demonstrations and the other one from generated plans, . We would refer to ordinal relations listed here as , resembling groundtruth labels.

However, is not differentiable. To this end, we need to learn a classifier with to match the order described above, mimicking the optimization in Eq.11. One way is to directly relax with a discriminator . To some extent, this approximation shares similar spirits with Generative Adversarial Imitation Learning (GAIL) Ho and Ermon (2016) and some variants Finn et al. (2016). But different from them, we explicitly consider the temporal relation in the non-Markovian utility.

We can also learn this classifier with max-margin methods. Essentially, it is a Ranking SVM model Liu (2011) taking both the demonstrations and the sampled plans as supervision. Consider there are pairs of states in total from expected to fulfill the ordinal relation above, we have this Quadratic Optimization problem:


It can be solved by off-the-shelf Gradient Descent (GD) or Quadratic Programming (QP). Notice that the number of constraints and slack variables grows quadratically in the size of each plan, we thus only consider the ordinal relations in each planning problem . But the parameters are shared for all problems in the domain . If we only consider the linear, primal form of the problem, there are also efficient methods for training Joachims (2006).

Sampling Method

We need to sample from the EBM to calculate parts in Eq.12 that correspond to the second term in Eq.11. Thus we need to have the distribution of trajectories from the EBM. In this work, we only consider planning problems where states are discrete. We leave the continuous state space as our future work. We can factorize the probability of a plan by conditioning:


We take its logarithmic form to connect with Eq.8 and Eq.9:


where takes both the action and transition probability into account. Since here we do not explicitly specify the action in 111The main concern here is that grounded actions in plans from different may not generalize over . Bonet and Geffner (2018) discuss learning abstract actions for generalized planning. Here we do not enforce a relational form in the action space . We assume the equivalence between actions causing same transitions. We further assume there is an absorbing state failure for each state transition thus is stochastic with ., we decompose it to be


It then becomes clear that to sample from the descriptive model, we just sample proportionally to :


can be acquired from Monte Carlo Tree Search, for which the reward at each node is initialized with . After its convergence, we can sample trajectories according to the value of each branch.

It is worth noticing that the same sampling method can be used to plan for optimal utility when transferred to other problems in the same domain . So generally speaking, for sampled plans, there is a min-max view:

Input: A Set of Concepts from the Concept Language; MCTS Convergence Conditions; Hyperparams.
Data: Human Demonstrations
Result: Learned Parameters of Utility
Init: , value function in MCTS
while utility function not converged do
       while MCTS not stopped do
             MCTS rollout, generate
             Compute Kendall and with Eq. 8
             Value iteration via MCTS with Kendall
       end while
      sample trajectories with MCTS according to converged values of branches
       update with and (See Eq. 11)
end while
Algorithm 1 Maximum Entropy Inverse Planning

Learning First-Order Concepts

In the previous section, we introduced a descriptive model for inverse planning, given some concepts that generalize over the planning domain. In this section, we introduce a formalism for concepts with this characteristics and how we learn them as in Generalized Inverse Planning.

Concept Language

As introduced in Sec.3Generalized Inverse Planning, a planning domain is defined in a lifted manner, abstract out instances of entities in specific problems. Therefore the computational form of the utility learned from Generalized Inverse Planning should also be lifted and be invariant to the variation of numbers of instances. In AI, the first-order logic with quantifiers and aggregators is a formalism to express this invariance. To this end, we employ a modified concept language Donini et al. (1997) as a grammar for elements in such that we can learn these first-order concepts in a top-down manner. Concept languages have the expressive power of subsets of standard first-order logic yet with a syntax that is suited for representing classes of entities. Adopting the terminology of FOL, each concept is represented by a first-order formula. However, in the original concept languages, concepts are assumed independent, which might not be the case for our utility function222Recall that we assume the ranking function to be (piece-wise) linear in Eq.7, which requires concepts to be independent from each other. Utility functions whose concepts are not independent will never be expressed without this modification.. So the primary modification we introduce is to complete it as in FOL, explicitly accounting for formulas which take other formulas as terms with a syntax:

where the notation highlights different concepts. denotes predicates with binary value domains i.e. relations, denotes functions with either real or discrete value domains.333

They may have neural network equivalents structuralized as Multi-Layer Perceptrons (MLP) or Graph Neural Nets (GNN).

They all have their own arities. are concepts represented by atomic formulas with a syntax:

is the set of aggregators and is the set of quantifiers. is a set that is either the value domain of a fluent or the truth domain of a predicate . is the domain of interest that can either be the extension of a certain predicate, which is denoted as , or the universe () of entities. The dimension of the domain should match the arity of and when placed together in and . Among predicates , are primitives. Other predicates can be their constituents’ negation () or conjunction (). They can also be a result of permuting another predicate’s arguments (), if the arity is larger than 1 and arguments are from the same class. They can even be defined transitively (), as in original concept languages. Apparently, .

Concept Pursuit

To consider the combination over the full bank of concepts would be computationally intractable. A Bayesian treatment of concept induction can provide a principled way to incorporate the prior of Occam’s razor to probabilistic grammars. Let us denote the selected concepts with , where coincides Eq.7, the posterior of the utility function given demonstrated plans would be


To obtain the MAP estimate efficiently, we adopt stepwise greedy search over :


Specifically, concepts in are first sorted by their complexity, reflecting the prior . Note that concepts are mutually exclusive if they share the same and only differentiate at or , so they are stored in the same slot in the sorted list. The levels of complexity are naturally discretized by the number of primitive fluents or predicates involved. We start from the simplest level. At each level, the concept that brings the largest margin is added to in a step-wise greedy manner. The selection terminates at the current level when the marginal benefit in the posterior is below a threshold. Then we move on to the next level. Since some complex concepts may have information overlap with simpler ones e.g. vs , the simpler ones are replaced from when the complex are added. The first term in Eq.19 is equivalent to the MLE in Eq.11 and can be solved with MEIP. Therefore the complete algorithm is nested.

This derivation leads us to a boosting method Friedman (2001)

. A similar greedy search strategy was proposed for feature selection in IRL by

Bagnell et al. (2007). But different from them, we further assume the increase from is always more significant than the former term, such that we only need to consider a subset of with the lowest complexity at a time.


Evaluation Protocol

To help readers better understand the generalization problem studied in this work, we provide a systematic introduction of our evaluation protocol. Note that even though it is the learned utility that is to be evaluated, it cannot be evaluated without planned behaviors. This because the optimal utility can not be trivially defined for most tasks where Generalized Inverse Planning is meaningful, a issue that originally motivated the proposal of IRL (Abbeel and Ng, 2004)

  1. [noitemsep, nolistsep]

  2. The agent first learns the utility in the environment where the demos come from;

  3. We then transfer this agent to another environment, which is built under the same schema thus in the same planning task. This new environment is a result of probability shift or structural change or both;

  4. Given the symbolic world model of the new environment, the agent optimizes for a policy that maximizes the reward function with model-based methods such as MCTS;

  5. Test if the behavior of the agent in this new environment is consistent with the demo in terms of temporal relations.

We adopt the following two sets of evaluation metrics, depending on the diversity of the desired behavior:

  1. [noitemsep, nolistsep]

  2. For simple tasks the ground-truth behavior is only one single sequence, such as in the didactic example is the only desired sequence, we evaluate the learned utility by a Monte Carlo estimate of the probability of the desired sequence executed by an MCTS agent’s planned behavior in the new environment.

  3. Most of the time, the ground-truth optimal utility is not clear to us, especially its numerical value. But we can extract the ground-truth concepts and their ordering from the demos. We evaluate the learned utility with the objective of IRL or inverse planning: the matching in the statistics. Specifically, given the planned behaviors and the demonstrated ones, we measure their mean matching to check if the learned utility can attain a similar behavior from the perspective of IRL; We measure their Kendall (ordinal matching) to check if the learned utility can attain a similar behavior from the perspective of inverse planning.

Experiment 1: Probability Shift

We first conduct an experiment with the didactic example from Littman et al. (2017) introduced above to illustrate utility learned with MEIP can generalize regardless of probability shift. Here probability shift refers to a change in a distribution in . The desired behavior is ‘do not visit a bad state until you reach the goal’. We recommend readers to review Fig.1a for the problem setup. When learning the utility, the agent is provided with demonstrations collected from the MDP with . And it is also provided a black-box model to simulate this MDP during MCTS. The learned utility is then tested in another MDP, with . In both environments, we evaluate agents’ behaviors by estimating the probability of the desired sequence . The result is shown in Fig.2: Although both agents with MEIP and MaxEnt-IRL behave perfectly with , only the agent with MEIP still performs perfectly with . The MaxEnt-IRL agent discards temporal relations in utility and only matches mean statistics , thus prefers in the first transition after a probability shift.

We also conducted an empirical study to explore the boundary of generalization of both MEIP and MaxEntIRL. If , the difference between MEIP and MaxEntIRL would be insignificant since the reward learned from MaxEntIRL can also discourage agents from taking . We also notice that if , which induces an extremely high probability for agents to move to after taking , neither MEIP nor MaxEntIRL learns meaningful utility.

Figure 2: The probability of ‘reaching the goal state without hitting a bad state’ in the training environment () and the testing () for agents whose utility is learned with MEIP and MaxEnt-IRL respectively. is the optimal.

Experiment 2: Structural Change

Structural change happens when the utility is transferred between two environments whose underlying PGMs for the transition have the same fluent nodes but different causal structures or different numbers of nodes. Note that causal structures of environments always implicitly enforce ordering in demonstration sequences. If the ordinal information in these sequences only reflects causality, there should be no difference between MEIP and IRL. However, we humans are cultural creatures. There are lots of things we do in an order not because they are the only feasible ways, but due to certain social conventions, e.g. the traditional order in a wedding ceremony, the stroke order in hand-writing, etc. We acquire social utility by following these conventions. It is under these situations does MEIP differentiate from IRL.

Figure 3: There are 3 stages with abundant preps. Alex was shown demos by the chief who followed a prescribed order to fetch different numbers of torch, bamboo, clay.

Consider the scenario in Fig.3 that hypothetically took place in early history when utterance was not at all easy for our ancestors. Alex was a new-comer to a tribe. He was invited to a ritual host by the chief of the tribe. The ritual went like this: There were 3 stages in total. In stage 1, Alex saw the chief took out all torches . After stage 2, Alex was showed 1 or 3 bamboos . Eventually, the chief fetched 4 pieces of clay from stage 3. As a member of this tribe, Alex need to understand the process of this ceremony. And the metric to test his understanding is how he would imagine himself hosting it, presuming some environmental changes.

Figure 4: (a) Ordinal and mean matching when agents are given world models that do not enforce the demonstrated temporal ordering. (b) Results after an extra change in entity quantity. These results are the average of trajectory samples from convergences of MCTS with the learned utility.

The first evaluation metric, mean matching, tests whether the learned utility is associated with the correct set of concepts and thus can be transferred to another environment with different quantities of objects. In the demonstrated setting, there are 5 torches , 5 bamboos and 5 pieces of clay in each stage. After the structural change, there are 6 pieces of each objects. The ground-truth concepts of these demonstrations are , , . In plain English, the learner needs to fetch all torches from stage 1, any number of bamboo from stage 2 and 4 of pieces of clay from stage 3. We estimate the mean statistics of matching these concepts in sequences planned with utility learned by MEIP and MaxEnt-IRL. Both agents successfully discover the correct set of concepts with the help of our concept language (Fig.3(a)). These concepts empower them to generalize learned utility to environments with different quantities of entities (Fig.3(b)).

The second metric is for ordinal relations in sequences. We estimate the Kendall of planned sequences with the ground-truth ordering . To test this generalization, structural change needs to alter causality. And this alternation is done by controlling the world model provided to agents. When the transition in the world model directly enforces the required ordering, both agents have when planning with this world model. However, after changing to a transition without causality constraint on ordering, namely, agents can choose to enter any stage in any order, only the MEIP agent can attain the desired ordering () with the learned utility. We also tried a controlled setting where we provide a world model that does not have causality constraints to both agents. The contrast between the planned behaviors of the MEIP agents and the MaxEntIRL agent remains the same. These results justify that MEIP can learn not only correct concepts but also the desired temporal order. On the other hand, MaxEnt-IRL fails to capture the ordinal information. As illustrated in Fig.4, mean matching does not enforce ordinal utility learning.

Figure 5: Qualitative results of transferring the learned utility to folding different clothes. Note that the shape of polygons and the number of edge segments in the testing clothes are substantially different from the one in demos.

Experiment 3: Learning to Fold

Another task that reflects the ubiquity of this social utility in our daily life is the running example we start from the introduction, cloth folding. If the utility was only associated with the final state that the cloth becomes a square whose area is within a range, there would not be those exemplar sequences you recall when you hear this task. We conduct an experiment to learn the utility of cloth folding in a visually and geometrically authentic simulator.

To represent the cloth to be folded under our formulation, we adopt a grammar, Spatial And-Or-Graph (S-AOG) (Zhu and Mumford, 2007) for the image input. Details of this grammar, as well as other technical details, can be found in the supplementary. Given the visual input of a cloth, the agent should parse it with this grammar to acquire a first-order representation. For the sake of simplicity, we ignore the uncertainty from perception. We also endow the agent with an action model, assuming it can imagine geometric transformations like a toddler. As such, the agent has sufficient knowledge of each problem instance to rollout with MCTS. In our experiments, we collect 15 “good folds” from human demonstrators. When presented with these folding sequences, the agent with MEIP learns the utility and concepts that successfully generalize to unseen clothes, see Fig.5. Even though all shirts (or sweater) look similar on their appearances, their underlying structures are significantly different in the number, location, and orientation of edges and vertexes, see Fig.6.

Figure 6: Nodes of the underlying structures, illustrated as edges and vertexes of different clothes. They are significantly different in different clothes.

Concluding Remarks

In this work we propose a new quest for learning generalizable task representation, especially its utility, from demonstrations. This problem lies outside of the regime of inverse reinforcement learning and thus dubbed Generalized Inverse Planning. We then outline the computational principles in the cognitive process Marr and Poggio (1976) of lifted non-Markovian utility learning, which we model as Maximum Entropy Inverse Planning (MEIP). Comparing to existing inverse reinforcement learning methods, our model learns a task representation that generalizes regardless of probability shift and structural change in the environment. To highlight this contribution, we exclude irrelevant representation learning by adopting classical assumptions and representation in planning, i.e. grounded semantics of entities and relations are given a priori, as well as an action model over them. This kind of assumption was also made in original works of IRL Abbeel and Ng (2004), Ratliff et al. (2006),Ziebart et al. (2008), Munzer et al. (2015). To disclaim, we are aware that such assumption on priors might be regarded as too strong in modern days since there are some recent progress in deep reinforcement learning that successfully induces them with a weaker assumption of relational inductive biases Zambaldi et al. (2019). Our model is general enough to be extended to those neural networks. It would an interesting future work to investigate their synergy.


The authors thank Prof. Ying Nian Wu and Dr. Yixin Zhu of UCLA Statistics Department, Prof. Guy Van den Broeck of UCLA Computer Science Department for useful discussions. The work reported herein was supported by ONR MURI grant N00014-16-1-2007, ONR N00014-19-1-2153, and DARPA XAI N66001-17-2-4029.


  • P. Abbeel and A. Y. Ng (2004) Apprenticeship learning via inverse reinforcement learning. In Proceedings of the 21st international conference on Machine learning, pp. 1. Cited by: Introduction, Inverse Reinforcement Learning, Inverse Reinforcement Learning, Evaluation Protocol, Concluding Remarks.
  • R. Amit and M. Matari (2002) Learning movement sequences from demonstration. In Proceedings 2nd International Conference on Development and Learning. ICDL 2002, pp. 203–208. Cited by: Inverse Reinforcement Learning.
  • C. G. Atkeson and S. Schaal (1997) Robot learning from demonstration. In ICML, Vol. 97, pp. 12–20. Cited by: Inverse Reinforcement Learning.
  • J. Bagnell, J. Chestnutt, D. M. Bradley, and N. D. Ratliff (2007) Boosting structured prediction for imitation learning. In Advances in Neural Information Processing Systems, pp. 1153–1160. Cited by: Concept Pursuit.
  • C. L. Baker, R. Saxe, and J. B. Tenenbaum (2009) Action understanding as inverse planning. Cognition 113 (3), pp. 329–349. Cited by: Introduction.
  • B. Bonet and H. Geffner (2018) Features, projections, and representation change for generalized planning. arXiv preprint arXiv:1801.10055. Cited by: footnote 1.
  • A. Boularias, J. Kober, and J. Peters (2011) Relative entropy inverse reinforcement learning. In Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 182–189. Cited by: Inverse Reinforcement Learning.
  • F. M. Donini, M. Lenzerini, D. Nardi, and W. Nutt (1997) The complexity of concept languages. Information and Computation 134 (1), pp. 1–58. Cited by: Concept Language.
  • S. Džeroski, L. De Raedt, and K. Driessens (2001) Relational reinforcement learning. Machine learning 43 (1-2), pp. 7–52. Cited by: Structured Utility Function.
  • R. E. Fikes and N. J. Nilsson (1971) Strips: a new approach to the application of theorem proving to problem solving. Artificial Intelligence 2 (3), pp. 189 – 208. Cited by: Generalized Inverse Planning.
  • C. Finn, P. Christiano, P. Abbeel, and S. Levine (2016)

    A connection between generative adversarial networks, inverse reinforcement learning, and energy-based models

    arXiv preprint arXiv:1611.03852. Cited by: Utility Learning for Order Preserving.
  • J. H. Friedman (2001)

    Greedy function approximation: a gradient boosting machine

    Annals of statistics, pp. 1189–1232. Cited by: Concept Pursuit.
  • C. R. Garrett, L. P. Kaelbling, and T. Lozano-Pérez (2016)

    Learning to rank for synthesizing planning heuristics

    In Proceedings of the Twenty-Fifth International Joint Conference on Artificial Intelligence, pp. 3089–3095. Cited by: Inverse Reinforcement Learning.
  • U. Grenander (1993) General pattern theory: a mathematical study of regular structures oxford mathematical monographs. Oxford University Press: Clarendon. Cited by: Inverse Reinforcement Learning.
  • G. M. Hayes and J. Demiris (1994) A robot controller using learning by imitation. University of Edinburgh, Department of Artificial Intelligence. Cited by: Inverse Reinforcement Learning.
  • J. Ho and S. Ermon (2016) Generative adversarial imitation learning. In Advances in neural information processing systems, pp. 4565–4573. Cited by: Utility Learning for Order Preserving.
  • R. T. Icarte, T. Klassen, R. Valenzano, and S. McIlraith (2018) Using reward machines for high-level task specification and decomposition in reinforcement learning. In International Conference on Machine Learning, pp. 2107–2116. Cited by: Structured Utility Function.
  • E. T. Jaynes (1957) Information theory and statistical mechanics. Physical review 106 (4), pp. 620. Cited by: Descriptive Model and Maximum Entropy, Utility Learning for Order Preserving.
  • T. Joachims (2006) Training linear svms in linear time. In Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 217–226. Cited by: Utility Learning for Order Preserving.
  • L. P. Kaelbling, M. L. Littman, and A. R. Cassandra (1998) Planning and acting in partially observable stochastic domains. Artificial intelligence 101 (1-2), pp. 99–134. Cited by: Generalized Inverse Planning.
  • K. Kersting and K. Driessens (2008) Non-parametric policy gradients: a unified treatment of propositional and relational domains. In Proceedings of the 25th international conference on Machine learning, pp. 456–463. Cited by: Structured Utility Function.
  • R. Khardon (1999) Learning action strategies for planning domains. Artificial Intelligence 113 (1-2), pp. 125–148. Cited by: Generalized Inverse Planning.
  • Y. LeCun, S. Chopra, R. Hadsell, M. Ranzato, and F. Huang (2006) A tutorial on energy-based learning. Predicting structured data 1 (0). Cited by: Descriptive Model and Maximum Entropy.
  • X. Li, C. Vasile, and C. Belta (2017) Reinforcement learning with temporal logic rewards. In 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 3834–3839. Cited by: Structured Utility Function.
  • M. L. Littman, U. Topcu, J. Fu, C. Isbell, M. Wen, and J. MacGlashan (2017) Environment-independent task specifications via gltl. arXiv preprint arXiv:1704.04341. Cited by: Introduction, Structured Utility Function, Experiment 1: Probability Shift.
  • T. Liu (2011) Learning to rank for information retrieval. Springer Science & Business Media. Cited by: Utility Learning for Order Preserving.
  • D. Marr and T. Poggio (1976) From understanding computation to understanding neural circuitry. Cited by: Concluding Remarks.
  • D. Marr (1982) Vision: a computational investigation into the human representation and processing of visual information. Henry Holt and Co., Inc., USA. External Links: ISBN 0716715678 Cited by: Representation for Visual Input of Clothes.
  • M. Martin and H. Geffner (2004) Learning generalized policies from planning examples using concept languages. Applied Intelligence 20 (1), pp. 9–19. Cited by: Generalized Inverse Planning.
  • T. Munzer, B. Piot, M. Geist, O. Pietquin, and M. Lopes (2015) Inverse reinforcement learning in relational domains. In Twenty-Fourth International Joint Conference on Artificial Intelligence, Cited by: Structured Utility Function, Concluding Remarks.
  • S. Natarajan, S. Joshi, P. Tadepalli, K. Kersting, and J. Shavlik (2011) Imitation learning in relational domains: a functional-gradient boosting approach. In Twenty-Second International Joint Conference on Artificial Intelligence, Cited by: Generalized Inverse Planning.
  • A. Y. Ng, S. J. Russell, et al. (2000) Algorithms for inverse reinforcement learning.. In Icml, Vol. 1, pp. 2. Cited by: Inverse Reinforcement Learning.
  • M. Nitzberg and D. B. Mumford (1990) The 2.1-d sketch. IEEE Computer Society Press. Cited by: Representation for Visual Input of Clothes.
  • P. A. Ortega and D. A. Braun (2013) Thermodynamics as a theory of decision-making with information-processing costs. Proceedings of the Royal Society A: Mathematical, Physical and Engineering Sciences 469 (2153), pp. 20120683. Cited by: Inverse Reinforcement Learning.
  • N. D. Ratliff, J. A. Bagnell, and M. A. Zinkevich (2006) Maximum margin planning. In Proceedings of the 23rd international conference on Machine learning, pp. 729–736. Cited by: Inverse Reinforcement Learning, Concluding Remarks.
  • T. Silver, K. R. Allen, A. K. Lew, L. P. Kaelbling, and J. Tenenbaum (2020)

    Few-shot bayesian imitation learning with logical program policies.

    In AAAI, pp. 10251–10258. Cited by: Generalized Inverse Planning.
  • E. S. Spelke and K. D. Kinzler (2007) Core knowledge. Developmental science 10 (1), pp. 89–96. Cited by: Representation for Visual Input of Clothes.
  • S. Srivastava, N. Immerman, and S. Zilberstein (2011) A new representation and associated algorithms for generalized planning. Artificial Intelligence 175 (2), pp. 615–647. Cited by: Generalized Inverse Planning.
  • R. S. Sutton and A. G. Barto (1998) Reinforcement learning: an introduction. MIT press. Cited by: Introduction.
  • R. S. Sutton, D. Precup, and S. Singh (1999) Between mdps and semi-mdps: a framework for temporal abstraction in reinforcement learning. Artificial intelligence 112 (1-2), pp. 181–211. Cited by: Generalized Inverse Planning.
  • M. Vazquez-Chanlatte, S. Jha, A. Tiwari, M. K. Ho, and S. Seshia (2018) Learning task specifications from demonstrations. In Advances in Neural Information Processing Systems, pp. 5367–5377. Cited by: Structured Utility Function.
  • Y. N. Wu, R. Gao, T. Han, and S. Zhu (2019) A tale of three probabilistic families: discriminative, descriptive, and generative models. Quarterly of Applied Mathematics 77 (2), pp. 423–465. Cited by: Descriptive Model and Maximum Entropy.
  • Y. Xu, A. Fern, and S. Yoon (2009) Learning linear ranking functions for beam search with application to planning.. Journal of Machine Learning Research 10 (7). Cited by: Inverse Reinforcement Learning.
  • V. Zambaldi, D. Raposo, A. Santoro, V. Bapst, Y. Li, I. Babuschkin, K. Tuyls, D. Reichert, T. Lillicrap, E. Lockhart, et al. (2019) Deep reinforcement learning with relational inductive biases. In International Conference on Learning Representations, Cited by: Concluding Remarks.
  • S. Zhu and D. Mumford (2007) A stochastic grammar of images. Now Publishers Inc. Cited by: Representation for Visual Input of Clothes, Experiment 3: Learning to Fold.
  • B. D. Ziebart, A. L. Maas, J. A. Bagnell, and A. K. Dey (2008) Maximum entropy inverse reinforcement learning.. In Aaai, Vol. 8, pp. 1433–1438. Cited by: 2nd item, Inverse Reinforcement Learning, Concluding Remarks.

Algorithmic Details

Maximum Entropy Inverse Planning (Meip)

For the purpose of generalization, the ranking function of utility is designed to be a piece-wise linear function. When , the form of the function is:

Cases where has higher dimensions can be derived accordingly. To make sure the continuity, we further add constraint on the connection between two consecutive bins, i.e. .

The training process of MEIP follows the principle of analysis by synthesis. According to Eq.11 there are two parts to be optimized, one for demonstrations and one for sampled plans . Specifically, the learning is implemented with a Ranking SVM, see Eq.12. Ordinal relations between pairs are specified in the main text after Eq.11. For a pair , we set for the Ranking SVM.

The sampling process is based on a Monte Carlo Tree Search (MCTS). First, we initialize the value function of the MCTS. We keep doing rollout to compute for trajectories until value functions in MCTS converges. After that, we choose to sample a small number of trajectories according to the values on the branches of the Monte Carlo Tree. These trajectories, together with the human demonstrations , are used to update the utility function.

Concept Pursuit

As mentioned in the main text, we generate first-order concepts from a concept language. However, the size of this language could be infinitely large. Thus we adopt the principle of Occam’s razors. We herein describe the algorithm for concept pursuit in algorithm 2.

There are three main steps in the pursuit process. First, we sort the concepts according to the complexity. The measurement of complexity is based on the number of predicates (or functions) in a concept. The more predicates (functions) in a concept, the higher complexity it is. This discrete nature introduces levels in the complexity.

Input: Sorted Concept Set ; Threshold
Data: The Set of State Pairs :
Result: Selected Concept Set ; Optimized Parameters
Init: Selected Concepts Set
foreach concept subset  do
       foreach first-order concept  do
             violation count
             foreach state pair  do
                   if   then
                   end if
             end foreach
            if  then
                   remove from
             end if
       end foreach
end foreach
foreach concept subset  do
       while True do
             foreach  do
                   compute by calling algorithm 1 with
                   add to
             end foreach
            if  then
             end if
             end if
       end while
end foreach
Algorithm 2 Concept Pursuit

The second step is to exclude concepts that are not relevant at all. We can select those relevant concepts by simply counting the number of violations in the state pairs. The last and most important step is to select concepts according to Eq. 19 for the utility we learn. Each concept is added to the selected set in a step-wise greedy manner. Note that concepts that share the same predicate (or function) and the same domain but different quantifiers are mutually exclusive.

Details for Experiment 2

Concept Space in Ritual Learning

In the ritual learning experiment, we only consider the concepts that are generated by the rule in which is the quantifier, is the predicate, and is the domain. In the ritual learning experiment, the domain in which is the set of object types and is the set of stages. A primitive concept in this experiment can be expressed as:

The primitive predicate, picked, describes whether the host carries some type of object from a specific stage. As introduced in the main text, primitive concepts represented by atomic formulas with the primitive predicate are terms for formulas of more complex concepts. Here these more complex concepts are encodings of the interrelation between primitive concepts.

Experimental Details

The Environment

is designed to be an analogy to a ritual. There are different stages in the environment. Both the agent and the demonstrator are required to choose a certain stage before they can advance to pick up objects in it. After locking down a specific stage, one will be asked to choose one type of object and pick up the of them ( means all). None of the stages can be visited more than once. The ritual will be terminated after all stages are visited.

Human Demonstrations

consist of at least 3-5 sequences. A sequence consists of an ordered descriptions of objects that the demonstrator obtains at each stage. The demonstrator can only choose to pick one type of object at one stage without limitation on the quantity. Examples of demonstrations are or . There is no doubt that we can have a longer demo if it is legal in the environment, although each demo only have 3 stages in our experiment. Note that all demos must contain exactly the same set of specific concepts.


of this experiment are listed as the following. MCTS converge condition: terminate after 3000 iterations. Size of sample trajectories : 5. Upper confident bound coefficient: 1.

Details for Experiment 3

Representation for Visual Input of Clothes

We adopt a stochastic grammar, Spatial And-Or-Graph (S-AOG) Zhu and Mumford (2007) for the visual input of all clothes. The design of this grammar follows Gestalt Laws in vision Marr (1982). Here we informally summarize its production rules. A cloth is an And node that produces a set of polygons. The number of polygons may change after being folded. Since some polygons may be occluded in the visual input, we adopt a 2.1D representation Nitzberg and Mumford (1990). The 2.1D representation is a layer representation. In our case, the order of the layer is consistent with the folding order. All polygons belong to the same class with a template set of fluents. They produce a set of line segments, edges. Different configurations of edges in one polygon consist an Or node. All edges also belong to the same class. Each edge is associated with two vertexes as its attribute. Each vertexis specified by its coordinate. Classes introduced above are regarded as domains in the concept language.

The full fluent set for this grammar is designed following axioms in Euclidean geometry, as showed in Table 1. Edge s, vertices, surfaces and their relations are also believed to be our core knowledge developed in early age Spelke and Kinzler (2007). Fluents for edges are categorized into functions/relations between edges e.g. parallel, and functions/relations between one edgeand one vertex of another edgee.g. distance. Other fluents, such as Logo and Neck, are for polygons, which are visual features. With these classes and fluents, we can generate concepts from the concept language.

Figure 7: A minimal parse graph of the S-AOG

During planning, the visual input of the cloth from each situation is parsed into a parse graph () of the S-AOG. As shown in Figure 7, is a hierarchical representation in which the terminal node is an edgewith two vertexes and non-terminal nodes are polygons. Clothes with the simplest structures, such as a shirt, are initially parsed into three polygons and their own affiliated edges.

Name Arity Type
Edge Length Unary function
Logo Unary function
Neck Unary function
Vt2Vt Distance Binary function
Vt2Edge Distance Binary function
Parallel Binary predicate
Perpendicular Binary predicate
Vertex on Edge Binary predicate
Edge on Edge Binary predicate
Vertex in Polygon Binary predicate
Table 1: Fluent space for learning to fold

Experimental Details

Figure 8: The action is a folding line in the simulator

The Environment

is a simulator for folding. The demonstrator is asked to draw a folding line that splits a polygon into two new polygons. In Figure 8, we show an example of a legal folding line in the simulator. The polygon is separated by the folding line and the small one will be flipped to the back of the larger one after each fold.

Human Demonstrations

are given by folding a demo shirt (or sweater). In a demo, states of the shirt are recorded, serialized as a sequence. A fold is not reversible therefore the demonstrator needs to consider the final state at every step. If the demonstrator made a “bad fold”, it could have a significant impact on the final state. Unlike the previous experiment, the demonstrator does not have a prescribed concept set during the demo process. Instead, the demonstrator needs to conduct a folding sequence that can lead to good final states which meet their own criteria of “good folds” based on default sequences in their real-life habits. We collected 15 sequences as demonstrations.

Figure 9: (a) The discretized action space for folds. (b) Samples of folds that with high probabilities.

The Action Space

of a folding line is discretized to make MCTS applicable. It is defined as a tuple and each parameter is discretized accordingly. As illustrated in (a), is coordinate of a point on the grid. The grid is a discretization of the bounding box of a shirt. is the radius of the folding line and is the angle. Note that some folding lines may be redundant, therefore we need to check the uniqueness of each folding line.

Even though we have discretized action space, it is yet too large for MCTS with limited computational resources. Thus, it is necessary to have a reasonable number of legal folds for the MCTS. The solution is to learn a probabilistic distribution over the action space. The folds that are similar to demo folds will be associated with high probabilities. We assume that each parameter in

follows a normal distribution around some exemplars. See