Introduction
Humans learn underlying utility by observing others’ behaviors. It is widely accepted that we humans have a Theory of Mind (ToM), assume others as bounded rational agents, and inversely solve for their utility to understand their planned behaviors Baker et al. (2009). The inferred utility is associated with some concepts, which together specify the task. Then in a similar context, this utility can generalize to incentivize us to act similarly. In this work, we study a formal definition of such generalization and a proper machinery to learn such utility.
The utility we want to study is different from the reward function in classical reinforcement learning. In their seminal book, Sutton and Barto (1998) distinguish planning from reinforcement learning as requiring some explicit deliberation using a world model. It is this deliberation that the utility we discuss here expects to capture.
Apart from these philosophical concerns, there are also computational issues in the setup of classical reinforcement learning. One of our key insights is that tasks specified with Markovian reward functions do not generalize over environments. Consider a didactic example from Littman et al. (2017), see Fig.1a. The desired behavior we want to specify is ‘maximizing the probability of reaching the goal without hitting a bad state ’. It is equivalent to a temporal description ‘do not visit a until you reach the ’, which unfortunately cannot be represented with Markovian rewards independently from the environment. Concretely, let’s assume a discount of and a reward of +1 for reaching the goal. If , setting to the bad state encourages the desired behavior. But if , this reward needs to be . Even though this example may seem contrived, it captures the essence of the limitation of Markovian rewards. There are more natural examples in our daily life. Imagine you are folding your clothes after laundry. Normally, not until you fold the left and right sleeves do you fold it in half from top to bottom, see Fig.1b. Searching your memory, you will probably realize it is the default order ever since you learned to fold clothes as a kid, no matter the one you fold is a Tshirt or a sweater, no matter it is your alltime favorite or a brandnew one. Remarkably, the utility you learned on this temporally extended task exhibits robust generalization. In other words, we have this cognitive capability to learn a task representation with a utility function that is independent of the environment.
In fact, a slip in the transition probability is only the mildest one in environmental shifts. In the example of cloth folding, hoodies and Tshirts have different structures in the underlying probabilistic graphical models (PGM). Concretely, this is because they have different numbers of edges in their contours thus different numbers of nodes in their objectoriented probabilistic graphs. Despite of this difference, the utility learning mechanism ought to be tolerant of the heterogeneous nature of demonstrations. The learned utility should also help when folding a sweater after seeing these demonstrations. From a classicist’ perspective, this requirement goes beyond the formulation of RL, Markovian Decision Processes (MDP), where the language that specifies the structure of the MDP (which is essentially a PGM) is constrained to be propositional. For readers who are not familiar with classical AI or computational linguistics, propositional logic can be understood as a language without objectorientation. Its objectoriented counterpart is firstorder or relational logic. Intuitively, utility associated with a relation, i.e., “symmetric”, can generalize better than than utility associated with a grounded description such as “the left and right sleeves are symmetric”. In classical AI, this property is called lifted in the sense of not being grounded with specific entities.
How can machines learn utility that is both nonMarkovian and lifted? The closest solution in the literature is Inverse Reinforcement Learning (IRL)
Abbeel and Ng (2004). However, IRL adopts the fundamental modeling of MDP. Given an MDP and a set of demonstrations from it, IRL learns a Markovian reward function by matching the mean statistics of states or stateaction pairs. This learned reward function can only encourage the expected behavior in the identical MDP, because apparently it will fall into the trap of the didactic example above. Apart from that, utility learned with vanilla IRL also fails to generalize to a PGM with different structure. In this work, we propose a joint treatment for learning generalizable task representation from human demonstrations.Our contributions are threefold:

We characterize the domain of generalization in this utility learning problem by formally defining a schema for the planning task that represents a set of planning problems. The target utility should be learned in a subset of this domain and successfully generalize to other problem instances. This is in stark contrast to the formulation of IRL for which only one planning problem is of interest. We hence term the problem generalized inverse planning.

We propose an energybased model (EBM) for generalized inverse planning. In Statistics, energybased models are also dubbed
descriptive models. This is because the model is expected to match the minimal statistical description in data. It is this description that differentiates the proposed model, Maximum Entropy Inverse Planning (MEIP), from prior arts such as MaxEntIRL Ziebart et al. (2008). Instead of matching the mean statistics in demonstrations under a Markovian assumption, MEIP matches the ordinal statistics, a description that sufficiently captures the temporal relations in nonMarkovian planning. This model can be learned with Maximum Likelihood, sampled with Monte Carlo Tree Search (MCTS). To combine MEIP with a firstorder concept language, we further introduce a boosting method to pursue abstract concepts associated with the utility. 
Seeing that the generalization problem in this work was barely studied systematically, we carefully design an evaluation protocol. Under this protocol, we validate the generalizability of utility learned with MEIP in two proofofconcept experiments for environmental change and a challenging task, learning to fold clothes.
Background
Inverse Reinforcement Learning
Imitation learning, also known as learning from demonstrations, is a longstanding problem in the community of artificial intelligence. Earliest works most adopt the paradigm of Behavior Cloning (BC), directly supervise the policy at each step of provided sequences Hayes and Demiris (1994), Amit and Matari (2002). Atkeson and Schaal (1997) was the fist to consider the temporal drifting in sequences. Nonetheless, BC is always believed to be less transferable without generatively modeling the decision making sequences. Alternatively, Ng et al. (2000) proposed another paradigm, IRL, to inversely solve for the reward function in an MDP. The learned reward function is expected to help the agent learn a new policy, hopefully covers more states in the MDP. Together with some extensions such as Abbeel and Ng (2004) and Ratliff et al. (2006), they set up the standard formulation of IRL.
Consider a finitehorizon MDP, which is a tuple , where is the set of states, is the set of actions, is the Markovian transition probability, is the Markovian transition probability and is the horizon. Abbeel and Ng (2004) assume the Markovian reward to be linear to some predefined features of states and derive the feature expectation as . Specifically, the underlying unknown reward is assumed to be with unknown parameter . Given a set of demonstration trajectories from this MDP with ,
, the value with the estimated parameter
is(1) 
The algorithm of IRL always runs with an offtheshelf reinforcement learning method to generate trajectories with the optimal value for the same MDP except that the reward is the current estimation. We thus have two sets of trajectories. Assuming humans behave rationally when demonstrating, Abbeel and Ng then propose a learning objective to maximize the margin between these two sets. When the model converges, the mean statistics from and should be matched.
To account for humans’ bounded rationality and imperfect demonstration, Ziebart et al. (2008) introduced a probabilistic modeling for IRL. They started from a statistical mechanics perspective: the IRL model above does not address the ambiguity that many reward functions can lead to the same feature count. They propose MaxEntIRL to maximize the entropy of distribution of trajectories to break the tie while matching the mean statistics. The distribution of a trajectory in a deterministic MDP is thus
(2) 
where is the partition function. For nondeterministic MDPs, there is an extra factor to account for the transition probability . In MaxEntIRL, the reward function is learned with Maximum Likelihood Estimate (MLE). Some variants such as Boularias et al. (2011) and Ortega and Braun (2013) learn by minimizing relative entropy (KL divergence).
We derive our model of Maximum Entropy Inverse Planning with the same principle. But in stark contrast to matching the mean statistics under the assumption of Markovian independence, MEIP matches the ordinal statistics Kendall rank correlation to account for the nonMarkovian temporal relations. To some degree, we share the same spirit with Xu et al. (2009) and Garrett et al. (2016). But their models are discriminative, which are believed to be more datahunger and less generalizable than analysis by synthesis Grenander (1993).
Structured Utility Function
Prior attempts to make utility functions structured fall into two regimes: those that adopt Firstorder Logic (FOL) for lifting Džeroski et al. (2001) Kersting and Driessens (2008) and those that adopt Linear Temporal Logic (LTL) for nonMarkovian rewards Li et al. (2017), Littman et al. (2017), Icarte et al. (2018). Leveraging the expressive powers of language and axioms, these structured utility functions can be generalized over a task domain. We draw inspiration from them. There are works that also inversely solve for structured utility function from demonstrations Munzer et al. (2015), VazquezChanlatte et al. (2018), but they only focus on one regime. To the best of our knowledge, our work is the first to provide a unified probabilistic model for both regimes. Our framework adopts a relational language for abstract and composable concepts and match ordinal statistics with a descriptive model. The learned structured utility function is expected to represent a task in Generalized Planning, to be elaborated below.
Generalized Inverse Planning
The planning we want to discuss in this work is generalized in the sense that its representation generalizes over a domain with multiple planning problems Srivastava et al. (2011). It is thus fundamentally different from the formulation of reinforcement learning which only maximizes rewards on one specific MDP.
Definition 1 (Planning Domain)
A planning domain consists of a set of fluents which are realvalued functions and a set of actions which are associated with some entities or parameters or both.
Obviously, fluents here are generalized predicates in classical symbolic planning, whose arity should be given in advance and value may vary with time. Actions may also be associated with preconditions and effects, specified by sets of fluent values. This definition makes the planning domain lifted to the firstorder, abstracting out instances of objects, agents, and events, only describes their classes and relations. It ,therefore, generalizes to arbitrary numbers of instances. In the literature, STRIPSstyle action language Fikes and Nilsson (1971) provides such abstraction.
Definition 2 (Stochastic Planning Problem)
A stochastic planning problem is given by a domain , a set of entities , a set of transition probabilities for , a reward function and a distribution over initial states. A state is specified by a set of fluent values.
Here we do not assume the transition probabilities to be Markovian. It can thus be easily generalized to Partially Observable Kaelbling et al. (1998) or semiMarkovian Sutton et al. (1999). States are sets of fluent values, which may be grounded spatial relations, grounded temporal relations, or output of functions. The reward function is preferred over Goal because the extensionality of the latter one might be hard to specify, particularly for inversion from demonstrations. As introduced earlier, Markovian rewards may not generalize over a domain independently to the environment, we assume to be nonMarkovian for each planning problem. A plan is a (sub)optimal trajectory of states for a planning problem.
Given a set of plans from different planning problems in the same domain, a task is their abstraction that generalizes these plans and captures their underlying optimality. It may abstract out numerical values in . At the minimal level, it is the action or transition at each state that we care about. This principle was previously discussed in Martin and Geffner (2004), Natarajan et al. (2011), and Silver et al. (2020), which directly imitate the behavior with expressive policies whose hypothesis space is logically or programmatically constrained. But completely depending on the policy also constrained the capability of generalization. For example, the learned policy may not work or generalize well if the action set is (uncountably) infinite. To this end, an ordering in the state space is still needed for a task to represent the (possibly bounded) rationality in demonstrated plans Khardon (1999).
Definition 3 (Planning Task)
A planning task is a tuple , where is the domain and is a partial ordering relation over states.
This minimal algebraic description of a task for generalized planning. And Generalized Inverse Planning is to inversely solve for a representation of the task that maintains this algebraic structure from demonstrations.
Our proposal towards generalized inverse planning is a nested algorithm. In the inner loop, it learns a numerical representation of with maximum likelihood. In the outer loop it pursuits firstorder concepts with maximum a posteriori (MAP). We introduce from inside out in the next two sections and provide pseudocode in Alg.1 and Alg.2.
Maximum Entropy Inverse Planning
Descriptive Model and Maximum Entropy
We adopt descriptive modeling for inverse planning. Different from a discriminative model, which models a conditional probability, i.e.
a classifier, a descriptive model specifies the probability distribution of the signal, based on an energy function defined on the signal through some descriptive feature statistics extracted from the signal
Wu et al. (2019). In the literature of modern AI, it is also known as energybased model LeCun et al. (2006). From the discussion above, it comes clear to us that given a set of plans from different problems in the same domain, the minimum statistical description to match should be ordinal to account for temporal relations. Therefore, different from the feature expectation in IRL, we match the Kendall ranking statistics for inverse planning given plans from the same domain:(3) 
where scores the concordance or discordance of a ranking function for temporally indexed states and where from a plan for a problem :
(4) 
The range of Kendall is , where indicates a perfect match. However, normally there can be multiple distributions of plans and thus multiple ranking functions
that match this ordinal statistics. Similar to MaxEntIRL, we employ the principle of maximum entropy
Jaynes (1957) to resolve ambiguities in choosing distributions. Concretely, we maximize the entropy of the distribution over plans under the constraint that the Kendall can be matched between the demonstrations and generated plans:(5) 
Under the KKT condition, we can derive the Boltzmann form of this distribution from Eq.5’s Langrangian:
(6) 
where is the partition function, is the Langrangian multiplier and accounts for the the transition probability .
Utility Learning for Order Preserving
Comparing Eq.6 with Eq.2, it is easy to see the correspondence between and . To further derive the utility function of plans from Eq.6, we first specify the concrete form of . We assume for each state
, which is a set of grounded fluent values, there is a vector of fistorder concepts
that generalizes over the domain . We will introduce how this vector can be learned in next section. Here we can simply assume it is given. And we can further assume that the ranking function is piecewise linear with respect to this concept vector. Being piecewise linear is a general assumption because functions with this characteristics are in theory as expressive as artificial neural networks. Specifically, we discretize each entry
into bins and attach a vector as a functional:(7) 
Obviously, there is no need to separate and anymore. So we drop for the derivation below.
To further illustrate the utility function in MEIP, we can rewrite as:
(8) 
Therefore we have
(9) 
It is easy to see that this reward function is nonMarkovian, in stark contrast to in MaxEntIRL.
We can solve for the by maximum likelihood (MLE) over given demonstrations, which according to Jaynes (1957) implies maximum entropy:
(10) 
Notice that is dropped as a constant. If is differentiable w.r.t. , consider Eq.10’s gradient
(11) 
the second term can be approximated by sampling. Apparently, there is an contrastive view here: when maximizing the likelihood of demonstrations, we maximize their Kendall and minimize the Kendall of generated plans. The underlying intuition is that at convergence, for pairs of states in the demonstrations, if ; for pairs in generated plans, if ; for pairs with one state in demonstrations and the other one from generated plans, . We would refer to ordinal relations listed here as , resembling groundtruth labels.
However, is not differentiable. To this end, we need to learn a classifier with to match the order described above, mimicking the optimization in Eq.11. One way is to directly relax with a discriminator . To some extent, this approximation shares similar spirits with Generative Adversarial Imitation Learning (GAIL) Ho and Ermon (2016) and some variants Finn et al. (2016). But different from them, we explicitly consider the temporal relation in the nonMarkovian utility.
We can also learn this classifier with maxmargin methods. Essentially, it is a Ranking SVM model Liu (2011) taking both the demonstrations and the sampled plans as supervision. Consider there are pairs of states in total from expected to fulfill the ordinal relation above, we have this Quadratic Optimization problem:
(12) 
It can be solved by offtheshelf Gradient Descent (GD) or Quadratic Programming (QP). Notice that the number of constraints and slack variables grows quadratically in the size of each plan, we thus only consider the ordinal relations in each planning problem . But the parameters are shared for all problems in the domain . If we only consider the linear, primal form of the problem, there are also efficient methods for training Joachims (2006).
Sampling Method
We need to sample from the EBM to calculate parts in Eq.12 that correspond to the second term in Eq.11. Thus we need to have the distribution of trajectories from the EBM. In this work, we only consider planning problems where states are discrete. We leave the continuous state space as our future work. We can factorize the probability of a plan by conditioning:
(13) 
We take its logarithmic form to connect with Eq.8 and Eq.9:
(14) 
where takes both the action and transition probability into account. Since here we do not explicitly specify the action in ^{1}^{1}1The main concern here is that grounded actions in plans from different may not generalize over . Bonet and Geffner (2018) discuss learning abstract actions for generalized planning. Here we do not enforce a relational form in the action space . We assume the equivalence between actions causing same transitions. We further assume there is an absorbing state failure for each state transition thus is stochastic with ., we decompose it to be
(15) 
It then becomes clear that to sample from the descriptive model, we just sample proportionally to :
(16) 
can be acquired from Monte Carlo Tree Search, for which the reward at each node is initialized with . After its convergence, we can sample trajectories according to the value of each branch.
It is worth noticing that the same sampling method can be used to plan for optimal utility when transferred to other problems in the same domain . So generally speaking, for sampled plans, there is a minmax view:
(17) 
Learning FirstOrder Concepts
In the previous section, we introduced a descriptive model for inverse planning, given some concepts that generalize over the planning domain. In this section, we introduce a formalism for concepts with this characteristics and how we learn them as in Generalized Inverse Planning.
Concept Language
As introduced in Sec.3Generalized Inverse Planning, a planning domain is defined in a lifted manner, abstract out instances of entities in specific problems. Therefore the computational form of the utility learned from Generalized Inverse Planning should also be lifted and be invariant to the variation of numbers of instances. In AI, the firstorder logic with quantifiers and aggregators is a formalism to express this invariance. To this end, we employ a modified concept language Donini et al. (1997) as a grammar for elements in such that we can learn these firstorder concepts in a topdown manner. Concept languages have the expressive power of subsets of standard firstorder logic yet with a syntax that is suited for representing classes of entities. Adopting the terminology of FOL, each concept is represented by a firstorder formula. However, in the original concept languages, concepts are assumed independent, which might not be the case for our utility function^{2}^{2}2Recall that we assume the ranking function to be (piecewise) linear in Eq.7, which requires concepts to be independent from each other. Utility functions whose concepts are not independent will never be expressed without this modification.. So the primary modification we introduce is to complete it as in FOL, explicitly accounting for formulas which take other formulas as terms with a syntax:
where the notation highlights different concepts. denotes predicates with binary value domains i.e. relations, denotes functions with either real or discrete value domains.^{3}^{3}3
They may have neural network equivalents structuralized as MultiLayer Perceptrons (MLP) or Graph Neural Nets (GNN).
They all have their own arities. are concepts represented by atomic formulas with a syntax:is the set of aggregators and is the set of quantifiers. is a set that is either the value domain of a fluent or the truth domain of a predicate . is the domain of interest that can either be the extension of a certain predicate, which is denoted as , or the universe () of entities. The dimension of the domain should match the arity of and when placed together in and . Among predicates , are primitives. Other predicates can be their constituents’ negation () or conjunction (). They can also be a result of permuting another predicate’s arguments (), if the arity is larger than 1 and arguments are from the same class. They can even be defined transitively (), as in original concept languages. Apparently, .
Concept Pursuit
To consider the combination over the full bank of concepts would be computationally intractable. A Bayesian treatment of concept induction can provide a principled way to incorporate the prior of Occam’s razor to probabilistic grammars. Let us denote the selected concepts with , where coincides Eq.7, the posterior of the utility function given demonstrated plans would be
(18) 
To obtain the MAP estimate efficiently, we adopt stepwise greedy search over :
(19) 
Specifically, concepts in are first sorted by their complexity, reflecting the prior . Note that concepts are mutually exclusive if they share the same and only differentiate at or , so they are stored in the same slot in the sorted list. The levels of complexity are naturally discretized by the number of primitive fluents or predicates involved. We start from the simplest level. At each level, the concept that brings the largest margin is added to in a stepwise greedy manner. The selection terminates at the current level when the marginal benefit in the posterior is below a threshold. Then we move on to the next level. Since some complex concepts may have information overlap with simpler ones e.g. vs , the simpler ones are replaced from when the complex are added. The first term in Eq.19 is equivalent to the MLE in Eq.11 and can be solved with MEIP. Therefore the complete algorithm is nested.
This derivation leads us to a boosting method Friedman (2001)
. A similar greedy search strategy was proposed for feature selection in IRL by
Bagnell et al. (2007). But different from them, we further assume the increase from is always more significant than the former term, such that we only need to consider a subset of with the lowest complexity at a time.Experiments
Evaluation Protocol
To help readers better understand the generalization problem studied in this work, we provide a systematic introduction of our evaluation protocol. Note that even though it is the learned utility that is to be evaluated, it cannot be evaluated without planned behaviors. This because the optimal utility can not be trivially defined for most tasks where Generalized Inverse Planning is meaningful, a issue that originally motivated the proposal of IRL (Abbeel and Ng, 2004)

[noitemsep, nolistsep]

The agent first learns the utility in the environment where the demos come from;

We then transfer this agent to another environment, which is built under the same schema thus in the same planning task. This new environment is a result of probability shift or structural change or both;

Given the symbolic world model of the new environment, the agent optimizes for a policy that maximizes the reward function with modelbased methods such as MCTS;

Test if the behavior of the agent in this new environment is consistent with the demo in terms of temporal relations.
We adopt the following two sets of evaluation metrics, depending on the diversity of the desired behavior:

[noitemsep, nolistsep]

For simple tasks the groundtruth behavior is only one single sequence, such as in the didactic example is the only desired sequence, we evaluate the learned utility by a Monte Carlo estimate of the probability of the desired sequence executed by an MCTS agent’s planned behavior in the new environment.

Most of the time, the groundtruth optimal utility is not clear to us, especially its numerical value. But we can extract the groundtruth concepts and their ordering from the demos. We evaluate the learned utility with the objective of IRL or inverse planning: the matching in the statistics. Specifically, given the planned behaviors and the demonstrated ones, we measure their mean matching to check if the learned utility can attain a similar behavior from the perspective of IRL; We measure their Kendall (ordinal matching) to check if the learned utility can attain a similar behavior from the perspective of inverse planning.
Experiment 1: Probability Shift
We first conduct an experiment with the didactic example from Littman et al. (2017) introduced above to illustrate utility learned with MEIP can generalize regardless of probability shift. Here probability shift refers to a change in a distribution in . The desired behavior is ‘do not visit a bad state until you reach the goal’. We recommend readers to review Fig.1a for the problem setup. When learning the utility, the agent is provided with demonstrations collected from the MDP with . And it is also provided a blackbox model to simulate this MDP during MCTS. The learned utility is then tested in another MDP, with . In both environments, we evaluate agents’ behaviors by estimating the probability of the desired sequence . The result is shown in Fig.2: Although both agents with MEIP and MaxEntIRL behave perfectly with , only the agent with MEIP still performs perfectly with . The MaxEntIRL agent discards temporal relations in utility and only matches mean statistics , thus prefers in the first transition after a probability shift.
We also conducted an empirical study to explore the boundary of generalization of both MEIP and MaxEntIRL. If , the difference between MEIP and MaxEntIRL would be insignificant since the reward learned from MaxEntIRL can also discourage agents from taking . We also notice that if , which induces an extremely high probability for agents to move to after taking , neither MEIP nor MaxEntIRL learns meaningful utility.
Experiment 2: Structural Change
Structural change happens when the utility is transferred between two environments whose underlying PGMs for the transition have the same fluent nodes but different causal structures or different numbers of nodes. Note that causal structures of environments always implicitly enforce ordering in demonstration sequences. If the ordinal information in these sequences only reflects causality, there should be no difference between MEIP and IRL. However, we humans are cultural creatures. There are lots of things we do in an order not because they are the only feasible ways, but due to certain social conventions, e.g. the traditional order in a wedding ceremony, the stroke order in handwriting, etc. We acquire social utility by following these conventions. It is under these situations does MEIP differentiate from IRL.
Consider the scenario in Fig.3 that hypothetically took place in early history when utterance was not at all easy for our ancestors. Alex was a newcomer to a tribe. He was invited to a ritual host by the chief of the tribe. The ritual went like this: There were 3 stages in total. In stage 1, Alex saw the chief took out all torches . After stage 2, Alex was showed 1 or 3 bamboos . Eventually, the chief fetched 4 pieces of clay from stage 3. As a member of this tribe, Alex need to understand the process of this ceremony. And the metric to test his understanding is how he would imagine himself hosting it, presuming some environmental changes.
The first evaluation metric, mean matching, tests whether the learned utility is associated with the correct set of concepts and thus can be transferred to another environment with different quantities of objects. In the demonstrated setting, there are 5 torches , 5 bamboos and 5 pieces of clay in each stage. After the structural change, there are 6 pieces of each objects. The groundtruth concepts of these demonstrations are , , . In plain English, the learner needs to fetch all torches from stage 1, any number of bamboo from stage 2 and 4 of pieces of clay from stage 3. We estimate the mean statistics of matching these concepts in sequences planned with utility learned by MEIP and MaxEntIRL. Both agents successfully discover the correct set of concepts with the help of our concept language (Fig.3(a)). These concepts empower them to generalize learned utility to environments with different quantities of entities (Fig.3(b)).
The second metric is for ordinal relations in sequences. We estimate the Kendall of planned sequences with the groundtruth ordering . To test this generalization, structural change needs to alter causality. And this alternation is done by controlling the world model provided to agents. When the transition in the world model directly enforces the required ordering, both agents have when planning with this world model. However, after changing to a transition without causality constraint on ordering, namely, agents can choose to enter any stage in any order, only the MEIP agent can attain the desired ordering () with the learned utility. We also tried a controlled setting where we provide a world model that does not have causality constraints to both agents. The contrast between the planned behaviors of the MEIP agents and the MaxEntIRL agent remains the same. These results justify that MEIP can learn not only correct concepts but also the desired temporal order. On the other hand, MaxEntIRL fails to capture the ordinal information. As illustrated in Fig.4, mean matching does not enforce ordinal utility learning.
Experiment 3: Learning to Fold
Another task that reflects the ubiquity of this social utility in our daily life is the running example we start from the introduction, cloth folding. If the utility was only associated with the final state that the cloth becomes a square whose area is within a range, there would not be those exemplar sequences you recall when you hear this task. We conduct an experiment to learn the utility of cloth folding in a visually and geometrically authentic simulator.
To represent the cloth to be folded under our formulation, we adopt a grammar, Spatial AndOrGraph (SAOG) (Zhu and Mumford, 2007) for the image input. Details of this grammar, as well as other technical details, can be found in the supplementary. Given the visual input of a cloth, the agent should parse it with this grammar to acquire a firstorder representation. For the sake of simplicity, we ignore the uncertainty from perception. We also endow the agent with an action model, assuming it can imagine geometric transformations like a toddler. As such, the agent has sufficient knowledge of each problem instance to rollout with MCTS. In our experiments, we collect 15 “good folds” from human demonstrators. When presented with these folding sequences, the agent with MEIP learns the utility and concepts that successfully generalize to unseen clothes, see Fig.5. Even though all shirts (or sweater) look similar on their appearances, their underlying structures are significantly different in the number, location, and orientation of edges and vertexes, see Fig.6.
Concluding Remarks
In this work we propose a new quest for learning generalizable task representation, especially its utility, from demonstrations. This problem lies outside of the regime of inverse reinforcement learning and thus dubbed Generalized Inverse Planning. We then outline the computational principles in the cognitive process Marr and Poggio (1976) of lifted nonMarkovian utility learning, which we model as Maximum Entropy Inverse Planning (MEIP). Comparing to existing inverse reinforcement learning methods, our model learns a task representation that generalizes regardless of probability shift and structural change in the environment. To highlight this contribution, we exclude irrelevant representation learning by adopting classical assumptions and representation in planning, i.e. grounded semantics of entities and relations are given a priori, as well as an action model over them. This kind of assumption was also made in original works of IRL Abbeel and Ng (2004), Ratliff et al. (2006),Ziebart et al. (2008), Munzer et al. (2015). To disclaim, we are aware that such assumption on priors might be regarded as too strong in modern days since there are some recent progress in deep reinforcement learning that successfully induces them with a weaker assumption of relational inductive biases Zambaldi et al. (2019). Our model is general enough to be extended to those neural networks. It would an interesting future work to investigate their synergy.
Acknowledgments
The authors thank Prof. Ying Nian Wu and Dr. Yixin Zhu of UCLA Statistics Department, Prof. Guy Van den Broeck of UCLA Computer Science Department for useful discussions. The work reported herein was supported by ONR MURI grant N000141612007, ONR N000141912153, and DARPA XAI N660011724029.
References
 Apprenticeship learning via inverse reinforcement learning. In Proceedings of the 21st international conference on Machine learning, pp. 1. Cited by: Introduction, Inverse Reinforcement Learning, Inverse Reinforcement Learning, Evaluation Protocol, Concluding Remarks.
 Learning movement sequences from demonstration. In Proceedings 2nd International Conference on Development and Learning. ICDL 2002, pp. 203–208. Cited by: Inverse Reinforcement Learning.
 Robot learning from demonstration. In ICML, Vol. 97, pp. 12–20. Cited by: Inverse Reinforcement Learning.
 Boosting structured prediction for imitation learning. In Advances in Neural Information Processing Systems, pp. 1153–1160. Cited by: Concept Pursuit.
 Action understanding as inverse planning. Cognition 113 (3), pp. 329–349. Cited by: Introduction.
 Features, projections, and representation change for generalized planning. arXiv preprint arXiv:1801.10055. Cited by: footnote 1.
 Relative entropy inverse reinforcement learning. In Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 182–189. Cited by: Inverse Reinforcement Learning.
 The complexity of concept languages. Information and Computation 134 (1), pp. 1–58. Cited by: Concept Language.
 Relational reinforcement learning. Machine learning 43 (12), pp. 7–52. Cited by: Structured Utility Function.
 Strips: a new approach to the application of theorem proving to problem solving. Artificial Intelligence 2 (3), pp. 189 – 208. Cited by: Generalized Inverse Planning.

A connection between generative adversarial networks, inverse reinforcement learning, and energybased models
. arXiv preprint arXiv:1611.03852. Cited by: Utility Learning for Order Preserving. 
Greedy function approximation: a gradient boosting machine
. Annals of statistics, pp. 1189–1232. Cited by: Concept Pursuit. 
Learning to rank for synthesizing planning heuristics
. In Proceedings of the TwentyFifth International Joint Conference on Artificial Intelligence, pp. 3089–3095. Cited by: Inverse Reinforcement Learning.  General pattern theory: a mathematical study of regular structures oxford mathematical monographs. Oxford University Press: Clarendon. Cited by: Inverse Reinforcement Learning.
 A robot controller using learning by imitation. University of Edinburgh, Department of Artificial Intelligence. Cited by: Inverse Reinforcement Learning.
 Generative adversarial imitation learning. In Advances in neural information processing systems, pp. 4565–4573. Cited by: Utility Learning for Order Preserving.
 Using reward machines for highlevel task specification and decomposition in reinforcement learning. In International Conference on Machine Learning, pp. 2107–2116. Cited by: Structured Utility Function.
 Information theory and statistical mechanics. Physical review 106 (4), pp. 620. Cited by: Descriptive Model and Maximum Entropy, Utility Learning for Order Preserving.
 Training linear svms in linear time. In Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 217–226. Cited by: Utility Learning for Order Preserving.
 Planning and acting in partially observable stochastic domains. Artificial intelligence 101 (12), pp. 99–134. Cited by: Generalized Inverse Planning.
 Nonparametric policy gradients: a unified treatment of propositional and relational domains. In Proceedings of the 25th international conference on Machine learning, pp. 456–463. Cited by: Structured Utility Function.
 Learning action strategies for planning domains. Artificial Intelligence 113 (12), pp. 125–148. Cited by: Generalized Inverse Planning.
 A tutorial on energybased learning. Predicting structured data 1 (0). Cited by: Descriptive Model and Maximum Entropy.
 Reinforcement learning with temporal logic rewards. In 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 3834–3839. Cited by: Structured Utility Function.
 Environmentindependent task specifications via gltl. arXiv preprint arXiv:1704.04341. Cited by: Introduction, Structured Utility Function, Experiment 1: Probability Shift.
 Learning to rank for information retrieval. Springer Science & Business Media. Cited by: Utility Learning for Order Preserving.
 From understanding computation to understanding neural circuitry. Cited by: Concluding Remarks.
 Vision: a computational investigation into the human representation and processing of visual information. Henry Holt and Co., Inc., USA. External Links: ISBN 0716715678 Cited by: Representation for Visual Input of Clothes.
 Learning generalized policies from planning examples using concept languages. Applied Intelligence 20 (1), pp. 9–19. Cited by: Generalized Inverse Planning.
 Inverse reinforcement learning in relational domains. In TwentyFourth International Joint Conference on Artificial Intelligence, Cited by: Structured Utility Function, Concluding Remarks.
 Imitation learning in relational domains: a functionalgradient boosting approach. In TwentySecond International Joint Conference on Artificial Intelligence, Cited by: Generalized Inverse Planning.
 Algorithms for inverse reinforcement learning.. In Icml, Vol. 1, pp. 2. Cited by: Inverse Reinforcement Learning.
 The 2.1d sketch. IEEE Computer Society Press. Cited by: Representation for Visual Input of Clothes.
 Thermodynamics as a theory of decisionmaking with informationprocessing costs. Proceedings of the Royal Society A: Mathematical, Physical and Engineering Sciences 469 (2153), pp. 20120683. Cited by: Inverse Reinforcement Learning.
 Maximum margin planning. In Proceedings of the 23rd international conference on Machine learning, pp. 729–736. Cited by: Inverse Reinforcement Learning, Concluding Remarks.

Fewshot bayesian imitation learning with logical program policies.
. In AAAI, pp. 10251–10258. Cited by: Generalized Inverse Planning.  Core knowledge. Developmental science 10 (1), pp. 89–96. Cited by: Representation for Visual Input of Clothes.
 A new representation and associated algorithms for generalized planning. Artificial Intelligence 175 (2), pp. 615–647. Cited by: Generalized Inverse Planning.
 Reinforcement learning: an introduction. MIT press. Cited by: Introduction.
 Between mdps and semimdps: a framework for temporal abstraction in reinforcement learning. Artificial intelligence 112 (12), pp. 181–211. Cited by: Generalized Inverse Planning.
 Learning task specifications from demonstrations. In Advances in Neural Information Processing Systems, pp. 5367–5377. Cited by: Structured Utility Function.
 A tale of three probabilistic families: discriminative, descriptive, and generative models. Quarterly of Applied Mathematics 77 (2), pp. 423–465. Cited by: Descriptive Model and Maximum Entropy.
 Learning linear ranking functions for beam search with application to planning.. Journal of Machine Learning Research 10 (7). Cited by: Inverse Reinforcement Learning.
 Deep reinforcement learning with relational inductive biases. In International Conference on Learning Representations, Cited by: Concluding Remarks.
 A stochastic grammar of images. Now Publishers Inc. Cited by: Representation for Visual Input of Clothes, Experiment 3: Learning to Fold.
 Maximum entropy inverse reinforcement learning.. In Aaai, Vol. 8, pp. 1433–1438. Cited by: 2nd item, Inverse Reinforcement Learning, Concluding Remarks.
Algorithmic Details
Maximum Entropy Inverse Planning (Meip)
For the purpose of generalization, the ranking function of utility is designed to be a piecewise linear function. When , the form of the function is:
Cases where has higher dimensions can be derived accordingly. To make sure the continuity, we further add constraint on the connection between two consecutive bins, i.e. .
The training process of MEIP follows the principle of analysis by synthesis. According to Eq.11 there are two parts to be optimized, one for demonstrations and one for sampled plans . Specifically, the learning is implemented with a Ranking SVM, see Eq.12. Ordinal relations between pairs are specified in the main text after Eq.11. For a pair , we set for the Ranking SVM.
The sampling process is based on a Monte Carlo Tree Search (MCTS). First, we initialize the value function of the MCTS. We keep doing rollout to compute for trajectories until value functions in MCTS converges. After that, we choose to sample a small number of trajectories according to the values on the branches of the Monte Carlo Tree. These trajectories, together with the human demonstrations , are used to update the utility function.
Concept Pursuit
As mentioned in the main text, we generate firstorder concepts from a concept language. However, the size of this language could be infinitely large. Thus we adopt the principle of Occam’s razors. We herein describe the algorithm for concept pursuit in algorithm 2.
There are three main steps in the pursuit process. First, we sort the concepts according to the complexity. The measurement of complexity is based on the number of predicates (or functions) in a concept. The more predicates (functions) in a concept, the higher complexity it is. This discrete nature introduces levels in the complexity.
The second step is to exclude concepts that are not relevant at all. We can select those relevant concepts by simply counting the number of violations in the state pairs. The last and most important step is to select concepts according to Eq. 19 for the utility we learn. Each concept is added to the selected set in a stepwise greedy manner. Note that concepts that share the same predicate (or function) and the same domain but different quantifiers are mutually exclusive.
Details for Experiment 2
Concept Space in Ritual Learning
In the ritual learning experiment, we only consider the concepts that are generated by the rule in which is the quantifier, is the predicate, and is the domain. In the ritual learning experiment, the domain in which is the set of object types and is the set of stages. A primitive concept in this experiment can be expressed as:
The primitive predicate, picked, describes whether the host carries some type of object from a specific stage. As introduced in the main text, primitive concepts represented by atomic formulas with the primitive predicate are terms for formulas of more complex concepts. Here these more complex concepts are encodings of the interrelation between primitive concepts.
Experimental Details
The Environment
is designed to be an analogy to a ritual. There are different stages in the environment. Both the agent and the demonstrator are required to choose a certain stage before they can advance to pick up objects in it. After locking down a specific stage, one will be asked to choose one type of object and pick up the of them ( means all). None of the stages can be visited more than once. The ritual will be terminated after all stages are visited.
Human Demonstrations
consist of at least 35 sequences. A sequence consists of an ordered descriptions of objects that the demonstrator obtains at each stage. The demonstrator can only choose to pick one type of object at one stage without limitation on the quantity. Examples of demonstrations are or . There is no doubt that we can have a longer demo if it is legal in the environment, although each demo only have 3 stages in our experiment. Note that all demos must contain exactly the same set of specific concepts.
Hyperparameters
of this experiment are listed as the following. MCTS converge condition: terminate after 3000 iterations. Size of sample trajectories : 5. Upper confident bound coefficient: 1.
Details for Experiment 3
Representation for Visual Input of Clothes
We adopt a stochastic grammar, Spatial AndOrGraph (SAOG) Zhu and Mumford (2007) for the visual input of all clothes. The design of this grammar follows Gestalt Laws in vision Marr (1982). Here we informally summarize its production rules. A cloth is an And node that produces a set of polygons. The number of polygons may change after being folded. Since some polygons may be occluded in the visual input, we adopt a 2.1D representation Nitzberg and Mumford (1990). The 2.1D representation is a layer representation. In our case, the order of the layer is consistent with the folding order. All polygons belong to the same class with a template set of fluents. They produce a set of line segments, edges. Different configurations of edges in one polygon consist an Or node. All edges also belong to the same class. Each edge is associated with two vertexes as its attribute. Each vertexis specified by its coordinate. Classes introduced above are regarded as domains in the concept language.
The full fluent set for this grammar is designed following axioms in Euclidean geometry, as showed in Table 1. Edge s, vertices, surfaces and their relations are also believed to be our core knowledge developed in early age Spelke and Kinzler (2007). Fluents for edges are categorized into functions/relations between edges e.g. parallel, and functions/relations between one edgeand one vertex of another edgee.g. distance. Other fluents, such as Logo and Neck, are for polygons, which are visual features. With these classes and fluents, we can generate concepts from the concept language.
During planning, the visual input of the cloth from each situation is parsed into a parse graph () of the SAOG. As shown in Figure 7, is a hierarchical representation in which the terminal node is an edgewith two vertexes and nonterminal nodes are polygons. Clothes with the simplest structures, such as a shirt, are initially parsed into three polygons and their own affiliated edges.
Name  Arity  Type 

Edge Length  Unary  function 
Logo  Unary  function 
Neck  Unary  function 
Vt2Vt Distance  Binary  function 
Vt2Edge Distance  Binary  function 
Parallel  Binary  predicate 
Perpendicular  Binary  predicate 
Vertex on Edge  Binary  predicate 
Edge on Edge  Binary  predicate 
Vertex in Polygon  Binary  predicate 
Experimental Details
The Environment
is a simulator for folding. The demonstrator is asked to draw a folding line that splits a polygon into two new polygons. In Figure 8, we show an example of a legal folding line in the simulator. The polygon is separated by the folding line and the small one will be flipped to the back of the larger one after each fold.
Human Demonstrations
are given by folding a demo shirt (or sweater). In a demo, states of the shirt are recorded, serialized as a sequence. A fold is not reversible therefore the demonstrator needs to consider the final state at every step. If the demonstrator made a “bad fold”, it could have a significant impact on the final state. Unlike the previous experiment, the demonstrator does not have a prescribed concept set during the demo process. Instead, the demonstrator needs to conduct a folding sequence that can lead to good final states which meet their own criteria of “good folds” based on default sequences in their reallife habits. We collected 15 sequences as demonstrations.
The Action Space
of a folding line is discretized to make MCTS applicable. It is defined as a tuple and each parameter is discretized accordingly. As illustrated in (a), is coordinate of a point on the grid. The grid is a discretization of the bounding box of a shirt. is the radius of the folding line and is the angle. Note that some folding lines may be redundant, therefore we need to check the uniqueness of each folding line.
Even though we have discretized action space, it is yet too large for MCTS with limited computational resources. Thus, it is necessary to have a reasonable number of legal folds for the MCTS. The solution is to learn a probabilistic distribution over the action space. The folds that are similar to demo folds will be associated with high probabilities. We assume that each parameter in
follows a normal distribution around some exemplars. See
(b).