Apprenticeship learning, or learning behavior by observing expert demonstrations, allows artificial agents to learn to perform tasks without requiring the system designer to explicitly specify reward functions or objectives in advance. Apprenticeship learning has been accomplished in agents in stochastic domains, such as Markov Decision Processes (MDPs), by means of inverse reinforcement learning
inverse reinforcement learning(IRL), in which agents infer some reward function presumed to underlie the observed behavior. IRL has recently been criticized, especially in learning ethical behavior , because the resulting reward functions (1) may not be easily explained, and (2) cannot represent complex temporal objectives.
) has proposed using linear temporal logic (LTL) as a specification language for agents in MDPs. An agent in a stochastic domain may be provided a formula in LTL, which it must satisfy with maximal probability. These approaches require the LTL specification to be specified a priori (e.g., by the system designer, although construct specifications from natural language instruction).
This paper proposes combining the virtues of these approaches by inferring LTL formulas from observed behavior trajectories. Specifically, this inference problem can be formulated as multiobjective optimization over the space of LTL formulas. The two objective functions represent (1) the extent to which the given formula explains the observed behavior, and (2) the complexity of the given formula. The resulting specifications are interpretable, and can be subsequently applied to new problems, but do not need to be specified in advance by the system designer.
The key contributions of this work are (1) the introduction of this problem and its formulation as an optimization problem; and (2) the notion of violation cost, and the state- and action-based objectives based on this notion.
In the remainder of the paper, we first discuss related work; we then describe our formulation of this problem as multiobjective optimization, defining a notion of “violation cost” and then describing state-based and action-based objectives, corresponding to inferring a specification from “what actually happened” and “what the demonstrator expected to happen” respectively. We demonstrate the usefulness of the formulation by using genetic programming to optimize these objectives in two domains, called SlimChance and CleaningWorld. We discuss issues pertaining to our approach and directions for future work, and summarize our results.
Ii Related Work
The proposed problem draws primarily upon ideas from apprenticeship learning (particularly, inverse reinforcement learning), stochastic planning with temporal logic specifications, and inferring temporal logic descriptions of systems.
Ii-a Apprenticeship Learning
Apprenticeship learning, the problem of learning correct behavior by observing the policies or behavioral trajectories of one or more experts, has predominantly been accomplished by inverse reinforcement learning (IRL) [19, 1]. IRL algorithms generally compute a reward function that “explains” the observed trajectories (typically, by maximally differentiating them from random behavior). Complete discussion of the many types of IRL algorithms is beyond the scope of this paper.
The proposed approach bears some resemblance to IRL, particularly in its inputs (sets of finite behavioral trajectories). Instead of computing a reward function based on the observed trajectories, however, the proposed approach computes a formula in linear temporal logic that optimally “explains” the data. This addresses the criticisms of , who claim that IRL is insufficient in morally and socially important domains because (1) reward functions can be difficult for human instructors to understand and correct, and (2) some moral and social goals may be too temporally complex to be representable using reward functions.
Ii-B Stochastic Planning with Temporal Logic Specifications
There has been a wealth of work in recent years on providing agents in stochastic domains (namely, Markov Decision Processes) with specifications in linear temporal logic (LTL). The most straightforward approach is , which we describe further in section III-C. The problem is to compute some policy which satisfies some LTL formula with maximal probability.
More sophisticated approaches consider the same problem in the face of uncertain transition dynamics [25, 8], partial observability [23, 22], and multi-agent domains [16, 11]. Also relevant to the proposed approach is the idea of “weighted skipping” that appears (in deterministic domains) in [21, 24, 15].
The problem of inferring LTL specifications from behavior trajectories is complementary to the problem of stochastic planning with LTL specifications, much as IRL is complementary to “traditional” reinforcement learning (RL). Specifications learned using the proposed approach may be used for planning, and trajectories generated from planning agents may be used to infer the underlying LTL specification.
Ii-C Inferring Temporal Logic Rules from Agent Behavior
The task of generating temporal logic rules that describe data is not a new one. Automatic identification of temporal logic rules describing the behavior of software programs (in the category of “specification mining”) has been attempted in, e.g., [9, 10, 17]. Lemieux et al’s Texada  allows users to enter custom templates for formulas and retrieves all formulas satisfied by the observed traces up to user-defined support and confidence thresholds; this differs from the work of Gabel and Su, who decompose complex specifications into combinations of predefined templates. Specifications in a temporal logic (rPSTL) have also been inferred from data in continuous control systems in . Each approach deals with (deterministic) program traces.
The proposed approach is most strongly influenced by , which casts the task of inferring temporal logic specifications for finite state machines as a multiobjective optimization problem amenable to genetic programming. Much of our approach follows from this work; our novel contribution is introducing the problem of applying such methods to agent behavior in stochastic domains, and in particular our notion of the violation cost as an objective function.
In this section we provide formal definitions of Markov Decision Processes (MDPs) and linear temporal logic (LTL); we then outline the approach taken in  for planning to satisfy (with maximum probability) LTL formulas in MDPs.
Iii-a Markov Decision Processes
The proposed approach pertains to agents in Markov Decision Processes (MDPs) augmented with a set, , of atomic propositions. Since reward functions are not important to this problem, we omit them. All notation and references to MDPs in this paper assume this construction.
Formally, a Markov Decision Process is a tuple
is a (finite) set of states;
is a (finite) set of actions;
specifies which actions are available in each state;
is a transition function, with if , so that is the probability of transitioning to by beginning in and taking action ;
is an initial state;
is a set of atomic propositions; and
is the labeling function, so that is the set of propositions that are true in state .
A trajectory in an MDP specifies the path of an agent through the state space. A finite trajectory is a finite sequence of state-action pairs followed by a final state (e.g., ); an infinite trajectory takes , and is an infinite sequence of state-action pairs (e.g., ). A sequence (finite or infinite) is only a trajectory if for all . We will denote by the set of all finite trajectories in an MDP , and by the set of all infinite trajectories in . We will denote by the -time step truncation of an infinite trajectory .
is a probability distribution over an agent’s next action, given its previous (finite) trajectory. A policy is said to bedeterministic if, for each trajectory, the returned distribution allots nonzero probability for only one action; we write . A policy is said to be stationary if the returned distribution depends only on the last state of the trajectory; we write .
We denote the set of all infinite trajectories that may occur under a given policy . More formally,
Iii-B Linear Temporal Logic
Linear temporal logic (LTL)  is a multimodal logic over propositions that linearly encodes time. Its syntax is as follows:
Here means “in the next time step, ”; means “in all present and future time steps, ”; means “in some present or future time step, ”; and means “ will be true until holds”.
The truth-value of an LTL formula is evaluated over an infinite sequence of valuations , where for all , . We say if is true given the infinite sequence of valuations .
There is thus a clear mapping between infinite trajectories and LTL formulas. We abuse notation slightly and define
We abuse notation further and say that for any , if .
We define the probability that a given policy satisfies an LTL formula by
That is, the probability that an infinite trajectory under will satisfy .
Each LTL formula can be translated into a deterministic Rabin automaton (DRA), a finite automaton over infinite words. DRAs are the standard approach to model checking for LTL. A DRA is a tuple
is a finite set of states;
is an alphabet (in this case, , so words are infinite sequences of valuations);
is a (deterministic) transition function;
is an initial state; and
, where , for all specifies the acceptance conditions.
A run of a DRA is an infinite sequence of DRA states such that there is some word such that for all . A run is considered accepting if there exists some such that for all , is visited only finitely often in , and is visited infinitely often in .
Iii-C Stochastic Planning with LTL Specifications
Planning to satisfy a given LTL formula within an MDP with maximum probability generally follows the approach of .
The planning agent runs the DRA for alongside by constructing a product MDP which augments the state space to include information about the current DRA state.
Formally, the product of an MDP and a DRA is an MDP
The agent constructs the product MDP , and then computes its accepting maximal end components (AMECs). An end component of an MDP is a set of states and an action restriction (mapping from states to sets of actions) such that (1) any agent in that performs only actions as specified by will remain in ; and (2) any agent with a policy assigning nonzero probability to all actions in is guaranteed to eventually visit each state in infinitely often.
An end component thus specifies a set of states such that with an appropriate choice in policy, the agent can guarantee that it will remain in forever, and that it will reach every state in infinitely often. An end component is maximal if it is not a proper subset of another end component. An end component is accepting if there is some such that (1) if , then for all ; and (2) there exists some such that . In this case, by entering and choosing an appropriate policy (for instance, a uniformly random policy over ), the agent guarantees that the DRA run will be accepting. A method for computing the AMECs of the product MDP is found in .
Iv Optimization Problem
Suppose that an agent is given some set of finite behavior trajectories , where for all .
We refer to the agent whose trajectories are observed as the demonstrator, and the agent that observes the trajectories as the apprentice. There may be several demonstrators satisfying the same objectives; this does not affect the proposed approach.
The proposed problem is to infer an LTL specification that well (and succinctly) explains the observed trajectories. This can be cast as a multiobjective optimization problem with two objective functions:
An objective function representing how well a candidate LTL formula explains the observed trajectories (and distinguishes them from random behavior); and
An objective function representing the complexity of a candidate LTL formula.
This section proceeds by describing a notion of “violation cost” (and defining the violation cost of infinite trajectories and policies) and using it to define two alternate objective functions representing (a) how well a candidate formula explains the actual observed state sequence (a “state-based” objective function), and (b) how well a candidate formula explains the actions of the demonstrator in each state (an “action-based” objective function). We then describe the simple notion of formula complexity we will utilize, and formulate the optimization problem.
Iv-a Violation Cost
We are interested in computing LTL formulas that well explain the demonstrator’s trajectories. These formulas should be satisfied by the observed behavior, but not by random behavior within the same MDP (since, for example, the trivial formula will be satisfied by the observed behavior, but also by random behavior). Ideally we could assign a “cost” either to trajectories (finite or infinite) or to policies (and, particularly, to the uniformly random policy in ), where the cost of a trajectory or policy corresponds to its adherence to or deviance from the specification. Given such a cost function , the objective would be to minimize , where is the uniformly-random (stationary) policy over :
The obvious choice of such a cost function (over infinite trajectories ) would be the indicator function which returns if and otherwise; this function may be extended to general policies by . This function, however, cannot distinguish between small and large deviances from the specification. For example, given the specification , this function cannot differentiate between such that is almost always true and such that is never true. We thus propose a more sophisticated cost function.
For , a set of nonnegative integers, we define to be the subsequence of omitting the state-action pairs with time step indices in . For example, . Each time step with an index in is said to be “skipped”.
We define the violation cost of an infinite trajectory subject to the formula as the (discounted) minimum number of time steps that must be skipped in order for the agent to satisfy the formula:
Note that if , then .
In order to define a similar measure for policies, we must construct an augmented product MDP , which is similar to as described in section III-C, but allows an agent to “skip” states by performing at each time step (simultaneously with their normal actions), a “DRA action” , where causes the DRA to transition as usual, and causes the DRA to not update in response to the new state.
Formally, given an MDP and a DRA corresponding to the specification , we may construct a product MDP as follows:
The state and action are added so that the agent may choose to “skip” time step . This is necessary for the case that violates the specification.
Note that the transition dynamics of are such that (the set of “skipped” time step indices) can be defined as
Define the transition cost in as
The violation cost of a (non-product) trajectory can then be rewritten as a discounted sum of the transition costs at each stage, minimized over the DRA actions , subject to the constraint that the DRA run from carrying out and the DRA actions must be accepting. This indicates that the violation cost of a policy may be thought of as the state-value function for the policy with respect to . Indeed, we will define the violation cost of a policy this way.
We define a product policy to be a stationary policy . When we consider the violation cost of a policy, we will assume a product policy of this form.
There are two reasons for this. First, when evaluating a candidate specification, we wish to assume the demonstrator had knowledge of that specification (or else we would be unable to notice complex temporal patterns in agent behavior), and thus that the demonstrator’s policy is over product states. Second, we wish to allow the demonstrator to observe the new (non-product) state before deciding whether to “skip” time step . That is, should be observed before is chosen, which is inconsistent with the typical policy over the product space.
We can easily construct a product policy from the uniformly random policy on . We define for all .
Upon constructing the product MDP , we compute its AMECs (as in section III-C). Then let , and let be the set of states in the product space from which no state in can be reached; these can be determined by breadth-first search.
We can use a form of the Bellman update equation to perform policy evaluation on a product policy . For each state , we initialize the cost of this state to the maximum discounted cost, , and we do not update these costs. This is done to enforce the constraint that the minimization should be over accepting DRA runs. Otherwise, the violation cost will always be trivially zero (since will always be picked). The update equation has the following form:
The in (4) is where the optimization over (implicitly) occurs. Choosing incurs a transition cost of and then causes the DRA to remain in state ; choosing incurs no transition cost, but causes the DRA to transition to state . The ability for the demonstrator to optimize over after observing the new state corresponds to the location of the in the Bellman update.
We define the violation cost of a policy as the function that results when running this update equation to convergence:
We now consider state-based (“what actually happened”) and action-based (“what the agent expected to happen”) objective functions, for explaining sets of finite trajectories.
The crux of both the state- and action-based objective functions is Algorithm 1. Given a finite sequence of states , Algorithm 1 determines the “optimal product-space interpretation” of . We define a product space interpretation of a sequence of states in an MDP as a sequence of DRA states such that, for all , either , or . That is, a product-space interpretation specifies a possible trajectory in that is consistent with the observed trajectory in .
Algorithm 1 uses dynamic programming to determine, for each time step , the set of states of DRA states that the demonstrator could be in at time (lines 3,7, and 10), as well as the minimal violation cost that would need to be accrued in order to be in each such state (lines 2,4, 12, and 16). The sequence that achieves this minimal cost is also computed (lines 5, 13, and 17).
The apprentice then assumes that the demonstrator acted randomly from time step onward. Although this assumption is probably incorrect, it is not entirely unreasonable, since it avoids the assumption that the demonstrator attempted to satisfy the formula after time step , which would artificially drive the net violation cost down; this allows the apprentice to reuse values that are already computed in order to evaluate the random policy.
Employing this assumption, the apprentice determines the optimal product-space interpretation as , where
Iv-A1 State-based objective function
We first consider an approach to estimating the violation cost of a finite trajectory that considers only the states visited in the trajectory, ignoring the demonstrator’s actions.
Thus the state-based objective function for is the sum of the estimated violation costs of all observed finite trajectories, less times the expected violation cost of the random policy from the initial state:
The main drawback of the state-based approach is that by ignoring the observed actions, the apprentice neglects a crucial detail: that what the demonstrator “expected” or “intended” to satisfy may differ from what actually was satisfied. The fact that did not occur does not mean that the demonstrator was not attempting to make occur with maximal probability, particularly if is a very rare event. To solve this problem, we consider an action-based approach.
Iv-A2 Action-based objective function
We now consider estimating the violation cost of a finite trajectory by using the observed state-action pairs to compute a partial policy over the product MDP .
To compute the action-based violation cost of a set of trajectories (Algorithm 2), the apprentice first runs Algorithm 1 to determine the optimal product-space interpretation for each trajectory (line 4), and uses this to compute the resulting product-space sequence where .
The action-based objective function is then
Iv-B Formula Complexity
Given two formulas that equally distinguish between the observed behavior and random behavior, we wish to select the less complex of the two. Here it suffices to simply minimize the number of nodes in the parse tree for the LTL formula (that is, the total number of symbols in the formula). There are also more sophisticated ways to evaluate formula complexity (such as that used in ), but they are not necessary for our purposes.
Iv-C Multiobjective Optimization Problem
To demonstrate the effectiveness of the proposed objective functions, we employed genetic programming to evolve a set of LTL formulas (where formulas are represented by their parse trees) in two domains. A summary of the domains used is in Table I. In all demonstrations, we used MOEAFramework  for genetic programming, using standard tree crossover and mutation operations . We consider (separately) the state-based and action-based objectives. NSGA-II over each set of objectives was run for generations with a population size of . This process was repeated times. We employed BURLAP  for MDP planning, and Rabinizer 3  for converting LTL formulas to DRAs. In each case, we restricted search to formulas of the form .
|Domain||# States||# Actions||“Actual” specification||Time, state-based (s)||Time, action-based (s)|
The tables in this section show formulas that are Pareto efficient in at least two NSGA-II runs - that is, there were no solutions within those runs that outperformed them on both objectives. For any Pareto inefficient formula , there is some formula which both (1) better explains the demonstrated trajectories (as measured by the violation-cost objective function) and (2) is simpler. Thus it is reasonable to restrict consideration to only Pareto efficient solutions.
V-a SlimChance domain
The SlimChance domain consists of two states: , a “good” state, and , a “bad” state. The agent has two actions: , and . If the agent performs , the next state is always ; if the agent performs , the next state is with small probability and otherwise. Thus, performing the action is “trying” to make the good state occur, but will rarely succeed.
The set of atomic propositions for this problem consists of a single proposition , which is true in but false in . We then suppose that the agent is attempting to satisfy the simple LTL formula .
A demonstrator attempting to minimize violation cost generated three trajectories of 10 time steps each. This resulted in the following trajectories (note that , which occurred randomly):
Tables II and III show all solutions that were Pareto efficient in at least two runs, for and respectively. The results emphasize the distinction between the two objective functions. In Table II the correct formula is Pareto efficient in two runs, but in most runs the obviously-incorrect is the only Pareto efficient formula (and note that ). In contrast, Table III shows that when using , the true function is Pareto efficient in all twenty runs.
V-B CleaningWorld domain
In the CleaningWorld domain, the agent is a vacuum cleaning robot in a dirty room. The room is characterized by some initial amount of dirt; the agent has some battery level . The actions available to the agent are: , which reduces both and by one; , which plugs the robot into a charger, allowing it to increment for each time step it remains docked; , which unplugs the robot from the charger; , which allows the robot to remain docked if it is currently docked, but otherwise simply decrements . If the robot’s battery dies (), the robot may only perform the dummy action . The domain has two propositions , which is true iff , and , which is true iff . There are also propositions corresponding to each action (where, e.g., the proposition is true whenever the agent’s last action was to vacuum). The agent is to satisfy the LTL objective .
An agent attempting to minimize violation cost for this specification produced three demonstration trajectories of 10 time steps each. Because CleaningWorld is deterministic, all three trajectories were identical. Here we represent each state by where is the amount of dirt still in the room in state and is the robot’s current battery level.
Tables IV and V show all solutions that were Pareto efficient in at least two runs, for and respectively. The formulas and are generated in all 20 runs by both and . These formulas (in particular, ) arguably better describe agent behavior than the “actual” specification : they are simpler than while generating identical trajectories. This is reflected by the fact that was generated by the algorithm for both state- and action-based runs, but , , and , which is Pareto dominated by when considering either or