# Interpretable Apprenticeship Learning with Temporal Logic Specifications

Recent work has addressed using formulas in linear temporal logic (LTL) as specifications for agents planning in Markov Decision Processes (MDPs). We consider the inverse problem: inferring an LTL specification from demonstrated behavior trajectories in MDPs. We formulate this as a multiobjective optimization problem, and describe state-based ("what actually happened") and action-based ("what the agent expected to happen") objective functions based on a notion of "violation cost". We demonstrate the efficacy of the approach by employing genetic programming to solve this problem in two simple domains.

## Authors

• 5 publications
• 14 publications
• ### Compositional planning in Markov decision processes: Temporal abstraction meets generalized logic composition

In hierarchical planning for Markov decision processes (MDPs), temporal ...
10/05/2018 ∙ by Xuan Liu, et al. ∙ 0

• ### Verifiable Planning in Expected Reward Multichain MDPs

The planning domain has experienced increased interest in the formal syn...
12/03/2020 ∙ by George K. Atia, et al. ∙ 0

• ### Policy Synthesis for Factored MDPs with Graph Temporal Logic Specifications

We study the synthesis of policies for multi-agent systems to implement ...
01/24/2020 ∙ by Murat Cubuktepe, et al. ∙ 0

• ### Multiple Plans are Better than One: Diverse Stochastic Planning

In planning problems, it is often challenging to fully model the desired...
12/31/2020 ∙ by Mahsa Ghasemi, et al. ∙ 23

• ### Inverse Reinforcement Learning of Autonomous Behaviors Encoded as Weighted Finite Automata

This paper presents a method for learning logical task specifications an...
03/10/2021 ∙ by Tianyu Wang, et al. ∙ 0

• ### Barrier Functions for Multiagent-POMDPs with DTL Specifications

Multi-agent partially observable Markov decision processes (MPOMDPs) pro...

• ### Learning and Solving Regular Decision Processes

Regular Decision Processes (RDPs) are a recently introduced model that e...
03/02/2020 ∙ by Eden Abadi, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## I Introduction

Apprenticeship learning, or learning behavior by observing expert demonstrations, allows artificial agents to learn to perform tasks without requiring the system designer to explicitly specify reward functions or objectives in advance. Apprenticeship learning has been accomplished in agents in stochastic domains, such as Markov Decision Processes (MDPs), by means of

inverse reinforcement learning

(IRL), in which agents infer some reward function presumed to underlie the observed behavior. IRL has recently been criticized, especially in learning ethical behavior [2], because the resulting reward functions (1) may not be easily explained, and (2) cannot represent complex temporal objectives.

Recent work (e.g., [5, 8, 25]

) has proposed using linear temporal logic (LTL) as a specification language for agents in MDPs. An agent in a stochastic domain may be provided a formula in LTL, which it must satisfy with maximal probability. These approaches require the LTL specification to be specified a priori (e.g., by the system designer, although

[6] construct specifications from natural language instruction).

This paper proposes combining the virtues of these approaches by inferring LTL formulas from observed behavior trajectories. Specifically, this inference problem can be formulated as multiobjective optimization over the space of LTL formulas. The two objective functions represent (1) the extent to which the given formula explains the observed behavior, and (2) the complexity of the given formula. The resulting specifications are interpretable, and can be subsequently applied to new problems, but do not need to be specified in advance by the system designer.

The key contributions of this work are (1) the introduction of this problem and its formulation as an optimization problem; and (2) the notion of violation cost, and the state- and action-based objectives based on this notion.

In the remainder of the paper, we first discuss related work; we then describe our formulation of this problem as multiobjective optimization, defining a notion of “violation cost” and then describing state-based and action-based objectives, corresponding to inferring a specification from “what actually happened” and “what the demonstrator expected to happen” respectively. We demonstrate the usefulness of the formulation by using genetic programming to optimize these objectives in two domains, called SlimChance and CleaningWorld. We discuss issues pertaining to our approach and directions for future work, and summarize our results.

## Ii Related Work

The proposed problem draws primarily upon ideas from apprenticeship learning (particularly, inverse reinforcement learning), stochastic planning with temporal logic specifications, and inferring temporal logic descriptions of systems.

### Ii-a Apprenticeship Learning

Apprenticeship learning, the problem of learning correct behavior by observing the policies or behavioral trajectories of one or more experts, has predominantly been accomplished by inverse reinforcement learning (IRL) [19, 1]. IRL algorithms generally compute a reward function that “explains” the observed trajectories (typically, by maximally differentiating them from random behavior). Complete discussion of the many types of IRL algorithms is beyond the scope of this paper.

The proposed approach bears some resemblance to IRL, particularly in its inputs (sets of finite behavioral trajectories). Instead of computing a reward function based on the observed trajectories, however, the proposed approach computes a formula in linear temporal logic that optimally “explains” the data. This addresses the criticisms of [2], who claim that IRL is insufficient in morally and socially important domains because (1) reward functions can be difficult for human instructors to understand and correct, and (2) some moral and social goals may be too temporally complex to be representable using reward functions.

### Ii-B Stochastic Planning with Temporal Logic Specifications

There has been a wealth of work in recent years on providing agents in stochastic domains (namely, Markov Decision Processes) with specifications in linear temporal logic (LTL). The most straightforward approach is [5], which we describe further in section III-C. The problem is to compute some policy which satisfies some LTL formula with maximal probability.

More sophisticated approaches consider the same problem in the face of uncertain transition dynamics [25, 8], partial observability [23, 22], and multi-agent domains [16, 11]. Also relevant to the proposed approach is the idea of “weighted skipping” that appears (in deterministic domains) in [21, 24, 15].

The problem of inferring LTL specifications from behavior trajectories is complementary to the problem of stochastic planning with LTL specifications, much as IRL is complementary to “traditional” reinforcement learning (RL). Specifications learned using the proposed approach may be used for planning, and trajectories generated from planning agents may be used to infer the underlying LTL specification.

### Ii-C Inferring Temporal Logic Rules from Agent Behavior

The task of generating temporal logic rules that describe data is not a new one. Automatic identification of temporal logic rules describing the behavior of software programs (in the category of “specification mining”) has been attempted in, e.g., [9, 10, 17]. Lemieux et al’s Texada [17] allows users to enter custom templates for formulas and retrieves all formulas satisfied by the observed traces up to user-defined support and confidence thresholds; this differs from the work of Gabel and Su, who decompose complex specifications into combinations of predefined templates. Specifications in a temporal logic (rPSTL) have also been inferred from data in continuous control systems in [13]. Each approach deals with (deterministic) program traces.

The proposed approach is most strongly influenced by [4], which casts the task of inferring temporal logic specifications for finite state machines as a multiobjective optimization problem amenable to genetic programming. Much of our approach follows from this work; our novel contribution is introducing the problem of applying such methods to agent behavior in stochastic domains, and in particular our notion of the violation cost as an objective function.

## Iii Preliminaries

In this section we provide formal definitions of Markov Decision Processes (MDPs) and linear temporal logic (LTL); we then outline the approach taken in [5] for planning to satisfy (with maximum probability) LTL formulas in MDPs.

### Iii-a Markov Decision Processes

The proposed approach pertains to agents in Markov Decision Processes (MDPs) augmented with a set, , of atomic propositions. Since reward functions are not important to this problem, we omit them. All notation and references to MDPs in this paper assume this construction.

Formally, a Markov Decision Process is a tuple

 M=⟨S,U,A,P,s0,Π,L⟩

where

• is a (finite) set of states;

• is a (finite) set of actions;

• specifies which actions are available in each state;

• is a transition function, with if , so that is the probability of transitioning to by beginning in and taking action ;

• is an initial state;

• is a set of atomic propositions; and

• is the labeling function, so that is the set of propositions that are true in state .

A trajectory in an MDP specifies the path of an agent through the state space. A finite trajectory is a finite sequence of state-action pairs followed by a final state (e.g., ); an infinite trajectory takes , and is an infinite sequence of state-action pairs (e.g., ). A sequence (finite or infinite) is only a trajectory if for all . We will denote by the set of all finite trajectories in an MDP , and by the set of all infinite trajectories in . We will denote by the -time step truncation of an infinite trajectory .

A policy

is a probability distribution over an agent’s next action, given its previous (finite) trajectory. A policy is said to be

deterministic if, for each trajectory, the returned distribution allots nonzero probability for only one action; we write . A policy is said to be stationary if the returned distribution depends only on the last state of the trajectory; we write .

We denote the set of all infinite trajectories that may occur under a given policy . More formally,

 ITrajMM={τ=(s0,a0),(s1,a1),⋯∈ITrajM: M(τ|T,aT)>0 for all T}

### Iii-B Linear Temporal Logic

Linear temporal logic (LTL) [20] is a multimodal logic over propositions that linearly encodes time. Its syntax is as follows:

 ϕ::= ⊤ | ⊥ | p,where p∈Π | ¬ϕ | ϕ1∧ϕ2 | ϕ1∨ϕ2 |

Here means “in the next time step, ”; means “in all present and future time steps, ”; means “in some present or future time step, ”; and means “ will be true until holds”.

The truth-value of an LTL formula is evaluated over an infinite sequence of valuations , where for all , . We say if is true given the infinite sequence of valuations .

There is thus a clear mapping between infinite trajectories and LTL formulas. We abuse notation slightly and define

 L((s0,a0),(s1,a1),⋯)=L(s0),L(s1),⋯

We abuse notation further and say that for any , if .

We define the probability that a given policy satisfies an LTL formula by

 PrMM(ϕ)=Pr{τ∈ITrajMM:τ⊨ϕ}

That is, the probability that an infinite trajectory under will satisfy .

Each LTL formula can be translated into a deterministic Rabin automaton (DRA), a finite automaton over infinite words. DRAs are the standard approach to model checking for LTL. A DRA is a tuple

 D=⟨Q,Σ,δ,q0,F⟩

where

• is a finite set of states;

• is an alphabet (in this case, , so words are infinite sequences of valuations);

• is a (deterministic) transition function;

• is an initial state; and

• , where , for all specifies the acceptance conditions.

A run of a DRA is an infinite sequence of DRA states such that there is some word such that for all . A run is considered accepting if there exists some such that for all , is visited only finitely often in , and is visited infinitely often in .

### Iii-C Stochastic Planning with LTL Specifications

Planning to satisfy a given LTL formula within an MDP with maximum probability generally follows the approach of [5].

The planning agent runs the DRA for alongside by constructing a product MDP which augments the state space to include information about the current DRA state.

Formally, the product of an MDP and a DRA is an MDP

 M×=⟨S×,U×,A×,P×,s×0,Π×,L×⟩

where

• ;

• ;

•  {P(s,a,s′)if q′=δ(q,L(s′))0otherwise
• and .

The agent constructs the product MDP , and then computes its accepting maximal end components (AMECs). An end component of an MDP is a set of states and an action restriction (mapping from states to sets of actions) such that (1) any agent in that performs only actions as specified by will remain in ; and (2) any agent with a policy assigning nonzero probability to all actions in is guaranteed to eventually visit each state in infinitely often.

An end component thus specifies a set of states such that with an appropriate choice in policy, the agent can guarantee that it will remain in forever, and that it will reach every state in infinitely often. An end component is maximal if it is not a proper subset of another end component. An end component is accepting if there is some such that (1) if , then for all ; and (2) there exists some such that . In this case, by entering and choosing an appropriate policy (for instance, a uniformly random policy over ), the agent guarantees that the DRA run will be accepting. A method for computing the AMECs of the product MDP is found in [3].

The problem of satisfying with maximal probability is thus reduced to the problem of reaching, with maximal probability, any state in any AMEC. [5]

shows how this can be solved using linear programming.

## Iv Optimization Problem

Suppose that an agent is given some set of finite behavior trajectories , where for all .

We refer to the agent whose trajectories are observed as the demonstrator, and the agent that observes the trajectories as the apprentice. There may be several demonstrators satisfying the same objectives; this does not affect the proposed approach.

The proposed problem is to infer an LTL specification that well (and succinctly) explains the observed trajectories. This can be cast as a multiobjective optimization problem with two objective functions:

1. An objective function representing how well a candidate LTL formula explains the observed trajectories (and distinguishes them from random behavior); and

2. An objective function representing the complexity of a candidate LTL formula.

This section proceeds by describing a notion of “violation cost” (and defining the violation cost of infinite trajectories and policies) and using it to define two alternate objective functions representing (a) how well a candidate formula explains the actual observed state sequence (a “state-based” objective function), and (b) how well a candidate formula explains the actions of the demonstrator in each state (an “action-based” objective function). We then describe the simple notion of formula complexity we will utilize, and formulate the optimization problem.

### Iv-a Violation Cost

We are interested in computing LTL formulas that well explain the demonstrator’s trajectories. These formulas should be satisfied by the observed behavior, but not by random behavior within the same MDP (since, for example, the trivial formula will be satisfied by the observed behavior, but also by random behavior). Ideally we could assign a “cost” either to trajectories (finite or infinite) or to policies (and, particularly, to the uniformly random policy in ), where the cost of a trajectory or policy corresponds to its adherence to or deviance from the specification. Given such a cost function , the objective would be to minimize , where is the uniformly-random (stationary) policy over :

 πrand(s,a)={1|A(s)|if a∈A(s)0otherwise

The obvious choice of such a cost function (over infinite trajectories ) would be the indicator function which returns if and otherwise; this function may be extended to general policies by . This function, however, cannot distinguish between small and large deviances from the specification. For example, given the specification , this function cannot differentiate between such that is almost always true and such that is never true. We thus propose a more sophisticated cost function.

For , a set of nonnegative integers, we define to be the subsequence of omitting the state-action pairs with time step indices in . For example, . Each time step with an index in is said to be “skipped”.

We define the violation cost of an infinite trajectory subject to the formula as the (discounted) minimum number of time steps that must be skipped in order for the agent to satisfy the formula:

 Violϕ(τ)=minN⊆N0τ∖N⊨ϕ∞∑t=0γt1t∈N (1)

Note that if , then .

In order to define a similar measure for policies, we must construct an augmented product MDP , which is similar to as described in section III-C, but allows an agent to “skip” states by performing at each time step (simultaneously with their normal actions), a “DRA action” , where causes the DRA to transition as usual, and causes the DRA to not update in response to the new state.

Formally, given an MDP and a DRA corresponding to the specification , we may construct a product MDP as follows:

• , where

• Otherwise,

 ⎧⎨⎩P(s,a,s′)if q′=δ(q,L(s′)) and ~a=keepP(s,a,s′)if q′=q and ~a=susp0otherwise

The state and action are added so that the agent may choose to “skip” time step . This is necessary for the case that violates the specification.

Note that the transition dynamics of are such that (the set of “skipped” time step indices) can be defined as

 N={t∈N0:~at−1=susp} (2)

Define the transition cost in as

 TC(s⊗,(a,~a),s⊗′)=1~a=susp (3)

The violation cost of a (non-product) trajectory can then be rewritten as a discounted sum of the transition costs at each stage, minimized over the DRA actions , subject to the constraint that the DRA run from carrying out and the DRA actions must be accepting. This indicates that the violation cost of a policy may be thought of as the state-value function for the policy with respect to . Indeed, we will define the violation cost of a policy this way.

We define a product policy to be a stationary policy . When we consider the violation cost of a policy, we will assume a product policy of this form.

There are two reasons for this. First, when evaluating a candidate specification, we wish to assume the demonstrator had knowledge of that specification (or else we would be unable to notice complex temporal patterns in agent behavior), and thus that the demonstrator’s policy is over product states. Second, we wish to allow the demonstrator to observe the new (non-product) state before deciding whether to “skip” time step . That is, should be observed before is chosen, which is inconsistent with the typical policy over the product space.

We can easily construct a product policy from the uniformly random policy on . We define for all .

Upon constructing the product MDP , we compute its AMECs (as in section III-C). Then let , and let be the set of states in the product space from which no state in can be reached; these can be determined by breadth-first search.

We can use a form of the Bellman update equation to perform policy evaluation on a product policy . For each state , we initialize the cost of this state to the maximum discounted cost, , and we do not update these costs. This is done to enforce the constraint that the minimization should be over accepting DRA runs. Otherwise, the violation cost will always be trivially zero (since will always be picked). The update equation has the following form:

 Viol(k+1)((s,q))←(∑a∈A(s)π⊗((s,q),a)(∑s′∈SP(s,a,s′) min{1+γViol(k)(s′,q),γViol(k)(s′,δ(q,L(s′)))})) (4)

The in (4) is where the optimization over (implicitly) occurs. Choosing incurs a transition cost of and then causes the DRA to remain in state ; choosing incurs no transition cost, but causes the DRA to transition to state . The ability for the demonstrator to optimize over after observing the new state corresponds to the location of the in the Bellman update.

We define the violation cost of a policy as the function that results when running this update equation to convergence:

 Violπ⊗ϕ((s,q))=limk→∞Viol(k)((s,q)) (5)

We now consider state-based (“what actually happened”) and action-based (“what the agent expected to happen”) objective functions, for explaining sets of finite trajectories.

The crux of both the state- and action-based objective functions is Algorithm 1. Given a finite sequence of states , Algorithm 1 determines the “optimal product-space interpretation” of . We define a product space interpretation of a sequence of states in an MDP as a sequence of DRA states such that, for all , either , or . That is, a product-space interpretation specifies a possible trajectory in that is consistent with the observed trajectory in .

Algorithm 1 uses dynamic programming to determine, for each time step , the set of states of DRA states that the demonstrator could be in at time (lines 3,7, and 10), as well as the minimal violation cost that would need to be accrued in order to be in each such state (lines 2,4, 12, and 16). The sequence that achieves this minimal cost is also computed (lines 5, 13, and 17).

The apprentice then assumes that the demonstrator acted randomly from time step onward. Although this assumption is probably incorrect, it is not entirely unreasonable, since it avoids the assumption that the demonstrator attempted to satisfy the formula after time step , which would artificially drive the net violation cost down; this allows the apprentice to reuse values that are already computed in order to evaluate the random policy.

Employing this assumption, the apprentice determines the optimal product-space interpretation as , where

#### Iv-A1 State-based objective function

We first consider an approach to estimating the violation cost of a finite trajectory that considers only the states visited in the trajectory, ignoring the demonstrator’s actions.

The state-based violation cost is the minimand of (6), which is the second value returned by Algorithm 1:

 ViolSϕ(τ)=CT[s⊗T]+γT+1Violπ⊗randϕ(s⊗T) (7)

Thus the state-based objective function for is the sum of the estimated violation costs of all observed finite trajectories, less times the expected violation cost of the random policy from the initial state:

 ObjS(ϕ)=(m∑i=1ViolSϕ(τi))−mViolπ⊗randϕ(s⊗−1) (8)

The main drawback of the state-based approach is that by ignoring the observed actions, the apprentice neglects a crucial detail: that what the demonstrator “expected” or “intended” to satisfy may differ from what actually was satisfied. The fact that did not occur does not mean that the demonstrator was not attempting to make occur with maximal probability, particularly if is a very rare event. To solve this problem, we consider an action-based approach.

#### Iv-A2 Action-based objective function

We now consider estimating the violation cost of a finite trajectory by using the observed state-action pairs to compute a partial policy over the product MDP .

To compute the action-based violation cost of a set of trajectories (Algorithm 2), the apprentice first runs Algorithm 1 to determine the optimal product-space interpretation for each trajectory (line 4), and uses this to compute the resulting product-space sequence where .

The assumption that for each , , the demonstrator performed when in the inferred product MDP state , induces an action restriction (lines 6 and 11) where

 A∗(s⊗)=⎧⎪ ⎪⎨⎪ ⎪⎩⋃i,t:s⊗=s⊗it{ait}if ⋃i,t:s⊗=s⊗it{ait}≠∅A⊗(s⊗)otherwise

The apprentice may then compute, using the Bellman update (4), the violation cost of the policy that uniformly-randomly chooses an action from at each state (line 14):

 π⊗A∗(s⊗,a)={1|A∗(s⊗)|if a∈A∗(s⊗)0otherwise

The action-based objective function is then

 ObjA(ϕ)=Violπ⊗A∗ϕ(s⊗−1)−Violπ⊗randϕ(s⊗−1) (9)

### Iv-B Formula Complexity

Given two formulas that equally distinguish between the observed behavior and random behavior, we wish to select the less complex of the two. Here it suffices to simply minimize the number of nodes in the parse tree for the LTL formula (that is, the total number of symbols in the formula). There are also more sophisticated ways to evaluate formula complexity (such as that used in [4]), but they are not necessary for our purposes.

### Iv-C Multiobjective Optimization Problem

Given some set of finite trajectories , we thus frame the problem of inferring some LTL formula that describes as

 minϕ∈LTL(Obj(ϕ),FC(ϕ))

where is either , as described in (8), or , as described in (9); is formula complexity (in this case, the number of nodes in the formula) as specified in section IV-B.

## V Examples

To demonstrate the effectiveness of the proposed objective functions, we employed genetic programming to evolve a set of LTL formulas (where formulas are represented by their parse trees) in two domains. A summary of the domains used is in Table I. In all demonstrations, we used MOEAFramework [12] for genetic programming, using standard tree crossover and mutation operations [14]. We consider (separately) the state-based and action-based objectives. NSGA-II over each set of objectives was run for generations with a population size of . This process was repeated times. We employed BURLAP [18] for MDP planning, and Rabinizer 3 [7] for converting LTL formulas to DRAs. In each case, we restricted search to formulas of the form .

The tables in this section show formulas that are Pareto efficient in at least two NSGA-II runs - that is, there were no solutions within those runs that outperformed them on both objectives. For any Pareto inefficient formula , there is some formula which both (1) better explains the demonstrated trajectories (as measured by the violation-cost objective function) and (2) is simpler. Thus it is reasonable to restrict consideration to only Pareto efficient solutions.

### V-a SlimChance domain

The SlimChance domain consists of two states: , a “good” state, and , a “bad” state. The agent has two actions: , and . If the agent performs , the next state is always ; if the agent performs , the next state is with small probability and otherwise. Thus, performing the action is “trying” to make the good state occur, but will rarely succeed.

The set of atomic propositions for this problem consists of a single proposition , which is true in but false in . We then suppose that the agent is attempting to satisfy the simple LTL formula .

A demonstrator attempting to minimize violation cost generated three trajectories of 10 time steps each. This resulted in the following trajectories (note that , which occurred randomly):