## 1 Introduction

In inverse reinforcement learning (IRL), [ng2000algorithms, abbeel2004apprenticeship, ziebart2008maximum], given a reward free MDP and a set of demonstrations, the goal is to infer a reward function that can best explain the demonstrations.

Most existing IRL algorithms learn a Markovian reward function (i.e., a memoryless reward function that is independent from the history of visited states). However, it is challenging to apply IRL to tasks with complicated memory structures, as the whole history of previously visited states may need to be considered to determine the reward at the current state.

Another challenge for IRL is the dependence of the effectiveness of the learned reward function on the quality of the demonstrations provided beforehand. To address this challenge, active IRL methods [lopes2009active, odom] are proposed to actively query the demonstrator for samples at specific states or regions in the feature space. However, for tasks with complicated memory structures, queries on specific states or regions may not suffice as the high-level event sequences are more critical and should be queried instead for recovering the reward function.

In this paper, we develop an iterative algorithm that alternates between a task inference module that infers the high-level memory structure of the task and a reward learning module that learns a reward function with the inferred memory structure. In each iteration, the task inference module produces a series of queries to be answered by the demonstrator. Each query is a sequence of high-level events, and the demonstrator must execute the sequence in the environment and answer the query based on the binary outcome of whether the task is completed. The reward learning module incorporates the DFA states into the MDP states and creates a product automaton. Then the reward learning module proceeds to learn a Markovian reward function for this product automaton by performing deep maximum entropy IRL. At the end of each iteration, the algorithm computes an optimal policy with respect to the learned reward function. This iterative process continues until the computed policy leads to a satisfactory performance in completing the task.

In the reward learning module, we adapt the deep maximum entropy IRL algorithm [wulfmeier2015maximum]

to the proposed framework, and use a convolutional neural network (CNN) to represent the reward function. The reward function takes as input the states of the MDP and the DFA. Essentially, by using a CNN, we avoid the need for manual feature engineering. Moreover, CNNs are able to learn the most salient features corresponding to successful task completion. Hence the learned reward function has the potential for strong generalization performance.

To test the validity of the proposed algorithm, we create a task-oriented navigation environment. The experiments show that our algorithm can successfully infer the underlying memory structure of the task and use it to guide IRL. The training and test performances of the proposed algorithm outperform three baselines, i.e., a memoryless IRL method, a memory-based behavioral cloning method, and a memory augmented IRL method.

Here we list the contributions of the paper: 1) We offer an algorithm for actively inferring the memory structure in the form of a DFA. 2) We propose a method for inducing high-level task information encoded in the DFA in the IRL loop. The DFA tracks the progress in the execution of the task. 3) We use deep reward learning in the proposed IRL framework, which improves generalizability of the proposed algorithm.

## 2 Background

Generally, the problem of inverse reinforcement learning (IRL) can be described as follows: Given a reward-free MDP , and a set of demonstration trajectories , infer a reward function that can optimally interpret the demonstrations in some pre-specified way. Different works on IRL can be distinguished by the following three aspects: First, the reward parameterization; second, the way to generate a policy with a given reward function; third, the interpretation of the demonstrations using the output policy. In this section, we briefly describe the maximum entropy inverse reinforcement learning (MaxEnt IRL) algorithm [ziebart2008maximum]. We show the limitations of this algorithm for learning behaviors for complicated task structures, which motivate the proposed algorithm.

We adopt the following settings on environment model and demonstrations that are commonly used in IRL. The environment is modeled as a reward-free MDP where is a state space; is an action space; (where

is the set of all probability distributions over

) is the transition function; is an initial distribution over ; and is a labeling function with as a finite set of events. Let be a set of demonstration trajectories, where for all . We define the task to be learned by a mapping , where is the Kleene star of , and denotes that a sequence of events can complete the task.The reward function is assumed to have a memory structure encoded by a deterministic finite automaton (DFA). A DFA is a tuple where is a set of states; is a set of input symbols (also called the alphabet); is a deterministic transition function; is the initial state; is a set of final states (also called *accepting* states). Given a finite sequence of input symbols in for some , the DFA generates a unique sequence of states in such that for each , . We denote the last state by taking the sequence of inputs from as . is *accepted* by if and only if . Let be all the finite sequences of input symbols that are accepted by , which is also referred to as the *accepted language* of .

### 2.1 Maximum Entropy IRL

In the original MaxEnt IRL algorithm [ziebart2008maximum], the reward function is parameterized as a linear combination of a given set of feature functions. In other words, given a set of features where for , the reward function is parameterized by and

. The basic assumption is that the expected total reward over the distribution of trajectories is the same as the empirical average reward over demonstration trajectories. With the principle of maximum entropy, it can be derived that the probability of generating any (finite-length) trajectory

is proportional to the exponent of the total reward of :(1) |

While linear parameterization is commonly used in IRL literature [abbeel2004apprenticeship, ratliff2006maximum, neu2007apprenticeship, klein2012inverse], it suffers from several drawbacks. On the one hand, it requires human knowledge to provide properly designed reward features, which can be labor-intensive; on the other hand, if the given features fail to encode all the essential requirements to generate the demonstrations, there is no way to recover this flaw by learning from demonstrations. One way to deal with this problem is to use nonlinear reward models such as Gaussian process [levine2011nonlinear]

[levine2010feature] or neural network to automatically construct reward features from expert demonstrations.##
3 Active Task-Inference-Guided Deep Inverse Reinforcement

Learning

In this section, we introduce the active task-inference-guided deep IRL (ATIG-DIRL), which iteratively infers the memory structure of tasks and incorporates the memory structure into reward learning.

### 3.1 Overview

ATIG-DIRL, as illustrated in Algorithm 1, consists of a task inference module and a reward learning module. The task inference module utilizes L* learning [angluin1987learning], an active automaton inference algorithm, as the template to iteratively infer a DFA from queries and counterexamples. The inferreded DFA encodes the high-level temporal structure to help IRL recover reward functions. Following L* learning, the task inference module generates two kinds of queries with Boolean answers, namely the *membership* query and the *conjecture* query. A membership query asks whether a sequence of high-level events can lead to task completion. We defer the details of answering membership queries to Sec. 3.2. After a number of membership queries, the inference engine outputs a hypothesis DFA and a set of corresponding demonstrations for answering the membership queries (line 5 in Alg. 1). The task inference module then asks a conjecture query about whether the hypothesis DFA can help recover a satisfactory reward function (line 6 in Alg. 2). The conjecture query is to be answered by the reward learning module and the details are in Sec. 3.3. If the answer to the conjecture query is , both the automaton inference process and the IRL are finished. Otherwise, the answer is and there exists a counterexample in the form of a sequence of high-level events to illustrate the difference between the conjectured DFA and the memory structure of the task. Such a counterexample will trigger a new iteration with next round of membership queries.

### 3.2 Task Inference Module

To generate a hypothesis DFA where , ATIG-DIRL produces a number of membership queries. Each membership query asks whether following a event sequence leads to task completion, i.e. . The task is unknown, but given a sequence , one can observe from the environment. To answer a membership query, we rely on a demonstrator to make a demonstration in the MDP environment and generate a state sequence where . Then if the event sequence completes the task, i.e. , the answer to this membership is , otherwise the answer is . After answering the membership queries, ATIG-DIRL will generate a hypothesis DFA following procedures in [angluin1987learning].

After obtaining a hypothesis DFA , ATIG-DIRL asks a conjecture query that whether is sufficient to recover the reward function. To answer this query, we follow the procedure introduced in Sec. 3.3, which is summarized in Alg. 2.

Demonstration trajectories. When the task inference module asks a query , if , it means the query encodes an event sequence that leads to task completion. The demonstrator will then produce several demonstrations following the same event sequence and adds them to the set of demonstrations.

### 3.3 Reward Learning Module

This module is concerned with learning a reward function using the hypothesis DFA. Although previous deep MaxEnt IRL methods [wulfmeier2015maximum, finn2016guided, wulfmeier2016watch, wulfmeier2017large] can construct reward features automatically from demonstrations, they suffer from a fundamental limitation that the learned reward function is Markovian. As a result, the learned policy has to be independent from the history of states, which does not suffice for tasks with complicated memory structures. To address this issue, we propose a new maximum entropy deep IRL algorithm (which is inspired by the work in [wulfmeier2015maximum]), as described in Alg. 2.

The key idea is to use the hypothesis DFA and the MDP to create a product automaton and then learn a reward function over the state space of this product automaton. We define the product automaton as follows:

Product automaton. Let be a reward-free MDP and be a DFA. The product automaton is a tuple such that, is a finite set of states. is the initial set of states where for each , , . is the transition function of the product automaton defined as

is a labeling function; and is a finite set of accepting states.

We apply the proposed IRL algorithm on the product automaton. As a result, the reward depends on both the current state in and the current state in , which together form the current state of the product automaton . DFA states can be considered as different stages in task implementation. Since the DFA state is an input to the reward function, the learned reward and the corresponding induced policy would be a function of the stage of the task.

Unlike MaxEnt IRL where the reward function is modeled as a linear combination of pre-specified features, we model the reward function as a neural network. The set of reward parameters are represented by

, which is the weight vector of the reward network. The objective is to maximize the posterior probability of observing the demonstration trajectories and reward parameters

given a reward structure:(2) |

is the log likelihood of the demonstration trajectories in given the reward function . can be interpreted as either the logarithm of the prior distribution at or as a differentiable regularization term on . In this work we assume a uniform prior distribution over , so what remains is the maximization of .

Since we apply IRL on the product automaton, the demonstration trajectories are projected onto the state space of the product automaton, i.e. , where . Let be the policy corresponding to , then can be expressed as

(3) |

where is a constant that is dependent on and the transition dynamics of the product automaton.

The computation of given is essentially a maximum entropy reinforcement learning problem [zhou2018infinite]. It can be proved [zhou2018infinite] that for any , there exists a unique function which is the fixed point of

(4) | ||||

where is the discount factor which is a hyper-parameter. Note that is an implicit function of , as is parametrized by and is derived from by Eq. 4. The policy can be derived from as shown in the equation below

(5) |

To find the optimal , we perform gradient ascent using the gradient of with respect to expressed as

(6) | ||||

where for any . To compute the right hand side of Eq.6, we compute the gradient of and as

(7) | ||||

(8) |

Since , it can be shown that for any , there exists a unique solution to Eq. 7. Therefore, there is also a unique solution to Eq. 8.

Once we have performed a gradient ascent step using Eq.6 to update the weight vector of the neural network, we have automatically updated the reward network .

Monte Carlo evaluation. Evaluating the task performance of a reward network , amounts to computing the success ratio of the optimal policy for . We use the Monte Carlo approach to empirically compute the success ratio of the policy . Concretely, we use to produce a set of state sequences , convert the state sequences to event sequences, i.e, , and then observe . The ratio of sequences with to the total number of sequences yields the success ratio (line 11 in Alg. 2).

Stopping criterion. After every iterations of gradient ascent ( is hyper-parameter), the module evaluates the success rate of the computed optimal policy for , and stops the iterations once the success rate stops changing significantly according to a pre-specified threshold (line 12 in Alg.2).

Counterexamples. After we finish the iterations of gradient ascent, if the success ratio of the optimal policy for the obtained reward network is smaller than a threshold , then the module produces a counterexample to be used by the task inference module in the next iteration of Alg. 1. To produce the counterexample, we apply Monte Carlo simulations and find an such that where is when is accepted by the DFA and is otherwise (line 17 in Alg. 1).

## 4 Related Work

Recently, there have been interesting works on IRL with task information. The first attempt to incorporate task evaluation into IRL was to augment the demonstrations with evaluation of their task performance. [el2016score] and [burchfiel2016distance] augmented each demonstration trajectory with a score rated by experts. The Boolean labels and the continuous scores can be used to train a classification or a regression model to evaluate policies. Their experiment results showed that such data augmentation helps reduce the number of demonstration. But since the task is not explicitly defined, the learned policy evaluation model may be neither reliable nor interpretable. [pan2018human] assumed that the experts provide a set of subgoal states for each demonstration. However, the learning agent does not understand how or why the demonstrator picks this set of critical subgoal states, especially if the number of subgoals are inconsistent over different demonstrations. Though the robot may recognize some similar states using the learned reward features in a new environment, it cannot tell if all of previous subgoals are still necessary or if they should be executed in the same order. With our method, the agent can search for a sequence of subgoals in the extended state space that implements the task, which may not be necessarily the same as shown in training environments.

Several work has been done on policy learning with assumptions about the task structure. [niekum2012learning] and [michini2015bayesian]

use Bayesian inference to segment unstructured demonstration trajectories.

[shiarlis2018taco] assumed that the expert performs a given sequence of (symbolic) subtasks in each demonstration. They solve the problem of temporal alignment for the demonstrations and policy learning for each subtask simultaneously.Another related work is hierarchical IRL (HIRL) [krishnan2016hirl], where they assume each demonstration trajectory corresponds to a set of subtasks. The subtasks are separated by critical “transition states”, i.e., states where transitions happen consistently across all the demonstrations. Transitions are defined based on changes in local linearity with respect to a Kernel function. Once the transition states are recognized, for each trajectory, the state space is augmented with features that capture the visitation history of the transition states. HIRL suffers from three limitations compared to our proposed method. First, the assumption that subtasks correspond to states where local linearity changes w.r.t a kernel function, is limited to certain classes of problems and hence HIRL may not be able to confidently detect the transition states. Second, it assumes that the task can always be decomposed into a sequence of subtasks. If a task requirement is such that there are two possible sequences of subgoals that can be followed to reach the final winning state, HIRL is unable to learn a decomposition for such a task. Third, their method does not benefit from the use of deep neural networks and automatic feature learning.

## 5 Experiments

We create the task-oriented navigation environment, which is a simulated environment that can model various navigation tasks. Different objects are present in the environment, such as {building, grass, tree, rock, barrel and tile}. A region is defined as any square neighborhood in the environment. Each region belongs to one of the types defined in Table. 1. Region types are used to define navigation tasks as we will see later. Visiting a region means visiting the center block of that region. Fig. 1 depicts one instantiation of the environment used as the training environment for our experiments.

Agent. The agent is an aerial vehicle that can navigate in the environment. At each time step, it is located on top of an environment block, and it can choose either of the following actions {”up”, ”down”, ”left”, ”right”}, and move one block in that direction.

The navigation task. For the rest of this section, we consider the following navigation task as the test bed for the experiments: {visit , , , , in this order. Any other order leads to failure}. Fig. 2 depicts the DFA encoding this task. The DFA states are defined in Table 2. The goal of ATIG-DIRL is to infer an equivalent DFA and use it for reward learning. We have used the libalf library [bollig2010libalf] to infer the DFA.

Type-0 region () | a region with more than buildings |
---|---|

Type-1 region () | a region with more than trees |

Type-2 region () | a region with more than barrels |

Type-3 region () | a region with more than stones |

Irrelevant region | none of the above |

None of have been visited | |
---|---|

Only has been visited | |

have been visited in this order | |

have been visited in this order | |

have been visited in this order | |

have been visited in this order | |

Failure state. Any scenario other than above transits the DFA into (an absorbing state) |

Reward network. The input to the convolutional layers of the reward network is the

neighborhood of the agent in the environment. Each block in the neighborhood is represented by a one hot encoded vector, specifying the object that occupies that block. There are 6 objects, so the input to convolutional layers is a

tensor. The network has two convolutional layers, the first layer has kernels and the second layer has kernels, all kernels are of sizewith a stride of

. After the convolutions, there is a flattening layer. And this is where the DFA state and the action are provided as input, by appending them to the output of the flattening layer. The DFA state and the action are represented by a one-hot-encoded vectors respectively of length 10 and 4. Note that the DFA has 7 states, but we use zero padding to make the one-hot-encoded vector be of size 10. Then there are two fully connected layers of size

and, and the output layer has 1 neuron. Each hidden layer is followed by a ReLU layer.

Baselines. We implemented three baseline methods to compare with ATIG-DIRL: memoryless IRL, IRL augmented with information bits (IRL-IB) and memory-based behavioural cloning (BC). The memoryless IRL is a basic deep MaxEnt IRL method [wulfmeier2015maximum] that learns a reward function as a function of the states of the MDP. The IRL-IB method is a deep MaxEnt IRL method that uses a fixed size memory to augment the state space of the MDP. In our experiment, since there are 4 region types, the memory is a vector of size 4, where each element corresponds to one of the region types. If a region type has been visited, the value of the corresponding vector element is , and otherwise. The memory-based BC method operates on the same extended state space as ATIG-DIRL, and uses the behavioural cloning [pomerleau1991efficient] method to learn a policy that mimics the demonstrations. The memory-based BC method in essence is a simplified version of the method introduced in [shiarlis2018taco] in which the task label is provided for each individual state in the demonstrations, and the agent does not need to learn the alignment between the demonstrations and task sketches.

#### 5.0.1 ATIG-DIRL vs baselines

The primary criterion we use in evaluating the performance of a trained model is the task success ratio. All methods are trained using the same training environment (Fig. 1).

Training performance. The ATIG-DIRL was able to infer a DFA equivalent to the underling DFA (Fig. 2) in three iterations of the main algorithm (Alg. 1), and the reported results correspond to the last call to Alg. 2 where the correct DFA is used in the IRL loop. The ATIG-DIRL algorithm and the memory-based BC method perform well at training time (Fig. 2(a) and Fig. 2(b)) with ATIG-DIRL agent still outperforming the BC agent. The memoryless IRL method and the IRL-IB method, however, perform poorly even on the training environment (Fig. 2(a)). As expected, the memoryless IRL method fails at this experiment, because it can only learn a memoryless reward function. The IRL-IB method, however, fails because its memory is unable to capture the memory structure required for the navigation task. Concretely, its memory is unable to keep track of repeated occurrences of the same high-level event.

Test performance. To compare the generalization capability of the four methods, we have tested them on 10 different randomly generated environments. The ATIG-DIRL method outperforms all the baselines on test cases (Fig. 2(c)). The reason the memoryless IRL and the IRL-IB methods fail at generalization, is that they are not equipped with an appropriate memory structure. However, we have equipped the memory-based BC method with the same memory structure as the ATIG-DIRL method, the reason for the difference in generalizability between these two methods is as follows: the ATIG-DIRL algorithm learns a reward function which generalizes significantly better to test environments. The behavioral cloning agent, on the other hand, learns to shallowly imitate the demonstrations by directly learning a policy.

## 6 Conclusion and Future Work

We have proposed a new IRL algorithm, active task-inference-guided deep IRL (ATIG-DIRL), that learns the memory structure of the task in the form of a deterministic finite automaton (DFA), and uses the DFA to extend the state space of the original MDP so that we can infer a Markovian reward function using deep MaxEnt IRL. The proposed algorithm learns a reward function over the extended state space obtained by composing the state spaces of the MDP and the inferred DFA. By modeling the reward functions as a convolutional neural network, the algorithm can automatically extract local features that are essential for task implementation. We show with experiments that the learned reward function can be used to generate policies that achieve near-perfect task performance in new environments without expert demonstrations, while methods such as memory-based behavior cloning, IRL with information bits, and memoryless IRL have poor generalization performance. For future work, we would like to apply our algorithm to more complex environment with continuous state spaces.