# Inverse Reinforcement Learning with Missing Data

We consider the problem of recovering an expert's reward function with inverse reinforcement learning (IRL) when there are missing/incomplete state-action pairs or observations in the demonstrated trajectories. This issue of missing trajectory data or information occurs in many situations, e.g., GPS signals from vehicles moving on a road network are intermittent. In this paper, we propose a tractable approach to directly compute the log-likelihood of demonstrated trajectories with incomplete/missing data. Our algorithm is efficient in handling a large number of missing segments in the demonstrated trajectories, as it performs the training with incomplete data by solving a sequence of systems of linear equations, and the number of such systems to be solved does not depend on the number of missing segments. Empirical evaluation on a real-world dataset shows that our training algorithm outperforms other conventional techniques.

## Authors

• 5 publications
• 7 publications
• 28 publications
• 32 publications
• ### Generalized Maximum Causal Entropy for Inverse Reinforcement Learning

We consider the problem of learning from demonstrated trajectories with ...
11/16/2019 ∙ by Tien Mai, et al. ∙ 44

• ### Improved Activity Forecasting for Generating Trajectories

An efficient inverse reinforcement learning for generating trajectories ...
12/12/2019 ∙ by Daisuke Ogawa, et al. ∙ 0

• ### Inverse Reinforcement Learning from Incomplete Observation Data

Inverse reinforcement learning (IRL) aims to explain observed strategic ...
03/28/2017 ∙ by Antti Kangasrääsiö, et al. ∙ 0

• ### MIWAE: Deep Generative Modelling and Imputation of Incomplete Data

We consider the problem of handling missing data with deep latent variab...
12/06/2018 ∙ by Pierre-Alexandre Mattei, et al. ∙ 0

• ### Joint Modeling of Dense and Incomplete Trajectories for Citywide Traffic Volume Inference

Real-time traffic volume inference is key to an intelligent city. It is ...
02/25/2019 ∙ by Xianfeng Tang, et al. ∙ 6

• ### MissFormer: (In-)attention-based handling of missing observations for trajectory filtering and prediction

In applications such as object tracking, time-series data inevitably car...
06/30/2021 ∙ by Stefan Becker, et al. ∙ 0

• ### Handling Incomplete Heterogeneous Data using VAEs

Variational autoencoders (VAEs), as well as other generative models, hav...
07/10/2018 ∙ by Alfredo Nazabal, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

The inverse reinforcement learning (IRL) problem involves an agent using demonstrated trajectories to learn an expert’s reward function Russell (1998). IRL has been receiving a great deal of research Ng and Russell (2000); Abbeel and Ng (2004); Ziebart et al. (2008); Levine et al. (2011) due to its real-world applications.The rationale behind this approach is that although a reward function is a more succinct and generalizable representation of an expert’s behavior, it is often difficult for the expert to elucidate his/her reward function, as opposed to giving demonstrations. However, a common issue in obtaining the demonstrated trajectories is that certain state-action pairs are missing (namely, missing segments) due to, for instance, technical issues or privacy concerns. Some examples of this issue are the loss of GPS signal from vehicles moving on a road network, or the limit on the field of view of a robot when seeking to learn a patroller’s behavior in order to penetrate the patrol without being spotted (Bogert and Doshi, 2014). Missing data is also a common issue in healthcare applications (Penny and Atkinson, 2012). Hence, learning a reward function with incomplete demonstration in a scalable manner is an indispensable milestone to make IRL more practical. This is also our aim in this paper.

A straightforward approach to train IRL models with missing data is to use only the connected segments in the incomplete demonstrations Bogert and Doshi (2014). It clearly ignores the information provided by the missing segments, which can be incorporated into IRL algorithms to improve the performance. A prominent line of work that uses the missing segment information Bogert et al. (2016); Shahryari and Doshi (2017); Bogert and Doshi (2017) is based on theexpectation maximization (EM) algorithm and the latent maximum entropy principle Wang et al. (2002) to generalize the maximum entropy IRL Ziebart et al. (2008). However, apart from inheriting the approximation of the latent maximum entropy principle, a more severe limitation is the expensive sampling of the missing state-action pairs in the EM algorithm, which is not effective when the length of the missing segment increases as shown in our experiments. More importantly, our experiments show that if the sampling of paths for missing segments is not sufficient, EM algorithm could be even outperformed by the above naive approach, which does not justify the complication in using missing segments. These shortcomings of existing works and the importance of missing data leave an important open question of how to exploit accurately and efficiently the information of missing segments in IRL, which we address in this paper.

Contribution. In this work, we propose a novel and efficient algorithm to train maximum entropy IRL models with full or missing data. Our contribution is twofold. First, we show that the training of maximum entropy IRL models with complete trajectories can be done efficiently through solving several systems of linear equations, and the number of such systems to be solved does not depend on the number of demonstrated trajectories. Thus, our approach is scalable and efficient in training IRL models with large numbers of states/actions and trajectories. Moreover, we establish conditions under which these systems of linear equations have unique solutions. We also show that, under the same conditions, the conventional value iteration approach that has been popularly used to train maximum entropy IRL models also converges to unique fixed point solutions from any starting points.

Second, we propose a novel way to directly compute the log-likelihood of demonstrated trajectories with missing data. Based on this, we design a tractable training algorithm to deal with the issue of missing data/information. Our approach relies on the remark that the probability of reaching any state from a state can be computed without sampling paths between these two states. Instead, we show that such probabilities can be obtained via solving a system of linear equations. We further propose a way to compose such systems, in such a way that we only need to solve one linear system to obtain all the probabilities of the missing segments in trajectories sharing the same set of zero-reward absorbing states. Moreover, we show that these systems of linear equations always have unique solutions. The main advantage of our algorithm is that the number of such systems to be solved is independent of the number of missing segments in the demonstrated trajectories, which makes the algorithm highly scalable in large-scale settings.

We empirically evaluate the performance of our algorithm using a dataset from real-world taxi trajectories in a large road network (thousands of links). We show that, with full demonstrated trajectories, our approach is able to speed up the training process up to 60 times, as compared to the widely-used value iteration method (Ziebart et al., 2008). When the demonstrated trajectories contains missing segments, we show that our algorithm outperforms two conventional approaches, i.e., the EM method and the (naive) one relying on ignoring all the missing segments in the trajectories.

The remainder of this paper is organized as follows. In Section 2, we give a brief introduction to the maximum entropy IRL. In Section 3, we present our approach to train IRL models via solving systems of linear equations, and discuss conditions under which such systems has solutions. In Section 4, we present our main IRL training algorithm with missing data. We evaluate and compare our approach in Section 5, and conclude in Section 6.

Notation. We use to denote the cardinality of set . We use and to denote the transpose and the inverse of matrix M, respectively. We use to denote the

-th element of vector

z, and to denote the element of matrix M at the -th row and -th column. We also use to denote an element-wise first-order derivative of M w.r.t. parameter , i.e., a matrix (or vector) of the same size as M with the entry at the -th row and -th column being . The Hadamard product (element-wise product) between two matrices is denoted by . We use to denote the infinity norm (or maximum norm) of vector z.

## 2 Background

A Markov decision process (MDP) for an agent is defined as , where is a set of states , is a finite set of actions, is a transition probability function, i.e., is the probability of moving to state from by performing action , is a reward function of parameters and a feature vector associated with state , and is a discount factor. In the context of maximum entropy inverse reinforcement learning (ME-IRL) (Ziebart et al., 2008) we assume that . In this paper, for notational simplicity, we assume that the reward function has a linear-in-parameters form , but our results can be straightforwardly applied to nonlinear reward functions (e.g., reward functions represented by artificial neutral networks). We also denote by the set of zero-reward absorbing states.

According Ziebart et al. (2008), to train a ME-IRL model, we need to compute values , for all states , which satisfy the following recursive equations

 zs={∑a∈A∑s′∈Sp(s′|a,s)er(s′|s,θ)zs′ if s∈S∖D1 if s∈D. (1)

Given vector z, we can obtain the log-likelihood of demonstrated trajectories (and its gradients) by computing the local action probabilities of the following form

 P(a|s)=∑s′p(s′|a,s)er(s′|s,θ)zs′zs, (2)

 ∂P(a|s)∂θt=1zs (∑s′p(s′|a,s)er(s′|s,θ)(f(s′|s)tzs′+Jzs′,t))−∑s′p(s′|a,s)er(s′|s,θ)zs′Jzs′,tz2s, (3)

where is the Jacobian matrix of z, i.e., . So, to efficiently train IRL models, given any parameters , we need to quickly compute z and its Jacobian matrix. Traditional approaches rely on value iteration, which could be time-consuming and inaccurate. The method proposed in the following provides a more efficient way to obtain z and .

## 3 Scalable IRL Training Approach

In this section, we show that the vector z as well as its Jacobian matrix can be obtained by solving systems of linear equations. Our approach is scalable, in the sense that it is much faster then the conventional value iteration (Ziebart et al., 2008) approach, and would be useful to train IRL models with large numbers of states and demonstrated trajectories. We describe our approach in the following.

Suppose that the states in are numbered in such a way that the absorbing states are those numbered as . We define a matrix M of size () and a vector b of size () with elements

 Ms,s′ =(∑a∈Ap(s′|a,s)er(s′|s,θ)),∀s,s′∈S bs =1 if s∈D and 0 otherwise.

We can reformulate the recursive system (1) as a system of linear equations as

 z=Mz+b or z=(I−M)−1b, (4)

where I

is the identity matrix of size

. So, we can obtain z by solving a system of linear equations, which is computationally convenient. Moreover, taking the gradient of z w.r.t. parameter we have

 ∂z∂θt=(I−M)−1Utz, (5)

where is a matrix of size with elements

 Uts,s′=∂Ms,s′∂θt=∑a∈Ap(s′|a,s)er(s′|s,θ)f(s′|s)t,

where is the -th element of the vector feature . If we define as a matrix of size with elements , we can write So again, the Jacobian of z w.r.t. can be obtained conveniently by solving a system of linear equations. It is possible to further speed up the computation of the Jacobian of z by defining a matrix H of size whose -th column is , where is the size of . The Jacobian matrix of z can be computed as which is also a linear system. Moreover, the following theorem shows that, under some conditions, the systems (4) and (5) have unique solutions.

###### Theorem 1 (Conditions for the existence and uniqueness of z and Jz)

is invertible and systems (4) and (5) have unique solutions if one of the following conditions holds

• for any sequence such that .

• for all .

We provide the proof of Theorem 1 in the Appendix. Condition (i) holds if the network of states is cycle-free, i.e., if we leave a state , the probability of returning back to that state is zero. Condition (ii) holds generally in cases that the magnitudes of parameters are large enough and the reward function takes significantly negative values.

An alternative approach to compute vector z is to use value iteration (Ziebart et al., 2008), i.e., we perform for until converging to a fixed point solution. This approach has been used in several studies to train IRL models. However, to the best of our knowledge, there is no study investigating conditions under which the iterative process converges Proposition 2 below shows that, under the same conditions stated in Theorem 1, the value iteration converges to a unique fixed point solution from any starting vector . The proof of Proposition 2 can be found in the Appendix.

###### Proposition 2 (Conditions for the convergence of the value iteration)

If one of the two conditions stated in Theorem 1 holds, the value iteration procedure for always converges to a unique fixed point solution from any starting vector . Moreover, if Condition (i) holds, then the fixed point solution lies in . In addition, if Condition (ii) holds, then the value iteration converges after a finite number of iterations.

Remark. If both Condition (i) and (ii) hold, then the value iteration will converge to the unique fixed point solution in after a finite number of iterations. If only Condition (i) holds, then it is not necessary that the fixed point solution lies in , and if only Condition (ii) holds, then the value iteration procedure may need an infinite number of iterations to converge to the fixed point solution. For the later, Proposition 3

gives an estimate of the number of iterations necessary to get a good approximation of the fixed point solution.

###### Proposition 3

If Condition (ii) of Theorem 1 holds, then for any , for all , where .

Proof. From the proof of Proposition 2 we have inequality , for . This leads to , . So, for all we have and , as required.

It is possible to further accelerate the computation of vectors z and their Jacobian matrices. The idea is to compose the linear systems from different demonstrated trajectories into one linear system. First, let us assume that we need to compute the log-likelihood of demonstrated trajectories . For each observation , let denotes the set of the corresponding zero-reward absorbing states, and denotes the corresponding vector z. We also define a matrix Z of size whose -th column is vector . Now, if we assume that the feature vector only depends on state and , then matrix M are the same over trajectories. So, if we define a matrix with elements if , and otherwise, then Z and and the gradients of Z w.r.t. a parameters are solutions to the following systems of linear equations

 Z=(I−M)−1B, and ∂Z∂θt=(I−M)−1(M∘Ft)Z. (6)

In summary, to compute the log-likelihood (and its gradients) of the demonstrated trajectories, one just need to solve (6) to obtain Z and the Jacobian matrix of Z. This requires to solve systems of linear equations ( is the number of features considered). In large-scale applications, we typically do not invert the matrix to solve the linear systems. Instead, we can use more stable and less time-consuming methods (e.g., LU factorization). Since the linear systems (6) all involve , the LU factorization appears to be an efficient approach. The idea is that we decompose matrix into two matrices L and U as , where L is a lower triangular matrix with unit diagonal and U is a upper triangular matrix. Then, for solving a linear system , we just need to perform two simple tasks, namely, (i) finding Y such that and (ii) finding X such that . Since the factorization step (i.e., finding ) is more expensive than the two steps (i) and (ii), we just need to do the factorization once and reuse the matrices for all the systems of linear equations.

## 4 Training Algorithm with Missing Data

We need demonstrated trajectories , , , to train an IRL model. We now consider the case where there are some missing state-action pairs on the trajectories. The question is how to train the IRL model in such situations.

To make the presentation simple, we consider a trajectory containing a pair of states such that state-action observations between these two states are missing. To train the model, we need to compute or approximate , i.e., the probability of reaching state from state . Naive approaches might be to ignore all the missing segments, or to enumerate all possible states and actions that allow to move from to . However, enumerating all possible paths between two states is not tractable in practice.

Another conventional approach is to use the EM method, which is a popular way to deal with missing variables and has been considered to train IRL models with missing data (Shahryari and Doshi, 2017; Bogert et al., 2016). The idea is to alternate performing an expectation (E) step, which creates a function for the expectation of the log-likelihood evaluated using the current estimate for the parameters and a maximization (M) step that maximizing the expectation function created by the E step. The key feature of the EM is to define a function as the expected value of the log-likelihood function of the parameters , w.r.t. the current conditional distribution of unobserved trajectories given the observed state-actions and the current estimate of the model parameters. To this end, we need to build the function for each incomplete pair of states , where is the parameter estimates at iteration and is a complete trajectory from to (inclusive). We can approximate the expectation by sampling over the distribution of .

There are two main issues associated with the use of the EM method. First, the EM requires to sample trajectories between any incomplete pair, which would be expensive, especially when the number of such pairs is huge. Another issue is that the EM method requires to solve several maximum likelihood estimation problems until getting a fixed point solution. This is indeed an expensive procedure.

To sample , a straightforward approach is to start from and sample an action and a next state according to probability and . Keep doing that until we reach , we can get one complete trajectory between and . Suppose are trajectory samples, the expectation can be approximated as .

However, if is different from the optimal value such that is small, then it is very inefficient to sample a complete trajectory from to following the above procedure. In the experiment, we only enumerate a subset of trajectories for each missing segment by employing breadth first search without marking visited states and with a limited depth . It is able to sample all possible trajectories of length less than or equal to . If BFS with depth cannot find any path connecting a missing segment, the missing segment is omitted from the training. This is to keep the training time reasonable. Suppose these trajectory samples are , then the expectation can be approximated as . We name this method as EM-BFS-H. The purpose is to empirically assess how sampling a subset of paths affects the EM algorithm.

### 4.2 Likelihood of Incomplete Trajectories

In the following we present a tractable way to compute the log-likelihood of incomplete trajectories without path/trajectory sampling. To do so, let us consider an incomplete pair such that state-action pairs between these two states are missing. To compute the log-likelihood of the trajectory that contain pair , we need to compute , the probability of reaching from . To do this in a tractable way, we define as the probability of reaching state from state . The values of , , satisfy the following equations

 π(s)={∑s′∈S∑a∈Ap(s|a,s′)P(a|s′)π(s′)∀s≠si1s=si. (7)

So, if we define d as a vector of size with all zero elements except and a matrix Q of size with elements

 Qsis,s′={∑a∈Ap(s|a,s′)P(a|s′),∀s,s′∈S, s≠si0 if s=si,

then we can write (7) as

 π=Qsiπ+d, or π=(I−Qsi)−1π, (8)

where is a vector of size with entries , , and I is the identity matrix of size . Proposition 4 shows that is invertible, which guarantees the existence and uniqueness of the solutions to (8).

###### Proposition 4

is invertible, for any .

Proof. For any , we define and write

 ∑s′∈SPs,s′={∑s′∈S∑a∈Ap(s′|a,s)P(a|s)=1 if s≠¯s0 if s=¯s.

So, P is a sub-stochastic matrix (contains non negative entries and every row adds up to at most 1). Moreover, P contains no recurrent class. So, is invertible. Furthermore, is also invertible, which is the desired result.

So, we can obtain as . We also need the gradients of for the maximum likelihood estimation. This can be obtained through taking the Jacobian of w.r.t. parameters as where H is of size whose -th column is vector , recall that is the size of - number of features considered in the IRL model. So, in summary, we can compute the log-likelihood of the demonstrated trajectories with missing data as well we its gradients by solving several systems of linear equations.

### 4.3 Composition Algorithm

We show above that one can obtain the log-likelihood of trajectories with missing data by solving one system of linear equations for each incomplete pair . This would be time-consuming if the demonstrated trajectories contains a large number of such pairs. We show in the following that it is possible to obtain the probabilities of all the incomplete pairs in trajectories sharing the same set of zero-reward absorbing states by solving only one system of linear equations, instead of solving one linear system per each incomplete pair.

Assume that we observe incomplete pairs ,…, in demonstrated trajectories that share the same set of zero-reward absorbing states, for , . We define a matrix of size with entries

 Q0s,s′ =∑a∈Ap(s|a,s′)P(a|s′), ∀s,s′∈S Q0s,|S|+1 =Q0|S|+1,s=0, ∀s∈S,

and a matrix D of size with entries

 Ds,k=1 if s=|S|+1 or s=uk, and % Ds,k=0 otherwise.

The following theorem indicates how to obtain all the probabilities , by solving only one system of linear equations.

###### Theorem 5

If is a solution to the system of linear equations , then

The proof of Theorem 5 is given in the Appendix. As a result, the gradient of w.r.t. a parameter can also be obtained via solving the following linear system

 ∂Π∂θt=(I−Q0)−1∂Q0∂θtΠ. (9)

Moreover, similarly to the proof of Proposition 4, we can show that is a sub-stochastic matrix with no recurrent class. As a result, is invertible. So is also invertible, which ensures that the linear system in Theorem 5 and (9) always have unique solutions.

Algorithm 1 describes basic steps to compute the log-likelihood value and its gradients in the case of missing data. The total number of linear systems to be solved in Algorithm 1 is , where stands for the number of groups of trajectories, where each group contains trajectories that share the same set of zero-reward absorbing states. Clearly, the number of linear systems to be solved does not depend on the number of missing segments in the demonstrated trajectories. Moreover, Algorithm 1 can be implemented in a parallel manner, which would help to speed up the computation. If the data is complete (no missing segment), then we just need to remove Step 2. Note that in this case, we only need to solve systems of linear equations to obtain the log-likelihood value and its gradients, and this number does not depend on the number of demonstrated trajectories.

Similarly to the previous section, for each set of zero-reward absorbing states, Step 2 requires to solve systems of linear equations that all involve the matrix . The LU factorization is also a convenient approach to achieve good performance. Technical speaking, one can firstly decompose into a lower triangular matrix L and an upper triangular matrix U and use these two matrices to solve the linear systems in Step 2.

## 5 Experiments

This section evaluates the empirical performance of our approach using a real-world dataset. We compare our decomposition method with the EM method and the (naive) one that simply ignores the missing segments in the dataset. Our dataset contains larger numbers of states/actions/trajectories, as compared to those used with the EM in previous studies (Shahryari and Doshi, 2017; Bogert et al., 2016), which makes it more suitable to illustrate the scalability of our approach in handling the issue of missing data. For the sake of comparison, we also compare our approach with the traditional value iteration (Ziebart et al., 2008) in the case of no-missing data.

Dataset. The dataset contains 1832 real-world trajectories of taxi drivers. The road network consists of 7288 links, which are states in the model. The action is moving from a link to a consecutive link with no uncertainty. There are four features, in which three boolean features are left turn, U-turn, and the incidence matrix, and one real-valued feature is the traveling time between consecutive links. This dataset has been used in some route choice modeling studies (Fosgerau et al., 2013; Mai et al., 2015).

Generating Missing Dataset. To evaluate the performances of different approaches in the context of missing data, we take the full trajectories from the above dataset and remove some observed links to generate missing datasets. This allows us to assess our algorithm with different “missing levels”. We define missing probability as the probability that a link (except the origin and the destination links) is removed from a trajectory in the dataset. To increase the difficulty of recovering the reward function, the removed links in a trajectory are consecutive. Taking a trajectory of length

as an example, the length of the missing segment follows a binomial distribution with

trials, and the missing probability as the success probability. For each trajectory, such a length is randomly sampled, and a segment of this length is randomly removed from the trajectory. In the experiment, there are 10 different random runs to account for the randomness in the generation of the missing dataset.

In this setting, as the missing probability increases, the expected length of missing segments in the dataset increases. This, in turn, makes the learning of the reward function more difficult as observed in the experimental result. To assess the performance of the reward recovery, the log-likelihood of the no-missing dataset using the learned reward function is computed. The higher the log-likelihood, the better the reward function is.

Comparison between Composition and MaxEnt Algorithms on No-Missing Dataset. To compare the composition algorithm with the popular MaxEnt IRL algorithm (i.e., value iteration) (Ziebart et al., 2008), we run both algorithms on the no-missing dataset. They both give similar reward functions, i.e., (MaxEnt) and (composition), which have no significant difference in the log-likelihood ( and , respectively). On the other hand, the iterative algorithm to compute the log-likelihood (and its derivative) of MaxEnt incurs much more time (s) than that of the composition algorithm (s). So, our composition algorithm is about 60 times faster than the MaxEnt while returning a similar reward function.

Comparison between Composition and EM Algorithms on Missing Dataset. Fig. 1 shows a comparison of different approaches in terms of the reward recovery and computing time. Regarding the performance, Fig. 0(a) shows the log-likelihood of no-missing dataset with the learned rewards from different methods. When the missing probability is less than , the information from missing segments is not enough to make a visible difference among all the methods. However, as the missing probability increases to , the use of missing segments makes a significant rise in the performance of the composition method over the others: the composition method’s log-likelihood achieves both a larger value and a smaller confidence interval. Surprisingly, EM methods are outperformed by the naive method using only connected segments when the missing probability is large. It is because the length of missing segments increases in this case, which makes the sampling of BFS- and BFS- inaccurate. This illustrates the adverse effect of sampling a subset of trajectories when the missing segment length is large. On the other hand, increasing the depth of BFS increases the execution time significantly (Fig. 0(b)). Regarding our composition method, its execution time to compute the log-likelihood is remarkably smaller than that of the EM111The execution time is measured using a computer with: i7-6700 CPU @ 3.40GHz and 16GB of RAM.. It is also interesting to see that, as expected, the computing time of the composition method does not grow much as the missing probability increases. These observations demonstrate the scalability of the composition method in a real-world application.

## 6 Conclusion

We have presented a novel IRL training algorithm that clearly outperforms previous approaches, in both cases of no-missing or missing data. The main advantage of our approach is that the IRL training procedure is performed via solving a number of systems of linear equations, and this number does not depend on the number of demonstrated trajectories in the case of complete data, and not depend on the number of missing segments in the missing case. This makes our approach highly scalable for large-scale applications. Many applications would potentially benefit from our approach, for example, problems of learning expert’s reward functions when datasets are not entirely available due to technical issues or privacy concerns. In future work, we plan to investigate generative adversarial ideas (Ho and Ermon, 2016; Goodfellow et al., 2014)

for imitation learning in the context of missing information.

## References

• Abbeel and Ng (2004) Abbeel, P. and Ng, A. Y. Apprenticeship learning via inverse reinforcement learning. In Proc. ICML, 2004.
• Bogert and Doshi (2014) Bogert, K. and Doshi, P. Multi-robot inverse reinforcement learning under occlusion with interactions. In Proc. AAMAS, pages 173–180, 2014.
• Bogert and Doshi (2017) Bogert, K. and Doshi, P. Scaling expectation-maximization for inverse reinforcement learning to multiple robots under occlusion. In Proceedings of the 16th Conference on Autonomous Agents and MultiAgent Systems, pages 522–529. International Foundation for Autonomous Agents and Multiagent Systems, 2017.
• Bogert et al. (2016) Bogert, K., Lin, J. F.-S., Doshi, P., and Kulic, D. Expectation-maximization for inverse reinforcement learning with hidden data. In Proceedings of the 2016 International Conference on Autonomous Agents & Multiagent Systems, pages 1034–1042. International Foundation for Autonomous Agents and Multiagent Systems, 2016.
• Fosgerau et al. (2013) Fosgerau, M., Frejinger, E., and Karlstrom, A. A link based network route choice model with unrestricted choice set. Transportation Research Part B: Methodological, 56:70–80, 2013.
• Goodfellow et al. (2014) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., and Bengio, Y. Generative adversarial nets. In Advances in neural information processing systems, pages 2672–2680, 2014.
• Granas and Dugundji (2013) Granas, A. and Dugundji, J. Fixed point theory. Springer Science & Business Media, 2013.
• Ho and Ermon (2016) Ho, J. and Ermon, S. Generative adversarial imitation learning. In Advances in Neural Information Processing Systems, pages 4565–4573, 2016.
• Levine et al. (2011) Levine, S., Popović, Z., and Koltun, V. Nonlinear inverse reinforcement learning with Gaussian processes. In Proc. NIPS, pages 19–27, 2011.
• Mai et al. (2015) Mai, T., Fosgerau, M., and Frejinger, E.

A nested recursive logit model for route choice analysis.

Transportation Research Part B: Methodological, 75:100–112, 2015.
• Ng and Russell (2000) Ng, A. Y. and Russell, S. Algorithms for inverse reinforcement learning. In Proc. ICML, pages 663–670, 2000.
• Penny and Atkinson (2012) Penny, K. I. and Atkinson, I. Approaches for dealing with missing data in health care studies. Journal of clinical nursing, 21(19pt20):2722–2729, 2012.
• Russell (1998) Russell, S. Learning agents for uncertain environments. In Proc. COLT, pages 101–103, 1998.
• Shahryari and Doshi (2017) Shahryari, S. and Doshi, P. Inverse reinforcement learning under noisy observations. arXiv preprint arXiv:1710.10116, 2017.
• Wang et al. (2002) Wang, S., Rosenfeld, R., Zhao, Y., and Schuurmans, D. The latent maximum entropy principle. In Proceedings IEEE International Symposium on Information Theory,, page 131. IEEE, 2002.
• Ziebart et al. (2008) Ziebart, B. D., Maas, A. L., Bagnell, J. A., and Dey, A. K. Maximum entropy inverse reinforcement learning. In AAAI, volume 8, pages 1433–1438. Chicago, IL, USA, 2008.

## Appendix

### Proof of Theorem 1

We first prove the convergence under Condition (i) by introducing the following lemma

###### Lemma 1

For any sequence with , , and , there is at least one pair such that

 p(si+1|a,si)=0, ∀a∈A.

Proof. Because and there are states in , there are two identical states in the sequence, i.e., such that and . The hypothesis of (i) guarantees that there exists , , such that

 ∑a∈Ap(si+1|a,si)=0,

which leads to the desired result.

Now we consider, for any , matrix with entries

 Mns,s′ =∑s1,…,sn∈Ss1=s,sn=s′(n−1∏i=1Mi,i+1) =∑s1,…,sn∈Ss1=s,sn=s′n−1∏i=1(∑a∈Ap(si+1|a,si)er(si+1|si,θ))

Using Lemma 1 we have that, if , then

 n−1∏i=1(∑a∈Ap(si+1|a,si)er(si+1|si,θ))=0

for any sequence . So, for any and . In other words,

 limn→∞Mn=0.

This implies that the modulus of the eigenvalues of

M lie within the unit disc, and consequently, is invertible.

To prove the convergence under Condition (ii), we write matrix M in the following canonical form

 M=(ΣRO1O2),

where and are the last rows and last columns of M corresponding the absorbing states , and and are matrices of zero elements. We define a matrix of the same size with M as

 M′=(ΣR′O1I),

where I is the identity matrix of size and is a matrix of size with entries

 R′s,k=1|D|(1−∑s′∈SMs,s′)>0,∀s∈S∖D,∀k∈D

So, under the hypothesis of the theorem, it is easy to verify that

is a transition matrix of an arbitrary absorbing Markov chain. Thus, we have the following well-known result

 limn→∞Σn=0.

So

 limn→∞Mn=limn→∞(ΣnΣn−1R% O1O2)=0.

Similarly to the proof of the case (i), is invertible.

### Proof of Proposition 2

First, we assume that Condition (i) holds. At iteration of the value iteration procedure, we write where stands for ( times). Using the proof of Theorem 1, we have for any . So, we can write which does not depend on and . This remark indicates that converges to the unique fixed point solution after -th iterations. This completes the proof for Condition (i).

We now move to Condition (ii). We define . Assume that Condition (i) of Theorem 1 holds, given a vector , we have So, is a mapping from to itself. According to the Brouwer fixed-point theorem (Granas and Dugundji, 2013), there is at least one fixed point solution in . As stated in Theorem 1, there always exists a unique fixed point solution under Condition (i), meaning that the fixed point solution always lies in . Moreover, given two points we have

 ||f(z1)−f(z2)||∞ =maxs∈S∣∣ ∣∣∑s′∈SMs,s′z1s′−Ms,s′z2s′∣∣ ∣∣ ≤maxs∈S(∑s′∈SMs,s′)maxs∣∣z1s−z% 2s∣∣ ≤τ||z1−z2||∞,

where . So, is a contraction mapping in , meaning that the value iteration always converges to a fixed point solution from any starting point. As shown previously, this fixed point solution is unique and lies in . We complete the proof for Condition (i).

### Proof for Theorem 5

First, let us consider a pair of missing data . We create an artificial state such that

 ⎧⎪⎨⎪⎩∑a∈Ap(h|a,s)=0∀s∈S∑a∈Ap(s|a,h)=0∀s∈S∖{uk}∃a∈A such that p(uk|a,h)=1.

Basically, we require that it is impossible to reach from other states in and from we can only reach with probability 1. Now we define the following recursive equations

 πk(s)={∑s′∈S∑a∈Ap(s|a,s′)P(a|s′)πk(s′)∀s∈S1s=h. (10)

One can show that if function satisfies (10), then , . Now, we define a matrix of size with elements , and a vector q of size with all zero elements except . We also number state as , so the last column and row of and the last element of q correspond to state . So, (10) can be written as

 πk=Qkπk+q. (11)

Note that the last row of is an all-zero vector. We decompose (11) as

 Qkπk+q =⎛⎜ ⎜ ⎜ ⎜ ⎜ ⎜⎝Qk1,1Qk1,2⋯Qk1,|S|+1Qk2,1Qk2,2⋯Qk2,|S|+1⋮⋮⋱⋮00⋯0⎞⎟ ⎟ ⎟ ⎟ ⎟ ⎟⎠⎛⎜ ⎜ ⎜ ⎜ ⎜ ⎜⎝πk1πk2⋮πk|S|+1⎞⎟ ⎟ ⎟ ⎟ ⎟ ⎟⎠+q =Q0πk+⎛⎜ ⎜ ⎜ ⎜ ⎜ ⎜⎝00⋯Qk1,|S|+100⋯Qk2,|S|+1⋮⋮⋱⋮00⋯0⎞⎟ ⎟ ⎟ ⎟ ⎟ ⎟⎠⎛⎜ ⎜ ⎜ ⎜ ⎜ ⎜⎝πk1πk2⋮πk|S|+1⎞⎟ ⎟ ⎟ ⎟ ⎟ ⎟⎠+q =Q0πk+⎛⎜ ⎜ ⎜ ⎜ ⎜ ⎜⎝Qk1,|S|+1Qk2,|S|+1⋮0⎞⎟ ⎟ ⎟ ⎟ ⎟ ⎟⎠πk|S|+1+q=Q0πk+D:,k

where is the -th column of D. So, is a solution to the following system

 (I−Q0)πk=D:,k,

which also means that if matrix is a solution to the system of linear equations , then

 πk=Π:,k and Πvk,k=πk(vk)=P(vk|uk),

which is the desired result.