1 Introduction
Inverse reinforcement learning (IRL) Ng00:Algorithms ; Russell98:Learning refers to the problem of ascertaining an agent’s preferences from observations of its behavior on a task. It inverts RL with its focus on learning the reward function given information about optimal action trajectories. IRL lends itself naturally to a robot learning from demonstrations by a human teacher (often called the expert) in controlled environments, and therefore finds application in robot learning from demonstration Argall09:Survey
Osa18:Algorithmic .Previous methods for IRL Abbeel04:Apprenticeship ; BabesVroman11:Apprenticeship ; Boularias12:Structured ; Choi11:Inverse ; Ramachandran07:Bayesian ; Ziebart08:Maximum
typically operate on large batches of observations and yield an estimate of the expert’s reward function in a oneshot manner. These methods fill the need of applications that predominantly center on imitation learning. Here, the task being performed is observed and must be replicated subsequently. However, newer applications of IRL are motivating the need for continuous learning from streaming data or data in minibatches. Consider, for example, the task of forecasting a person’s goals in an everyday setting from observing his activities using a body camera
Rhinehart17:First . Alternately, a robotic learner observing continuous patrols from a vantage point is tasked with penetrating the patrolling and reaching a goal location speedily and without being spotted Bogert14:Multi . Both these applications offer streaming observations, no episodic tasks, and would benefit from progressively learning and assessing the other agent’s preferences.In this paper, we present the first formal framework to facilitate investigations into online IRL. The framework, labeled as incremental IRL (I2RL), establishes the key components of this problem and rigorously defines the notion of an incremental variant of IRL. Next, we focus on methods for online IRL. Very few methods exist Jin10:Convergence ; Rhinehart17:First that are suited for online IRL, and we cast these in the formal context provided by I2RL. More importantly, we introduce a new method that generalizes recent advances in maximum entropy IRL with hidden training data to an online setting. Key theoretical properties of this new method are also established. In particular, we show that the performance of the method improves monotonically with more data and that we may probabilistically bound the error in estimating the true reward after some data both under full observability and when some data is hidden.
Our experiments evaluate the benefit of online IRL on the previously introduced robotic application of penetrating continuous patrols under occlusion Bogert14:Multi . We comprehensively demonstrate that the proposed incremental method achieves a learning performance that is similar to that of the previously introduced batch method. More importantly, it does so in significantly less time thereby suffering from far fewer timeouts and a significantly improved success rate. Consequently, this paper makes important initial contributions toward the nascent problem of online IRL by offering both a formal framework, I2RL, and a new general method that performs well.
2 Background on IRL
Informally, IRL refers to both the problem and method by which an agent learns preferences of another agent that explain the latter’s observed behavior Russell98:Learning . Usually considered an “expert” in the task that it is performing, the observed agent, say , is modeled as executing the optimal policy of a standard MDP defined as . The learning agent is assumed to perfectly know the parameters of the MDP except the reward function. Consequently, the learner’s task may be viewed as finding a reward function under which the expert’s observed behavior is optimal.
This problem in general is illposed because for any given behavior there are infinitelymany reward functions which align with the behavior. Ng and Russell Ng00:Algorithms
first formalized this task as a linear program in which the reward function that maximizes the difference in value between the expert’s policy and the next best policy is sought. Abbeel and Ng
Abbeel04:Apprenticeship present an algorithm that allows the expert to provide task demonstrations instead of its policy. The reward function is modeled as a linear combination of binary features, : , each of which maps a state from the set of states and an action from the set of ’s actions to a value in [0,1]. Note that nonbinary feature functions can always be converted into binary feature functions although there will be more of them. Throughout this article, we assume that these features are known to or selected by the learner. The reward function for expert is then defined as , where are the weightsin vector
; let be the continuous space of the reward functions. The learner’s task is reduced to finding a vector of weights that complete the reward function, and subsequently the MDP such that the demonstrated behavior is optimal.To assist in finding the weights, feature expectations are calculated for the expert’s demonstration and compared to those of possible trajectories Ziebart08:Maximum . A demonstration is provided as one or more trajectories, which are a sequence of length stateaction pairs, , corresponding to an observation of the expert’s behavior across time steps. Feature expectations of the expert are discounted averages over all observed trajectories, , where is a trajectory in the set of all observed trajectories, , and is a discount factor. Given a set of reward weights the expert’s MDP is completed and solved optimally to produce . The difference provides a gradient with respect to the reward weights for a numerical solver.
2.1 Maximum Entropy IRL
While expected to be valid in some contexts, the maxmargin approach of Abeel and Ng Abbeel04:Apprenticeship introduces a bias into the learned reward function in general. To address this, Ziebart et al. Ziebart08:Maximum find the distribution with maximum entropy over all trajectories that is constrained to match the observed feature expectations.
(1) 
Here, is the space of all distributions over the space of all trajectories. We denote the analytical feature expectation on the left hand side of the second constraint above as .Equivalently, as the distribution is parameterized by learned weights , represents the feature expectations of policy computed using the learned reward function. The benefit is that distribution makes no further assumptions beyond those which are needed to match its constraints and is maximally noncommittal to any one trajectory. As such, it is most generalizable by being the least wrong most often of all alternative distributions. A disadvantage of this approach is that it becomes intractable for long trajectories because the set of trajectories grows exponentially with length. In this regard, another formulation defines the maximum entropy distribution over policies Boularias12:Structured , the size of which is also large but fixed.
2.2 IRL under Occlusion
Our motivating application involves a subject robot that must observe other mobile robots from a fixed vantage point. Its sensors allow it a limited observation area; within this area it can observe the other robots fully, outside this area it cannot observe at all. Previous methods Bogert14:Multi ; Bogert15:Toward denote this special case of partial observability where certain states are either fully observable or fully hidden as occlusion. Subsequently, the trajectories gathered by the learner exhibit missing data associated with time steps where the expert robot is in one of the occluded states. The empirical feature expectation of the expert will therefore exclude the occluded states (and actions in those states).
To ensure that the feature expectation constraint in IRL accounts for the missing data, Bogert and Doshi Bogert14:Multi while maximizing entropy over policies Boularias12:Structured limit the calculation of feature expectations for policies to observable states only. A recent approach Bogert16:Expectation improves on this method by taking an expectation over the missing data conditioned on the observations. Completing the missing data in this way allows the use of all states in the constraint and with it the Lagrangian dual’s gradient as well. The nonlinear program in (1) is modified to account for the hidden data and its expectation.
Let be the observed portion of a trajectory, is one way of completing the hidden portions of this trajectory, is the set of all possible , and . Now we may treat as a latent variable and take the expectation to arrive at a new definition for the expert’s feature expectations:
(2) 
where , is the set of all observed and is the set of all complete trajectories. The program in (1) is modified by replacing with , as we show below. Notice that in the case of no occlusion is empty and . Therefore and this method reduces to (1). Thus, this method generalizes the previous maximum entropy IRL method.
(3) 
However, the program in (3) becomes nonconvex due to the presence of . As such, finding its optima by Lagrangian relaxation is not trivial. Wang et al. Wang12:Latent suggests a log linear approximation to obtain maximizing
and casts the problem of finding the reward weights as likelihood maximization that can be solved within the schema of expectationmaximization
Dempster77:EM . An application of this approach to the problem of IRL under occlusion yields the following two steps with more details in Bogert16:Expectation :Estep This step involves calculating Eq. 2 to arrive at , a conditional expectation of the feature functions using the parameter from the previous iteration. We may initialize the parameter vector randomly.
Mstep In this step, the modified program (3) is optimized by utilizing from the Estep above as the expert’s constant feature expectations to obtain . Normalized exponentiated gradient descent Steinhardt14:Adaptivity solves the program.
As EM may converge to local minima, this process is repeated with random initial and the solution with the maximum entropy is chosen as the final one.
3 Incremental IRL (I2rl)
We present our framework labeled I2RL in order to realize IRL in an online setting. Then, we present an initial set of methods cast into the framework of I2RL. In addition to including previous techniques for online IRL, we introduce a new method that generalizes the maximum entropy IRL involving latent variables.
3.1 Framework
Expanding on the notation and definitions given previously in Section 2, we introduce some new notation, which will help us in defining I2RL rigorously.
Definition 1 (Set of fixedlength trajectories).
The set of all trajectories of finite length from an MDP attributed to the expert is defined as, .
Let be a bounded set of natural numbers. Then, the set of all trajectories is . Recall that a demonstration is some finite set of trajectories of varying lengths, , and it includes the empty set. ^{1}^{1}1Repeated trajectories in a demonstration can usually be excluded for many methods without impacting the learning. Subsequently, we may define the set of demonstrations.
Definition 2 (Set of demonstrations).
The set of demonstrations is the set of all subsets of the spaces of trajectories of varying lengths. Therefore, it is the power set, .
In the context of the definitions above, traditional IRL attributes an MDP without the reward function to the expert, and usually involves determining an estimate of the expert’s reward function, , which best explains the observed demonstration, . As such, we may view IRL as a function: .
To establish the definition of I2RL, we must first define a session of I2RL. Let be an initial estimate of the expert’s reward function.
Definition 3 (Session).
A session of I2RL represents the following function: . In this session where , I2RL takes as input the expert’s MDP sans the reward function, the demonstration observed since the previous session, , and the reward function estimate learned from the previous session. It yields a revised estimate of the expert’s rewards from this session of IRL.
Note that we may replace the reward function estimates with some statistic that is sufficient to compute it (e.g., ).
We may let the sessions run infinitely. Alternately, we may establish stopping criteria for I2RL, which would allow us to automatically terminate the sessions once the criterion is satisfied. Let be the log likelihood of the demonstrations received up to the session given the current estimate of the expert’s reward function. We may view this likelihood as a measure of how well the learned reward function explains the observed data.
Definition 4 (Stopping criterion #1).
Terminate the sessions of I2RL when , where is a very small positive number and is given.
Definition 4 reflects the fact that additional sessions are not impacting the learning significantly. On the other hand, a more effective stopping criterion is possible if we know the expert’s true policy. We utilize the inverse learning error Choi11:Inverse in this criterion, which gives the loss of value if uses the learned policy on the task instead of the expert’s: . Here, is the optimal value function of ’s MDP and is the value function due to utilizing the learned policy in ’s MDP. Notice that when the learned reward function results in an identical optimal policy to ’s optimal policy, , ILE will be zero; it increases monotonically as the two policies increasingly diverge in value.
Definition 5 (Stopping criterion #2).
Terminate the sessions of I2RL when , where is a very small positive error and is given. Here, is the optimal policy obtained from solving the expert’s MDP with the reward function learned in session .
Obviously, prior knowledge of the expert’s policy is not common. Therefore, we view this criterion as being more useful during the formative assessments of I2RL methods.
Definition 6 (I2rl).
Incremental IRL (I2RL) is a sequence of learning sessions , which continue infinitely, or until either stopping criterion #1 or #2 is met depending on which one is chosen a’priori.
3.2 Methods
Our goal is to facilitate a portfolio of methods each with its own appealing properties under the framework of I2RL. This would enable online IRL in various applications. An early method for online IRL Jin10:Convergence modifies Ng and Russell’s linear program Ng00:Algorithms to take as input a single trajectory (instead of a policy) and replaces the linear program with an incremental update of the reward function. We may easily present this method within the framework of I2RL. A session of this method is realized as follows: Each is a single stateaction pair ; initial reward function , whereas for , where is the difference in expected value of the observed action at state and the (predicted) optimal action obtained by solving the MDP with the reward function , and is the learning rate. While no explicit stopping criterion is specified, the incremental method terminates when it runs out of observed stateaction pairs. Jin et al. Jin10:Convergence provide the algorithm for this method as well as error bounds.
A recent method by Rhinehart and Kitani Rhinehart17:First performs online IRL for activity forecasting. Casting this method to the framework of I2RL, a session of this method is , which yields . Input observations for the session, , are all the activity trajectories observed since the end of previous goal until the next goal is reached. The session IRL finds the reward weights that minimize the margin using gradient descent. Here, the expert’s policy is obtained by using soft value iteration for solving the complete MDP that includes a reward function estimate obtained using previous weights . No explicit stopping criterion is utilized for the online learning thereby emphasizing its continuity.
3.2.1 Incremental Latent MaxEnt
We present a new method for online IRL under the I2RL framework, which modifies the latent maximum entropy (LME) optimization of Section 2.2. It offers the capability to perform online IRL in contexts where portions of the observed trajectory may be occluded.
For differentiation, we refer to the original method as the batch version. Recall the feature expectation of the expert computed in Eq. 2 as part of the Estep. If is a trajectory in session composed of the observed portion and the hidden portion , is the expectation computed using all trajectories obtained so far, we may rewrite Eq. 2 for feature as:
(4) 
A session of our incremental LME takes as input the expert’s MDP sans the reward function, the current session’s trajectories, the number of trajectories observed until the start of this session, the expert’s empirical feature expectation and reward weights from previous session. More concisely, each session is denoted by, . In each session, the feature expectations using that session’s observed trajectories are computed, and the output feature expectations are obtained by including these as shown above in Eq. 4. Of course, each session may involve several iterations of the E and Msteps until the converged reward weights is obtained thereby giving the corresponding reward function estimate. Here, the expert’s feature expectation is a sufficient statistic to compute the reward function. We refer to this method as LME I2RL.
Wang et al. Wang2002 shows that if the distribution over the trajectories in (3) is log linear, then the reward function that maximizes the entropy of the trajectory distribution also maximizes the log likelihood of the observed portions of the trajectories. Given this linkage with log likelihood, the stopping criterion #1 as given in Def. 4 is utilized. In other words, the sessions will terminate when, , where fully parameterizes the reward function estimate from the session and is a given acceptable difference. The algorithm for this method is presented in the supplementary file.
Incremental LME admits some significant convergence guarantees with a confidence of meeting the desired errorspecification on the demonstration likelihood. We state these results with discussions but defer the detailed proofs to the supplementary file.
Lemma 1 (Monotonicity).
Incremental LME increases the demonstration likelihood monotonically with each new session, , when , yielding a feasible solution to I2RL with missing training data.
While Lemma 1 suggests that the log likelihood of the demonstration can only improve from session to session after a significant amount of training data has been accumulated, a stronger result illuminates the confidence with which it approaches, over a series of sessions, the log likelihood of the expert’s true weights . To establish this convergence result, we first focus on the full observability setting. For a desired bound on the loglikelihood loss (difference in likelihood w.r.t expert’s and that w.r.t learned ) for session in this setting, the confidence is bounded as follows:
Theorem 1 (Confidence for incremental maxentropy Irl).
Given as the (fully observed) demonstration till session , is the expert’s weights, and is the solution of session for incremental maxentropy IRL, we have
with probability at least
, where .As a step toward relaxing the assumption of full observability made above, we first consider the error in approximating the feature expectations of the unobserved portions of the data, accumulated from the first to the current session of I2RL. Notice that given by Eq. 4, is an approximation of the fullobservability expectation , estimated by sampling the hidden from Bogert16:Expectation . The following lemma relates the error due to this samplingbased approximation, i.e., , to the difference between feature expectations for learned policy and that estimated for the expert’s true policy.
Lemma 2 (Constraint Bounds for incremental LME ).
Suppose has portions of trajectories in occluded from the learner. Let be a bound on the error after samples for approximation. Then, with probability at least , the following holds:
where are as defined in Theorem 1, and .
Theorem 1 combined with Lemma 2 now allows us to completely relax the assumption of full observability as follows:
Theorem 2 (Confidence for incremental LME).
Let be the observed portions of the demonstration until session , are inputs as defined in Lemma 2, and is the solution of session for incremental LME, then we have
with a confidence at least , where , and .
Given required inputs and fixed , Theorem 2 computes a confidence . The amount of required demonstration, , can be decided based on the desired magnitude of the confidence.
4 Experiments
We evaluate the benefit of online IRL on the perimeter patrol domain, introduced by Bogert and Doshi Bogert14:Multi , and simulated in ROS using data and files made publicly available. It involves a robotic learner observing two guards continuously patrol a hallway as shown in Fig. 1(left). The learner is tasked with reaching the cell marked ’X’ (Fig. 1(right)) without being spotted by any of the patrollers. Each guard can see up to 3 grid cells in front. This domain differs from the usual applications of IRL toward imitation learning. In particular, the learner must solve its own distinct decisionmaking problem (modeled as another MDP) that is reliant on knowing how the guards patrol, which can be estimated from inferring each guard’s preferences. The guards utilized two types of binary stateaction features leading to a total of six: does the current action in the current state make the guard change its grid cell? And, is the robot’s current cell in a given region of the grid, which is divided into 5 regions? The true weight vector for these feature functions is .
Notice that the learner’s vantage point limits its observability. Therefore, this domain requires IRL under occlusion. Among the methods discussed in Section 2, LME allows IRL when portions of the trajectory are hidden, as mentioned previously. To establish the benefit of I2RL, we compare the performances of both batch and incremental variants of this method.
Efficacy of the methods was compared using the following metrics: learned behavior accuracy (LBA), which is the percentage of overlap between the learned policy’s actions and demonstrated policy’s actions; ILE, which was defined previously in Section 3.1; and success rate, which is the percentage of runs where reaches the goal state undetected. Note that when the learned behavior accuracy is high, we expect ILE to be low. However, as MDPs admit multiple optimal policies, a low ILE need not translate into a high behavior accuracy. As such, these two metrics are not strictly correlated.
We report the LBA, ILE, and the time duration in seconds of the inverse learning for both batch and incremental LME in Figs. 2 and 2; the latter under a 30% degree of observability and the former under 70%. Each data point is averaged over 100 trials for a fixed degree of observability and a fixed number of stateaction pairs in the demonstration . While the entire demonstration is given as input to the batch variant, the for each session has one trajectory. As such, the incremental learning stops when there are no more trajectories remaining to be processed. To better understand any differentiations in performance, we introduce a third variant that implements each session as, . Notice that this incremental variant does not utilize the previous session’s reward weights, instead it initializes them randomly in each session; we label it as incremental LME with random weights.
As the size of demonstration increases, both batch and incremental variants exhibit similar quality of learning although initially the incremental performs slightly worse. Importantly, incremental LME achieves these learning accuracies in significantly less time compared to batch, with the speed up ratio increasing to four as grows.
Is there a benefit due to the reduced learning time? We show the success rates of the learner when each of the three methods are utilized for IRL in Fig. 2. Incremental LME begins to demonstrate comparatively better success rates under 30% observability itself, which further improves when the observability is at 70%. While the batch LME’s success rate does not exceed 40%, the incremental variant succeeds in reaching the goal location undetected in about 65% of the runs under full observability (the last data point). A deeper analysis to understand these differences reveals that batch LME suffers from a large percentage of timeouts – more than 50% for low observability, which drops down to 10% for full observability. A timeout occurs when IRL fails to converge to a reward estimate in the given amount of time for each run. On the other hand, incremental LME suffers from very few timeouts. Of course, other factors play a role in success as well.
5 Concluding Remarks
This paper makes an important contribution toward the nascent problem of online IRL by offering the first formal framework, I2RL, to help analyze the class of methods for online IRL. We presented a new method within the I2RL framework that generalizes recent advances in maximum entropy IRL to online settings. Our comprehensive experiments show that the new I2RL method indeed improves over the previous stateoftheart in timelimited domains, by approximately reproducing its accuracy but in significantly less time. In particular, we have shown that given practical constraints on computation time for an online IRL application, the new method suffers fewer timeouts and is thus able to solve the problem with a higher success rate. In addition to experimental validation, we have also established key theoretical properties of the new method, ensuring the desired monotonic progress within a precomputable confidence of convergence. Future avenues for investigation include understanding how I2RL can address some of the challenges related to scalability to a larger number of experts as well as the challenge of accommodating unknown dynamics of the experts.
References

[1]
P. Abbeel and A. Y. Ng.
Apprenticeship learning via inverse reinforcement learning.
In
Twentyfirst International Conference on Machine Learning (ICML)
, pages 1–8, 2004.  [2] B. D. Argall, S. Chernova, M. Veloso, and B. Browning. A survey of robot learning from demonstration. Robotics and autonomous systems, 57(5):469–483, 2009.
 [3] M. BabesVroman, V. Marivate, K. Subramanian, and M. Littman. Apprenticeship learning about multiple intentions. In 28th International Conference on Machine Learning (ICML), pages 897–904, 2011.
 [4] K. Bogert and P. Doshi. Multirobot inverse reinforcement learning under occlusion with state transition estimation. In International Conference on Autonomous Agents and Multiagent Systems (AAMAS), pages 1837–1838, 2015.

[5]
K. Bogert and P. Doshi.
Toward estimating others’ transition models under occlusion for
multirobot irl.
In
24th International Joint Conference on Artificial Intelligence (IJCAI)
, pages 1867–1873, 2015.  [6] K. Bogert, J. F.S. Lin, P. Doshi, and D. Kulic. Expectationmaximization for inverse reinforcement learning with hidden data. In 2016 International Conference on Autonomous Agents and Multiagent Systems, pages 1034–1042, 2016.
 [7] A. Boularias, O. Krömer, and J. Peters. Structured apprenticeship learning. In European Conference on Machine Learning and Knowledge Discovery in Databases, Part II, pages 227–242, 2012.
 [8] J. Choi and K.E. Kim. Inverse reinforcement learning in partially observable environments. J. Mach. Learn. Res., 12:691–730, 2011.
 [9] A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihood from incomplete data via the em algorithm. Journal of the Royal Statistical Society, Series B (Methodological), 39:1–38, 1977.
 [10] Z. jun Jin, H. Qian, S. yi Chen, and M. liang Zhu. Convergence analysis of an incremental approach to online inverse reinforcement learning. Journal of Zhejiang University  Science C, 12(1):17–24, 2010.
 [11] A. Ng and S. Russell. Algorithms for inverse reinforcement learning. In Seventeenth International Conference on Machine Learning, pages 663–670, 2000.
 [12] T. Osa, J. Pajarinen, G. Neumann, J. A. Bagnell, P. Abbeel, and J. Peters. An algorithmic perspective on imitation learning. Foundations and Trends® in Robotics, 7(12):1–179, 2018.
 [13] D. Ramachandran and E. Amir. Bayesian inverse reinforcement learning. In 20th International Joint Conference on Artifical Intelligence (IJCAI), pages 2586–2591, 2007.

[14]
N. Rhinehart and K. M. Kitani.
Firstperson activity forecasting with online inverse reinforcement
learning.
In
International Conference on Computer Vision (ICCV)
, 2017. 
[15]
S. Russell.
Learning agents for uncertain environments (extended abstract).
In
Eleventh Annual Conference on Computational Learning Theory
, pages 101–103, 1998.  [16] J. Steinhardt and P. Liang. Adaptivity and optimism: An improved exponentiated gradient algorithm. In 31st International Conference on Machine Learning, pages 1593–1601, 2014.
 [17] S. Wang, R. Rosenfeld, Y. Zhao, and D. Schuurmans. The Latent Maximum Entropy Principle. In IEEE International Symposium on Information Theory, pages 131–131, 2002.
 [18] S. Wang and D. Schuurmans Yunxin Zhao. The Latent Maximum Entropy Principle. ACM Transactions on Knowledge Discovery from Data, 6(8), 2012.
 [19] B. D. Ziebart, A. Maas, J. A. Bagnell, and A. K. Dey. Maximum entropy inverse reinforcement learning. In 23rd National Conference on Artificial Intelligence  Volume 3, pages 1433–1438, 2008.
Comments
There are no comments yet.