Log In Sign Up

IRL with Partial Observations using the Principle of Uncertain Maximum Entropy

The principle of maximum entropy is a broadly applicable technique for computing a distribution with the least amount of information possible while constrained to match empirically estimated feature expectations. However, in many real-world applications that use noisy sensors computing the feature expectations may be challenging due to partial observation of the relevant model variables. For example, a robot performing apprenticeship learning may lose sight of the agent it is learning from due to environmental occlusion. We show that in generalizing the principle of maximum entropy to these types of scenarios we unavoidably introduce a dependency on the learned model to the empirical feature expectations. We introduce the principle of uncertain maximum entropy and present an expectation-maximization based solution generalized from the principle of latent maximum entropy. Finally, we experimentally demonstrate the improved robustness to noisy data offered by our technique in a maximum causal entropy inverse reinforcement learning domain.


Notes on Generalizing the Maximum Entropy Principle to Uncertain Data

The principle of maximum entropy is a broadly applicable technique for c...

Generalized Maximum Entropy for Supervised Classification

The maximum entropy principle advocates to evaluate events' probabilitie...

Inverse Reinforcement Learning Under Noisy Observations

We consider the problem of performing inverse reinforcement learning whe...

Marginal MAP Estimation for Inverse RL under Occlusion with Observer Noise

We consider the problem of learning the behavioral preferences of an exp...

Summarizing Data Succinctly with the Most Informative Itemsets

Knowledge discovery from data is an inherently iterative process. That i...

On Tsallis Entropy Bias and Generalized Maximum Entropy Models

In density estimation task, maximum entropy model (Maxent) can effective...

A Framework and Method for Online Inverse Reinforcement Learning

Inverse reinforcement learning (IRL) is the problem of learning the pref...

1 Introduction

The principle of maximum entropy is a technique for finding a distribution over random variable(s)

that contains the least amount of information while still conforming to constraints. It has existed in various forms since the early 20 century but was formalized by Jaynes [9]. In a commonly encountered form, the constraints consist of matching sufficient statistics (e.g., feature expectations) under the maximum entropy model being learned and those observed empirically.

However, in many cases the feature expectations are not directly observable. It could be the case that the model contains hidden variables, that some data is missing or corrupted by noise, or that is only partially observable using existing sensors or processes. Such scenarios are common when attempting to use maximum entropy (MaxEnt) models to perform inference using real-world data.

As an example, let us take a simple natural language processing model. Using the principle of maximum entropy, each

may be a word in a vocabulary , and we wish to obtain a model that matches the empirical distribution of words in a given document, , as given by the expectation of some interesting features .

However, if the data gathered is an audio recording of spoken words then the words themselves are never directly observed. Instead, we may extract observations from the recording that only partially reveals the words being spoken. For instance, suppose corresponds to phonemes.

, the probability of hearing a phoneme for a given word, will be non-deterministic as different dialects and accents pronounce the same word in different ways. Further, a poor quality voice recording may cause uncertainty in the phoneme being spoken, forcing the use of a more general

to correctly model the data received.

We make the following contributions in this work.

  1. We introduce the principle of uncertain maximum entropy as a new non-linear program and analyze it.

  2. We show a solution to the non-linear program using expectation-maximization, which has been previously used to solve the principle of latent maximum entropy [16, 5].

  3. We apply the new principle to the the domain of inverse reinforcement learning (IRL) [11] by generalizing the maximum causal entropy IRL algorithm [21] to observation noise. We experimentally demonstrate the efficacy of our new technique in a toy IRL application.

2 Background

We briefly review the principles of maximum entropy and latent maximum entropy, and discuss their use in IRL.

2.1 Principle of Maximum Entropy

The principle of maximum entropy may be expressed as a non-linear program.

subject to

Where are the empirical feature expectations . Notably, this program is known to be convex which provides a number of benefits. Particularly relevant is that we may find a closed-form definition of , and solving the problem’s dual is guaranteed to also solve the primal problem [6]. The solution is a log linear or Boltzmann distribution over the model elements, Where .

2.2 Principle of Latent Maximum Entropy

First presented in [17], the principle of latent maximum entropy generalizes the principle of maximum entropy to models with hidden variables that are never empirically observed.

Split each into and . is the component of X that is perfectly observed, is perfectly un-observed and completes . Thus, and . Latent maximum entropy corrects for the hidden portion of in the empirical data by summing over all , which is every way of completing a given to arrive at a . Its non-linear program is identical to equation 1 except the empirical feature expectations .

Since , the right side of the feature expectations contains a dependency on the model being learned, meaning the program is no longer convex and only an approximate solution can be found if we still desire a log-linear model for . This leads to an expectation-maximization solution, the resulting algorithm fills in the missing portions of the data with the complete model given current parameters, which are successively improved by EM. To our knowledge [17] is the first to apply EM to the principle of maximum entropy to account for incomplete data.

As the methodology and arguments used in this work are very similar to those in [18] the reader is encouraged to review [18] for more background and discussion.

2.3 Inverse Reinforcement Learning

A Markov Decision Process

[13] is defined as the tuple , where is a set of states, is a set of actions, is a function describing transition probabilities, , is a reward function mapping states and actions to a real number, is a discount rate and is the initial state distribution. MDP’s model a rational agent that is Markovian, ie, its state variable completely describes all relevant information such that it does not need to consider its history when making decisions. A MDP is solved by using Reinforcement Learning to arrive at optimal behavior which may be expressed as a policy that describes the action to choose for every state.

IRL reverses this process. First presented by [11], we begin with a MDP missing its reward function and the behavior of an assumed optimal agent. We are tasked with finding a such that the agent’s behavior is optimal in the resulting complete MDP. Applications of IRL include Apprenticeship Learning [1], wherein a learner agent wishes to perform a task as well as an expert agent. The learner utilizes IRL to recover a reward function that describes the expert’s behavior, and uses the learned rewards to complete its own MDP. By solving this MDP the learner arrives at a policy that may perform the same task in a different way, such as if it is physically dissimilar from the expert.

Unfortunately, IRL is ill-posed as there exists an infinite number of reward functions under which a given behavior is optimal [11]. This makes evaluating the performance of IRL algorithms complicated, as directly comparing learned reward functions is not fruitful. In our experiments, we use a popular measure, inverse learning error (ILE) [7] of a learned reward obtained as ILE , where is the value function computed using the true reward and the expert’s optimal policy, is the value function using the true rewards and the optimal policy for the learned reward.

To address the ill-posedness of IRL, [20] introduced Maximum Entropy IRL. This algorithm makes use of the principle of maximum entropy by choosing to be the set of all possible MDP trajectories of the target agent. By assuming is a linear combination of feature functions MaxEntIRL finds the one set of weights that results in the maximum entropy over while matching empirical feature expectations. Later, [19] presented the MaxCausalEntIRL algorithm which extended MaxEntIRL to non-deterministic MDPs. [5] extended MaxEntIRL to scenarios with missing data by employing the principle of latent maximum entropy. The resulting algorithm, HiddenDataEM is restricted to scenarios with perfectly unknown sections of an agent’s trajectory, due to occlusion of the environment for instance. It requires perfect knowledge of all other portions of the trajectory and is therefore unable to make use of any available partial information.

3 Uncertain Maximum Entropy

For our application domain of IRL with noisy real-world data we require an algorithm that both accounts for the ill-posedness of the problem and can make use of imperfect empirical data from noisy sensors. We begin by generalizing the principle of latent maximum entropy (and by extension, the principle of maximum entropy) to uncertain data, which we call the principle of uncertain maximum entropy.

Assume we want a maximum entropy model of some variables given we have observations . Critically, we desire that the model does not include as the observations themselves will pertain solely to the data gathering technique used to produce it, not the elements or model being observed. Additionally, we assume the existence of a static observation function . Let be the empirical distribution of observations obtained from data. Our new non-linear program is:

subject to

To solve Eq 2, we first take the Lagrangian.


Next, we find Lagrangian’s gradient so that we can set it to zero and attempt to solve for .


Unfortunately, the existence of on the right side of the constraints causes the gradient to be non-linear in . However, Wang et al. (Wang2012) notes that we may approximate the gradient, which then allows to be log-linear. In other words:

which gives,

Now we enter our approximation back into the Lagrangian to arrive at an approximate dual:


We would now try to find the dual’s gradient and use it to minimize the dual. Unfortunately the presence of still admits no closed form solution in general. We will instead employ another technique to minimize it as we show next.

3.1 EM Solution

Inspired by Wang et al. (Wang2012) the constraints of Eq. 2 can be satisfied by locally maximizing the likelihood of the observed data. One method is to derive an EM algorithm utilizing our log-linear model. We begin with the log likelihood of all the observations, note subscripts indicate a distribution is parameterized with the respective variable:


Equation 3.1 follows because the observation function does not depend on . This leaves as the only function which depends on . The EM algorithm proceeds by maximizing , and upon convergence , at which point the likelihood of the data is at a local maximum.

For the sake of completeness, we note that is the conditional entropy of the latent variables and is the expected log observations. These impact the overall data likelihood only, not the model solution because the observations are not included in the model. We now substitute the log-linear model of (Eq. 5) into :


Notice that Eq. 9 is similar to Eq. 6. One important difference is that Eq. 9 is easier to solve as depends on and not . In fact, maximizing is equivalent to solving the following program by minimizing its dual, Eq. 6:

subject to

which equals Eq. 2 at convergence. We now arrive at the following EM algorithm:

1:Start: Initialize
2:E Step: Using , compute:
4:M Step: Solve Equation 1’s program to arrive at a new .
6:Repeat: Until converges
Algorithm 1 uMaxEnt

4 Analysis of New Principle

To show that the principle of uncertain maximum entropy generalizes the principle of maximum entropy, let equal to . Then and . Hence, we recover Eq. 2’s constraint terms. Similarly, the principle of uncertain maximum entropy generalizes the principle of latent maximum entropy [16] by allowing , then and since .

Additionally, in both cases the generalization holds with an arbitrary so long as the following conditions hold: and and and , respectively.

One valuable property for statistical inference models is the ability to guarantee correctness in the infinite limit of available data by making use of the law of large numbers. For linear and non-linear programs, this is satisfied by showing that the constraints hold true as the empirical distribution of data converges to the true distribution. Next, we show that the principle of uncertain maximum entropy exhibits this attribute, while the principle of maximum entropy does not except in special cases.

Notice that and therefore in the infinite limit of data where the constraints of are satisfied as:


To use the principle of maximum entropy with imperfect data the observations must be converted into the model elements. For instance, in statistical physics, where the technique was first developed, the observations themselves are the result of a large number of smaller elements, such as measuring the temperature of a gas. So long as we constrain the model with such statistical moments, we may safely use the principle of maximum entropy.

In general, however, we must accept some amount of error in the observation-to-model-element conversion. As a result in the infinite limit of data and therefore the equality constraints do not hold. The effects of this are ad-hoc, for instance errors in continuous data may be bounded to within an acceptable tolerance while categorization errors in discrete models may result in unbounded error. This makes it difficult to use the technique with large data sets or in automated inference applications with noisy sensor data.

To illustrate, suppose we choose the most likely for an observed , or in other words, each ’s probability is the sum of the observed ’s probabilities for which it is the most likely element. In the limit, where is the Kronecker delta. This distribution does not equal , the true .

To demonstrate empirically, we generated a large number of random maximum entropy programs of random size with random observation functions. For each we generate a number of observations as well as their corresponding most-likely and use these in our new technique and principle of maximum entropy, respectively. Figure 1

shows the Kullback–Leibler divergence of the resulting learned models to the true models as the number of observations provided increases. Notice that

uMaxEnt continues improving as more observations are given until it reaches the performance of the the best case control Inf Obs (where is set to . Meanwhile, ML MaxEnt converges at a higher KLD and no additional data improves its performance.

Figure 1: Average KLD between the learned distribution over

and the true distribution as observations increase in randomly generated uncertain maximum entropy programs. Note log scale on both axis. Error bars are standard deviation, each point the result of 1100 runs.

5 Experiment

To evaluate the performance of our new technique in our inverse reinforcement learning application domain, we implemented a new algorithm, uMaxCausalEntIRL, by generalizing MaxCausalEntIRL using the principle of uncertain maximum entropy. Each becomes a tuple composed of elements

corresponding to one trajectory of the MDP. The empirical data of the agent being learned from may now be thought of as a specialized Hidden-Markov model, as shown in figure 

2. In this view we concretely see the empirical data’s dependency on the model being learned as the distribution over all states and actions over time given the observations may not be calculated due to the missing distribution . This distribution, the agent’s policy, is only available after our EM-based algorithm is complete.

Figure 2: Plate diagram of a partially observed MDP-based agent. Only is observed each timestep. Note , the agent’s policy, which is unknown and must be learned to compute the complete distribution.

Our scenario is that of a fugitive being unknowingly tracked by a radio tower that is listening for pings coming from a tracker device hidden on them. We attempt to discern the fugitive movement’s within a space including identifying a safe house. Complicating matters the tracking is imperfect, becoming less accurate the farther away the fugitive is from the tower. Further, it’s possible the tower may miss a transmission and there is a mountain range that greatly increases the probability of missing transmissions coming from behind it. We illustrate this scenario in figure 3.

Figure 3: Illustration of the fugitive scenario. The fugitive is attempting to reach safe house 1 (SH1) and may start in any state. Note the location of the radio tower on the left and the impassible mountain range.

We model the fugitive’s movement using a MDP with 11 states and four actions corresponding to moving in the four cardinal directions. Movement is uncertain, as the fugitive may attempt to move in a direction but be unable to during a given timestep. This is modeled as an action having its intended movement effect 90% of the time, with the remaining probability mass given to the other actions’ results (invalid movements cause the fugitive to remain in the same state). There are two potential safe houses as marked on the map, with the correct one being SH1, and they are attempting to reach it. The fugitive may start from any state.

We model the tower’s observations using 12 ; 11 corresponding each to one state that represents the most likely position of the fugitive being that state, and one to represent ”no ping received”. We vary the amount of noise in the observations by changing the standard deviation of , a ”low” setting makes the ”correct” observation (and those nearby) much more likely, while ”high” spreads probability mass among most of the observations. We illustrate an example with these two standard deviation settings in figure 4.

Figure 4: Illustration of for typical low (Top) and high (Bottom) observation error settings. The location of the fugitive in both cases is the bottom left. Bars are arranged such that each is over its corresponding state. is shown in the last column on the right.

We develop a number of controls to better illustrate our new technique’s performance. TRUE is the best case scenario where we record the fugitive’s true trajectories and use MaxCausalEntIRL. ML is an unrealistic worst-case procedure where we simply record the most likely state for each received observation, and in the case of a ”missed ping” observation this absurdly records SH2. A more realistic procedure, WOERR discards the ”missed ping” observations to create missing data. Both of these controls use MaxCausalEntIRL on their resulting data sets. Finally, cHiddenDataEM, a simple adaptation of HiddenDataEM to use MaxCausalEntIRL, performs HiddenDataEM on the dataset produced by WOERR to account for the missing data.

We present our results in figure 5 showing the average ILE achieved by each algorithm as the number of trajectories increases. Notice that in the low error setting cHiddenDataEM improves at first but as more trajectories are provided it stops improving at a relatively high error, ML fails to improve at all and WOERR stops improving after a very short number of trajectories. In the high error setting the control algorithms show minimal improvement after the first few trajectories. Only uMaxCausalEntIRL is able to learn accurately in both error settings.

Figure 5: Average ILE achieved by various algorithms in the fugitive scenario with low (Top) and high (Bottom) observation error settings. Note log scale on y axis. Error bars are standard deviation. Points are the result of 800 (low) and 160 (high) trials

6 Related Work

To our knowledge Maximum Entropy models utilizing imperfect data have received comparatively little attention. By contrast, applications with missing data are more common and may use latent maximum entropy to account for missing data, such as missing labels in a supervised learning scenario

[8], species distributions [12], or language modeling [17]. In the context of IRL missing data may be interpreted as occlusion of the agent being observed ([2], [3], [4]).

IRL strongly motivates the user of partial observations due to the desire for deploying applications like apprenticeship learning in the real world. [10] first introduce MaxEntIRL with partial observations, however, the learned model was not exploited to improve the state distribution leading to potentially incorrect results. [14]

developed a MaxEntIRL with partial observations model based upon Latent Maximum Entropy that included the observations in the model. But, the resulting model is difficult to interpret from a theoretical standpoint due to the joint distribution

being log linear and not . Recently, [15] presented IRL with hidden internal states but still required perfect observation of ”external” states and actions.

7 Concluding Remarks

The inability to make use of partial observations has been a significant obstacle to deploying apprenticeship learning in many real-world scenarios due to the rarity of error-free data. In this work we present a new IRL technique, uMaxCausalEntIRL, based upon the principle of uncertain maximum entropy and demonstrate that it successfully generalizes MaxCausalEntIRL. Future work will focus on further reduction of the engineering required by relaxing our algorithms requirements allowing the transition and observation functions to be unknown and estimated automatically from data, and on deploying our algorithm in an autonomous robotic apprenticeship learning scenario.


  • [1] P. Abbeel and A. Y. Ng (2004) Apprenticeship learning via inverse reinforcement learning. In

    Twenty-first International Conference on Machine Learning (ICML)

    pp. 1–8. Cited by: §2.3.
  • [2] K. Bogert and P. Doshi (2014) Multi-robot inverse reinforcement learning under occlusion with interactions. In Proceedings of the 2014 International Conference on Autonomous Agents and Multi-agent Systems, AAMAS ’14, pp. 173–180. Cited by: §6.
  • [3] K. Bogert and P. Doshi (2015) Toward estimating others’ transition models under occlusion for multi-robot irl. In

    24th International Joint Conference on Artificial Intelligence (IJCAI)

    pp. 1867–1873. Cited by: §6.
  • [4] K. Bogert and P. Doshi (2018) Multi-robot inverse reinforcement learning under occlusion with estimation of state transitions. Artificial Intelligence 263, pp. 46 – 73. Cited by: §6.
  • [5] K. Bogert, J. F. Lin, P. Doshi, and D. Kulic (2016) Expectation-maximization for inverse reinforcement learning with hidden data. In 2016 International Conference on Autonomous Agents and Multiagent Systems, pp. 1034–1042. Cited by: item 2, §2.3.
  • [6] S. Boyd and L. Vandenberghe (2002) Convex Optimization. External Links: Document, 1111.6189v1, ISBN 9780521833783, ISSN 1558-254X Cited by: §2.1.
  • [7] J. Choi and K. Kim (2011) Inverse reinforcement learning in partially observable environments. J. Mach. Learn. Res. 12, pp. 691–730. Cited by: §2.3.
  • [8] Y. Grandvalet and Y. Bengio (2006) Entropy regularization.. Cited by: §6.
  • [9] E. T. Jaynes (1957-05) Information theory and statistical mechanics. Phys. Rev. 106, pp. 620–630. Cited by: §1.
  • [10] K. M. Kitani, B. D. Ziebart, J. A. Bagnell, and M. Hebert (2012) Activity forecasting. In

    European Conference on Computer Vision (ECCV)

    pp. 201–214. Cited by: §6.
  • [11] A. Ng and S. Russell (2000) Algorithms for inverse reinforcement learning. In Seventeenth International Conference on Machine Learning, pp. 663–670. Cited by: item 3, §2.3, §2.3.
  • [12] S. J. Phillips, R. P. Anderson, and R. E. Schapire (2006) Maximum entropy modeling of species geographic distributions. Ecological modelling 190 (3-4), pp. 231–259. Cited by: §6.
  • [13] M. L. Puterman (1994) Markov decision processes: discrete stochastic dynamic programming. 1st edition, John Wiley & Sons, Inc., New York, NY, USA. External Links: ISBN 0471619779 Cited by: §2.3.
  • [14] S. Shahryari and P. Doshi (2017) Inverse Reinforcement Learning Under Noisy Observations ( Extended Abstract ). In AAMAS, pp. 1733–1735. Cited by: §6.
  • [15] V. V. Unhelkar and J. A. Shah (2019) Learning Models of Sequential Decision-Making with Partial Specification of Agent Behavior. Proceedings of the AAAI Conference on Artificial Intelligence 33, pp. 2522–2530. External Links: Document, ISSN 2159-5399 Cited by: §6.
  • [16] S. Wang, R. Rosenfeld, Y. Zhao, and D. Schuurmans (2002) The Latent Maximum Entropy Principle. In IEEE International Symposium on Information Theory, pp. 131–131. Cited by: item 2, §4.
  • [17] S. Wang, R. Rosenfeld, and Y. Zhao (2001) Latent maximum entropy principle for statistical language modeling. In

    IEEE Workshop on Automatic Speech Recognition and Understanding, 2001. ASRU’01.

    pp. 182–185. Cited by: §2.2, §2.2, §6.
  • [18] S. Wang, D. Schuurmans, and Y. Zhao (2012) The Latent Maximum Entropy Principle. ACM Transactions on Knowledge Discovery from Data (TKDD) 6 (2), pp. 8:1–8:42. External Links: Document, Link Cited by: §2.2.
  • [19] B. D. Ziebart, J. A. Bagnell, and A. K. Dey (2010) Modeling interaction via the principle of maximum causal entropy. In Proceedings of the 27th International Conference on Machine Learning (ICML-10), J. Fürnkranz and T. Joachims (Eds.), pp. 1255–1262. Cited by: §2.3.
  • [20] B. D. Ziebart, A. Maas, J. A. Bagnell, and A. K. Dey (2008) Maximum entropy inverse reinforcement learning. In 23rd National Conference on Artificial Intelligence - Volume 3, pp. 1433–1438. Cited by: §2.3.
  • [21] B. Ziebart (2010-12) Modeling Purposeful Adaptive Behavior with the Principle of Maximum Causal Entropy. Ph.D. Thesis, Carnegie Mellon University. Cited by: item 3.