1 Introduction
The principle of maximum entropy is a technique for finding a distribution over random variable(s)
that contains the least amount of information while still conforming to constraints. It has existed in various forms since the early 20 century but was formalized by Jaynes [9]. In a commonly encountered form, the constraints consist of matching sufficient statistics (e.g., feature expectations) under the maximum entropy model being learned and those observed empirically.However, in many cases the feature expectations are not directly observable. It could be the case that the model contains hidden variables, that some data is missing or corrupted by noise, or that is only partially observable using existing sensors or processes. Such scenarios are common when attempting to use maximum entropy (MaxEnt) models to perform inference using realworld data.
As an example, let us take a simple natural language processing model. Using the principle of maximum entropy, each
may be a word in a vocabulary , and we wish to obtain a model that matches the empirical distribution of words in a given document, , as given by the expectation of some interesting features .However, if the data gathered is an audio recording of spoken words then the words themselves are never directly observed. Instead, we may extract observations from the recording that only partially reveals the words being spoken. For instance, suppose corresponds to phonemes.
, the probability of hearing a phoneme for a given word, will be nondeterministic as different dialects and accents pronounce the same word in different ways. Further, a poor quality voice recording may cause uncertainty in the phoneme being spoken, forcing the use of a more general
to correctly model the data received.We make the following contributions in this work.

We introduce the principle of uncertain maximum entropy as a new nonlinear program and analyze it.
2 Background
We briefly review the principles of maximum entropy and latent maximum entropy, and discuss their use in IRL.
2.1 Principle of Maximum Entropy
The principle of maximum entropy may be expressed as a nonlinear program.
subject to  
(1) 
Where are the empirical feature expectations . Notably, this program is known to be convex which provides a number of benefits. Particularly relevant is that we may find a closedform definition of , and solving the problem’s dual is guaranteed to also solve the primal problem [6]. The solution is a log linear or Boltzmann distribution over the model elements, Where .
2.2 Principle of Latent Maximum Entropy
First presented in [17], the principle of latent maximum entropy generalizes the principle of maximum entropy to models with hidden variables that are never empirically observed.
Split each into and . is the component of X that is perfectly observed, is perfectly unobserved and completes . Thus, and . Latent maximum entropy corrects for the hidden portion of in the empirical data by summing over all , which is every way of completing a given to arrive at a . Its nonlinear program is identical to equation 1 except the empirical feature expectations .
Since , the right side of the feature expectations contains a dependency on the model being learned, meaning the program is no longer convex and only an approximate solution can be found if we still desire a loglinear model for . This leads to an expectationmaximization solution, the resulting algorithm fills in the missing portions of the data with the complete model given current parameters, which are successively improved by EM. To our knowledge [17] is the first to apply EM to the principle of maximum entropy to account for incomplete data.
2.3 Inverse Reinforcement Learning
IRL reverses this process. First presented by [11], we begin with a MDP missing its reward function and the behavior of an assumed optimal agent. We are tasked with finding a such that the agent’s behavior is optimal in the resulting complete MDP. Applications of IRL include Apprenticeship Learning [1], wherein a learner agent wishes to perform a task as well as an expert agent. The learner utilizes IRL to recover a reward function that describes the expert’s behavior, and uses the learned rewards to complete its own MDP. By solving this MDP the learner arrives at a policy that may perform the same task in a different way, such as if it is physically dissimilar from the expert.
Unfortunately, IRL is illposed as there exists an infinite number of reward functions under which a given behavior is optimal [11]. This makes evaluating the performance of IRL algorithms complicated, as directly comparing learned reward functions is not fruitful. In our experiments, we use a popular measure, inverse learning error (ILE) [7] of a learned reward obtained as ILE , where is the value function computed using the true reward and the expert’s optimal policy, is the value function using the true rewards and the optimal policy for the learned reward.
To address the illposedness of IRL, [20] introduced Maximum Entropy IRL. This algorithm makes use of the principle of maximum entropy by choosing to be the set of all possible MDP trajectories of the target agent. By assuming is a linear combination of feature functions MaxEntIRL finds the one set of weights that results in the maximum entropy over while matching empirical feature expectations. Later, [19] presented the MaxCausalEntIRL algorithm which extended MaxEntIRL to nondeterministic MDPs. [5] extended MaxEntIRL to scenarios with missing data by employing the principle of latent maximum entropy. The resulting algorithm, HiddenDataEM is restricted to scenarios with perfectly unknown sections of an agent’s trajectory, due to occlusion of the environment for instance. It requires perfect knowledge of all other portions of the trajectory and is therefore unable to make use of any available partial information.
3 Uncertain Maximum Entropy
For our application domain of IRL with noisy realworld data we require an algorithm that both accounts for the illposedness of the problem and can make use of imperfect empirical data from noisy sensors. We begin by generalizing the principle of latent maximum entropy (and by extension, the principle of maximum entropy) to uncertain data, which we call the principle of uncertain maximum entropy.
Assume we want a maximum entropy model of some variables given we have observations . Critically, we desire that the model does not include as the observations themselves will pertain solely to the data gathering technique used to produce it, not the elements or model being observed. Additionally, we assume the existence of a static observation function . Let be the empirical distribution of observations obtained from data. Our new nonlinear program is:
subject to  
(2) 
To solve Eq 2, we first take the Lagrangian.
(3) 
Next, we find Lagrangian’s gradient so that we can set it to zero and attempt to solve for .
(4) 
Unfortunately, the existence of on the right side of the constraints causes the gradient to be nonlinear in . However, Wang et al. (Wang2012) notes that we may approximate the gradient, which then allows to be loglinear. In other words:
which gives,  
(5) 
Now we enter our approximation back into the Lagrangian to arrive at an approximate dual:
(6) 
We would now try to find the dual’s gradient and use it to minimize the dual. Unfortunately the presence of still admits no closed form solution in general. We will instead employ another technique to minimize it as we show next.
3.1 EM Solution
Inspired by Wang et al. (Wang2012) the constraints of Eq. 2 can be satisfied by locally maximizing the likelihood of the observed data. One method is to derive an EM algorithm utilizing our loglinear model. We begin with the log likelihood of all the observations, note subscripts indicate a distribution is parameterized with the respective variable:
(7)  
(8) 
Equation 3.1 follows because the observation function does not depend on . This leaves as the only function which depends on . The EM algorithm proceeds by maximizing , and upon convergence , at which point the likelihood of the data is at a local maximum.
For the sake of completeness, we note that is the conditional entropy of the latent variables and is the expected log observations. These impact the overall data likelihood only, not the model solution because the observations are not included in the model. We now substitute the loglinear model of (Eq. 5) into :
(9) 
Notice that Eq. 9 is similar to Eq. 6. One important difference is that Eq. 9 is easier to solve as depends on and not . In fact, maximizing is equivalent to solving the following program by minimizing its dual, Eq. 6:
subject to  
(10) 
which equals Eq. 2 at convergence. We now arrive at the following EM algorithm:
4 Analysis of New Principle
To show that the principle of uncertain maximum entropy generalizes the principle of maximum entropy, let equal to . Then and . Hence, we recover Eq. 2’s constraint terms. Similarly, the principle of uncertain maximum entropy generalizes the principle of latent maximum entropy [16] by allowing , then and since .
Additionally, in both cases the generalization holds with an arbitrary so long as the following conditions hold: and and and , respectively.
One valuable property for statistical inference models is the ability to guarantee correctness in the infinite limit of available data by making use of the law of large numbers. For linear and nonlinear programs, this is satisfied by showing that the constraints hold true as the empirical distribution of data converges to the true distribution. Next, we show that the principle of uncertain maximum entropy exhibits this attribute, while the principle of maximum entropy does not except in special cases.
Notice that and therefore in the infinite limit of data where the constraints of are satisfied as:
(11) 
To use the principle of maximum entropy with imperfect data the observations must be converted into the model elements. For instance, in statistical physics, where the technique was first developed, the observations themselves are the result of a large number of smaller elements, such as measuring the temperature of a gas. So long as we constrain the model with such statistical moments, we may safely use the principle of maximum entropy.
In general, however, we must accept some amount of error in the observationtomodelelement conversion. As a result in the infinite limit of data and therefore the equality constraints do not hold. The effects of this are adhoc, for instance errors in continuous data may be bounded to within an acceptable tolerance while categorization errors in discrete models may result in unbounded error. This makes it difficult to use the technique with large data sets or in automated inference applications with noisy sensor data.
To illustrate, suppose we choose the most likely for an observed , or in other words, each ’s probability is the sum of the observed ’s probabilities for which it is the most likely element. In the limit, where is the Kronecker delta. This distribution does not equal , the true .
To demonstrate empirically, we generated a large number of random maximum entropy programs of random size with random observation functions. For each we generate a number of observations as well as their corresponding mostlikely and use these in our new technique and principle of maximum entropy, respectively. Figure 1
shows the Kullback–Leibler divergence of the resulting learned models to the true models as the number of observations provided increases. Notice that
uMaxEnt continues improving as more observations are given until it reaches the performance of the the best case control Inf Obs (where is set to . Meanwhile, ML MaxEnt converges at a higher KLD and no additional data improves its performance.5 Experiment
To evaluate the performance of our new technique in our inverse reinforcement learning application domain, we implemented a new algorithm, uMaxCausalEntIRL, by generalizing MaxCausalEntIRL using the principle of uncertain maximum entropy. Each becomes a tuple composed of elements
corresponding to one trajectory of the MDP. The empirical data of the agent being learned from may now be thought of as a specialized HiddenMarkov model, as shown in figure
2. In this view we concretely see the empirical data’s dependency on the model being learned as the distribution over all states and actions over time given the observations may not be calculated due to the missing distribution . This distribution, the agent’s policy, is only available after our EMbased algorithm is complete.Our scenario is that of a fugitive being unknowingly tracked by a radio tower that is listening for pings coming from a tracker device hidden on them. We attempt to discern the fugitive movement’s within a space including identifying a safe house. Complicating matters the tracking is imperfect, becoming less accurate the farther away the fugitive is from the tower. Further, it’s possible the tower may miss a transmission and there is a mountain range that greatly increases the probability of missing transmissions coming from behind it. We illustrate this scenario in figure 3.
We model the fugitive’s movement using a MDP with 11 states and four actions corresponding to moving in the four cardinal directions. Movement is uncertain, as the fugitive may attempt to move in a direction but be unable to during a given timestep. This is modeled as an action having its intended movement effect 90% of the time, with the remaining probability mass given to the other actions’ results (invalid movements cause the fugitive to remain in the same state). There are two potential safe houses as marked on the map, with the correct one being SH1, and they are attempting to reach it. The fugitive may start from any state.
We model the tower’s observations using 12 ; 11 corresponding each to one state that represents the most likely position of the fugitive being that state, and one to represent ”no ping received”. We vary the amount of noise in the observations by changing the standard deviation of , a ”low” setting makes the ”correct” observation (and those nearby) much more likely, while ”high” spreads probability mass among most of the observations. We illustrate an example with these two standard deviation settings in figure 4.
We develop a number of controls to better illustrate our new technique’s performance. TRUE is the best case scenario where we record the fugitive’s true trajectories and use MaxCausalEntIRL. ML is an unrealistic worstcase procedure where we simply record the most likely state for each received observation, and in the case of a ”missed ping” observation this absurdly records SH2. A more realistic procedure, WOERR discards the ”missed ping” observations to create missing data. Both of these controls use MaxCausalEntIRL on their resulting data sets. Finally, cHiddenDataEM, a simple adaptation of HiddenDataEM to use MaxCausalEntIRL, performs HiddenDataEM on the dataset produced by WOERR to account for the missing data.
We present our results in figure 5 showing the average ILE achieved by each algorithm as the number of trajectories increases. Notice that in the low error setting cHiddenDataEM improves at first but as more trajectories are provided it stops improving at a relatively high error, ML fails to improve at all and WOERR stops improving after a very short number of trajectories. In the high error setting the control algorithms show minimal improvement after the first few trajectories. Only uMaxCausalEntIRL is able to learn accurately in both error settings.
6 Related Work
To our knowledge Maximum Entropy models utilizing imperfect data have received comparatively little attention. By contrast, applications with missing data are more common and may use latent maximum entropy to account for missing data, such as missing labels in a supervised learning scenario
[8], species distributions [12], or language modeling [17]. In the context of IRL missing data may be interpreted as occlusion of the agent being observed ([2], [3], [4]).IRL strongly motivates the user of partial observations due to the desire for deploying applications like apprenticeship learning in the real world. [10] first introduce MaxEntIRL with partial observations, however, the learned model was not exploited to improve the state distribution leading to potentially incorrect results. [14]
developed a MaxEntIRL with partial observations model based upon Latent Maximum Entropy that included the observations in the model. But, the resulting model is difficult to interpret from a theoretical standpoint due to the joint distribution
being log linear and not . Recently, [15] presented IRL with hidden internal states but still required perfect observation of ”external” states and actions.7 Concluding Remarks
The inability to make use of partial observations has been a significant obstacle to deploying apprenticeship learning in many realworld scenarios due to the rarity of errorfree data. In this work we present a new IRL technique, uMaxCausalEntIRL, based upon the principle of uncertain maximum entropy and demonstrate that it successfully generalizes MaxCausalEntIRL. Future work will focus on further reduction of the engineering required by relaxing our algorithms requirements allowing the transition and observation functions to be unknown and estimated automatically from data, and on deploying our algorithm in an autonomous robotic apprenticeship learning scenario.
References

[1]
(2004)
Apprenticeship learning via inverse reinforcement learning.
In
Twentyfirst International Conference on Machine Learning (ICML)
, pp. 1–8. Cited by: §2.3.  [2] (2014) Multirobot inverse reinforcement learning under occlusion with interactions. In Proceedings of the 2014 International Conference on Autonomous Agents and Multiagent Systems, AAMAS ’14, pp. 173–180. Cited by: §6.

[3]
(2015)
Toward estimating others’ transition models under occlusion for multirobot irl.
In
24th International Joint Conference on Artificial Intelligence (IJCAI)
, pp. 1867–1873. Cited by: §6.  [4] (2018) Multirobot inverse reinforcement learning under occlusion with estimation of state transitions. Artificial Intelligence 263, pp. 46 – 73. Cited by: §6.
 [5] (2016) Expectationmaximization for inverse reinforcement learning with hidden data. In 2016 International Conference on Autonomous Agents and Multiagent Systems, pp. 1034–1042. Cited by: item 2, §2.3.
 [6] (2002) Convex Optimization. External Links: Document, 1111.6189v1, ISBN 9780521833783, ISSN 1558254X Cited by: §2.1.
 [7] (2011) Inverse reinforcement learning in partially observable environments. J. Mach. Learn. Res. 12, pp. 691–730. Cited by: §2.3.
 [8] (2006) Entropy regularization.. Cited by: §6.
 [9] (195705) Information theory and statistical mechanics. Phys. Rev. 106, pp. 620–630. Cited by: §1.

[10]
(2012)
Activity forecasting.
In
European Conference on Computer Vision (ECCV)
, pp. 201–214. Cited by: §6.  [11] (2000) Algorithms for inverse reinforcement learning. In Seventeenth International Conference on Machine Learning, pp. 663–670. Cited by: item 3, §2.3, §2.3.
 [12] (2006) Maximum entropy modeling of species geographic distributions. Ecological modelling 190 (34), pp. 231–259. Cited by: §6.
 [13] (1994) Markov decision processes: discrete stochastic dynamic programming. 1st edition, John Wiley & Sons, Inc., New York, NY, USA. External Links: ISBN 0471619779 Cited by: §2.3.
 [14] (2017) Inverse Reinforcement Learning Under Noisy Observations ( Extended Abstract ). In AAMAS, pp. 1733–1735. Cited by: §6.
 [15] (2019) Learning Models of Sequential DecisionMaking with Partial Specification of Agent Behavior. Proceedings of the AAAI Conference on Artificial Intelligence 33, pp. 2522–2530. External Links: Document, ISSN 21595399 Cited by: §6.
 [16] (2002) The Latent Maximum Entropy Principle. In IEEE International Symposium on Information Theory, pp. 131–131. Cited by: item 2, §4.

[17]
(2001)
Latent maximum entropy principle for statistical language modeling.
In
IEEE Workshop on Automatic Speech Recognition and Understanding, 2001. ASRU’01.
, pp. 182–185. Cited by: §2.2, §2.2, §6.  [18] (2012) The Latent Maximum Entropy Principle. ACM Transactions on Knowledge Discovery from Data (TKDD) 6 (2), pp. 8:1–8:42. External Links: Document, Link Cited by: §2.2.
 [19] (2010) Modeling interaction via the principle of maximum causal entropy. In Proceedings of the 27th International Conference on Machine Learning (ICML10), J. Fürnkranz and T. Joachims (Eds.), pp. 1255–1262. Cited by: §2.3.
 [20] (2008) Maximum entropy inverse reinforcement learning. In 23rd National Conference on Artificial Intelligence  Volume 3, pp. 1433–1438. Cited by: §2.3.
 [21] (201012) Modeling Purposeful Adaptive Behavior with the Principle of Maximum Causal Entropy. Ph.D. Thesis, Carnegie Mellon University. Cited by: item 3.