    # A new inequality for maximum likelihood estimation in statistical models with latent variables

Maximum-likelihood estimation (MLE) is arguably the most important tool for statisticians, and many methods have been developed to find the MLE. We present a new inequality involving posterior distributions of a latent variable that holds under very general conditions. It is related to the EM algorithm and has a clear potential for being used in a similar fashion.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Since and before R.A. Fisher introduced the concept of likelihood and maximum-likelihood estimation (Fisher, 1922), it has been a goal to find the maximum likelihood for a given statstical model, which in principle is done solving the associated score equations. Maximum-likelihood estimation is arguably the most used principle for statistical inference and is underpinned by a lot of theory.

However, due to the fact that solving the score equations is often infeasible, auxillary methods have been developed specifically to facilitate maximization of the likelihood function, most notably the Expectation-Maximization (EM) algorithm (Dempster et al., 1977). The huge increase in computational power during the last decades has also faciliated many new methods, in particular simulation-based approaches.

#### Latent variables

A notable challenge in maximum likelihood estimation (and indeed, any statistical inference) is the presence of latent or unobserved variables. A latent variable is characterized by the fact that it acts as part of the statistical model, but is not observed. In terms of maximum likelihood inference, this implies the presence of a sum (if is discrete) or an integral (if is continuous) in the likelihood expression. Intergrals are notoriously difficult to evaluate, so other approaches are often needed.

There are three common approaches to handle latent variables in maximum likelihood estimation (sometimes in combination):

• EM algorithm and its derivatives

• Monte Carlo methods

• Approximation methods, such as Laplace’s method.

In this article, we present and proof a new inequality involving latent variables, that when true guarantees an increase in likelihood. The resulting theorem is related to the EM algorithm as it also uses the posterior of the latent variable, but our inequality is more general and does not imply an algorithm per se. We will briefly discuss some practicalities but otherwise leave applications for future work.

## 2 Theorem

Suppose that we are given a statistical family consisting of an observation , a latent variable and an unknown parameter in parameter space .

Assume that the joint variable is dominated; that is for a measure on and on , and assume that for every , is non-zero for almost all .

Let be the marginal density for , and let denote marginal and posterior likelihoods, respectively.

###### Theorem 1.

Let . Then iff the following the following inequality is true:

 ∫min(1,L(θ2,w)L(θ1,w))dPθ1(w|y)>∫min(1,L(θ1,w)L(θ2,w))dPθ2(w|y), (1)

where is the posterior distribution of under given .

That is, by integrating the "truncated likelihood-ratios" under the posterior distributions, we can compare the likelihood of under and .

###### Proof.

Let denote the subset of where . First consider the left integral:

 ∫min(1,L(θ2,w)L(θ1,w))dPθ1(w|y)=∫min(1,L(θ2,w)L(θ1,w))L(θ1,w)L(θ1)dμ(w)=∫AL(θ2,w)L(θ1)dμ(w)+∫AcL(θ1,w)L(θ1)dμ(w)=1L(θ1)∫1A(w)L(θ2,w)+1Ac(w)L(θ1,w)dμ(w) (2)

We get a similar result for the right integral with replaced by . Now the theorem follows. ∎

### Remarks

Note that the truncated likelihood-ratio is numerically stable due to the upper limit of 1.

#### Application of the theorem

In general, the posterior distributions and are not attainable. Therefore, Monte Carlo methods would have to be applied.

Note that the theorem does not imply an algorithm per se. However, by identifying with the current estimate and with a proposed estimate, we can outline an algorithm, if we can come up with new proposed estimates. This will obviously be dependent on the model in question, and we leave this for future work.

## 3 Discussion

With its very general setting, the presented theorem should be applicable in a wide range of models, since latent variables are present in many classes of statistical models.

An interesting notion is the fact the theorem does not require any assumptions on the parameter space . The most important assumption is that is nonzero for almost all , which can actually be relaxed to some extent.

### Comparison to the EM algorithm

A close relative is the EM algorithm and its derivatives – using the current estimate , the EM algorithm proposes a new estimate , that is guaranteed to improve the likelihood.

However, there are some notable differences:

• The EM algorithm and its derivatives use only the posterior distribution of the current estimate.

• Application of the theorem requries using two posterior distributions.

We believe the great potential of the presented theorem is when combined with a clever "proposal scheme" for new estimates. Unlike the EM algorithm and its derivatives, we are free to choose any propsal for an updated estimate. Since the proposal scheme can be tailored to a specific model, it should be possible to create new and powerful methods to facilitate maximum likelihood estimation.

## References

• (1)
• Dempster et al. (1977) Dempster, A. P., Laird, N. M. & Rubin, D. B. (1977), ‘Maximum likelihood from incomplete data via the em algorithm’, Journal of the Royal Statistical Society: Series B (Methodological) 39(1), 1–22.
• Fisher (1922) Fisher, R. A. (1922), ‘On the mathematical foundations of theoretical statistics’, Philosophical Transactions of the Royal Society of London. Series A, Containing Papers of a Mathematical or Physical Character 222(594-604), 309–368.