Since and before R.A. Fisher introduced the concept of likelihood and maximum-likelihood estimation (Fisher, 1922), it has been a goal to find the maximum likelihood for a given statstical model, which in principle is done solving the associated score equations. Maximum-likelihood estimation is arguably the most used principle for statistical inference and is underpinned by a lot of theory.
However, due to the fact that solving the score equations is often infeasible, auxillary methods have been developed specifically to facilitate maximization of the likelihood function, most notably the Expectation-Maximization (EM) algorithm (Dempster et al., 1977). The huge increase in computational power during the last decades has also faciliated many new methods, in particular simulation-based approaches.
A notable challenge in maximum likelihood estimation (and indeed, any statistical inference) is the presence of latent or unobserved variables. A latent variable is characterized by the fact that it acts as part of the statistical model, but is not observed. In terms of maximum likelihood inference, this implies the presence of a sum (if is discrete) or an integral (if is continuous) in the likelihood expression. Intergrals are notoriously difficult to evaluate, so other approaches are often needed.
There are three common approaches to handle latent variables in maximum likelihood estimation (sometimes in combination):
EM algorithm and its derivatives
Monte Carlo methods
Approximation methods, such as Laplace’s method.
In this article, we present and proof a new inequality involving latent variables, that when true guarantees an increase in likelihood. The resulting theorem is related to the EM algorithm as it also uses the posterior of the latent variable, but our inequality is more general and does not imply an algorithm per se. We will briefly discuss some practicalities but otherwise leave applications for future work.
Suppose that we are given a statistical family consisting of an observation , a latent variable and an unknown parameter in parameter space .
Assume that the joint variable is dominated; that is for a measure on and on , and assume that for every , is non-zero for almost all .
Let be the marginal density for , and let denote marginal and posterior likelihoods, respectively.
Let . Then iff the following the following inequality is true:
where is the posterior distribution of under given .
That is, by integrating the "truncated likelihood-ratios" under the posterior distributions, we can compare the likelihood of under and .
Let denote the subset of where . First consider the left integral:
We get a similar result for the right integral with replaced by . Now the theorem follows. ∎
Note that the truncated likelihood-ratio is numerically stable due to the upper limit of 1.
Application of the theorem
In general, the posterior distributions and are not attainable. Therefore, Monte Carlo methods would have to be applied.
Note that the theorem does not imply an algorithm per se. However, by identifying with the current estimate and with a proposed estimate, we can outline an algorithm, if we can come up with new proposed estimates. This will obviously be dependent on the model in question, and we leave this for future work.
With its very general setting, the presented theorem should be applicable in a wide range of models, since latent variables are present in many classes of statistical models.
An interesting notion is the fact the theorem does not require any assumptions on the parameter space . The most important assumption is that is nonzero for almost all , which can actually be relaxed to some extent.
Comparison to the EM algorithm
A close relative is the EM algorithm and its derivatives – using the current estimate , the EM algorithm proposes a new estimate , that is guaranteed to improve the likelihood.
However, there are some notable differences:
The EM algorithm and its derivatives use only the posterior distribution of the current estimate.
Application of the theorem requries using two posterior distributions.
We believe the great potential of the presented theorem is when combined with a clever "proposal scheme" for new estimates. Unlike the EM algorithm and its derivatives, we are free to choose any propsal for an updated estimate. Since the proposal scheme can be tailored to a specific model, it should be possible to create new and powerful methods to facilitate maximum likelihood estimation.
- Dempster et al. (1977) Dempster, A. P., Laird, N. M. & Rubin, D. B. (1977), ‘Maximum likelihood from incomplete data via the em algorithm’, Journal of the Royal Statistical Society: Series B (Methodological) 39(1), 1–22.
- Fisher (1922) Fisher, R. A. (1922), ‘On the mathematical foundations of theoretical statistics’, Philosophical Transactions of the Royal Society of London. Series A, Containing Papers of a Mathematical or Physical Character 222(594-604), 309–368.