There has been some recent and rather lively debate as to whether the profile likelihood, obtained by maximizing out nuisance parameters in the full likelihood, can be considered a “true” likelihood function in the remaining parameters, with arguments ranging from probabilistic, possibilistic and even philosophical perspectives (e.g., Aitkin, 2005, 2010; Evans, 2015; Maclaren, 2018; Robert, 2018)
. Here, the notion of a “true” likelihood is a function that corresponds to some joint probability distribution on the data for each value of the model parameters111Maclaren (2018) argues for a different notion of likelihood based on possibility rather than probability, with addition replaced by maximization. This approach is worth further consideration, however in this note we stick with the classical probability-based notion of likelihood..
The consensus from the statistical literature seems to be “no”, in general. Aitkin (2005) states rather unequivocally that “the profile likelihood is not a likelihood, but a likelihood maximized over nuisance parameters given the values of the parameters of interest.” In other words, the maximization operator does not generally take probability distributions to probability distributions, but merely to a “slice” in a probability distribution (hence the “profile” moniker).
Of course, from a frequentist point of view the profile likelihood can still exhibit likelihood-type statistical properties, regardless of whether or not it corresponds to a true likelihood. These properties include consistency, asymptotic normality and asymptotic efficiency of its maximizer, with the profile likelihood ratio test even exhibiting Wilks’ phenomenon under some general conditions (Murphy & van der Vaart, 2000).
From a Bayesian point of view, nuisance parameters are usually dealt with via marginalization instead of maximization. In contrast to the profile likelihood, there is little debate as to whether the marginal likelihood corresponds to a true likelihood, as “integration over variables takes probability distributions to probability distributions” (Maclaren, 2018). Indeed, the elementary concept of a marginal probability is constructed precisely by integrating joint probabilities over a subset of variables.
While maximization and marginalization are two seemingly disparate operators, it turns out that in the special case of normal models, the profile likelihood for the mean parameter(s) is precisely equivalent to the marginal likelihood obtained by integrating over Jeffreys prior on the nuisance variance parameters. In this case, profile likelihood can be considered a true likelihood insofar as a marginal likelihood is a true likelihood. This equivalence is exact for normal models, and we speculate that, like other results from likelihood theory, it may only be asymptotically true for other exponential families.
2 Profile likelihood, marginal likelihood and the equivalent prior for normal models
be a random sample from a normal distribution with meanand variance . The likelihood function (using Bayesian notation) is given by
The maximum likelihood estimator offor each given is
so that the profile likelihood for is
It is not immediately clear that this function corresponds to a valid probability distribution in for each . This kind of ambiguity is precisely what has fuelled the debate over whether profile likelihoods can be considered true likelihoods.
On the other hand, consider a Jeffreys prior on the variance where the mean is treated as given. Integrating out leads to the marginal likelihood as
by noticing that the integrand is the kernel of a Inverse-Gamma distribution with shape parameterand scale parameter . We see that the marginal likelihood coincides exactly with the profile likelihood, that is,
for Jeffreys prior on the variance .
A practical consequence is that the profile likelihood can be used to construct valid posterior distributions for Bayesian inferences. Given a prioron , the ‘profile posterior” is
precisely the same as the marginal posterior obtained from integrating over Jeffreys prior on . For example, profile posterior distributions resulting from priors and the improper prior can be obtained via Gibbs sampling, following Chapter 8.2.1 of Kroese & Chan (2014), say. Indeed, following the example in that book, Figure 1 displays the corresponding profile posterior distributions for given a random sample of 10 observations from a standard normal distribution.
The Jeffreys prior can therefore be thought of as an “equivalent prior” that makes marginalizing the likelihood equivalent to maximizing the likelihood. Analogous results for the multivariate and regression cases are also elementary to show.
Let , where is a -vector of mean parameters of interest and
-vector of mean parameters of interest andis a nuisance variance matrix. Then the profile likelihood for is equivalent to the marginal likelihood for for Jeffreys prior on .
Let , , where each is a vector of covariates, is an associated vector of mean parameters of interest and is a nuisance variance parameter. Then the profile likelihood for is equivalent to the marginal likelihood for for Jeffreys prior on .
Our contribution to the ongoing debate over the nature of the profile likelihood is to provide a simple (counter-)example in which the profile likelihood is identical to a marginal likelihood. We find it rather remarkable and somewhat counter-intuitive that marginalization can be made equivalent to maximization via a particular choice of prior on the nuisance parameters. That this equivalent prior happens to be the well-known Jeffreys prior is also an interesting coincidence, but perhaps not completely unexpected as both the profile likelihood and Jeffreys prior are constructed to be “non-informative” in some frequentist or Bayesian sense, respectively. Of course, whether the improper Jeffreys prior constitutes a “true” prior that can be integrated over is another debate, perhaps for another day.
In keeping with Aitkin (2005)
, we suspect that normal models are the only class of models for which this equivalence is exact. However, we also speculate that a generalization may hold asymptotically for other exponential families. Heuristically speaking, the score equations for exponential families, whilst typically not solvable in closed-form, can be linearized in its parameters, with leading term proportional to the Hessian of the likelihood, the inverse of which forms the basis of Jeffreys prior. It is also well-known that exponential families for data induce exponential families in the model parameters (which is why exponential families always have conjugate priors). These two ingredients combine to give us hope that the (linearized) profile likelihood might pop up as the normalizing constant when integrating out an exponential family likelihood over the Jeffreys prior, just as it did in the normal case. This is a lead worth exploring further.
We thank Dr Yao-ban Chan (Melbourne) for comments that improved this note.
- Aitkin (2005) Aitkin, M. (2005). Profile Likelihood. In Encyclopedia of Biostatistics, John Wiley & Sons.
- Aitkin (2010) Aitkin, M. (2010). Statistical Inference: An Integrated Bayesian/Likelihood Approach, Chapman & Hall/CRC Monographs on Statistics & Applied Probability. CRC Press.
- Evans (2015) Evans, M. (2015). Measuring Statistical Evidence Using Relative Belief, Chapman & Hall/CRC Monographs on Statistics & Applied Probability. CRC Press.
- Kroese & Chan (2014) Kroese, D.P. & Chan, J.C.C. (2014). Statistical Modeling and Computation, Springer, NY.
- Maclaren (2018) Maclaren, O.J. (2018). Is profile likelihood a true likelihood? An argument in favor. arXiv arxiv.org/abs/1801.04369
- Murphy & van der Vaart (2000) Murphy, S.A & van der Vaart, A. W. (2000). On Profile Likelihood. Journal of the American Statistical Association, 95, 449–465.
- Robert (2018) Robert, C.P. (2018). xianblog.wordpress.com/2018/03/27/are-profile-likelihoods-likelihoods/