Unknown parameters of mixture models are frequently estimated via the Maximum Marginal Likelihood (MML) method that employs the marginal probability of the observed datapawitan ; cox ; jelinek_review ; rabiner_review ; ephraim_review . A local maximization of the marginal likelihood can be carried out via one of computationally feasible algorithms, e.g. the Expectation-Maximization (EM) method jelinek_review ; rabiner_review ; ephraim_review .
There is however a range of problems, where MML does not apply due to observational nonidentifiability: the full model (including hidden variables) is identifiable, but the observed (marginal) model is not. Hence the maxima of the marginal likelihood are (generally infinitely) degenerate, and the outcome of MML does depend on the initial point of the maximization. Resolving the nonidentifiability in such situations is not hopeless, precisely because the full model is identifiable. However, the standard likelihood maximization cannot be employed, since there are hidden (not observed) variables. We emphasize that some information about unknown parameters is always lost after marginalization cox . The observational nonidentifiability is an extreme case of this.
Nonidentifiability in mixture models is studied in teicher ; rothenberg ; ito ; watanabe ; hsiao ; ran_hu ; welcher ; manski ; allman ; gu ; see hsiao ; ran_hu ; welcher for reviews. In such models even an infinitely large number of observed data samples cannot guarantee the perfect recovery of parameters (i.e. the convergence to true parameter values), because maxima of the likelihood are infinitely degenerate rothenberg . There is an attitude towards nonidentifiable models that they are in a certain sense rare, and do not have a big practical importance. This is incorrect: almost any model becomes nonidentifiable if the number of unknown parameters is sufficiently large, i.e. if the model is sufficiently realistic watanabe ; hsiao . Moreover, nonidentifiability can be present effectively due to unresponsiveness of a many-parameter likelihood along sufficiently many directions sethna1 ; sethna2 ; see sethna3
for a review. The simplest scenario of this is realized via small eigenvalues of the likelihood Hessian. For practical purposes such an effective nonidentifiability—which is generically found in systems biology and chemistrysethna1 ; sethna2 ; sethna3 —is indistinguishable from the true one.
Aiming to solve the problem of observational nonidentifiability, we extend the marginal likelihood via a one-parameter generalized function , which is constructed by analogy to the free energy in statistical physics. The positive parameter is an analogue of the inverse temperature from statistical physics, and the marginal likelihood is recovered for . We show that inherits pertinent features of ; e.g. it holds the conditionality principle, concavity (for ) and the possibility to search for its local maximum via suitably generalized expectation-maximization method. Its maximization resolves the degeneracy of . It does have relations with the maximum entropy method (for ) and with entropy minimization (for ). For several models we found an optimal value of in , which appears to be close to 1, but strictly smaller than . We also show numerically that maximizing leads to better results than (i) a random selection of one of many results provided by maximizing the usual likelihood ; (ii) averaging over many such random selections; see section V. Both (i) and (ii) would be among standard reactions of practitioners to (effective) nonidentifiability.
For we get another known quantity: coincides with the h-likelihood pawitan ; jelinek_review ; rabiner_review ; ephraim_review , i.e. the full likelihood (including both observed and hidden variables), where the value of hidden variables is replaced by their maximum aposteriori (MAP) estimates from the observed data nelder ; bjorn ; scand
. The h-likelihood is employed in Hidden Markov Models (HMM), where efficient methods of maximizing
are known as Viterbi Training (VT) or k-means segmentationjelinek_review ; rabiner_review ; ephraim_review ; rabiner ; merhav . When the h-likelihood is applied to an observationally nonidentifiable situation, its results converge to boundary values of the parameters (e.g. zero or one for unknown probabilities), as was demonstrated by analyzing an exactly solvable HMM model nips . Such results are inferior to random selection (see (i) and (ii) above), if there is no prior information that the model is indeed sparse in this sense; cf. section VI. This feature is one reason why the h-likelihood maximization leads to obvious failures even in simple models nelder ; meng . In particular, it cannot apply generally for solving observational nonidentifiability.
also relates to a recent trend in the Bayesian statistics, where the model is raised to a certain positive power, akin toin bissiri ; holmes ; hagan ; friel ; miller . In this way people deal with misspecified models bissiri ; holmes ; miller , facilitate the computation of Bayesian factors for model selection friel , regularize them hagan etc; see miller for a recent review. The raising into a power emerges from the decision theory (as applied to misspecified models) bissiri and present a general method for making Bayesian models more robust. Among the actively researched issues here is the selection of the power parameter holmes .
This paper is written in the style of the book by Cox and Hinkley cox : it is example-based and informal, not the least because it employs ideas of statistical physics. It is organized as follows. Section II.1 recalls the definition of the observational nonidentifiability we set to study. Section II.2 defines the generalized likelihood , and discusses its features inherited from the usual likelihood . Sections II.3 and II.4 study the simplest nonidentifiable examples that illustrates features of . Section III defines the main model we shall focus on. It amounts to a finite mixture with unknown probabilities. Section IV studies for this model the generalized likelihood . Numerical comparison with the random selection methods is discussed in section V. Section IV studies the maximization of and shows in which sense this is related to entropy minimization. We summarize in the last section.
Ii Free energy as generalized likelihood
ii.1 Defining observational nonidentifiability
We are given two random variablesand with values and , respectively. We assume that is hidden, while observed variable, i. e. we assume a mixture model. The joint probabilities of
generally depend on unknown parameters . Let we are given the observation data
where are values of generated independently from each other. Then can be estimated via the (marginal, logarithmic) likelihood
Within the maximum likelihood method, the unknown can be determined from . Since is a hidden, we can easily run into the nonidenitifiability problem, where (at least two) different values of lead to the same probability for all values of teicher ; rothenberg ; ito ; hsiao :
Eqs. (4) imply that maxima of are degenerate; see below for examples. In addition to (4), we shall require that the full model is still identifiable, i.e. imposing equal joint probabilities for for all does lead to for all :
We shall propose a solution to this type of nonidentifiability. Below we shall focus the most acute situation, where the marginal probability in (4) does not depend on and the sample length in (2) is very large: . Note that other (weaker) forms of nonidentifiability are possible and well-documented in literature: the weakest form of nonidentifiability is when it is restricted to a measure-zero subset of the parameter domain (generic identifiability) allman . A stronger form is that of partial nonidentifiability, where some information on (e.g. certain bounds on ) can still be recovered from observations; see gu for a recent discussion.
ii.2 Generalized likelihood: definition and features
Instead of (3) we set to maximize over its generalization, viz. the negative free-energy
where is a parameter. An obvious feature of (6) is that for we return from (6) to the (marginal) likelihood function in (3). Hence if we apply the maximization of (6) with to the identifiable model, we expect to get results that are close to those found via maximization of . The meaning of in (6) is that it sums over all values of , but does not reduce the outcome to the usual (marginal) likelihood.
Below we discuss several features that inherits from the usual likelihood . These features motivate introducing as a generalization of . The first such feature is apparent from the fact that in (6) is to be maximized over unknown parameter pawitan . If we reparametrize via a bijective (one-to-one) function —i.e. if the full information on is retained in —then the maximization outcomes
are related via the same function: .
ii.2.2 Relations of (6) to nonequilibrium free energy
Relations between statistical physics and probabilistic inference frequently proceed via the Gibbs distribution, where the minus logarithm of probability to be inferred is interpreted as the physical energy (both these quantities are additive for independent events), while the physical temperature is taken to be ; see mezard for a textbook presentation of this analogy and lamont for a recent review. The main point of making this analogy is that powerful approximate methods of statistical physics can be applied to inference mezard ; lamont .
In the context of mixture models we can carry out the above analogy one step further. This analogy is now structural, i.e. it relates to the form of (6), and not to applicability of any approximate method. We relate with the energy of a physical system, where and are respectively fast (hidden) and slow (observed) variables. Here fast and slow connect with (resp.) hidden and observed, which agrees with the set-up of statistical physics, where only a part of variables is observed free . Then (6) connects to the negative nonequilibrium free energy with inverse temperature free . Here nonequilibrium means that only one variable (i.e. ) is thermalized (i.e. its conditional probability is Gibbsian), while the free energy has several physical meanings free ; e.g. it is a generating function for calculating various averages and also the (physical) work done under a slow change of suitable externally-driven parameters free . The maximization of (6) naturally relates to the physical tendency of decreasing free energy (one formulation of the second law of thermodynamics) free .
Though formal, this correspondence with statistical physics will be instrumental in interpreting . E.g. we shall see that the maximizer of is unique (in contrast to maximizers of ), and this fact can be related to sufficiently high temperatures that simplify the free energy landscape.
ii.2.3 Relations with h-likelihood
For we revert from (6) to
where is the MAP (maximum aposteriori) estimate of given the data ephraim_review ; rabiner ; nips . The meaning of (8) is obvious in the context of (5): once we cannot employ the maximum likelihood method to —since we do not know what to take for the hidden variable —we first estimate from data (2) via the MAP method, and then proceed a la usual likelihood 111Note that in (8) the maximization was carried out for a given value of , i.e. we did not apply it to the whole sample (2). Doing so will lead to instead of (8). We did not see applications of in literature. One possible reason for this is that the definition of makes an unwarranted (though not strictly forbidden) assumption that is fixed during the sample generation process. At any rate, we applied to models and noted that its results for parameter estimation are worse than those of . Hence we stick to (8)..
It is known that the ordinary maximum-likelihood method has an appealing feature of conditionality, which is formulated in several related forms cox , and closely connects to other fundamental principles of statistics, e.g. to the likelihood principle cox ; berger ; evans . We now find out to which extent the conditionality principle is inherited by the generalized likelihood defined in (6).
First we note that holds the weak conditionality principle berger . To define this principle we should enlarge the original pair of random variables to , where assumes (for simplicity) a finite set of values . Now and are still (resp.) hidden and observed variables, while determines the choice of the experiment that does not depend on the unknown parameter berger ; evans . The choice is done before observing , i.e. before collecting the sample (2), and the (marginal) probability does not depend on . For this extended experiment the data amounts to sample (2) plus the indicator for the choice of the experiment. Then the analogue of (6) is defined as
where is the probability of . It is seen that the inference for the extended experiment produces the same result as the inference for the partial experiment, where the value of was fixed beforehands (i.e. the choice of was not a part of data):
This is the weak conditionality principle that holds for the generalized likelihood .
However, a stronger form of the conditionality principle does not hold for , because this form mixes observable and hidden variables. Define a new random variable that depends on and and assumes values cox . Assume that the marginal probability of does not depend on , i.e. is an ancillary variable with respect to estimating fraser_review 222Recall that ancillary variables need not always exist for a given model sze .. Now (6) reads
where is the set of values assumed by , where is fixed, while goes over all its values . One defines a new experiment, where it is a priori known that the value of is restricted to a specific value from . The generalized likelihood for this experiment is
ii.2.5 Monotonicity and concavity
is monotonically decreasing over :
since is a weighted sum of negative entropies.
Let is defined over a partially convex set , i.e. if and , then for there exists such that ; such a model is studied below in section III. Now for , from (6) is a concave function, since it is a linear combination of superposition of two strictly concave functions: and :
For , we note that a superposition of strictly convex and monotonic is pseudo-convex pseudoconvex . Pseudo-convex functions do share many important features of convex functions, but generally is not pseudo-convex, since besides superposition of and , (6) involves summation over , and the sum of two pseudo-convex functions is generally not pseudo-convex pseudoconvex . In section VI we shall show numerically that maximizers of relate to those of a generalized Schur-convex function; see Appendix D.
ii.2.6 Relations with the maximum entropy method
The maximization of the generalized likelihood (6) will be now related with the maximum entropy method jaynes ; jaynes_2 ; skyrms ; enk ; cheeseman . Recall that the method addresses the problem of recovering unknown probabilities of a random variable on the ground of certain contraints on and . The type and number of those constrains are not decided within the method itself jaynes_2 ; enk , though the method can give some recommendations for selecting relevant constraints; see Appendix C. Then are determined from the constrained maximization of the entropy jaynes ; jaynes_2 ; skyrms ; enk ; cheeseman . The intuitive rationale of the method is that it provides the most unbiased choice of probability compatible with constraints.
To find this relation, we expand (6) for a small (i.e. )
where is the entropy of for a fixed observation and fixed parameters . When expanding over we need to assume that , but eventually a milder condition suffices because the terms in (18–20) stay finite for .
The zero-order term in is naturally ; see (18). But, as we explained around (4), even when in (16), the maximization of does not lead to a single result if the model is not identifiable. This degeneration will be (at least partially) lifted if the next-order term in is taken into account; cf. (18). For this term will tend to lift the degeneracy by selecting those maxima which achieve the largest average entropy . Hence for a small, but positive , the results of maximazing will (effectively) serve as constraints when maximizing . This is the relation between maximizing (for a small, positive ) and entropy maximization 333Note that the idea of lifting degeneracies of the maximum likelihood by maximizing the entropy over those degenerate solutions appeared recently in the quantum maximum likelihood method hradil ; singa . But there the degeneracies of the likelihood are due to incomplete (noisy) data, i.e. they appear in a identifiable model..
Note that when converges to the true probabilities of , i.e. when in (16), and when is fixed to its true value, then is the conditional entropy of given jaynes_2 . The appearance of the conditional entropy is reasonable given the fact that is an observed variable.
Within the second-order term
the fluctuations of entropy enter into consideration: the degeneration will be lifted by (simultaneously) maximizing the entropy variance and maximizing the entropy; see (18, 19).
Likewise, for (but ) the term in predicts that among degenerate maxima of , those of the minimal entropy will be selected.
ii.2.7 -function and generalized EM procedure
in (6) admits a representation via a suitably generalized -function, i.e. its local maximum can be calculated via the (generalized) expectation-maximization (EM) algorithm. Let us define for two different values of and :
where defined by (15) is formally a conditional probability. For we revert from (21) to the average of the usual -function ephraim_review ; wu : , which is the full log-likelihood that is averaged over the hidden variable given the observed and calculated at trial values and .
Now non-negativity of the relative entropy:
Hence if for a fixed we choose such that , then this will increase over . Eq. (23) shows the main idea of EM: defining
Eq. (25) shows that if we would find such that the maximum of over is reached at , i.e.
then can be a local maximum of , or an inflection point of (which has a direction along which it maximizes), or—for a multidimensional —a saddle point. Eq. (26) holds if (24) converges. Thus similarly to the usual likelihood, can be partially (i.e. generally not globally) maximized via (21).
ii.3 First example (discrete random variables)
The following example is among simplest ones, but it does illustrate several general points of the approach based on maximizing . A binary random variable () is hidden, while its noisy version () is observed. The joint probability of reads
where and are unknown parameters:
relates with the prior probability of unobserved, and relates to the noise. Since the marginal probability of holds:
even with infinite set of -observations one can determine only the product , but not the separate factors and . On the other hand, the full model (27) is identifiable with respect to and , i.e. we have nonidentifiability in the sense of (4, 5). Appendix A.1 discusses a Bayesian approach to solving this nonidentifiability. As expected, if a good (sharp) prior probabilities for or for are available, then the nonidentifiability can be resolved. However, when no prior information is available, one is invited to employ noninformative priors jaynes , which are improper for this model, and which do not lead to any sensible outcomes; see Appendix A.1. To the same end, Appendix A.2 studies a decision-theoretic (maximin) approach to this model, which also does not assume any prior information on and/or on . This approach also does not lead to sensible results. Thus, Appendices A.1 and A.2 argue that the estimation of parameters in (27) is a nontrivial problem.
Now equations reduce from (29) to
For the only solution of (32) is , which is far from holding ; hence we disregard the domain . For , but , there is a non-zero solution of (32) that provides the global maximum of . This solution is certainly better than the previous , but it also does not exactly hold the constraint . This recovery—i.e. the convergence —is achieved only in the limit . For any we thus have from maximizing : . Both these facts are seen from (32).
The situation is different for : under assumed and , we get two maxima of related by the transformation to each other:
Both solutions hold ; in a sense these are the most extreme possibilities that hold this constraint 444Note that (33) can be obtained in a more artificial way, by replacing in , and then maximizing over and ; cf. this procedure with (8). Replacing is formal, since is (strictly speaking) not defined for a real . Still for this model this formal procedure leads to (33)..
We emphasize that one does not need to focus exclusively on maximizing over and . We note that is finite and hence we can consider as a joint density of and , which is still symmetric with respect to .
Returning to solutions (32) and (33), let us argue that there is a sense in which (32) is better than (33). To this end, we should enlarge our consideration and ask which solution is more suitable from the viewpoint of finding an estimate of the hidden variable given the observed value of . This estimation can be done via maximizing the overlap (or the risk function): over ; see (27). The maximization produces , and the quality of the estimation can be judged via the average overlap [cf. (28)]:
If the values of and are known precisely, and , then together with we get from (34): . Now employing in (34) solution (33), we get . This overconfidence is not desirable, because with approximate values of parameters we do not expect to have a better estimation quality than with the true values. In contrast, using (32) in (34) we get a reasonable conclusion:
Hence, from this viewpoint, the best regime is , since we approximately hold the contraint , and also . Moreover, the -solution is unique in contrast to (33).
ii.4 Second example (continuous random variables)
While the previous example showed that the maximization of can produce reasonable results, here we discuss a continuous-variable example, where the similar maximization leads nowhere without additional assumptions on the model. Consider an analogue of (27):
where (hidden) and
(observed) are nonnegative, continuous random variables, whileand are positive unknown parameters. The full model is identifiable; e.g. the maximum-likelihood estimates of and read (resp.): and , where and are observed values of and . But the marginal model is not identifiable, since
depends on the ratio of two unknown parameters; cf. (28). Maximizing over the marginal likelihood —for a large number of observations in (2)—leads to the correct outcome . But the individual values of unknown parameters and are not determined in this way.
where is the Euler’s Gamma-function, and where . It is seen that expresses in terms of two unknown parameters: and . Hence the maximization of can be carried out independently over and . Now the maximization of (39) over produces for a fixed a finite outcome for (see below), while the maximization of (38) over leads to for and to for . Hence does not have maxima for positive and finite and , as required for having a reasonable model in (36). Note that this situation is worse than the maximization of the marginal likelihood , because there at least the value of the ratio was recovered correctly (in the limit of infinite number of observations).
I.e. for , but we a unique maximization outcome: and . Note that the maximization of is still not sensible, since it leads to .
To conclude this continuous-variable example, here the maximization of produces unique and correct results for unknown parameters and (correct in the sense of reproducing the ratio ), at the cost of additional assumption (40). If this assumption is not made, then only the maximization of , i.e. of the usual marginal likelihood, is sensible for this model. The maximization of is never sensible here.
Iii Mixture model with unknown probabilities
Now we focus on a sufficiently general mixture model, which will allow us to study in detail the structure of and its dependence on . In mixture model (1) probabilities and are unknown. The prior information on them is introduced below. We shall skip and denote unknown probabilities by hats:
Then reads from (6)
which is also produced by the maximization of from (44). Eq. (45) has known quantities (note the constraint ). If all and are unknown (apart of holding (45)), then we have unknown variables: parameters minus known parameters . Already for , is larger than the number of known variables. As expected, (45) will not give a unique solution, and the model is nonidentifiable; cf. (4).
Apart of (45), further constraints are also possible. Such constraints amount to various forms of prior information; e.g. and hold a linear constraint:
where is some function of and with a known average . For instance, refers to the correlation between and . Another example of (46) is when one of probabilities is known precisely. Note that several linear constraints can be implemented simultaneously, this does not increase the analytical difficulty of treating the model. Constraints similar to (46) decrease the number of (effectively) unknown variables, but we shall focus on the situation, where they cannot select a single solution of (45), i.e. the nonidentifiability is kept.
Once the maximization of does not lead to any definite outcome, we look at maximizing . To this end, it will be useful to recall the concavity of ; cf. (16). The advantage of linear constraints [cf. (45, 46)], is that unknown are defined over a convex set. Eq. (16) means that for there can be only a single internal (with respect to the convex set) point , where the gradient of vanishes, , and is the global maximum of .
Iv Maximizing the generalized likelihood for
iv.1 Known probability of
As the first exercise in maximizing for the present model, let us assume that (prior) probabilities are known. Hence
The Lagrange function reads:
where are Lagrange multipliers of (47). Now amounts to
Since the right-hand-side of (49) does not depend on so should its left-hand-side, which is only possible under
Once (50) solves (49), it is the global maximum of , since the latter is concave. Recall that are generally the observed frequencies of (2). Though (50) may not very useful by itself, it still shows that maximizing under (47) leads to a reasonable null model in a nonidentifiable situation. Imposing other constraints on does lead to nontrivial predictions, as we now proceed to show.
iv.2 Known average
Let us turn to maximizing under constraint (46). The Lagrange function reads: