Statistical analysis of samples from mixtures of subpopulations is usually carried out by assuming that the underlying probability law is governed by a mixture model. There exists a large number of applications where mixture models are central part of the data analyses [2, 3, 4, 5]. An important one is in population genetics, where the mixed population datasets are modeled by finite mixture models . A finite mixture model can be represented by
where and are, respectively, the probability law governing and the relative population size of the subpopulation. encapsulates all the latent parameters used to specify the model including s.
In a common setting, a mixture model with latent parameters is assumed to be the law governing a given dataset. Then, latent parameters are estimated by various methods such as maximum likelihood estimation. However, before starting the parameter estimation one needs to answer a very important and fundamental question: is the model identifiable? Identifiability means that there is a one to one map between the latent parameterand the probability law (up to permutations of the subpopulations which will be discussed later.) [6, 7, 8, 9].
In this paper, we consider finite mixture models with observed variables which have the conditional independence property. This means that variables are distributed in each mixture component according to a product probability measure, i.e., for any
(see Section 2 for more explanations). In addition, we assume that the observed variables are discrete random variables and take values in a finite set with cardinality. This finite set is also refereed to as the state space in the literature. This class of mixture models are called finite mixtures of finite product measures. In this paper, by mixture models we mean this class of distributions.
Studying identifiability of statistical models has a long history [6, 7, 8, 9]. For finite mixtures of finite product measures the problem has been studied in various articles (see for instance [10, 11]). It is well known that there are some Bernoulli Mixture Models (BMMs) which are not identifiable [12, 13]. We note that, as mentioned in , only counting the dimensionality of the set of the latent parameters and then comparing it to the dimension of the mixture distribution is not sufficient to establish an identifiability argument. A novel method has been used in , based on the seminal result of Kruskal [14, 15], where the authors show that the parameters of a finite mixture of finite product measures are generically identifiable if . This means that if , then the measure of the parameters that make the problem non-identifiable is zero.
In many applications, such as population genetics , after modeling the drawn dataset by mixture models, the next step is to cluster the given dataset into subpopulations (called population stratification in the population genetics literature). The key observation is that many observed variables are not useful for clustering, which means their probability distributions (pmfs) are (exactly) same in the different mixture components. This means the set of possible parameters for this type of mixture models has a zero measure. On the other side, a small fraction of the observed variables are useful for clustering, which means that they have enough separation among their pmfs to make the reliable clustering possible . A fundamental (and also theoretical) question is that what is the number of required separable variables for a mixture model to be identifiable? We note that, for this case, we can not use the result of generic identifiability because the parameters are in a zero measure set. In this paper we establish connections between identifiability of the parameters of a mixture model and the separability of its variables. In this way, two notions of separability are introduced: strongly and weakly separable variables, which will be defined in the next two paragraphs. We note that our studies in this paper can be taught as a worst-case analysis of the identifiability of the finite mixtures of finite product measures.
In a strongly separable variable, the probability distributions (pmfs) of the variable in the mixture components are strictly different from each other, for each element of the state space. This is a natural definition for good variables, as the strongly separable variables are essentially useful for clustering the datasets drawn by mixture models. In , a mathematical analysis of the problem of reliable clustering of datasets drawn by mixture models based on the separability of mixture components is presented. However, the identifiability of mixture models based on the separability conditions is not explored in . We establish a condition on the number of strongly separable variables that results the identifiability of the parameters of the corresponding mixture model. In particular, we find a sharp threshold that the identifiability is guaranteed, if the number of strongly separable variables is greater than or equal to it. The threshold is , where is the number of mixture components. We note that this threshold is independent of , the size of the state space of the variables, which is in contrast to the result of the generic identifiability. Also we show that this threshold is tight by introducing a family of non-identifiable mixture models with less than strongly separable variables.
We note that for a strongly separable variable, the number of conditions which are needed to be satisfied scales by . To study the effect of this scaling on identifiability we introduce weakly separable variables. In a weakly separable variable, the probability distributions (pmfs) of the variable in the mixture components are strictly different from each other, for at least one (not necessarily all) state space element. This means that for any weakly separable variable, the number of required conditions does not scale with . Observe that by definitions any strongly separable variable is also weakly separable. Therefore, naturally we expect that the number of required weakly separable variables for the identifiability would be very larger than , due to the order-wise fewer number of conditions. However, we prove that if a mixture model has at least weakly separable variables, then it is identifiable. This means that the penalty of considering weakly separable variables instead of strongly separable variables is at most one.
We notice that the threshold of is also observed in the identifiability of the other problems, like the Latent Block Models (LBMs) , the binomial mixture models , the mixture models from grouped samples [19, 20] and also the topic modeling problem . However, the model of this paper is essentially different from them and their proofs ideas are also very different from what we develop in this paper. For the case of binomial mixture models, the result of can be followed as a special case of our results, when the mixture components have i.i.d. variables.
To prove the sufficiencies in both strongly and weakly separable variables, we introduce a multi-variable polynomial, called the characteristic polynomial representing a mixture model. Then, we prove that the identifiability of the mixture model is equivalent to the identifiability of its characteristic polynomial in the polynomials space. This allows us to exploit properties of polynomials to proof the identifiability argument. For the converse, we introduce a family of non-identifiable mixture models that have less than strongly separable variables for arbitrary and .
The rest of the paper is organized as follows. In Section 2, we define the problem. In Section 3, we describe the main results of the paper. Finally, Sections 4 and 5 contain the proofs of the main theorems of the paper.
In this paper all vectors are columnar and they are denoted by bold letters, like. For any positive integer , we define . The transpose operator is denoted by . The polynomial identity is denoted by which is used when two polynomials have the same coefficients. We use the notation for the Kronecker product of two vectors and . For any vector , we define . Let us define
Also denotes the closure of .
2 Problem Statement
We consider a finite mixture model with mixture components and observed variables where each observed variable takes its value in a finite set (finite state space) with cardinality . For any mixture component, we have a generative model that the observed variables attain their realizations based on that. In this model, the variable in the mixture component is generated from a finite probability measure denoted by (we call the frequency vector of the variable in the mixture component). We assume the conditional independence structure in the mixture model, which means that the observed variables are independent in each mixture component. Let us denote the frequency of the variable, in the mixture component, for any , by , as it is also denoted by . We denote the collection of the frequencies specifically for the mixture component by a matrix . Let denotes the collection of the frequencies in the mixture model in a specific order.
In our model, each data instance is generated from one of the mixture components according to a sampling distribution with pmf . Let us denote the latent parameters of the mixture model by . We also denote the set of all possible latent parameters of the mixture models (as defined above) by . Also let denotes the closure of . Here, 111 The main difference of and is that in the first, the probability of sampling from each mixture component is positive, but in the second, it may be zero..
In this paper, our objective is to attain the latent parameters , using the mixture distribution. The formal definition of the mixture distribution is as follows.
The mixture distribution of a mixture model with latent parameters is defined as
In other words, for any , we define
Here is the vector containing all for any in a specific order.
In the following definitions, we define the notion of the identifiability of a mixture model.
For any and any permutation on , we define as follows.
For any , if there exists a permutation on such that , then we write .
A mixture model with latent parameters and mixture distribution is said to be identifiable, iff for any with mixture distribution , such that (they have the same mixture distributions), we have . In other words, a mixture model is identifiable, iff its latent parameters can be identified from its mixture distribution , up to a label swapping.
For the identifiability of a mixture model it is essential that the mixture components separate in some measures. In this paper, first we define the notion of strongly separable variables as the measure of difference among mixture components. In our definition, the variable is said to be strongly separable, if and only if the frequencies of it are different for any distinct mixture components and also for any state space element. In what follows, we mathematically define the notion of the strongly separable variables of a mixture model.
Consider a mixture model with latent parameters . In this mixture model, the variable is said to be strongly separable, iff for any and any distinct , we have . In other words, the variable is strongly separable, iff for any we have . Also the number of strongly separable variables of a mixture model with latent parameters is denoted by .
We note that strongly separable variables are global. In other words, in a strongly separable variable, any pair of mixture components have different frequencies.
We note that the number of conditions that must be satisfied for the strong separability of a variable scales with the state space size . For studying the effect of this scaling on identifiability, we define weakly separable variables in this paper. The definition of the weakly separable variables is as follows.
Consider a mixture model with latent parameters . In this mixture model, the variable is said to be weakly separable, iff there is an , such that for any distinct we have 222 Note that in this case, for any distinct , we have . . Also the number of weakly separable variables of a mixture model with latent parameters is denoted by .
Observe that any strongly separable variable is also weakly separable. This means that we have the inequality for any latent parameters . Specially for the case of binary state spaces the definitions of the weakly separable variables and the strongly separable variables are same, resulting for any .
In the next section, we analyze the problem of identifiability of the mixture models and its relation to the number of separable variables (strong or weak). In particular, we show that there is a sharp threshold on the number of separable variables (strongly or weak), that implies the identifiability of the corresponded mixture model.
3 Main Results
The main result of this paper for the strongly separable variables is summarized in the following theorem.
(Necessary and sufficient condition for the identifiability based on the strongly separable variables) Any mixture model with latent variables , such that , is identifiable. Conversely, for any , there is a mixture model with latent variables , which has strongly separable variables and is not identifiable.
The theorem states that if then the identifiability of the parameters from the mixture distribution is guaranteed. On the other hand, if there is no guarantee that the problem is identifiable. Hence, in the worst-case analysis, the identifiability is possible if and only if we have at least strongly separable variables.
Our threshold depends only on the number of mixture components , and it is independent of . Naturally, we expect that our threshold varies with the size of the state space, similar to the generic identifiability result, which is . Also, the threshold is , while in the generic identifiability, the threshold is . This means that by relying on strongly separable variables order-wise more variables are required in comparison with generic identifiability.
For the proof of the theorem, first we introduce a multi-variable polynomial which is made up by the parameters of the problem. We show that the identifiability of a mixture model follows by the identifiability of its characteristic polynomial in the class of polynomials. Then, by exploiting the properties of the polynomials, we prove the sufficiency part of the theorem. For the necessary part, we introduce a family of non-identifiable mixture models that shows the necessity of the condition in the worst-case regime.
The result for the weakly separable variables is also provided in the next theorem.
(Sufficient condition for the identifiability based on the weakly separable variables) Any mixture model with latent variables such that is identifiable.
The theorem states that although the weakly separable variables satisfy order-wise less conditions, but to use them, it suffices to have just one extra variable, in comparison with the strong separability measure.
It is easy to see that in a more general model, if the observed variables have state spaces with (possibly) different sizes denoted by , then the notion of weakly separable variables can be defined for them similarly and also the sufficiency of weakly separable variables for the identifiability of them holds. For this matter, it just suffices to set and then consider each observed variable as an instance with state apace of cardinality , by setting frequencies to be zero in the variable.
The proofs of the theorems are available in the next two sections.
4 Proof of Theorem 1
The proof consists of three steps. First we prove the sufficiency part of theorem for the binary state space, i.e., the case . Then we extend the proof for non-binary state spaces. The necessity part of theorem is also proved via providing a class of non-identifiable mixture models. This three steps of the proof are available in the next three subsections.
4.1 Proof of the sufficiency part of Theorem 1 for
In this part, we use the notation instead of . Note that we have in the binary case. Also, we use and instead of and , respectively. First we need to define the characteristic polynomial of the latent parameters.
The characteristic polynomial of latent parameters is an -variable polynomial that is defined as follows.
We also denote the characteristic polynomials by in a brief way.
The characteristic polynomial of latent parameters has an important role in our proofs. In particular, in the next lemma, we prove that the identifiability of the characteristic polynomial of a mixture model implies the the identifiability of the corresponding mixture model.
For any , the following propositions are equivalent.
A mixture model with the latent parameters is identifiable.
For any satisfying the polynomial identity , we have .
See appendix A. ∎
Lemma 1 shows that the identifiability of mixture models of products of Bernoulli measures is equivalent to the identifiability of a class of multi-variable polynomials. This connection makes it possible to prove an identifiability result in the class of multi-variable polynomials and use it to prove the identifiability of mixture models.
Now, we state the following theorem which concludes the proof of the sufficiency part of Theorem 1 for binary case.
Let with . Then, for any the identity implies .
To prove Theorem 3, we establish an stronger argument. We relax the condition in the definition of to and form a new set of latent parameters, denoted by . Note that we have . Also denotes the closure of . Now we prove the statement of the theorem, for the cases that and , which is stronger than the theorem333 Note that all of the prior definitions for naturally extend to the elements of . . The reason that we use this modification is that this allows us to establish an inductive proof.
The proof is based on an induction on . The case is trivial. Assume that the theorem is proved for any . We will prove the theorem for . Assume that for some and , we have the identity and also . We will show that . Note that we have
Without loss of generality, assume that the variable is strongly separable in . Letting in (1) results
Note that the term in LHS of (1) becomes zero by choosing . Also notice that for any , because the variable is strongly separable. Now two cases may happen.
Case one. First assume that the RHS of the summation in (2) has less than non-zero terms444 Note that the LHS of (2) contains terms, which means that the number of distinct non-zero polynomials in the summation is . This is due to the strong separability of the the variable . , i.e., there is a , such that . Without loss of generality, assume that . In this case, we can use the induction hypothesis on the identity in (2). More precisely, since we have set , the identity (2) can be written as
We note that , since the variable has been assumed to be strongly separable and . Now we consider two sets of latent parameters corresponded to the two sides of (3). Observe that the number of strongly separable variables in the corresponding problem of the LHS of (3) is at least , which is greater than . This shows that we can use the induction hypothesis in this case.
Hence, by using the induction hypothesis, we conclude that there is a permutation on , such that for any and , we have . Now let be some strongly separable variables of . The existence of them is guaranteed by the assumption of the induction. Now if we let for any in (1), we have
We notice that the above polynomial identity holds due to the fact that each term in the summation in the LHS of (1), which corresponds to some , becomes zero, since we have set . Also, for the RHS of (1), the term in the summation, for any becomes zero, due to the fact that we have set in (1), where .
Since the variable has been assumed to be strongly separable in , for any , we conclude that . This shows that (4) is a non-zero polynomial. Hence by the identity of two (non-zero) polynomials in (4), we conclude that for any and also .
A similar argument can be established to show that we have for any . Combining the results shows that for any , since we have
By applying the induction hypothesis to (6), we conclude that there is a permutation on with the following property 555 Actually the permutation is equal to the permutation , which is defined previously. However, we do not use this identity and so it is not needed to prove it. . For any and , we have and . Combining the results shows that for the permutation on , which is defined as
we have . This shows that and completes the proof.
Case two. Now assume that the RHS of the summation in (2) has exactly non-zero terms. We notice that the proof of the case one does not depend on the location of the first chosen strongly separable variable. In other words, if for some and some , where is the location of a strongly separable variable in , it is possible to set in (1), such that the assumption in the case one holds, then the proof is completed. Hence, we assume that for any , which is the location of a strongly separable variable of , and for any , if we set in (1), then the RHS of the result has exactly non-zero terms, i.e., for any and for any . Denote the locations of some strongly separable variables in by . Note that we still assumed that the variable is strongly separable. Without loss of generality, assume that . Following by these assumptions, we set in (2) and conclude that
where and for any . Note that because of the discussed considerations, two sides of the summation in (7) have exactly terms in two sides, i.e., for any and for any . Also, the number of the residual strongly separable variables of the corresponding problem of the LHS of (7) is at least . Hence, using the induction hypothesis, we conclude that there is a bijection mapping such that for any and we have . Observe that because , we have . This shows that for any . Note that this is contradiction with the assumption that we make for the case two, where now if we set in (1), then the RHS of the result has less than terms. This shows that it is impossible that the first case does not happen. This completes the proof.
4.2 Proof of the sufficiency part of Theorem 1 for
In this part, we focus on larger state spaces than binaries. The main idea for the proof is that by introducing some auxiliary binary mixture models, we can use the result of Theorem 3. The auxiliary binary mixture models are generated based on the projection of the variables into binary spaces. This allows us to use Theorem 3.
Consider latent variables , where and . Also assume that and . Note that are the mixture distributions of the problems. We want to show that . We introduce two auxiliary binary mixture models and as follows. For any and , let and . Also let and . Note that by the definition of the strongly separable variables, they do not waste by this transformation. In other words, if the variable is strongly separable in , then it is strongly separable in , too. Hence, we have .
See appendix B. ∎
Using Lemma 2 and Theorem 3, we conclude that . This implies that there is a permutation on such that . This shows that for any and any we have . Using the definitions of the auxiliary problems, we conclude that for any and . It is also concluded that for any .
Now we claim that and this completes the proof. For this purpose, we need to show that for any , and we have . Using the symmetry in the problem, it suffices to prove that .
Again, we introduce two auxiliary binary mixture models and as follows. For any and , let and . For any , let and . Also, let and for any and . Similarly, we have .
See appendix C. ∎
Using Lemma 3 and Theorem 3, we conclude that . This yields that there is a permutation on , such that we have . Hence, for any and any , we have . Thus, we conclude that for any and . Choose such that the variable is strongly separable in . This is possible due to the assumption of the existence of at least strongly separable variables 666For the case , this assumption may be incorrect. However, the proof for the case is trivial and can be done directly. in . We claim that . Note that for any we have . Note that because of the strong separability of the variable, the set has exactly elements. This shows that or for any . Hence we have .
Now using , we conclude that . It is also assumed that and . Hence, we have . Similarly, we can conclude that and hence, the proof is completed.
4.3 Proof of the necessary part of Theorem 1
In this part, for any we introduce a problem such that