# Maximum Likelihood Estimation of a Semiparametric Two-component Mixture Model using Log-concave Approximation

Motivated by studies in biological sciences to detect differentially expressed genes, a semiparametric two-component mixture model with one known component is being studied in this paper. Assuming the density of the unknown component to be log-concave, which contains a very broad family of densities, we develop a semiparametric maximum likelihood estimator and propose an EM algorithm to compute it. Our new estimation method finds the mixing proportions and the distribution of the unknown component simultaneously. We establish the identifiability of the proposed semiparametric mixture model and prove the existence and consistency of the proposed estimators. We further compare our estimator with several existing estimators through simulation studies and apply our method to two real data sets from biological sciences and astronomy.

## Authors

• 1 publication
• 6 publications
11/08/2020

### Consistency of the MLE under a two-parameter gamma mixture model with a structural shape parameter

The finite Gamma mixture model is often used to describe randomness in i...
06/26/2021

### Extending the Patra-Sen Approach to Estimating the Background Component in a Two-Component Mixture Model

Patra and Sen (2016) consider a two-component mixture model, where one c...
10/18/2018

### Two-component Mixture Model in the Presence of Covariates

In this paper we study a generalization of the two-groups model in the p...
11/14/2019

### Location estimation for symmetric log-concave densities

We revisit the problem of estimating the center of symmetry θ of an unkn...
07/29/2021

### Binomial Mixture Model With U-shape Constraint

In this article, we study the binomial mixture model under the regime th...
04/03/2018

### A Mixture Model to Detect Edges in Sparse Co-expression Graphs

In the early days of microarray data, the medical and statistical commun...
12/12/2017

### Topological mixture estimation

Density functions that represent sample data are often multimodal, i.e. ...
##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

In this paper, we consider the following two-component mixture model,

 g(x)=(1−p)f0(x)+pf(x), (1.1)

where the probability density function (pdf)

is known, whereas the mixing proportion and the pdf are unknown. Model (1.1) is motivated by studies in the biological sciences to cluster differentially expressed genes in microarray data Bordes et al. (2006)

. Typically we build a test statistic, say

, for each gene

. Under the null hypothesis, which presumes no difference in expression levels under two or more conditions,

is assumed to have a known distribution (in general Student’s or Fisher). Under the alternative hypothesis, the distribution is unknown. Thus, the distribution of the test statistic is modelled by (1.1) where is the proportion of non-null statistics. The estimation of and the pdf can tell us the probability that gene is differentially expressed given :

 Pi=pf(ti)(1−p)f0(ti)+pf(ti).

Bordes et al. (2006) considered model (1.1) where

is assumed to be symmetric. They obtained some identifiability results under moment and symmetry conditions, and proposed to estimate this model under the symmetry of

. In addition, they proved the consistency of their estimator under mild conditions.

Song et al. (2010) considered another special case,

 g(x)=(1−p)ϕσ(x)+pf(x)

where is a normal density with mean

and unknown standard deviation

. This model was inspired by sequential clustering (Song & Nicolae, 2009), which finds candidates for centers of clusters first, then carries out a local search to find the objects that belong to those clusters, and finally selects the best cluster. Song et al. (2010) proposed an EM-type estimator and a maximizing -type estimator for their model which can be easily extended to models where is not normal.

A slightly different model is considered by Xiang et al. (2014),

 g(x)=(1−p)f0(x;ξ)+pf(x−μ),

where is a possibly unknown parameter, and is a non-null location parameter for . They proposed a new effective estimator based on the minimum profile Hellinger distance (MPHD). They established the existence and uniqueness of their estimator and also proved its consistency under some regularity conditions. Their method actually does not require to be symmetric and thus can be applied to a more general model. For some other alternative estimators, see, for example Patra & Sen (2016); Ma et al. (2015).

In this paper, we propose to estimate (1.1) using a new approach by imposing a log-concave assumption on , i.e. ; here denotes the family of concave functions on which are upper semicontinuous and coercive in the sense that . Note that needs to be coercive in order for

to be a density function. Many common parametric families of distributions belong to the family of log-concave densities, for example, normal distribution, exponential distribution, logistic distribution, etc. We propose to estimate the new model by maximizing a semiparametric mixture likelihood. Compared to the kernel density estimation of

used by many existing methods (Bordes et al., 2006; Xiang et al., 2014; Ma et al., 2015), the new method does not require the choice of one or more bandwidths (Samworth (2017)). We establish the identifiability of the proposed semiparametric mixture model and prove the existence and consistency of the proposed estimators. We further compare our estimator with several existing estimators through simulation studies and apply our method to two real data sets from biological sciences and astronomy.

The rest of the paper is organized as follows. In Section 2 we discuss some identifiability issues for model (1.1). Section 3 introduces our maximum likelihood estimator and a detailed EM type algorithm is provided. Existence and consistency properties of our estimator are established with detailed proofs given in the Appendix (Section 7). Section 4 demonstrates the finite sample performance of our proposed estimator by comparing with many other existing algorithms. Two real data applications are given in Section 5. Section 6 gives a brief discussion.

## 2 Identifiability

Note that the model (1.1) is non-identifiable without any constraint on the density , see e.g., Bordes et al. (2006), and Patra & Sen (2016)

. However a parametric model for

might create biased or even misleading statistical inference when the model assumption is incorrect. In this paper, we assume to be log-concave, i.e. , where is a concave function. Log-concave densities attracted lots of attention in the recent years since it is very flexible and can be estimated by nonparametric maximum likelihood estimator without requiring the choice of any tuning parameter. For more details, see Cule et al. (2010a), Dümbgen et al. (2009), Walther et al. (2009), Dümbgen et al. (2011) and the review of the recent progress in log-concave density estimation by Samworth (2017).

###### Proposition 2.0.

Assume around a neighborhood of , then model (1.1) is identifiable if,

 limx→a+f(x)f0(x)=0 or limx→a−f(x)f0(x)=0.
###### Remark 2.0.

Proposition 2.1 also holds if , and this result is much more general (require much weaker condition) than the result of Proposition 3(i) of Bordes et al. (2006).

###### Remark 2.0.

Proposition 2.1 guarantees that model (1.1) is identifiable if the support of is strictly contained in the support of and the two supports have different Legesgue measure.

Now, for a log-concave density, it is easy to derive the following more specific result,

###### Proposition 2.0.

Assume and . Model (1.1) is identifiable if , for some , as or .

###### Example 2.1.

If is the density of a distribution with degrees of freedom, then model (1.1) is identifiable.

###### Proof.

Since , we have,

 log(f0(x))=log(Γ(ν+12))−12log(νπ)−log(Γ(ν2))−ν+12log(1+x2ν).

Thus for any , , as . And by Proposition 2.4, we can conclude that model (1.1) is identifiable when . ∎

###### Remark 2.0.

Similarly, one can check that when is the pdf of an

distribution, log-normal distribution, or Pareto distribution, then model (

1.1) is identifiable under the condition that .

In general, we need as or to make model (1.1) identifiable. Since is known, it is not difficult to give some sufficient conditions with respect to different ’s. Here we give two examples of easy to apply conditions.

###### Example 2.2.

Suppose is the density of a normal distribution with mean

and variance

, then model (1.1) is identifiable if , or , or the condition of Remark 2.3 holds.

###### Proof.

Suppose , or . Since

 ϕ(x)−logf0(x) = ϕ(x)+log(√2πσ)+12σ2(x−μ)2 = x2(ϕ(x)x2+1x2log(√2πσ)+12σ2(1−μx)2) → −∞, as x→+∞ or x→−∞.

Hence as or , and Proposition 2.1 asserts the identifiability of model (1). ∎

###### Remark 2.0.

Under the constraints set by Example 2.2, Model 1, 4, and 5 from Section 4 are identifiable.

###### Example 2.3.

Suppose is the density of an exponential distritution with rate , then model (1.1) is identifiable if , or the condition of Remark 2.3 holds.

###### Proof.

Suppose . Since,

 ϕ(x)−logf0(x) = ϕ(x)−logλ+λx = x(ϕ(x)x−logλx+λ) → −∞, as x→+∞.

Hence as , and again Proposition 2.1 ensures the identifiability of model (1.1). ∎

###### Remark 2.0.

Under the constraints set by Example 2.3, Model 3 from Section 4 is identifiable.

## 3 Maximum Likelihood Estimation

Suppose we have a random sample of i.i.d. observations from the density , , and is a log-concave density, i.e., . Then with the empirical distribution , where is the degenerate distribution function at , the log likelihood of our random sample can be written as,

 L(p,ϕ,Qn)=n∫log(g)dQn=n∑i=1log((1−p)f0(Xi)+peϕ(Xi)),

subject to the condition that . A natural approach to estimate and is to find the maximizer of .

### 3.1 Algorithm

We propose to estimate and by maximizing , using the EM algorithm that consists of iterating an E step and M step until convergence:

“E step”: Given and ,

 ω(k+1)i=(1−p(k))f0(xi)(1−p(k))f0(xi)+p(k)f(k)(xi),

“M step”:

 p(k+1)=1nn∑i=1(1−ω(k+1)i), ϕ(k+1)=[ϕ∈Φ1, ∫eϕ(x)dx=1]argmaxn∑i=1(1−ω(k+1)i)ϕ(xi), f(k+1)=eϕ(k+1).

We find using an active set algorithm, which is described in Dümbgen et al. (2007) and implemened in the R package logcondens by Rufibach & Duembgen (2010). Through out this paper, we use “EM_logconcave” to represent our method. The following result establishes the monotone properties of EM_logconcave algorithm.

###### Proposition 3.0.

Let , then

 l(k+1)≥l(k),

for any .

### 3.2 Theoretical Properties

In general, for any distribution on , we define,

 L(p,ϕ,Q) = ∫log((1−p)f0+peϕ)dQ,

For the existence of a maximizer of , we follow the approach of Dümbgen et al. (2011). We define the convex support of as,

 csupp(Q)=⋂{C:C⊆Rd closed and convex, Q(C)=1}.
###### Theorem 3.9.

For fixed , assume , and there exist some integer , such that,

 ∫||x||kQ(dx)<∞  and  interior(csupp(Q))≠∅.

Let , then

 L(Q)=[p∈[0,1], ϕ∈~Φd]supL(p,ϕ,Q)

is real. In that case, there exists,

 (p0,ϕ0)∈[p∈[0,1],ϕ∈~Φd]argmaxL(p,ϕ,Q).

Moreover,

 interior(csupp(Q))⊆dom(ϕ0)={x∈Rd:ϕ0(x)>−∞}⊆csupp(Q).

The proof of Theorem 3.9 is given in the Appendix (Section 7).

###### Example 3.1.

Assume represents the distribution of Model 1: , hence represents the pdf of distribution. For any integer , contains all the pdfs of normal distribution, logistic distribution, and Laplace distribution, etc. And Theorem 3.9 implies that the maximum of exists over and .

In general, the maximizer of is not unique. But if has density , where is identifiable, then , and this is the unique maximizer. This is because as noted by Dümbgen et al. (2011), if we have , such that , let , then

 ∫log(g0(x)/g1(x))g0(x)dx=0.

Note the above integral is exactly the Kullback-Leibler divergence which is positive and equals

iff almost everywhere. Thus except that and may differ on a set of Lebesgue measure zero.

Next we establish the consistency of our maximum likelihood estimator. First, we introduce some notations,

 Qk = {Q on Rd:∫||x||kQ(dx)<∞}, Q0 = {Q on Rd:interior(csupp(Q))≠∅}.

In what follows, we consider the convergence of distributions under Mallows’ distance (Mallows (1972)). Specifically, for two distributions , ,

 Dk(Q,Q′)=[[X∼Q, X′∼Q′]X, X′]inf{E||X−X′||k}1/k.

It is known that is equivalent to and (Bickel & Freedman (1981); Mallows (1972)). Here means weak convergence, or convergence in distribution.

Now we are ready to state our consistency theorem.

###### Theorem 3.10.

Assume, (a). ; (b). for some fixed integer , the unknown density satisfies the following condition: , where , , such that, . Let be a sequence of distributions in such that for some . Suppose is upper semi-continuous and is bounded. Then

 [n→∞]limL(Qn)=L(Q).

Assume there exist maximizers of , and a unique maximizer of , where . Let , , then

 [n→∞]limpn = p∗, [n→∞, x→y]limfn(x) = f∗(y),  ∀y∈Rd∖∂{f∗>0}, [n→∞, x→y]limsupfn(x) ≤ f∗(y),  ∀y∈∂{f∗>0}, [n→∞]lim∫|fn(x)−f∗(x)|dx = 0.

Practically, will be the empirical distribution function which automatically satisfies the above assumption.

## 4 Simulation

In this section, we investigate the finite sample performance of our algorithm and compare it to the estimator proposed by Patra & Sen (2016) ( from their paper), the Symmetrization estimator by Bordes et al. (2006), the EM-type estimator, Maximizing- type estimator by Song et al. (2010), and the Minimum profile Hellinger distance estimator by Xiang et al. (2014).

In order to test our method under different settings, we simulate samples of

i.i.d. random variables with the common distribution given by the following six models:

• Model 1: ,

• Model 2: ,

• Model 3: ,

• Model 4: ,

• Model 5: ,

• Model 6: .

For each sample we estimate , the mean () of the unknown component and the classification error. The detailed calculation is explained bellow.

For our algorithm and the algorithm by Xiang et al. (2014), final estimators and are always produced, thus the estimated probability that the -th observation is from the known component , given , can be calculated by

 ^wi=(1−^p)f0(xi)(1−^p)f0(xi)+^p^f(xi).

For other methods, may not always be given directly. Suggested by Song et al. (2010), we estimate by the following,

 ^wi=2(1−^p)f0(xi)(1−^p)f0(xi)+^h(xi),

where is the kernel density estimator of with Gaussian kernel and Silverman’s “rule of thumb” bandwidth (Silverman, 1986). Note that the algorithm proposed by Patra & Sen (2016) actually can estimate when is non-increasing. But we find the algorithm works best when and has the same support and often produces unreliable estimates when two supports differ from each other. Thus, we do not use to estimate for Patra & Sen (2016)’s algorithm even when the true does decrease on its support, instead, we follow Song et al. (2010)’s recommendation to get .

The algorithms by Xiang et al. (2014) and Bordes et al. (2006) give a final mean estimator directly. For other methods, after we get , we estimate by the following weighted sum,

 ^μ=∑ni=1(1−^wi)Xi∑ni=1(1−^wi).

Last, we report the classification error (Cla_error) based on as the mean squared error between and the true , i.e.,

 Cla\_error=1nn∑i=1(^wi−wi)2,

where if is from the known component and if is from the unknown component .

For model 1, Table 1 reports the bias and MSE of the estimates of , the bias and MSE of the estimates of , and the mean of the classification error for different methods over repetitions when , , and , with sample size . Similar reports of other models can be found in Tables 16. Simulation results for sample sizes and are reported in the Appendix(Section 7). We report the results of Bordes et al. (2006)’s algorithm only for model 1, model 2 and model 6, since this method fails to estimate for other models, in which the real ’s are not symmetric on their supports.

All the simulation results strongly suggest that our method is very competitive and often outperforms all other methods. Moreover, our method is even more favorable when the sample size gets larger.

To better display our simulation results, we also plot the MSE of point estimates of and vs. different models for all the methods we mentioned above when and , except for the method by Bordes et al. (2006) as their method fails to estimate and for half of the models we discussed here. Figure 1 shows that the curve representing our method always lies at the bottom which demonstrates the effectiveness of our algorithm, while the Maximizing- type estimator by Song et al. (2010) gives the worst results in terms of MSE.

## 5 Real Data Application

### 5.1 Prostate Data

In this section we consider the prostate data consisting of genetic expression levels related to prostate cancer patients of Efron (2012). The data set is a matrix, with entries expression level for gene on patient , , , here , . Among the patients, of them are normal control subjects (corresponding to ) and of them are prostate cancer patients (corresponding to ). The goal of the study is to discover the potential genes that are differentially expressed between normal control and prostate cancer patients.

Two-sample -test is performed to test the significance of each gene by,

 ti=(¯xi(1)−¯xi(2))/si,

where , , . These two-sided -tests produce -values, and the distribution of these -values under the null hypothesis (i.e., gene is not differentially expressed) has a uniform density, while under the alternative hypothesis (i.e., gene is differentially expressed) has a non-increasing density.

The estimation of is reported in Table 7. We can see that the estimate by Bordes et al. (2006) and the Maximizing- type estimate by Song et al. (2010) give a relatively big estimate. The estimate procedure by Bordes et al. (2006) assumes the density function under the alternative hypothesis to be symmetric, while in our example this density is non-increasing, which implies the violation of their symmetric assumption. It is known that the Maximizing- type estimator by Song et al. (2010) tends to overestimate the value, which can also been seen in Table 7. We also want to point out that several approaches have been proposed by Efron (2012) to estimate as well, the estimator based on central matching method gives (please see Efron (2012) and Efron et al. (2007) for detailed description of those estimators), and Table 7 shows that our estimator gives a closest value to Efron’s result.

Figure 2 shows that our estimate of the density under the alternative tends to have a much smaller support comparing to the one given by Patra & Sen (2016). Again, as we noted before, the method by Patra & Sen (2016) assumes that is decreasing on the whole support of . While in reality, smaller -values tends to indicate the alternative hypothesis, hence it actually makes sense that the support of for this prostate data may be much smaller than . The estimate produced by Bordes et al. (2006) is not very reliable, since the density is not symmetric. For this example, if we apply Bordes et al. (2006)’s method to the original statistics directly, the estimate is .

### 5.2 Carina Data

Carina is one of the seven dwarf spheroidal (dSph) satellite galaxies of Milkey Way. Here we consider the data consisting of radial velocities (RV) of stars from Carina galaxy. The data is obtained by Magellan and MMT telescopes (Walker et al. (2007)). The stars of Milkey Way contribute contamination to this data set. We assume the distribution of RV from stars of Milkey Way is known from the Besancon Milky Way model (Robin et al. (2003)). Now we would like to analyze this data set to better understand the mixture distribution of the RV of stars in Carina galaxy.

The estimation of is reported in Table 8. Again we see that the estimation by Song et al. (2010)’s Maximizing- type estimator gives a relatively big estimate. Other estimates are relatively close.

Next in Figure 3 we plot the histogram of the RV data overlaid with our estimated two components of the mixture density, and we can see that our estimation approximates the real data fairly well. The component corresponding to the stars of Carina looks very symmetric, and in fact astronomers usually assume the distribution to be Gaussian, which causing the density estimation proposed by Patra & Sen (2016) does not work here.

## 6 Discussion

In this paper we study the two-component mixture model with one component completely known. A nonparametric maximum likelihood estimator is developed via EM algorithm and log-concave approximation. Unlike most existing estimation procedures, our new method finds the mixing proportion and the distribution of the unknown component simultaneously without any selection of a tuning parameter and the proposed EM algorithm satisfies the non-decreasing property of the traditional EM algorithm. Simulation results show that our method is more favorable than many other competing estimation methods.

We are able to prove the existence and consistency of our maximum likelihood estimator for a general distribution . But we do require some extra conditions on and . A possible future research direction is trying to ease these assumptions and make it more general. In addition, it would be also our interest to apply our method to a more general model where the component also contains some unknown parameter.

## 7 Appendix

### 7.1 Theoretical Proof

###### Proof of Proposition 2.1.

According to Patra & Sen (2016), if we let , and

and respectively, define , then

 p0=p{1−essinfff0},

where , and here represents the Lebesgue measure. Now if , there must exist some , such that, , i.e., almost everywhere, which contradicts to the fact that . Hence we can conclude that , and consequently , which means if we can write , this is fixed and equals . Consequently is fixed as well, and our model (1.1) is identifiable. ∎

###### Proof of Proposition 2.4.

Since is a log-concave density, there exist constants and , such that (see Cule et al. (2010b)), which implies

 ϕ(x)−logf0(x)≤a−b|x|−logf0(x).

Now if , for some , apparently,

 −b|x|−logf0(x)=|x|k(−b|x|1−k−logf0(x)/|x|k)→−∞, as x→+∞ or x→−∞.

Hence as or , which shows . Thus, model (1.1) is identifiable from Proposition 2.1. ∎

###### Proof of Theorem 3.9.

Suppose , , and . For any concave function satisfying the above conditions, there exist , such that , thus for any , , thus we have . When maximizing over all , we may restrict our attention to functions such that . For if , replacing with for all , then the value of would be greater or equal to the original . Note that since , the new concave function still satisfies the conditions above, i.e., , and . We denote to be the family of all with these properties.

Now we show that . Suppose that is such that . Let , hence is closed and convex. For any , we have the following estimate,

 L(p,ϕ,Q) = ∫log((1−p)f0+peϕ)dQ ≤ ∫log((1−p)m(x)+p)Q(dx)+∫ϕdQ ≤ ∫log(m(x)+1)Q(dx)−αMQ(Rd∖D−αM)+MQ(D−αM) = ∫log(m(x)+1)Q(dx)−(α+1)M(αα+1−Q(D−αM)).

Note that exists since . By Lemma 4.1 of Dümbgen et al. (2011), for any fixed ,

 Leb(D−αM) ≤ (1+α)dMde−M/∫(1+α