1 Introduction
In this paper, we consider the following twocomponent mixture model,
(1.1) 
where the probability density function (pdf)
is known, whereas the mixing proportion and the pdf are unknown. Model (1.1) is motivated by studies in the biological sciences to cluster differentially expressed genes in microarray data Bordes et al. (2006). Typically we build a test statistic, say
, for each gene. Under the null hypothesis, which presumes no difference in expression levels under two or more conditions,
is assumed to have a known distribution (in general Student’s or Fisher). Under the alternative hypothesis, the distribution is unknown. Thus, the distribution of the test statistic is modelled by (1.1) where is the proportion of nonnull statistics. The estimation of and the pdf can tell us the probability that gene is differentially expressed given :Bordes et al. (2006) considered model (1.1) where
is assumed to be symmetric. They obtained some identifiability results under moment and symmetry conditions, and proposed to estimate this model under the symmetry of
. In addition, they proved the consistency of their estimator under mild conditions.Song et al. (2010) considered another special case,
where is a normal density with mean
and unknown standard deviation
. This model was inspired by sequential clustering (Song & Nicolae, 2009), which finds candidates for centers of clusters first, then carries out a local search to find the objects that belong to those clusters, and finally selects the best cluster. Song et al. (2010) proposed an EMtype estimator and a maximizing type estimator for their model which can be easily extended to models where is not normal.A slightly different model is considered by Xiang et al. (2014),
where is a possibly unknown parameter, and is a nonnull location parameter for . They proposed a new effective estimator based on the minimum profile Hellinger distance (MPHD). They established the existence and uniqueness of their estimator and also proved its consistency under some regularity conditions. Their method actually does not require to be symmetric and thus can be applied to a more general model. For some other alternative estimators, see, for example Patra & Sen (2016); Ma et al. (2015).
In this paper, we propose to estimate (1.1) using a new approach by imposing a logconcave assumption on , i.e. ; here denotes the family of concave functions on which are upper semicontinuous and coercive in the sense that . Note that needs to be coercive in order for
to be a density function. Many common parametric families of distributions belong to the family of logconcave densities, for example, normal distribution, exponential distribution, logistic distribution, etc. We propose to estimate the new model by maximizing a semiparametric mixture likelihood. Compared to the kernel density estimation of
used by many existing methods (Bordes et al., 2006; Xiang et al., 2014; Ma et al., 2015), the new method does not require the choice of one or more bandwidths (Samworth (2017)). We establish the identifiability of the proposed semiparametric mixture model and prove the existence and consistency of the proposed estimators. We further compare our estimator with several existing estimators through simulation studies and apply our method to two real data sets from biological sciences and astronomy.The rest of the paper is organized as follows. In Section 2 we discuss some identifiability issues for model (1.1). Section 3 introduces our maximum likelihood estimator and a detailed EM type algorithm is provided. Existence and consistency properties of our estimator are established with detailed proofs given in the Appendix (Section 7). Section 4 demonstrates the finite sample performance of our proposed estimator by comparing with many other existing algorithms. Two real data applications are given in Section 5. Section 6 gives a brief discussion.
2 Identifiability
Note that the model (1.1) is nonidentifiable without any constraint on the density , see e.g., Bordes et al. (2006), and Patra & Sen (2016)
. However a parametric model for
might create biased or even misleading statistical inference when the model assumption is incorrect. In this paper, we assume to be logconcave, i.e. , where is a concave function. Logconcave densities attracted lots of attention in the recent years since it is very flexible and can be estimated by nonparametric maximum likelihood estimator without requiring the choice of any tuning parameter. For more details, see Cule et al. (2010a), Dümbgen et al. (2009), Walther et al. (2009), Dümbgen et al. (2011) and the review of the recent progress in logconcave density estimation by Samworth (2017).Proposition 2.0.
Assume around a neighborhood of , then model (1.1) is identifiable if,
Remark 2.0.
Remark 2.0.
Now, for a logconcave density, it is easy to derive the following more specific result,
Proposition 2.0.
Assume and . Model (1.1) is identifiable if , for some , as or .
Example 2.1.
If is the density of a distribution with degrees of freedom, then model (1.1) is identifiable.
Proof.
Remark 2.0.
Similarly, one can check that when is the pdf of an
distribution, lognormal distribution, or Pareto distribution, then model (
1.1) is identifiable under the condition that .In general, we need as or to make model (1.1) identifiable. Since is known, it is not difficult to give some sufficient conditions with respect to different ’s. Here we give two examples of easy to apply conditions.
Example 2.2.
Proof.
Suppose , or . Since
Hence as or , and Proposition 2.1 asserts the identifiability of model (1). ∎
Remark 2.0.
Example 2.3.
Proof.
3 Maximum Likelihood Estimation
Suppose we have a random sample of i.i.d. observations from the density , , and is a logconcave density, i.e., . Then with the empirical distribution , where is the degenerate distribution function at , the log likelihood of our random sample can be written as,
subject to the condition that . A natural approach to estimate and is to find the maximizer of .
3.1 Algorithm
We propose to estimate and by maximizing , using the EM algorithm that consists of iterating an E step and M step until convergence:
“E step”: Given and ,
“M step”:
We find using an active set algorithm, which is described in Dümbgen et al. (2007) and implemened in the R package logcondens by Rufibach & Duembgen (2010). Through out this paper, we use “EM_logconcave” to represent our method. The following result establishes the monotone properties of EM_logconcave algorithm.
Proposition 3.0.
Let , then
for any .
3.2 Theoretical Properties
In general, for any distribution on , we define,
For the existence of a maximizer of , we follow the approach of Dümbgen et al. (2011). We define the convex support of as,
Theorem 3.9.
For fixed , assume , and there exist some integer , such that,
Let , then
is real. In that case, there exists,
Moreover,
Example 3.1.
Assume represents the distribution of Model 1: , hence represents the pdf of distribution. For any integer , contains all the pdfs of normal distribution, logistic distribution, and Laplace distribution, etc. And Theorem 3.9 implies that the maximum of exists over and .
In general, the maximizer of is not unique. But if has density , where is identifiable, then , and this is the unique maximizer. This is because as noted by Dümbgen et al. (2011), if we have , such that , let , then
Note the above integral is exactly the KullbackLeibler divergence which is positive and equals
iff almost everywhere. Thus except that and may differ on a set of Lebesgue measure zero.Next we establish the consistency of our maximum likelihood estimator. First, we introduce some notations,
In what follows, we consider the convergence of distributions under Mallows’ distance (Mallows (1972)). Specifically, for two distributions , ,
It is known that is equivalent to and (Bickel & Freedman (1981); Mallows (1972)). Here means weak convergence, or convergence in distribution.
Now we are ready to state our consistency theorem.
Theorem 3.10.
Assume, (a). ; (b). for some fixed integer , the unknown density satisfies the following condition: , where , , such that, . Let be a sequence of distributions in such that for some . Suppose is upper semicontinuous and is bounded. Then
Assume there exist maximizers of , and a unique maximizer of , where . Let , , then
Practically, will be the empirical distribution function which automatically satisfies the above assumption.
4 Simulation
In this section, we investigate the finite sample performance of our algorithm and compare it to the estimator proposed by Patra & Sen (2016) ( from their paper), the Symmetrization estimator by Bordes et al. (2006), the EMtype estimator, Maximizing type estimator by Song et al. (2010), and the Minimum profile Hellinger distance estimator by Xiang et al. (2014).
In order to test our method under different settings, we simulate samples of
i.i.d. random variables with the common distribution given by the following six models:

Model 1: ,

Model 2: ,

Model 3: ,

Model 4: ,

Model 5: ,

Model 6: .
For each sample we estimate , the mean () of the unknown component and the classification error. The detailed calculation is explained bellow.
For our algorithm and the algorithm by Xiang et al. (2014), final estimators and are always produced, thus the estimated probability that the th observation is from the known component , given , can be calculated by
For other methods, may not always be given directly. Suggested by Song et al. (2010), we estimate by the following,
where is the kernel density estimator of with Gaussian kernel and Silverman’s “rule of thumb” bandwidth (Silverman, 1986). Note that the algorithm proposed by Patra & Sen (2016) actually can estimate when is nonincreasing. But we find the algorithm works best when and has the same support and often produces unreliable estimates when two supports differ from each other. Thus, we do not use to estimate for Patra & Sen (2016)’s algorithm even when the true does decrease on its support, instead, we follow Song et al. (2010)’s recommendation to get .
The algorithms by Xiang et al. (2014) and Bordes et al. (2006) give a final mean estimator directly. For other methods, after we get , we estimate by the following weighted sum,
Last, we report the classification error (Cla_error) based on as the mean squared error between and the true , i.e.,
where if is from the known component and if is from the unknown component .
For model 1, Table 1 reports the bias and MSE of the estimates of , the bias and MSE of the estimates of , and the mean of the classification error for different methods over repetitions when , , and , with sample size . Similar reports of other models can be found in Tables 1 — 6. Simulation results for sample sizes and are reported in the Appendix(Section 7). We report the results of Bordes et al. (2006)’s algorithm only for model 1, model 2 and model 6, since this method fails to estimate for other models, in which the real ’s are not symmetric on their supports.
EM_logconcave  Patra  Bordes  Song EM  Song max  Xiang  

0.002(0.0004)  0.009(0.0007)  0.001(0.0009)  0.08(0.0066)  0.087(0.0122)  0.006(0.0006)  
0.063(0.0180)  0.152(0.0650)  0.021(0.0426)  0.116(0.0446)  0.675(0.5680)  0.116(0.0396)  
Cla_error  0.0960  0.1056  0.1052  0.1102  0.1052  0.0973 
0.002(0.0004)  0.025(0.0011)  0.001(0.0006)  0.132(0.0177)  0.106(0.0149)  0.007(0.0006)  
0.018(0.0042)  0.051(0.0073)  0.000(0.0046)  0.185(0.0375)  0.322(0.1392)  0.013(0.0056)  
Cla_error  0.1094  0.1219  0.1198  0.1352  0.1206  0.1104 
0.001(0.0002)  0.252(0.0020)  0.001(0.0003)  0.107(0.0118)  0.063(0.0047)  0.009(0.0003)  
0.005(0.0013)  0.066(0.0057)  0.000(0.0016)  0.118(0.0153)  0.128(0.0220)  0.002(0.0021)  
Cla_error  0.0645  0.0739  0.0694  0.0834  0.0721  0.0664 
EM_logconcave  Patra  Bordes  Song EM  Song max  Xiang  

0.008(0.0014)  0.023(0.0015)  0.015(0.0012)  0.15(0.0228)  0.382(0.1496)  0.017(0.0019)  
0.018(0.0015)  0.027(0.0016)  0.029(0.0017)  0.014(0.0007)  0.199(0.0401)  0.007(0.0010)  
Cla_error  0.1270  0.1520  0.1511  0.1676  0.1847  0.1339 
0.001(0.0007)  0.046(0.0030)  0.040(0.0024)  0.248(0.0811)  0.228(0.0548)  0.047(0.0035)  
0.003(0.0001)  0.011(0.0002)  0.032(0.0011)  0.038(0.0015)  0.077(0.0064)  0.022(0.0012)  
Cla_error  0.1609  0.1990  0.1974  0.2638  0.1887  0.1753 
0.001(0.0004)  0.074(0.0059)  0.070(0.0055)  0.311(0.0974)  0.099(0.0105)  0.060(0.0043)  
0.001(0.00004)  0.019(0.0004)  0.033(0.0011)  0.040(0.0016)  0.025(0.0007)  0.030(0.0014)  
Cla_error  0.1000  0.1264  0.1261  0.2103  0.1129  0.1142 
EM_logconcave  Patra  Bordes  Song EM  Song max  Xiang  

0.001(0.0002)  0.001(0.0006)  NA  0.060(0.0038)  0.410(0.1698)  0.024(0.0011)  
0.006(0.0082)  0.039(0.0152)  NA  0.048(0.0149)  1.140(1.3094)  0.105(0.0184)  
Cla_error  0.0709  0.0851  NA  0.0879  0.1568  0.0790 
0.000(0.0003)  0.013(0.0006)  NA  0.073(0.0057)  0.259(0.0681)  0.042(0.0028)  
0.003(0.0021)  0.011(0.0030)  NA  0.018(0.0030)  0.502(0.2578)  0.091(0.0157)  
Cla_error  0.0595  0.0767  NA  0.0790  0.1166  0.0732 
0.001(0.0002)  0.228(0.0010)  NA  0.231(0.0012)  0.104(0.0112)  0.071(0.0060)  
0.001(0.0013)  0.002(0.0014)  NA  0.002(0.0014)  0.159(0.0283)  0.104(0.0224)  
Cla_error  0.0260  0.0325  NA  0.0322  0.0526  0.0617 
EM_logconcave  Patra  Bordes  Song EM  Song max  Xiang  

0.000(0.0002)  0.005(0.0005)  NA  0.006(0.0003)  0.106(0.0160)  0.056(0.0041)  
0.023(0.0321)  0.286(0.1299)  NA  0.304(0.1353)  1.066(1.4174)  0.738(1.0279)  
Cla_error  0.0112  0.0139  NA  0.0137  0.0205  0.0215 
0.000(0.0002)  0.009(0.0004)  NA  0.014(0.0005)  0.067(0.0057)  0.049(0.0030)  
0.005(0.0074)  0.148(0.0332)  NA  0.185(0.0459)  0.333(0.1439)  0.676(0.6616)  
Cla_error  0.0110  0.0157  NA  0.0163  0.0160  0.0207 
0.001(0.0001)  0.023(0.0007)  NA  0.006(0.0002)  0.038(0.0019)  0.066(0.0048)  
0.002(0.0052)  0.025(0.0077)  NA  0.054(0.0098)  0.147(0.0352)  0.718(0.5396)  
Cla_error  0.0045  0.0078  NA  0.0089  0.0108  0.0319 
EM_logconcave  Patra  Bordes  Song EM  Song max  Xiang  

0.001(0.0002)  0.006(0.0006)  NA  0.018(0.0005)  0.110(0.0154)  0.037(0.0018)  
0.004(0.0193)  0.399(0.2021)  NA  0.427(0.2126)  1.129(1.4965)  0.923(0.9073)  
Cla_error  0.0012  0.0044  NA  0.0045  0.0105  0.0092 
0.000(0.0003)  0.009(0.0005)  NA  0.023(0.0008)  0.131(0.0208)  0.061(0.0044)  
0.001(0.0079)  0.173(0.0384)  NA  0.213(0.0538)  0.681(0.5620)  0.650(0.4645)  
Cla_error  0.0007  0.0079  NA  0.0090  0.0202  0.0189 
0.001(0.0001)  0.223(0.0007)  NA  0.009(0.0002)  0.079(0.0068)  0.081(0.0071)  
0.000(0.0047)  0.027(0.0061)  NA  0.051(0.0077)  0.313(0.1123)  0.473(0.2482)  
Cla_error  0.0003  0.0022  NA  0.0030  0.0190  0.0426 
EM_logconcave  Patra  Bordes  Song EM  Song max  Xiang  

0.010(0.0003)  0.007(0.0006)  0.083(0.0012)  0.022(0.0006)  0.077(0.0080)  0.001(0.0003)  
0.171(0.0455)  0.029(0.0262)  0.075(0.0843)  0.083(0.0257)  0.414(0.2369)  0.001(0.0177)  
Cla_error  0.0440  0.0450  0.0455  0.0457  0.0468  0.0435 
0.001(0.0003)  0.031(0.0015)  0.002(0.0006)  0.053(0.0031)  0.038(0.0028)  0.002(0.0631)  
0.010(0.0115)  0.148(0.0264)  0.003(0.0066)  0.182(0.0375)  0.020(0.0185)  0.018(0.0066)  
Cla_error  0.0094  0.0672  0.0658  0.0680  0.0656  0.0631 
0.001(0.0001)  0.059(0.0037)  0.002(0.0004)  0.063(0.0043)  0.008(0.0006)  0.001(0.0034)  
0.004(0.0072)  0.169(0.0307)  0.001(0.0024)  0.174(0.0321)  0.055(0.0073)  0.003(0.0034)  
Cla_error  0.0046  0.0637  0.0570  0.0643  0.0567  0.0545 
All the simulation results strongly suggest that our method is very competitive and often outperforms all other methods. Moreover, our method is even more favorable when the sample size gets larger.
To better display our simulation results, we also plot the MSE of point estimates of and vs. different models for all the methods we mentioned above when and , except for the method by Bordes et al. (2006) as their method fails to estimate and for half of the models we discussed here. Figure 1 shows that the curve representing our method always lies at the bottom which demonstrates the effectiveness of our algorithm, while the Maximizing type estimator by Song et al. (2010) gives the worst results in terms of MSE.
5 Real Data Application
5.1 Prostate Data
In this section we consider the prostate data consisting of genetic expression levels related to prostate cancer patients of Efron (2012). The data set is a matrix, with entries expression level for gene on patient , , , here , . Among the patients, of them are normal control subjects (corresponding to ) and of them are prostate cancer patients (corresponding to ). The goal of the study is to discover the potential genes that are differentially expressed between normal control and prostate cancer patients.
Twosample test is performed to test the significance of each gene by,
where , , . These twosided tests produce values, and the distribution of these values under the null hypothesis (i.e., gene is not differentially expressed) has a uniform density, while under the alternative hypothesis (i.e., gene is differentially expressed) has a nonincreasing density.
The estimation of is reported in Table 7. We can see that the estimate by Bordes et al. (2006) and the Maximizing type estimate by Song et al. (2010) give a relatively big estimate. The estimate procedure by Bordes et al. (2006) assumes the density function under the alternative hypothesis to be symmetric, while in our example this density is nonincreasing, which implies the violation of their symmetric assumption. It is known that the Maximizing type estimator by Song et al. (2010) tends to overestimate the value, which can also been seen in Table 7. We also want to point out that several approaches have been proposed by Efron (2012) to estimate as well, the estimator based on central matching method gives (please see Efron (2012) and Efron et al. (2007) for detailed description of those estimators), and Table 7 shows that our estimator gives a closest value to Efron’s result.
EM_logconcave  Patra  Bordes  Song EM  Song max  Xiang 
0.0173  0.0817  0.1975  0.0076  0.6132  0.1915 
Figure 2 shows that our estimate of the density under the alternative tends to have a much smaller support comparing to the one given by Patra & Sen (2016). Again, as we noted before, the method by Patra & Sen (2016) assumes that is decreasing on the whole support of . While in reality, smaller values tends to indicate the alternative hypothesis, hence it actually makes sense that the support of for this prostate data may be much smaller than . The estimate produced by Bordes et al. (2006) is not very reliable, since the density is not symmetric. For this example, if we apply Bordes et al. (2006)’s method to the original statistics directly, the estimate is .
5.2 Carina Data
Carina is one of the seven dwarf spheroidal (dSph) satellite galaxies of Milkey Way. Here we consider the data consisting of radial velocities (RV) of stars from Carina galaxy. The data is obtained by Magellan and MMT telescopes (Walker et al. (2007)). The stars of Milkey Way contribute contamination to this data set. We assume the distribution of RV from stars of Milkey Way is known from the Besancon Milky Way model (Robin et al. (2003)). Now we would like to analyze this data set to better understand the mixture distribution of the RV of stars in Carina galaxy.
The estimation of is reported in Table 8. Again we see that the estimation by Song et al. (2010)’s Maximizing type estimator gives a relatively big estimate. Other estimates are relatively close.
EM_logconcave  Patra  Bordes  Song EM  Song max  Xiang 
0.354  0.364  0.363  0.370  0.687  0.385 
Next in Figure 3 we plot the histogram of the RV data overlaid with our estimated two components of the mixture density, and we can see that our estimation approximates the real data fairly well. The component corresponding to the stars of Carina looks very symmetric, and in fact astronomers usually assume the distribution to be Gaussian, which causing the density estimation proposed by Patra & Sen (2016) does not work here.
6 Discussion
In this paper we study the twocomponent mixture model with one component completely known. A nonparametric maximum likelihood estimator is developed via EM algorithm and logconcave approximation. Unlike most existing estimation procedures, our new method finds the mixing proportion and the distribution of the unknown component simultaneously without any selection of a tuning parameter and the proposed EM algorithm satisfies the nondecreasing property of the traditional EM algorithm. Simulation results show that our method is more favorable than many other competing estimation methods.
We are able to prove the existence and consistency of our maximum likelihood estimator for a general distribution . But we do require some extra conditions on and . A possible future research direction is trying to ease these assumptions and make it more general. In addition, it would be also our interest to apply our method to a more general model where the component also contains some unknown parameter.
7 Appendix
7.1 Theoretical Proof
Proof of Proposition 2.1.
According to Patra & Sen (2016), if we let , and
be the cumulative distribution functions of
and respectively, define , thenwhere , and here represents the Lebesgue measure. Now if , there must exist some , such that, , i.e., almost everywhere, which contradicts to the fact that . Hence we can conclude that , and consequently , which means if we can write , this is fixed and equals . Consequently is fixed as well, and our model (1.1) is identifiable. ∎
Proof of Proposition 2.4.
Proof of Theorem 3.9.
Suppose , , and . For any concave function satisfying the above conditions, there exist , such that , thus for any , , thus we have . When maximizing over all , we may restrict our attention to functions such that . For if , replacing with for all , then the value of would be greater or equal to the original . Note that since , the new concave function still satisfies the conditions above, i.e., , and . We denote to be the family of all with these properties.
Now we show that . Suppose that is such that . Let , hence is closed and convex. For any , we have the following estimate,
Note that exists since . By Lemma 4.1 of Dümbgen et al. (2011), for any fixed ,
Comments
There are no comments yet.