1 Introduction
Let be a parametric distribution family with density function with respect to some finite measure. Denote by
a distribution assigning probability
on . A distribution with the following density functionis called a finite mixture. We call the subpopulation density function, the subpopulation parameter, and the mixing weight of the th subpopulation. We use and
for the cumulative distribution functions (CDF) of
and respectively. Letbe a space of mixing distributions with at most support points. A mixture distribution of (exactly) order has its mixing distribution being a member of .
We study the problem of learning the mixing distribution given a set of independent and identically distributed (IID) observations from a mixture . Throughout the paper, we assume the order of is known and is a known locationscale family. That is,
for some probability density function
with with respect to Lebesgue measure where with .Finite mixture models provide a natural representation of heterogeneous population that is believed to be composed of several homogeneous subpopulations (Pearson, 1894; Schork et al., 1996). They are also useful for approximating distributions with unknown shapes which are particularly relevant in image generation (Kolouri et al., 2018), image segmentation (Farnoosh and Zarpak, 2008), object tracking (Santosh et al., 2013), and signal processing (Plataniotis and Hatzinak, 2000).
In statistics, the most fundamental task is to learn the unknown parameters. In early days, the method of moments was the choice for its ease of computation
(Pearson, 1894) under finite mixture models. Nowadays, the maximum likelihood estimate (MLE) is the first choice due to its statistical efficiency and the availability of an easytouse EMalgorithm. Under a finite locationscale mixture model, the loglikelihood function of is given by(1) 
At an arbitrary mixing distribution we have as . Hence, the MLE of is not well defined or is ill defined. Various remedies, such as penalized maximum likelihood estimate (pMLE), has been proposed to overcome this obstacle (Chen et al., 2008; Chen and Tan, 2009)
. At the same time, MLE can be thought of a special minimum distance estimator. It minimizes a specific KullbackLeibler divergence between the empirical distribution and the assumed model
. Other divergences and distances have been investigated in the literature as in Choi (1969); Yakowitz (1969); Woodward et al. (1984); Clarke and Heathcote (1994); Cutler and CorderoBrana (1996); Deely and Kruse (1968). Recently, the Wasserstein distance has drawn increased attention in machine learning community due to its intuitive interpretation and good geometric properties (Evans and Matsen, 2012; Arjovsky et al., 2017). The Wasserstein distance based estimator for learning finite mixture models is absent in the literature.Are there any benefits to learn finite locationscale mixtures by the minimum Wasserstein distance estimator (MWDE)? This paper answers this question from several angles. We find that the MWDE is consistent and derive a numerical solution under finite locationscale mixtures. We compare the robustness of the MWDE with pMLE in the presence of outliers and mild model misspecifications. We conclude that the MWDE suffers some efficiency loss against pMLE in general without obvious gain in robustness. Through this paper, we better understand the pros and cons of the MWDE under finite locationscale mixtures. We reaffirm the general superiority of the likelihood based learning strategies even for the nonregular finite locationscale mixtures.
In the next section, we first introduce the Wasserstein distance and some of its properties. This is followed by a formal definition of the MWDE, a discussion of its existence and consistency under finite locationscale mixtures. In Section 2.4, we give some algebraic results that are essential for computing Wasserstein distance between the empirical distribution and the finite locationscale mixtures. We then develop a BFGS algorithm scheme for computing the MWDE of the mixing distribution. In addition, we briefly review the penalized likelihood approach and its numerical issues. In Section 3, we characterize the efficiency properties of the MWDE relative to pMLE in various circumstances via simulation. We also study their robustness when the data contains outliers, is contaminated or when the model is misspecified. We then apply both methods in an image segmentation example. We conclude the paper with a summary in Section 4.
2 Wasserstein Distance and the Minimum Distance Estimator
2.1 Wasserstein Distance
Wasserstein distance is a distance between probability measures. Let be a Polish space endowed with a ground distance and the space of Borel probability measures on . Let be a probability measure. If for some ,
for some (and thus any) , we say has finite th moment. Denote by the space of probability measures with finite th moment. For any , we use to denote the space of the bivariate probability measures on whose marginals are and . Namely,
The Wasserstein distance is defined as follows.
Definition 2.1 (Wasserstein distance).
For any with , the th Wasserstein distance between and is
Suppose and
are two random variables whose distributions are
and and induced probability measures are and . We regard the Wasserstein distance between and also the distance between random variables or distributions: .The Wasserstein distance is a distance on as shown by Villani (2003, Theorem 7.3). For any , it has the following properties:

Nonnegativity: and if and only if ;

Symmetry: ;

Triangular inequality: .
The Wasserstein distance has many nice properties. Let us denote for convergence in distribution or measure. Villani (2003, Theorem 7.1.2) shows that it has the following properties:

Property 1. For any , .

Property 2. as if and only if both

, and

for some (and thus any) .

Computing the Wasserstein distance involves a challenging optimization problem in general but has a simple solution under a special case. Suppose is the space of real numbers, , and and are univariate distributions. Let and for
be their quantile functions. We can easily compute the Wasserstein distance based on the following property.

Property 3. .
2.2 Minimum Wasserstein Distance Estimator
Let be the Wasserstein distance with ground distance for univariate random variables. Let be a set of IID observations from finite locationscale mixture of order and be the empirical distribution. We introduce the MWDE of the mixing distribution that is
(2) 
As we pointed out earlier, the MLE is not well defined under finite locationscale mixtures. Is the MWDE well defined? We examine the existence or sensibility of the MWDE. We show that the MWDE exists when satisfies certain conditions.
Assume that , is bounded, continuous, and has finite th moment. Under these conditions, we can see
for any . When , the solution to (2) merits special attention. Let be a mixing distribution assigning probability on . When , each subpopulation in the mixture degenerates to a point mass at . Hence, as ,
Since none of has zerodistance from , the MWDE does not exist unless we expand to include . To remove this technical artifact, in the MWDE definition we expand the space of to . We denote by a distribution with point mass at . With this expansion, is the MWDE when .
Let . Clearly, . By definition, there exists a sequence of mixing distributions such that as . Suppose one mixing weight of has limit 0. Removing this support point and rescaling, we get a new mixing distribution sequence and it still satisfies . For this reason, we assume that its mixing weights have nonzero limits by selecting converging subsequence if necessary to ensure the limits exist. Further, when the mixing weights of assume their limiting values while keeping subpopulation parameters the same, we still have as . In the following discussion, we therefore discuss the sequence of mixing distributions whose mixing weights are fixed.
Suppose the first subpopulation of has its scale parameter as . With the boundedness assumption on , the mass of this subpopulation will spread thinly over entire because uniformly. For any fixed finite interval, [], this thinning makes
as . It implies that for any given , we have
This further implies for any , we have
as . In comparison, the empirical quantile satisfies for any . By Property 3 of , these lead to as . This contradicts the assumption . Hence, is not a possible scenario of nor for any .
Can a subpopulation of instead have its location parameter ? For definitiveness, let this subpopulation correspond to . Note that at least sized probability mass of is contained in the range . Because of this, when , we have for . Therefore, by Property 3. This contradicts . Hence, is not a possible scenario of either. For the same reason, we cannot have for any .
After ruling out and , we find has a converging subsequence whose limit is a proper mixing distribution in . This limit is then an MWDE and the existence is verified.
The MWDE may not be unique and the mixing distribution may lead to a mixture with degenerate subpopulations. We will show that the MWDE is consistent as the sample size goes to infinity. Thus, having degenerated subpopulations in the learned mixture is a mathematical artifact and also a sensible solution. In contrast, no matter how large the sample size becomes, there are always degenerated mixing distributions with unbounded likelihood values.
2.3 Consistency of MWDE
We consider the problem when are IID observations from a finite locationscale mixture of order . The true mixing distribution is denoted as . Assume that is bounded, continuous, and has finite th moment. We say the locationscale mixture is identifiable if
for all given implies . We allow subpopulation scale . The most commonly used finite locatescale mixtures, such as the normal mixture, are well known to be identifiable (Teicher, 1961). Holzmann et al. (2004) give a sufficient condition for the identifiability of general finite locationscale mixtures. Let
be the characteristic function of
. The finite locationscale mixture is identifiable if for any . .We consider the MWDE based on Wasserstein distance with ground distance for some . The MWDE under finite locationscale mixture model as defined in (2) is asymptotically consistent.
Theorem 2.1.
With the same conditions on the finite locationscale mixture and same notations above, we have the following conclusions.

For any sequence and , implies as .

The MWDE satisfies as almost surely.

The MWDE is consistent: as almost surely.
Proof.
We present these three conclusions in the current order which is easy to understand. For the sake of proof, a different order is better. For ease presentation, we write and in this proof.
We first prove the second conclusion. By the triangular inequality and the definition of the minimum distance estimator, we have
Note that is the empirical distribution and is the true distribution, we have uniformly in almost surely. At the same time, under the assumption that has finite th moment, also has finite th moment. The th moment of converges to that of almost surely. Given the ground distance , the
th moment in Wasserstein distance sense is the usual moments in probability theory. By Property 2, we conclude
as both conditions there are satisfied.Conclusion 3 is implied by Conclusions 1 and 2. With Conclusion 2 already established, we need only prove Conclusion 1 to complete the whole proof. By Helly’s lemma (Van der Vaart, 2000, Lemma 2.5) again, has a converging subsequence though the limit can be a subprobability measure. Without loss of generality, we assume that itself converges with limit . If is a subprobability measure, so would be . This will lead to
which violates the theorem condition. If is a proper distribution in and
then by identifiability condition, we have . This implies and completes the proof. ∎
The multivariate normal mixture is another type of locationscale mixture. The above consistency result of MWDE can be easily extended to finite multivariate normal mixtures.
Theorem 2.2.
Consider the problem when are IID observations from a finite multivariate normal mixture distribution of order and is the minimum Wasserstein distance estimator defined by (2). Let the true mixing distribution be . The MWDE is consistent: as almost surely.
The rigorous proof is long though the conclusion is obvious. We offer a less formal proof based on several well known probability theory results:

A multivariate random variable sequence converges in distribution to if and only if converges to
for any unit vector
; 
If is multivariate normal if and only if is normal for all ;

The normal distribution has finite moment of any order.
Let be a random vector with distribution for some , , in a general mixture model setting. Suppose as , with the notation we introduced previously,
Then for any unit vector , based on property 2 of the Wasserstein distance and the result (I), we can see that
Next, we apply this result to normal mixture so that becomes which stands for a finite multivariate normal mixture with mixing distribution . In this case, is a random vector with distribution . Let be generic subpopulation parameters. We can see that the distribution of , is a finite normal mixture with subpopulation parameters , and mixing weights the same as those of . Let the mixing distributions after projection be and .
2.4 Numerical Solution to MWDE
Both in applications and in simulation experiments, we need an effective way to compute the MWDE. We develop an algorithm that leverages the explicit form of the Wasserstein distance between two measures on for the numerical solution to the MWDE. The strategy works for any Wasserstein distance but we only provide specifics for as it is the most widely used.
Let be a random variable with distribution
. Denote the mean and variance of
by and . Recall that . Let be the order statistics, , and be the th quantile of the mixture for . Letand define
When , we expand the squared distance, , between the empirical distribution and as follows:
The MWDE minimizes with respect to . The mixing weights and subpopulation scale parameters in this optimization problem have natural constraints. We may replace the optimization problem with an unconstrained one by the following parameter transformation:
for . We may then minimize with respect to over the unconstrained space . Furthermore, we adopt the quasiNewton BFGS algorithm (Nocedal and Wright, 2006, Section 6.1). To use this algorithm, it is best to provide the gradients of , which are given as follows:
for where
Since is nonconvex, the algorithm may find a local minimum of instead of a global minimum as required for MWDE. We use multiple initial values for the BFGS algorithm, and regard the one with the lowest value as the solution. We leave the algebraic details in the Appendix.
This algorithm involves computing the quantiles and repeatedly which may lead to high computational cost. Since , it can be found efficiently via a bisection method. Fortunately, has simple analytical forms under two widely used locationscale mixtures which make the computation of efficient:

When which is the density function of the standard normal, we have . In this case, we find

For finite mixture of locationscale logistic distributions, we have
and
(3)
2.5 Penalized Maximum Likelihood Estimator
A well investigated inference method under finite mixture of locationscale families is the pMLE (Tanaka, 2009; Chen et al., 2008). Chen et al. (2008) consider this approach for finite normal mixture models. They recommend the following penalized loglikelihood function
for some positive and sample variance . The loglikelihood function is given in (1). They suggest to learn the mixing distribution via pMLE defined as
The size of controls the strength of the penalty and a recommended value is . Regularizing the likelihood function via a penalty function fixes the problem caused by degenerated subpopulations (i.e. some ). The pMLE is shown to be strongly consistent when the number of components has a known upper bound under the finite normal mixture model.
The penalized likelihood approach can be easily extended to finite mixture of locationscale families. Let be the density function in the locationscale family as before. We may replace the sample variance
in the penalty function by any scaleinvariance statistic such as the sample interquartile range. This is applicable even if the variance of
is not finite.We can use the EM algorithm for numerical computation. Let be the membership vector of the th observation. That is, the th entry of is 1 when the response value is an observation from the th subpopulation and 0 otherwise. When the complete data are available, the penalized complete data likelihood function of is given by
Given the observed data and proposed mixing distribution , we have the conditional expectation
After this Estep, we define
Note that the subpopulation parameters are well separated in . The Mstep is to maximize with respect to . The solution is given by the mixing distribution with mixing weights
and the subpopulation parameters
(4) 
with the notational convention .
For general locationscale mixture, the Mstep (4) may not have a closed form solution but it is merely a simple twovariable function. There are many effective algorithms in the literature to solve this optimization problem. The EMalgorithm for pMLE increases the value of the penalized likelihood after each iteration. Hence, it should converge as long as the penalized likelihood function has an upper bound. We do not give a proof as it is a standard problem.
3 Experiments
We now study the performance of MWDE and pMLE under finite locationscale mixtures. We explore the potential advantages of the MWDE and quantify its efficiency loss, if any, by simulation experiments. Consider the following three locationscale families (Chen et al., 2020):

Normal distribution: . Its mean and variance are given by and .

Logistic distribution: . Its mean and variance are given by and .

Gumbel distribution (type I extremevalue distribution): . Its mean and variance are given by and where is the Euler constant.
We will also include a real data example to compare the image segmentation result of using the MWDE and pMLE.
3.1 Performance Measure
For vector valued parameters, the commonly used performance metric of their estimators is the mean squared error (MSE). A mixing distribution with finite and fixed support points can be regarded as a realvalued vector in theory. Yet the mean squared errors of the mixing weights, the subpopulation means, and the subpopulation scales are not comparable in terms of the learned finite mixture. In this study, we use two performance metrics specific for finite mixture models. Let and be the learned mixing distribution and the true mixing distribution. We use distance between the learned mixture and the true mixture as the first performance metric. The distance between two mixtures and is defined to be
where and are three square matrices of size with their th elements given by
Given an observed value
of a unit from the true mixture population, by Bayes’ theorem, the most probably membership of this unit is given by
Following the same rule, if is the learned mixing distribution, then the most likely membership of the unit with observed value is
We cannot directly compare and because the subpopulation themselves are not labeled. Instead, the adjusted rand index (ARI) is a good performance metric for clustering accuracy. Suppose the observations in a dataset are divided into clusters by one approach, and clusters by another. Let for , where is the number of units in set . The ARI between these two clustering outcomes is defined to be
When the two clustering approaches completely agree with each other, the ARI value is . When data are assigned to clusters randomly, the expected ARI value is . ARI values close to 1 indicate a high degree of agreement. We compute ARI based on clusters formed by and .
For each simulation, we choose or generate a mixing distribution , then generate a random sample from mixture . This is repeated times. Let be the learned based on the th data set. We obtain the two performance metrics as follows:

Mean distance:

Mean adjusted rand index:
The lower the ML2 and the higher the MARI, the better the estimator performs.
3.2 Performance under Homogeneous Model
The homogeneous locationscale model is a special mixture model with a single subpopulation . Both MWDE and MLE are applicable for parameter estimation. There have been no studies of MWDE in this special case in the literature. It is therefore of interest to see how MWDE performs under this model.
Under three locationscale models given earlier, the MWDE has closed analytical forms. Using the same notation introduced, their analytical forms are as follows.

Normal distribution:

Gumbel distribution:
where