I Introduction
Finding significant groups in a set of data points is a central problem in many fields. Consequently, clustering has received a lot of attention, and many methods, algorithms and software packages are available today. Among these techniques, parametric finite mixture models play a paramount role, due to their interesting mathematical properties as well as to the existence of maximum likelihood estimators based on expectationmaximization (EM) algorithms. While the finite Gaussian mixture (GMM) [1] is the model of choice, it is extremely sensitive to the presence of outliers. Alternative robust models have been proposed in the statistical literature, such as mixtures of tdistributions [2] and their numerous variants, e.g. [3, 4, 5, 6, 7, 8]. In contrast to the Gaussian case, no closedform solution exists for the tdistribution and tractability is maintained via the use of EM and a Gaussian scale mixture representation, where
is an observed vector,
is the multivariate Gaussian distribution with mean
and covariance , and is the gamma distribution of a univariate positive variable parameterized by . [mixture]1In the case of mixtures of tdistributions, with mixing coefficients , , a latent variable can also be introduced. Its distribution is a mixture of gamma distributions that accounts for the componentdependent [2]. Clustering is then usually performed associating a positive variable , distributed as , with each observed point . The distributions of both and do not depend on . The observed data are drawn from i.i.d. variables, distributed according to the tmixture, or one of its variants [2, 3, 4, 5, 6, 7, 8].[notiid]1 In this paper we propose a finite mixture model in which variable is used as a weight to account for the reliability of the observed and this independently on its assigned cluster. The distribution of is not a gamma mixture anymore but has to depend on to allow each data point to be potentially treated differently. In contrast to mixtures of tdistributions, it follows that the observed data are independent but not
identically distributed. We introduce the weighteddata Gaussian mixture model (WDGMM). We distinguish two cases, namely (i) the weights are known a priori and hence they are fixed, and (ii) the weights are modeled as variables and hence they are iteratively updated, given initial estimates. We show that in the case of fixed weights, the GMM parameters can be estimated via an extension of the standard EM which will be referred to as the
fixed weighteddata EM algorithm (FWDEM). Then we consider the more general case of weights that are treated as random variables. We model these variables with gamma distributions (one distribution for each variable) and we formally derive a closedform EM algorithm which will be referred to as the weighteddataEM algorithm (WDEM). While the Mstep of the latter is similar to the Mstep of FWDEM, the Estep is considerably different as both the posterior probabilities (responsibilities) and the parameters of the posterior gamma distributions (the weights) are updated (EZstep and EWstep). [pearson]1 The responsibilities are computed using the Pearson type VII distribution (the reader is referred to
[5] for a recent discussion regarding this distribution), also called the ArellanoValle and Bolfarine generalized tdistribution [9], and the parameters of the posterior gamma distributions are computed from the prior gamma parameters and from the Mahalanobis distance between the data and the mixture means. Note that the weights play a different role than the responsibilities. Unlike the responsibilities, which are probabilities, the weights are random variables that can take arbitrary positive values. Their posterior means can be used as an absolute measure of the relevance of the data. Typically, an outlying data point which is far from any cluster center will have a small weight while it may still be assigned with a significant responsibility value to the closest cluster. Responsibilities indicate which cluster center is the closest but not if any of them is close at all.The idea of weighteddata clustering has already been proposed in the framework of nonparametric clustering methods such as
means and spectral clustering, e.g.
[10, 11, 12, 13]. These methods generally propose to incorporate prior information in the clustering process in order to prohibit atypical data (outliers) to contaminate the clusters. The idea of modeling data weights as random variables and to estimate them via EM was proposed in [14] in the particular framework of Markovian brain image segmentation. In [14] it is shown that specific expert knowledge is not needed and that the dataweight distribution guide the model towards a satisfactory segmentation. A variational EM is proposed in [14] as their formulation has no closed form. In this paper we build on the idea that, instead of relying on prior information about atypical data, e.g. [10, 11, 12, 13], we devise a novel EM algorithm that updates the weight distributions. [robustexplanation]1The proposed method belongs to the robust clustering category of mixture models because observed data that are far away from the cluster centers have little influence on the estimation of the means and covariances.[modelselectionintro]1An important feature of mixture based clustering methods is to perform model selection on the premise that the number of components in the mixture corresponds to the number of clusters in the data. Traditionally, model selection is performed by obtaining a set of candidate models for a range of values of
(assuming that the true value is in this range). The number of components is selected by minimizing a model selection criteria, such as the Bayesian inference criterion (BIC), minimum message length (MML), Akaike’s information criteria (AIC) to cite just a few
[1, 15]. The disadvantage of these methods is twofold. Firstly, a whole set of candidates has to be obtained and problems associated with running EM many times may emerge. Secondly, they provide a number of components that optimally approximate the density and not the true number of clusters present in the data. More recently, there seems to be a consensus among mixture model practitioners that a wellfounded and computationally efficient model selection strategy is to start with a large number of components and to merge them [16]. [15] proposes a practical algorithm that starts with a very large number of components (thus making the algorithm robust to initialization), iteratively annihilates components, redistributes the observations to the other components, and terminates based on the MML criterion. [17] starts with an overestimated number of components using BIC, and then merges them hierarchically according to an entropy criterion. More recently [18] proposes a similar method that merges components based on measuring their pairwise overlap.Another trend in handling the issue of finding the proper number of components is to consider Bayesian nonparametric mixture models. This allows the implementation of mixture models with an infinite number of components via the use of Dirichlet process mixture models. In [19, 20]
an infinite Gaussian mixture (IGMM) is presented with a computationally intensive Markov Chain Monte Carlo implementation. [igmm]3 At first glance, IGMM may appear similar to FWDEM. However, these two algorithms are quite different. While IGMM is fully Bayesian the proposed FWDEM is not, in the sense that no priors are assumed on the parameters, typically the means and covariance matrices. IGMM implies Student predictive distributions while FWDEM involves only Gaussian distributions.
More recently, more flexibility in the cluster shapes has been allowed by considering infinite mixture of infinite Gaussian mixtures (IGMM) [21]. The flexibility is however limited to a cluster composed of subclusters of identical shapes and orientations, which may alter the performance of this approach. Altogether, IGMM and IGMM are not designed to handle outliers, as illustrated in Section VIII, Figs. 18f and 18g. Infinite Student mixture models have also been considered [22], but inference requires a variational Bayes approximation which generates additional computational complexity.
Bayesian nonparametrics, although promising techniques, require a fully Bayesian setting. The latter, however, induces additional complexity for handling priors and hyperpriors, especially in a multivariate context. In contrast, our latent variable approach allows exact inference. With respect to model selection, we therefore propose to extend the method of [15] to weighteddata mixtures. We formally derive an MML criterion for the weighteddata mixture model and we plug this criterion into an efficient algorithm which, starting with a large number of components, simultaneously estimates the model parameters, the posterior probabilities of the weights and the optimal number of components.
[avintro]1 We also propose to apply the proposed weighteddata robust clustering method to the problem of fusing auditory and visual information. This problem arises when the task is, e.g. to detect a person that is both seen and heard, such as an active speaker. Singlemodality signals – visiononly or audioonly – are often either weak or ambiguous, and it may be useful to combine information from different sensors, e.g. cameras and microphones. There are several difficulties associated with audiovisual fusion from a data clustering perspective: the two sensorial modalities (i) live in different spaces, (ii) are contaminated by different types of noise with different distributions, (iii) have different spatiotemporal distributions, and (iv) are perturbed by different physical phenomena, e.g. acoustic reverberations, lighting conditions, etc. For example, a speaker may face the camera while he/she is silent and may emit speech while he/she turns his/her face away from the camera. Speech signals have sparse spectrotemporal structure and they are mixed with other sound sources, such as music or background noise. Speaker faces may be totally or partially occluded, in which case face detection and localization is extremely unreliable. We show that the proposed method is well suited to find audiovisual clusters and to discriminate between speaking and silent people.
The remainder of this paper is organized as follows. Section II outlines the weighteddata mixture model; Section III sketches the FWDEM algorithm. Weights modeled with random variables are introduced in Section IV and the WDEM is described in detail in Section V. Section VI details how to deal with an unknown number of clusters and Section VII addresses the issue of algorithm initialization. In Section VIII the proposed algorithms are tested and compared with several other parametric and nonparametric clustering methods. Section IX addresses clustering of audiovisual data. Section X concludes the paper. Additional results and videos are available online.^{1}^{1}1https://team.inria.fr/perception/research/wdgmm/
Ii Gaussian Mixture with Weighted Data
In this Section, we present the intuition and the formal definition of the proposed weighteddata model. Let be a random vector following a multivariate Gaussian distribution with mean and covariance , namely , with the notation . Let be a weight indicating the relevance of the observation . Intuitively, higher the weight , stronger the impact of . The weight can therefore be incorporated into the model by “observing times”. In terms of the likelihood function, this is equivalent to raise to the power , i.e.
. However, the latter is not a probability distribution since it does not integrate to one. It is straightforward to notice that
. [precision]1Therefore, plays the role of the precision and is different for each datum . Subsequently, we write:(1) 
from which we derive a mixture model with components:
(2) 
where are the mixture parameters, are the mixture coefficients satisfying and , are the parameters of the th component and is the number of components. We will refer to the model in (2) as the weighteddata Gaussian mixture model (WDGMM). Let be the observed data and be the weights associated with . We assume each is independently drawn from (2) with . The observeddata loglikelihood is:
(3) 
It is well known that direct maximization of the loglikelihood function is problematic in case of mixtures and that the expected completedata loglikelihood must be considered instead. Hence, we introduce a set of hidden (assignment) variables associated with the observed variables and such that , if and only if is generated by the th component of the mixture. In the following we first consider a fixed (given) number of mixture components , we then extend the model to an unknown , thus estimating the number of components from the data.
Iii EM with Fixed Weights
The simplest case is when the weight values are provided at algorithm initialization, either using some prior knowledge or estimated from the observations (e.g. Section VII), and are then kept fixed while alternating between the expectation and maximization steps. In this case, the expected completedata loglikelihood is:
(4) 
where denotes the expectation with respect to the distribution . The th EM iteration consists of two steps namely, the evaluation of the posterior distribution given the current model parameters and the weights (Estep), and the maximization of (4) with respect to (Mstep):
(5) 
It is straightforward to show that this yields the following FWDEM algorithm:
Iiia The EStep
IiiB The MStep
Expanding (4) we get:
(7) 
where denotes equality up to a constant that does not depend on . By canceling out the derivatives with respect to the model parameters, we obtain the following update formulae for the mixture proportions, means, and covariances matrices:
(8) 
(9) 
(10) 
Iv Modeling the Weights
As we already remarked, the weights play the role of precisions. The notable difference between standard finite mixture models and the proposed model is that there is a different weight , hence a different precision, associated with each observation . Within a Bayesian formalism, the weights may be treated as random variables, rather than being fixed in advance, as in the previous case. Since (1) is a Gaussian, a convenient choice for the prior on ,
is the conjugate prior of the precision with known mean, i.e. a gamma distribution. This ensures that the weight posteriors are gamma distributions as well. Summarizing we have:
(11) 
where is the gamma distribution, is the gamma function, and are the parameters of the prior distribution of
. The mean and variance of the random variable
are given by:(12)  
(13) 
V EM with Random Weights
In this section we derive the WDEM algorithm associated to a model in which the weights are treated as random variables following (11). The gamma distribution of each is assumed to be parameterized by . Within this framework, the expectation of the completedata loglikelihood is computed over both the assignment and weight variables:
(14) 
where we used the notation . We notice that the posterior distribution factorizes on :
and each one of these factors can be be decomposed as:
(15) 
where the two quantities on the righthand side of this equation have closedform expressions. The computation of each one of these two expressions leads to two sequential steps, the EWstep and the EZstep, of the expectation step of the proposed algorithm.
Va The EZ Step
The marginal posterior distribution of is obtained by integrating (15) over . As previously, we denote the responsibilities with . The integration computes:
(16) 
where denotes the Pearson type VII probability distribution function, which can be seen as a generalization of the tdistribution:
(17) 
VB The EW Step
The posterior distribution of , namely is a gamma distribution, because it is the conjugate prior of the precision of the Gaussian distribution. Therefore, we only need to compute the parameters of the posterior gamma distribution:
(18) 
where the parameters of the posterior gamma distribution are evaluated with:
(19)  
(20) 
The conditional mean of , namely , can then be evaluated with:
(21) 
[weffect]1 While estimating the weights themselves is not needed by the algorithm, it is useful to evaluate them in order to fully characterize the observations and to discriminate between inliers and outliers. First notice that the marginal posterior distribution of is a mixture of gamma distributions:
(22) 
and therefore the posterior mean of is evaluated with:
(23) 
By inspection of (19), (20), and (21) it is easily seen that the value of decreases as the distance between the cluster centers and observation increases. Importantly, the evaluation of enables outlier detection. Indeed, an outlier is expected to be far from all the clusters, and therefore all will be small, leading to a small value of . It is worth noticing that this is not possible using only the responsibilities , since they are normalized by definition, and therefore their value is not an absolute measure of the datum’s relevance, but only a relative measure of it.
VC The Maximization Step
This step maximizes the expected completedata loglikelihood over the mixture parameters. By expanding (14), we have:
(24) 
The parameter updates are obtained from canceling out the derivatives of the expected completedata loglikelihood (24). As with standard Gaussian mixtures, all the updates are closedform expressions:
(25) 
(26) 
(27) 
It is worth noticing that the Mstep of the WDEM algorithm is very similar to the Mstep of the FWDEM algorithm (section III). Indeed, the above iterative formulas, (25), (26), (27) are identical to the formulas (8), (9), (10), except that the fixed weights are here replaced with the posterior means of the random weights, .
Vi Estimating the Number of Components
So far it has been assumed that the number of mixture components is provided in advance. This assumption is unrealistic for most realworld applications. In this Section we propose to extend the method and algorithm proposed in [15] to the weighteddata clustering model. An interesting feature of this model selection method is that it does not require parameter estimation for many different values of , as it would be the case with the Bayesian information criterion (BIC) [23]. Instead, the algorithm starts with a large number of components and iteratively deletes components as they become irrelevant. Starting with a large number of components has the additional advantage of making the algorithm robust to initialization. Formally, the parameter estimation problem is cast into a transmission encoding problem and the criterion is to minimize the expected length of the message to be transmitted:
(28) 
In this context, the observations and the parameters have to be quantized to finite precision before the transmission. This quantization sets a trade off between the two terms of the previous equation. Indeed, when truncating to high precision, may be long, but will be short, since the parameters fit well the data. Conversely, if the quantization is coarse, may be short, but will be long. The optimal quantization step can be found by means of the Taylor approximation [15]. In that case, the optimization problem corresponding to the minimum message length (MML) criterion, is:
(29) 
where is the expected Fisher information matrix (FIM) and denotes the dimensionality of the model, namely the dimension of the parameter vector . Since the minimization (29) does not depend on the weight parameters, will be omitted for simplicity.
In our particular case, as in the general case of mixtures, the Fisher information matrix cannot be obtained analytically. Indeed, the direct optimization of the loglikelihood does not lead to closedform solutions. Nevertheless, it was noticed that the complete FIM upper bounds the FIM [15], and that the expected completedata loglikelihood lower bounds the loglikelihood. This allows us to write the following equivalent optimization problem:
(30) 
where denotes the expected completeFIM and is evaluated with (24).
As already mentioned, because there is a different weight for each observation , the observed data are not identically distributed and our model cannot be considered a classical mixture model. For this reason, the algorithm proposed in [15] cannot be applied directly to our model. Indeed, in the proposed WDGMM setting, the completeFIM is:
(31) 
where is the Fisher information matrix for the th observation with respect to the parameter vector (mean and the covariance) of the th component, is defined in (17), and is the Fisher information matrix of the multinomial distribution, namely the diagonal matrix . We can evaluate from (31):
(32) 
where denotes the number of free parameters of each component. For example, when using diagonal covariance matrices or when using full covariance matrices.
Importantly, one of the main advantages of the methodology proposed in [15] is that one has complete freedom to choose a prior distribution on the parameters, . In our case, inspired by (32), we select the following prior distributions for the parameters:
(33)  
(34) 
By substitution of (32)–(34) into (30) we obtain the following optimization problem:
(35) 
where we used .
One may notice that (35) does not make sense (diverges) if any of the ’s is allowed to be null. However, in the current length coding framework, there is no point in transmitting the parameters of an empty component. Therefore, we only focus on the nonempty components, namely those components for which . Let denote the index set of nonempty components and let be its cardinality. We can rewrite (35) as:
(36) 
The above minimization problem can be solved by modifying the EM algorithm described in Section V (notice that there is an equivalent derivation for the fixedweigth EM algorithm described in Section III). Indeed, we remark that the minimization (36) is equivalent to using a symmetric improper Dirichlet prior for the proportions with exponent . Moreover, since the optimization function for the parameters of the Gaussian components is the same (equivalently, we used a flat prior for the mean vector and covariance matrix), their estimation formulas (26) and (27) still hold. Therefore, we only need to modify the estimation of the mixture proportions, namely:
(37) 
The operator in (37) verifies whether the th component is supported by the data. When one of the components becomes too weak, i.e. the required minimum support cannot be obtained from the data, this component is annihilated. In other words, its parameters will not be estimated, since there is no need in transmitting them. One has to be careful in this context, since starting with a large value of may lead to several empty components. In order to avoid this singular situation, we adopt the componentwise EM procedure (CEM) [24], as proposed in [15] as well. Intuitively, we run both E and M steps for one component, before moving to the next component. More precisely, after running the EZ and EW steps for the component , its parameters are updated if , otherwise the component is annihilated if . The rationale behind this procedure is that, when a component is annihilated its probability mass is immediately redistributed among the remaining components. Summarizing, CEM updates the components one by one, whereas the classical EM simultaneously updates all the components.
The proposed algorithm is outlined in alg:cem_for_wdgmm. In practice, an upper and a lower number of components, and , are provided. Each iteration of the algorithm consists of componentwise E and M steps. If needed, some of the components are annihilated, and the parameters are updated accordingly, until the relative length difference is below a threshold, . In that case, if the message length, i.e. (36) is lower than the current optimum, the parameters, weights, and length are saved in , and respectively. In order to explore the full range of , the less populated component is artificially annihilated, and CEM is run again. [complexity]3The complexity of alg:cem_for_wdgmm is similar to the complexity of the algorithm in [15], with the exception of the EW step. However, the most computationally intensive part of this step (matrix inversion and matrixvector multiplications in (20)) is already achieved in the EZ step.
Vii Algorithm Initialization
[initnew]1 The EM algorithms proposed in Section III, Section V, and Section VI require proper initialization of both the weights (one for each observation and either a fixed value or parameters ) and of the model parameters. The means algorithm is used for an initial clustering, from which values for the model parameters are computed. In this section we concentrate onto the issue of weight initialization. An interesting feature of our method is that the only constraint on the weights is that they must be positive. Initial values may depend on expert or prior knowledge and may be experiment or goaldependent. This model flexibility allows the incorporation of such prior knowledge. In the absence of any prior information/knowledge, we propose a datadriven initialization scheme and make the assumption that densely sampled regions are more important that sparsely sampled ones. We note that a similar strategy could be used if one wants to reduce the importance of dense data and to give more importance to small groups of data or to sparse data.
We adopt a well known data similarity measure based on the Gaussian kernel, and it follows that the weight associated with the data point is evaluated with:
(38) 
where is the Euclidean distance, denotes the set containing the nearest neighbors of , and is a positive scalar. In all the experiments we used for the simulated datasets and for the real datasets. In both cases, we used . In the case of the FWDEM algorithm, the weights thus initialized remain unchanged. However, in the case of the WDEM algorithm, the weights are modeled as latent random variables drawn from a gamma distribution, hence one needs to set initial values for the parameters of this distribution, namely and in (11). Using (12) and (13) one can choose to initialize these parameters such as and , such that the mean and variance of the prior distribution are and respectively.
Viii Experimental Validation
Samples of the SIM dataset with no outliers (top row) and contaminated with 50% outliers (bottom row). The 600 inliers are generated from Gaussian mixtures while the 300 outliers are generated from a uniform distribution.
Data Set  

SIMEasy  
SIMUnbalanced  
SIMOverlapped  
SIMMixed  
MNIST [25]  
Wav [26]  
BCW [27]  
Letter Recognition [28] 
The proposed algorithms were tested and evaluated using eight datasets: four simulated datasets and four publicly available datasets that are widely used for benchmarking clustering methods. The main characteristics of these datasets are summarized in Table I. The simulated datasets (SIM) are designed to evaluate the robustness of the proposed method with respect to outliers. The simulated inliers are drawn from Gaussian mixtures while the simulated outliers are drawn from a uniform distribution, e.g. Fig. 9. The SIM datasets have different cluster configurations in terms of separability, shape and compactness. The eight datasets that we used are the following:

SIMEasy: Five clusters that are well separated and compact.

SIMUnbalanced: Four clusters of different size and density.

SIMOverlapped: Four clusters, two of them overlap.

SIMMixed: Six clusters of different size, compactness and shape.

MNIST contains instances of handwritten digit images normalized to the same size [25]. We preprocessed these data with PCA to reduce the dimension from 784 to 141, by keeping of the variance.

Wav is the Waveform Database Generator [26].

BCW refers to the Breast Cancer Wisconsin data set [27], in which each instance represents a digitized image of a fine needle aspirate (FNA) of breast mass.

[letter_recognition]1Letter Recognition contains singleletter images that were generated by randomly distorting the images of the 26 uppercase letters from 20 different commercial fonts [28]
. Each letter/image is described by 16 features. This dataset is available through the UCI machine learning repository.
Dataset  WDEM  FWDEM  GMM  GMM+U  FMuMST  IGMM  GMM  KMeans  KKMeans  Ncut  HAC 

MNIST  
WAV  
BCW  
Letter Recognition 
Data set  WDEM  FWDEM  GMM  GMM+U  FMuMST  IGMM  GMM  KMeans  KKMeans  Ncut  HAC 

MNIST  
WAV  
BCW 
Comments
There are no comments yet.