Finding significant groups in a set of data points is a central problem in many fields. Consequently, clustering has received a lot of attention, and many methods, algorithms and software packages are available today. Among these techniques, parametric finite mixture models play a paramount role, due to their interesting mathematical properties as well as to the existence of maximum likelihood estimators based on expectation-maximization (EM) algorithms. While the finite Gaussian mixture (GMM)  is the model of choice, it is extremely sensitive to the presence of outliers. Alternative robust models have been proposed in the statistical literature, such as mixtures of t-distributions  and their numerous variants, e.g. [3, 4, 5, 6, 7, 8]. In contrast to the Gaussian case, no closed-form solution exists for the t-distribution and tractability is maintained via the use of EM and a Gaussian scale mixture representation, where
is an observed vector,
is the multivariate Gaussian distribution with meanand covariance , and is the gamma distribution of a univariate positive variable parameterized by . [mixture]1In the case of mixtures of t-distributions, with mixing coefficients , , a latent variable can also be introduced. Its distribution is a mixture of gamma distributions that accounts for the component-dependent . Clustering is then usually performed associating a positive variable , distributed as , with each observed point . The distributions of both and do not depend on . The observed data are drawn from i.i.d. variables, distributed according to the t-mixture, or one of its variants [2, 3, 4, 5, 6, 7, 8].
[not-iid]1 In this paper we propose a finite mixture model in which variable is used as a weight to account for the reliability of the observed and this independently on its assigned cluster. The distribution of is not a gamma mixture anymore but has to depend on to allow each data point to be potentially treated differently. In contrast to mixtures of t-distributions, it follows that the observed data are independent but not
identically distributed. We introduce the weighted-data Gaussian mixture model (WD-GMM). We distinguish two cases, namely (i) the weights are known a priori and hence they are fixed, and (ii) the weights are modeled as variables and hence they are iteratively updated, given initial estimates. We show that in the case of fixed weights, the GMM parameters can be estimated via an extension of the standard EM which will be referred to as thefixed weighted-data EM algorithm (FWD-EM). Then we consider the more general case of weights that are treated as random variables. We model these variables with gamma distributions (one distribution for each variable) and we formally derive a closed-form EM algorithm which will be referred to as the weighted-data
EM algorithm (WD-EM). While the M-step of the latter is similar to the M-step of FWD-EM, the E-step is considerably different as both the posterior probabilities (responsibilities) and the parameters of the posterior gamma distributions (the weights) are updated (E-Z-step and E-W-step). [pearson]1 The responsibilities are computed using the Pearson type VII distribution (the reader is referred to for a recent discussion regarding this distribution), also called the Arellano-Valle and Bolfarine generalized t-distribution , and the parameters of the posterior gamma distributions are computed from the prior gamma parameters and from the Mahalanobis distance between the data and the mixture means. Note that the weights play a different role than the responsibilities. Unlike the responsibilities, which are probabilities, the weights are random variables that can take arbitrary positive values. Their posterior means can be used as an absolute measure of the relevance of the data. Typically, an outlying data point which is far from any cluster center will have a small weight while it may still be assigned with a significant responsibility value to the closest cluster. Responsibilities indicate which cluster center is the closest but not if any of them is close at all.
The idea of weighted-data clustering has already been proposed in the framework of non-parametric clustering methods such as
-means and spectral clustering, e.g.[10, 11, 12, 13]. These methods generally propose to incorporate prior information in the clustering process in order to prohibit atypical data (outliers) to contaminate the clusters. The idea of modeling data weights as random variables and to estimate them via EM was proposed in  in the particular framework of Markovian brain image segmentation. In  it is shown that specific expert knowledge is not needed and that the data-weight distribution guide the model towards a satisfactory segmentation. A variational EM is proposed in  as their formulation has no closed form. In this paper we build on the idea that, instead of relying on prior information about atypical data, e.g. [10, 11, 12, 13], we devise a novel EM algorithm that updates the weight distributions. [robust-explanation]1The proposed method belongs to the robust clustering category of mixture models because observed data that are far away from the cluster centers have little influence on the estimation of the means and covariances.
[model-selection-intro]1An important feature of mixture based clustering methods is to perform model selection on the premise that the number of components in the mixture corresponds to the number of clusters in the data. Traditionally, model selection is performed by obtaining a set of candidate models for a range of values of
(assuming that the true value is in this range). The number of components is selected by minimizing a model selection criteria, such as the Bayesian inference criterion (BIC), minimum message length (MML), Akaike’s information criteria (AIC) to cite just a few[1, 15]. The disadvantage of these methods is twofold. Firstly, a whole set of candidates has to be obtained and problems associated with running EM many times may emerge. Secondly, they provide a number of components that optimally approximate the density and not the true number of clusters present in the data. More recently, there seems to be a consensus among mixture model practitioners that a well-founded and computationally efficient model selection strategy is to start with a large number of components and to merge them .  proposes a practical algorithm that starts with a very large number of components (thus making the algorithm robust to initialization), iteratively annihilates components, redistributes the observations to the other components, and terminates based on the MML criterion.  starts with an overestimated number of components using BIC, and then merges them hierarchically according to an entropy criterion. More recently  proposes a similar method that merges components based on measuring their pair-wise overlap.
Another trend in handling the issue of finding the proper number of components is to consider Bayesian non-parametric mixture models. This allows the implementation of mixture models with an infinite number of components via the use of Dirichlet process mixture models. In [19, 20]
an infinite Gaussian mixture (IGMM) is presented with a computationally intensive Markov Chain Monte Carlo implementation. [igmm]3 At first glance, IGMM may appear similar to FWD-EM. However, these two algorithms are quite different. While IGMM is fully Bayesian the proposed FWD-EM is not, in the sense that no priors are assumed on the parameters, typically the means and covariance matrices. IGMM implies Student predictive distributions while FWD-EM involves only Gaussian distributions.
More recently, more flexibility in the cluster shapes has been allowed by considering infinite mixture of infinite Gaussian mixtures (IGMM) . The flexibility is however limited to a cluster composed of sub-clusters of identical shapes and orientations, which may alter the performance of this approach. Altogether, IGMM and IGMM are not designed to handle outliers, as illustrated in Section VIII, Figs. 18-f and 18-g. Infinite Student mixture models have also been considered , but inference requires a variational Bayes approximation which generates additional computational complexity.
Bayesian non-parametrics, although promising techniques, require a fully Bayesian setting. The latter, however, induces additional complexity for handling priors and hyper-priors, especially in a multi-variate context. In contrast, our latent variable approach allows exact inference. With respect to model selection, we therefore propose to extend the method of  to weighted-data mixtures. We formally derive an MML criterion for the weighted-data mixture model and we plug this criterion into an efficient algorithm which, starting with a large number of components, simultaneously estimates the model parameters, the posterior probabilities of the weights and the optimal number of components.
[av-intro]1 We also propose to apply the proposed weighted-data robust clustering method to the problem of fusing auditory and visual information. This problem arises when the task is, e.g. to detect a person that is both seen and heard, such as an active speaker. Single-modality signals – vision-only or audio-only – are often either weak or ambiguous, and it may be useful to combine information from different sensors, e.g. cameras and microphones. There are several difficulties associated with audio-visual fusion from a data clustering perspective: the two sensorial modalities (i) live in different spaces, (ii) are contaminated by different types of noise with different distributions, (iii) have different spatiotemporal distributions, and (iv) are perturbed by different physical phenomena, e.g. acoustic reverberations, lighting conditions, etc. For example, a speaker may face the camera while he/she is silent and may emit speech while he/she turns his/her face away from the camera. Speech signals have sparse spectro-temporal structure and they are mixed with other sound sources, such as music or background noise. Speaker faces may be totally or partially occluded, in which case face detection and localization is extremely unreliable. We show that the proposed method is well suited to find audio-visual clusters and to discriminate between speaking and silent people.
The remainder of this paper is organized as follows. Section II outlines the weighted-data mixture model; Section III sketches the FWD-EM algorithm. Weights modeled with random variables are introduced in Section IV and the WD-EM is described in detail in Section V. Section VI details how to deal with an unknown number of clusters and Section VII addresses the issue of algorithm initialization. In Section VIII the proposed algorithms are tested and compared with several other parametric and non-parametric clustering methods. Section IX addresses clustering of audio-visual data. Section X concludes the paper. Additional results and videos are available online.111https://team.inria.fr/perception/research/wdgmm/
Ii Gaussian Mixture with Weighted Data
In this Section, we present the intuition and the formal definition of the proposed weighted-data model. Let be a random vector following a multivariate Gaussian distribution with mean and covariance , namely , with the notation . Let be a weight indicating the relevance of the observation . Intuitively, higher the weight , stronger the impact of . The weight can therefore be incorporated into the model by “observing times”. In terms of the likelihood function, this is equivalent to raise to the power , i.e.
. However, the latter is not a probability distribution since it does not integrate to one. It is straightforward to notice that. [precision]1Therefore, plays the role of the precision and is different for each datum . Subsequently, we write:
from which we derive a mixture model with components:
where are the mixture parameters, are the mixture coefficients satisfying and , are the parameters of the -th component and is the number of components. We will refer to the model in (2) as the weighted-data Gaussian mixture model (WD-GMM). Let be the observed data and be the weights associated with . We assume each is independently drawn from (2) with . The observed-data log-likelihood is:
It is well known that direct maximization of the log-likelihood function is problematic in case of mixtures and that the expected complete-data log-likelihood must be considered instead. Hence, we introduce a set of hidden (assignment) variables associated with the observed variables and such that , if and only if is generated by the -th component of the mixture. In the following we first consider a fixed (given) number of mixture components , we then extend the model to an unknown , thus estimating the number of components from the data.
Iii EM with Fixed Weights
The simplest case is when the weight values are provided at algorithm initialization, either using some prior knowledge or estimated from the observations (e.g. Section VII), and are then kept fixed while alternating between the expectation and maximization steps. In this case, the expected complete-data log-likelihood is:
where denotes the expectation with respect to the distribution . The -th EM iteration consists of two steps namely, the evaluation of the posterior distribution given the current model parameters and the weights (E-step), and the maximization of (4) with respect to (M-step):
It is straightforward to show that this yields the following FWD-EM algorithm:
Iii-a The E-Step
Iii-B The M-Step
Expanding (4) we get:
where denotes equality up to a constant that does not depend on . By canceling out the derivatives with respect to the model parameters, we obtain the following update formulae for the mixture proportions, means, and covariances matrices:
Iv Modeling the Weights
As we already remarked, the weights play the role of precisions. The notable difference between standard finite mixture models and the proposed model is that there is a different weight , hence a different precision, associated with each observation . Within a Bayesian formalism, the weights may be treated as random variables, rather than being fixed in advance, as in the previous case. Since (1) is a Gaussian, a convenient choice for the prior on ,
is the conjugate prior of the precision with known mean, i.e. a gamma distribution. This ensures that the weight posteriors are gamma distributions as well. Summarizing we have:
where is the gamma distribution, is the gamma function, and are the parameters of the prior distribution of
. The mean and variance of the random variableare given by:
V EM with Random Weights
In this section we derive the WD-EM algorithm associated to a model in which the weights are treated as random variables following (11). The gamma distribution of each is assumed to be parameterized by . Within this framework, the expectation of the complete-data log-likelihood is computed over both the assignment and weight variables:
where we used the notation . We notice that the posterior distribution factorizes on :
and each one of these factors can be be decomposed as:
where the two quantities on the right-hand side of this equation have closed-form expressions. The computation of each one of these two expressions leads to two sequential steps, the E-W-step and the E-Z-step, of the expectation step of the proposed algorithm.
V-a The E-Z Step
The marginal posterior distribution of is obtained by integrating (15) over . As previously, we denote the responsibilities with . The integration computes:
where denotes the Pearson type VII probability distribution function, which can be seen as a generalization of the t-distribution:
V-B The E-W Step
The posterior distribution of , namely is a gamma distribution, because it is the conjugate prior of the precision of the Gaussian distribution. Therefore, we only need to compute the parameters of the posterior gamma distribution:
where the parameters of the posterior gamma distribution are evaluated with:
The conditional mean of , namely , can then be evaluated with:
[w-effect]1 While estimating the weights themselves is not needed by the algorithm, it is useful to evaluate them in order to fully characterize the observations and to discriminate between inliers and outliers. First notice that the marginal posterior distribution of is a mixture of gamma distributions:
and therefore the posterior mean of is evaluated with:
By inspection of (19), (20), and (21) it is easily seen that the value of decreases as the distance between the cluster centers and observation increases. Importantly, the evaluation of enables outlier detection. Indeed, an outlier is expected to be far from all the clusters, and therefore all will be small, leading to a small value of . It is worth noticing that this is not possible using only the responsibilities , since they are normalized by definition, and therefore their value is not an absolute measure of the datum’s relevance, but only a relative measure of it.
V-C The Maximization Step
This step maximizes the expected complete-data log-likelihood over the mixture parameters. By expanding (14), we have:
The parameter updates are obtained from canceling out the derivatives of the expected complete-data log-likelihood (24). As with standard Gaussian mixtures, all the updates are closed-form expressions:
It is worth noticing that the M-step of the WD-EM algorithm is very similar to the M-step of the FWD-EM algorithm (section III). Indeed, the above iterative formulas, (25), (26), (27) are identical to the formulas (8), (9), (10), except that the fixed weights are here replaced with the posterior means of the random weights, .
Vi Estimating the Number of Components
So far it has been assumed that the number of mixture components is provided in advance. This assumption is unrealistic for most real-world applications. In this Section we propose to extend the method and algorithm proposed in  to the weighted-data clustering model. An interesting feature of this model selection method is that it does not require parameter estimation for many different values of , as it would be the case with the Bayesian information criterion (BIC) . Instead, the algorithm starts with a large number of components and iteratively deletes components as they become irrelevant. Starting with a large number of components has the additional advantage of making the algorithm robust to initialization. Formally, the parameter estimation problem is cast into a transmission encoding problem and the criterion is to minimize the expected length of the message to be transmitted:
In this context, the observations and the parameters have to be quantized to finite precision before the transmission. This quantization sets a trade off between the two terms of the previous equation. Indeed, when truncating to high precision, may be long, but will be short, since the parameters fit well the data. Conversely, if the quantization is coarse, may be short, but will be long. The optimal quantization step can be found by means of the Taylor approximation . In that case, the optimization problem corresponding to the minimum message length (MML) criterion, is:
where is the expected Fisher information matrix (FIM) and denotes the dimensionality of the model, namely the dimension of the parameter vector . Since the minimization (29) does not depend on the weight parameters, will be omitted for simplicity.
In our particular case, as in the general case of mixtures, the Fisher information matrix cannot be obtained analytically. Indeed, the direct optimization of the log-likelihood does not lead to closed-form solutions. Nevertheless, it was noticed that the complete FIM upper bounds the FIM , and that the expected complete-data log-likelihood lower bounds the log-likelihood. This allows us to write the following equivalent optimization problem:
where denotes the expected complete-FIM and is evaluated with (24).
As already mentioned, because there is a different weight for each observation , the observed data are not identically distributed and our model cannot be considered a classical mixture model. For this reason, the algorithm proposed in  cannot be applied directly to our model. Indeed, in the proposed WD-GMM setting, the complete-FIM is:
where is the Fisher information matrix for the -th observation with respect to the parameter vector (mean and the covariance) of the -th component, is defined in (17), and is the Fisher information matrix of the multinomial distribution, namely the diagonal matrix . We can evaluate from (31):
where denotes the number of free parameters of each component. For example, when using diagonal covariance matrices or when using full covariance matrices.
Importantly, one of the main advantages of the methodology proposed in  is that one has complete freedom to choose a prior distribution on the parameters, . In our case, inspired by (32), we select the following prior distributions for the parameters:
where we used .
One may notice that (35) does not make sense (diverges) if any of the ’s is allowed to be null. However, in the current length coding framework, there is no point in transmitting the parameters of an empty component. Therefore, we only focus on the non-empty components, namely those components for which . Let denote the index set of non-empty components and let be its cardinality. We can rewrite (35) as:
The above minimization problem can be solved by modifying the EM algorithm described in Section V (notice that there is an equivalent derivation for the fixed-weigth EM algorithm described in Section III). Indeed, we remark that the minimization (36) is equivalent to using a symmetric improper Dirichlet prior for the proportions with exponent . Moreover, since the optimization function for the parameters of the Gaussian components is the same (equivalently, we used a flat prior for the mean vector and covariance matrix), their estimation formulas (26) and (27) still hold. Therefore, we only need to modify the estimation of the mixture proportions, namely:
The operator in (37) verifies whether the -th component is supported by the data. When one of the components becomes too weak, i.e. the required minimum support cannot be obtained from the data, this component is annihilated. In other words, its parameters will not be estimated, since there is no need in transmitting them. One has to be careful in this context, since starting with a large value of may lead to several empty components. In order to avoid this singular situation, we adopt the component-wise EM procedure (CEM) , as proposed in  as well. Intuitively, we run both E and M steps for one component, before moving to the next component. More precisely, after running the E-Z and E-W steps for the component , its parameters are updated if , otherwise the component is annihilated if . The rationale behind this procedure is that, when a component is annihilated its probability mass is immediately redistributed among the remaining components. Summarizing, CEM updates the components one by one, whereas the classical EM simultaneously updates all the components.
The proposed algorithm is outlined in alg:cem_for_wdgmm. In practice, an upper and a lower number of components, and , are provided. Each iteration of the algorithm consists of component-wise E and M steps. If needed, some of the components are annihilated, and the parameters are updated accordingly, until the relative length difference is below a threshold, . In that case, if the message length, i.e. (36) is lower than the current optimum, the parameters, weights, and length are saved in , and respectively. In order to explore the full range of , the less populated component is artificially annihilated, and CEM is run again. [complexity]3The complexity of alg:cem_for_wdgmm is similar to the complexity of the algorithm in , with the exception of the E-W step. However, the most computationally intensive part of this step (matrix inversion and matrix-vector multiplications in (20)) is already achieved in the E-Z step.
Vii Algorithm Initialization
[init-new]1 The EM algorithms proposed in Section III, Section V, and Section VI require proper initialization of both the weights (one for each observation and either a fixed value or parameters ) and of the model parameters. The -means algorithm is used for an initial clustering, from which values for the model parameters are computed. In this section we concentrate onto the issue of weight initialization. An interesting feature of our method is that the only constraint on the weights is that they must be positive. Initial values may depend on expert or prior knowledge and may be experiment- or goal-dependent. This model flexibility allows the incorporation of such prior knowledge. In the absence of any prior information/knowledge, we propose a data-driven initialization scheme and make the assumption that densely sampled regions are more important that sparsely sampled ones. We note that a similar strategy could be used if one wants to reduce the importance of dense data and to give more importance to small groups of data or to sparse data.
We adopt a well known data similarity measure based on the Gaussian kernel, and it follows that the weight associated with the data point is evaluated with:
where is the Euclidean distance, denotes the set containing the nearest neighbors of , and is a positive scalar. In all the experiments we used for the simulated datasets and for the real datasets. In both cases, we used . In the case of the FWD-EM algorithm, the weights thus initialized remain unchanged. However, in the case of the WD-EM algorithm, the weights are modeled as latent random variables drawn from a gamma distribution, hence one needs to set initial values for the parameters of this distribution, namely and in (11). Using (12) and (13) one can choose to initialize these parameters such as and , such that the mean and variance of the prior distribution are and respectively.
Viii Experimental Validation
|Letter Recognition |
The proposed algorithms were tested and evaluated using eight datasets: four simulated datasets and four publicly available datasets that are widely used for benchmarking clustering methods. The main characteristics of these datasets are summarized in Table I. The simulated datasets (SIM) are designed to evaluate the robustness of the proposed method with respect to outliers. The simulated inliers are drawn from Gaussian mixtures while the simulated outliers are drawn from a uniform distribution, e.g. Fig. 9. The SIM datasets have different cluster configurations in terms of separability, shape and compactness. The eight datasets that we used are the following:
SIM-Easy: Five clusters that are well separated and compact.
SIM-Unbalanced: Four clusters of different size and density.
SIM-Overlapped: Four clusters, two of them overlap.
SIM-Mixed: Six clusters of different size, compactness and shape.
MNIST contains instances of handwritten digit images normalized to the same size . We preprocessed these data with PCA to reduce the dimension from 784 to 141, by keeping of the variance.
Wav is the Waveform Database Generator .
BCW refers to the Breast Cancer Wisconsin data set , in which each instance represents a digitized image of a fine needle aspirate (FNA) of breast mass.
[letter_recognition]1Letter Recognition contains single-letter images that were generated by randomly distorting the images of the 26 uppercase letters from 20 different commercial fonts 
. Each letter/image is described by 16 features. This dataset is available through the UCI machine learning repository.