1 Introduction
Mixture model is a probabilistic model that is able to infer subpopulations from total population without additional information (within the paradigm of unsupervised learning). Mixture models closely correspond to the mixture distributions of the probabilistic distributions of observations. In general, in the structure of mixture model, we make assumptions over latent variables that evaluate membership of each observation. Given the dataset, we can assume that it is a sample and then mixture model can estimate the parameters of the probability distributions that created points of this dataset, as well as assign each observation vector of probabilities indicating the original distribution.
Comparing different mixture models can be considered a generalization of the problem of comparing different distributions. From the viewpoint of optimal transport theory, the Wasserstein distance is an important method for measuring similarities and the maintenance of the explainable nature of mixture models.
In this paper we derive one of possible approximations of Wasserstein distances computed between mixture models, which may be reduced to linear optimization problem, and we present examples of usage
2 Related work
Gaussian mixture models with Wasserstein distance find their place in many areas of machine learning. In case of generative networks[3]
, the use of Wasserstein distance has been proved to model more complex distributions. Autoencoder architectures equipped with Wasserstein distance (WAE), unlike variational autonencoders (VAE), allow to use deterministic mapping to a latent space
[7]. In image processing, Gaussian mixture models equipped with Wasserstein distance proved to be useful in tasks of color transfer and texture synthesis[2]. When dealing with heterogeneous data, mixture models have the advantage of simplicity and Wasserstein distance provides a suitable convergence rate[5]. Moreover, Wasserstein distance holds an important place in optimal transport theory[6][8].3 Problem formulation
Let be a probability distribution of the given data with unknown vector of parameters
. Modeling the data using statistics and machine learning comes to modeling probability distribution. In real world applications, data is usually composed of multiple different probability distributions. Hence comes an elementary idea of modeling the data using mixture model, where each observation is assigned a probability of originating from the given probability distribution. The problem of choosing types of probability distributions for each component is usually skipped by assuming normality (Gaussian) of individual components, as normal distribution has important probabilistic properties. This approach is focused on a general summary of the very origin of the data, therefore its applications are widespread:

in cluster analysis, Gaussian mixture models (GMM) may be seen as an extension to Kmeans algorithm, yielding additional information about given observations;

in supervised learning, associating a type of labels from training data with one or more components may give us a similarity function between observations, based on whether they originate from the same probability distribution;

in natural language processing, distribution of words in documents can be modelled as mixture of different categorical distributions.
3.1 Big data
Nowadays dealing with the big data is a popular issue. While focusing on a big volume of moderately dimensional data, mixture models can help with summary of the most common type of observations. Suppose that size of the data makes it impractical to repetitively perform calculations using the entire data. If we could summarize the data by creating representations, which allow to maintain most important features of the data, as well as to perform calculations yielding approximate but much faster solutions, we would save a lot of computing power and time in practical applications. Mixture models may be considered as one of such approaches, in which the data representation is made of the components understood as parameters of probability distributions. The mixture model of a given dataset is itself an approximation of the underlying probability distribution. While it gives a way to compare different observations from the same dataset, one may think about comparing different representations, i.e. different mixture models. Suppose that we split a labeled dataset into datasets based on label, then compute mixture model for each of such datasets. Under the assumption that different labels indicate a different distribution of features, comparing the mixture models allows us to conclude that two datasets originate from similar sources. This problem is more widely known as schema matching problem and is a common task in data integration and database management.
3.2 Comparing mixture models
Mixture models, by the very way they are calculated, are based on the values of many observations. The only difference between the resulting models must be a manifestation of the different values of the respective observations. This interpretation yields a corollary that the difference between models could be measured by how much and how many observations making up one mixture model must be transformed in order for the resulting model to be more similar to the one with which it is compared. This intuition is realized in the Wasserstein metric, where the distance between two probability distributions is the amount of ”work” that needs to be done in order to transform one distribution into another. Further explanation is provided in the following sections.
Gaussian mixture models allow us to summarize large datasets, while Wasserstein distance makes a tool for comparing different representations.
4 Gaussian mixture models
Henceforth we will focus on the Gaussian mixture models, i.e. mixture models only with normal components.
Definition 1.
Let , s.t. and . A Gaussian mixture model of size components is a probability distribution defined as:
where is a normal distribution with the vector of mean and the covariance matrix as parameters corresponding to the th component.
Thereafter, if not stated otherwise, we will assume that distributions are defined over with the dimension . For the sake of simplicity, if it is not necessary we will omit the dimension.
Fitting Gaussian mixture model to a given data is a task of finding appropriate values for parameters s.t. the resulting model describes the dataset. Using maximum likelihood estimation (MLE) let be a vector of observations from our dataset. The joint probability distribution is then defined as:
Likelihood function is defined as:
Unfortunately, differentiation and comparing to will not allow us to analytically solve this equation. In order to help with this, we will introduce a latent variables that explain which component generated given observation. Then:
4.1 EM algorithm
We can notice that knowing either parameters or allows us to compute the missing part. Furthermore, having a random guess about parameters, we can evaluate
probabilities and then estimate new parameters. Repeating this process, as well as measuring progress with loglikelihood, is a sketch of an iterative method known as the expectation–maximization (EM) algorithm. Let
, then:We can simplify to this form:
And since:
In summary, we have:
As for expectation phase (E), we evaluate given initial . Maximization step (M) consists of solving . Solving for is performed using Lagrange multipliers. These steps are repeated until the stop conditions are met.
4.2 Bayesian Gaussian mixture models
Finding parameters for GMM with EM algorithm does not include particular hyperparameters, as e.g. number of components. One can imagine that having
observations andcomponents in the form of Dirac delta functions would perfectly model a dataset, yet it would not be useful. The number of components can also be a very important parameter for the regularization of overfitting, a phenomenon in which the model may not be able to generalize outside of the training set. Bayesian interpretation allows us to use the prior probability distribution (Dirichlet distribution) to model the parameter space. Estimating the approximate posterior distribution over the parameters of a Gaussian mixture distribution yields the number of components from the dataset.
5 Wasserstein distance
Definition 2.
Let and be two dimensional probability distributions and let be a set of probability distributions whose marginals are and as the first and second factors respectively. Let . The th Wasserstein distance between and is defined as:
We may notice that looking at the measures corresponding to distributions and , the set is compact in the sense of weak convergence, therefore infimum is achievable.
Lemma 1.
Let and be dimensional () elements from support of s.t. and . Then the following inequality holds:
Proof.
We can notice that:
Since points lie on the same line:
From Jensen’s inequality we have:
Symmetrically we get the result:
Summing gives us the inequality. ∎
Theorem 1.
Let
be cumulative distribution functions of distributions
and , by andwe mean inverse cumulative distribution functions or quantile functions; for
we have:Proof.
If satisfies infimum in the definition of Wasserstein distance, then . Otherwise, by lemma, it would mean that there exists a better fit, where swapping and gives smaller value.
Let ; then we can notice that . Indeed,
Therefore we conclude that:
∎
Thereafter, if not stated otherwise, we will consider , i.e. Wasserstein distance for .
5.1 Connections with transportation theory
While considering probability as a mass over some space, Wasserstein distance realises the optimal transport problem for transforming one probability distribution into another. Suppose we have a cost function and probability distributions . A transport plan is a function s.t. is a volume of mass that needs to be moved from to . Cost of a transport plan is:
Depending on the selection of the function , going with infimum over possible plans yields us cost of optimal transport.
6 Wasserstein distance between two Gaussian mixture models
In order to calculate Wasserstein distance between Gaussian mixture models, we would need to calculate an inverse cumulative distribution function for mixture of normal distributions. Since it is analytically impossible, a similar idea is adopted.
Theorem 2.
[2]
Definition 3.
^{1}^{1}1Similar definition is proposed in [2], but we only consider the finite case.Let be two GMMs. We define approximate Wasserstein distance between and in the following way:
Proposed Wasserstein distance between mixture models is a straightforward extension of intuitions lying behind original Wasserstein distance. From the transport point of view, we are looking for the best assignment between corresponding mixtures. This can be extended to infinite dimensional form, i.e. when the number of components is not finite. The main difference is that here we do not seek the best transportation plan understood as a function or a measure, but a matrix of size .
6.1 Dual problem
Let us consider a more general problem; let be given nonnegative integers, the problem is following:
In the case of Wasserstein distance , is a Wasserstein distance between th component from first mixture and th component from second mixture, are weights of components from first mixture and are weights of components from second mixture.
The dual problem has the following form:
It is worth to notice that the dual form immediately yields us a possible solution: setting all .
6.2 Solving with linear programming
Finding Wasserstein distance between two mixture models comes down to solving particular transport problem. Therefore we can use the notations of graph theory: with a directed complete bipartite graph we have a cost over each edge being a Wasserstein distance between the given components, a capacity at each edge corresponding to the weight of the component and amount of flow, i.e. the value sought. We can use the network simplex algorithm to solve such a problem.
6.3 GMM with Wasserstein distance as a classifier
We present the algorithm for classification problem using Gaussian mixture models and Wasserstein distance.
7 Experiments
7.1 STL10 dataset
In the first experiment we used features extracted from an autoencoder neural network, which was trained in the case of image recognition. Original dataset is the STL10 dataset
[1], which consists of total images labeled as one of ten possible classes. Extracted representation has dimensionality of , therefore during experiment we randomly choose some subset of dimensions. Received results has been compares with distance and quadratic JensenRényi divergence.The task of matching data types using proposed method, i.e. Gaussian mixture models with Wasserstein distance, relies only on applied preprocessing. While autoencoder representations can be summaries themselves[4], given the large volume of the data, our method is far more practical.
7.2 Text data
In the second experiment we operated on a large volume () of short text data, divided into chunks of examples with the same label. The task was to predict label of entire chunk. Preprocessing consists of transforming characters into features based on length and frequency of occurrence of given letters and signs. During experiment we performed
fold crossvalidations. Results were compared with a different approach (KNN algorithm) and similar approach with different distance function (
).Results conclude that proposed framework works better than compared methods.
8 Conclusions and future work
We derived the approximate easy easytocalculate version of the Wasserstein distance between Gaussian mixture models, that may find many applications in various fields of machine learning. In the case of big data, the greatest advantage is the avoidance of multiple calculations over the entire dataset, as the obtained summary allow for the estimation of similarity based only on the compacted data representations.
The future work may involve a statistical analysis of the properties extracted by Gaussian mixture models from a dataset, e.g. selecting important observations that may have had the greatest impact on the parameters.
References
 [1] A. Coates, H. Lee, A. Y. Ng, ”An Analysis of Single Layer Networks in Unsupervised Feature Learning”, AISTATS, 2011.
 [2] J. Delon and A. Desolneux, ”A Wassersteintype distance in the space of Gaussian Mixture Models”, arXiv:1907.05254v4, 2019.
 [3] B. Gaujac, I. Feige, and D. Barber, ”Gaussian mixture models with Wasserstein distance,” arXiv:1806.04465v1, 2018.
 [4] M. Przyborowski, T. Tajmajer, Ł. Grad, A. Janusz, P. Biczyk, D. Ślęzak, ”Toward Machine Learning on Granulated Data  a Case of Compact Autoencoderbased Representations of Satellite Images”, IEEE BigData 2018: pp. 26572662, 2018.
 [5] S. Ozkan and G. B. Akar, ”mproved deep spectral convolution network for hyperspectral unmixing with multinomial mixture kernel and endmember uncertainty,” arXiv:1808.01104v1, 2018.
 [6] G. Peyré and M. Cuturi, ”Computational Optimal Transport”, arXiv:1803.00567v4, 2020.
 [7] M. Śmieja, M. Wołczyk, J. Tabol, B. C. Geiger, ”SeGMA: SemiSupervised Gaussian Mixture AutoEncoder”, arXiv:1906.09333v2, 2019.
 [8] A. Takatsu and T. Yokota, ”Cone structure of Wasserstein spaces”, arXiv:0812.2752v3, 2009.