Schema matching using Gaussian mixture models with Wasserstein distance

by   Mateusz Przyborowski, et al.
Uniwersytet Warszawski

Gaussian mixture models find their place as a powerful tool, mostly in the clustering problem, but with proper preparation also in feature extraction, pattern recognition, image segmentation and in general machine learning. When faced with the problem of schema matching, different mixture models computed on different pieces of data can maintain crucial information about the structure of the dataset. In order to measure or compare results from mixture models, the Wasserstein distance can be very useful, however it is not easy to calculate for mixture distributions. In this paper we derive one of possible approximations for the Wasserstein distance between Gaussian mixture models and reduce it to linear problem. Furthermore, application examples concerning real world data are shown.


Scale-Based Gaussian Coverings: Combining Intra and Inter Mixture Models in Image Segmentation

By a "covering" we mean a Gaussian mixture model fit to observed data. A...

Sliced Wasserstein Distance for Learning Gaussian Mixture Models

Gaussian mixture models (GMM) are powerful parametric tools with many ap...

Evaluating generative networks using Gaussian mixtures of image features

We develop a measure for evaluating the performance of generative networ...

A general solver to the elliptical mixture model through an approximate Wasserstein manifold

This paper studies the problem of estimation for general finite mixture ...

From 2D to 3D Geodesic-based Garment Matching

A new approach for 2D to 3D garment retexturing is proposed based on Gau...

Generalized Categorisation of Digital Pathology Whole Image Slides using Unsupervised Learning

This project aims to break down large pathology images into small tiles ...

1 Introduction

Mixture model is a probabilistic model that is able to infer subpopulations from total population without additional information (within the paradigm of unsupervised learning). Mixture models closely correspond to the mixture distributions of the probabilistic distributions of observations. In general, in the structure of mixture model, we make assumptions over latent variables that evaluate membership of each observation. Given the dataset, we can assume that it is a sample and then mixture model can estimate the parameters of the probability distributions that created points of this dataset, as well as assign each observation vector of probabilities indicating the original distribution.

Comparing different mixture models can be considered a generalization of the problem of comparing different distributions. From the viewpoint of optimal transport theory, the Wasserstein distance is an important method for measuring similarities and the maintenance of the explainable nature of mixture models.
In this paper we derive one of possible approximations of Wasserstein distances computed between mixture models, which may be reduced to linear optimization problem, and we present examples of usage

2 Related work

Gaussian mixture models with Wasserstein distance find their place in many areas of machine learning. In case of generative networks[3]

, the use of Wasserstein distance has been proved to model more complex distributions. Autoencoder architectures equipped with Wasserstein distance (WAE), unlike variational autonencoders (VAE), allow to use deterministic mapping to a latent space

[7]. In image processing, Gaussian mixture models equipped with Wasserstein distance proved to be useful in tasks of color transfer and texture synthesis[2]. When dealing with heterogeneous data, mixture models have the advantage of simplicity and Wasserstein distance provides a suitable convergence rate[5]. Moreover, Wasserstein distance holds an important place in optimal transport theory[6][8].

3 Problem formulation

Let be a probability distribution of the given data with unknown vector of parameters

. Modeling the data using statistics and machine learning comes to modeling probability distribution. In real world applications, data is usually composed of multiple different probability distributions. Hence comes an elementary idea of modeling the data using mixture model, where each observation is assigned a probability of originating from the given probability distribution. The problem of choosing types of probability distributions for each component is usually skipped by assuming normality (Gaussian) of individual components, as normal distribution has important probabilistic properties. This approach is focused on a general summary of the very origin of the data, therefore its applications are widespread:

  1. in cluster analysis, Gaussian mixture models (GMM) may be seen as an extension to K-means algorithm, yielding additional information about given observations;

  2. in supervised learning, associating a type of labels from training data with one or more components may give us a similarity function between observations, based on whether they originate from the same probability distribution;

  3. in natural language processing, distribution of words in documents can be modelled as mixture of different categorical distributions.

3.1 Big data

Nowadays dealing with the big data is a popular issue. While focusing on a big volume of moderately dimensional data, mixture models can help with summary of the most common type of observations. Suppose that size of the data makes it impractical to repetitively perform calculations using the entire data. If we could summarize the data by creating representations, which allow to maintain most important features of the data, as well as to perform calculations yielding approximate but much faster solutions, we would save a lot of computing power and time in practical applications. Mixture models may be considered as one of such approaches, in which the data representation is made of the components understood as parameters of probability distributions. The mixture model of a given dataset is itself an approximation of the underlying probability distribution. While it gives a way to compare different observations from the same dataset, one may think about comparing different representations, i.e. different mixture models. Suppose that we split a labeled dataset into datasets based on label, then compute mixture model for each of such datasets. Under the assumption that different labels indicate a different distribution of features, comparing the mixture models allows us to conclude that two datasets originate from similar sources. This problem is more widely known as schema matching problem and is a common task in data integration and database management.

3.2 Comparing mixture models

Mixture models, by the very way they are calculated, are based on the values of many observations. The only difference between the resulting models must be a manifestation of the different values of the respective observations. This interpretation yields a corollary that the difference between models could be measured by how much and how many observations making up one mixture model must be transformed in order for the resulting model to be more similar to the one with which it is compared. This intuition is realized in the Wasserstein metric, where the distance between two probability distributions is the amount of ”work” that needs to be done in order to transform one distribution into another. Further explanation is provided in the following sections.
Gaussian mixture models allow us to summarize large datasets, while Wasserstein distance makes a tool for comparing different representations.

4 Gaussian mixture models

Henceforth we will focus on the Gaussian mixture models, i.e. mixture models only with normal components.

Definition 1.

Let , s.t. and . A Gaussian mixture model of size components is a probability distribution defined as:

where is a normal distribution with the vector of mean and the covariance matrix as parameters corresponding to the -th component.

Thereafter, if not stated otherwise, we will assume that distributions are defined over with the dimension . For the sake of simplicity, if it is not necessary we will omit the dimension.
Fitting Gaussian mixture model to a given data is a task of finding appropriate values for parameters s.t. the resulting model describes the dataset. Using maximum likelihood estimation (MLE) let be a vector of observations from our dataset. The joint probability distribution is then defined as:

Likelihood function is defined as:

Unfortunately, differentiation and comparing to will not allow us to analytically solve this equation. In order to help with this, we will introduce a latent variables that explain which component generated given observation. Then:

4.1 EM algorithm

We can notice that knowing either parameters or allows us to compute the missing part. Furthermore, having a random guess about parameters, we can evaluate

probabilities and then estimate new parameters. Repeating this process, as well as measuring progress with log-likelihood, is a sketch of an iterative method known as the expectation–maximization (EM) algorithm. Let

, then:

We can simplify to this form:

And since:

In summary, we have:

As for expectation phase (E), we evaluate given initial . Maximization step (M) consists of solving . Solving for is performed using Lagrange multipliers. These steps are repeated until the stop conditions are met.

initial guess
until stop condition satisfied
Algorithm 1 EM algorithm

4.2 Bayesian Gaussian mixture models

Finding parameters for GMM with EM algorithm does not include particular hyperparameters, as e.g. number of components. One can imagine that having

observations and

components in the form of Dirac delta functions would perfectly model a dataset, yet it would not be useful. The number of components can also be a very important parameter for the regularization of overfitting, a phenomenon in which the model may not be able to generalize outside of the training set. Bayesian interpretation allows us to use the prior probability distribution (Dirichlet distribution) to model the parameter space. Estimating the approximate posterior distribution over the parameters of a Gaussian mixture distribution yields the number of components from the dataset.

5 Wasserstein distance

Definition 2.

Let and be two -dimensional probability distributions and let be a set of probability distributions whose marginals are and as the first and second factors respectively. Let . The -th Wasserstein distance between and is defined as:

We may notice that looking at the measures corresponding to distributions and , the set is compact in the sense of weak convergence, therefore infimum is achievable.

Lemma 1.

Let and be -dimensional () elements from support of s.t. and . Then the following inequality holds:


We can notice that:

Since points lie on the same line:

From Jensen’s inequality we have:

Symmetrically we get the result:

Summing gives us the inequality. ∎

Theorem 1.


be cumulative distribution functions of distributions

and , by and

we mean inverse cumulative distribution functions or quantile functions; for

we have:


If satisfies infimum in the definition of Wasserstein distance, then . Otherwise, by lemma, it would mean that there exists a better fit, where swapping and gives smaller value.
Let ; then we can notice that . Indeed,

Therefore we conclude that:

Thereafter, if not stated otherwise, we will consider , i.e. Wasserstein distance for .

5.1 Connections with transportation theory

While considering probability as a mass over some space, Wasserstein distance realises the optimal transport problem for transforming one probability distribution into another. Suppose we have a cost function and probability distributions . A transport plan is a function s.t. is a volume of mass that needs to be moved from to . Cost of a transport plan is:

Depending on the selection of the function , going with infimum over possible plans yields us cost of optimal transport.

6 Wasserstein distance between two Gaussian mixture models

In order to calculate Wasserstein distance between Gaussian mixture models, we would need to calculate an inverse cumulative distribution function for mixture of normal distributions. Since it is analytically impossible, a similar idea is adopted.

Theorem 2.


Definition 3.
111Similar definition is proposed in [2], but we only consider the finite case.

Let be two GMMs. We define approximate Wasserstein distance between and in the following way:

Proposed Wasserstein distance between mixture models is a straightforward extension of intuitions lying behind original Wasserstein distance. From the transport point of view, we are looking for the best assignment between corresponding mixtures. This can be extended to infinite dimensional form, i.e. when the number of components is not finite. The main difference is that here we do not seek the best transportation plan understood as a function or a measure, but a matrix of size .

6.1 Dual problem

Let us consider a more general problem; let be given nonnegative integers, the problem is following:

In the case of Wasserstein distance , is a Wasserstein distance between -th component from first mixture and -th component from second mixture, are weights of components from first mixture and are weights of components from second mixture.
The dual problem has the following form:

It is worth to notice that the dual form immediately yields us a possible solution: setting all .

6.2 Solving with linear programming

Finding Wasserstein distance between two mixture models comes down to solving particular transport problem. Therefore we can use the notations of graph theory: with a directed complete bipartite graph we have a cost over each edge being a Wasserstein distance between the given components, a capacity at each edge corresponding to the weight of the component and amount of flow, i.e. the value sought. We can use the network simplex algorithm to solve such a problem.

6.3 GMM with Wasserstein distance as a classifier

We present the algorithm for classification problem using Gaussian mixture models and Wasserstein distance.

- training dataset with different classes
- test datasets; within dataset each observation has the same label
1. Split training dataset into sets based on the label
2. Fit Gaussian mixture model for each
3. Fit Gaussian mixture model for each
4. Compute Wasserstein between and for each ,
5. Label the set with a label of the set , where
Algorithm 2 GMM with Wasserstein distance

7 Experiments

7.1 STL-10 dataset

In the first experiment we used features extracted from an autoencoder neural network, which was trained in the case of image recognition. Original dataset is the STL-10 dataset

[1], which consists of total images labeled as one of ten possible classes. Extracted representation has dimensionality of , therefore during experiment we randomly choose some subset of dimensions. Received results has been compares with distance and quadratic Jensen-Rényi divergence.

Figure 1: OX axis indicates number of selected components. At each step experiment was repeated times. Solid lines stands for mean results, while shaded area indicates standard deviation from the mean.

The task of matching data types using proposed method, i.e. Gaussian mixture models with Wasserstein distance, relies only on applied preprocessing. While autoencoder representations can be summaries themselves[4], given the large volume of the data, our method is far more practical.

7.2 Text data

In the second experiment we operated on a large volume () of short text data, divided into chunks of examples with the same label. The task was to predict label of entire chunk. Preprocessing consists of transforming characters into features based on length and frequency of occurrence of given letters and signs. During experiment we performed

-fold cross-validations. Results were compared with a different approach (KNN algorithm) and similar approach with different distance function (


Figure 2: Mean results from -fold cross-validations. Standard deviations are indicated.

Results conclude that proposed framework works better than compared methods.

8 Conclusions and future work

We derived the approximate easy easy-to-calculate version of the Wasserstein distance between Gaussian mixture models, that may find many applications in various fields of machine learning. In the case of big data, the greatest advantage is the avoidance of multiple calculations over the entire dataset, as the obtained summary allow for the estimation of similarity based only on the compacted data representations.
The future work may involve a statistical analysis of the properties extracted by Gaussian mixture models from a dataset, e.g. selecting important observations that may have had the greatest impact on the parameters.