 # A Unified Framework for Gaussian Mixture Reduction with Composite Transportation Distance

Gaussian mixture reduction (GMR) is the problem of approximating a finite Gaussian mixture by one with fewer components. It is widely used in density estimation, nonparametric belief propagation, and Bayesian recursive filtering. Although optimization and clustering-based algorithms have been proposed for GMR, they are either computationally expensive or lacking in theoretical supports. In this work, we propose to perform GMR by minimizing the entropic regularized composite transportation distance between two mixtures. We show our approach provides a unified framework for GMR that is both interpretable and computationally efficient. Our work also bridges the gap between optimization and clustering-based approaches for GMR. A Majorization-Minimization algorithm is developed for our optimization problem and its theoretical convergence is also established in this paper. Empirical experiments are also conducted to show the effectiveness of GMR. The effect of the choice of transportation cost on the performance of GMR is also investigated.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Finite mixture models provide a convenient parametric framework to approximate distributions with unknown shapes. It is well known that finite mixture models can approximate any continuous distribution with arbitrary precision (McLachlan and Peel, 2004). They are therefore widely used in applications such as image generation (Kolouri et al., 2017), image segmentation (Farnoosh and Zarpak, 2008), object tracking (Santosh et al., 2013), and signal processing (Kostantinos, 2000)

. Among many choices, the Gaussian mixture model is the most commonly used due to its mathematical simplicity. A finite Gaussian mixture model is defined as follows.

###### Definition 1 (Finite Gaussian Mixture).

Let be the density function of a

-dimensional Gaussian random vector with mean

and covariance matrix . The following convex combination of Gaussian density functions

 ϕ(x;G)=K∑k=1wkGϕ(x;μkG,ΣkG)

is the density function of a Gaussian mixture of order .

Notation in the definition is for the mixing distribution whose measure is given by

. The cumulative distribution function (CDF) of the Gaussian mixture is given by

where is the CDF of . We also refer to as the number of components. When , it is a finite Gaussian mixture. We call mixing weights, where is the -dimensional simplex. We call the component parameters.

When Gaussian mixtures are used for density approximation, there is always a trade-off between the accuracy of the approximation and computational efficiency. A larger order

improves the approximation precision but increases the computational cost for downstream applications. For instance, it increases the cost of evaluating the log-likelihood function. Moreover, when Gaussian mixture models are used to approximate density functions in some Bayesian inference procedures, the number of components of the mixture increases exponentially

(Manzar, 2017)

due to some recursive operations. An example is in belief propagation for finding the marginal probabilities in graphical models, each time the belief is updated, the orders of the mixture distributions are multiplied. When the messages–the reusable partial sum for the marginalization calculations– are modeled as Gaussian mixtures

(Sudderth et al., 2010), their orders increase exponentially and quickly become intractable with iterations. In recursive Bayesian filtering, when the prior and likelihood are both Gaussian mixtures, the order of the posterior mixture increases exponentially. In these cases, to make inferences within a reasonable amount of time, some intermediate approximation steps are helpful to prevent the order of the mixture from exploding. Gaussian mixture reduction (GMR) is a technique useful for such purposes.

The problem of GMR is to approximate a Gaussian mixture by another with fewer components. GMR can also be a tool to obtain a much simplified Gaussian mixture estimate from a kernel density estimate

(Scott and Szewczyk, 2001). There has been a rich literature on GMR and most approaches belong to one of the three general types: greedy algorithm based (Salmond, 1990; Runnalls, 2007; Huber and Hanebeck, 2008), optimization based (Williams and Maybeck, 2006), and clustering based (Schieferdecker and Huber, 2009; Goldberger and Roweis, 2005; Davis and Dhillon, 2007; Assa and Plataniotis, 2018; Zhang and Kwok, 2010; Vasconcelos and Lippman, 1999; Yu et al., 2018)

. Each type of these approaches has its own advantages and limitations. For a greedy algorithm based approach, at each step it chooses two components from the current mixture and merges them into a single Gaussian distribution with the same first two moments. Although it eventually reduces the original mixture to a mixture of the target order, the greedy algorithm merely follows some instinct without an ultimate optimality goal. The optimization based approach addresses this issue by introducing an ultimate optimality goal.

Williams and Maybeck (2006) proposes to search for a reduced mixture that minimizes the distance. Faithfully minimizing the distance is, however, computationally expensive. The third type of approaches clusters original components following the idea of the k-means algorithm. These clustering based approaches are fast but their theoretical support is not investigated.

Contribution In this work, we propose a principled GMR approach through the entropic regularized transportation distance between two mixtures as defined in Section 2.1. We emphasize interpretability, speed for computation, and the quality of the approximation. Moreover, our framework is applicable to any finite mixture models. Our contributions are in three folds: 1) we show in Section 3 that the proposed approach provides a unified framework for clustering based algorithms for GMR; 2) we develop an iterative algorithm for minimizing the entropic regularized composite transportation distance and establish its convergence; 3) we empirically demonstrate the effect of the cost function on the GMR performance.

## 2 Composite Transportation Distance and GMR

In this section, we define the composite transportation distance in Section 2.1. The GMR formulation and the corresponding numerical algorithms are given in Section 2.2. The analysis of the convergence of the algorithm is given in Section 2.3.

### 2.1 Composite Transportation Distance

We first introduce composite distances between finite Gaussian mixtures (Chen et al., 2017; Delon and Desolneux, 2019). The distance is based on the notion that the finite Gaussian mixtures are also discrete measures on the space of Gaussian distributions.

###### Definition 2 (Composite Transportation Distance).

Let and be two finite Gaussian mixtures with mixing distributions and where and are the component parameters. Let be a non-negative and measurable function,

 CFG:=[c(Φ(⋅;Fn),Φ(⋅;Gm))]nm∈RN×M+ (1)

and

 Π(wF,wG)={π∈RN×M+:π1M=wF,πT1N=wG}.

The composite transportation distance between two Gaussian mixtures is defined as

 Ic(Φ(⋅;F),Φ(⋅;G))=infπ∈Π(wF,wG)⟨π,CFG⟩ (2)

where is the matrix inner product.

In the above definition, is called the transportation plan. When the transportation cost function

has a closed-form and the orders of the mixtures are not large, a linear programming approach can compute this distance efficiently

(Peyré et al., 2017). We now give two examples of the cost function under finite Gaussian mixtures.

###### Example 1 (Composite Wasserstein Distance).

The squared -Wasserstein distance between Gaussian measures may serve as a cost function:

 c(Φ(⋅;μ1,Σ1),Φ(⋅;μ2,Σ2)) :=W22(Φ(⋅;μ1,Σ1),Φ(⋅;μ2,Σ2)) =∥μ1−μ2∥2+tr(Σ1+Σ2−2(Σ1/21Σ2Σ1/21)1/2)
###### Example 2 (Composite KL Divergence).

The KL divergence between two Gaussian measures may serve as a cost function:

 c(Φ(⋅;μ1,Σ1),Φ(⋅;μ2,Σ2)) :=DKL(Φ(⋅;μ1,Σ1)∥Φ(⋅;μ2,Σ2)) =12{tr(Σ−12Σ1)+log|2πΣ2|−d}−logϕ(μ1;μ2,Σ2).

In both cases, given and , the costs go to infinity both when

goes to infinity or when the largest eigenvalue of

goes to infinity. Thus, the space of satisfying is compact by allowing zero eigenvalues in Example 1.

As is the case with the traditional optimal transportation distance, when the orders of the mixtures get larger, the computation of the composite transportation distance becomes more and more expensive. To solve the computational issue, Cuturi and Doucet (2014) considers an entropic regularized version of transportation distance that gives an approximate solution to the original transportation problem but can be computed much faster. Under the Gaussian mixture, we adopt the idea in Cuturi and Doucet (2014) and define the entropic regularized distance as follows.

###### Definition 3 (Entropic Regularized Composite Transportation Distance).

Let , , and be the same as before. Let . The entropic regularized composite transportation distance between two Gaussian mixtures is defined to be

for some regularization strength parameter and with given by (1).

When , it reduces to the composite transportation distance. We will highlight their difference when necessary.

### 2.2 GMR Formulation and Algorithm

Given a -component Gaussian mixture , GMR aims at reducing it to a -component mixture with . After specifying the cost function , we propose a GMR solution to be

 G:=arginfG∈GMIλc(Φ(⋅;F),Φ(⋅;G)) (3)

where is the space of all mixing distributions with at most distinct support points. Since is given and fixed, for simplicity of notation, we drop in our notation and denote as .

We now prescribe a generic numerical algorithm for (3). By removing a redundant constraint on the transportation plan , we find a simpler numerical approach. The following theorem gives an equivalence result on the optimization problems, which leads to an efficient algorithm.

###### Theorem 1.

Let and be two mixing distributions,

 Π1(wF)={π∈RN×M+:π1M=wF},
 πG=arginfπ∈Π1(wF){⟨π,CFG⟩−λH(π)},

and

 Jλc(G):=infπ∈Π1(wF){⟨π,CFG⟩−λH(π)}.

Then

 infG∈GMIλc(G)=infG∈GM, wG=πTG1NJλc(G).

The proof of the theorem is given in Section A.2 of the appendix. Based on this theorem, we now focus on the optimization problem

 G=arginfG∈GM, wG=πTG1NJλc(G). (4)

Compared with the original objective function, the transportation plan in the new objective function is required to satisfy only one marginal constraint. The optimal transportation plan is unique and has an analytical form. This result enables us to develop a computationally efficient Majorization-Minimization (MM) algorithm (Hunter and Lange, 2004) as described in Algorithm 1.

For the ease of understanding, we give a brief overview of MM here.

###### Definition 4 (Majorization function).

The function is said to majorize a real-valued function at the point provided

 g(x|xt)≥f(x) for all x, g(xt|xt)=f(xt).

Starting at an initial value , the MM algorithm minimizes the majorization function as a surrogate for the true objective function, and iteratively updates the solution by

 xt+1=argminxg(x|xt)

until the surrogate converges. This iterative procedure guarantees  (Hunter and Lange, 2004). The key to the success of the MM algorithm is to find a majorization function that is much easier to minimize than the original objective function.

We now design a MM algorithm for in (4). Let be the mixing distribution obtained at the th MM iteration. Our MM algorithm is an iterative scheme given in the following majorization and minimization steps:

Majorization step Let

 π(t)=arginfπ∈Π1(wF){⟨π,CFGt⟩−λH(π)}. (6)

Define a majorization function of at to be

 L(G|Gt)=⟨π(t),CFG⟩−λH(π(t)).

Minimization step Solve for .

It is obvious that majorizes at . The in (6) has an analytical solution via the method of Lagrange multipliers (see Section A.2 in the appendix). The surrogate function has the nice property of allowing us to update component parameters and mixing weights in separately. More specifically, given , the weight parameters are updated by

 wmGt+1=∑nπnm(t). (7)

The component parameters are updated as solutions to the following optimization problem, one component at a time and possibly in parallel:

 Gmt+1=arginfθ∑nπnm(t)c(Φ(⋅;Fn),Φ(⋅;θ)). (8)

Solving (8) is also called barycenter problem (Cuturi and Doucet, 2014). For some specific , the barycenter has a closed-form solution. Let and , their Wasserstein barycenter and KL barycenter are given as follows.

###### Example 3 (Wasserstein Barycenter of Gaussian Measures).

The minimizer of is unique and a Gaussian measure with mean and the covariance is the unique positive definite root of the matrix equation

 M∑m=1λm(Σ1/2ΣmΣ1/2)1/2=Σ. (9)

See Agueh and Carlier (2011).

###### Example 4 (KL Barycenter of Gaussian Measures).

The minimizer of when is constrained to be a Gaussian measure is a Gaussian distribution with mean and the covariance

 ¯Σ=M∑m=1λm(Σm+(μm−¯μ)(μm−¯μ)T). (10)

See Section A.1 in the appendix for a proof.

Different cost functions lead to different Gaussian barycenters. Hence, the Wasserstein and KL barycenters of Gaussian measures may differ markedly. An example is given in Figure 1 which depicts the covariance matrices of the barycenters of four 2-dimensional Gaussian measures arranged with respect to different values.

### 2.3 Convergence of the MM Algorithm

Recall that is the output of the th MM iteration. As a direct result of MM algorithm, we have . Moreover, it is obvious that for all . Hence, the monotonic and non-negative sequence must converge to some limit . The convergence of is not as obvious. We give the following theorem with its proof in the appendix.

###### Theorem 2.

Let be the set of stationary points of and be the sequence generated by for some initial value . Suppose the following conditions:

• is compact;

• for all ;

• is a closed point-to-set map over the complement of

are satisfied, then all the limit points of are stationary points of , and the sequence converges monotonically to for some .

If the cost function is chosen as either the -Wasserstein distance or the KL divergence, the monotonicity of the MM iteration implies that the component parameters are confined in a compact space. Similar to Wu and others (1983), we are able to verify that all conditions in this theorem hold. Hence, our algorithm converges.

## 3 Comparison to Related Work

In this section, we show that our approach includes many clustering based approaches for GMR in the literature as special cases. The GMR is formulated as a clustering problem in the probability measure space (Schieferdecker and Huber, 2009; Yu et al., 2018). The clustering based GMR approaches are iterative procedures that involve the following two steps: (1) assignment step where the Gaussian components of the original mixture are partitioned into clusters and (2) update step

where the centriod of clusters are updated based on the components in the new clusters. As in the case of clustering in the vector space, the clustering based algorithms for GMR can also be classified into

hard clustering and soft clustering algorithms. We discuss these two cases separately.

Hard Clustering  Schieferdecker and Huber (2009) proposes a procedure for GMR which is summarized in Algorithm 2 following the k-means algorithm. The same idea is also proposed in Goldberger and Roweis (2005); Davis and Dhillon (2007). At each step of the algorithm, given the current cluster centers, the original mixture components are partitioned into disjoint clusters based on their KL-divergence to cluster centers. We say the th component in the original mixture is assigned to the th cluster if its KL-divergence to the th cluster center is the smallest. Then the Gaussian components in the same cluster are merged to a single Gaussian via moment matching and this Gaussian is the updated cluster center. These updated cluster centers are the components of the reduced mixture. The KL-divergence is used to measure the similarity of Gaussian components of the mixtures, it can be replaced by other similarity measures. For example,  Assa and Plataniotis (2018) replaces the KL-divergence with the -Wasserstein distance between Gaussians and updates cluster center via Wasserstein barycenter. These hard clustering algorithms seem to work well in practice, khowever, the convergence of these algorithms is not discussed. Indeed, it is not guaranteed that their objective function, the distance between the original and reduced mixture, is monotonically decreasing after each iteration.

Soft Clustering A straightforward approach for GMR is to generate samples from the original -component mixture and fit a -component mixture based on these samples. That is, let and maximize the sample log-likelihood . The estimator for the mixing distribution gets more and more accurate as the sample size increases. When

, based on the law of large numbers, the sample log-likelihood converges to the population log-likelihood

 ℓ(G)=EX∼ϕ(x;F){logϕ(X;G)}.

Therefore, we could maximize the population log-likelihood and use its maximizer as a solution to GMR. Following a similar idea, Yu et al. (2018) proposes to perform GMR by maximizing

 ℓI(G)=Eϕ(x1,⋯,xI;F){logϕ(x1,⋯,xI;G)}.

However, they fail to observe that and that maximizing is equivalent to maximizing for any . Hence, in terms of maximizing population log-likelihood, this hyper-parameter is redundant. Regardless, as neither nor its gradient is tractable, Yu et al. (2018) proposes to maximize the variational lower bound of . They suggest that

 ℓI(G)≥N∑n=1wnFM∑m=1znm{logwmGznm+IEnm} (12)

for any and such that . This inequality seems false, which can be checked numerically. By Jensen’s inequality, we may instead get

 ℓI(G)≥IN∑n=1wnFM∑m=1znm{logwmGznm+Enm}

which is smaller than the right hand side in (12).

Although (12) does not perfectly bound , regarding it as a lower bound still leads to an effective numerical algorithm in Yu et al. (2018). For a given and , maximizing  (12) with respect to gives the tightest “lower bound” of in this form. Hence, an algorithm is feasible by iteratively updating and as proposed by Yu et al. (2018), which is summarized as Algorithm 3. Since can be interpreted as assigning a fraction of the th component in to the th cluster in , it is regarded as a soft clustering algorithm. In addition, the size of controls the “hardness” of the assignment in the clustering step. As , this soft clustering becomes the hard clustering and reduces to the algorithm proposed by Schieferdecker and Huber (2009).  Yu et al. (2018) chooses .

Connection with Known Algorithms Our algorithm may be viewed as solving a pure optimization problem as in Section 2.2. From the clustering point of view, it includes both hard and soft clustering based algorithms as special cases. With some specific choices of cost functions and , our algorithm covers various clustering based algorithms in the literature as summarized in Table 1. Moreover, these clustering based algorithms are computationally efficient for GMR as shown in  Schieferdecker and Huber (2009). We explain their connection in detail as follows.

Assignment Step The optimal transportation plan in our algorithm is to assign the Gaussian components of the original mixture to clusters. When we put , our algorithm becomes a hard clustering procedure because the optimal transportation plan will assign the entire th component in the original mixture to the “nearest” cluster. Consequently, the destination cluster is completely decided by the cost function. More specifically, letting the cost function be the KL-divergence from the Gaussian component to the destination cluster center, our algorithm covers that of Schieferdecker and Huber (2009); letting the cost function be the -Wasserstein distance between Gaussians (W2 for short), our algorithm covers Assa and Plataniotis (2018). When we put , the proposed algorithm becomes a soft clustering procedure. In this case, the optimal transportation plan must have for all , due to the regularization term, see (5). It is therefore clear that may be regarded as the fraction of the th component assigned to the th cluster. Let , then  (5) becomes

 πnmwnF =exp(−Cnm/λ)∑kexp(−Cnk/λ) ={wmG}1/λexp{(I/λ)Enm}∑j{wjG}1/λexp{(I/λ)Enj}

which is the same of Yu et al. (2018) when . For this cost function, we also have

 Cnm/λ=−λ−1logwmG−(I/λ)Enm=−λ−1logwmG+(I/λ)DKL% (Φ(⋅;Fn)∥Φ(⋅;Gm))+C.

which is the weighted average of the distance between two sets of mixing proportions and the KL-divergence between the Gaussian components. Hereafter, we call the defined in this equation the modified KL-divergence (MKL). The regularization strength and the hyper-parameter play a role in the assignment step as and determine the importance of the mixing weights and the KL-divergence. For a fixed , as , which means each component in the original mixture is split equally to every cluster. For a fixed , the larger the value of , the stronger the effect of KL-divergence in similarity measure. Recall that is redundant from the maximum population log-likelihood point of view, however, it plays an important role from the minimum composite transportation distance point of view. The effect of the value of and on the GMR is illustrated by the experiment in Section 4.2.

Update Step Our algorithm updates the cluster center by the barycenter with respect to the chosen cost function. When the KL-divergence (or MKL) is chosen as the cost function, the KL barycenter is the solution to (10). Interestingly, we find that this solution is the same as the moment matching solution given in (11). Therefore, even though we update the cluster centers by barycenters, the update step in our algorithm reduces to that in Schieferdecker and Huber (2009); Yu et al. (2018). Similarly, by choosing the W2 distance between Gaussian components as the cost function, the cluster centers are updated according to (9) which reduces to the update step in Assa and Plataniotis (2018).

## 4 Experiments

In this section, we first use the belief propagation example to illustrate the use of GMR. We also study the effect of the cost functions on the performance of GMR. The code for the experiments can be found at https://github.com/SarahQiong/CTDGMR.

### 4.1 GMR for Belief Propagation

This experiment illustrates the effectiveness of GMR when applied to finding a lower order mixture approximation to the message in belief propagation. A brief introduction to belief propagation and corresponding notation is given in Section B.2 in the appendix. We consider the graphical model in Figure 1(a) following Yu et al. (2018). In this model, the local potential associated with the th edge is given by , where values are marked alongside the edges. The local evidence potential associated with the th node is a two-component Gaussian mixture with and . Under this setup, the message is conceptually a finite mixture density function which has an explicit expression. Exact inference is computationally feasible for the first iterations. This is because the number of mixture components in the message grows exponentially with the number of iterations and the exact message/belief becomes intractable. To overcome this difficulty, one may use GMR to approximate the message with a lower order mixture before updating the belief in the next iteration, thereby confining the order below a manageable size. In this experiment, we use our proposed GMR algorithms to approximate the message with a mixture of order after each iteration when needed. This leads to approximate inferences. We evaluate the performance of the approximate inferences by the distance between the exact belief and the approximate beliefs (averaged over nodes). The comparison is computationally feasible for the first iterations due to limited computer memory. The results are averaged over 100 trials.

The average computational time of the exact and approximate inferences are shown in Figure 1(c). Clearly, the computational time of the exact inference increases exponentially with number of iterations. All proposed GMR algorithms are effective at saving computational time in belief propagation. The approximate beliefs are comparable to the exact belief in terms of the distance. It can be seen from Figure 1(d) that the distance is no larger than for all algorithms. Figure 3: The heat map of the density functions of the original mixture in (a) and reduced mixtures in (b), (c), and (d) grouped by regularization strength λ.

Although the distance increases from the first iteration to the second iteration, it does not increase further from the second to the third iteration. It is reasonable to conjecture that the approximate inference would not get worse in subsequent iterations. In terms of the distance, the approximate belief based on hard clustering with W2 cost function is closest to the exact belief. The density functions of the exact and approximate beliefs based on one of the trials are given in Figure 1(b). In this trial, the exact belief is a -component mixture while the approximate belief has only 16 components. The density functions of the exact and approximate beliefs are so close that we cannot tell them apart, for all cost functions and levels of regularization. In summary, the GMR is a useful technique for approximate inference in belief propagation.

### 4.2 Choice of Cost Functions & Hyper-parameters

Our proposed GMR algorithms only require the cost function to be continuous and non-negative. Therefore, the three choices in Table 1 are only a small fraction of countless possibilities. We have yet to study the Hellinger distance, distance and many other distances. An important factor in choosing the cost function is the computational simplicity of the barycenter. Afterwards, the quality of the GMR outcome is the ultimate concern. This experiment investigates how the cost functions and the hyper-parameter affect the GMR outcome in various situations. In summary, different cost functions and regularization strengths can lead to very different outcomes.

In this experiment, we choose three bivariate Gaussian mixtures to be reduced. The parameter values of these three Gaussian mixtures are given in Section B.3 in the appendix. The first one is of order 8 but has only 4 modes, and it is reduced to a mixture of order 4. The second mixture is of order 32 but has only 16 modes, and it is reduced to a mixture of order 16. The third mixture is of order 18 but only has 6 modes, and it is reduced to a mixture of order 6 or of order 12. We let , and experiment on three cost functions: KL, MKL, and W2. For MKL, we let . The combination leads to 12 GMR outcomes for each original mixture and order of the reduction. The original mixture and 12 GMR outcomes in each case are shown in the four rows of Figure 3 in the form of heat maps of their density functions.

The GMR based on MKL has the best overall performance when , regardless of values. The KL based GMR has comparable performance except for . The W2 based GMR has poor performance in general but works well for the second mixture when , and for reducing the third mixture to order when . In summary, different cost functions may perform well in different situations. Their performances are also heavily influenced by the choice of hyper-parameters. It is an interesting research problem to identify the most suitable cost functions together with the values of the hyper-parameters in various applications.

## 5 Discussion

In this work, we propose a principled GMR approach through the entropic regularized transportation distance between two mixtures. The proposed approach provides a unified framework that bridges the optimization and clustering based algorithms. The framework covers many existing methods with different choices of the cost function. Our GMR algorithms are computationally efficient by selecting cost functions with easy-to-compute barycenters. This framework is equally applicable to non-Gaussian finite mixtures. Experiments show that different cost functions can lead to very different outcomes. Hence, it is fruitful to search for the most suitable cost functions in different applications.

## References

• M. Agueh and G. Carlier (2011) Barycenters in the Wasserstein space. SIAM Journal on Mathematical Analysis 43 (2), pp. 904–924. Cited by: Example 3.
• A. Assa and K. N. Plataniotis (2018) Wasserstein-distance-based Gaussian mixture reduction. IEEE Signal Processing Letters 25 (10), pp. 1465–1469. Cited by: §B.1, §1, Table 1, §3, §3, §3.
• Y. Chen, T. T. Georgiou, and A. Tannenbaum (2017) Optimal transport for Gaussian mixture models. arXiv preprint arXiv:1710.07876. Cited by: §2.1.
• M. Cuturi and A. Doucet (2014) Fast computation of Wasserstein barycenters. In

International Conference on Machine Learning

,
pp. 685–693. Cited by: §2.1, §2.2.
• J. V. Davis and I. S. Dhillon (2007) Differential entropic clustering of multivariate Gaussians. In Advances in Neural Information Processing Systems, pp. 337–344. Cited by: §1, Table 1, §3.
• J. Delon and A. Desolneux (2019) A Wasserstein-type distance in the space of Gaussian Mixture Models. arXiv preprint arXiv:1907.05254. Cited by: §2.1.
• R. Farnoosh and B. Zarpak (2008) Image segmentation using Gaussian mixture model. IUST international journal of engineering science 19 (1-2), pp. 29–32. Cited by: §1.
• J. Goldberger and S. T. Roweis (2005) Hierarchical clustering of a mixture model. In Advances in Neural Information Processing Systems, pp. 505–512. Cited by: §1, Table 1, §3.
• M. F. Huber and U. D. Hanebeck (2008) Progressive Gaussian mixture reduction. In 2008 11th International Conference on Information Fusion, pp. 1–8. Cited by: §1.
• D. R. Hunter and K. Lange (2004) A tutorial on MM algorithms. The American Statistician 58 (1), pp. 30–37. Cited by: §2.2, §2.2.
• S. Kolouri, G. K. Rohde, and H. Hoffman (2017) Sliced Wasserstein distance for learning Gaussian mixture models. arXiv preprint arXiv:1711.05376. Cited by: §1.
• N. Kostantinos (2000) Gaussian mixtures and their applications to signal processing. Advanced signal processing handbook: theory and implementation for radar, sonar, and medical imaging real time systems, pp. 3–1. Cited by: §1.
• D. G. Luenberger, Y. Ye, et al. (1984) Linear and nonlinear programming. Vol. 2, Springer. Cited by: §A.3.
• A. Manzar (2017) Recursive bayesian filtering through a mixture of gaussian and discrete particles. Master’s Thesis, Queen’s University. Cited by: §1.
• G. McLachlan and D. Peel (2004) Finite mixture models. John Wiley & Sons. Cited by: §1.
• G. Peyré, M. Cuturi, et al. (2017) Computational optimal transport. arXiv preprint arXiv:1803.00567. Cited by: §2.1.
• A. R. Runnalls (2007) Kullback-Leibler approach to Gaussian mixture reduction. IEEE Transactions on Aerospace and Electronic Systems 43 (3), pp. 989–999. Cited by: §B.1, §1.
• D. J. Salmond (1990) Mixture reduction algorithms for target tracking in clutter. In Signal and Data Processing of Small Targets 1990, Vol. 1305, pp. 434. Cited by: §B.1, §1.
• D. H. H. Santosh, P. Venkatesh, P. Poornesh, L. N. Rao, and N. A. Kumar (2013) Tracking multiple moving objects using Gaussian mixture model. International Journal of Soft Computing and Engineering (IJSCE) 3 (2), pp. 114–119. Cited by: §1.
• D. Schieferdecker and M. F. Huber (2009) Gaussian mixture reduction via clustering. In 2009 12th International Conference on Information Fusion, pp. 1536–1543. Cited by: §1, Table 1, §3, §3, §3, §3, §3, §3, Algorithm 2.
• D. W. Scott and W. F. Szewczyk (2001) From kernels to mixtures. Technometrics 43 (3), pp. 323–335. Cited by: §1.
• E. B. Sudderth, A. T. Ihler, M. Isard, W. T. Freeman, and A. S. Willsky (2010) Nonparametric belief propagation. Communications of the ACM 53 (10), pp. 95–103. Cited by: §1.
• N. Vasconcelos and A. Lippman (1999) Learning mixture hierarchies. In Advances in Neural Information Processing Systems, pp. 606–612. Cited by: §1.
• J. L. Williams and P. S. Maybeck (2006) Cost-function-based hypothesis control techniques for multiple hypothesis tracking. Mathematical and Computer Modelling 43 (9-10), pp. 976–989. Cited by: §B.1, §1.
• C. J. Wu et al. (1983) On the convergence properties of the EM algorithm. The Annals of statistics 11 (1), pp. 95–103. Cited by: §2.3.
• L. Yu, T. Yang, and A. B. Chan (2018) Density-preserving hierarchical EM algorithm: Simplifying Gaussian mixture models for approximate inference. IEEE transactions on pattern analysis and machine intelligence 41 (6), pp. 1323–1337. Cited by: §1, Table 1, §3, §3, §3, §3, §3, §4.1, Algorithm 3.
• K. Zhang and J. T. Kwok (2010) Simplifying mixture models through function approximation.

IEEE Transactions on Neural Networks

21 (4), pp. 644–658.
Cited by: §1.

## Appendix A Theoretical Results

This section contains the proofs of the theoretical results.

### a.1 KL Barycenter of Gaussian Measures

###### Theorem 3.

Let be a multivariate Gaussian measure of dimension with mean and covariance matrix , . Let . Define the function

 f(η)=M∑m=1λmDKL(νm∥η)

on the space of multivariate Gaussian measures of dimension .

Then, is minimized when is a multivariate Gaussian measure of dimension with mean

 ¯μ=M∑m=1λmμm

and the covariance

 ¯Σ=M∑m=1λm(Σm+(μm−¯μ)(μm−¯μ)T).
###### Proof.

By some simple algebra, we find the KL-divergence between two Gaussian measures is given by

 DKL(Φ(⋅;μ1,Σ1)∥Φ(⋅;μ2,Σ2))=12{log|Σ2||Σ1|+tr(Σ−12Σ1)+(μ2−μ1)TΣ−12(μ2−μ1)−d}

where is the determinant of the matrix. Therefore, we can write

 f(η)=f(μ,Σ)=12M∑m=1λm{log|Σ||Σm|+tr(Σ−1Σm)+(μ−μm)TΣ−1(μ−μm)}+C

for some constant . We now use the following linear algebra formulas

 ∂log|X|∂X=(X−1)T=(XT)−1,
 ∂tr(AX−1B)∂X=−(X−1BAX−1)T,

and

 ∂∂x(x−s)TW(x−s)=2W(x−s)

to work out partial derivatives of with respect to and . They are given by

 ∂f∂μ =2M∑m=1λmΣ−1(μ−μm); ∂f∂Σ

Setting both partial derivatives to , we obtain

 ¯μ=M∑m=1λmμm

and the covariance

 ¯Σ=M∑m=1λm(Σm+(μm−¯μ)(μm−¯μ)T).

Clearly, these solutions are the mean and covariance matrix of that minimizes . This completes the proof. ∎

### a.2 Results on Numerical Algorithm

Recall that the entropic regularized composite transportation distance is defined to be

 Iλc(G)=infπ∈Π(wF,wG){⟨π,CFG⟩−λH(π)}.

It involves a constrained optimization problem with respect to both marginal measures of . The equivalence result in the following theorem reduces the constraint to a single marginal measure and enables us to design an efficient algorithm.

###### Theorem 4.

Let and be two mixing distributions. Define

 Jλc(G):=infπ∈Π1(wF){⟨π,CFG⟩−λH(π)}

where

 Π1(wF)={π∈RN×M+:π1M=wF}.

Denote

 πG=arginfπ∈Π1(wF){⟨π,CFG⟩−λH(π)},

then

 infG∈GMIλc(G)=infG∈GM, wG=πTG1NJλc(G). (13)
###### Proof.

We proceed to show that both LHS RHS and LHS RHS are true for two sides in (13).

We now prove LHS RHS. Let , then . Hence,

 πG∗∈Π(wF,wG∗)and% RHS=arginfG∈GM,wG=πTG1NJλc(G)=Jλc(G⋆).

Therefore, following the definition of , we have

 Iλc(G∗)=infπ∈Π(wF,wG∗){⟨π,CFG⋆⟩−λH(π)}≤{⟨πG∗,CFG⋆⟩−λH(πG∗)}=Jλc(G∗)

Therefore,

 LHS=infG∈GMIλc(G)≤Iλc(G⋆)≤Jλc(G⋆)=RHS.

This proves LHS RHS. Next, we prove LHS RHS.

Let and . Hence,

 π∗∈Π1(wF)andLHS=infG∈GMIλc(G)=Iλc(G⋆).

It is seen that (a)