# Gaussian Mixture Reduction Using Reverse Kullback-Leibler Divergence

We propose a greedy mixture reduction algorithm which is capable of pruning mixture components as well as merging them based on the Kullback-Leibler divergence (KLD). The algorithm is distinct from the well-known Runnalls' KLD based method since it is not restricted to merging operations. The capability of pruning (in addition to merging) gives the algorithm the ability of preserving the peaks of the original mixture during the reduction. Analytical approximations are derived to circumvent the computational intractability of the KLD which results in a computationally efficient method. The proposed algorithm is compared with Runnalls' and Williams' methods in two numerical examples, using both simulated and real world data. The results indicate that the performance and computational complexity of the proposed approach make it an efficient alternative to existing mixture reduction methods.

There are no comments yet.

## Authors

• 3 publications
• 3 publications
• 3 publications
• ### Pearson chi^2-divergence Approach to Gaussian Mixture Reduction and its Application to Gaussian-sum Filter and Smoother

The Gaussian mixture distribution is important in various statistical pr...
01/03/2020 ∙ by Genshiro Kitagawa, et al. ∙ 0

• ### A New Smoothing Algorithm for Jump Markov Linear Systems

This paper presents a method for calculating the smoothed state distribu...
04/18/2020 ∙ by Mark P. Balenzuela, et al. ∙ 0

• ### Robust Mixture Modeling using Weighted Complete Estimating Equations

Mixture modeling that takes account of potential heterogeneity in data i...
04/08/2020 ∙ by Shonosuke Sugasawa, et al. ∙ 0

• ### A Unified Framework for Gaussian Mixture Reduction with Composite Transportation Distance

Gaussian mixture reduction (GMR) is the problem of approximating a finit...
02/19/2020 ∙ by Qiong Zhang, et al. ∙ 0

• ### Merging of Bézier curves with box constraints

In this paper, we present a novel approach to the problem of merging of ...
12/11/2014 ∙ by Przemysław Gospodarczyk, et al. ∙ 0

• ### Uncovering Interpretable Internal States of Merging Tasks at Highway On-Ramps for Autonomous Driving Decision-Making

Humans make daily-routine decisions based on their internal states in in...
02/15/2021 ∙ by Huanjie Wang, et al. ∙ 0

• ### A criterion for bubble merging in liquid metal: computational and experimental study

An innovative model is presented for merging of bubbles inside a liquid ...
08/04/2017 ∙ by Mojtaba Barzegari, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## I Introduction

Mixture densities appear in various problems of the estimation theory. The existing solutions for these problems often require efficient strategies to reduce the number of components in the mixture representation because of computational limits. For example, time series problems can involve an ever increasing number of components in a mixture over time. The Gaussian sum filter

[1] for nonlinear state estimation; Multi-Hypotheses Tracking (MHT) [2]

and Gaussian Mixture Probability Hypothesis Density (GM-PHD) filter

[3] for multiple target tracking can be listed as examples of the algorithms which require mixture reduction (abbreviated as MR in the sequel) in their implementation.

Several methods were proposed in the literature addressing the MR problem. In [4] and [5], the components of the mixture were successively merged in pairs for minimizing a cost function. A Gaussian MR algorithm using homotopy to avoid local minima was suggested in [6]. Merging statistics for greedy MR for multiple target tracking was discussed in [7]. A Gaussian MR algorithm using clustering techniques was proposed in [8]. In [9], Crouse et al. presented a survey of the Gaussian MR algorithms such as West’s algorithm [10], constraint optimized weight adaptation [11], Runnalls’ algorithm [12] and Gaussian mixture reduction via clustering [8] and compared them in detail.

Williams and Maybeck [13] proposed using the Integral Square Error (ISE) approach for MR in the multiple hypothesis tracking context. One distinctive feature of the method is the availability of exact analytical expressions for evaluating the cost function between two Gaussian mixtures. Whereas, Runnalls proposed using an upper bound on the Kullback-Leibler divergence (KLD) as a distance measure between the original mixture density and its reduced form at each step of the reduction in [12]. The motivation for the choice of an upper bound in Runnalls’ algorithm is based on the premise that the KLD between two Gaussian mixtures can not be calculated analytically. Runnalls’ approach is dedicated to minimize the KLD from the original mixture to the approximate one, which we refer to here as the Forward-KLD (FKLD). This choice of the cost function results in an algorithm, which reduces the number of components only by merging them at each reduction step.

In this paper, we propose a KLD based MR algorithm. Our aim is to find an efficient method for minimizing the KLD from the approximate mixture to the original one, which we refer to as the Reverse-KLD (RKLD). The resulting algorithm has the ability to choose between pruning or merging components at each step of the reduction unlike Runnalls’ algorithm. This enables, for example, the possibility to prune low-weight components while keeping the heavy-weight components unaltered. Furthermore, we present approximations which are required to overcome the analytical intractability of the RKLD between Gaussian mixtures making the implementation fast and efficient.

The rest of this paper is organized as follows. In Section II we present the necessary background required for the MR problem we intend to solve in this paper. Two of the most relevant works and their strengths and weaknesses are described in Section III. The proposed solution and its strengths are presented in Section IV. Approximations for the fast computation of the proposed divergence are given in Section V. The proposed MR algorithm using the approximations is evaluated and compared to the alternatives on two numerical examples in Section VI. The paper is concluded in Section VII.

## Ii Background

A mixture density is a convex combination of (more basic) probability densities, see e.g. [14]. A normalized mixture with components is defined as

 p(x)=N∑I=1wIq(x;ηI), (1)

where the terms are positive weights summing up to unity, and are the parameters of the component density .

The mixture reduction problem

(MRP) is to find an approximation of the original mixture using fewer components. Ideally, the MRP is formulated as a nonlinear optimization problem where a cost function measuring the distance between the original and the approximate mixture is minimized. The optimization problem is solved by numerical solvers when the problem is not analytically tractable. The numerical optimization based approaches can be computationally quite expensive, in particular for high dimensional data, and they generally suffer from the problem of local optima

[9, 6, 13]. Hence, a common alternative solution has been the greedy iterative approach.

In the greedy approach, the number of components in the mixture is reduced one at a time. By applying the same procedure over and over again, a desired number of components is reached. To reduce the number of components by one, a decision has to be made among two types of operations; namely, the pruning and the merging operations. Each of these operations are considered to be a hypothesis in a greedy MRP and is denoted as , , , .

#### Ii-1 Pruning

Pruning is the simplest operation for reducing the number of components in a mixture density. It is denoted with the hypothesis , in the sequel. In pruning, one component of the mixture is removed and the weights of the remaining components are rescaled such that the mixture integrates to unity. For example, choosing the hypothesis , i.e., pruning component from (1), results in the reduced mixture

 ^p(x|H0J)≜1(1−wJ)N∑I=1I≠JwIq(x;ηI). (2)

#### Ii-2 Merging

The merging operation approximates a pair of components in a mixture density with a single component of the same type. It is denoted with the hypothesis ,

in the sequel. In general, an optimization problem minimizing a divergence between the normalized pair of the mixture and the single component is used for this purpose. Choosing the FKLD as the the cost function for merging two components, leads to a moment matching operation. More specifically, if the hypothesis

is selected, i.e., if the components and are chosen to be merged, the parameters of the merged component are found by minimizing the divergence from the kernel to a single weighted component as follows:

 ηIJ=argminηDKL(ˆwIq(x;ηI)+ˆwJq(x;ηJ)||q(x;η)),

where, , and denotes the KLD from to which is defined as

 DKL(p||q)≜∫p(x)logp(x)q(x)dx. (3)

The minimization of the above cost function usually results in the single component whose several moments are matched to those of the two component mixture. The reduced mixture after merging is then given as

 ^p(x|HIJ)≜N∑K=1K≠IK≠JwKq(x;ηK)+(wI+wJ)q(x;ηIJ). (4)

There are two different types of greedy approaches in the literature, namely, local and global approaches. The local approaches consider only the merging hypotheses , . In general, the merging hypothesis which provides the smallest divergence is selected. This divergence considers only the components to be merged and neglects the others. Therefore these methods are called local. Well-known examples of local approaches are given in [15, 16].

In the global approaches, both pruning and merging operations are considered. The divergence between the original and the reduced mixtures, i.e., is minimized in the decision. Because the decision criterion for a global approach involves all of the components of the original mixture, global approaches are in general computationally more costly. On the other hand, since the global properties of the original mixture are taken into account, they provide better performance. In the following, we propose a global greedy MR method that can be implemented efficiently.

## Iii Related work

In this section, we give an overview and a discussion for two well-known global MR algorithms related to the current work.

### Iii-a Runnalls’ Method

Runnalls’ method [12] is a global greedy MR algorithm that minimizes the FKLD (i.e., ). Unfortunately, the KLD between two Gaussian mixtures can not be calculated analytically. Runnalls uses an analytical upper-bound for the KLD which can only be used for comparing merging hypotheses. The upper bound for , which is given as

 B(I,J)≜wIDKL(q(x;ηI)||q(x;ηIJ))+wJDKL(q(x;ηJ)||q(x;ηIJ)), (5)

is used as the cost of merging the components and where is the merged component density. Hence, the original global decision statistics for merging is replaced with its local approximation to obtain the decision rule as follows:

 i∗,j∗=argmin1≤I≠J≤NB(I,J). (6)

### Iii-B Williams’ Method

Williams and Maybeck proposed a global greedy MRA in [13]

where ISE is used as the cost function. ISE between two probability distributions

and is defined by

 DISE(p||q)≜∫|p(x)−q(x)|2dx. (7)

ISE has all properties of a metric such as symmetry and triangle inequality and is analytically tractable for Gaussian mixtures. Williams’ method minimizes over all pruning and merging hypotheses, i.e.,

 i∗,j∗=argmin0≤I≤N1≤J≤NI≠JDISE(p(⋅)||^p(⋅|HIJ)). (8)

### Iii-C Discussion

In this subsection, first we illustrate the behavior of Runnalls’ and Willams’ methods on a very simple MR example. Second, we provide a brief discussion on the characteristics observed in the examples along with their implications.

###### Example 1.

Consider the Gaussian mixture with two components given below.

 p(x)=w1N(x;−μ,1)+w2N(x;μ,1) (9)

where and . We would like to reduce the mixture to a single component and hence we consider the two pruning hypotheses , and the merging hypothesis . The reduced mixtures under these hypotheses are given as

 ^p(x|H01)= N(x;μ,1), (10a) ^p(x|H02)= N(x;−μ,1), (10b) ^p(x|H12)= N(x;¯¯¯μ,Σ), (10c)

where and are computed via moment matching as

 ¯¯¯μ= (w2−w1)μ, (11a) Σ= 1+4w1w2μ2. (11b)

Noting that the optimization problem

 μ∗,P∗=argminμ,PDKL(p(⋅)||N(⋅,μ,P)) (12)

has the solution and , we see that the density is the best reduced mixture with respect to FKLD. Similarly, Runnalls’ method would select as the best mixture reduction hypothesis because it considers only the merging hypotheses.

###### Example 2.

We consider the same MR problem in Example 1 with the Williams’ method. The ISE between the original mixture and the reduced mixture can be written as

 DISE(p(⋅)||^p(⋅|HIJ))+= ∫^p(x|HIJ)2dx −2∫p(x)^p(x|HIJ)dx (13)

where the sign means equality up to an additive constant independent of . Letting , we can calculate using (13) as

 DISE(p(⋅)||^p(⋅|HIJ))+= N(μIJ;μIJ,2ΣIJ) −2w1N(μIJ;−μ,1+ΣIJ) −2w2N(μIJ;μ,1+ΣIJ). (14)

Hence, for the hypotheses listed in (10), we have

 DISE(p(⋅)||^p(⋅|H01))+= (1−2w2)N(μ;μ,2) −2w1N(μ;−μ,2) (15) = (1−2w2)N(0;0,2) −2w1N(μ;−μ,2) (16) DISE(p(⋅)||^p(⋅|H02))+= (1−2w1)N(−μ;−μ,2) −2w2N(−μ;μ,2) (17) = (1−2w1)N(0;0,2) −2w2N(μ;−μ,2) (18) DISE(p(⋅)||^p(⋅|H12))+= N(¯¯¯μ;¯¯¯μ,2Σ) −2w1N(¯¯¯μ;−μ,1+Σ) −2w2N(¯¯¯μ;μ,1+Σ) (19) = N(0;0,2Σ) −2w1N(2w2μ;0,1+Σ) −2w2N(2w1μ;0,1+Σ) (20)

Restricting ourselves to the case where and is very large, we can see that

 DISE(p(⋅)||^p(⋅|H01))≈ −1√4πe−μ2 (21) DISE(p(⋅)||^p(⋅|H02))≈ −1√4πe−μ2 (22) DISE(p(⋅)||^p(⋅|H12))≈ 1√2πμ(1√2−2√e)≈−0.51√2πμ. (23)

For sufficiently large values, it is now easily seen that

 DISE(p(⋅)||^p(⋅|H12))< DISE(p(⋅)||^p(⋅|H01)) =DISE(p(⋅)||^p(⋅|H02)). (24)

Hence, under the aforementioned conditions, the merging hypothesis is selected by Williams’ method.

As basically illustrated in Example 1, FKLD has a tendency towards the merging operation no matter how separated the components of the original mixture are. Similarly Runnalls’ method considers only the merging operations ruling out the pruning hypotheses from the start. The significant preference towards the merging operation tends to produce reduced mixtures which may have significant support over the regions where the original mixture have negligible probability mass. This is called as the zero-avoiding behavior of the KLD in the literature [14, p. 470]. Such a tendency may be preferable in some applications such as minimum mean square error (MMSE) based estimation. On the other hand, it may also lead to a loss of the important details of the original mixture, e.g., the mode, which might be less desirable in applications such as maximum a posteriori (MAP) estimation. In such applications, having the pruning operation as a preferable alternative might preserve significant information while at the same time keeping a reasonable level of overlap between the supports of the original and the reduced mixtures.

Example 2 illustrated that a similar tendency toward merging (when the components are far from each other) can appear in the case of Williams’ method for some specific cases (weights being equal). It must be mentioned here that, in Example 2, if the weights of components were not the same, Williams’ method would select to prune the component with the smaller weight. Therefore, the tendency toward merging (when the components are far from each other) is significantly less in Williams’ method than in FKLD and Runnalls’ method. It is also important to mention that, in some target tracking algorithms such as MHT and GM-PHD filters, mixtures with some components with identical weights are commonly encountered.

Williams’ method, being a global greedy approach to MR, is computationally quite expensive for mixture densities with many components. The computational burden results from the following facts. Reducing a mixture with components to a mixture with components involves hypotheses. Since computational load of calculating the ISE between mixtures of and components is , reducing a mixture with components to a mixture with components has the computation complexity with Williams’ method. On the other hand, using the upper bound (5), Runnalls’ method avoids the computations associated with the components which are not directly involved in the merging operation resulting in just computations for the same reduction. Another disadvantage of Williams’ method is that the ISE does not scale up with the dimension nicely, as pointed out in an example in [12].

## Iv Proposed Method

We here propose a greedy global MR algorithm based on KLD which can give credit to pruning operations and avoid merging (unlike Runnalls’ method) when components of the mixture are far away from each other. The MR method we propose minimizes the KLD from the reduced mixture to the original mixture, i.e., RKLD. Hence we solve the following optimization problem.

 I∗,J∗=argmin0≤I≤N1≤J≤NI≠JDKL(^p(x|HIJ)||p(x)). (25)

By using the RKLD as the cost function, we aim to enable pruning and avoid the ever merging behavior of Runnalls’ method unless it is found necessary. We illustrate the characteristics of this cost function in MR with the following examples.

###### Example 3.

We consider the same MR problem in Example 1 and 2 when is very small. When is sufficiently close to zero, we can express using a second-order Taylor series approximation around as follows:

 DKL(^p(x|HIJ)||p(x))≈12cIJμ2 (26)

where

 cIJ≜∂2∂μ2DKL(^p(x|HIJ)||p(x))∣∣∣μ=0. (27)

This is because, when , both and are equal to and therefore . Since is minimized at , the first derivative of also vanishes at . The second derivative term is given by tedious but straightforward calculations as

 cIJ=∫(∂∂μp(x)∣∣μ=0−∂∂μ^p(x|HIJ)∣∣μ=0)2N(x;0,1)dx. (28)

Using the identity

 ∂∂μN(x;μ,1)=(x−μ)N(x;μ,1) (29)

we can now calculate , and as

 c01= 4w21 (30a) c02= 4w22, (30b) c12= 0. (30c)

Hence, when is sufficiently small, the RKLD cost function results in the selection of merging operation similar to FKLD and Runnalls’ method.

###### Example 4.

We consider the same MR problem in Example 1 and 2 when is very large. In the following, we calculate the RKLDs for the hypotheses given in (10).

• and : We can write the RKLD for as

 DKL(^p(⋅|H01)||p(⋅))= −EH01[logp(x)]−H(^p(⋅|H01)), (31)

where the notation denotes the expectation operation on the argument with respect to the density function and denotes the entropy of the argument density. Under the assumption that is very large, we can approximate the expectation as

 EH01 [logp(x)] ≜ ∫N(x;μ,1) ×log(w1N(x;−μ,1)+w2N(x;μ,1))dx (32) ≈ (33) = logw2−log√2π−12 (34)

where we used the fact that over the effective integration range (around the mean ), we have

 w2N(x;−μ,1)≈0. (35)

Substituting this result and the entropy111Entropy of a Gaussian density is equal to . of into (31) along with, we obtain

 DKL(^p(⋅|H01)||p(⋅))≈ −logw2. (36)

Using similar arguments, we can easily obtain

 DKL(^p(⋅|H02)||p(⋅))≈ −logw1. (37)

Noting that the logarithm function is monotonic, we can also see that the approximations given on the right hand sides of the above equations are upper bounds for the corresponding RKLDs.

• : We now calculate a lower bound on and show that this lower bound is greater than both and when is sufficiently large.

 DKL(^p(x|H12)||p(x))=−EH12[logp(x)]−H(^p(⋅|H12)) (38)

We consider the following facts

 N(x;−μ,1)≥N(x;μ,1) whenx≤0, (39a) N(x;μ,1)≥N(x;−μ,1) whenx≥0. (39b)

Using the identities in (39) we can obtain a bound on the expectation as

 EH12 [logp(x)] ≜ EH12[log(w1N(x;−μ,1)+w2N(x;μ,1))] ≤ ∫x≤0N(x;¯μ,Σ)log(N(x;−μ,1))dx +∫x>0N(x;¯μ,Σ)log(N(x;μ,1))dx (40) = −∫x≤0N(x;¯μ,Σ)(log√2π+(x+μ)22)dx −∫x>0N(x;¯μ,Σ)(log√2π+(x−μ)22)dx (41) ≤ −log√2π−∫x≤0N(x;¯μ,Σ)x2+2μx+μ22dx −∫x>0N(x;¯μ,Σ)x2−2μx+μ22dx (42) = −log√2π−¯μ2+Σ+μ22 −μ∫x≤0xN(x;¯μ,Σ)dx +μ∫x>0xN(x;¯μ,Σ)dx (43) = −log√2π−¯μ2+Σ+μ22 −μ∫x≤0xN(x;¯μ,Σ)dx +μ(¯μ−∫x≤0xN(x;¯μ,Σ))dx (44) = −log√2π−¯μ2+Σ+μ22+μ¯μ −2μ∫x≤0xN(x;¯μ,Σ)dx (45) = −log√2π−(μ−¯μ)2+Σ2 −2μ¯μΦ(−¯μ√Σ)+2μ√Σϕ(−¯μ√Σ) (46)

where we have used the result

 ∫¯x−∞x N(x,μ,Σ)dx =μΦ(¯x−μ√Σ)−√Σϕ(¯x−μ√Σ). (47)

Here, the functions and

denote the probability density function and cumulative distribution function of a Gaussian random variable with zero mean and unity variance, respectively. Substituting the upper bound (

46) into (38), we obtain

 DKL (^p(x|H12)||p(x)) ≥ −12−12logΣ+(μ−¯μ)2+Σ2 +2μ¯μΦ(−¯μ√Σ)−2μ√Σϕ(−¯μ√Σ) (48) ≈ −12−12log(4w1w2μ2)+2μ2(w1+(w2−w1) = −12−12log(4w1w2μ2)+2g(w1,w2)μ2 (49)

for sufficiently large values where we used the definitions of , and the fact that when tends to infinity, we have

 −¯μ√Σ→w1−w2√4w1w2. (50)

The coefficient in (49) is defined as

 g(w1,w2)≜ w1+(w2−w1)Φ(w1−w2√4w1w2) −√4w1w2ϕ(w1−w2√4w1w2). (51)

When tends to infinity, the dominating term on the right hand side of (49) becomes the last term. By generating a simple plot, we can see that the function is positive for , which makes the right hand side of (49) go to infinity as tends to infinity. Consequently, the cost of the merging operation exceeds both of the pruning hypotheses for sufficiently large values. Therefore, the component having the minimum weight is pruned when the components are sufficiently separated.

As illustrated in Example 4, when the components of the mixture are far away from each other, RKLD refrains from merging them. This property of RKLD is known as zero-forcing in the literature [14, p. 470].

The RKLD between two Gaussian mixtures is analytically intractable except for trivial cases. Therefore, to be able to use RKLD in MR, approximations are necessary just as in the case of FKLD with Runnalls’ method. We propose such approximations of RKLD for the pruning and merging operations in the following section.

## V Approximations for RKLD

In sections V-A and V-B, specifically tailored approximations for the cost functions of pruning and merging hypotheses are derived respectively. Before proceeding further, we would like to introduce a lemma which is used in the derivations.

###### Lemma 1.

Let , , and be three probability distributions and and two positive real numbers. The following inequality holds

 ∫qK(x)logqK(x)wIqI(x)+wJqJ(x)dx≤ −log(wIexp(−DKL(qK||qI))+wJexp(−DKL(qK||qJ))) (52)
###### Proof.

For the proof see Appendix -A. ∎

### V-a Approximations for pruning hypotheses

Consider the mixture density defined as

 p(x)=N∑I=1wIqI(x), (53)

where . Suppose we reduce the mixture in (53) by pruning the th component as follows.

 ^p(x|H0I)=11−wI∑i∈{1⋯N}−{I}wiqi(x). (54)

An upper bound can be obtained using the fact that the logarithm function is monotonically increasing as

 DKL(^p(⋅|H0I)||p(⋅))=∫^p(x|H0I)log^p(x|H0I)p(x)dx =∫^p(x|H0I)log^p(x|H0I)(1−wI)^p(x|H0I)+wIqIdx ≤−log(1−wI)+∫^p(x|H0I)log^p(x|H0I)^p(x|H0I)dx =−log(1−wI). (55)

This upper bound is rather crude when the th component density is close to other component densities in the mixture. Therefore we compute a tighter bound on using the log-sum inequality [17]. Before we derive the upper bound, we first define the following unnormalized density

 r(x)=∑i∈{1⋯N}−{I,J}wiqi(x) (56)

where and .

We can rewrite the RKLD between and as

 DKL (^p(⋅|H0I)||p(⋅))=∫11−wI(r(x)+wJqJ(x)) ×log11−wI(r(x)+wJqJ(x))r(x)+wIqI(x)+wJqJ(x)dx = −log(1−wI)+11−wI∫(r(x)+wJqJ(x)) ×logr(x)+wJqJ(x)r(x)+wIqI(x)+wJqJ(x)dx. (57)

Using the log-sum inequality we can obtain the following expression.

 DKL(^p(⋅|H0I)||p(⋅)) ≤−log(1−wI)+11−wI∫r(x)logr(x)r(x)dx +11−wI∫wJqJ(x)logwJqJ(x)wIqI(x)+wJqJ(x)dx =−log(1−wI)+wJ1−wIlog(wJ) +wJ1−wI∫qJ(x)logqJ(x)wIqI(x)+wJqJ(x)dx. (58)

Applying the result of Lemma 1 on the integral, we can write

 DKL (^p(⋅|H0I)||p(⋅))≤−log(1−wI) −wJ1−wIlog(1+wIwJexp(−DKL(qJ||qI))). (59)

Since is arbitrary, as long as , we obtain the upper bound given below.

 DKL(^p(⋅|H0I)||p(⋅))≤minJ∈{1⋯N}−{I}[−log(1−wI) −wJ1−wIlog(1+wIwJexp(−DKL(qJ||qI)))]. (60)

The proposed approximate divergence for pruning component will be denoted by , where in the rest of this paper. In the following, we illustrate the advantage of the proposed approximation in a numerical example.

###### Example 5.

Consider Example 1 with the mixture (9) and the hypothesis (10b). In Figure 1 the exact divergence , which is computed numerically, its crude approximation given in (55) and the proposed approximation are shown for different values of with . Both the exact divergence and the upper bound converge to when the pruned component has small overlapping probability mass with the other component. The bound brings a significant improvement over the crude bound when the amount of overlap between the components increases.

### V-B Approximations for merging hypotheses

Consider the problem of merging the th and the th components of the mixture density (53) where the resulting approximate density is given as follows.

 ^p(x|HIJ)=wIJqIJ(x)+∑i∈{1⋯N}−{I,J}wiqi(x). (61)

We are interested in the RKLD between and . Two approximations of this quantity with different accuracy and computational cost are given in sections V-B1 and V-B2.

#### V-B1 A simple upper bound

We can compute a bound on as follows:

 D KL(^p(⋅|HIJ)||p(⋅)) = ∫(r(x)+wIJqIJ(x))logr(x)+wIJqIJ(x)r(x)+wIqI(x)+wJqJ(x)dx ≤ ∫r(x)logr(x)r(x)dx +∫wIJqIJ(x)logwIJqIJ(x)wIqI(x)+wJqJ(x)dx = wIJlogwIJ+wIJ∫qIJ(x)logqIJ(x)wIqI(x)+wJqJ(x)dx (62)

where the log-sum inequality is used and is defined in (56). Using Lemma 1 for the second term on the right hand side of (62), we obtain

 DKL(<