DeepAI

# Learning Sparse Mixture Models

This work approximates high-dimensional density functions with an ANOVA-like sparse structure by the mixture of wrapped Gaussian and von Mises distributions. When the dimension d is very large, it is complex and impossible to train the model parameters by the usually known learning algorithms due to the curse of dimensionality. Therefore, assuming that each component of the model depends on an a priori unknown much smaller number of variables than the space dimension d, we first define an algorithm that determines the mixture model's set of active variables by the Kolmogorov-Smirnov and correlation test. Then restricting the learning procedure to the set of active variables, we iteratively determine the set of variable interactions of the marginal density function and simultaneously learn the parameters by the Kolmogorov and correlation coefficient statistic test and the proximal Expectation-Maximization algorithm. The learning procedure considerably reduces the algorithm's complexity for the input dimension d and increases the model's accuracy for the given samples, as the numerical examples show.

05/31/2021

### Sparse ANOVA Inspired Mixture Models

Based on the analysis of variance (ANOVA) decomposition of functions whi...
02/06/2015

### Active Function Cross-Entropy Clustering

Gaussian Mixture Models (GMM) have found many applications in density es...
11/26/2014

### Fisher Vectors Derived from Hybrid Gaussian-Laplacian Mixture Models for Image Annotation

In the traditional object recognition pipeline, descriptors are densely ...
07/31/2019

### Neural Network based Explicit Mixture Models and Expectation-maximization based Learning

We propose two neural network based mixture models in this article. The ...
08/02/2022

### Cluster Weighted Model Based on TSNE algorithm for High-Dimensional Data

Similar to many Machine Learning models, both accuracy and speed of the ...
06/09/2016

### Sketching for Large-Scale Learning of Mixture Models

Learning parameters from voluminous data can be prohibitive in terms of ...
11/17/2018

### Detection of Sparse Positive Dependence

In a bivariate setting, we consider the problem of detecting a sparse co...

## 1 Introduction

One of the most recurrent problem of multivariate function approximation theory problems is the curse of dimensionality. An algorithm is said to face the curse of dimensionality if the algorithm depends exponentially on the dimension of the data. In order to circumvent or solve the problem, several authors have focused on the study of sparse functions with respect to their arguments. A widespread example appears in compressive sensing [19]

, where the target function can be spanned precisely by assuming that the input vector is sparse with respect to the

-norm. Another theory assumes that the target function can be decomposed into a sum or product of much smaller dimensional functions [15, 2, 4]. In the specific case of ANOVA decomposition [14], it is assumed that only a minimal number of ANOVA terms whose dimension is minimal compared to are relevant. This implies that the function to be approximated can be factorized into a sum of functions that depend on only a limited number of variables [16, 9], i.e., only a certain number of variables interact with each other. Thus the notion of superposition dimension and truncation dimension [19] was introduced to penalize the number of ANOVA terms (equal to ) and, in addition, the dimension of each of them. Several fruitful pieces of research have been done in this sense, as in regression problems [17, 9] and density function approximations [6, 7, 3]. Therefore, we introduced finite sparse mixtures models which are inspired by the ANOVA decomposition of sparse functions. Indeed we assume that each mixture component may only depend on a smaller of variables interaction than the space dimension .

### 1.1 Prior work

Given

data of a multivariate random variable

of potentially very large dimension, the objective of our work is to approximate the density function through a mixture of wrapped Gaussian or von Mises distribution models. One of the best known methods is Expectation-Maximization which maximizes the likelihood of the data. It should be noted that in the case where the dimension is high it is impossible to apply the algorithm naively without prior knowledge of the sparsity of the density function . Thus, in a previous paper [6] we tried to take into account the sparsity assumption of the density function of the mixture model. The algorithm proved to be very effective in approximating periodic B-splines, the first Friedman function and in image classification.

### 1.2 Our contribution

This current paper is an extension of our previous work "Sparse ANOVA Inspired Mixture Models" [6]. In particular we deal with improvement of learning algorithm by first determining the active variables of the density function. Then we restricted the study to the set of active variables. This approach is even more efficient if we assume that some variables do not play any role in the approximation of . This considerably reduces the computational time and space. Thus we will assign masses to the variables according to the amount of information they contain. Thus it is possible to obtain an accurate approximate the density function by its marginal which contains the variables with the most information.

### 1.3 Outline of paper

Section2 introduced the notation. In section 3

, we have introduced a sparse mixture model from the parametric family of multivariate wrapped Gaussian and von Mises distribution. Furthermore, we have derived the marginal and the conditional density function of the wrapped Gaussian Distribution, which will later help us approximate the target density function iteratively.

In section 4, we have implemented an algorithm that determines the set of active variables of a sparse mixture model by the Kolmogorov-Smirnov and correlation coefficient test. Therefore the model learning can be restricted to active variables set , which will considerably reduce the complexity of the model training if we assume that .

In section 4

, we will define an Algorithm that will iteratively estimate the set of interacting variables and the parameters of the marginal density function as well. Later in section

5, we will test our model on sparse mixtures of wrapped Gaussian, B-splines function, and the California Housing prices data.

## 3 Sparse Mixture Models

Under similar assumption as [6, section 2] , we try to approximate the density function of an unknown distribution given a finite number of weighted samples

by a finite dimensional sparse mixture model, whose probability density function (pdf) is given by

 f(x∣α,θ)=K∑k=1αukp(xuk∣θuk) (1)

where and

is a probability density function with

-dimensional parameter . We will here consider samples which are equally weighted, i.e for all . Similarly to [6] we will also assume that the index set may not be pairwise different, i.e there may exist such that but . Thus denotes by the number of mixture components such that Mixture models, whose density function has the form (1) are called sparse mixture model (sparse MM). The parametric family of sparse wrapped Gaussian distribution with both diagonal and full covariance matrix on one side and the family of sparse von Mises distribution on the other will be used to approximate the unknown target density function. Recall that the wrapped Gaussian distribution is obtained by wrapping the Gaussian distribution around the torus. Indeed if is a Gaussian distributed random variable (RV), the corresponding wrapped Gaussian RV is given by where denotes the period. Since we are interested in approximating -periodic functions on the unit torus, then the wrapped random variable becomes There pdf are defined as

 pG(xuk∣θuk=(μuk,Σuk))=∑l∈Z\absukN(xuk+l∣μuk,Σuk)=Nw(xuk∣μuk,Σuk), (2)

where () denotes the pdf of the -dimensional (wrapped) Gaussian distribution with mean and symmetric positive definite (SPD) covariance matrix If the wrapped Gaussian distribution has a diagonal covariance matrix then its pdf is simplified to a product of univariate wrapped Gaussian density function as

 pG(xuk∣θuk=(μuk,Σuk))=∑l∈Z\absuk∏i∈ukN(xi+li∣μi,Σi)=∏i∈ukNw(xi∣μi,Σi), (3)

where () is the pdf of the univariate (wrapped) Gaussian density function with parameters and Since it is practically impossible to numerically compute the probability function of the wrapped Gaussian distribution, and due to the assumption on its covariance matrix, which is positively definite, it can been shown that

 Nw(xuk∣μuk,Σuk)≈NBw(xuk∣μuk,Σuk),

where

 NBw(xuk∣μuk,Σuk)\coloneqq∑l∈([−B,B]∩Z)\absukN(xuk+l∣μuk,Σuk)

for a suitably chosen For instance [8, 10] has derived some values of

depending on the standard deviation

for where approximates gut the ground truth density function . It has been showed by [10] for

 B={1,if σ≥2π,0, otherwise, (4)

and by [8] for

 B={1,if σ<2π/3,2,if 2π/3≤σ<4π/3,

the approximation is very accurate. As the space dimension increases, then also increased. Thus we will consider the truncated function instead of in the rest of the paper.

###### Remark 3.1.

The pdf of the wrapped Gaussian distribution from (2) can be interpreted as the marginal density function with respect to of the joint pdf

 f(x,l∣μ,Σ)=N(x+l∣μ,Σ),

where and

are the wrapped normal distribution parameters of

and the hidden variable denotes the number of winding, i.e

The von Mises distribution, which represents the restriction of the pdf of an isotropic normal distribution to the unit circle has the pdf

 pM(xuk∣θuk=(μuk,κuk))=∏i∈uk1I0(κi)exp(κicos(2π(xi−μi))),

where represents mean and and is the modified Bessel function of the first kind of order .

To ensure a good approximation accuracy by the parametric family of sparse mixture models of wrapped and von Mises distribution, we assume furthermore that the ground function

is smooth enough and has a compact support, since Gaussian Mixture Models (GMM) has proved to be good approximators for continuous density functions with compact support

[1].

Under the same assumption as above we will derive a form of the marginal density function of , where is defined as (1). Indeed we will introduce later in section 4 an algorithm that iteratively approximate the marginal density function of .

###### Definition 3.2.

Let

be a continuous random variable with probability density function

. For every the marginal probability density function with respect to is defined as

 fXu(xu∣θu)=Puf(x∣θ)=∫Td−\absuf(xu,xuc∣θ)dxuc,

where and is the projection operator.

The linearity of the integral immediately implies that the marginal distribution of a mixture model is equal to the mixture of the marginal of each mixture component and thus the linearity of the projection operator. Since two different components may have the same marginal (i.e the same parameters), then they are put together by summing their mixing weights. This implies, that number of components of the marginal is smaller or equal to the number of components of the ground mixture model. Furthermore if the multivariate random variable is componentwise independent or (wrapped) Gaussian distributed with parameter then the marginal distribution with respect to the subset of random variable is of the same family as the ground distribution with parameters For the special case of wrapped Gaussian distribution with dependent random variables the assumption also holds. Before stating the theorem on the marginals of sparse mixture models of wrapped Gaussian or von Mises distributions, let us recall first the marginal and conditional distribution of a multivariate wrapped Gaussian distribution.

###### Lemma 3.3.

Let and . Let furthermore be a multivariate continuous random variable of a wrapped Gaussian distribution, i.e

 X∼Nw(μ,Σ),

with parameters

 μ\coloneqq(μuμuc),Σ\coloneqq(ΣuuΣuucΣucuΣucuc),

such that are the mean and are positive definite covariance matrices parameters. Then the marginal distribution of and are also a wrapped Gaussian distribution, such that

The conditional distribution of is given by

 Xuc∣(Xu,Lu)=(xu,lu)∼Nw(¯μ,¯Σ), (5)

with

 ¯μ =μuc+Σucu(Σuu)−1(xu+lu−μu), ¯Σ =Σucuc−Σucu(Σuu)−1Σuuc,

where denotes the -dimensional winding number [8]

###### Theorem 3.4.

Let be a continuous random variable of a sparse mixture model of wrapped Gaussian distribution, with density function . Let furthermore be an -dimensional random variable and . Then the marginal distribution with respect to is also a sparse mixture model with density function

 fXu=Tu∑t=1αtp(⋅∣θut), (6)

where and is element of

 UXu=\setv∩u∣v∈U.

and is the collection of the indices of interacting variables of The mixing weights and density functions of the marginal distribution are respectively

 αt=K∑k=1αkχ\setθξk=θut,p(⋅∣θut)={1,if ut=∅,Pξkp(⋅∣θuk), otherwise,

such that and for

 θut=(μut,Σutut)=(μξk,Σξkξk).
###### Proof.

By definition of the marginal density function and by the linearity of the integral, the marginal density of the mixture model holds

 fXu =∫Td−\absufdxuc=K∑k=1αk∫Td−\absup(⋅∣θuk)dxuc, =K∑k=1αkPvp(⋅∣θuk)=K∑k=1αkPξkp(⋅∣θuk).

The definition of the marginal density function implies that

 p(⋅∣θut)={1,if ξk=∅,Pξkp(⋅∣θuk), otherwise.

The theorem on conditional wrapped Gaussian distribution, implies that for each the marginal distribution of each mixture component is a wrapped Gaussian distribution with parameter

 θuk∩u=(μξk,Σξkξk),

if Thus the marginal of the mixture model yields

 fXu=K∑k=1αkp(⋅∣θξk),

where denotes the probability function of the wrapped Gaussian with parameter If there exists such that then combine both components by summing up their weights and reduce the number of mixture component to one. ∎

Theorem 3.4

shows, that the marginal of a sparse mixture model of a parametric family of wrapped Gaussian or von Mises distribution may contain the uniform distribution as mixing component.

### 3.2 Determination of Active Variables

Assuming that the above assumptions are fulfilled, we can considerably reduce the complexity of learning the parameters of the sparse mixture models by removing the independent uniform distributed random variables. Indeed a random variables such that and

are independent yields the Bayes theorem

 h(x)=p1(xA∣θ1)p2(xAc∣xA,θ2) (7)

where denotes the marginal density function with respect to and the conditional density function of By assumption since both random variables are independent. If we further assume that is a sparse density function having the form (1) and is the uniform density function then

 h(x)=K∑k=1αkp(xuk∣θk), (8)

since the multivariate uniform density function on is equal to everywhere. Therefore we can introduce the notion of active and inactive variables for sparse density functions.

###### Definition 3.5.

Let be a multivariate random variable with density function

 f(x)=K∑k=1αkp(xuk∣θk). (9)

a mixture model of wrapped or von Mises mixture model. The set of active variables of by

 Af:=\seti∈[d]∣∃u∈U:i∈u. (10)

and any random variable such that is called active.

Thus an active random variable is either non uniformly distributed or dependant to some such that Otherwise the random variable is called inactive. Taking as example the density function of the sparse mixture model defined in (1), the active set of each mixture component is given by , which yields

 Af=⋃k=1,…,KAukf.

Based on we can determine iteratively by checking which features variables are non uniform distributed with the help of Kolmogorov-Smirnov test or which depends to the non-uniform random variables. Since the independence is generally not trivial, we will only test the random variables by correlations. We can explicitly determine the active set of density function given a large enough number of weighted samples by Algorithm 1

To better understand the concept let us consider two density functions of mixture of wrapped Gaussian distribution which will study in detail along the paper.

###### Example 3.6.

Consider a -dimensional density functions

 fj(x)\coloneqq2∑k=1αkpuk(xuk|μk,Σk),p(xuk|μk,Σk)\coloneqq∑l∈Z|uk|N(xuk+l|μk,Σk),

where The first function has the parameters

 U \coloneqq\setu1,u2=\set{0,1,8},{0,14}, α \coloneqq(0.7,0.3), μ \coloneqq12((1,1,1)⊺,(.5,.5)⊺), Σk \coloneqqσ2I\absuk,σ2\coloneqq0.001.

The second function has the parameters

 U \coloneqq\setu1,u2,u3=\set{0,1,8},{0,14},{5}, α \coloneqq(0.7,0.2,0.1), μ \coloneqq12((1,1,1)⊺,(0.5,0.5)⊺,0.3), Σk \coloneqqσ2I\absuk,σ2\coloneqq0.001.

Following the definition of an active variable, we can directly read the active set from the function definition which are respectively

 Af1=\set0,1,8,14,Af2=\set0,1,5,8,14.

Applying formally Algorithm 1, the plot of the Kolmogorov-Smirnov distance of the weighted samples along each dimension, shows that the variables, whose indices are elements of are non uniform distributed for the first density function and the variables with index in are non uniformly distributed by the second function Since all variables are uncorrelated as the correlation it shows.

Assuming that the active set of the sparse mixture model is already known and increasingly ordered, we can iterate over the index of the active variables to determine the marginal distribution of the subset of the random variable with probability density function . For shake of simplicity we will Denote by the marginal probability density function with respect to . It is equal to the marginal density function with respect to where Let represents the position of in and by the same way the number of iterations. Theorem 3.4 implies that for each the index set of all interacting variables of the marginal mixture model with respect to is equal to

 Ur(i)=\setur(i):=u∩¯i∣u∈U, (11)

with parameters set

 θr(i)=\setθuk∩v\coloneqq(θj)j∈uk∩v,k=1,…,K∣uk∈U and v=ur(i)k∈Ur(i).

Note that for the function is the probability density function of the uniform distribution, and for the marginal density function is equal to . By definition of the marginal mixture model, it follows that the active set and for all and . Thus we can further define for a fix the residual active set of the ground function with respect to the marginal density function by

 ¯¯¯¯¯¯¯¯¯¯¯Ar(i)f\coloneqqAf∖Afi,

and the residual active set of each mixture component density function with respect to its marginal counterpart by

 ¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯Ar(i),ukf\coloneqqAukf∖Awfi,

where . This notion of residual active set will be useful to considerably reduced the complexity of the algorithm presented in section 4, which approximate iteratively the marginal density function of the sparse mixture model. These new notions can be illustrated by two concrete examples. Indeed in the following we will consider two -dimensional density function of sparse wrapped Gaussian mixture models. We will compute their marginal distributions with respect to the subset of random variables

In the following we will introduce two notions of effective dimension, when dealing with very high dimensional sparse functions.

###### Definition 3.7.

Let be a function and . The superposition dimension at level is defined as

 ds:=argmins∈[d−1]∑∅≠u⊆[d]\absu≤sσ2(fu)σ2(f)≥η, (12)

where the -dimensional functions

 fu=∑v⊆u(−1)\absu−\absvPvf

are the ANOVA-term of the function denotes the projection operator of definition 3.2 and

the variance of the corresponding functions. The second notion of

effective dimension is the truncation dimension , which is defined as

 dt\coloneqqargmins∈[d−1]∑∅≠u⊆[s]σ2(fu)σ2(f)≥η.

We will combine later in section LABEL:sec:num_approx these two notions of effective dimension to introduce an assumption of sparsity criterion for the density functions, we want to approximate. First, the function from (3.6) can be rewritten as

 f=∑w∈Wgw,

where and are linear combination of lower-dimensional functions depending only on variables with index set in . For those class of density functions, it has been shown in [6, proposition 2.1], that the ANOVA-decomposition of is equal to

 f=∑u∈¯Wfu,

where denotes the set of all and all their subsets. Then the superposition dimension defined in (12) is also

 ds=argmins∈[d−1]∑∅≠u∈¯W\absu≤sσ2(fu)σ2(f),

and the truncation dimension

 dt=argmins∈[d−1]∑∅≠u∈¯Wsσ2(fu)σ2(f),

where

 ¯Ws\coloneqq\setu⊆[d]∣u∈¯W and u⊆[s].

Considering the ANOVA decomposition of the marginal density function of associated to an arbitrary but fixed it follows that

 Pξf=∑u⊆[d](Pξf)u,

where

 (Pξf)u=∑v⊆u(−1)\absu−\absvPvPξf. (13)

We know by definition, that depends only on the variables Lemma [6, proposition 2.1] implies that for all such that is not included in Thus

 Pξf=∑u⊆ξ(Pξf)u.

For it holds that . Hence equation (13) implies that for all . Thus with [16, Lemma 2.9] the truncation dimension defined in 3.7 yields

 dt=argmins∈[d−1]σ2(P[s]f)σ2(f).

Using this, we can introduce an iterative algorithm, which can approximate the marginal density function for any of the form (1), under the assumption that a large enough number of samples are provided. The function is an accurate approximation to the ground function . If is sparse in sense of equation 1, then there exists an element such that

 ∥g−Pvf∥≤∥g−f∥+∥∥fXvc−I∥∥∥Pvf∥→0,

and the maximal number of interacting variables are very small with respect to the space dimension.

## 4 Learning Sparse Mixture Models

In the rest of this paper, we assume that all variables in (1) are active. For learning the sparse MM, we propose an algorithm which iteratively approximates the marginals for . In the following, we give an idea of the algorithm by describing its first two steps. Let the samples be given.
Step 1: Find an approximation of the first marginal by

 f1(x1)=α10+K1∑k=1α1kp(x1|μ1k,σ1k) (14)

from the samples as follows:

• Determine by the BIC method described in Appendix 7.4.

• Apply a univariate EM algorithm to compute , and and to determine the probability , , that belongs to the -th mixture component.

Step 2: Find an approximation of the first two marginals by the following steps:

• For each determine if the weighted samples are uniformly distributed and uncorrelated by the Kolmogorov-Smirnov test in Appendix 7.2 and correlation estimate in Appendix 7.1. Then we get

 {0,…,K1}=Knu∪Ku,

where denote the indices of those mixture summands in (14), where the samples are not uniformly distributed and the other ones.

• For each and samples determine

 f2k(x2)=α2k,0+Lk∑l=1α2k,lp(x2|μ2k,l,σ2k,l) (15)

by computing

• by the BIC method described in Appendix 7.4.

• , and by a univariate EM algorithm. These parameters will be used as initial ones in the next EM step.

• Case 1: If , set and compute the parameters and determine the probability , , , , in the MM

 f1,2k(x1,x2) =α10+∑k∈Kuα1kp(x1|μ1k,σ1k) (16) +∑k∈Knu(Lk∑l=1α1,2k,lp(x1,x2|μ1,2k,l,Σ1,2k,l)+α1,2k,0p(x1|μ1k,σ1k))

with and initialization for as

 α1,2k,l\coloneqqα1kα2k,l,μ1,2k,l\coloneqq(μ1kμ2k,l),Σ1,2k,l\coloneqq(σ1k00σ2k,l). (17)

Case 2: If , compute the parameters and determine the probability , , , in the MM

 f1,2k(x1,x2) =α1,20,0+L0∑l=1α1,20,lp(x2|μ1,20,l,σ1,20,l)++∑k∈Kuα1kp(x1|μ1k,σ1k) (18) +∑k∈Knu∖{0}(Lk∑l=1α1,2k,lp(x1,x2|μ1,2k,l,Σ1,2k,l)+α1,2k,0p(x1|μ1k,σ1k))

We use the same initialization (17) for and

 α1,20,l\coloneqqα10α20,l,μ1,20,l\coloneqqμ20,l,σ1,20,l\coloneqqσ20,l. (19)

If we use a MM with wrapped Gaussians with just diagonal covariance matrices, Step 2.3 is superfluous and the new parameters are those from the initialization.

###### Remark 4.1.

If we consider the sparse mixture model of diagonal wrapped Gaussian or von Mises distribution then the estimation step and in algorithm 2 will be resumed to fitting univariate marginal distribution. This will considerably increase the computation (time and storage) complexity.