1 Introduction
One of the most recurrent problem of multivariate function approximation theory problems is the curse of dimensionality. An algorithm is said to face the curse of dimensionality if the algorithm depends exponentially on the dimension of the data. In order to circumvent or solve the problem, several authors have focused on the study of sparse functions with respect to their arguments. A widespread example appears in compressive sensing [19]
, where the target function can be spanned precisely by assuming that the input vector is sparse with respect to the
norm. Another theory assumes that the target function can be decomposed into a sum or product of much smaller dimensional functions [15, 2, 4]. In the specific case of ANOVA decomposition [14], it is assumed that only a minimal number of ANOVA terms whose dimension is minimal compared to are relevant. This implies that the function to be approximated can be factorized into a sum of functions that depend on only a limited number of variables [16, 9], i.e., only a certain number of variables interact with each other. Thus the notion of superposition dimension and truncation dimension [19] was introduced to penalize the number of ANOVA terms (equal to ) and, in addition, the dimension of each of them. Several fruitful pieces of research have been done in this sense, as in regression problems [17, 9] and density function approximations [6, 7, 3]. Therefore, we introduced finite sparse mixtures models which are inspired by the ANOVA decomposition of sparse functions. Indeed we assume that each mixture component may only depend on a smaller of variables interaction than the space dimension .1.1 Prior work
Given
data of a multivariate random variable
of potentially very large dimension, the objective of our work is to approximate the density function through a mixture of wrapped Gaussian or von Mises distribution models. One of the best known methods is ExpectationMaximization which maximizes the likelihood of the data. It should be noted that in the case where the dimension is high it is impossible to apply the algorithm naively without prior knowledge of the sparsity of the density function . Thus, in a previous paper [6] we tried to take into account the sparsity assumption of the density function of the mixture model. The algorithm proved to be very effective in approximating periodic Bsplines, the first Friedman function and in image classification.1.2 Our contribution
This current paper is an extension of our previous work "Sparse ANOVA Inspired Mixture Models" [6]. In particular we deal with improvement of learning algorithm by first determining the active variables of the density function. Then we restricted the study to the set of active variables. This approach is even more efficient if we assume that some variables do not play any role in the approximation of . This considerably reduces the computational time and space. Thus we will assign masses to the variables according to the amount of information they contain. Thus it is possible to obtain an accurate approximate the density function by its marginal which contains the variables with the most information.
1.3 Outline of paper
Section2 introduced the notation. In section 3
, we have introduced a sparse mixture model from the parametric family of multivariate wrapped Gaussian and von Mises distribution. Furthermore, we have derived the marginal and the conditional density function of the wrapped Gaussian Distribution, which will later help us approximate the target density function iteratively.
In section 4, we have implemented an algorithm that determines the set of active variables of a sparse mixture model by the KolmogorovSmirnov and correlation coefficient test. Therefore the model learning can be restricted to active variables set , which will considerably reduce the complexity of the model training if we assume that .
In section 4
, we will define an Algorithm that will iteratively estimate the set of interacting variables and the parameters of the marginal density function as well. Later in section
5, we will test our model on sparse mixtures of wrapped Gaussian, Bsplines function, and the California Housing prices data.2 Preliminaries and notation
3 Sparse Mixture Models
3.1 Sparse additive Model
Under similar assumption as [6, section 2] , we try to approximate the density function of an unknown distribution given a finite number of weighted samples
by a finite dimensional sparse mixture model, whose probability density function (pdf) is given by
(1) 
where and
is a probability density function with
dimensional parameter . We will here consider samples which are equally weighted, i.e for all . Similarly to [6] we will also assume that the index set may not be pairwise different, i.e there may exist such that but . Thus denotes by the number of mixture components such that Mixture models, whose density function has the form (1) are called sparse mixture model (sparse MM). The parametric family of sparse wrapped Gaussian distribution with both diagonal and full covariance matrix on one side and the family of sparse von Mises distribution on the other will be used to approximate the unknown target density function. Recall that the wrapped Gaussian distribution is obtained by wrapping the Gaussian distribution around the torus. Indeed if is a Gaussian distributed random variable (RV), the corresponding wrapped Gaussian RV is given by where denotes the period. Since we are interested in approximating periodic functions on the unit torus, then the wrapped random variable becomes There pdf are defined as(2) 
where () denotes the pdf of the dimensional (wrapped) Gaussian distribution with mean and symmetric positive definite (SPD) covariance matrix If the wrapped Gaussian distribution has a diagonal covariance matrix then its pdf is simplified to a product of univariate wrapped Gaussian density function as
(3) 
where () is the pdf of the univariate (wrapped) Gaussian density function with parameters and Since it is practically impossible to numerically compute the probability function of the wrapped Gaussian distribution, and due to the assumption on its covariance matrix, which is positively definite, it can been shown that
where
for a suitably chosen For instance [8, 10] has derived some values of
depending on the standard deviation
for where approximates gut the ground truth density function . It has been showed by [10] for(4) 
and by [8] for
the approximation is very accurate. As the space dimension increases, then also increased. Thus we will consider the truncated function instead of in the rest of the paper.
Remark 3.1.
The pdf of the wrapped Gaussian distribution from (2) can be interpreted as the marginal density function with respect to of the joint pdf
where and
are the wrapped normal distribution parameters of
and the hidden variable denotes the number of winding, i.eThe von Mises distribution, which represents the restriction of the pdf of an isotropic normal distribution to the unit circle has the pdf
where represents mean and and is the modified Bessel function of the first kind of order .
To ensure a good approximation accuracy by the parametric family of sparse mixture models of wrapped and von Mises distribution, we assume furthermore that the ground function
is smooth enough and has a compact support, since Gaussian Mixture Models (GMM) has proved to be good approximators for continuous density functions with compact support
[1].Under the same assumption as above we will derive a form of the marginal density function of , where is defined as (1). Indeed we will introduce later in section 4 an algorithm that iteratively approximate the marginal density function of .
Definition 3.2.
Let
be a continuous random variable with probability density function
. For every the marginal probability density function with respect to is defined aswhere and is the projection operator.
The linearity of the integral immediately implies that the marginal distribution of a mixture model is equal to the mixture of the marginal of each mixture component and thus the linearity of the projection operator. Since two different components may have the same marginal (i.e the same parameters), then they are put together by summing their mixing weights. This implies, that number of components of the marginal is smaller or equal to the number of components of the ground mixture model. Furthermore if the multivariate random variable is componentwise independent or (wrapped) Gaussian distributed with parameter then the marginal distribution with respect to the subset of random variable is of the same family as the ground distribution with parameters For the special case of wrapped Gaussian distribution with dependent random variables the assumption also holds. Before stating the theorem on the marginals of sparse mixture models of wrapped Gaussian or von Mises distributions, let us recall first the marginal and conditional distribution of a multivariate wrapped Gaussian distribution.
Lemma 3.3.
Let and . Let furthermore be a multivariate continuous random variable of a wrapped Gaussian distribution, i.e
with parameters
such that are the mean and are positive definite covariance matrices parameters. Then the marginal distribution of and are also a wrapped Gaussian distribution, such that
The conditional distribution of is given by
(5) 
with
where denotes the dimensional winding number [8]
Theorem 3.4.
Let be a continuous random variable of a sparse mixture model of wrapped Gaussian distribution, with density function . Let furthermore be an dimensional random variable and . Then the marginal distribution with respect to is also a sparse mixture model with density function
(6) 
where and is element of
and is the collection of the indices of interacting variables of The mixing weights and density functions of the marginal distribution are respectively
such that and for
Proof.
By definition of the marginal density function and by the linearity of the integral, the marginal density of the mixture model holds
The definition of the marginal density function implies that
The theorem on conditional wrapped Gaussian distribution, implies that for each the marginal distribution of each mixture component is a wrapped Gaussian distribution with parameter
if Thus the marginal of the mixture model yields
where denotes the probability function of the wrapped Gaussian with parameter If there exists such that then combine both components by summing up their weights and reduce the number of mixture component to one. ∎
Theorem 3.4
shows, that the marginal of a sparse mixture model of a parametric family of wrapped Gaussian or von Mises distribution may contain the uniform distribution as mixing component.
3.2 Determination of Active Variables
Assuming that the above assumptions are fulfilled, we can considerably reduce the complexity of learning the parameters of the sparse mixture models by removing the independent uniform distributed random variables. Indeed a random variables such that and
are independent yields the Bayes theorem
(7) 
where denotes the marginal density function with respect to and the conditional density function of By assumption since both random variables are independent. If we further assume that is a sparse density function having the form (1) and is the uniform density function then
(8) 
since the multivariate uniform density function on is equal to everywhere. Therefore we can introduce the notion of active and inactive variables for sparse density functions.
Definition 3.5.
Let be a multivariate random variable with density function
(9) 
a mixture model of wrapped or von Mises mixture model. The set of active variables of by
(10) 
and any random variable such that is called active.
Thus an active random variable is either non uniformly distributed or dependant to some such that Otherwise the random variable is called inactive. Taking as example the density function of the sparse mixture model defined in (1), the active set of each mixture component is given by , which yields
Based on we can determine iteratively by checking which features variables are non uniform distributed with the help of KolmogorovSmirnov test or which depends to the nonuniform random variables. Since the independence is generally not trivial, we will only test the random variables by correlations. We can explicitly determine the active set of density function given a large enough number of weighted samples by Algorithm 1
To better understand the concept let us consider two density functions of mixture of wrapped Gaussian distribution which will study in detail along the paper.
Example 3.6.
Consider a dimensional density functions
where The first function has the parameters
The second function has the parameters
Following the definition of an active variable, we can directly read the active set from the function definition which are respectively
Applying formally Algorithm 1, the plot of the KolmogorovSmirnov distance of the weighted samples along each dimension, shows that the variables, whose indices are elements of are non uniform distributed for the first density function and the variables with index in are non uniformly distributed by the second function Since all variables are uncorrelated as the correlation it shows.
Assuming that the active set of the sparse mixture model is already known and increasingly ordered, we can iterate over the index of the active variables to determine the marginal distribution of the subset of the random variable with probability density function . For shake of simplicity we will Denote by the marginal probability density function with respect to . It is equal to the marginal density function with respect to where Let represents the position of in and by the same way the number of iterations. Theorem 3.4 implies that for each the index set of all interacting variables of the marginal mixture model with respect to is equal to
(11) 
with parameters set
Note that for the function is the probability density function of the uniform distribution, and for the marginal density function is equal to . By definition of the marginal mixture model, it follows that the active set and for all and . Thus we can further define for a fix the residual active set of the ground function with respect to the marginal density function by
and the residual active set of each mixture component density function with respect to its marginal counterpart by
where . This notion of residual active set will be useful to considerably reduced the complexity of the algorithm presented in section 4, which approximate iteratively the marginal density function of the sparse mixture model. These new notions can be illustrated by two concrete examples. Indeed in the following we will consider two dimensional density function of sparse wrapped Gaussian mixture models. We will compute their marginal distributions with respect to the subset of random variables
In the following we will introduce two notions of effective dimension, when dealing with very high dimensional sparse functions.
Definition 3.7.
Let be a function and . The superposition dimension at level is defined as
(12) 
where the dimensional functions
are the ANOVAterm of the function denotes the projection operator of definition 3.2 and
the variance of the corresponding functions. The second notion of
effective dimension is the truncation dimension , which is defined asWe will combine later in section LABEL:sec:num_approx these two notions of effective dimension to introduce an assumption of sparsity criterion for the density functions, we want to approximate. First, the function from (3.6) can be rewritten as
where and are linear combination of lowerdimensional functions depending only on variables with index set in . For those class of density functions, it has been shown in [6, proposition 2.1], that the ANOVAdecomposition of is equal to
where denotes the set of all and all their subsets. Then the superposition dimension defined in (12) is also
and the truncation dimension
where
Considering the ANOVA decomposition of the marginal density function of associated to an arbitrary but fixed it follows that
where
(13) 
We know by definition, that depends only on the variables Lemma [6, proposition 2.1] implies that for all such that is not included in Thus
For it holds that . Hence equation (13) implies that for all . Thus with [16, Lemma 2.9] the truncation dimension defined in 3.7 yields
Using this, we can introduce an iterative algorithm, which can approximate the marginal density function for any of the form (1), under the assumption that a large enough number of samples are provided. The function is an accurate approximation to the ground function . If is sparse in sense of equation 1, then there exists an element such that
and the maximal number of interacting variables are very small with respect to the space dimension.
4 Learning Sparse Mixture Models
In the rest of this paper, we assume that all variables in (1)
are active.
For learning the sparse MM, we propose an algorithm which iteratively approximates the marginals
for .
In the following, we give an idea of the algorithm by describing its first two steps.
Let the samples be given.
Step 1: Find an approximation of the first marginal by
(14) 
from the samples as follows:

Determine by the BIC method described in Appendix 7.4.

Apply a univariate EM algorithm to compute , and and to determine the probability , , that belongs to the th mixture component.
Step 2: Find an approximation of the first two marginals by the following steps:

For each determine if the weighted samples are uniformly distributed and uncorrelated by the KolmogorovSmirnov test in Appendix 7.2 and correlation estimate in Appendix 7.1. Then we get
where denote the indices of those mixture summands in (14), where the samples are not uniformly distributed and the other ones.

For each and samples determine
(15) by computing

by the BIC method described in Appendix 7.4.

, and by a univariate EM algorithm. These parameters will be used as initial ones in the next EM step.


Case 1: If , set and compute the parameters and determine the probability , , , , in the MM
(16) with and initialization for as
(17) Case 2: If , compute the parameters and determine the probability , , , in the MM
(18) We use the same initialization (17) for and
(19)
If we use a MM with wrapped Gaussians with just diagonal covariance matrices, Step 2.3 is superfluous and the new parameters are those from the initialization.
Remark 4.1.
If we consider the sparse mixture model of diagonal wrapped Gaussian or von Mises distribution then the estimation step and in algorithm 2 will be resumed to fitting univariate marginal distribution. This will considerably increase the computation (time and storage) complexity.