Kernel methods constitute of a wide class of algorithms for non-parametric data analysis of massive high dimensional datasets. Typically, a limited set of underlying factors generates the high dimensional observable parameters via non-linear mappings. The non-parametric nature of these methods enables to uncover hidden structures in the data. These methods extend the well known Multi-Dimensional Scaling (MDS) cox:MDS ; kruskal:MDS method. They are based on an affinity kernel construction that encapsulates the relations (distances, similarities or correlations) among multidimensional data points. Spectral analysis of this kernel provides an efficient representation of the data that simplifies its analysis. Methods such as Isomap tenenbaum:Isomap , LLE roweis:LLE , Laplacian eigenmaps belkin:LaplacianEigenmaps , Hessian eigenmaps donoho:HessianEigenmaps and local tangent space alignment yang:LTSA ; zhang:LTSA , extend the MDS paradigm by assuming to satisfy the manifold assumption. Under this assumption, the data is assumed to be sampled from a low intrinsic dimensional manifold that captures the dependencies between the observable parameters. The corresponding spectral-based embedding performed by these methods preserves the geometry of the manifold that incorporates the underlying factors in the data.
The diffusion maps (DM) method coifman:DM
is a kernel-based method that defines diffusion similarities for data analyzes by imposing a Markovian process over the dataset. It defines a transition probability operator based on local affinities between multidimensional data points. By spectral decomposition of this operator, the data is embedded into a low dimensional Euclidean space, where Euclidean distances represent the diffusion distances in the ambient space. When the data is sampled from a low dimensional manifold, the diffusion paths follow the manifold and the diffusion distances capture its geometry.
DM embedding was utilized for a wide variety of data and pattern analysis techniques. For example it was used to improve audio quality by suppressing transient interference talmon2013single . It was utilized in Schclar2010111SeamanticVisualWords2009 , gene expression analysis Cancer2007 and source localization TalmonSourceLocal2011 . Furthermore, the DM method can be utilized for fusing different sources of data LafonDataFusion2006 ; Keller2010 .
, are obtained by the principal eigenvectors of the corresponding diffusion operator. These eigenvectors represent the long-term behavior of the diffusion process that captures its metastable stateshuisinga:metastability as it converges to a unique stationary distribution.
The MGC framework wolf:MGC ; wolf:MGC-sampta enhances the DM method by incorporating information about data distribution in addition to the local distances on which DM is based. This distribution is modeled by a probability measure, which is assumed to quantify the likelihood of data presence over the geometry of the space. The measure and its support in this method replace the manifold assumption. Thus, the diffusion process is accelerated in high density areas in the data rather than being depended solely on the manifold geometry. As shown in wolf:MGC , the compactness of the associated integral operator enables to achieve dimensionality reduction by utilizing the DM framework.
This MGC construction consists of two independent data points representations. The first represent the domain on which the measure is defined or, equivalently, the support of the measure. The second represent the domain on which the MGC kernel function and the resulting diffusion process are defined. These measure domain and the analyzed domain may, in some cases, be identical, but separate sets can also be considered by the MGC-based construction. The latter case utilizes a training dataset, which is used as the measure domain to analyze any similar data that is used as the analyzed domain. Furthermore, instead of using the collected data as an analyzed domain, it can be designed as a dictionary or as a grid of representative data points that capture the essential structure of the MGC-based diffusion.
In general, kernel methods can find geometrical meaning in a given data via the application of spectral decomposition. However, this representation changes as additional data points are added to the given dataset. Furthermore, the required computational complexity, which is dictated by spectral decomposition, isthat is not feasible for a very large dataset. For example, a segmentation of a medium size image of pixels requires a kernel matrix of size . The size of such matrix necessitated about GByte of memory assuming double precision. Spectral decomposition procedure applied to such a matrix is a formidable slow task. Hence, there is a growing need to have more computationally efficient methods that are practical for processing large datasets.
Recently, a method to produce random Fourier features from a given data and a positive kernel was proposed in Recht07 . The suggested method provides a random mapping of the data such that the inner product of any two mapped points approximates the kernel function with high probability. This scheme utilizes Bochner s theorem Bochner62
that says that any such kernel is a Fourier transform of a uniquely defined probability measure. Later, this work was extended inMaji09 ; Vedaldi2012 to find explicit features for image classification.
In this paper, we focus on deriving a representation that preserves the diffusion distances between multidimensional data points based on the MGC framework wolf:MGC ; wolf:MGC-sampta . This representation is applicable to process efficiently very large datasets by imposing a Markovian diffusion process to define and represent the non-linear relations between multidimensional data points. It provides a diffusion distance metric that correlates with the intrinsic geometry of the data. The suggested representation and its computational complexity cost per a given data point are invariant to the dataset size.
The paper is organized as follows: Section 2 presents the problem formulation. Section 3 provides an explicit formulation for the diffusion distance. The main results of this paper are established in Sections 4 and 5 that present the suggested low-dimensional embedding of the data and its characterization. Practical demonstration of the proposed methodology is given in Section 6.
2 Problem formulation and mathematical preliminaries
Consider a big dataset such that for any practical purposes the size of is considered to be infinite. Without-loss-of-generality, we assume that for all , . Implementation of a kernel method, which uses a full spectral decomposition, becomes impractical when the dataset size is big. Instead, we suggest to represent the given dataset via the density of data points in it using the MGC kernel. In other words, let be the density function of . We aim to find an explicit embedding function denoted by , where is the embedding dimension such that . Each member in , which is denoted by , , depends on the density function .
DM provides a multiscale view of the data via a family of geometries that are referred to by diffusion geometries. Each geometry is defined by both the associated diffusion metric and the diffusion time parameter that are linked by where is a localization parameter. The diffusion maps are the associated functions that embed the data into Euclidean spaces, where the diffusion geometries are preserved such that , .
Given an accuracy requirement , we aim to design an embedding that preserves the diffusion geometry for such that for all ,
We call the embedding the diffusion representation. From the requirement in Eq. (2.1), the Euclidian distance between pairs of representatives approximates the diffusion distance between the corresponding data points over the density of these data points in the MGC-based kernel when . If Eq. (2.1) holds, then preserves the diffusion geometry of the dataset in this sense.
The rest of this section is dedicated to provide additional details regarding the diffusion geometries that utilize the MGC kernel.
2.1 Diffusion geometries
A family of diffusion geometries of a measurable space with a measure is determined by imposing a Markov process over the space. Given a non-negative symmetric kernel function , then an associated Markov process over the data via the stochastic kernel is
where is the local volume function. In a discrete setting, it is called the degree function. In a continues settings, the local volume function is defined by
The associated Markovian process over is defined via the conjugate operator of the integral operator that is denoted by
. Thus, for any initial probability distributionover , is the probability distribution over after a single time step. The probability distribution over after time steps is given by the -th power of . Specifically, if the initial probability measure is concentrated in a specific data point , i.e. , then the probability distribution after time steps is , denoted also by . Thus, is the probability that a random walker, which started his walk in , will end in after time steps. Based on this, the -time diffusion geometry is defined by the distances between probability distributions such that for all
Equations 2.1 and 2.4 suggest that the embedding approximately preserves (for ) the distance between probability distributions. Such a family of geometries can be defined for any Markovian process and not necessarily for a diffusion process. It is proved in coifman:DM that under specific conditions, the defined Markovian process approximates the diffusion over a manifold from which the dataset is sampled. If the Markovian process is ergodic, then it has a unique probability distribution , to which it converges independently of its initial distribution, namely, for any , , independently of . This probability measure is an normalization of the local volume function (Eq. (2.3)), i.e. .
2.2 Measure-based Gaussian Correlation kernel
The measure-based Gaussian Correlation (MGC) kernel wolf:MGC ; wolf:MGC-sampta defines the affinities between elements in , which in this context, it is referred to as the analyzed domain via their relations with the reference dataset that is referred to as the measure domain. This framework enables to have a flexible representation of , as long as , which in some sense characterizes the data, is sufficiently large. For example, it was shown in MGCunigrid that in order to compute the MGC-based diffusion to a large, perhaps infinite dataset, it suffices to sample it to compute the associated DM of the sampled dataset and to extend it to the rest of the dataset via an out-of-sample extension procedure. The sampling rate, shown in MGCunigrid , depends on the density of and on the required accuracy. In this sense, is considered as a grid for the whole dataset.
Mathematically, for the analyzed domain and for the measure domain with a density function defined on the measure domain, the MGC kernel is defined as
where is an
unit matrix. For a fixed mean vectorand a covariance matrix , is the normalized Gaussian function given by
Since the MGC kernel in Eq. (2.5) is symmetric and positive, it can be utilized to establish a Markov process as was described in Section 2.1. The associated diffusion parameters from Eqs. (2.2) , (2.3) and (2.4) are , and , respectively.
3 Explicit forms for the diffusion distance and stationary distribution
In general, the integral in Eq. (2.5
) does not have an explicit form. However, for our purposes, we adopt the Gaussian Mixture Model (GMM), which assumes that the density
is a superposition of normal distributions. Under this assumption,takes the form
for appropriate mean vectors and covariance matrices , , (see Eq. (2.6
)). Estimating Eq. (3.1) is a generally known problem that has been extensively investigated such as in Day1969 ; mclachlan88 with many published implementations. Such an estimation enables to provide an explicit (closed form) representation of the diffusion geometry in Eq. (2.4).
First, a closed form for the inner product , , is presented. This inner product closed form enables to get an explicit formulation for the first time step of the MGC-based DM distance . This formalism is established in Theorem 3.1, whose proof appears in A.
Assume that the GMM assumption in Eq. (3.1) holds and let be the inner product . Denote and . Let be defined by
Then, for any , the kernel affinity , the stationary distribution and the inner product have explicit forms given by
4 Explicit Diffusion Maps of the analyzed domain
The diffusion distance provides a relation between a pair of data points in the analyzed domain. In this section, we find a representation of any data point in the analyzed domain that preserves the diffusion distance relation. We assume that the covariances in the GMM (Eq. (3.1)) are all identical, namely, , , and that the analyzed domain is a subset of the unit ball in . The Taylor extension
where is a scalar is reformulated with to be
The exponent in Eq. (4.2) is formulated as an inner product such that
where is a vector of infinite length whose first term () is and for we have the generating function where is the Kronecker product operator. In the following we consider the vector as a single term of that corresponds to the th term from the relevant Taylor expansion.
For the diffusion representation, we reformulate the inner product in Eq. (3.5) to be a vector multiplication of two vectors. The first depends only on and the second depends only on . Let be a -dimensional vector with the entries (Eq. (3.2)). Then, the inner product in Eq. (3.5) is given by the matrix multiplication
where the and entry in the matrix is given by . The scalar depends on both and . From the assumption on the GMM structure, we have . Let then can be reformulated by
Hence, the matrix is the result from an outer-product of the form
and the infinite dimensional vector is the weighted sum of the inner product in the exponent decompositions (Eq. (4.3)).
5 Truncated diffusion representation
The diffusion representation in Lemma 4.1 is an infinite size vector. In this section, this vector is truncated and an error estimate for the resulted truncated diffusion distance is provided. The infinite number of entries in originates from the Gaussian decomposition multiplication in Eq. (4.3). Let be a Taylor expended function on the interval , where is an arbitrary upper limit of . Then, the Lagrange remainder of the th order Taylor (Maclaurin) expansion of on this interval is given by
Since we have
Let be the th order truncated version of from Eq (4.10) that is given by,
where the matrix is given by
and is a vector that contains the first terms from . Furthermore, let be the vector that contains the truncated terms from the th term in to . Then, the diffusion distance error is given by
where the matrix is given by
The Frobenius matrix norm and the Euclidean vector norm are compatible Meyer:2000 such that
The product is the remainder of the Taylor series. By using Eq. (5.2), we get
where is the solution to the Trust Region Problem (TRP) of the form
where and . TRP in Eq. (5.11) is widely investigated in the literature. Important properties are given in GQT66 ; G81 ; MS83 . In B, we provide complementary theoretical details about this problem. In the following, we assume that the solution was found such that the maximal value of is known.
Lemma 5.1 provides an additional bound for .
Assume that is defined according to Eq. (5.6). Then, for all
The inner product is the reminder of the Taylor series of with terms. Hence,
where . The derivative is given by
By examining the stationary distribution component in the denominator of Eq. (5.5), we assume that a given a minimal threshold establishes that any , which is smaller than , is negligible. By substituting Eq. (5.17) into Eqs. (5.16) and into Eq. (5.5) we get
By imposing the requirement in Eq. (2.1) for in the form
we find the minimal such that
The number of Taylor terms of the Taylor expansion is derived by computing Eq. (5.20) for an increasing integers from until the requirement in Eq. (5.20) is satisfied. We expect that for a practical use, iterations are sufficient.
Let be any two data points. Assume we have a properly trained GMM and the number of Taylor terms is designed according to Eq. (5.20). Then, the error in the diffusion distance (Eq. (2.1)) is bounded by .
Let and be the corresponding embedding of and , respectively. Furthermore, let and be the corresponding truncated embedding of and , respectively. From Eq. (5.19) we have and . Therefore, from the triangle inequality
Hance, . ∎
5.1 The computational complexity to compute the diffusion representation
Assume that GMM-based and are designed according to Eq. (5.20). Assume that is the length of the representation that is dictated by the dimension of the measurement such that . The computation of in Lemma. 4.1 has a computational complexity of due to the multiplication and due to the truncation of that has Taylor terms. We also assume that . Hence, the computational complexity for the computation of the terms in the vector is negligible compared to the computational complexity of .
6 Experimental Results
This section presents several examples that demonstrate the principles of the MGC-based closed-form embedding. The first example presents an analysis of a density function for which the stationary distribution is analytically known. The closed-form stationary distribution in this case is compared to the analytical stationary distribution. The second example compares the analytical distance computation with the corresponding distance that is computed via the closed-form embedding defined in Eq. (4.10).
6.1 Example I: A data analysis with an analytically known stationary distribution function
Let the density function includes two flat squares one above the other with probability to draw samples from the lower square and to draw samples from the upper square. In other words,
where is given by
and is the Gauss error function.
Given a trained GMM, the stationary distribution is computed by using Eq. (3.4). First, data points were randomly selected from the distribution in Eq. (6.1). Then, a grid was constructed to compute the stationary distribution via Eq. (3.4) based on the analytical solution in Eq. (6.2).
Figure 6.1(b) presents the stationary distribution (Eq. (3.4)) with a minor distortion compared to the analytical distribution (Eq. (6.2)) presented in Fig. 6.1(a). The error, which is presented in Fig. 6.1(c), is a result from the GMM-based training on a small set of data points. The difference between the closed form stationary distribution and the analytical stationary distribution is in the order of and it is the result from the GMM training error. In the following, we assume that the GMM-based training error is negligible.
6.2 Example II: Estimation of the diffusion distance error originated from the truncated representation (Eq. (5.3))
In this section, we show the diffusion distance error from the proposed truncated representation (Eq. (5.3)) of a specific data. Figures 6.2 and 6.3 compare between the diffusion distance in Eq. (3.6) and the corresponding distance computed by Eq. (4.11), respectively, using the truncated representations and computed by
A data of data points was randomly selected from the distribution in Eq. (6.1). Furthermore, the bound on the worst-case diffusion distance error is computed using Eq. (5.18) for the corresponding and where .
Figure 6.2 displays the normalized histogram of (Eq. (6.3)) for different values of and different values of . Figures 6.2(a) and 6.2(b) display a reduction of the distance error as a function of and , respectively. As increases, is reduced in Fig. 6.2(a). Furthermore, as the number of Taylor terms increases, is reduced in Fig. 6.2(b).
Figure 6.3 compares the bound to the worst-case of from the randomized data points as a function of and as a function of . Figure. 6.3(a) compares between the worst-case of the diffusion distance error on the dataset and the corresponding bound as a function of and as a function of .
The presented methodology provides an alternative to a non-parametric kernel method approach for obtaining data representations via spectral decompositions of a big kernel operator or matrix with finite settings. The presentation of our approach is based on the MGC diffusion kernel wolf:MGC and on the resulting measure-based DM embedding obtained by its decomposition. We showed that when the underlying measure is modeled by a GMM, an equivalent embedding, which preserves the diffusion geometry of the data, can be computed without the need to decompose the full kernel. Furthermore, this computation does not depend on the size of the dataset, therefore, it is suitable for modern scenarios where large amounts of data are available for analysis.
The proposed data representation is achieved by finding and decomposing an explicit form of the measure-based diffusion distance in terms of GMM. The decomposition is done by utilizing a Taylor decomposition and by an associated bound for the number of Taylor terms that was formulated for a given measure-based diffusion distance error tolerance requirement. The resulted representation length grows exponentially with the number of Taylor terms, hence, for a practical use, the error tolerance requirement should consider the data dimensionality.
7.1 Future work
Future work will focus on extending the proposed methodology to a more general settings in the following ways below to reduce the representation dimension and to generalize the explicit form for additional kernels.
The utilized generating function for the inner products decomposition is based on the Kronecker product and hence incorporates its structure. This structure maybe utilized to reduce the representation length by using a variant of random projection. Thus, the dimensionality of the representation can reduced.
The proposed methodology is valid for additional Gaussian based kernels such as the standard DM kernel. The formulation of an explicit form for this kernel is straightforward for the above analysis. Hence, similar considerations may lead to an explicit form for the diffusion distance in this case. This research direction may lead to more general explicit forms that maybe used to represent datasets.
We provided an explicit form for the inner product associated with the measure-based diffusion distance, therefore, we are able to explicitly and efficiently compute a matrix of pair-wise distances that consider the entire dataset. This matrix can be decomposed via the application of MDS and the proposed representation can be projected to the resulted MDS space that may has a low dimension due to the involved spectral decomposition.
The explicit resulted representation form can be used to find the most distant data points in terms of the diffusion distance by solving a proper optimization problem. The resulted data points can be used to form a dictionary or to provide data points samples for any general use.
The proposed explicit map and the explicit diffusion distance are differentiable, hence, both may be used to characterize data points in terms of change rate in the diffusion maps as a function of a given specific feature. This characterization may be helpful in identifying both the anomality and the associated features that are the source for anomality occurrence .
The proposed representation corresponds to . An additional interesting research direction is to generalize the analysis to any to allow a time diffusion multiscale analysis of datasets.
This research was partially supported by the Israeli Ministry of Science & Technology (Grants No. 3-9096, 3-10898), US-Israel Binational Science Foundation (BSF 2012282), Blavatnik Computer Science Research Fund and Blavatink ICRC Funds.
Appendix A Proof of Theorem 3.1
First, we introduce some basic Gaussian distribution related equalities. Denoteand In the following, we will use the normal distribution identity
Another useful Gaussian relation is due to rescaling of the covariance matrix and the mean vector in Eq. (2.6) such that
Now, we are ready for the proof of Theorem 3.1.