Measuring the discrepancy between probability distributions is at the heart of statistics and machine learning problems. A classic example in statistics is the hypothesis testing in higher dimensions, which has attracted a plethora of interest in recent years gretton2012kernel; ramdas2017wasserstein; chwialkowski2015fast. Similarly, in generative modeling, leveraging probability metrics and discrepancy measures as an alternative to the adversarial networks, used in Generative Adversarial Networks (GANs), has become an exciting topic dziugaite2015training; mohamed2016learning; li2017mmd; arjovsky2017wasserstein. Notably, variations of the Wasserstein distances and the Maximum Mean Discrepancy (MMD) have enjoyed ample attention from the community and have incited many enthralling works in the literature.
There are specific challenges with measuring the discrepancy between two high-dimensional probability distributions, including the high computational cost (e.g., for p-Wasserstein distances), and growing sample complexity, i.e., in the sense of the dependence of convergence rate of a given metric between a measure and its empirical counterpart on the number of samples genevay2019sample. The community has tackled these challenges from different angles in recent years. One of the thought-provoking approaches is via slicing high-dimensional distributions over their one-dimensional marginals and comparing their marginal distributions kolouri2019sliced; nadjahi2019asymptotic. The idea of slicing distributions is related to the Radon transform and has been successfully used in, for instance, sliced-Wasserstein distances in various applications rabin2011wasserstein; kolouri2016sliced; carriere2017sliced; deshpande2018generative; kolouri2018sliced; nadjahi2019approximate. More recently, kolouri2019generalized extended the idea of linear slices of distributions, used in sliced-Wasserstein distances, to non-linear slicing of high-dimensional distributions, which is rooted in the generalized Radon transform.
In this paper, we leverage the idea of slicing high-dimensional distributions and introduce a broad family of probability metrics named Generalized Sliced Probability Metrics (GSPMs). We provide a geometric interpretation of these metrics, and show their connection to the well-celebrated MMDs. GSPMs are built based on the idea of ’slicing’ high-dimensional distributions, or the pushforward measure of the high-dimensional input distributions for a real function. We emphasize that GSPMs are not a subclass of Integral Probability Measures (IPMs) muller1997integral; dziugaite2015training, however, they share many commonalities. We show that a subset of GSPMs is equivalent to MMDs, and leverage this connection to define geometrically interpretable kernels for MMDs that were not explored in the literature prior to this work.
Finally, following the work of arbel2019maximum, we identify some regularity conditions under which we show that the introduced kernels, which are rooted in GSPMs, satisfy the conditions for the global convergence of gradient flows. Hence, the proposed kernels are suitable for applications dealing with probability flows and implicit generative modeling.
Let and be probability measures defined on a measurable space, , with corresponding densities and . In addition, let denote a metric space. Let be a class of real-valued bounded measurable functions on . Then the slice of a probability measure , with respect to , is the pushforward measure . We use the equivalent terminology that the slice of a
-dimensional probability density function(), with respect to a function , is a one-dimensional probability density function that is defined as:
where is a one-dimensional Dirac function. Intuitively, is the distribution of (which is scalar) when s are i.i.d samples from , .
2.1 Radon Transform
In Radon transform, we are interested in the question of whether one can recover the distribution from its slices . In other words, when does the set preserve the information contained in ?
Classical Radon Transform: Denote by the unit sphere in a -dimensional Euclidean space. The classical Radon transform shows that when the function class is “linear”, i.e., , the corresponding slices, contain all the required information to recover the distribution . The previous statement implies that the classical Radon transform map is invertible, i.e. we have
for and , where
is a one-dimensional high-pass filter with a Fourier transform, appearing as a result of the Fourier slice theorem. The geometric interpretation of this process is that integrates
along the hyperplane.
Generalized Radon Transform: Classical Radon transform can be extended to the generalized Radon transform (GRT) to integrate over hypersurfaces i.e. -dimensional manifolds, . The literature on GRT focuses on parametric functions defined on with and . These functions are so-called “defining functions”. To ensure that GRT is invertible, the following necessary conditions are identified homan2017injectivity:
must be a real-valued function on to guarantee the smoothness of hyper-surfaces,
must be homogeneous of degree one in , i.e., . The condition is required to guarantee a unique parametrization of hypersurfaces,
must be non-degenerate in the sense that . The non-degenerate assumption ensures that the -dimensional hypersurfaces do not collapse to points, and the integrals are well defined,
The mixed Hessian of must be strictly positive, i.e., for , and . This condition is a local form of the Bolker’s condition (See homan2017injectivity), which allows one to locally identify with the covector .
The linear function class is one example of such family of “defining functions”. Invertibility of GRTs is a long standing research problem. We provide below a number of well-studied classes of “defining functions”, that ensure invertibility of GRTs.
In kuchment2006generalized, it is shown that the circular defining function, with and
provides an injective GRT. Homogeneous polynomials with an odd degree also define an injective GRTehrenpreis2003universality, i.e. , where we use the multi-index notation , , and . The summation here iterates over all possible multi-indices , such that , where represents the polynomial degree and . The parameter set for homogeneous polynomials is then set to be . One can see that the choice of recovers the linear case , in that the set of the multi-indices with becomes and includes elements. We note that GRT was also the basis for the recently proposed generalized sliced-Wasserstein distances kolouri2019generalized.
3 Generalized Sliced Probability Metrics (GSPMs)
In this section, we show that any probability metric between one-dimensional probability measures can be extended to higher-dimensions via the concept of generalized slicing. Let be a metric for one-dimensional probability measures. Then, for probability measures and defined on with respective densities and , the proposed GSPM is defined as follows:
where . Let us first show that GSPM is a metric. Non-negativity and symmetry immediately follow from non-negativity and symmetry of , while triangle inequality follows from the Minkowski inequality:
Finally, the identity of indiscernibles states that, if and only if (iff) . The forward proof is straightforward: results in and since is a metric for . If for , we can conclude that iff the GRT is injective. Hence, if GRT is injective then GSPMs provide a metric. Otherwise, GSPMs are pseudo-metrics.
Equation (4) is based on the expected value of , when where
is the uniform distribution on. Here we show that the max version of GSPMs are also metrics. Substituting the expected value with supremum, leads to a metric defined as:
Verifying the metric properties for Eq. (5) is trivial, given the properties of (see the supplementary material). Note that the recently proposed distances like Sliced Wasserstein (SW) distances and max-SW distances are a special case of GSPMs and Max-GSPMs.
4 GSPMs and MMDs
The seminal work by gretton2007kernel; gretton2012kernel on maximum mean discrepancy (MMD) provides a framework for efficient comparison of probability distributions. MMD is an integral probability metric sejdinovic2013equivalence, and has become a popular choice of comparison between distributions in a wide variety of applications, e.g., generative modeling li2017mmd; tolstikhin2018wasserstein, and gradient flows arbel2019maximum. In practice, MMD is defined with respect to a Reproducing Kernel Hilbert Space (RKHS), with a unique kernel. Like other kernel methods, the choice of kernel is often an application-dependent choice. In what follows, we show that an interesting family of GSPMs could be related to MMDs. Notably, we combine generalized slices together with a specific family of distances, which both have clear geometric interpretations, and obtain MMDs with well-defined kernels.
Consider Equation (4) for the special case of and , where is a positive(-definite) linear operator. The positive assumption enforces to be a norm (i.e., the weighted Euclidean norm). If is positive semi-definite, then would become a pseudo-metric, and as a consequence also becomes a pseudo-metric. Given a linear operator, , we can write:
We focus on practical settings where we only observe samples and from these distributions. Substituting the empirical distribution in Equation (1) give us the empirical slices as and . Using a common trick-of-trade in statistics, and without the loss of generality, we consider a smoothened version of the empirical slices via a radial basis function (RBF), , where identifies the radius of the RBF (). Note that using is equivalent to assuming smoothness priors on the slices.
By plugging in the (smoothened) empirical sliced distributions into (6), we obtain:
Equation (7) is also the squared MMD with the particular kernel shown there-in. Note that one can use the Monte-Carlo integral approximation to obtain an algorithmic way of calculating the kernel for any feasible , , and .
We now argue that these family of kernels are positive definite (PD). Indeed,
is a dot-product kernel, which is by definition PD, and summation/integration of PD kernels results in a PD kernel. Therefore,
is a PD kernel. Below, we study some special interesting cases of the GSPMs based on , and their equivalent MMD form based on kernels.
4.1 First example:
When , the GSPM is a generalized-sliced distance between the two distributions. This subsection shows that the work of knop2018cramerwold follows this setting. In addition, we demonstrate that while such generalized sliced distance might not be as interesting for , from a geometric point of view, it becomes appealing for more complex family of slices (e.g., homogeneous polynomials).
Assuming the RBF is a Gaussian, and using the inner product between two Gaussians, one can show that the dot-product kernel in Eq. (8) boils down to:
The geometric interpretation of Equation (10) is quite interesting. First note that therefore, the pre-image of a scalar in the range of is a hyper-surface in . This means that all points living on a hyper-surface would be projected to the same scalar in the range of (i.e., iso-hyper-surface). Therefore, while and could be far away from one another (in a Euclidean sense), as long as they live on the same or nearby iso-hyper-surfaces they will considered to be similar (with respect to ). Figure 1 demonstrates this effect and shows different s, from family of linear functions parameterized by on a unit sphere (a), and family of polynomials of degree 5 (b), for which samples are considered near-by/far-away.
As a special case, knop2018cramerwold used linear slices (i.e., ) and showed that when is the Gaussian function, then Equation (9) has a closed form:
where is a Kummer’s confluent hypergeometric function barnard1998gram and can be approximated as:
The above kernel also holds when is the Fourier transform, which is due to the fact that the Fourier transform is a unitary linear operator, i.e. satisfies
. However, note that the Fourier transform of a PDF is the characteristic function. Therefore, one would be considering L2-norm squared of the characteristics functions of the slices.
Recall that in these derivations, we started by fixing a slicing operation (linear slices), and used a specific distance, i.e.
distance, and that we know the geometric meaning of both of these steps and their implications. Then, we ended up with a novel PD kernel that defines a MMD, which inherits these geometric properties. Here we emphasize that the distance used here (and inknop2018cramerwold) is a Sliced-. In the next section, we study the specific case of the Generalized-Sliced-Cramér distance.
4.2 Second example: is the cumulative integral operator
Now, we choose as the cumulative integral operator:
Note that such is a positive definite operator. In this setting, the distance is the 2-Cramér distance Cramer1928composition between the two one-dimensional probability distributions, and , which is recently used in various publications bellemare2017cramer; kolouri2020sliced. The Cramér distance shares some common characteristics to those of the Wasserstein distances. In fact, the 1-Cramér distance and the 1-Wasserstein distance are equivalent. It is straightforward to show that
is unbounded. Note that is the CDF of an RBF, and therefore its integral is unbounded. However, assuming that the integral domain is , we can find closed form solutions for . For instance, for we have that is a step function and,
The boundedness assumption enforces us to use kernels with a bounded range (hence, Gaussian kernels won’t be allowed in this setting). Our experiments indicate that smoothstep functions, often used in computer graphics, are well-suited candidates for . The ’th order smoothstep function is defined as:
We include the derivations of , with the smoothstep functions, in the supplementary material.
4.3 Third example: is a generic integral transform
Integral transforms provide a broad family of linear operators, which could be used in Equation (6) to define novel distances/pseudo-distances (depending on the invertibility of the transform). The integral transform of a function,
, is a generic linear transform defined as:
where is the integral kernel or the nucleus of the transform. In this work, we suffice to mention this family of linear operators as an interesting class of operators for further studies.
5 GSPM Gradient Flows
Gradient flows have become increasingly popular in implicit generative modeling csimcsekli2018sliced; arbel2019maximum; kolouri2019generalized
, where the aim is to minimize a functional in the Wasserstein space (i.e., the space of probability measures with bounded second-order moments, metrized by the Wasserstein-2 metric), given as follows:
In this section, we will exploit the connections that we developed between GSPMs and MMD (as detailed in Section 4) and develop a globally convergent algorithm for solving problems of the form of (13) by building up on the recent theoretical results given in arbel2019maximum.
We now present the GSPM-flows, that aim at generating a path of measures which minimizes the squared GSPM between an initial measure and a target measure as goes to infinity. In particular, we will consider the gradient flow, informally expressed as follows:
where denotes a notion of a gradient in the Wasserstein space ambrosio2008gradient. Such gradient flows are of particular interest for generative modeling, since if the solution paths of the flow can be shown to converge to the global optimum
, then one can approximately simulate the gradient flow in order to solve the minimization problem and estimate.
Under appropriate conditions ambrosio2008gradient, a path is a solution of (14) if and only if it solves a continuity equation of the form:
denotes the divergence operator and
is a vector field, given as follows:arbel2019maximum
where is defined in (9).
The partial differential equation representation (15) has important practical implications, since such PDEs are often associated with a McKean-Vlasov (MV) process bogachev2015fokker, which can be used for developing practical algorithms. In particular, associated to the continuity equation, we can define a MV process as a solution to the following differential equation:
where denotes the state of the process at time . Here, evolves through the drift function , which requires the knowledge of , i.e., the density function of . The interest in this process is that the probability density functions of solve the continuity equation, hence, solving the optimization problem (13) reduces to simulating (16).
Unfortunately, exact simulation of (16) is often intractable due to (i) the process is continuous-time, it needs to be discretized, (ii) the drift depends on the density , which is not available in general. We will focus on the discretization of the process first, then we will develop a particle-based approach to alleviate the second problem.
In order to discretize (16), we consider the noisy Euler-Maruyama scheme, proposed in arbel2019maximum, given as follows:
where is a step-size, denotes the iterations, denotes the density of , denotes an inverse temperature variable, and is a standard Gaussian variable. If for all , this scheme reduces to the standard Euler-Maruyama discretization, whereas a positive would drive the scheme to explore the space in a more efficient way.
As one of our main contributions, we will now identify sufficient regularity conditions on the defining function and the smoothing function , which will be required for the convergence analysis of the gradient flow and its discretization (17).
is a linear, bounded, positive semi-definite operator with the corresponding operator norm .
There exists a constant , such that (for any ) for all and
for all .
There exists a constant , such that the following inequalities hold: , , , and .
We now present our main result.
Furthermore, let be the iterates obtained by (17). If as , then the following bound holds:
where denotes the density of and
The proof is given in the supplement. This result shows that, with sufficiently regular and , the noisy Euler scheme (17) can achieve the global optimum, where the convergence rate depends on the structure of and .
Even though Theorem 1 hints the potential of the proposed gradient flow, the discretization scheme (17) is unfortunately still intractable due to the dependency of on . In order to obtain a practical algorithm, we finally consider a particle system that serves as an approximation to the original system (17), and given as follows:
where denotes the particle index and denotes the empirical distribution of . Here, the idea is to approximate by by evolving different particles at the same time. Similar schemes have proved successful in generative modeling csimcsekli2018sliced and Bayesian machine learning liu2016stein. Moreover, one can further show that the particle system converges to the true system (17) with a rate of durmus2018elementary; arbel2019maximum.
6 Numerical Experiments
In this section, we present our experimental results that illustrate our framework.
6.1 Gradient flow – Synthetic
We first perform a numerical experiment with synthetic datasets to demonstrate the performance of the proposed GSPM-MMD kernels. To simplify the presentation, in our first experiment, we assumed the noise for all (i.e., the standard Euler-Maruyama discretization). We consider three two-dimensional target distributions, namely the Swiss Roll, the 8-Gaussians, and the 25-Gaussians distributions. The source distribution is initialized with samples from a Gaussian distribution. Figure 2 shows the datasets and the flow (calculated using GSPM-MMD). We calculate the gradient flow updates (See Equation (20)) to match the source and the target distributions.
For our method, we used the GSPM-MMD kernel with and when is the cumulative integral operator (i.e., the 2-Cramér distance). For simplicity, we used linear slices . Also, as a standard baseline for comparison, we apply the Gaussian kernel and minimize the MMD flow. In each iteration of the gradient flow, we measure the 2-Wasserstein distance between the updated source and the target distribution. For each method we vary , and repeat the experiments times. Figure 3 compares the algorithms on the three datasets and for various . For the cumulative integral operator we used slices.
Effect of noise: The addition of noise lessens the effect of a poor choice of by allowing the particles to explore the space in a more efficient manner. To demonstrate the effect of the addition of noise to the updates (See Equation (20)) we repeated the experiment in Figure 2 for the Swiss Roll dataset, but with a poor choice of . From Figure 2, one can see that is too small for calculating an effective flow. Hence, we chose and solved a noisy gradient flow problem with the MMD-RBF kernel (baseline) and the GSPM-MMD with kernel. We selected the initial and decayed the noise in each gradient iteration with a rate ( being the iteration). The log 2-Wasserstein between the source and target distributions is depicted in Figure 4. As expected, addition of noise improves the overall performance of gradient flows.
Linear or non-linear slices: So far, in our experiments, we have only used linear slices, i.e., . Here, we compare GSPM-MMD flows solely based on the choice of linear and non-linear slices. For the non-linear slices, in this experiment, we use the homogeneous polynomials of degree 5 (see Figure 1). To ensure a fair comparison, we chose the number of random slices for both GSPM-MMD kernels to . Figure 5 shows the comparison between linear and polynomial slices.
6.2 Gradient flow – MNIST
To show the effectiveness of the proposed distances in higher dimensions, we designed the following experiment. We first learn a simple convolutional auto-encoder, with an added classifier on its bottleneck to ensure a discriminative space embedding, to embed the MNIST dataset into a-dimensional space. Then we solve the gradient flow problem in the embedded space with particles initialized from a Gaussian distribution.
Similar to the previous experiments, we use MMD-RBF, GSPM-MMD with (denoted as GSPM-MMD 1), and GSPM-MMD with being the cumulative integral (denoted as GSPM-MMD 2) and calculate the flow between the source and target distributions. We measure the 2-Wasserstein distance between the distributions at each iteration. The experiments were repeated times and the average performance for each method is reported in Figure 6 (top row). After the convergence, we sort the particles according to the output of the classifier and feed them to the decoder network to visualize the corresponding digits for each method (See the bottom row in Figure 6). We note that same was used for all three methods, and linear slicing was used in this experiment. We conclude that the GSPM-MMD with the cumulative integral operator, which corresponds to the sliced-Cramér distance, seems to achieve a superior performance in comparison with the other two kernels.
We introduced a new family of distances, denoted as Generalized Sliced Probability Metrics (GSPMs), which calculate the expected distances between slices (i.e., one-dimensional marginals) of two input distributions. We then showed that a subset of the proposed distances is equivalent to the squared maximum mean discrepancy (MMD) with new kernels introduced in this work, denoted as GSPM-MMD kernels. Furthermore, we applied the GSPM-MMD kernels in the domain of gradient flows for implicit generative modeling, which has recently attracted ample attention from the research community. More importantly, we identified sufficient regularity conditions on the building elements of our proposed distance (and consequently the proposed kernels) for guaranteeing global convergence of the gradient flow. Finally, we provide extensive ablation experiments to test our proposed distance on synthetic and real datasets.
We invoke the following lemma from steinwart2008support to prove our result.
(Lemma 4.34 in steinwart2008support) Let be an open subset, be a kernel on , be a feature space of , and be a feature map of . Let be an index such that the mixed partial derivative of with respect to the coordinates and exists and is continuous. Then the partial derivative with respect to the -th coordinate exists, is continuous, and for all we have
Proof of Theorem 1
We prove that the kernel in (9) has -Lipschitz gradients:
and satisfies the following inequality:
Then the rest of the proof follows from arbel2019maximum, Proposition 1 (existence and uniqueness) and Proposition 8 (convergence of the Euler scheme).
Recalling the definition of from Lemma 2, we have that
Applying Lemma 2, we can simplify above to get
On the other hand,
Due to Lipschitz continuity of as well as Lemma 2, the above entails that
Combining above with (23), we have by triangle inequality that
Integrating above over and interchanging the integral with the norm on the left-hand-side proves Condition 21 with .