How many moments does MMD compare?
We present a new way of study of Mercer kernels, by corresponding to a special kernel K a pseudo-differential operator p(𝐱, D) such that ℱ p(𝐱, D)^† p(𝐱, D) ℱ^-1 acts on smooth functions in the same way as an integral operator associated with K (where ℱ is the Fourier transform). We show that kernels defined by pseudo-differential operators are able to approximate uniformly any continuous Mercer kernel on a compact set. The symbol p(𝐱, 𝐲) encapsulates a lot of useful information about the structure of the Maximum Mean Discrepancy distance defined by the kernel K. We approximate p(𝐱, 𝐲) with the sum of the first r terms of the Singular Value Decomposition of p, denoted by p_r(𝐱, 𝐲). If ordered singular values of the integral operator associated with p(𝐱, 𝐲) die down rapidly, the MMD distance defined by the new symbol p_r differs from the initial one only slightly. Moreover, the new MMD distance can be interpreted as an aggregated result of comparing r local moments of two probability distributions. The latter results holds under the condition that right singular vectors of the integral operator associated with p are uniformly bounded. But even if this is not satisfied we can still hold that the Hilbert-Schmidt distance between p and p_r vanishes. Thus, we report an interesting phenomenon: the MMD distance measures the difference of two probability distributions with respect to a certain number of local moments, r^∗, and this number r^∗ depends on the speed with which singular values of p die down.
READ FULL TEXT