Generative modeling, in its unconditional form, refers to the problem of estimating the data generating distribution: given i.i.d. samples with an unknown distribution , a generative model seeks to find a parametric distribution that closely resembles . In modern deep generative models, we often approach this problem via a latent variable – i.e., we assume that there is some variable associated with the observed data that
follows a known distribution (also referred to as the prior in generative models). Thus, we can learn a mapping such that the distribution after transformation, denoted by , aligns well with the data generating distribution . Therefore, sampling from becomes convenient since can be efficiently sampled. Frequently,
is parameterized by deep neural networks and optimized with stochastic gradient descent (SGD).
Existing generative modeling methods variously optimize the transformation , most commonly modeling it as a Maximum Likelihood Estimation (MLE) or distribution matching problem. For instance, given data , a variational autoencoder (VAE) (Kingma and Welling, 2013) first constructs through the approximate posterior and maximizes a lower bound of likelihood
. Generative adversarial networks (GANs)(Goodfellow et al., 2014) relies on a simultaneously learned discriminator such that samples of are indistinguishable from . Results in (Arjovsky et al., 2017; Li et al., 2017) suggest that GANs minimize the distributional discrepancies between and . Flow-based generative models optimize explicitly through the change of variable rule and efficiently calculating the Jacobian determinant of the inverse mapping .
In all examples above, the architecture or objective notwithstanding, the common goal is to find a suitable function that reduces the difference between and . Thus, a key component in many deep generative models is to learn a forward operator as defined below.
Definition 1.1 (Forward operator).
A forward operator is defined to be a mapping associated with some latent variable such that for some function class and a distance measure .
Motivation: The specifics of the forward operator may differ from case to case. But its properties and how it is estimated numerically greatly influences the empirical performance of the model. For instance, mode collapse issues in GANs are well known and solutions continue to emerge (Srivastava et al., 2017). To learn the forward operator, VAEs use an approximate posterior that may sometimes fail to align with the prior (Kingma et al., 2016; Dai and Wipf, 2019). Flow-based generative models enable direct access to the posterior likelihood, yet in order to tractably evaluate the Jacobian of the transformation during training, one must either restrict the expressiveness at each layer (Dinh et al., 2017; Kingma and Dhariwal, 2018) or use more involved solutions (Chen et al., 2018). Of course, solutions to mitigate these weaknesses (Ho et al., 2019) remains an active area of research.
The starting point of our work is to evaluate the extent to which we can radically simplify the forward operator in deep generative models. Consider some desirable properties of a hypothetical forward operator (in Def. (1.1)): (a) Upon convergence, the learned operator minimizes the distance between and over all possible operators of a certain class. (b) The training directly learns the mapping from the prior distribution , rather than a variational approximation. (c) The forward operator can be efficiently learned and sample generation is also efficient. It would appear that these criteria violate the “no free lunch rule”, and some compromise must be involved. Our goal is to investigate this trade-off: which design choices can make this approach work? Specifically, a well studied construct in dynamical systems, namely the Perron-Frobenius operator (Lemmens and Nussbaum, 2012), suggests an alternative linear route to model the forward operator. Here, we show that if we are willing to give up on a few features in existing models – this may be acceptable depending on the downstream use case – then, the forward operator in generative models can be efficiently approximated as the estimation of a closed-form linear operator in the reproducing kernel Hilbert space (RKHS). With simple adjustments of existing results, we identify a novel way to replace the expensive training for generative tasks with a simple principled kernel approach.
Contributions. Our results are largely based on results in kernel methods and dynamical systems, but we demonstrate their relevance in generative modeling and complement recent ideas that emphasize links between deep generative models and dynamical systems. Our contributions are
(a) We propose a non-parametric method for transferring a known prior density linearly in RKHS to an unknown data density – equivalent to learning a nonlinear forward operator in the input space. When compared to its functionally-analogous module used in other deep generative methods, our method
avoids multiple expensive training steps yielding significant efficiency gains;
(b) We evaluate this idea in multiple scenarios and show competitive generation performance and efficiency benefits with pre-trained autoencoders on popular image datasets including MNIST, CIFAR-10, CelebA and FFHQ;
We evaluate this idea in multiple scenarios and show competitive generation performance and efficiency benefits with pre-trained autoencoders on popular image datasets including MNIST, CIFAR-10, CelebA and FFHQ;(c) As a special use case, we demonstrate the advantages over other methods in limited data settings.
We briefly introduce reproducing kernel Hilbert space (RKHS) and kernel embedding of probability distributions, concepts we will use frequently.
Definition 2.1 (Rkhs (Aronszajn, 1950)).
For a set , let be a set of functions . Then, is a reproducing kernel Hilbert space (RKHS) with a product if there exists a function (called a reproducing kernel) such that (i) ; (ii) , where is the set closure.
|Mean embedding operator|
|Kernel mean embedding|
The function is referred to as the feature mapping of the induced RKHS . A useful identity derived from feature mappings is the kernel mean embedding
: it defines a mapping from a probablity measure into an element in the RKHS.
Definition 2.2 (Kernel Mean Embedding (Smola et al., 2007)).
Given a probability measure on with an associated RKHS equipped with a reproducing kernel such that , the kernel mean embedding of in RKHS , denoted by , is defined as , and the mean embedding operator is defined as .
For characteristic kernels, the operator is injective. Thus, two distributions in are identical iff .
This property allows using of Maximum Mean Discrepancy (MMD) for distribution matching (Gretton et al., 2012; Li et al., 2017) and is common, see (Muandet et al., 2017; Zhou et al., 2018). For a finite number of samples drawn from the probability measure , an unbiased empirical estimate of is such that .
Definition 2.3 (Covariance/Cross-covariance Operator).
Let be random variables defined on with joint distribution
with joint distributionand marginal distributions , . Let and be two sets of (a) bounded kernel, (b) their corresponding feature map, and (c) their induced RKHS, respectively. The (uncentered) covariance operator and cross-covariance operator are defined as
where is the outer product operator.
3 Simplifying the estimation of the forward operator
Forward operator as a dynamical system:
The dynamical system view of generative models has been described by others (Chen et al., 2018; Grathwohl et al., 2019; Behrmann et al., 2019). These strategies model the evolution of latent variables in a residual neural network in terms of its dynamics over continuous or discrete time , and consider the output function as the evaluation function at a predetermined boundary condition . Specifically, given an input (i.e., initial condition) , is defined as
where is a time-dependent neural network function and is the intermediate solution at . This view of generative models is not limited to specific methods or model archetypes, but generally useful, for example, by viewing the outputs of each hidden layer as evaluations in discrete-time dynamics. After applying on a random variable , the marginal density of the output over any subspace can be expressed as
If there exists some neural network instance such that the corresponding output function satisfies , by Def. 1.1, is a forward operator. Let be a set of i.i.d. samples drawn from . In typical generative learning, either maximizing the likelihood or minimizing the distributional divergence requires evaluating and differentiating through or many times.
Towards a one-step estimation of forward operator:
Since and in (3) will be highly nonlinear in practice, evaluating and computing the gradients can be expensive. Nevertheless, the dynamical systems literature suggests a linear extension of , namely the Perron-Frobenius operator or transfer operator, that conveniently transfers to .
Definition 3.1 (Perron-Frobenius operator (Mayer, 1980)).
Given a dynamical system , the Perron-Frobenius (PF) operator is an infinite-dimensional linear operator defined as for all
Although in Def. 3.1, the PF operator is defined for self-maps, it is trivial to extend to mappings by restricting the RHS integral to .
It can be seen that, for the forward operator , the corresponding PF operator satisfies
If can be efficiently computed, transferring the tractable density to the target density can be accomplished simply by applying . However, since is an infinite-dimensional operator on , it is impractical to instantiate it explicitly and exactly. Nonetheless, there exist several methods for estimating the Perron-Frobenius operator, including Ulam’s method (Ulam, 1960) and the Extended Dynamical Mode Decomposition (EDMD) (Williams et al., 2015a). Both strategies project onto a finite number of hand-crafted basis functions – this may suffice in many settings but may fall short in modeling highly complex dynamics.
Kernel-embedded form of PF operator:
A natural extension of PF operator is to represent by an infinite set of functions (Klus et al., 2020), e.g., projecting it onto the bases of an RKHS via the kernel trick. There, for a characteristic kernel , the kernel mean embedding uniquely identifies an element for any . Thus, to approximate , we may alternatively solve for the dynamics from to in their embedded form. Using Tab. 1 notations, we have the following linear operator that defines the dynamics between two embedded densities.
Definition 3.2 (Kernel-embedded Perron-Frobenius operator (Klus et al., 2020)).
Given and . Denote as the input kernel and as the output kernel. Let and be their corresponding mean kernel embeddings. The kernel-embedded Perron-Frobenius (kPF) operator, denoted by , is defined as
Proposition 3.1 (Song et al. (2013)).
With the above definition, satisfies
under the conditions: (i) is injective (ii) (iii) for any .
The last two assumptions can sometimes be difficult to satisfy for certain RKHS (see Theorem 2 of Fukumizu et al. (2013)). In such cases, a relaxed solution can be constructed by replacing by a regularized inverse or a Moore-Penrose pseudoinverse .
The following proposition shows commutativity between the (kernel-embedded) PF operator and the mean embedding operator, showing its equivalence to when is characteristic.
Proposition 3.2 ((Klus et al., 2020)).
With the above notations, .
Transferring embedded densities with the kPF operator:
The kPF operator is a powerful tool that allows transferring embedded densities in RKHS. The main steps are: [top=0mm,bottom=0mm]
Use mean embedding operator on . Let us denote it by .
Transfer using kPF operator to get the mean embedded , given by .
Of course, in practice with finite data, and , must be estimated empirically (see Klus et al. (2020) for an error analysis).
where are simply the corresponding feature matrices for samples of and , and is a small penalty term.
Learning kPF for unconditional generative modeling:
Some generative modeling methods such as VAEs and flow-based formulations explicitly model the latent variable as conditionally dependent on the data variable . This allows deriving/optimizing the likelihood . This is desirable but may not be essential in all applications. To learn a kPF, however, and can be independent RVs. While it may not be immediately obvious why we could assume this independence, we can observe the following property for the empirical kPF operator, assuming that the empirical covariance operator is non-singular:
Suppose that and are independently sampled from the marginals and . It is easy to verify that (7) holds for any pairing . However, instantiating the RVs in this way rules out the use of kPF for certain downstream tasks such as controlled generation or mode detection, since does not contain information regarding . Nevertheless, if sampling is our only goal, then this instantiation of kPF will suffice.
Mapping to :
Now, since is a deterministic linear operator, we can easily set up a scheme to map samples of to elements of where the expectation of the mapped samples equals
Define and as feature maps of kernels and . We can rewrite as
Here is the inverse or the preimage map of . Such an inverse, in general, may not exist (Kwok and Tsang, 2004; Honeine and Richard, 2011). We will discuss a procedure to approximate in §4.1. In what follows, we will temporarily assume that an exact preimage map exists and is tractable to compute.
Define as the transferred sample in using the empirical embedded PF operator . Then the next result shows that asymptotically the transferred samples converge in distribution to the target distribution.
As , . That is, the preimage of the transferred sample approximately conforms to under previous assumptions when is large.
Since , the proof immediately follows from (8). ∎
4 Sample generation using the Kernel transfer operator
At this point, the transferred sample , obtained by the kPF operator, remains an element of RKHS . To translate the samples back to the input space, we must find the preimage such that .
4.1 Solving for an approximate preimage
Solving the preimage in kernel-based methods is known to be ill-posed (Mika et al., 1999) because the mapping is not necessarily surjective, i.e., a unique preimage may not exist. Often, an approximate preimage is constructed instead based on relational properties among the training data in the RKHS. We consider two options in our framework (1) MDS-based method (Kwok and Tsang, 2004; Honeine and Richard, 2011),
which optimally preserves the distances in RKHS to the preimages in the input space, and (2) weighted Fréchet mean (Friedman et al., 2001), which in Euclidean space takes the form
where a neighborhood of training samples based on pairwise distance or similarity in RKHS, following (Kwok and Tsang, 2004). The weighted Fréchet mean preimage uses the inner product weights
as measures of similarities to interpolate training samples. On the toy data (as in Fig.2), weighted Fréchet mean produces fewer samples that deviate from the true distribution and is easier to compute. Based on this observation, we use the weighted Fréchet mean as the preimage module for all experiments that requires samples, while acknowledging that other preimage methods can also be substituted in.
With all the ingredients in hand, we now present an algorithm for sample generation using the kPF operator in Alg. 1. The idea is simple yet powerful: at training time, we construct the empirical kPF operator using the training data and samples of the known prior . At test time, we will transfer new points sampled from to feature maps in , and construct their preimages as the generated output samples.
4.2 Image generation
Image generation is a common application for generative models (Goodfellow et al., 2014; Dinh et al., 2017). While our proposal is not image specific, constructing sample preimages in a high dimensional space with limited training samples can be challenging, since the space of images is usually not dense in a reasonably sized neighborhood. However, empirically images often lie near a low dimensional manifold in the ambient space (Seung and Lee, 2000), and one may utilize an autoencoder (AE) to embed the images onto a latent space that represents coordinates on a learned manifold. If the learned manifold lies close to the true manifold, we can learn densities on the manifold directly (Dai and Wipf, 2019).
Therefore, for image generation tasks, the training data is first projected onto the latent space of a pretrained AE. Then, the operator will be constructed using the projected latent representations, and samples will be mapped back to image space with the decoder of AE. Our setup can be viewed analogously to other generative methods based on so called “ex-post” density estimation of latent variables (Ghosh et al., 2020). We also restrict the AE latent space to a hypersphere to ensure that (a) and are bounded and (b) the space is geodesically convex and complete, which is required by the preimage computation. To compute the weighted Fréchet mean on a hypersphere, we adopt the recursive algorithm in Chakraborty and Vemuri (2015) (see appendix D for details).
5 Experimental Results
Goals. In our experiments, we seek to answer three questions: (a) With sufficient data, can our method generate new data with comparable performance with other state-of-the-art generative models? (b) If only limited data samples were given, can our method still estimate the density with reasonable accuracy? (c) What are the runtime benefits, if any?
Datasets/setup. To answer the first question, we evaluate our method on standard vision datasets, including MNIST, CIFAR10, and CelebA, where the number of data samples is much larger than the latent dimension. We compare our results with other VAE variants (Two-stage VAE (Dai and Wipf, 2019), WAE (Arjovsky et al., 2017), CV-VAE (Ghosh et al., 2020)) and flow-based generative models (Glow (Kingma and Dhariwal, 2018), CAGlow (Liu et al., 2019)) The second question is due to the broad use of kernel methods in small sample size settings. For this more challenging case, we randomly choose 100 training samples (% of the full dataset) from CelebA and evaluate the quality of generation compared to other density approximation schemes. We also use a dataset of T1 Magnetic Resonance (MR) images from the Alzheimer’s Disease Neuromaging Initiative (ADNI) study.
Distribution transfer with many data samples. We evaluate the quality by calculating the Fréchet Inception Distance (FID) (Heusel et al., 2017) with 10K generated images from each model. Here, we use a pretained regularized autoencoder (Ghosh et al., 2020) with a latent space restricted to the hypersphere (denoted by SRAE) to obtain smooth latent representations. We compare our kPF to competitive end-to-end deep generative baselines (i.e. flow and VAE variants) as well as other density estimation models over the same SRAE latent space. For the latent space models, we experimented with Glow (Kingma and Dhariwal, 2018)
, VAE, Gaussian mixture model (GMM), and two proposed kPF operators with Gaussian kernel (RBF-kPF) and NTK (NTK-kPF) as the input kernel. The use of NTK is motivated by promising results at the interface of kernel methods and neural networks(Jacot et al., 2018; Arora et al., 2020). Implementation details are included in the appendix.
Comparative results are shown in Table 2. We see that for images with structured feature spaces, e.g., MNIST and CelebA, our method matches other non-adversarial generative models, which provides evidence in support of the premise that the forward operator can be simplified.
Further, we present qualitative results on all datasets (in Fig. 3), where we compare our kPF operator based model with other density estimation techniques on the latent space. Observe that our model generates comparable visual results as .
Since kPF learns the distribution on a pre-trained AE latent space for image generation, using a more powerful AE can
offer improvements in generation quality. In Fig. 4, we present representative images by learning
our kPF on NVAE Vahdat and Kautz (2020) latent space, pre-trained on the FFHQ dataset. NVAE builds a hierarchical prior and achieves state-of-the-art generation quality among VAEs. We see that kPF can indeed generate high-quality and diverse samples with the help of NVAE encoder/decoder. In fact, any AE/VAE may be substituted in, assuming that the latent space is smooth.
Summary: When a sufficient number of samples are available, our algorithm performs as well as the alternatives, which is attractive given the efficient training. In Fig. 5, we present comparative result of FIDs with respect to the training time. Since kPF can be computed in closed-form, it achieves significant training efficiency gain compared to other deep generative methods while delivering competitive generative quality.
Distribution transfer with limited data samples. Next, we present our evaluations when only a limited number of samples are available. Here, each of the density estimators was trained on latent representations of the same set of 100 randomly sampled CelebA images, and 10K images were generated to evaluate FID (see Table 3). Our method outperforms Glow and VAE, while offering competitive performance with GMM. Surprisingly, GMM remains a strong baseline for both tasks, which agrees with results in Ghosh et al. (2020). However, note that GMM is restricted by its parametric form and is less flexible than our method (as shown in Fig 2).
Learning from few samples is common in biomedical applications where acquisition is costly. Motivated
by interest in making synthetic but
statistic preserving data (rather than the
real patient records)
publicly available to researchers
(see NIH N3C Data Overview),
we present results on generating high-resolution
brain images: samples from group AD
(diagnosed as Alzheimer’s disease) and samples from group CN (control normals).
For , using our kernel operator, we can generate high-quality samples
that are in-distribution. We present comparative results with VAEs.
The generated samples in Fig. 6 clearly show that our method generates sharper images.
To check if the results are also scientifically meaningful,
we test consistency between group difference testing (i.e., cases versus controls differential tests on each voxel) on the real images (groups were AD and CN) and
the same test was performed on the generated samples (AD and CN groups),
using a FWER corrected two-sample -test Ashburner and Friston (2000).
The results (see Fig 6) show that while there is a deterioration in regions
identified to be affected by disease (different across groups),
many statistically-significant regions from tests on the real images are preserved in voxel-wise tests
on the generated images.
Summary: We achieve improvements in the small sample size setting compared to other generative methods. This is useful in many data-poor settings. For larger datasets, our method still compares competitively with alternatives, but with a smaller resource footprint.
Our proposed simplifications can be variously useful, but deriving the density of the posterior given a mean embedding or providing an exact preimage for the generated sample in RKHS is unresolved at this time. While density estimation from kPF has been partially addressed in Schuster et al. (2020b), finding the pre-image is often ill-posed. The weighted Fréchet mean preimage only provides an approximate solution and evaluated empirically,and due to the interpolation-based sampling strategy, samples cannot be obtained beyond the convex hull of training examples. Making and independent RVs also limits its use for certain downstream task such as representation learning or semantic clustering. Finally, like other kernel-based methods, the sup-quadratic memory/compute cost can be a bottleneck on large datasets and kernel approximation (e.g. (Rahimi and Recht, 2009)) may have to be applied; we discuss this in appendix F.
We show that using recent developments in regularized autoencoders, a linear kernel transfer operator can potentially be an efficient substitute for the forward operator in some generative models, if some compromise in capabilities/performance is acceptable. Our proposal, despite its simplicity, shows comparable empirical results to other generative models, while offering efficiency benefits. Results on brain images also show promise for applications to high-resolution 3D imaging data generation, which is being pursued actively in the community.
Wasserstein generative adversarial networks.
International conference on machine learning, pp. 214–223. Cited by: §1, §5.
- Theory of reproducing kernels. Transactions of the American mathematical society 68 (3), pp. 337–404. Cited by: Definition 2.1.
- On exact computation with an infinitely wide neural net. In Advances in Neural Information Processing Systems 32, H. Wallach, H. Larochelle, A. Beygelzimer, F. dÁlché-Buc, E. Fox, and R. Garnett (Eds.), pp. 8141–8150. Cited by: item c, item d.
- Harnessing the power of infinitely wide deep nets on small-data tasks. In International Conference on Learning Representations, Cited by: §5.
- Voxel-based morphometry—the methods. Neuroimage 11 (6), pp. 805–821. Cited by: §5.
- Invertible residual networks. In International Conference on Machine Learning, pp. 573–582. Cited by: §3.
- Discovering governing equations from data by sparse identification of nonlinear dynamical systems. Proceedings of the National Academy of Sciences 113 (15), pp. 3932–3937. External Links: Cited by: Appendix A.
Recursive frechet mean computation on the grassmannian and its applications to computer vision. In Proceedings of the IEEE International Conference on Computer Vision, pp. 4229–4237. Cited by: Appendix D, item b.
- . In Advances in Neural Information Processing Systems 31, S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett (Eds.), pp. 6571–6583. Cited by: §1, §3.
- Diagnosing and enhancing VAE models. In International Conference on Learning Representations, Cited by: §1, §4.2, Figure 3, §5.
- Density estimation using real nvp. In International Conference on Learning Representations, Cited by: §1, §4.2.
- . In Applied Soft Computing Technologies: The Challenge of Complexity, A. Abraham, B. de Baets, M. Köppen, and B. Nickolay (Eds.), Berlin, Heidelberg, pp. 425–438. External Links: Cited by: §J.2.
- The elements of statistical learning. Vol. 1, Springer series in statistics New York. Cited by: §4.1.
Kernel bayes’ rule: bayesian inference with positive definite kernels. The Journal of Machine Learning Research 14 (1), pp. 3753–3783. Cited by: Appendix A, §2, §3.
- From variational to deterministic autoencoders. In International Conference on Learning Representations, External Links: Cited by: §J.2, §4.2, Table 2, §5, §5, §5.
- Generative adversarial nets. In Advances in Neural Information Processing Systems, Z. Ghahramani, M. Welling, C. Cortes, N. Lawrence, and K. Q. Weinberger (Eds.), Vol. 27, pp. 2672–2680. Cited by: §1, §4.2.
- Scalable reversible generative models with free-form continuous dynamics. In International Conference on Learning Representations, Cited by: §3.
- A kernel two-sample test. The Journal of Machine Learning Research 13 (1), pp. 723–773. Cited by: §2.
- The elements of statistical learning. Springer Series in Statistics, Springer New York Inc., New York, NY, USA. Cited by: Appendix C.
Deep residual learning for image recognition.
2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Vol. , pp. 770–778. External Links: Cited by: §J.3.
- GANs trained by a two time-scale update rule converge to a local nash equilibrium. In Advances in Neural Information Processing Systems 30, I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (Eds.), pp. 6626–6637. Cited by: §5.
- Flow++: improving flow-based generative models with variational dequantization and architecture design. In International Conference on Machine Learning, pp. 2722–2730. Cited by: §1.
- Preimage problem in kernel-based machine learning. IEEE Signal Processing Magazine 28 (2), pp. 77–88. Cited by: §3, §4.1.
- Neural tangent kernel: convergence and generalization in neural networks. In Advances in Neural Information Processing Systems 31, S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett (Eds.), pp. 8571–8580. Cited by: item b, Appendix A, §5.
- Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114. Cited by: §1.
- Glow: generative flow with invertible 1x1 convolutions. In Advances in Neural Information Processing Systems 31, S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett (Eds.), pp. 10215–10224. Cited by: §1, Figure 3, §5, §5.
- Improved variational inference with inverse autoregressive flow. In Advances in Neural Information Processing Systems 29, D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett (Eds.), pp. 4743–4751. Cited by: §1.
- Eigendecompositions of transfer operators in reproducing kernel hilbert spaces. Journal of Nonlinear Science 30 (1), pp. 283–315. Cited by: §3, §3, Definition 3.2, Proposition 3.2.
- The pre-image problem in kernel methods. IEEE transactions on neural networks 15 (6), pp. 1517–1525. Cited by: Appendix C, §3, §4.1, §4.1.
- Nonlinear perron-frobenius theory. Vol. 189, Cambridge University Press. Cited by: §1.
MMD gan: towards deeper understanding of moment matching network. In Advances in Neural Information Processing Systems, I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (Eds.), Vol. 30, pp. 2203–2213. Cited by: §1, §2.
- Conditional adversarial generative flow for controllable image synthesis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: Table 2, §5.
- The ruelle-araki transfer operator in classical statistical mechanics. Cited by: Definition 3.1.
- Kernel pca and de-noising in feature spaces. In Advances in Neural Information Processing Systems, M. Kearns, S. Solla, and D. Cohn (Eds.), Vol. 11, pp. 536–542. Cited by: §4.1.
- Vol. . External Links: Cited by: §2.
- Weighted sums of random kitchen sinks: replacing minimization with randomization in learning. In Advances in Neural Information Processing Systems, D. Koller, D. Schuurmans, Y. Bengio, and L. Bottou (Eds.), Vol. 21, pp. . External Links: Cited by: §6.
- A new iterative method for finding approximate inverses of complex matrices. Abstract and Applied Analysis 2014, pp. 563787. External Links: Cited by: Appendix E.
Assessing generative models via precision and recall. In Advances in Neural Information Processing Systems, S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett (Eds.), Vol. 31, pp. . External Links: Cited by: Appendix H.
Kernel conditional density operators.
Proceedings of the Twenty Third International Conference on Artificial Intelligence and Statistics, S. Chiappa and R. Calandra (Eds.), Proceedings of Machine Learning Research, Vol. 108, pp. 993–1004. Cited by: Appendix B, Appendix B.
- Kernel conditional density operators. S. Chiappa and R. Calandra (Eds.), Proceedings of Machine Learning Research, Vol. 108, Online, pp. 993–1004. Cited by: §6.
- The manifold ways of perception. Science 290 (5500), pp. 2268–2269. External Links: Cited by: §4.2.
- A hilbert space embedding for distributions. In Algorithmic Learning Theory, M. Hutter, R. A. Servedio, and E. Takimoto (Eds.), Berlin, Heidelberg, pp. 13–31. External Links: Cited by: Definition 2.2.
- Kernel embeddings of conditional distributions: a unified kernel framework for nonparametric inference in graphical models. IEEE Signal Processing Magazine 30 (4), pp. 98–111. Cited by: §2, Proposition 3.1.
- Veegan: reducing mode collapse in gans using implicit variational learning. arXiv preprint arXiv:1705.07761. Cited by: §1.
- Wasserstein auto-encoders. In International Conference on Learning Representations, External Links: Cited by: §J.2.
- A collection of mathematical problems. New York 29. Cited by: §3.
- NVAE: a deep hierarchical variational autoencoder. In Neural Information Processing Systems (NeurIPS), Cited by: §J.2, §5.
- Using the nyström method to speed up kernel machines. In Advances in Neural Information Processing Systems, T. Leen, T. Dietterich, and V. Tresp (Eds.), Vol. 13, pp. . External Links: Cited by: Appendix F.
- A data–driven approximation of the koopman operator: extending dynamic mode decomposition. Journal of Nonlinear Science 25 (6), pp. 1307–1346. Cited by: §3.
- A data–driven approximation of the koopman operator: extending dynamic mode decomposition. Journal of Nonlinear Science 25 (6). External Links: Cited by: Appendix A.
- Nyströmformer: A nyström-based algorithm for approximating self-attention. In Thirty-Fifth AAAI Conference on Artificial Intelligence, AAAI 2021, Thirty-Third Conference on Innovative Applications of Artificial Intelligence, IAAI 2021, The Eleventh Symposium on Educational Advances in Artificial Intelligence, EAAI 2021, Virtual Event, February 2-9, 2021, pp. 14138–14148. External Links: Cited by: Appendix E.
- Statistical tests and identifiability conditions for pooling and analyzing multisite datasets. Proceedings of the National Academy of Sciences 115 (7), pp. 1481–1486. External Links: Cited by: §2.
Appendix A Choice of Kernel is relevant yet flexible
In some cases, one would focus on identifying (finite) eigenfuntions and modes of the underlying operator (Williams et al., 2015b; Brunton et al., 2016). But rather than finding certain modes that best characterize the dynamics, we care most about minimizing the error of the transferred density and therefore whether the span of functions is rich/expressive enough. In particular, condition (iii) in Proposition 3.2 requires an input RKHS spanned by sufficiently rich bases (Fukumizu et al., 2013). For this reason, the choice of kernel here cannot be completely ignored since it determines the family of functions contained in the induced RKHS.
To explore the appropriate kernel setup for our application, we empirically evaluate the effect of using several different kernels via a simple experiment on MNIST. We first train an autoencoder to embed MNIST digits on to a hypersphere , then generate samples from kPF by the procedure described by Alg. 1 using the respective kernel function as the input kernel . Subplot (b) and (c) in Fig. 7
show the generated samples using Radial Basis Function (RBF) kernel and arc-cosine kernel, respectively. Observe that the choice of kernel has an influence on the sample population, and a kernel function with superior empirical behavior is desirable.
Motivated by this observation, we evaluated the Neural Tangent Kernel (NTK) (Jacot et al., 2018), a well-studied neural kernel in recent works. We use it for a
(a) NTK, in theory, corresponds to a trained infinitely-wide neural network, which spans a rich set of functions that satisfies the assumption.
(b) For well-conditioned inputs (i.e., no duplicates) on hypersphere,
the positive-definiteness of NTK is proved in (Jacot et al., 2018). Therefore, invertibility of the Gram matrix is almost guaranteed if the prior distribution is restricted on a hypersphere
(c) NTK can be non-asymptotically approximated (Arora et al., 2019).
(d) Unlike other parametric kernels such as RBF kernels, NTK is less sensitive to hyperparameters, as long as the number of units used is large enough
Unlike other parametric kernels such as RBF kernels, NTK is less sensitive to hyperparameters, as long as the number of units used is large enough(Arora et al., 2019). Subplot (d) of Fig. 7 shows that kPF learned with NTK as input kernel is able to generate samples that are more consistent with the data distribution. However, we should note that NTK is merely a convenient choice of kernel that requires less tuning, and is not otherwise central to our work. In fact, as shown in our experiment in Tab. 2, a well-tuned RBF kernel may also achieve a similar performance. Indeed, in practice, any suitable choice of kernel may be conveniently adopted into the proposed framework without major modifications.
Appendix B Density Estimation with kPF Operator
The displayed transferred density with the kPF operator on toy data in Fig. 1 is approximated using the empirical kernel conditional density operator (CDO) (Schuster et al., 2020a), since there is currently no known methods that can exactly reconstruct density from the transferred mean embeddings. The marginalized transferred density has the following form
where is the covariance operator of a independent reference density in . The above density function is also an element of RKHS , and therefore we can evaluate the density at by using the reproducing property . The results in Schuster et al. (2020a) show that the empirical estimate of may be constructed from samples of the reference density and training samples and as
In Fig. 8, we use a uniform density in the square as the reference density and constructed using samples from . Due to the form of the empirical kernel CDO, where the estimated density function is a linear combination of (as in Eq. 12), the approximation can be inaccurate if reference samples are relatively sparse and the densities are ‘sharp’. In those cases, to obtain a better density estimate, we may either increase the number of reference samples used to construct the empirical CDO (which can be computationally difficult due to the need to compute ), or, with some prior knowledge to the true distribution, choose a reference density which is localized around the ground truth density.
Therefore, to show a more faithful density estimate of the transferred distribution for visualization purpose, we use a composite of the uniform density and the true density with weight as the reference density in Fig. 2. In this case, approximately 20% of the reference samples concentrates around the high-density areas of the true density, which helps to form a better basis for . Note that the choice of does not affect the transferred density embedding since it is independent of and . After this modification, the reconstructed density more accurately reflects the true density compared to GMM and GLOW, indicating the transferred distribution by kPF in RKHS matches better to the ground truth distribution. This is also reflected in the generated samples in Fig. 9, where samples generated by the kPF operator are clearly more aligned with the ground truth distribution.
Appendix C Effect of on Sample Quality
In the sampling stage, our proposed method finds the approximate preimage of the transferred kernel embeddings by taking the weighted Fréchet mean of the top neighbors among the training samples. The choice of therefore influences the quality of generation.
From Figure 10, we can observe that, in general, FID worsens as increases. This observation aligns with our intuition of preserving only the local similarities represented by the kernel, and similar ideas have been previously used in the literature (Hastie et al., 2001; Kwok and Tsang, 2004). However, significantly decreasing leads to the undesirable result where the generator merely generates the training samples (in the extreme case where , generated samples will just be reconstructions of training samples). Therefore, in our experiments, we choose to achieve a balance between generation quality and the distance to training samples.
Appendix D Weighted Fréchet Mean on the Hypersphere
While the weighted Fréchet Mean in Euclidean space can be computed in closed-form as a weighted arithmetic mean (as in Eq. 10), on the hypersphere there is no known closed-form solution. Thus, we adopt the iterative algorithm in (Chakraborty and Vemuri, 2015) for an approximate solution given data points
and weight vector:
where, . This algorithm iterates through the data points once, yielding a complexity of only , where is the dimension of . Under the prescribed iteration, converges asymptotically to the true Fréchet mean for finite data points. We refer the readers to (Chakraborty and Vemuri, 2015) for further details.
Appendix E Fast approximation of Moore-Penrose inverse
When computing the inverted kernel matrix in Algo. 1, conventional approaches typically performs SVD or Cholesky decomposition. Both procedures are hard to parallelize, and therefore, can be slow when is large. Alternatively, we can utilize an iterative procedure proposed in Razavi et al. (2014) to approximate the Moore-Penrose inverse.
Since this iterative procedure mostly involves matrix multiplications, it can be efficiently parallelized and implemented on GPU. The same procedure has also seen success in approximating large self-attention matrices in language modeling (Xiong et al., 2021). For the NVAE experiment, we run this iteration for steps and use .
Appendix F Nystrom Approximation of kPF
Due to the need to store and compute a kernel matrix inverse or , the memory and computational complexity of kPF is at least and , respectively. The sup-quadratic complexity hinders the use of kPF on extremely large datasets. In our experiments, we already adopted a simple subsampling strategy which randomly select 10k training samples from each dataset ( 50k samples) to fit our hardware configuration which works well. But for larger datasets with potentially more modes, a larger set of subsamples must be considered, and in those cases kPF may not be suitable for commodity/affordable hardware. In order to overcome this problem, we can combine kPF with conventional kernel approximation methods such as the Nyström method (Williams and Seeger, 2001).
Let be a size subset of the training set (which we refer to as the landmark points) and be their corresponding kernel feature maps. The weighting coefficients for each prior sample derived in Alg. 1 can be approximated by
where , , , , and the last identity is due to applying the Woodbury formula on . Assuming , the memory complexity is reduced to and the computation complexity to .
We empirically evaluated the Nyström-approximated kPF on the CelebA experiment and present the result in Tab. 4. It can be observed that when is sufficiently large, the performance of Nyström approximated kPFs is as good as the ones using the full kernel matrices.
Appendix G Does kPF Memorize Training Data?
Since in kPF, samples are generated by linearly interpolating between training samples, it is natural to wonder whether it ‘fools’ the metrics by simply replicating training samples. For comparison, we consider an alternative scheme that generates data through direct manipulation of the training data, namely Kernel Density Estimation (KDE).
We fit KDEs by varying noise levels and compare their FIDs and nearest samples in the latent space to kPF in Fig. 11. We observe that, although KDE can reach very low FIDs when is small, almost all new samples closely resemble some instance in the training set, which is a clear indication of memorization. In contrast, kPF can generate diverse samples that do not simply replicate the observed data.
Appendix H Assessing Sample Diversity
Although FID is one of the most frequently used measures for assessing sample quality of generative models, certain diversity considerations, such as mode collapse, may not be conveniently deduced from it (Sajjadi et al., 2018). To enable explicit examination of generative models with respect to both accuracy (i.e., generating samples within the support of the data distribution) and diversity (i.e., covering the support of the data distribution as much as possible), Sajjadi et al. (2018) proposed an approach to evaluate generative models with generalized definitions of precision and recall between distributions. Quality of generation can then be assessed by evaluating the PRD curve, which depicts the trade-offs between accuracy (precision) and diversity (recall). We present the PRD curves in Fig. 12. The observations align with our results in Tab. 2 and kPF performs competitively in both accuracy and sample diversity.
Appendix I Exploring Kernel Configurations
To investigate the implication of kernel choices on generation quality, we tested 25 different kernel configurations for CelebA generation (results are presented in Tab. 5). For RBF kernels used in the CelebA experiments of the main text, we use a bandwidth of when used as input kernel and , and we adopt the same notation here.
It can be seen that kernel configurations and parameters indeed has a non-trivial impact on the generation quality, with NTK-kPF being the most robust to the choice of parameters. This aligns with our previous observations and offers some support for using NTK as an input kernel despite the additional compute cost.
Appendix J Experimental Details
In this section, we provide the detailed specifications for all of our experiments. We have also provided our code in the supplemental material.
j.1 Density Estimation on Toy Densities
We generated samples from each of the toy densities to learn the kPF operator. The input kernel
is a ReLU-activated NTK corresponding to a fully-connected network with depthand width at each layer, and the output kernel is a Gaussian kernel. Unless specified otherwise, we always uses a Gaussian kernel as the output kernel for the remainder of this appendix. The bandwidth of the output kernel was adjusted separately for density estimation and sampling for the purpose of demonstration. For comparison, we also fit/estimate a 10-component GMM and a Glow model with 50 coupling layers, where each of them were trained until convergence.
j.2 Image Generation with Computer Vision Datasets
To generate results in Tab. 2 and Tab. 3, we first trained an autoencoder for each dataset following the model setup in (Ghosh et al., 2020), which uses a modified Wasserstein autoencoder (Tolstikhin et al., 2018)
architecture with batch normalization. Additionally, we applied spectral normalization on both the encoder and the decoder, following(Ghosh et al., 2020), to obtain a regularized autoencoder. The latent representations were projected onto a hypersphere before decoding to image space. We trained the models on two NVIDIA GTX 1080TI GPUs. A detailed model specification is provided below in Table 6.
We used an NTK with and as the input kernel (i.e. the embedding kernel of ) for NTK-kPF, and a Gaussian kernel with bandwidth for RBF-kPF. The bandwidth for the output Gaussian kernel is selected by grid search over , where
is the empirical data standard deviation, based on cross-validation of a degree 3 polynomial kernel MMD between the sampled and the ground-truth latent points. Further, to mitigate the deterioration of performance of kernel methods in a high-dimensional setting due to the curse of dimensionality(Evangelista et al., 2006), in practice, we model as a space with fewer dimensions than the input space . As a rule of the thumb, we choose such that .
To generate images from kPF learned on NVAE latent space, we used the pre-trained checkpoints provided in (Vahdat and Kautz, 2020) to obtain the latent embeddings for 2000 FFHQ images. We then construct the kPF from the concatenated latent space of the lowest resolution (
). During sampling, prior samples at those resolutions are replace by the kPF samples, while for other resolutions samples remain generated from inferred Gaussian distributions. The batchnorm statistics were readjusted foriterations following (Vahdat and Kautz, 2020). We use rbf kernels as input and output kernels, with bandwidths , chosen by the
median heuristic( for input and for output in our experiments).
j.3 Image Generation for Brain Images
For the high-resolution brain imaging dataset, we used a custom version of ResNet (He et al., 2016) with 3D convolutions. The detailed architecture is shown in Fig. 7. Due to the large size of the data, we trained the model on 4 NVIDIA Tesla V100 GPUs.
Mandatory ADNI statement regarding data use.
Data used in preparation of this article were obtained from the Alzheimer’s Disease Neuroimaging Initiative (ADNI) database (adni.loni.usc.edu). As such, the investigators within the ADNI contributed to the design and implementation of ADNI and/or provided data but did not participate in analysis or writing of this report. A complete listing of ADNI investigators can be found in the ADNI Acknowledgement List.
The T1 MR brain dataset we utilize consists of images from subjects diagnosed with Alzheimers’s disease and healthy controls/ normal subjects. Images were first coregistered to a MNI template and segmented to preserve only the white matter and grey matter. Then, all images were resliced and resized to and rescaled to the range of . Voxel-based morphometry (VBM) was used to obtain the -value map of data and generated images.
Appendix K More Samples
In this section we present additional uncurated set of samples on MNIST, CIFAR-10, CelebA based on pre-trained SRAE and FFHQ based on NVAE. From the figures, it can be seen that kPF produces consistent and diverse samples, often better in quality than the alternatives.