Latent variable modeling with random features

06/19/2020 ∙ by Gregory W. Gundersen, et al. ∙ 18

Gaussian process-based latent variable models are flexible and theoretically grounded tools for nonlinear dimension reduction, but generalizing to non-Gaussian data likelihoods within this nonlinear framework is statistically challenging. Here, we use random features to develop a family of nonlinear dimension reduction models that are easily extensible to non-Gaussian data likelihoods; we call these random feature latent variable models (RFLVMs). By approximating a nonlinear relationship between the latent space and the observations with a function that is linear with respect to random features, we induce closed-form gradients of the posterior distribution with respect to the latent variable. This allows the RFLVM framework to support computationally tractable nonlinear latent variable models for a variety of data likelihoods in the exponential family without specialized derivations. Our generalized RFLVMs produce results comparable with other state-of-the-art dimension reduction methods on diverse types of data, including neural spike train recordings, images, and text data.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 9

page 15

Code Repositories

rflvm

Random feature latent variable models in Python


view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Many dimension reduction techniques, such as principal component analysis

(Pearson, 1901; Tipping and Bishop, 1999) and factor analysis (Lawley and Maxwell, 1962)

, make two modeling assumptions: (1) the observations are Gaussian distributed, and (2) the latent structure is a linear function of the observations. However, for many applications, proper analysis requires us to break both of these assumptions. For example, in computational neuroscience, scientists collect firing rates for thousands of neurons simultaneously. These data are observed as counts, and neuroscientists believe that the biologically relevant latent structure is nonlinear with respect to the observations

(Saxena and Cunningham, 2019).

To capture nonlinear relationships in latent variable models, one approach is to assume that the mapping between the latent manifold and observations is Gaussian process (GP)-distributed. A GP is a prior distribution over the space of real-valued functions, which makes posterior inference in GP-based models tractable when the GP prior is conjugate to the data likelihood. This leads to the Gaussian process latent variable model (GPLVM, Lawrence, 2004).

The basic GPLVM model with a radial basis function (RBF) kernel has nice statistical properties that allow for exact, computationally tractable inference methods to be used when the number of observations is a reasonable size. Deviating from this basic model, however, leads to challenges with inference. For Poisson GPLVMs, we cannot integrate out the GP-distributed functional map, and we no longer have closed form expressions for the gradient of the posterior with respect to the latent space. This renders maximum a posteriori (MAP) estimation difficult, leading to solutions at poor local optima

(see Wu et al., 2017).

Random Fourier features (RFFs, Rahimi and Recht, 2008) were developed to avoid working with dimensional matrices when fitting kernel machines. RFFs accelerate kernel machines by using a low-dimensional, randomized approximation of the inner product associated with a given shift-invariant kernel. For this approximation, RFFs induce a nonlinear map using a linear function of random features.

We propose to use RFFs to approximate the kernel function in a GPLVM to create a flexible, tractable, and modular framework for fitting GP-based latent variable models. In the context of GPLVMs, RFF approximations allow for closed-form gradients of the objective function with respect to the latent variable. In addition, we can tractably explore the space of stationary covariance functions by using a Dirichlet process mixture prior for the spectral distribution of frequencies (BaNK, Oliva et al., 2016), leading to a flexible latent variable model.

This paper makes the following contributions to the space of nonlinear latent variable models: (1) we represent the nonlinear mapping in GPLVMs using a linear function of random Fourier features; (2) we leverage this representation to generalize GPLVMs to non-Gaussian likelihoods; (3) we place a prior on the random features to allow data-driven exploration over the space of shift-invariant kernels, to avoid putting restrictions on the kernel’s functional form. We validate our approach on diverse simulated data sets, and we show how results from RFLVMs compare with state-of-the-art methods on a variety of image, text, and scientific data sets. We release a Python library111https://github.com/gwgundersen/rflvm with modular and extensible code for reproducing and building on our work.

2 Random Feature Latent Variable Models

2.1 Random features for kernel machines

Here we briefly review random Fourier features (Rahimi and Recht, 2008) to motivate a randomized approximation of the GP-distributed maps in GPLVMs. Bochner’s theorem (Bochner, 1959) states that any continuous shift-invariant kernel on is positive definite if and only if

is the Fourier transform of a non-negative measure

. If the kernel is properly scaled, the kernel’s Fourier transform is guaranteed to be a density. Let , and let denote its complex conjugate. Observe that

(1)

So

is an unbiased estimate of

. If we drop the imaginary part for real-valued kernels, we can re-define by Euler’s formula. Then we can use Monte Carlo integration to approximate Eq. 1 as , where

(2)

We draw samples from , and the definition in Eq. 2 doubles the number of RFFs to . A representer theorem (Kimeldorf and Wahba, 1971; Schölkopf et al., 2001) says that the optimal solution to the objective function of a kernel method, , is linear in pairwise evaluations of the kernel. Using this random projection, we can represent as

(3)

In the second equality, the kernel trick implicitly lifts the data into a reproducing kernel Hilbert space in which the optimal solution is linear with respect to the features. The randomized approximation of this inner product allows us to replace expensive calculations involving the kernel with an -dimensional inner product.

For example, the predictive mean in GP regression implicitly uses the representer theorem and kernel trick (Williams and Rasmussen, 2006). RFFs have been used to reduce the computational costs of fitting GP regression models from to  (Lázaro-Gredilla et al., 2010; Hensman et al., 2017). However, RFFs have not yet been used to make GPLVMs more computationally tractable.

2.2 Gaussian process latent variable models

Now we introduce the basic GPLVM framework (Lawrence, 2004). Let be an matrix of observations and features, and let be an matrix of latent variables where . If we take the mean function to be zero, and the observations to be Gaussian distributed, the GPLVM is:

(4)

where is an covariance matrix defined by a positive definite kernel function and . Due to conjugacy between the GP prior on and Gaussian likelihood on , we can integrate out in closed form. The resulting marginal likelihood for is . We cannot find the optimal analytically, but various approximations have been proposed. We can obtain a MAP estimate by integrating out the GP-distributed maps and then optimizing with respect to the posterior using scaled conjugate gradients (Lawrence, 2004, 2005), where computation scales as . To scale up GPLVM inference, we may use sparse inducing point methods where the computational complexity is , for inducing points (Lawrence, 2007).

Alternatively, we can introduce a variational Bayes approximation of the posterior and minimize the Kullback–Leibler divergence between the posterior and the variational approximation with the latent variables

marginalized out. However, integrating out in the approximate marginal likelihood is only tractable when we assume that we have Gaussian observations and when we use an RBF kernel with automatic relevance determination, which limits its flexibility. This variational approach may be scaled using sparse inducing point methods. This approach is referred to as a Bayesian GPLVM (Titsias and Lawrence, 2010; Damianou et al., 2016).

2.3 Generative model for RFLVMs

The generative model of an RFLVM takes the form:

(5)

Following exponential family notation, is a likelihood function, is an invertible link function that maps the real numbers onto the likelihood’s support, and are other likelihood-specific parameters. Following Wilson and Adams (2013) and Oliva et al. (2016), we assume is a Dirichlet process mixture of Gaussians (DP-GMM, Ferguson, 1973; Antoniak, 1974). By sampling from the posterior of

, we can explore the space of stationary kernels and estimate the kernel hyperparameters in a Bayesian way. We assign each

in to a mixture component with the variable , which is distributed according to a Chinese restaurant process (CRP, Aldous, 1985) with concentration parameter

. This prior introduces additional random variables: the mixture means

, and the mixture covariance matrices where is the number of clusters in the current Gibbs sampling iteration.

The randomized map in Eq. 2 allows us to approximate the original GPLVM in Eq. 4 as

(6)

We approximate in Eq. 4 as , where . This is a Gaussian RFLVM when is a Gaussian distribution and is the identity function. Because the prior distribution on the mapping weights

is Gaussian, the model is analogous to Bayesian linear regression given

; if we integrate out , we recover a marginal likelihood that approximates the GPLVM’s marginal likelihood.

We use this representation to generalize the RFLVM to other observation types in the exponential family. For example, a Poisson RFLVM takes the following form:

(7)

For distributions including the Bernoulli, binomial, and negative binomial, the functional form of the data likelihood is

(8)

for some functions of the data , , and . The general form of this logistic RFLVM is then:

(9)

For example, by setting , , and , we get the negative binomial RFLVM with feature-specific dispersion parameter .

2.4 Inference for RFLVMs

We now present a general Gibbs sampling framework for all RFLVMs. First, we write the Gibbs sampling steps to estimate the posterior of the covariance kernel. Next, we describe estimating the latent variable by taking the MAP estimate. Then, we sample the data likelihood-specific parameters and linear coefficients . Variables subscripted with zero, e.g., , denote hyperparameters. While the number of mixture components may change across sampling iterations, let denote the number of components in the current Gibbs sampling step. We initialize all the parameters in our model by drawing from the prior, except for , which we initialize with PCA.

First, we sample following Algorithm 8 from Neal (2000). Let , and let denote the same sum with excluded. Then we sample the posterior of from the following discrete distribution for :

(10)

Given assignments and RFFs , the posterior of is inverse-Wishart distributed. Given , the posterior of

is normally distributed 

(Gelman et al., 2013):

(11)

We cannot sample from the full conditional distribution of , but prior work suggested a Metropolis–Hastings (MH) sampler using proposal distribution set to the prior (Eq. 5) and acceptance ratio  (Oliva et al., 2016):

(12)

Finally, we sample the DP-GMM concentration parameter  (Escobar and West, 1995). We augment the model with variable to make sampling conditionally conjugate:

(13)

For the Gaussian RFLVM (Eq. 6), let . We integrate out and in closed form to obtain a marginal likelihood,

(14)

where , , , and . See Appendix B or Minka (2000) for details. However, inference can be slow because marginalizing out introduces dependencies between the latent variables, and the complexity becomes . Alternatively, we can Gibbs sample and take the MAP estimate of using the original log likelihood where the computational complexity is .

In the Poisson RFLVM (Eq. 7), we no longer have the option of marginalizing out . Instead, we take iterative MAP estimates of and . Given , inference for

is analogous to Bayesian inference for a Poisson generalized linear model (GLM). In Secs. 

3.1 and 3.2, we show that, by inducing closed-form gradients with respect to through RFFs, this iterative MAP procedure produces results that are competitive with specialized GP-based latent variable models on count data.

For logistic RFLVMs (Eq. 9), we use Pólya-gamma augmentation (Polson et al., 2013) to make inference tractable. A random variable

is Pólya-gamma distributed with parameters

and , denoted , if

(15)

where denotes equality in distribution and are independent gamma random variables. The identity critical for Pólya-gamma augmentation is

(16)

where and . If we define , then Eq. 16 allows us to rewrite the likelihood in Eq. 8 as proportional to a Gaussian. Furthermore, we can sample conditioned on as . This enables convenient, closed-form Gibbs sampling steps of , conditioned on Pólya-gamma augmentation variables :

(17)

where and . This technique has been used to derive Gibbs samplers for binomial regression (Polson et al., 2013), negative binomial regression (Zhou et al., 2012), and correlated topic models (Chen et al., 2013; Linderman et al., 2015). Here, we use it to derive a sampling approach for logistic RFLVMs.

RFLVMs are identifiable up to the rotation and scale of the latent variable . As a result, MAP estimates of between iterations are unaligned as they can be arbitrarily rescaled and rotated through inference. Thus, a point estimate of that is a function of the Monte Carlo samples of , e.g. the expectation of across the samples, will not be meaningful. To this end, we arbitrarily fix the rotation of

by taking the singular value decomposition (SVD) of the MAP estimate,

, and setting

to be the left singular vectors corresponding to the

largest singular values, where and . Then we rescale

so that the covariance of the latent space is the identity matrix. This has the effect of enforcing orthogonality, and does not allow heteroskedasticity in the latent dimensions.

3 Experiments

Figure 1: Simulated data with Gaussian emissions. (Left) Inferred latent variables for both a GPLVM and Gaussian RFLVM. (Upper middle) Comparison of estimated for a single feature as estimated by GPLVM and RFLVM. (Lower middle) Comparison of MSE reconstruction error on held out for increasing , where is the number of inducing points for GPLVM and random Fourier features for RFLVM. (Right) Ground truth covariance matrix compared with RFLVM estimation for an increasing number of random Fourier features .

In our results, we refer to the Gaussian-distributed GPLVM using inducing point methods for inference as GPLVM (Titsias and Lawrence, 2010). We fit all GPLVM experiments using the GPy package (GPy, 2012)

. We refer to the Poisson-distributed GPLVM using a double Laplace approximation as

DLA-GPLVM (Wu et al., 2017). DLA-GPLVM is designed to model multi-neuron spike train data, and the code222https://github.com/waq1129/LMT initializes the latent space using the output of a Poisson linear dynamical system (Macke et al., 2011), and places a GP prior on . To make the experiments comparable for all GPLVM experiments, we initialize DLA-GPLVM with PCA and assume . We refer to our GPLVM with random Fourier features as RFLVM and explicitly state the assumed distribution. In Sec. 3.1, we use a Gaussian RFLVM with the linear coefficients marginalized out (Eq. 14) for a fairer comparison with the GPLVM. In Sec. 3.3, we use a Gaussian RFLVM without marginalizing out the linear coefficients because inference was faster on the larger datasets.

Since hyperparameter tuning our model on each dataset would be both time-consuming and unfair without also tuning the baselines, we fixed the hyperparameters across experiments. We used 2000 Gibbs sampling iterations with burn-in steps, , and . We initialized and . In Sec. 3.2, we used and visualized after the best affine transformation onto the 2-D rat positions following Wu et al. (2017). For computational reasons, MNIST and CIFAR-10 were subsampled (see Appendix A for details).

3.1 Simulated data

We first evaluate RFLVM on simulated data. We set to be a two-dimensional S-shaped manifold, sampled functions from a Gaussian process with an RBF kernel, and then generated observations for Gaussian emissions (Eq. 4) and Poisson emissions (Eq. 7). For all simulations, we used , and .

For these experiments, we computed the mean-squared error (MSE) between test set observations, , and predicted observations , where we held out 20% of the observations for the test set. To evaluate our latent space results, we projected the estimated latent space,

, onto the hyperplane that minimized the squared error with the ground truth,

, and calculated the value between the true and the projected latent space . We evaluated our model’s ability to estimate the GP outputs by comparing the MSE between the estimated and the true generating

. We computed the mean and standard error of the MSE and

results by running each experiment five times.

We compared the performance of a Gaussian RFLVM to the GPLVM. We ran these experiments across multiple values of , where denotes the number of random features for the RFLVM and the number of inducing points for the GPLVM. Both models recovered the true latent variable accurately and estimated the nonlinear maps, , well (Fig. 1, upper middle). Empirically, a GPLVM shows better performance for estimating than the RFLVM (Fig. 1, lower middle). We hypothesize that this is because Nyström’s method has better generalization error bounds than RFFs when there is a large gap in the eigenspectrum (Yang et al., 2012), which is the case for . However, we see that the RFLVM approximates the true given enough random features (Fig. 1, right), though perhaps less accurately than the GPLVM (Fig. 1, lower middle).

To demonstrate the utility of our model beyond Gaussian distributed data, we compared results for count data from a Poisson RFLVM and a DLA-GPLVM. Additionally, we compared results to our own naive implementation of the Poisson GPLVM that performs coordinate ascent on and by iteratively taking MAP estimates without using RFFs. We refer to this method as MAP-GPLVM. The MAP-GPLVM appears to get stuck in poor local modes (Wu et al., 2017) because we do not have gradients of the posterior in closed form (Fig. 2, left). Both DLA-GPLVM and RFLVM, however, do have closed-form gradients and approximate the true manifold with similar and MSE values for and .

Figure 2: Simulated data with Poisson emissions. (Left four plots) The true latent variable compared with estimated using a MAP-GPLVM, a DLA-GPLVM, and a Poisson RFLVM. (Middle) Comparison of for a single feature as estimated by DLA-GPLVM and RFLVMs. (Right) MSE and between the true and , and the true and , respectively.

3.2 Hippocampal place cell data

Figure 3: Hippocampal place cells.

(Left three plots) Inferred latent space for the DLA-GPLVM and the Poisson RFLVM. The points are colored by three major regions of the true rat position in a W-shaped maze. (Right two plots) KNN accuracy using 5-fold cross validation and

performance of the best affine transformation from onto the rat positions . Error bars computed using five trials.
Figure 4: MNIST digits. Digits visualized in 2-D latent space inferred from DLA-GPLVM (left) and Poisson RFLVM (right). Following Lawrence (2004), we plotted images in a random order while not plotting any images that result in an overlap. The RFLVM’s latent space is visualized as a histogram of 1000 draws after burn-in. The plotted points are the sample posterior mean.

Next, we checked whether a non-Gaussian RFLVM recovers an interpretable latent space when applied to a scientific problem. In particular, we use an RFLVM to model hippocampal place cell data (Wu et al., 2017). Place cells, a type of neuron, are activated when an animal enters a particular place in its environment. Here, is an matrix of count-valued spikes where indexes time and indexes neurons. These data were jointly recorded while measuring the position of a rat in a W-shaped maze. We are interested in reconstructing the latent positions of the rat with .

We quantified goodness-of-fit of the latent space by assessing how well the RFLVM captures known structure, in the form of held-out sample labels, in the low-dimensional space. After estimating , we performed -nearest neighbors (KNN) classification on with

. We ran this classification five times using 5-fold cross validation. We report the mean and standard deviation of KNN accuracy across five of these experiments.

The Poisson RFLVM and DLA-GPLVM have similar performance in terms of how well they cluster samples in the latent space as measured by KNN accuracy using regions of the maze as labels. Furthermore, the models have similar performance in recovering the true rat positions , measured by performance (Fig. 3). These results suggest that our generalized RFLVM framework finds structure even in empirical, complex, non-Gaussian data and is competitive with models built for this specific task.

3.3 Text and image data

Finally, we examine whether an RFLVM captures the latent space of text, image, and empirical data sets. We hold out the labels and use them to evaluate the estimated latent space using the same KNN evaluation from Sec. 3.2. Across all eight data sets, the Poisson and negative binomial RFLVMs infer a low-dimensional latent variable that generally captures the latent structure as well as or better than linear methods like PCA and NMF (Lee and Seung, 1999). Moreover, adding nonlinearity but retaining a Gaussian data likelihood—as with real-valued models like Isomap (Tenenbaum et al., 2000)

, a variational autoencoder

(VAE, Kingma and Welling, 2013), and the Gaussian RFLVM, or even using the Poisson-likelihood DLA-GPLVM—perform worse than the Poisson and negative binomial RFLVMs (Tab. 1, Fig. 4). We posit that this improved performance is because the generating process from the latent space to the observations for these data sets is (in part) nonlinear, non-RBF, and integer-valued.

PCA NMF Isomap VAE
Bridges
CIFAR-10
Congress
MNIST
Montreal
Newsgroups
Spam
Yale
DLA-GPLVM Gaussian RFLVM Poisson RFLVM Neg. binom. RFLVM
Bridges
CIFAR-10
Congress
MNIST
Montreal
Newsgroups
Spam
Yale
Table 1:

Classification accuracy evaluated by fitting a KNN classifier (

) with five-fold cross validation. Mean accuracy and standard error were computed by running each experiment five times.

4 Conclusion

We presented a framework that uses random Fourier features to induce computational tractability between the latent variables and GP-distributed maps in Gaussian process latent variable models. Our approach allows the Gaussian model to be extended to arbitrary distributions, and we derived an RFLVM for Gaussian, Poisson and logistic distributions. We described distribution-specific inference techniques for each posterior sampling step. Our empirical results showed that each was competitive in downstream analyses with existing distribution-specific approaches on diverse data sets including synthetic, image, text, and multi-neuron spike train data. We are particularly interested in exploring extensions of our generalized RFLVM framework to more sophisticated models such as extending GP dynamic state-space models (Ko and Fox, 2011) to count data and neuroscience applications, which assume temporal structure in .

RFLVMs have a number of limitations that motivate future work. First, the latent variables are unidentifiable up to scale and rotation. Our rescaling procedure (Sec. 2.4

) does not allow heteroscedastic dimensions and enforces orthogonality between the Gaussian latent variables. This prevents the use of more structured priors, such as a GP prior on

, since any inferred structure is eliminated between iterations. We are interested in adopting constraints from factor analysis literature to address the identifiability issues without a restrictive rescaling procedure (Erosheva and Curtis, 2011; Millsap, 2001; Ghosh and Dunson, 2009). Second, label switching in mixture models is a well-studied challenge that is present in our model. Enforcing identifiability may improve inference and model interpretability (Stephens, 2000). Finally, our model has a number of hyperparameters such as the latent dimension, the number of random Fourier features, and the number of Gibbs sampling iterations. Both simplifying the model and estimating these hyperparameters from data are two important directions to improve the usability of RFLVMs.

References

  • D. J. Aldous (1985) Exchangeability and related topics. In École d’Été de Probabilités de Saint-Flour XIII—1983, pp. 1–198. Cited by: §2.3.
  • C. E. Antoniak (1974) Mixtures of Dirichlet processes with applications to Bayesian nonparametric problems. The Annals of Statistics, pp. 1152–1174. Cited by: §2.3.
  • S. Bochner (1959) Lectures on Fourier integrals. Vol. 42, Princeton University Press. Cited by: §2.1.
  • J. Chen, J. Zhu, Z. Wang, X. Zheng, and B. Zhang (2013) Scalable inference for logistic-normal topic models. In Advances in Neural Information Processing Systems, pp. 2445–2453. Cited by: §2.4.
  • A. C. Damianou, M. K. Titsias, and N. D. Lawrence (2016) Variational inference for latent variables and uncertain inputs in Gaussian processes.

    Journal of Machine Learning Research

    17 (1), pp. 1425–1486.
    Cited by: §2.2.
  • E. A. Erosheva and S. M. Curtis (2011) Dealing with rotational invariance in Bayesian confirmatory factor analysis. Department of Statistics, University of Washington, Seattle, Washington, USA. Cited by: §4.
  • M. D. Escobar and M. West (1995) Bayesian density estimation and inference using mixtures. Journal of the American Statistical Association 90 (430), pp. 577–588. Cited by: §2.4.
  • T. S. Ferguson (1973) A Bayesian analysis of some nonparametric problems. The Annals of Statistics, pp. 209–230. Cited by: §2.3.
  • A. Gelman, J. B. Carlin, H. S. Stern, D. B. Dunson, A. Vehtari, and D. B. Rubin (2013) Bayesian data analysis. Chapman and Hall/CRC. Cited by: §2.4.
  • M. Gentzkow and J. M. Shapiro (2010) What drives media slant? Evidence from US daily newspapers. Econometrica 78 (1), pp. 35–71. Cited by: 3rd item.
  • J. Ghosh and D. B. Dunson (2009) Default prior distributions and efficient posterior computation in Bayesian factor analysis. Journal of Computational and Graphical Statistics 18 (2), pp. 306–320. Cited by: §4.
  • GPy (2012) GPy: a Gaussian process framework in python. Note: http://github.com/SheffieldML/GPy Cited by: §3.
  • J. Hensman, N. Durrande, and A. Solin (2017) Variational Fourier features for Gaussian processes. Journal of Machine Learning Research 18 (1), pp. 5537–5588. Cited by: §2.1.
  • G. Kimeldorf and G. Wahba (1971) Some results on Tchebycheffian spline functions. Journal of Mathematical Analysis and Applications 33 (1), pp. 82–95. Cited by: §2.1.
  • D. P. Kingma and M. Welling (2013) Auto-encoding variational Bayes. arXiv preprint arXiv:1312.6114. Cited by: §3.3.
  • J. Ko and D. Fox (2011) Learning GP-BayesFilters via Gaussian process latent variable models. Autonomous Robots 30 (1), pp. 3–23. Cited by: §4.
  • D. N. Lawley and A. E. Maxwell (1962) Factor analysis as a statistical method. Journal of the Royal Statistical Society. Series D (The Statistician) 12 (3), pp. 209–229. Cited by: §1.
  • N. D. Lawrence (2004)

    Gaussian process latent variable models for visualisation of high dimensional data

    .
    In Advances in Neural Information Processing Systems, pp. 329–336. Cited by: §1, §2.2, Figure 4.
  • N. D. Lawrence (2005) Probabilistic non-linear principal component analysis with Gaussian process latent variable models. Journal of Machine Learning Research 6 (Nov), pp. 1783–1816. Cited by: §2.2.
  • N. D. Lawrence (2007) Learning for larger datasets with the Gaussian process latent variable model. In Artificial Intelligence and Statistics, pp. 243–250. Cited by: §2.2.
  • M. Lázaro-Gredilla, J. Quiñonero-Candela, C. E. Rasmussen, and A. R. Figueiras-Vidal (2010) Sparse spectrum Gaussian process regression. Journal of Machine Learning Research 11, pp. 1865–1881. Cited by: §2.1.
  • D. D. Lee and H. S. Seung (1999) Learning the parts of objects by non-negative matrix factorization. Nature 401 (6755), pp. 788–791. Cited by: §3.3.
  • S. Linderman, M. J. Johnson, and R. P. Adams (2015) Dependent multinomial models made easy: stick-breaking with the Pólya-gamma augmentation. In Advances in Neural Information Processing Systems, pp. 3456–3464. Cited by: Appendix D, §2.4.
  • J. H. Macke, L. Buesing, J. P. Cunningham, M. Y. Byron, K. V. Shenoy, and M. Sahani (2011) Empirical models of spiking in neural populations. In Advances in Neural Information Processing Systems, pp. 1350–1358. Cited by: §3.
  • R. E. Millsap (2001) When trivial constraints are not trivial: the choice of uniqueness constraints in confirmatory factor analysis. Structural Equation Modeling 8 (1), pp. 1–17. Cited by: §4.
  • T. Minka (2000) Bayesian linear regression. Technical report Cited by: §2.4.
  • R. M. Neal (2000) Markov chain sampling methods for Dirichlet process mixture models. Journal of Computational and Graphical Statistics 9 (2), pp. 249–265. Cited by: §2.4.
  • J. B. Oliva, A. Dubey, A. G. Wilson, B. Póczos, J. Schneider, and E. P. Xing (2016) Bayesian nonparametric kernel-learning. In Artificial Intelligence and Statistics, pp. 1078–1086. Cited by: §1, §2.3, §2.4.
  • K. Pearson (1901) LIII. On lines and planes of closest fit to systems of points in space. The London, Edinburgh, and Dublin Philosophical Magazine and Journal of Science 2 (11), pp. 559–572. Cited by: §1.
  • N. G. Polson, J. G. Scott, and J. Windle (2013) Bayesian inference for logistic models using Pólya–gamma latent variables. Journal of the American Statistical Association 108 (504), pp. 1339–1349. Cited by: §C.1, Appendix D, §2.4.
  • A. Rahimi and B. Recht (2008) Random features for large-scale kernel machines. In Advances in Neural Information Processing Systems, pp. 1177–1184. Cited by: §1, §2.1.
  • S. Saxena and J. P. Cunningham (2019) Towards the neural population doctrine. Current Opinion in Neurobiology 55, pp. 103–111. Cited by: §1.
  • B. Schölkopf, R. Herbrich, and A. J. Smola (2001) A generalized representer theorem. In

    International Conference on Computational Learning Theory

    ,
    pp. 416–426. Cited by: §2.1.
  • M. Stephens (2000) Dealing with label switching in mixture models. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 62 (4), pp. 795–809. Cited by: §4.
  • J. B. Tenenbaum, V. De Silva, and J. C. Langford (2000) A global geometric framework for nonlinear dimensionality reduction. Science 290 (5500), pp. 2319–2323. Cited by: §3.3.
  • M. E. Tipping and C. M. Bishop (1999) Probabilistic principal component analysis. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 61 (3), pp. 611–622. Cited by: §1.
  • M. Titsias and N. D. Lawrence (2010) Bayesian Gaussian process latent variable model. In Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, pp. 844–851. Cited by: §2.2, §3.
  • C. K. I. Williams and C. E. Rasmussen (2006) Gaussian processes for machine learning. Vol. 2, MIT Press. Cited by: §2.1.
  • A. Wilson and R. Adams (2013) Gaussian process kernels for pattern discovery and extrapolation. In Proceedings of International Conference on Machine Learning, pp. 1067–1075. Cited by: §2.3.
  • A. Wu, N. A. Roy, S. Keeley, and J. W. Pillow (2017) Gaussian process based nonlinear latent structure discovery in multivariate spike train data. In Advances in Neural Information Processing Systems, pp. 3496–3505. Cited by: §1, §3.1, §3.2, §3, §3.
  • T. Yang, Y. Li, M. Mahdavi, R. Jin, and Z. Zhou (2012) Nyström method vs random Fourier features: a theoretical and empirical comparison. In Advances in Neural Information Processing Systems, pp. 476–484. Cited by: §3.1.
  • M. Zhou and L. Carin (2012) Augment-and-conquer negative binomial processes. In Advances in Neural Information Processing Systems, pp. 2546–2554. Cited by: §C.2.
  • M. Zhou, L. Li, D. Dunson, and L. Carin (2012) Lognormal and gamma mixed negative binomial regression. In Proceedings of International Conference on Machine Learning, Vol. 2012, pp. 1343. Cited by: §2.4.

Appendix A Experiments

a.1 Additional results

Figure 5: Latent space and generated faces for the Yale dataset using a Poisson RFLVM.
Figure 6: Latent space for CIFAR-10 and generated digits for MNIST using a Poisson RFLVM.

a.2 Data descriptions and preprocessing

Figure 7: (Left) The number of bicycle crossings over the Queensboro Bridge from April through November 2017. (Right) The number of cyclists on Berri St. in Montreal throughout 2015.

a.3 Experiment details

GPLVM baselines: We used GPy’s implementation BayesianGPLVMMiniBatch, which supports inducing points and prediction on held out data.

Appendix B Marginal likelihood in Bayesian linear regression

For ease of notation, we drop the subscript, and therefore and . Consider the linear regression model

(18)

where , an

matrix. A common conjugate prior on

is a normal–inverse–gamma distribution,

(19)

We can write the functional form of the posterior and prior terms in (19) as

(20)

We can combine the likelihood’s Gaussian kernel with the prior’s kernel in the following way:

(21)

where and are defined as

(22)

Now our posterior can be written as

(23)

We can see that we have an -variate normal distribution on the first line. If we ignore and inverse–gamma prior normalizer, we can combine the bottom two lines to be proportional to an inverse–gamma distribution,

(24)

Now define and as

(25)

Thus, we can write our posterior as

(26)

Now to compute the log marginal likelihood, we want

(27)

Using the definitions in (22) and (25), we can write the joint as

(28)

The integral over is only over the Gaussian kernel, which allows us to compute it immediately:

(29)

The terms in (28) cancel, and the first line of (28) reduces to

(30)

We can compute the second integral in (27) because we know the normalizing constant of the gamma kernel,

(31)

Putting everything together, we see that the marginal likelihood is

(32)

Appendix C Negative binomial Gibbs sampler updates

c.1 Sampling

Let be a Pólya-Gamma distributed random variable with parameters and , denoted . Polson et al. (2013) proved two useful properties of Pólya-Gamma variables. First,

(33)

where and . And second,

(34)

Now consider an NB likelihood on ,

(35)

Using (33), we can express the -th term in the negative binomial likelihood using the following variable substitutions,

(36)

This gives us

(37)

where

(38)

Finally, note that

(39)

If we vectorize across , we can sample each following Polson et al. (2013)’s proposed Gibbs sampler:

(40)

where

(41)

c.2 Sampling

Consider the hierarchical model

(42)

Zhou and Carin (2012) showed we can sample as follows: