Accelerated Inference for Latent Variable Models

05/19/2017 ∙ by Michael Minyi Zhang, et al. ∙ Princeton University ETH Zurich 0

Inference of latent feature models in the Bayesian nonparametric setting is generally difficult, especially in high dimensional settings, because it usually requires proposing features from some prior distribution. In special cases, where the integration is tractable, we could sample feature assignments according to a predictive likelihood. However, this still may not be efficient in high dimensions. We present a novel method to accelerate the mixing of latent variable model inference by proposing feature locations from the data, as opposed to the prior. This sampling method is efficient for proper mixing of the Markov chain Monte Carlo sampler, computationally attractive because this method can be performed in parallel, and is theoretically guaranteed to converge to the posterior distribution as its limiting distribution.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 9

page 10

page 11

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Bayesian nonparametrics (BNP) models appear to be perfectly suited for the era of big data (Jordan, 2011)

, in which ever-expanding databases of high-dimensional data cannot be dealt with simplistically. Generative processes priors like the Dirichlet process

(Ferguson, 1973) or the Indian buffet process (Griffiths and Ghahramani, 2011) allow for modeling latent variables like clusters or otherwise unobservable features in our data and adapting the complexity of the model in accordance to the complexity of the data. Even if we had some understanding of the latent structure in the data, we would not necessarily know their exact forms and implications in the model a priori. The BNP solution, which divides the data into discrete features and clusters, fosters interpretable models that would naturally lead to new hypotheses about the information in such databases (Kim et al., 2015)

. For example, in a general medical records dataset containing billions of observations, a cluster (or feature) composed of 0.001% of the population still includes tens of thousands of people. But in looking at a small fraction of the data, these clusters may be seen as outliers at best and it would be difficult to characterize their meaning. Exploring large, complicated databases with the intention of finding interpretable new features by adapting the model complexity to the complexity of the data is BNP’s most prominent and promising feature. Unfortunately, BNP fails to deliver on this promise due to at least three limitations:

  1. Base measure definition

    : The prior distribution over the parameters associated with a latent feature is govened by a (typically diffuse) base measure, which divides the probability mass over the entire parameter space associated with out predefined likelihood model. In a nonparametric setting, where we typically assume an unbounded number of distinct features, it is generally hard to specify informative priors, meaning the base measure behaves more as a regularizer than a source of prior knowledge about the problem at hand

    (Gelman and Shalizi, 2013)

    . In a high-dimensional parameter space, this means that, in the absense of highly informative priors, the prior probability mass around the true parameter will be vanishingly small.

  2. Computational complexity: Inference in BNP models is challenging because it involves exploring a countable, infinite-dimensional state space. This has two consequences for designing scalable inference. First, parallelization is challenging due of alignment issues: with an unbounded set of features, whose members change as inference proceeds, partitioning the model in a consistent manner is challenging. In particular, many methods such as collapsed samplers and variational approximations explicitly rely upon sequentially processing the data. Second, exploring the infinite-dimensional space requires proposing previously unseen features. As we discussed above, in high dimensional spaces the prior probability mass near “good” features can be vanishingly small, meaning such exploration can take an unfeasibly long time which is unacceptable under finite computing time constraints.

  3. Shallow likelihood models: Likelihood models are typically hand-engineered features that simplistically pass information about the parameters from the prior to the observable data. These simple likelihoods limit the type of information that can be represented and cannot capture high level complex features in images or audio, for example. We need different likelihood models, which should be learned from the data, to be able to better characterize the relation between the observed data and an interpretable model.

In this paper, we propose a method that addresses the first two limitations, and apply it to Dirichlet process mixture models. These two limitations are in a sense intrinsic to the BNP setting: while we focus on the Dirichlet process, they are inherent in models that involve a countably infinite parameter space and less problematic in finite-dimensional models. The third limitation, conversely, is a more general limitation of the Bayesian modeling paradigm. There are many lines of research aiming to address the problem of shallow likelihood models, including variational autoencoders

(Kingma and Welling, 2014), adversarial variational Bayes (Mescheder et al., 2017) or deep hierarchical implicit models (Tran et al., 2017). While the problem of shallow likelihood models is certainly relevant in the BNP setting, incorporating these approaches is not straightforward in our current inference strategy.

The main idea in this paper is an algorithm that relies on an adaptation of a popular MCMC algorithm (Neal, 2000) together with a parallelizable procedure in which, at regular intervals, the different computation nodes share summary statistics and newly introduced features. To improve convergence, instead of sampling from the base measure, we propose a simple mechanism that samples from the data points that are not well characterized by their current clusters. With this method, we are able to find new clusters efficiently without needing to sample from the base measure.

We begin this paper with a brief introduction to the Dirichlet processes and inference for Dirichlet process mixture models in Section 2. Next in Section 3, we present our novel accelerated inference algorithm, first as an exact MCMC method and then as an approximate parallelizable algorithm. We then illustrate the performance of our algorithm in Section 4 with high dimensional, high signal to noise ratio data in which standard MCMC inference procedures for BNP models will fail to find any meaningful information and our algorithm is able to find relevant clusters. And lastly, we conclude our paper in Section 5 with a discussion of our work and present future work for this idea.

2 Background

Many nonparametric models involve placing an infinite dimensional prior on a latent variable space. The inference challenge then becomes inferring the latent representation. In this paper, we focus on mixture models, where each observation is associated with a single latent variable. The most common nonparametric prior in this context is the Dirichlet process mixture model (Antoniak, 1974). However, our proposed method could be applied to a much wider family of nonparametric models beyond the Dirichlet process which we would like to rigorously investigate in further research.

2.1 The Dirichlet process

The Dirichlet process (DP) is a distribution, , over distributions wherein its finite marginal distributions are Dirichlet distributed (Ferguson, 1973). We can represent samples from a DP as a discrete measure, . A priori, the atom sizes are independent from the atom locations , and we can equivalently describe the law of the Dirichlet process in terms a distribution over an infinite sequence of weights , where refers to “Griffiths, Engen and McCloskey”, which represents the stick breaking process representation of the Dirichlet process (Pitman, 2002), and a distribution over an infinite sequence of locations from the base measure, , which represents the family of distributions our parameters are drawn from.

In a modeling context, the discrete nature of means that a single atom can be associated with multiple observations, allowing us to use the DP to cluster data, by associating item with parameters with probability . This mixture model is known as a Dirichlet process mixture model (DPMM), and can be written as:

(1)

Here represents the data, represent the latent variable assignment of the observations to features. The infinite a priori number of clusters assumed by the Dirichlet process is attractive in a clustering/mixture modeling scenario because it allows us to avoid pre-specifying a fixed number of clusters while providing an a posteriori finite number of clusters 111For a more substantial introduction to the Dirichlet process and Bayesian nonparametrics, we direct the reader to Emily Fox’s Ph.D. thesis (Fox, 2009) and Gershman and Blei’s introductory tutorial (Gershman and Blei, 2012).

2.2 Inference in the Dirichlet process

One reason for the popularity of the Dirichlet process is that its Dirichlet marginals lead to conjugacy, and allow us to construct relatively straightforward samplers (see Neal, 2000 for a number of different MCMC sampling techniques for the DPMM). Loosely, these algorithms can be split into uncollapsed and collapsed samplers. Uncollapsed, or batch, samplers (Ishwaran and Zarepour, 2002; Walker, 2007; Ge et al., 2015) explicitly instantiate . These algorithms are inherently parallelizable since the cluster allocation probability is conditionally independent given the mixing proportion. However, due to the infinite-dimensional nature of these methods require sampling previously unseen features, and relies on the idea that some of these sampled features will be close to those found in the posterior. Collapsed, or sequential sampler integrate out the mixing parameter (Neal, 2000; Ishwaran and James, 2001). This typically allows for better mixing, but introduces dependencies that make parallelization challenging: the conditional distribution of is dependent on the cluster allocations of all the other observations, denoted as . Evaluating this conditional distribution is therefore unparallelizable without excessive and costly communication between processors.

Numerous distributed inference procedures have been developed for fast inference in DP-based models (Ge et al., 2015; Broderick et al., 2013; Williamson et al., 2013; Zhang, 2016; Lovell et al., 2012). Smyth et al. (2009) propose an asynchronous method for the hierarchical Dirichlet process (Teh et al., 2004) which distributes the data across processors and performs Gibbs sampling based on approximate marginalized distribution of . Williamson et al. (2013) propose an exact sampler for the DPMM, but requires that each observation associated with a particular cluster must exist on the same processor. Ge et al. (2015) introduce an efficient map-reduce procedure for slice sampling the cluster allocations. Zhang (2016) developed an exact sampler for completely random measures (CRM) which exploits the conditional independencies of the features by partitioning the random measure into a finite instantiated partition and an infinite uninstantiated partition. The instantiated portion runs the inherently parallel sampler with the mixing parameter, , instantiated and at random one processor is selected to sample and propose new features based on the predictive distribution of the cluster assignments.

Many of these distributed methods perform well in low-dimensional space. However, they tend to struggle as the number of dimensions increases. This is because they all primarily address the issue of sampling cluster allocations, for , and assume the existance of an appropriate method for sampling parameters

. Efficient exploration of the parameter space requires a method for sampling from the base measure in a manner that encourages samples with high posterior probability. In proposing new features, we can either propose features from an

uncollapsed representation–meaning, drawing features from the prior, , and assign observations to clusters with likelihood . Or, we can marginalize out the clusters and sample cluster assignments from a collapsed likelihood, .

In a non-conjugate setting, both collapsed and uncollapsed algorithms require proposing new features. In a high-dimensional space, this can lead to very slow mixing, since randomly sampled features are unlikely to be close to the data. The conjugate sampler is more likely to sample better feature locations because it draws from the expectation of the likelihood with respect to the prior distribution on features but this requires us to obtain the marginal likelihood in closed form which, in general, is only available in the narrow conjugate setting.

3 Method

Our novel sampling method combines distributed parallel sampling for the Dirichlet process with an improved proposal method for new features. Essentially, we propose new features centered at observations that are poorly modeled by their assigned clusters in parallel. We begin by proposing an asymptotically exact, non-distributed sampler in Section 3.1. While this method is not a practical scalable algorithm, it provides inspiration for a two-stage inference method described in Section 3.3. The first stage of this algorithm is an acceleration stage, where features are proposed in parallel to ensure a high-quality pool of potential features. This stage is an approximation to the exact sampler in Section 3.1. To correct for the approximations introduced in this first stage, the method proceeds with a distributed MCMC algorithm that guarantees asymptotic convergence.

3.1 An exact, serial sampler with data-driven proposals

Neal’s Algorithm 8 describes a collapsed sampler where, at each iteration, an observation’s parameter is selected from the union of the parameters associated with other data points, and a finite set of “unobserved” parameters sampled from the base measure . These unobserved parameters are refreshed each iteration, to ensure posterior convergence. Algorithm 8 samples the unobserved parameters via i.i.d. samples from . This method of proposing new features essentially means we must, slowly, explore a high dimensional space in the hope of finding meaningful clusters. Despite being guaranteed to converge asymptotically, we would not expect this inference strategy to converge in a reasonable amount of time in a realistic sense.

We replace Algorithm 8’s method of introducing new features with our proposed empirical feature proposal algorithm, as seen in Algorithm 1 below. Neal’s Algorithm 8 augments the DPMM with of auxiliary variables which represents the latent features. For the instantiated features, we sample their value from the appropriate conditional distribution. The uninstantiated features are resampled from the prior. After the allocations have been resampled, any uninstantiated features are removed and a new set of features are sampled from the prior. Rather than resample the uninstantiated features directly from the prior, we resample according to a Markovian method. If the number of uninstantiated features is less than , we sample new features from the prior; if we randomly delete uninstantiated features. We then resample each uninstantiated feature location from a mixture distribution:

(2)

With probability , we sample from an empirical distribution of the data with point masses inversely proportional to its likelihood given its cluster assignment222The feature locations we draw from this distribution must have the same support as the parameter space. For the example of the Multinomial-Dirichlet models in Section 4

, we normalize the draws so that the vector sums to one.

. With probability , we sample from the prior. To ensure validity of the sampler, we accept the feature proposal using a Metropolis-Hastings correction. In the case where we sample features from the base measure then the MH correction probability simplifies to one and we automatically accept any draw from the prior, as we would typically do in Algorithm 8. The Metropolis Hastings sampling step for uninstantiated features is described in Algorithm 1.

for   do
       Sample
       if  then
             Set proposal distribution to an empirical distribution of the data:
      else
             Set (the base measure)
      Draw and accept new state for with probability where
Algorithm 1 Empirical Feature Proposal Algorithm

3.2 An approximate distributed inference algorithm

The exact sampler described above requires access to all of the data on a single processor–both in sampling from the empirical distribution, , and sampling the collapsed feature assignments which makes this algorithm inherently unparallelizable. Worse yet, as the dimensionality of the data grows the probability of accepting a state sampled from the data goes to zero. To overcome this issue, next we will propose an approximation of our accelerated algorithm that we can easily parallelize.

Our approximation consists of two components: The first is an approximation of the collapsed sampling algorithm where we replace the global summary statistics necessary for the sampler with summary statistics that are local to the processor. We choose to use this approximation because collapsed samplers, though inherently unparallelizable, exhibit better mixing properties than uncollapsed ones. Smyth et al. (2009) introduce this approximation in their distributable sampler and demonstrate that this approximation, though an incorrect sampler, still produces good empirical results. Explicitly, we accelerate sampling with a modified version of Algorithm 8 (Neal, 2000) which introduces auxiliary features for that represent finite realizations of the clusters. If represents the number of observations assigned to cluster then let represent and represent . Algorithm 8 assigns cluster assignments according to the following probability:

(3)

where represents the number of observations besides observation allocated to cluster . Then, draw new realizations for from and update posterior values for . In a single processor setting, we will always have the exact value of . However, in the parallel setting we no longer have the precise feature counts, which forces inter-processor communication for every state transition of . Therefore, we approximate  with which is the local count of cluster on processor excluding observation when running the algorithm in parallel.

The other approximation is to automatically accept new states for the empty features proposed from the empirical distribution, described in Algorithm 1, so as to minimize the amount of time wasted exploring portions of the feature space that lack useful information about the data. In practice, this means that we are sampling new features at each iteration, according to Equation 3. This approximation sacrifices the theoretical correctness imposed by the Metropolis Hastings step, in order to more quickly explore the state space. In contrast to DP samplers that sample feature locations from or assign cluster allocations based on , our accelerated sampler proposes features centered at observations with the lowest likelihood given its current feature allocation333We write that the feature should be proportional to the data because the feature locations in the multinomial-Dirichlet examples in Section 4.2 have features proposed from the observations normalized so that the proposed feature sums to .. We assume that new features accepted on different processors are different from each other, thus we do not consider the problem of feature alignment. Furthermore, by dividing computation across multiple processors we can propose times more features to explore possible new clusters that will persist after acceleration. After subiterations of our sampler, we trigger a global synchronization step where each processor sends updated features, feature counts and feature summary statistics to instantiate new features on all processors and to update posterior values for global parameters.

3.3 A two-stage, asymptotically exact parallel inference procedure

The method described in Section 3.2 is obviously not a correct MCMC sampler for a DPMM since it does not apply appropriate Metropolis Hastings corrections. However, we can use it as an acceleration stage in a two-stage algorithm. Moreover, in the parallel setting, the reliance of local counts is an approximation of the total counts of observations assigned to feature . Any asymptotically correct MCMC sampler for the DPMM will eventually converge to the true posterior, regardless of its starting point—but the time required to achieve arbitrary closeness to this posterior will be strongly dependent on that start point. We begin our inference procedure by running the approximate accelerated sampler for some pre-determined number of iterations, and then switch to an exact distributable inference algorithm (specifically, the one described in Zhang, 2016

). The major additional benefit of our accelerated sampler is that we have an efficient sampler to encourage fast mixing of the MCMC sampler without the need to integrate the latent features out of the likelihood. Thus, we can now use a variety of priors for features without encountering the problem of proposing features from the prior in high dimensional space. Our method in the DPMM case is suitably general for a wide class of data modeling scenarios. Although the most common type of mixture is the Gaussian mixture model, we do not place any assumption on the form of likelihood for

, and we will see an example of our method applied to count data to demonstrate the flexibility of this method.

Algorithm 2 provides the exact details for the inference procedure of our accelerated sampler for one iteration. We denote to mean features instantiated on all processors, to mean newly introduced features not yet instantiated on all processors, and to represent the unoccupied features. In the accelerated stage, each processor is allowed to accept new features–hence the subscript in and . Once the accelerated stage ends, only one processor , drawn uniformly, may sample new features while all other processor may only sample instantiated features. Furthermore, we assign cluster indicators for the features in according to an uncollapsed sampler in order to maintain theoretical correctness of the algorithm. Lastly, if the current iteration triggers a synchronization step, then all processors send summary statistics and new features to the master processor. The master processor samples new states for the global parameters and sends the new features to all the processors.

if accelerated then
       for  in parallel do
             for  do
                  Sample
            for  do
                  
            
      
else
       for  in parallel do
             for  do
                   if  then
                         Sample
for  do
                              
                        
                  else
                         Sample .
                  
            
      
if synchronization step  then
       On master processor, gather summary statistics and new instantiated features from all processors.
      
       Sample from full conditionals.
      
       Transmit to new processors.
Algorithm 2 Accelerated DP Inference Algorithm

4 Experimental Results

4.1 Synthetic example

As a basic evaluation of our accelerated sampler, we first generated synthetic 10 dimensional data with 1,000 training observations and 100 test observations from the prior of a Dirichlet process mixture of multinomials:

(4)

and applied our sampler to this data. Even in this simple example, we can see that our accelerated sampler performs favorably in comparison to the collapsed sampler (Neal’s Algorithm 8) and uncollapsed sampler (Ge et al.’s distributable slice sampler) with regards to convergence time. Furthermore, we see that even though the accelerated may seem like it would overfit the data with respect to the number of features, the accelerated sampler does not overfit with respect to the “true” (number of instantiated features drawn from the prior) number of features.

Figure 1: Test set predictive log likelihood vs. log time (seconds) and number of features vs. log time.

In a small, low-dimension setting we can easily run the MCMC sampler to convergence. However, for a big data scenario with high-dimensional data, we cannot realistically run the sampler until convergence especially with consideration to fixed computational budgets. In the next section, we will motivate the necessity for our accelerated MCMC method in this big data scenario by examining three image data sets that fail to mix or mix slowly with traditional sampling techniques.

4.2 Image data sets

We will apply our accelerated inference technique on three large dimensional data sets444Code is available at https://github.com/michaelzhang01/acceleratedDP, the dimensional MNIST handwritten digit dataset (LeCun and Cortes, 1998), consisting of a training set of 60,000 images and test set of 10,000 images, the dimensional Cropped Extended Yale Face Dataset B (Lee et al., 2005), divided into 1,818 training images and 606 test set images, and the CIFAR-10 (Krizhevsky, 2009) image data set converted to greyscale with 50,000 training images and 10,000 test images. For both of these datasets, we model the data with a multinomial likelihood and a Dirichlet prior. In each experiment we initialized all observations to the same cluster and distributed the data randomly across processors for the distributable algorithms (our method and the uncollapsed sampler). We ran the sampler for either iterations or two days (whichever arrived first) with a global synchronization step every iterations (synchronization steps are used for our method and the uncollapsed sampler only) and stopped the accelerated sampling after iterations. For the random and -means initialization tests (labeled “Rand. Init.” and “KM Init.” on the figures, respectively), we distributed data to initial clusters either by random sampling or -means allocation.

For each of the datasets, we can see that the uncollapsed samplers have difficulty proposing good features in high dimensional space. In terms of predictive log likelihood, we perform favorably for each of the data sets against all the sampling methods except against the variational inference. For the uncollapsed sampler, the results demonstrate that it is difficult for the sampler to propose good locations from the prior in high dimensional space and as a result, very few new features are instantiated relative to the size of the data. While the collapsed sampler can still propose good features, the timing results in Figures 2, 4, 6 show that Neal’s Algorithm 8 is far slower than the accelerated algorithm with regards to convergence time, due to the fact that it is not inherently a parallelizable sampling algorithm. In fact, the accelerated sampler never even reaches convergence to a local mode in any of the image data set experiments.

Moreover, the accelerated features are clearly superior to the ones learned in the uncollapsed algorithm (see Figure 8) and are comparable to the ones learned in the variational results (Figures 12, 13, 14). In comparison to the features learned in the collapsed algorithm, our accelerated algorithm has mixed much better than the collapsed algorithm and this is apparent when inspecting some of the collapsed features which tend to be less sharp or contain obviously mis-clustered observations (the fourth feature learned in Figure 15 is a composite of the number “four” and ”nine’). Although the VI results indicate that VI is faster and performs better than our method in terms of predictive log likelihood, we would like to emphasize that the variational approach is an approximate method of Bayesian inference that typically would perform faster than sampling based approaches whereas our algorithm is an exact method for inference and performs better than the other MCMC methods examined in this paper and, critically, is parallelizable as opposed to the VI and collapsed algorithms.

Furthermore, the results for random and -means initialization for the uncollapsed and collapsed samplers demonstrate that we need a smarter way of learning new features beyond just having good initial states. The results show that, under these initializations, we can learn some features of the data but we cannot fare as well as we would with the accelerated sampler. We can also see, by the poor quality of the collapsed and uncollapsed samplers, that our base measure has difficulty accurately representing out high-dimensional data set. This difficulty is one which we raised in the first point of our introduction. Thus, we need our accelerated method to learn features because our method better represents complicated datasets.

Algorithm MNIST Yale CIFAR-10
Accelerated -8.61e7 -2.43e8 -1.48e8
Acc. KM Init -9.62e7 -2.77e8 -1.50e8
Acc. Rand Init -9.14e7 -2.54e8 -1.54e8
Uncollapsed -2.30e8 -8.39e8 -2.05e8
Unc. KM Init. -1.23e8 -3.08e8 -1.60e8
Unc. Rand. Init -1.24e8 -3.34e8 -1.58e8
Collapsed -1.38e8 -2.72e8 -1.52e8
Coll. KM Init. -1.12e8 -2.67e8 -1.51e8
Coll. Rand. Init -1.29e8 -2.64e8 -1.51e8
Variational -3.44e7 -1.20e8 -5.91e7
Table 1: Comparison of final predictive log likelihood results for MNIST, Yale, and CIFAR data.
Figure 2: Test set predictive log likelihood vs. log time (seconds) and number of features vs. log time.
Figure 3: Popularity of each instantiated feature.
Figure 4: Test set predictive log likelihood vs. log time (seconds) and number of features vs. log time.
Figure 5: Popularity of each instantiated feature.
Figure 6: Test set predictive log likelihood vs. log time (seconds) and number of features vs. log time.
Figure 7: Popularity of each instantiated feature. Uncollapsed sampler put all observations into one cluster.
Figure 8: Yale faces features (left), MNIST features (middle) and CIFAR features (right) obtained via uncollapsed sampling, sorted in descending order of popularity.
Figure 9: MNIST features obtained via accelerated sampling, sorted in descending order of popularity.
Figure 10: Yale faces features obtained via accelerated sampling, sorted in descending order of popularity.
Figure 11: CIFAR features obtained via accelerated sampling, sorted in descending order of popularity.
Figure 12: MNIST features obtained via variational inference, sorted in descending order of popularity.
Figure 13: Yale faces features obtained via variational inference, sorted in descending order of popularity.
Figure 14: CIFAR features obtained via accelerated sampling, sorted in descending order of popularity.
Figure 15: MNIST features obtained via collapsed sampling, sorted in descending order of popularity.

Figure 16: Yale faces features obtained via variational inference, sorted in descending order of popularity.

Figure 17: CIFAR faces features obtained via collapsed sampling, sorted in descending order of popularity.

5 Discussion

We have introduced a novel method to overcome a problem inherent in Bayesian nonparametric latent variable models of learning unobservable features in a high-dimensional regime while also providing a data-parallel inference method suitable for “big data” scenarios. In order to accelerate the mixing of the MCMC sampler, we propose feature locations from observations that are poorly fit to its currently allocated feature during the accelerated stage of our sampler in order to find better locations for features.

After running accelerated sampling, our method then reverts to an exact algorithm, which is easily and inherently parallelizable without excessive processor communication, in order to maintain the theoretical correctness of our inference algorithm converging to the correct limiting posterior distribution. The additional benefit of our sampler is that our technique works for a general choice of likelihood and prior, whereas using a collapsed sampler limits the model choice to a narrow range of options for data modeling.

At first, it may seem that the number of features discovered for the accelerated method is excessive but we are using a very simple likelihood to model the digits instead of a more complicated model that is invariant to rotations or scalings of the data. This is apparent in the features discovered for accelerated sampling and variational inference, where a large proportion of the popular clusters learned are various forms of “ones” in the case of the MNIST data set. Furthermore, we could propose merge steps for features in order to prune the number of features if we are concerned about the number of features learned. Additionally, our method is suitable for other latent feature models–for example, sparse latent factor models like the Indian buffet process (Griffiths and Ghahramani, 2011) but for demonstration purposes we only examined our method on the DPMM in this paper.

Given our Bayesian formulation of the problem, we now have a natural generative model from which we can simulate GAN-type behavior. In contrast, the DPMM and BNP models in general do not need to train a discriminator to generate data but instead we fit a latent variable model through MCMC inference which then provides a method to generate data from the specified hierarchical model placed on the data. Because our method allows us to learn features in complex, high-dimensional settings where previously it was difficult to do so, we now have an opportunity to realize the promises and theoretical benefits of Bayesian nonparametric modeling in complicated datasets.

In future work we hope to demonstrate the continued success of Bayesian nonparametrics in modeling complex data while demonstrating the additional benefit that Bayesian and Bayesian nonparametric methods have in providing a natural representation of uncertainty quantification of our results and predictions while having a theoretically motivated methodology of adapting model complexity to the data.

Acknowledgments

The contribution of Michael Zhang and Sinead Williamson was funded by NSF grant IIS 1447721. Sinead Williamson’s contribution was written before her employment at Amazon.

References

  • Antoniak (1974) Antoniak, C. E. (1974). Mixtures of Dirichlet processes with applications to Bayesian nonparametric problems. Ann. Statist., 2(6):1152–1174.
  • Broderick et al. (2013) Broderick, T., Kulis, B., and Jordan, M. (2013). MAD-Bayes: MAP-based asymptotic derivations from Bayes. In Dasgupta, S. and McAllester, D., editors,

    Proceedings of the 30th International Conference on Machine Learning

    , volume 28 of Proceedings of Machine Learning Research, pages 226–234, Atlanta, Georgia, USA. PMLR.
  • Ferguson (1973) Ferguson, T. S. (1973). A Bayesian analysis of some nonparametric problems. Ann. Statist., 1(2):209–230.
  • Fox (2009) Fox, E. B. (2009). Bayesian nonparametric learning of complex dynamical phenomena. PhD thesis, Massachusetts Institute of Technology.
  • Ge et al. (2015) Ge, H., Chen, Y., Wan, M., and Ghahramani, Z. (2015). Distributed inference for Dirichlet process mixture models. In Proceedings of the 32nd International Conference on Machine Learning (ICML-15), pages 2276–2284.
  • Gelman and Shalizi (2013) Gelman, A. and Shalizi, C. R. (2013).

    Philosophy and the practice of Bayesian statistics.

    British Journal of Mathematical and Statistical Psychology, 66(1):8–38.
  • Gershman and Blei (2012) Gershman, S. J. and Blei, D. M. (2012). A tutorial on Bayesian nonparametric models. Journal of Mathematical Psychology, 56(1):1–12.
  • Griffiths and Ghahramani (2011) Griffiths, T. L. and Ghahramani, Z. (2011). The Indian buffet process: An introduction and review. The Journal of Machine Learning Research, 12:1185–1224.
  • Ishwaran and James (2001) Ishwaran, H. and James, L. F. (2001). Gibbs sampling methods for stick-breaking priors. JASA, 96(453):161–173.
  • Ishwaran and Zarepour (2002) Ishwaran, H. and Zarepour, M. (2002). Exact and approximate sum representations for the Dirichlet process. Canadian Journal of Statistics, 30(2):269–283.
  • Jordan (2011) Jordan, M. I. (2011). The era of big data. ISBA Bulletin, 18(2):1–3.
  • Kim et al. (2015) Kim, B., Shah, J. A., and Doshi-Velez, F. (2015).

    Mind the gap: A generative approach to interpretable feature selection and extraction.

    In Cortes, C., Lawrence, N. D., Lee, D. D., Sugiyama, M., and Garnett, R., editors, Advances in Neural Information Processing Systems 28, pages 2251–2259. Curran Associates, Inc.
  • Kingma and Welling (2014) Kingma, D. P. and Welling, M. (2014). Auto-encoding variational Bayes. ICLR 2014.
  • Krizhevsky (2009) Krizhevsky, A. (2009). Learning multiple layers of features from tiny images. Technical report, University of Toronto.
  • LeCun and Cortes (1998) LeCun, Y. and Cortes, C. (1998).

    The MNIST database of handwritten digits.

  • Lee et al. (2005) Lee, K.-C., Ho, J., and Kriegman, D. J. (2005).

    Acquiring linear subspaces for face recognition under variable lighting.

    IEEE Transactions on pattern analysis and machine intelligence, 27(5):684–698.
  • Lovell et al. (2012) Lovell, D., Adams, R. P., and Mansingka, V. K. (2012). Parallel Markov chain Monte Carlo for Dirichlet process mixtures. In Workshop on Big Learning, NIPS.
  • Mescheder et al. (2017) Mescheder, L. M., Nowozin, S., and Geiger, A. (2017). Adversarial variational Bayes: Unifying variational autoencoders and generative adversarial networks. CoRR, abs/1701.04722.
  • Neal (2000) Neal, R. M. (2000). Markov chain sampling methods for Dirichlet process mixture models. Journal of Computational and Graphical Statistics, 9(2):249–265.
  • Pitman (2002) Pitman, J. (2002). Combinatorial stochastic processes. Technical Report 621, Dept. Statistics, UC Berkeley. Lecture notes for St. Flour course.
  • Smyth et al. (2009) Smyth, P., Welling, M., and Asuncion, A. U. (2009). Asynchronous distributed learning of topic models. In Advances in Neural Information Processing Systems, pages 81–88.
  • Teh et al. (2004) Teh, Y. W., Jordan, M. I., Beal, M. J., and Blei, D. M. (2004). Hierarchical dirichlet processes. Journal of the American Statistical Association, 101.
  • Tran et al. (2017) Tran, D., Ranganath, R., and Blei, D. M. (2017). Deep and hierarchical implicit models. CoRR, abs/1702.08896.
  • Walker (2007) Walker, S. G. (2007). Sampling the Dirichlet mixture model with slices. Communications in Statistics—Simulation and Computation®, 36(1):45–54.
  • Williamson et al. (2013) Williamson, S. A., Dubey, A., and Xing, E. (2013). Parallel Markov chain Monte Carlo for nonparametric mixture models. In Proceedings of the 30th International Conference on Machine Learning, pages 98–106.
  • Zhang (2016) Zhang, M. M. (2016). Distributed inference in Bayesian nonparametric models using partially collapsed MCMC. Master’s thesis, The University of Texas at Austin.