1 Research Goals:
My goals on the outset of this experiment were to:

Learn more about nonparametric models

Investigate different nonparametric topic models and compare them to an LDA process run on a New York Times dataset with topics .

Pick higherperforming algorithms and apply recent research in Indian Buffet Process submodularity to these algorithms to assess speed increase.
I should note that my experiments were not successful. The nonparametric topic modeling algorithms that I compared to LDA had consistently worse perplexity scores. Thus, I did not go forward with proposed implementation of submodular methods. This speaks less to the potential of nonparametric models then to the need for more work in the field and the need to lower the barrier of entry–I found that unifying introductory texts were far and few between.
Thus, I will treat this paper as the start of a tutorial I plan to build out in the next few months. While I am certainly not an expert, I approached the subject as a beginner just a few weeks ago. This paper gives me an opportunity to offer a beginner’s look at nonparametrics, one that I will expand on. Although I feel that I have not contributed in a significant way researchwise, if I expand this paper over the next few months to give a better highlevel introduction to nonparametric modeling, this can be an important contribution.
I have tried to make my emphasis in this paper orthogonal to my presentation. As in, I will go lightly on the topics that I went heavily into during my presentation and try to go more deeply into the background of processes.
2 Introduction
Bayesian hierarchical techniques present unified, flexible and consistent methods for modeling realworld problems. In this section, we will review parametric models by using the Latent Dirichlet Allocation model (LDA), presented by Blei et. al. (2003) [1], as an example. We will then introduce Bayesian Processes and show how they factor into nonparametric models.
2.1 Bayesian Parametric Models: Latent Dirichlet Allocation (LDA)
The LDA topicmodel is a prototypical generative Bayesian model. It is ”generative” in the sense that it assumes that some realworld observation (in this case, wordcounts), are generated by series of draws from underlying distributions. According to Blei’s hypothesis, the LDA model is generated as follows:
For each document w in corpus D:

Choose

Choose

For each of the N words :

Choose a topic

Choose a word

These underlying distributions are hidden from us when we observe our sample set (the corpus), but through various methods we can infer their parameterization. This allows us to predict future samples, compress information, and explain existing samples in useful ways.
The LDA is a mixture model–it assumes that each word is assigned a class, or ”topic”, and that each document is represented by a mixture of these topics. The overall number of topics in a corpus, k, needs to be parameterized by the user, and this is often difficult to interpret. To overcome this limitation, we introduce nonparametric models.
2.2 Bayesian Nonparametric Models: An Introduction
Gaussian Processes: Function Modeling
We start our discussion on nonparametrics by discussing Gaussian Processes, since their construction follows the way in which processes we actually use will be constructed, yet I feel Gaussiananything is a naturally easier concept to grasp. Let’s say is distributed according to a Gaussian Process with mean measure and covariance :
~
Mean measure and covariance measures are simply functions that take some possibly infinite subset x of Euclidean (or, more generally, Hilbert space) and return a measure. It is most helpful when uses processes in Bayesian nonparametrics to think of a ”measure” as our prior belief, expressed not as a paramter but as a mapping function.
In other words, much like a function maps an input space to a quantity, measures are used to quantize finite sets. For example, lets take the mean measure. In the Gaussian Process, . This maps a finite (or infinite) set x to an expected value. In practice, when we have no information about , it’s common to choose , as a suitable measure prior. (This is reflected in Figure 2.a which shows our prior beliefs about the
, where the span of the orocess is given by the gray region, with mean 0 and constant variance.)
Over a finite subset of Euclidean space, x, the measure takes an expected, real value, and the Gaussian Process takes a discrete distribution: a joint Gaussian. For example, if , are subsets of the real line, then , are distributed as:
~
This joint Gaussian property of the , over an infinite subspace, can describe the set of points forming functions, shown in Figure 2. We can envision starting with a finite number of discrete subsets of the real line, represented as the blue points in the diagram, and continuing to add samples. As the number of subsets we add approaches infinity, we have encompassed the entire real line, and created a joint Gaussian who’s realization forms a continuous function. This ability of the
to model functions allows it to be used in kernelbased nonlinear regression as a prior. (For more information, see [3])
Dirichlet Processes: Probability Modeling
Similarly to how the Gaussian Process can be applied over infinite subspaces to model functions, the Dirichlet Process can model continuous probability densities. This is also rather intuitive.
Let’s start with a finite set, . If is a partition of such that and for , then:
~
Where is drawn from a Dirichlet Process () with base measure and concentration parameter :
~ where
We can think of the base measure in the Dirichlet Process the same way as we think of the mean measure in the , simply as a mapping between subsets of Hilbert space to reals–in this case reals .
Now, using a Dirichlet process to model the space of continuous probability functions makes sense when we consider the special property of the Dirichlet Distribution that a sample from the distribution is a tuple that sums to 1–essentially a probability mass function. Thus, a Dirichlet Process over infinitely many partitions gives a continuous probability distribution much the same way that a sampling from a
over infinitely many subsets of the real line gives a continous function.The Dirichlet Process has found many uses in modeling probability distributions. One example is the Hierarchical Dirichlet Process (HDP) by Yee Whye Teh [4]. This model uses two stacked Dirichlet Processes. The first is sampled to provide the base distribution for the second, allowing a fluid construction of point mixturemodeling known as the ”Chinese Restaurant Franchise”. However, the HDP has noted flaws, in particular: because the HDP draws the class proportions from a datasetlevel joint distribution, it makes the assumption that the weight of a component in the entire dataset (in our application, the ”corpus”) is correlated with the proportion of that component being expressed within a datapoint (a ”document”). Or, in other words, the probability of a datapoint exhibiting a class is correlated with the weight of that class within the datapoint. Intuitively, in the topic modeling context, we might argue that rare corpuslevel topics are often expressed to a great degree within specific documents, thus indicating that the HDP is flawed.
Beta Process: Point Event Modeling
The Beta Process offers us a way to perform class assignment without this correlational bias. We will see more in the next section.
Taking a step back, a draw from a Beta Distribution is a special case of a Dirichlet distribution draw over over two classes. Similarly, the Beta Process is defined over the product space
.A draw from a Beta Process is given as:
where ~, and can be thought of as a unit measure of . and are defined by the Levy measure, , of the Beta Process’ product space, (given by Hjort (1990) [6]), is
which can be passed as the mean measure to a Poisson Process. (Where is the concentration parameter and is the continuousfinite base measure). (This is commonly used in LevyKhintchine formulations of stochastic processes.)
Therefore, we can see that BP draws need not sum to one. Taken over infinite subspaces, the Beta Process can be used to model CDF’s, as point measurements are , but draws are not dependent on the sum of subspaces, like the Dirichlet. Taking the aggregate of point measurements, we can derive CDF’s, which are especially useful in fields like survival analysis.
As we can see, point measurements can also be used to model class assigments. Thus, Beta Processes can effectively model each classmembership separately. Per datapoint, probability mass behind subspaces, or classes, is no longer bounded by one as in the Dirichlet Process. (For more on Beta Processes see Paisley [7] and Zhou [8].)
2.3 Bernouilli Processes and the Indian Buffet Process
Here we construct such a mixedmembership modeling scheme. The Beta Process is conjugate to the Bernouilli family. (A Bernouilli process is a very simple stochastic process that can be thought of simply as a sequence of coin flips with probability p). This makes the BetaBernouilli pairing an ideal candidate for mixedmembership applications like topic modeling, where each point data point is expressed as a combination of latent classes.
Griffiths and Gharamani [9] recognized this in 2005, and constructed the Indian Buffet Process (IBP) by marginalizing out the Beta process. As Thibaux and Jordan [10] would later show in 2007, given:
~ ~
where BeP is the Bernouilli Process, the Beta Process can be marginalized out to give:
where is a continuous base distribution with mass , is the unit point mass at , and , or the number of datapoints in . Thus, we see through this marginalization that the class labels are independent of order–as long as the order of is consistent across , the BeP construction remains the same. This realization can be used to prove exchangeability of the betaassigned class labels, proving that these points are a DeFinetti mixture. (For more details, see [10]. DeFinetti mixtures are great because they can be realized in provably defined probability distributions.)
If we model these process as their corresponding distributions, we can derive a probability function.
Where the probability of observing a set of class assignments for each datapoint is given by . Under the exchangeability property of the BetaBernouilli construction given above, the Indian Buffet Process can be expressed as the sum of infinite classes:
Where represents the Gamma function and and are concentration parameters. represents the history term of the ordering. In addition to reordering the class labels in a way that sums probability only over the ”active” set of topics, this construction collapses multiple classes assigned to the same datapoints into a single class label. For more details on the combinatorial ordering scheme see slide 22 of my presentation, included in this folder. This formulation groups the class labels of the IBP datapoints in a way that allows the probability distribution to remain welldefined even as the set of classes is unbounded. Thus, we can draw inference on the infinite parameter space through a finite set of observations.
3 Applications of the Indian Buffet Process: Experiments
The Indian Buffet Process has been applied to a host of mixedmodel schemes. Like the Hierachical Dirichlet Process, it models class membership in a nonparametric way, thus making it appealing for use in nonparametric Bayesian models. Moreso, because it escapes the correlation that HDP introduces, the IBP has been recently studied in topic modeling applications.
One simple example is the Focused Topic Model (Blei et al., 2010) [11]. The Focused Topic Model uses an IBP Compound Dirichlet Process, which effectively decouples the correlation between high corpuslevel probability mass for a topic, and high documentlevel probability mass. The topic weights for document , now, are generated from from a , where represents the binary draw from an IBP. Thus, strong corpuslevel topics may be ”shutoff” in the document.
Archambeau et. al. take this a step further in [12] by applying an IBP compound Dirichlet to both the document topics and the words . In other words, while the Focused Topic model decouples the corpuslevel strength of a topic with the probability of a document expressing a topic, it still allows words to be weighted strongly under high probability topics, encouraging distributions across words to be more uniform. Archambeau et. al. add the IBP compound Dirichlet to word probabilities as well, creating a doubly compounded model, which they call LiDA.
Mingyuan Zhou took a different approach [13]. Instead of directly incorporating the IBP, he built a topic model around the BetaNegative Binomial Process, a hierarchical topic model which is related to the IBP (the IBP is a BetaBernouilli construction) but is not binary, and thus has greater expressiveness. He presented his work at the 2014 NIPS conference.
Experiments:
I ran trials for all three of these on New York Times dataset of 5294 documents and 5065 words, which I compared against a control of LDA with 55 topics. I chose this number has it was the current 30day window of articles that we use for training our recommendations system. I heldout 10 percent of articles for a perplexity test. The logperplexity of a heldout set was 7.9 for LDA on 55 topics, 2.099 for LiDA, which scaled to 66 topics, 5.6 for FTM and 5.4 for BNBP.
4 Submodularity of the IBP
Current inference algorithms for IBPrelated processes involve either sequential Gibbs samplers or sequential variational inference, effectively creating an NPhard problem. Recent work by Reed and Griffiths [14] examining the IBP has focused on its submodularity properties. Although I explored this in detail in my presentation, I will not go deeply into it here, as it is not relevant.
Acknowledgments
I’d like to thank Dr. Lozano and Dr. Aravkin for a truly excellent semester, and my coworkers at the New York Times–especially Daeil Kim–for suggestions and moral support.
5 Conclusion
In this paper, I’ve tried to give a brief tutorial on some commonly used Bayesian nonparametrics, as well as explain some motivations behind their use. I’ve tried to steer clear of the euphemisms many introductory reviews use to explain nonparametrics, like stickbreaking processes or poylaurns, in favor of a more general explanations. I’ve reviewed some recent implementations of nonparametric models and compared them on a single dataset.
Although the performance of the models I tested was generally lacking and the methods were slow, I still feel that nonparametrics offer a sophisticated approach towards constructing flexible Bayesian models. With more research, the I’m confident that the field could produce some promising results.
References
[1] Blei, David M., Ng, Andrew Y. & Jordan, M.I. (1995) Latent Dirichlet Allocation. John Lafferty (eds.),
Journal of Machine Learning Research
3, pp. 9931022.[2] Ibrahim, Joseph Georgy, and Ming Chen Bayesian Survival Analysis. New York: Springer, 2001. Print.
[3] Rasmussen, Carl Edward, and Christopher K.I. Williams Gaussian Processes for Machine Learning. Cambridge, Mass.: MIT, 2006. Print.
[4] Teh, Yee Whye, Michael I Jordan, Matthew J Beal, and David M Blei. ”Hierarchical Dirichlet Processes.” Journal of the American Statistical Association: 1566581. Print.
[5] Sudderth, Eric. ”Graphical Models for Visual Object Recognition and Tracking.” Doctoral Thesis, Massachusetts Institute of Technology (2006).
[6] Hjort, Nils Lid. ”Nonparametric Bayes Estimators Based on Beta Processes in Models for Life History Data.”
The Annals of Statistics (1990): 1259294. Print.[7] John Paisley and Michael Jordan. ”A Constructive Definition of the Beta Process”. Technical Report. 2014. Print
[8] Zhou, Mingyuan, Lauren Hannah, David Dunson, and Lawrence Carin. ”BetaNegative Binomial Process and Poisson Factor Analysis.”
Proceedings of the 15th International Conference on Artificial Intelligence and Statistics (AISTATS)
(2012).[9] Griffiths, Thomas, and Zoubin Ghahramani. ”Infinite Latent Feature Models and the Indian Buffet Process.” Presented at NIPS, 2005 (2005). Print.
[10] Thibaux, Romain, and Michael Jordan. ”Hierarchical Beta Processes and the Indian Buffet Process.” Presented at AISTATS 2007 Conference (2007).
[11] Williamson, Sinead, Chong Wang, Katherine Heller, and David Blei. ”The IBP Compound Dirichlet Process and Its Application to Focused Topic Modeling.” Proceedings of the 27th International Conference on Machine Learning (2010).
[12] Archambeau, Cedric, Balaji Lakshminarayanan, and Guillaume Bouchard. ”Latent IBP Compound Dirichlet Allocation.” IEEE Transactions on Pattern Analysis and Machine Intelligence: 1. Print.
[13] Zhou, Mingyuan. ”BetaNegative Binomial Process and Exchangeable Random Partitions for MixedMembership Modeling.” Presented at Neural Information Processing Systems 2014. (2014)
[14] Reed, Colorado, and Zoubin Ghahramani. ”Scaling the Indian Buffet Process via Submodular Maximization.” Presented at International Conference on Machine Learning 2013 (2013).
Comments
There are no comments yet.