## 1 Introduction

Dirichlet Process mixture models [Antoniak. (1974)] allow for non-parametric or infinite mixture modeling, where the number of densities or mixture components is not fixed ahead of time, but is allowed to grow (slowly) with the number of data items. This is achieved by using as a prior the Dirichlet Process (DP), which is a distribution over distributions, and has the additional property that draws from it are discrete (w.p. 1) with infinite support [Antoniak. (1974); Ferguson. (1973)]. The popular LDA model [D. Blei and Jordan. (2003)] may be considered as a parametric restriction of the HDP mixture model. LDA and its non-parametric counterpart HDP have since been used extensively as a prior for modeling of text collections [ Blunsom et al. (2009); Sharif-razavian and Zollmann. (2008)]. However, many applications require joint analysis of groups of data, such as a collection of text documents, where the mixture components, or topics (as they are called for text data), are shared across the documents. This calls for a coupling of multiple DPs, one for each document, where the base distribution is discrete, and shared. The hierarchical Dirichlet Process (HDP) [Y. Teh and Blei. (2006)] does so by placing a DP prior on a shared base distribution, so that the model now has two levels of DPs.

The HDP mixture model belongs to the family of non-parametric admixture models [E. Erosheva and Lafferty. (2004)], where each composite data item or group gets assigned to a mixture over the mixture components or topics, enabling group specific mixtures to share mixture components. Hence the HDP family leads to group level distributions with share mixture component distributions leading to a family of distributions over distributions. While this adds more flexibility to the groups of data items, the ability to cluster groups themselves is lost, since each group now has a distinct mixture of topics associated with it. This additional capability is desired in many applications. For instance, consider the analysis of patient profiles in hospitals [A. Rodriguez and Gelfand. (2008)

], where we would like to cluster patients in each hospital and additionally cluster the hospitals with common distributions over patient profiles. This is achieved by constructing a DP mixture over possible group level distributions from which distribution for each hospital is drawn, thus clustering hospitals based on the specific group level distribution chosen. This DP mixture has a base distribution that is itself a DP (instead of a draw from a DP, like in the case of HDP), from which the group level distributions over patient profiles are drawn. Since the patient profiles are themselves appropriately chosen distributions, the nDP results in a distribution over distributions over distributions, unlike the HDP and the DP, which are distributions over distributions. The nDP model therefore becomes a prior for non-parametrically modeling mixture of mixtures over appropriately chosen component distributions. The nested CRP (nCRP) [

D. Blei and Tanenbaum. (2010)], a closely related model, proposes a model for multi-level hierarchical mixture modeling to discover topic hierarchies of arbitrary depth through the predictive distribution obtained by integrating out the DP in a multi-level nDP.While the nDP family enables multi-level non-parametric mixture modeling, it is limited by the fact that it does not allow sharing of mixture components across group specific distributions at each level. For instance, in the previous example, group level distributions in hospitals do not share mixture components (patient profiles). In several real world applications, a need arises for multi-level non-parametric mixture modeling where at each level, group specific mixtures are required to share mixture components. This necessitates multi-level non-parametric admixture modeling. For instance, imagine a corpus containing descriptions related to entities, such as a shared set of researchers who have authored a large body of scientific literature, or a shared set of personalities discussed across news articles, such that each entity can be represented as a mixture of topics. Here, topic mixtures, corresponding to entities, are required to be shared across data groups or documents. In addition, we would like topics themselves to be shared across the topic mixtures corresponding to entities.

One could attempt to model this problem of non-parametric entity-topic modeling with nDP. The nDP can be imagined as first creating a discrete set of mixtures over topics, each mixture representing an entity, and then choosing exactly one of these entities for each document. In this sense, the nDP is a mixture of admixtures. However, a major shortcoming of the nDP for entity analysis is the restrictive assumption of a single entity being associated with a document. In research papers, multiple authors are associated with any document, and any news article typically discusses multiple news personalities. This requires each document to have a distribution over entities. In other words, we need a model that is an admixture of admixtures motivating the need for multi-level admixture modeling.

In this paper, we address non-parametric multi-level admixture models. To the best of our knowledge, there is no prior work that addresses this problem. We propose the nested HDP (nHDP), comprising of multiple levels of HDP, where the base distribution of each HDP is itself an HDP. For inference using the nHDP, we propose the nested CRF (nCRF), which extends the Chinese Restaurant Franchise (CRF) analogy of the HDP to multiple levels by integrating out each HDP. However, due to strong coupling between the CRF layers, inference using the nCRF poses computational challenges. We propose a scalable algorithm for inference in the multi-level setting with a direct sampling scheme, based on that for the HDP, where the mixture component associated with an observation is directly sampled at each level , based on the counts of table assignments and stick-breaking weights at each of the levels.

We apply the two-level nHDP to address the problem of non-parametric entity topic analysis for simultaneous discovery of entities and topics from document collections. The two-level nHDP belongs to the same class of models as a two-level nDP, in the sense that it specifies a distribution over distributions (entities) over distributions (topics). However, unlike the nDP, it first creates a discrete set of entities, and models each group as a document specific mixture over these entities using a HDP. Similarly, it creates a discrete set of topics and models each entity as a distribution over these topics using another level of HDP leading to two levels of HDPs. Apart from addressing the novel problem of multi-level admixture modeling, to the best of our knowledge, ours is the first attempt at entity topic modeling that is non-parametric in both entities and topics. The Author Topic Model falls out as a parametric version of this model, when the entity set is observed for each document, and the number of topics is fixed. Using experiments over publication datasets using author entities from NIPS and DBLP, we show that the nHDP generalizes better under different levels of available author information. More interestingly, the model is able to detect authors completely hidden in the entire corpus with reasonable accuracy.

## 2 Related Work

In this section, we review existing literature on Bayesian nonparametric modeling and entity-topic analysis.

Bayesian Nonparametric Models: We will review the Dirichlet Process (DP) [Ferguson. (1973)], the Hierarchical Dirichlet Process (HDP) [Y. Teh and Blei. (2006)] and the nested Dirichlet Process (nDP) A. Rodriguez and Gelfand. (2008) in detail in the Sec. 3.

The MLC-HDP [D. Wulsin and Litt. (2012)] is a -layer model proposed for human brain seizures data. The -level truncation of the model is closely related to the HDP and the nDP. Like the HDP, it shares mixture components across groups (documents) and assigns individual data points to the same set of mixtures, and like the nDP it clusters each of the groups or documents using a higher level mixture. In other words, this is a nonparametric mixture of admixtures, while our proposed nested HDP is a nonparametric admixture of admixtures.

The nested Chinese Restaurant Process (nCRP) [D. Blei and Tanenbaum. (2010)] extends the Chinese Restaurant Process analogy of the Dirichlet Process to an infinitely-branched tree structure over restaurants to define a distribution over finite length paths of trees. This can be used as a prior to learn hierarchical topics from documents, where each topic corresponds to a node of this tree, and each document is generated by a random path over these topics. The nCRP is also closely connected to the nDP in that the predictive distribution obtained by integrating out the DPs at each level from a K-level nDP leads to an nCRP. However, while the nCRP and the nDP facilitate multi-level non-parametric mixture modeling, they are not suitable for modeling multi-level non-parametric admixtures.

An extension to the nCRP model, also called the nested HDP, has recently been proposed on Arvix [J. Paisley and Jordan. (2012)]. In the spirit of the HDP, which has a top level DP and providing base distributions for document specific DPs, this model has a top level nCRP, which becomes the base distribution for document specific nCRPs. In contrast, our model for multi-level non-parametric admixtures has nested HDPs, in the sense that one HDP directly serves as the base distribution for another HDP, like in the nested DP [A. Rodriguez and Gelfand. (2008)], where one DP serves as the base distribution for another DP. This parallel with the nested DP motivates the nomenclature of our model as the nested HDP.

Next, we briefly review prior work on entity-topic modeling, that involves simultaneously modeling entities and topics in documents, an application we use throughout the paper to motivate our model. The literature mostly contains parametric models, where the number of topics and entities are known ahead of time. The LDA model [D. Blei and Jordan. (2003)] is the most popular parametric topic model, that infers a known number of latent topics from document collections. The LDA models the document as a distribution over a finite set of topics and the topics as distribution over words. The author-topic model (ATM) [M. Rosen-Zvi and Smyth. (2004)] extends the LDA to capture known authors of each document by modeling a document as a unifom distribution over a known author set and authors as distributions over topics, which are themselves distribution over words. Hence, the ATM can be used for parametric entity-topic modeling where the authors correspond to entities in documents. The Author Recipient Topic model [A. McCallum and Wang. (2004)] distinguishes between sender and recipient entities and learns the topics and topic distributions of sender-recipient pairs. In [D. Newman and Smyth. (2006)], the authors analyze entity-topic relationships from textual data containing entity words and topic words, which are pre-annotated. The Entity Topic Model [H. Kim and Han. (2012)] proposes a generative model which is parametric in both entities and topics and assumes observed entities for each document.

There has been very little work on nonparametric entity-topic modeling, which would enable discovery of entities in settings where entities are partially or completely unobserved in documents. The Author Disambiguation Model, [Dai and Storkey. (2009)] is a nonparametric model for the author entities along with topics. Primarily focusing on author disambiguation from noisy mentions of author names in documents, this model treats entities and topics symmetrically, generating entity-topic pairs from a DP prior. Contrary to this approach, our model is capable of treating the entity as a distribution over topics, thus explicitly modeling the fact that authors of documents have preferences over specific topics. We perform experiments in section 7 to demonstrate the effectiveness of our model for non-parametric entity topic analysis.

## 3 Background

Consider a setting where observations are organized in groups. Let denote the -th observation in -th group. For a corpus of documents, is the -th word occurrence in the -th document. In the context of this paper, we will use group synonymously with document, data item with word in a document. We assume that each is independently drawn from a mixture model and has a mixture component parameterized by a factor, say , representing a topic, associated with it. We let these factors themselves be drawn independantly from a distribution . For each group , let the associated factors have a prior distribution . Finally, let denote the distribution of given factor . Therefore, the generative model is given by

(1) |

The central question in analyzing a corpus of documents is the parametrization of the distributions — what parameters to share and what priors to place on them. The LDA model [D. Blei and Jordan. (2003)] is the most popular parametric topic model, that assumes is a distribution over a finite number of topics for each document. The choice of Dirichlet prior is based on the conjugacy of the Dirichlet distribution with the multinomial, that leads to efficient inference. However, in most realistic scenarios, the number of topics is not known in advance.

Bayesian Non-parametric modeling, is a paradigm that enables us to choose a prior for that allows for a countably infinite number of mixture components. This enables working with mixture models without having to fix the number of mixture components in advance by working with of the form with atoms , a base distribution. We start with such a prior, the Dirichlet Process that considers each of the distributions in isolation, then the Hierarchical Dirichlet Process that ensures sharing of atoms among the different s, and finally the nested Dirichlet Process that additionally clusters the groups by ensuring that all the s are not distinct.

Dirichlet Process: We start with a formal definition of the Dirichlet process as a prior for the distribution. Let (, ) be a measurable space. A Dirichlet Process (DP) [Ferguson. (1973); Antoniak. (1974)] is a measure over measures on that space. Let be a finite measure on the space. Let be a positive real number. We say that is DP distributed with concentration parameter and base distribution , written DP(, if for any finite measurable partition of , we have

(2) |

The stick-breaking representation provides a constructive definition for samples drawn from a DP, by explicitly drawing the mixture weights for . It can be shown [Sethuraman. (1994)] that a draw from can be written as

(3) |

where the atoms are drawn independently from and the corresponding weights follow a stick breaking construction. This is also called the GEM distribution: . The stick breaking construction shows that draws from the DP are necessarily discrete, with infinite support, and the DP therefore is suitable as a prior distribution on mixture components for ‘infinite’ mixture models. Subsequently, are drawn from , followed by draws (similar to Eqn. 1). The generation of from the DP prior followed by the generation of and constitutes the Dirichlet Process mixture model [Ferguson. (1973)].

Another commonly used perspective of the DP is the Chinese Restaurant Process (CRP) [Pitman. (2002)] which shows that DP tends to clusters draws from . Let denote the sequence of draws from , and let be the atoms of . The CRP considers the predictive distribution of the -th draw given the first draws after integrating out :

(4) |

where . The above conditional may be understood in terms of the following restaurant analogy. Consider an initially empty ‘restaurant’ with index that can accommodate an infinite number of ‘tables’. The -th ‘customer’ entering the restaurant chooses a table for himself, conditioned on the seating arrangement of all previous customers. He chooses the

-th table with probability proportional to

, the number of people already seated at the table, and with probability proportional to , he chooses a new (currently unoccupied) table. Whenever a new table is chosen, a new ‘dish’ is drawn () and associated with the table. The CRP readily lends itself to sampling-based inference strategies for the DP.Hierarchical Dirichlet Process: Now reconsider our grouped data setting. If each is drawn independently from a DP, then w.p. 1 the atoms for each are distinct, when , the base distribution is continuous. This would mean that there is no shared topic across documents, which is undesirable. The Hierarchical Dirichlet Process (HDP) [Y. Teh and Blei. (2006)] addresses this problem by modeling the base distribution of the DP prior in turn as a draw from a DP, instead of the continuous distribution . Since draws from a DP are discrete, this ensures that the same atoms are shared across all the s. Specifically, given a distribution on the space (, ) and positive real numbers and , we denote as the following generative process:

(5) |

When the generation of s as described in Eqn. 3 is followed by generation of and as in Eqn. 1, we get the HDP mixture model.

Using the stick-breaking construction, the global measure distributed as Dirichlet process can be expressed as , where the topics as before are drawn from independently () and the stick–breaking weights GEM represent ‘global’ popularities of these topics. Since has as its support the topics , each group-specific distribution necessarily has support at these topics, and can be written as follows:

(6) |

where denotes the topic popularities for the th group.

Analogously to the CRP for the DP, the Chinese Restaurant Franchise provides an interpretation of predictive distribution for the next draw from an HDP after integrating out the s and . Let denote the sequence of draws from each , the sequence of draws from , and the sequence of draws from . Then the conditional distribution of given and , after integrating out is as follows (similar to that in Eqn. 4):

(7) |

where , and dots indicate marginal counts. As is also distributed according to a Dirichlet Process, we can integrate it out similarly to get the conditional distribution of :

(8) |

These equations may be interpreted using a restaurant analogy with tables and dishes. Consider a set of restaurants, one corresponding to each group. Customers entering each of the restaurants select a table according a group specific CRP (Eqn 7). The restaurants share a common menu of dishes . Dishes are assigned to the tables of each restaurant according to another CRP (Eqn 8). Let be the (table) index of the element of associated with , and let be the (dish) index of the element of associated with . Then the two conditional distributions above can also be written in terms of the indexes and instead of referring to the distributions directly. If we draw via choosing a summation term, we set and let for the chosen . If the second term is chosen, we increment by 1 and draw and set and . This CRF analogy leads to efficient Gibbs sampling-based inference strategies for the HDP mixture model [Y. Teh and Blei. (2006)].

Nested Dirichlet Process: In other applications of grouped data, we may want to cluster observations in each group by learning group specific a mixture distributions and simultaneously cluster these group specific distributions inducing a clustering over the groups themselves. For example, when analyzing patient records in multiple hospitals, we may want to cluster the patients in each hospital by learning a distribution over patient profiles and cluster hospitals having the same distribution over patient profiles. The HDP cannot do this, since each group specific mixture is distinct.This problem is addressed by the nested Dirichlet Process [A. Rodriguez and Gelfand. (2008)].

This problem is addressed by the nested Dirichlet Process [A. Rodriguez and Gelfand. (2008)], which first defines a set of distributions with an infinite support:

(9) |

and then draws the group specific distributions, that we now term as , from a mixture over these set of :

We denote the generation process as The process ensures non-zero probability of different groups selecting the same , leading to clustering of the groups themselves. Using Eqn. 3, the draws can be characterized as:

(10) |

where the base distribution of the outer DP is in turn another DP, unlike the HDP where it is DP distributed. Thus the nDP can be viewed as a distribution on the space of distributions on distributions.

The nDP can be expressed with the following restaurant analogy with two levels of restaurants. Each group (hospital/document) is associated with an ‘outer’ level restaurant while each distribution corresponds to an ‘inner’ level restaurant. Each outer restaurant picks a distribution , through picking a ’dish’ from a global menu of dishes across outer restaurants based on the dish’s popularity according to . Each dish in this menu, that corresponds to a unique inner restaurant, defines a specific distribution over patient profiles. Hence each outer restaurant gets a distribution corresponding to one of the inner restaurants through this process, leading to a grouping of the outer restaurants (hospitals) based on the inner restaurant (distribution over patient profiles) chosen. The customer entering an outer restaurant goes to the corresponding inner restaurant, with index , such that . Now the customer selects a table in this restaurant, with the index, say, . The data is generated from the corresponding .

A Note on Notation: nDP brings to focus the idea of nesting, where the the distributions at one level ( at level 0) are themselves atoms for the next level (level 1 mixture distribution ). Hence, with the nDP, we introduce the notion of levels

into our notation through superscripts for random variables. For the rest of the paper

the superscript of a random variable indicates the level of the variable. Table LABEL:tab:notation shows a ready summary of the notation used through the rest of the paper.Nested Chinese Restaurant Process: The nDP can be viewed as a tool for building a non-parametric mixture of mixtures. The Nested Chinese Restaurant Process (nCRP) [D. Blei and Tanenbaum. (2010)], is a closely related model for multi-level clustering. The nCRP extends CRP by creating an infinitely-branched tree structure over restaurants to define a distribution over finite length paths of trees for modeling topic hierarchies from documents. The nCRP can be interpreted with a restaurant analogy consisting of multiple levels of restaurants as follows as described in [D. Blei and Tanenbaum. (2010)]. “ A tourist arrives at the city for an culinary vacation. On the first evening, he enters the root Chinese restaurant and selects a dish using the CRP distribution, based on its popularity (equation 4). On the second evening, he goes to the restaurant identiﬁed on the first nightís dish and chooses a second dish using a CRP distribution based on the popularity of the dishes in the second nightś restaurant. He repeats this process forever.” The nCRP however is closely connected to the nDP since a K-level nCRP can be obtained by integrating out the DP at each level in a K-level nDP facilitating multi-level non-parametric mixture models.

Multi-level Admixture models: The nDP enables modeling a non-parametric mixture of non-parametric mixtures, while the nCRP provides a hierarchical prior for multilevel non-parametric mixture models. In other words, the multi-level nDP leads to a prior where each distribution at a specific level , is a mixture over a distinct set of distributions at the previous level . Hence, there are no atoms in common between distributions at each level. The nDP and multi-level nDP are therefore not suited for applications that require mixture components to be shared across group specific distributions at each level. Several real world scenarios are however more effectively modeled by multi-level admixture models where each level has a group of distributions which share mixture components.

A example of entity-topic modeling for document collections clearly illustrates the limitation of existing models. Here, we would like to model documents as having distributions over a set of latent entities, with multiple documents sharing entities. We would like to model the entities themselves as distributions over a set of latent topics, with the ability for multiple entities to share topics. This constitutes a two level admixture model, where group specific distributions at one level (the ’entity’ distributions over topics) must share atoms (topics), which are themselves distributions at the previous level (the ’topic’ distribution over words).

The author-topic model (ATM) [M. Rosen-Zvi and Smyth. (2004)], an extension of LDA, captures this modeling scenario for the parametric case where the entities(authors) for each document are observed and the number of topics is known in advance. Consider a corpus containing authors. The ATM captures known authors, of each document, by modeling documents as a uniform distributions over corresponding sets of authors and authors as distributions over topics. The words are generated by first sampling one of the known authors of the document (with holding the global index of this author), followed by sampling a topic from the topic distribution of that author :

(11) |

The ATM however cannot handle a more realistic scenario of non-parametric modeling where the number of topics is not fixed in advance and author set for each document is not fully observed. Such an application calls for multi-level non-parametric admixture modeling, a previously unexplored problem. Motivated by this, we propose the nested Hierarchical Dirichlet Process(nHDP) for multi-level non-parametric admixture modeling.

## 4 Nested Hierarchical Dirichlet Processes

In this section, we introduce the Nested Hierarchical Dirichlet Processes. For this, we first introduce 2-nHDP i.e. the two level nested HDP for non-parametric modeling of entities and topics and then generalize this to L-nHDP for any given number of L levels.

### 4.1 Two-level Nonparametric Admixture Model

Recall that in [M. Rosen-Zvi and Smyth. (2004)], the authors approach the problem of modeling the topics and entities for the application of author-topic modeling by taking a two level approach. Our aim is to build a 2-level admixture 2-nHDP for a non-parametric treatment of this problem. However, before this, we first present a simpler intermediate model which we call DP-HDP, an extension of nDP, for ungrouped data, where the words are not grouped into documents, leading to a mixture of admixture model. (This can also be interpreted as a usecase for single document analysis instead of a collection of documents). We then gradually extend it for grouped data (multiple documents) to build 2-nHDP modeling non-parametric admixtures of admixtures. We next generalize this to (L+1)-nHDP in section 4.2.

DP-HDP for Ungrouped Data: Consider an entity-topic modeling scenario where the observed data i.e. set of words is not grouped as documents. One could conceive performing such two-level modeling for such data with the nDP. In nDP, entities are of equation 9 with as the topic variables drawn from a base distribution . However, the nDP is unsuitable for such analysis, since the entities drawn from a DP, with a continuous base distribution , do not share topic atoms. This can be modified by first creating a set of entities such that they share topics. One way to do this is to follow the HDP construction for entities:

(12) |

This can be followed by drawing the entity for each word from a mixture over the s:

(13) |

This may be interpreted as creating a countable set of entities by defining topic preferences (distributions over topics) for each of them, and then defining a ‘global popularity’ of the entities. Using Eqn. 3, we observe that . Observe the relationship with the nDP (Eqn. 10). Like nDP, this also defines a distribution over the space of distributions on distributions. But, instead of a DP base distribution for the outer DP, we have achieved sharing of topics using a HDP base distribution. We will write .

Note that multiple words can choose the same entity. As before, entity can now be used as prior for sampling topics, say for individual words which chose that entity, using

(14) |

We will call this the DP-HDP mixture model. Note that one can also alternatively use this model for grouped data where each group or document is associated with a single entity and each word in the document chooses topic as per the entity distribution over topics.

2-nHDP for Grouped Data: In this section, we extend the earlier model for grouped data since most of the applications use multiple documents e.g. in the form of news articles, scientific literature, images, etc.

We extend the approach presented in § LABEL:single to the setting of grouped data since most applications use multiple documents e.g. news articles, scientific literature, images, etc. In the single entity model, since a document is associated with one entity, a single entity is sampled for all the words in the document. Now, in the case of multiple entities per document, first we sample an entity for each word in the document, and then a topic is sampled according to the entity specific distribution of topics.

As in the previous model, we first create a set of entities as distributions over a common set of topics () by drawing independently from an HDP (Equation 12), and then create a global mixture over these entities (Equation 13).

Earlier in the absence of groupings, this global popularity was used to sample entities for all the words. Now, for each document , we define a local popularity of entities, derived from their global popularity :

(15) |

Now, sampling each factor in group is preceded by choosing an entity by sampling according to local entity popularity . Note that .

Note that the above equation 15 is similar to the stick breaking definition of HDP in Equation 6. We can see that is drawn from a HDP with the base distribution over atoms instead of topics . This distribution over is again an HDP. Therefore, we can write:

(16) |

We refer to the two HDPs as the inner and outer HDPs and hence, call this as 2-nHDP. We can write . Similar to the nDP and the DP-HDP (Eqn. 13), this again defines a distribution over the space of distributions over distributions. The 2-HDP mixture model is completely defined by subsequently sampling , followed by .

An alternative characterization of the 2-nHDP mixture model is using the topic index and entity index corresponding to :

(17) |

This may be understood as first creating a entity-specific distributions over topics using global topic popularities , followed by creation of document-specific distributions over entities using global entity popularities . Using these parameters, the content of the document is generated by sampling repeatedly in fashion an entity index using , a topic index using and finally a word using .

Observe the connection with the ATM in Eqn. 11

. The main difference is the the set of entities and topics is infinite. Separately, each document now has a distinct non-uniform distribution

over entities.(Move the following to/before background….?)

Also, observe that we have preserved the HDP notation to the extent possible, to facilitate understanding. To distinguish between variables corresponding to the two HDPs levels in this model, we use the superscript for symbols corresponding to the the inner HDP modeling entities as distributions as topics and superscript for symbols corresponding to the the outer HDP modeling documents as distributions over entities. Going forward, we follow the same convention for naming variables in the multi-level HDP with multiple levels of nesting.

### 4.2 Multi-level Non-parametric Admixture modeling

We now present (L+1)-nHDP, a generalized extension to 2-HDP proposed in the previous section 4.1, that can be used for multi-level non-parametric admixture modeling.

The 2-nHDP was constructed by first creating a set of entities, by drawing each of these distributions from an inner HDP with base distribution . This is followed by drawing document specific distributions at the outermost level from the outer HDP, with the base distribution as the inner HDP. To extend this to multiple levels, at each level, we draw group level distributions from an HDP with the base distribution at as the previous level HDP.

Let denote the number of levels of nesting, indexed by . Through the rest of this section, the superscript of a random variable denotes the level of the random variable. The nested HDP comprises of multiple levels of HDPs, where the base distribution of HDP at level is the the HDP at level . The innermost level is 0 while the outer most level is . The groups in the outermost level correspond to documents in the case of entity topic modeling. At the inner most level 0, we have a HDP, with a base distribution from which the inner most level entities are drawn. In the case of entity topic modeling these inner level entities are topics that are modeled as a distributions over words.

At level 0, the inner most level, we draw level-1 entities from a HDP with base distribution . This step corresponds to equation 12 of the 2-nHDP and constitutes a non-parametric admixture over atoms drawn from . Note that in case of two-level models, we had termed as entities. In case of this multi-level model, we term these entities as level-1 entities and topics can be considered as level-0 entities. Hence, at level 0, we have

(18) | |||

We denote the HDP distribution itself at level by , which subsequently becomes the base distribution for next level HDPs. At any level , becomes the base distribution of the level HDP, while the group level distributions at the previous level, , become the atoms for the group level distributions that we construct at the level,

(19) | |||

For the HDP at the outermost level , the base distribution is , the HDP from the previous level. At this level we have a set of M groups, that correspond to the number of documents in the case of document modeling. While it is possible to develop a multilevel admixture model where the number of groups is unobserved at every level, in this paper, we assume the number of groups at the outermost level to be an observed quantity in a fashion aligned with the document modeling usecase. Hence, at level , we have,

(20) | |||

Each observed data item that resides with one of the outermost groups is now associated with an entity (group level distribution) from each HDP level , which itself is a distribution over entities drawn from the previous level HDP. Hence we generate the data as follows. First generate from the group level distribution at the outermost group . For any level , we select . Note that thus sampled is equal to one of the variables, (which are themselves distributions over atoms drawn from previous level HDP). is equal to one of at the inner most level zero. Finally data items are generated as .

Similar to the 2-nHDP, (L+1)-nHDP can be defined using the index of the atom at each level corresponding to data item as follows.

(21) |

### 4.3 Nested Chinese Restaurant Franchise

In this section, we derive the predictive distribution for the next draw at various levels from the nHDP given previous draws, after integrating out the various group level distributions and at each level. We also provide a restaurant analogy for the nHDP in terms of multiple levels of nested CRFs, corresponding to the multiple levels of HDP. This will be useful for the inference algorithm that we describe in Section 5.

We start with the outermost level L. Let denote the sequence of draws from , and denote the sequence of draws from . Then the conditional distribution of given all previous draws after integrating out looks as follows:

(22) |

where , . Next, we integrate out , which is also distributed according to Dirichlet process:

(23) |

Note that here refers to the number of unique atoms already drawn from the base HDP of . Observe that each variable gets assigned to one of the variables, from which is drawn (recall ). Hence, the predictive distribution for , given is obtained by integrating out the corresponding grouplevel distribution . Similarly, for any general level , given that , , is drawn by integrating out the group level distribution . Hence, for level , let denote the sequence of previous draws from . Hence,

(24) |

where , the number of times component was picked. As is also distributed according to a Dirichlet Process, we can integrate it out similarly and write the conditional distribution of as follows with , and is the previous level HDP :

(25) |

At level 0, the predictive distribution for , given can be obtained by integrating out replacing with 0 in equation 24. Similarly, the predictive distribution for , draws from , can be obtained by integrating out as follows.

(26) |

At this level, each is assigned to a that are drawn from , the base distribution of the nHDP. Given the that corresponds to , the observed data is generated as . Note that each of the conditional distributions for and are similar to that for CRF (Eqns. 7 and 8). We interpret these distributions as a nested Chinese Restaurant Franchise (nCRF), involving CRFs with multiple levels of nesting.

We now describe in detail the restaurant analogy for the nested Chinese Restaurant Franchise. The nCRF comprises of multiple levels of CRF. At each level , there exist multiple restaurants , each containing a countably infinite number of tables. Each table in restaurant of level is associated with a dish from global menu of dishes specific to that level. is the distribution over the dishes in the global menu at level modeling the global popularity of the dishes.

Imagine a customer on a culinary vacation. We trace the journey of this customer to show the process of generating , the word in the document through the dishes he selects at restaurants at various levels. The customer first enters the restaurant in the outermost level as the customer and choses a table with index , based on the popularity of the table governed by . Each table in this level restaurant is associated with a dish from a global menu at level L. Each of these dishes has a one-to-one correspondence with a unique restaurant at level , leading to nesting between CRF levels. We use the variable to denote the level dish thus chosen by the customer, through his table selection, and to denote the index of the dish within the global menu and to denote the level restaurant corresponding to the dish chosen. The customer now enters the restaurant at level and repeats this process by selecting a table based on the distribution .

At any intermediate level , the customer enters the restaurant , governed by the dish chosen at the previous level. He then selects a table . Each table in this restaurant has a dish from the global level menu governing the dish chosen by the customer.Each dish in the global menu corresponds to a unique restaurant in the previous level. This process continues where at level 0, the customer enters restaurant governed by the dish selected in level . The customer then chooses a table which is associated with a dish , say for some . The word is generated from the corresponding innermost level dish(topic) .

### 4.4 Variations of multi-level nHDP

Recall that at any given level of (L+1)-nHDP, HDP distribution of the previous level becomes the base distribution of the level HDP, while the group level distributions at the previous level, , become the atoms for the group level distributions at the level, . This leads to multi-level admixture modeling where each entity at level models a distribution over entities at level . However, one can also consider a variation where entities at a given level are associated with a single entity at the previous level leading to a mixture instead of an admixture at this specific level. In other words, we replace a given level HDP with a DP to associate a single level- entity with the group at next level. This leads to multi-level model with admixtures at some levels and mixtures at other levels. We note that the DP-HDP model(for grouped data) that associates a single entity for each document (section 4.1) is an instance of such a variation. While these variations open avenues for investigating a new set of modeling techniques, we restrict our work to multi-level admixture modeling. Inference in these models should be an extension to that of our admixture model (refer section 5?).

Notation | Description of Notation |
---|---|

level index indicated in a superscript | |

Restaurant index | |

Document Index (Used instead of as index of observed group/restaurant at outermost level ) | |

Word (customer) Index within document | |

Dish index in various contexts | |

Number of dishes in the global menu at level | |

Number of tables in restaurant of level | |

Index of the outermost level () | |

word observed in document | |

Table index assigned to word of document for level | |

Dish index assigned to table of restaurant at level | |

Dish index at level assigned to word of document | |

Restaurant index at level (also level dish index) for word of document | |

dish in the global menu at level | |

Dish assigned to table in restaurant at level | |

Dish assigned to word of document at level | |

Number of customers at table in restaurant in level | |

Number of tables restaurant in level that got assigned dish | |

Base distribution of nHDP | |

Base distribution of the HDP at level : | |

Base distribution at level for group level DP at level | |

Group level distribution at level | |

Concentration parameter of the group level DP at level | |

Concentration parameter of the base DP at level |

Comments

There are no comments yet.