A hierarchical structure has multiple layers, and each layer contains a number of nodes that are linked to the nodes in the higher and lower layers, as illustrated in Figure 1. This kind of structure is very common and pervasive, and has been adopted in many different sub-fields in the artificial intelligence area. One example of such structure is found in text mining. Consider all the papers in a scientific journal (e.g., Artificial Intelligence). An author-paper-word atm3 hierarchical structure emerges, given each author writes and publishes a number of scientific papers in this journal, and each paper is composed of several different words. Learning from author-paper-word structure is useful for collaborators’ recommendations, authors disambiguation, paper clustering, statistical machine translation Xiong201654 , and so on. Another example occurs within image processing. The scene-image-feature hierarchical structure is formed because each image may belong to several scenes, such as beach or urban Boutell20041757 , and an image is also described by an abundance of features, such as grayscale and texture. Learning from scene-image-feature structure could at least benefit image search and context-sensitive image enhancement.
Current state-of-the-art Bayesian approaches to learn from this hierarchical structure are mainly based on topic models blei2003latent ; 7015568 that are a kind of probabilistic graphical models Flach2001199 and were originally designed for modeling a two-level hierarchical structure: document-word. Their basic idea is to construct a Bayesian prior based on manipulations on probabilistic distributions, e.g., Dirichlet and Multinomial distributions Blei:2012:PTM , to map documents and words into a latent topic space. For example, papers in the Artificial Intelligence Journal cover multiple research topics, such as machine learning, intelligent robotics, case-based reasoning, and knowledge representation
. Each paper in this journal could be seen as a combination of these research topics, and each topic is described by a weighted word vector. Beyond the two-level hierarchical structure, some three-level hierarchical structures have also been successfully modelled by incorporating additional document side information, such as:author-document-word atm3 , emotion-document-word emotion , entry-document-word entry and label-document-word labels .
A major issue in existing (parametric) topic model-based hierarchical structure modeling is that the hidden topic number in the defined priors needs to be fixed in advance. This number is usually chosen with domain knowledge. After fixing the number of topics, Dirichlet, multinomial, and other fixed-dimensional distributions could be adopted as the building blocks for (parametric) topic models. However, discovering an appropriate number is very difficult and sometimes unrealistic in many real-world applications. For example, limiting any given corpus to a fixed exact number of topics is apparently unrealistic. Furthermore, this may lead to overfitting where there are too many topics, so that relatively specific topics will not generalise well to unseen observations; Underfitting is the opposite case, where there are too few topics, so unrelated observations will be assigned to the same topic (6784083, ). This number is supposed to be inferred from the data, i.e., let the data speak. A number of methods can be used to nominate the number of topics, including cross-validation techniques (griffiths2004finding, ), but they are not efficient because the algorithm has to be restarted a number of times before determining the optimal number of topics (griffiths2004finding, ; 6784083, ).
One elegant approach to resolve the above issue is Bayesian nonparametric learning - a key approach for learning the number of mixtures in a mixture model (also called the model selection problem) Gershman20121
. The idea of Bayesian nonparametric learning is to use stochastic processes to replace the traditional fixed-dimensional probability distributions. The merit of these stochastic processes is that they have a theoretically infinite number of factors111We do not distinguish factor with topic throughout this paper.
and let the data determine the used number of factors. Many probabilistic models with fixed dimensions have been extended to infinite ones with the help of stochastic processes. One typical example is the famous Gaussian mixture model, which was extended into an infinite Gaussian mixture modelrasmussen1999infinite using the Dirichlet process. As for hierarchical structure modeling, the hierarchical Dirichlet process (HDP) teh2006hierarchical is the most well known, which uses the relationship between a stochastic process and its base measure to capture the hierarchical structure in data: more details are given in the preliminary knowledge section. Due to its success, many extensions have been developed to account for different situations, such as: a supervised version 6784083 for modeling additional labels and an incremental version 6137314 for streaming data.
However, this state-of-the-art HDP-based work can only model one special type of hierarchical structure, however there are actually two types, as shown in Figure 1, which are distinguished by the number of parent nodes for each node. In Type-I hierarchical structures, as illustrated in Figure 0(a), each node has one and only one parent node which could be seen as a group, and in turn is assigned to higher level groups. In Type-II hierarchical structures, as illustrated in Figure 0(b), each node may have more than one parent node. In this paper, we term this structure a cooperative hierarchical structure. Type-II is typically considered more general than Type-I, because Type-I can be seen as a special case of Type-II. Note that the renowned hierarchical Dirichlet process and its extensions (e.g., HDP-HMM Wulsin201455 , HDP-based hierarchical distance-dependent Chinese Restaurant process (hddCRP) DBLP:conf/uai/GhoshRSS14 , and HDP-based scene detection DBLP:conf/ijcai/MitraBB15 ) are all particularly designed after Type-I hierarchical structures but fail to model Type-II hierarchical structures. Consider the former example on an author-paper-word structure. Using a Type-I hierarchical structure for the text mining area would, in this case, imply that each paper was only written by one author. This applies to scene-image-feature structures as well. Despite a certain rationality in some situations, the constraints of the Type-I hierarchical structure are too restrictive to model many real-world phenomena, so a new Bayesian nonparametric prior is a must for modeling Type-II hierarchical structures.
This paper proposes a Bayesian nonparametric model for cooperative hierarchical structures, based on the renowned hierarchical Dirichlet process (HDP), which we call the cooperative hierarchical Dirichlet process(CHDP). More specifically, it is built on two operations for random measures from the Dirichlet process: Inheritance from the hierarchical Dirichlet process; Cooperation, an innovation proposed in this paper, to account for multiple parent nodes in Type-II hierarchical structures. More specially, we have designed two mechanisms for Cooperation: one is Superposition and the other is Maximization. Based on these operations, we propose the cooperative hierarchical Dirichlet process along with its two constructive representations. Although the proposed CHDP elegantly captures cooperative hierarchical structures, it also brings additional challenges to model inference. To resolve this challenge, we introduce two inference algorithms based on the proposed two representations. Experiments on synthetic and real-world tasks show the properties of the proposed CHDP and its usefulness in cooperative hierarchical structure modeling.
In summary, the main two contributions of this article are as follows:
we innovatively propose a cooperative hierarchical Dirichlet process based on operations on random measures: Inheritance, Cooperation: Superposition and Cooperation: Maximization, which can be used to model the cooperative hierarchical structures that cannot be modelled by existing Bayesian nonparametric models;
two constructive representations (i.e., the international restaurant process and stick-breaking) and the corresponding inference algorithms for the cooperative hierarchical Dirichlet process are proposed to facilitate model inference, which rise to the challenge brought about by Inheritance, Cooperation: Superposition and Cooperation: Maximization between the random measures.
The remainder of this article is organized as follows. Section 2 discuses related work. The definitions and constructive representations of the DP and the HDP, which are the preliminary knowledge of the proposed model, are reviewed in Section 3. The CHDP and its two constructive representations are presented in Section 4 with two corresponding inference algorithms in Section 5. Section 6 evaluates the properties of CHDP and conducts comparative experiments on real-world tasks. Section 7 concludes this study and discusses possible future work.
2 Related work
This section reviews the study on hierarchical structures using Bayesian nonparametric models. We organize the existing work in this area into two groups: one group aims to learn out a hierarchical structure from (plain) data; the other group aims to learn from data with a hierarchical structure. Although the two groups are similar, they are developed for different situations: the input of the first group is a plain dataset (e.g., a collection of documents or images) and the output is a hieratical structure; the input of the second group is a hierarchical data structure and the output is a new hidden factor space. Our study in this paper is within the second group.
2.1 Learning out hierarchical structures using Bayesian nonparametrics
Hierarchical structures play an important role in machine learning because they are pervasively applied and reflect the human habit to organize information, so learning out a hierarchical structure from plain data attracts a lot of attention from researchers in the Bayesian nonparametric field. Compared to other efforts on this task, Bayesian nonparametric models have the advantage that the learned hierarchical structure is more flexible which means there is no bound of depth and/or width, making it easy to incorporate the newly arrived data.
nCRP-based. A tree is viewed as a nested sequence of partitions by the nested Chinese restaurant process (nCRP) griffiths2004hierarchical ; ncrp2010 , where a measurable space is first partitioned by a CRP blackwell1973ferguson and each area in this partition is further partitioned into several areas using CRP. In this way, a tree with infinite depth and branching can be generated. A datum (e.g., a document) is associated with a path in the tree using DP by nCRP ncrp2010 or a flexible Martingale steinhardt2012flexible prior, and it can associate with a subtree of the generated tree using the HDP teh2006hierarchical prior in the nested HDP nhdp2015 instead of a path.
Stick-breaking-based. It is known that the traditional stick-breaking process sethuraman1994constructive can infer an infinite set, and it has also been revised to infer an infinite tree structure. An iterative stick-breaking process is used to construct a Polya tree (PT) 10.2307/2242009 in a nested fashion, and a datum is associated with a leaf node of the generated tree. The traditional stick-breaking process is revised to generate breaks with a tree structure and results in tree structured stick-breaking (TSSB) ghahramani2010tree where a datum is attached to a node in the generated tree.
Diffusion-based. This kind of method holds the idea that data are generated by a diffusion procedure with several divergences during this procedure and additional time varying continuous stochastic processes (i.e., Markov process) are needed for divergence control. A datum is placed at the end of the branches of diffusions. Both Kingman’s coalescent teh2009bayesian ; KINGMAN1982235 ; teh2011modelling and the Dirichlet diffusion tree (DDT) neal2003density define a prior for an infinite (binary) tree. DDT is extended to a more general structure: multifurcating branches by the Pitman-Yor diffusion tree (PYDT) knowles2011pitman ; 6777276 and to feature hierarchy by the beta diffusion tree (BDT) heaukulani2014beta .
Motivated by the deep belief network (DBN)hinton2006fast , the Poisson gamma belief network (PGBN) zhou2015gamma is proposed to learn a hierarchical structure where nodes have nonnegative real-valued weights rather than binary-valued weights in DBN and the width of each layer is flexible rather than fixed. Each layer node can be seen as an abstract feature expression of the input data.
To summarize, a variety of excellent work has been proposed in this direction, but this is beyond the scope of this work.
2.2 Learning from hierarchical structures using Bayesian nonparametrics
The most well-known and significant Bayesian nonparametric model for learning from hierarchical structures is the hierarchical Dirichlet process (HDP) teh2006hierarchical , which is based on layering DPs. Each node in the hierarchical structure is assigned a DP, and the relationship between nodes is modeled by the relation between a DP and its base measure. Due to its success, many extensions have been developed to account for different situations: supervised HDP 6784083 is proposed to incorporate additional label information of hierarchical structures; dynamic HDP dynamichdp ; Zhang:2010 is used to model the time-varying change of hierarchical structures; incremental HDP 6137314 is for streaming hierarchical structures; the tree extension of HDP Canini:2011
and the combination with deep Boltzmann Machine (DBM)salakhutdinov2009deep are used to learn out a different level of abstract features 6389680 ; and the adapted HDP 6916915 can fuse multiple heterogeneous aspects.
A similar idea was adopted in the gamma-negative binomial process nbpcm ; 7373353 , beta-negative binomial process 6802382 , hierarchical beta process hbp and hierarchical Poisson models hpm . Different stochastic processes, e.g., beta, Gamma, Poisson and negative binomial processes, used in these models are piled to account for different kinds of data (i.e., binary or count data) in the hierarchical structure. Note that these models can also be used to learn out a hierarchical structure if the hidden layers are fixed in advance for plain data.
To summarize, current state-of-the-art research in this group is mostly based on the hierarchical idea originally designed in HDP, so they can only be applied to Type-I hierarchical structures, as discussed in the introduction.
3 Preliminary knowledge
The CHDP is built on two existing Bayesian nonparametric priors: the Dirichlet process (DP) and the hierarchical Dirichlet process (HDP). In this section, we review their definitions and constructive representations that have been used to understand and build the proposed CHDP in the following section. Some important notations used throughout this paper are summarized in Table 1.
|a measurable space|
|a random measure from DP|
|/||global random measure from DP at the first layer|
|/||a random measure from DP at the second layer|
|/||a random measure from DP at the third layer|
|-th random measure from DP at the -th layer|
|the number of random measures at -th layer|
|base measure of DP|
|the parameter of (when it is a Dirichlet distribution)|
|a random partition|
|a measurable set in a random partition|
|an index of a measurable set/partition/factor/topic/dish|
|the number of measurable sets in a partition/factors/topics/dishes|
|a chef/node at the second layer|
|number of chefs/nodes at the second layer|
|number of restaurants/nodes at the third layer|
|a table in a restaurant|
|the table number in restaurant|
|the table number in restaurant served by chef|
|the number of tables served by menu option of chef|
|the number of tables served by dish|
|a menu option on the personal menu|
|the number of menu options on the personal menu of chef|
|the number of menu options with dish name|
|the number of different words in a corpus|
|-th partition/factor/topic/dish of DP (one point in )|
|assigned factor to menu option of chef|
|assigned factor to table in restaurant|
|/||assigned factor to data/customer / in restaurant|
|concentration parameter of general DP|
|concentration parameter of global DP at first layer|
|concentration parameter of DPs at second layer|
|concentration parameter of DPs at third layer|
-th stick break from beta distribution
|the stick weight of -th atom/factor from general DP|
|-th stick break from beta distribution|
|the stick weight of -th atom/factor from global DP at first layer|
|-th stick break from beta distribution|
|the stick weight of -th atom/factor from DP at second layer|
|-th stick break from beta distribution|
|the stick weight of -th atom/factor from DP at third layer|
|the assigned index of factor/dish of a node at first layer for a option of|
|the assigned index of factor/option of a node at second layer for a table of|
|the assigned index of factor/table of a node at third layer for a data of|
|the number of data/customers in restaurant|
|the number of data/customers sitting at table of restaurant|
|the number of data/customers sitting at table of restaurant served by chef|
|the variational parameters for stick breaks at the top layer|
|the variational parameters for stick breaks at the second layer|
|the variational parameters for stick breaks at the third layer|
|the variational parameters for|
|the variational parameters for|
|the variational parameters for|
|the variational parameter for|
3.1 Dirichlet process
The Dirichlet process ferguson1973bayesian ; DP2010 is the pioneer and foundation of Bayesian nonparametric learning. Its definition is as follows: [Dirichlet Process] A Dirichlet process (DP) ferguson1973bayesian ; DP2010 , which is specified by a base measure on a measurable space and a concentration parameter
, is a set of countably infinite random variables that can be seen as the measures on measurable sets from a random infinite partitionof . For any finite partition , the variables (measures on these measurable sets) from DP satisfy a Dirichlet distribution parameterized by the measures from the base measure
where is a realization of and denotes the Dirichlet distribution.
Since is a discrete measure with probability one ferguson1973bayesian , the mass will concentrate on one point (i.e., , called a topic/a factor/an atom222We do not distinguish these terms throughout this paper. in this paper) of , so an alternative definition of is
where denotes countable infinite points in measurable space and are sampled according to the base measure ; is the measure value from on a measurable set and it can be seen as the (normalized) weight of in ; is a Dirac measure parameterized by (i.e., if ; 0, otherwise). One draw from would be one of according to their relative weights .
Considering its infinite and discrete nature, DP is commonly adopted as the prior for mixture models rasmussen1999infinite , such as:
where is a data point generated according to a distribution parameterized by a draw from . Due to the discrete nature of , we have with the implication of data clustering according to their assigned . For computational convenience, is normally set as a multinomial distribution because it is conjugate with Dirichlet distribution. Document modeling is a successful application of this mixture model: is a -dimensional (normalized) vector (named a topic) where is the number of different words in a text corpus.
In Bayesian posterior analysis of DP, a representation of from a DP is needed. According to whether is represented explicitly or not, there are two kinds of constructive representations: Chinese restaurant process (CRP) representation and stick-breaking representation.
3.1.1 Chinese restaurant process (CRP) representation
A marginal constructive representation is the Chinese restaurant process blackwell1973ferguson , which directly generates for the -th data point (they are exchangeable) with marginalized out as follows:
where is the probability of taking the previous ones and is the probability of taking a new one according to . Here, the weights in Eq. (1) are implicitly reflected by the ratio of in .
The name comes from a metaphor used to understand Eq. (3). In a Chinese restaurant, the -th customer walks into this restaurant and chooses to sit at an occupied table with the probability or a new table with the probability . If the customer picks an occupied table, she eats the dish already on the table; if a new table is picked, she needs to order a new dish for the table from . As a result, is the dish eaten by the -th customer.
3.1.2 Stick-breaking representation
Another explicit way (named stick-breaking) to construct is proposed in sethuraman1994constructive as follows
where denotes a Beta distribution and is the -th random break from a unit stick with Beta distribution parameterized by and . We can see that the weights in Eq. (1) can be explicitly represented by .
3.2 Hierarchical Dirichlet processes
The hierarchical Dirichlet process teh2006hierarchical is built by piling a DP above another DP through an elegant method that can share the factors across the hierarchical structure. Its definition is as follows: [Hierarchical Dirichlet Process] A hierarchical Dirichlet process (HDP) teh2006hierarchical is a distribution over a set of random probability measures over . The process defines a set of random probability measures and a global random probability measure . The global measure is distributed as a Dirichlet process parameterized by a concentration parameter and a base (probability) measure
Each random measure is conditionally independent from the others given , and is also distributed as a Dirichlet process with the parameter and a base probability measure
This definition actually defines an operation between two DPs which will be discussed in more detail in the following section. It was originally designed to model group data. For example, there are documents (i.e., groups) and each could be adopted to model one document using the mixture idea in Eq. (2). Note that extending the above two-layer HDP to more layers is straightforward under this definition.
Analogous to DP, the representation for HDP is also required for model inference. There are two candidates: Chinese restaurant franchise representation and stick-breaking representation.
3.2.1 Chinese restaurant franchise (CRF) representation
Similar to the CRP for DP, HDP has its own marginal representation with and marginalized out (named the Chinese Restaurant Franchise) as follows:
where denotes the number of associated with and denotes the number of associated with in . Note that although appears in the above representation, we do not need to represent it explicitly as we can use the first line of Eq. (4) when we need to sample from in second line of Eq. (4).
The metaphor for CRF in Eq. (4) is as follows teh2006hierarchical . There are Chinese restaurants with a shared menu. The -th customer walks into the -th restaurant and picks an occupied table at which to sit with the probability or a new table with the probability . If this customer picks an occupied table, she just eats the dish already on that table; if a new table is picked, she needs to order a new dish. The new dish is ordered from the menu according to its popularity. The probability that the new dish is the same as the one on other tables has a probability of and the probability that it is a new dish is , where is the number of tables with the same dish . As a result, is the dish on table of restaurant , and is the dish eaten by customer in restaurant .
3.2.2 Stick-breaking representation
As for stick-breaking-based representation, there are two versions teh2006hierarchical ; sethuraman1994constructive for HDP. In this paper, we adopt the Sethuraman’s version sethuraman1994constructive ; wang2011online (with two layer) as follows:
where denotes an index to one of . Sethuraman’s version has an advantage in that the stick weights at different layers are decoupled which makes the posterior inference easier. From this constructive representation, we can see the factor sharing property of HDP. The at the lower layer shares the factors of at higher layers. Another interesting point is that the constructions of and are independent and the only connections between and are the relationships between and .
4 Cooperative hierarchical Dirichlet processes
As discussed in the Introduction, there are two types of hierarchical structures. In this section, we formally define and model the second type: the cooperative hierarchical structure.
[Cooperative Hierarchical Structure] A cooperative hierarchical structure (CHS), as illustrated in Figure 0(b), is composed of nodes assigned to different layers. Each node in the structure may link to multiple parent nodes and child nodes.
A real-world example of CHS is: author-paper-word data. This data has three-layer nodes: nodes in first layer denote authors; nodes in the second layer denote papers; nodes in the third layer denote words. If an author writes a paper, there is a link between two corresponding nodes; similarly, there is a link between a paper and a word if this paper contains this word.
Note that there is an implicit assumption of HDP in Definition 2 that each node can only have one parent node, so HDP fails to model CHS. To capture CHS, we first formally define three operations on random measures from DP as follows: [Inheritance] A probability measure is the Inheritance from another probability measure from DP on space by taking as its base measure
where and are DP parameters. The discrete nature of enables to inherit factors/atoms from .
Note that this operation is a more formal definition than the one in Definition 3.2.
[Cooperation: Superposition] A measure is the Superposition of two probability measures, i.e., and , from DP on the same space , if
where is a new probability measure on space and denotes the convex combination. For any given partition on , it has
Extending the Superposition of more than two probability measures is straightforward.
[Cooperation: Maximization] A measure is the Maximization of two probability measures, i.e., and , from DP on the same space , if
where is a new probability measure on the space and that is a Zadeh operator borrowed from fuzzy logic which denotes the maximization333Here, is a little different from its original definition, because there will be normalization after taking the maximum.. For any given partition on , it has
Extending the Maximization of more than two probability measures is also straightforward.
The defined Superposition and Maximization are two cooperation mechanisms between random measures, and they are not interchangeable. With the help of two mechanisms, we can model the many-to-many relationship of CHS defined in Definition 4. Next, we define a new Bayesian nonparametric prior to model CHS as follows: [Cooperative Hierarchical Dirichlet Process] A cooperative hierarchical Dirichlet process (CHDP) is a distribution over a set of random probability measures (over ) located at multiple layers. It defines:
Each layer has with a number of random probability measures where for the first layer;
At the first layer , a single global random probability measure is defined, which is distributed as a Dirichlet process parameterized by a concentration parameter and a base probability measure
At the following layer , each probability measure at layer is the Inheritance from the cooperation of probability measures at the upper layer which link to ,
where is the DP parameter at the layer and is from Superposition in Definition 4
or Maximization in Definition 4
where each denotes a random measure at layer with a link to and are the index of linked measures at layer .
The above CHDP has defined a prior, and we should specify the data likelihood to complete the data generation process: to sample a parameter from the bottom layer which is used to generate the data . is the base measure of top layer DP and defines the parameter space, which is normally set as a Dirichlet distribution for discrete data (e.g., documents). For example, when applied to author-document-word, is named the -th topic, is the -th word of document , and is a Dirichlet distribution on -simplex where is the vocabulary size.
Comparing Definitions 3.2 and 2, we can draw the conclusion that HDP can be seen as a special case of CHDP with each child node/probability measure having only one parent node/probability measure. If the cooperative/Type-II hierarchical structure degenerates into a Type-I hierarchical structure, the CHDP will degenerate into a HDP as well.
In Figures 1(a) and 1(b), we compare the graphical models of HDP and CHDP for a particular hierarchical structure: author-document-word, where this simple data includes three documents written by two authors and each document has with words. We also use colors to show how HDP and CHDP are used to model a hieratical structure. It can be seen that the random measures at the author and document layers of the HDP in Figure 1(a) have a one-to-many relationship, where Figure 1(b) (or CHDP) shows a many-to-many relationship. The ability of CHDP to model this many-to-many relationship is due to the designed cooperation. Therefore, CHDP is more powerful than HDP for more general hierarchical structure modeling. Note that the many-to-many relationship between the documents and words are both modeled by HDP and CHDP by the mixture likelihood.
Two similar studies have been published on the convex combination of DPs. Lin and Fisher lin2012coupling proposed to use the convex combination of a finite number of DPs at a high layer as a new measure for the low layer , and Chen chen2013dependent further extended this idea to all normalized random measures with DP as a special case. We want to highlight that although the idea of Cooperation: Superposition in this paper is similar to their work, they are different. The idea in lin2012coupling ; chen2013dependent is to directly use the new measure as the measure of the nodes at a lower layer and the difference between the two new measures relies on the different mixing weights. For example, and are different only if are different from . However, in our CHDP, we use this convexly combined measure as the base measure of a new DP which introduces additional flexibility (controlled by ) beyond the mixing weights. For example, and may be different even though and are the same. When modeling hierarchical structures, it is usually assumed that the whole structure is given and sometimes the mixing weights of the nodes may also be observed. In the situation where mixing weights are known, CHDP shows more model flexibility than the determinate method in lin2012coupling ; chen2013dependent . Note that we assume the mixing weights are given in this paper and it would be straightforward to model these mixing weights in CHDP just simply adding a Dirichlet prior to them. As for Cooperation: Maximization, we found no similar research in the literature.
Next, we introduce two constructive representations for CHDP: international restaurant process representation (marginal one) and stick-breaking representation (explicit one).
4.1 International restaurant process (IRP) representation
The marginal representation of CHDP with , , and marginalized out (named the international restaurant process) is as follows
where denotes the number of associated with in ; denotes the number of associated with ; and denotes the number of associated with . is the cooperation between the parent random measures of . If Superposition is adopted, then
If Maximization is adopted, then
where and are authors linked to . The above marginal representation is finished.
Similar to the Chinese restaurant process of DP outlined in Section 3.1.1 and the Chinese restaurant franchise in HDP in Section 3.2.1, a metaphor is also introduced to ease the understanding of IRP. Since CHDP is based on a three-layer HDP, we describe the metaphor for the three-layer HDP first, and then introduce one for CHDP. Note that the CRF in Section 3.2.1 is only a two-layer HDP.
As shown in Figure 2(b), the metaphor for the three-layer HDP is as follows: there is a global menu with different dishes shared by all chefs from different countries (i.e., China, India, Italy, France). Each chef has a personal menu with dish names as menu options (Note that menu options are not eliminative - different options could, in fact, be the same dish.) according to their preference and ability. There are also several (national) restaurants . Each restaurant employs one (and only one) chef, but a chef can work in different restaurants at the same time. For example, a French restaurant hired a French chef, but this chef may work in other French restaurants. In each restaurant, there are multiple tables , and each table is served with a dish cooked by the chef of this restaurant. When a customer walks into a restaurant , she sits at an occupied table with the probability or a new table with the probability . If an occupied table is selected, she just eats the dish on this table; if the table is new, the customer needs to order a dish for this table from the personal menu of the chef. If option on the menu is selected with the probability , she eats it; if she is not satisfied by all the current options on the menu with the probability , the chef has to add a new option on the menu from the global shared menu. If dish on the global menu is selected with the probability , she eats it; if all the dishes on the global menu still do not satisfy this customer with the probability , the chefs have to add a new dish to this global menu (while embarrassedly looking up a recipe book ).
As shown in Figure 2(a), the metaphor for the IRP is as follows: the background is almost the same as the one in HDP, but each restaurant in IRP can employ a number of chefs from different countries, and a chef can work in different restaurants. For example, an international restaurant may have a Chinese chef, a French chef, an Italian chef, and an Indian chef (hence its name, international restaurant). When a customer walks into an international restaurant and needs to order a dish for an empty table, she could order this from the menus of all the chefs working in this restaurant. If option on the menu of a chef is selected with the probability , she eats it; if she is not satisfied by the current options on the menu with the probability , she can ask this chef to add a new option to his menu from the globally shared menu.
4.2 Stick-breaking representation
Based on the stick-breaking process for HDP teh2006hierarchical , we develop the following stick-breaking representation for CHDP
where is from the cooperation of the parent random measures of :
If Superposition is used, then
If Maximization is used, then
and have links to . When applied to author-paper-word, is named the -th topic, is the -th word of a document , is the topic assignment of word , and is a Dirichlet distribution parameterized by .
Note that there is no one-to-one mapping between with . In fact, their relationship is . Similar to and , their relation is .
5 Model Inference
With the observed CHS, the final aim of the inference is to obtain the posterior distribution of the latent variables in CHDP. Apparently, different representations of CHDP lead to different representations for the posterior distribution. Therefore, we develop one Markov Chain Monte Carloref1 algorithm to approximate the target posterior distribution using samples in Section 5.1 based on IRP, and a variational inference blei2016variational algorithm to approximate target posterior distribution through optimization in Section 5.2 based on stick-breaking representation. The main difficulty facing the two inference algorithms lies in cooperation, i.e., superposition and maximization.
5.1 Gibbs sampler
In this section, we design a Markov Chain Monte Carlo algorithm to obtain samples of the posterior distribution of CHDP based on IRP representation. Since the difference and difficulty of CHDP comparing three-layer HDP mainly lies on sampling , we focus on its inference with two kinds of cooperation: Superposition and Maximization.
Sampling for CHDP-Superposition. This should be sampled from , but is a superposition of a number of so it is different from the one in HDP and hard to marginalize out. The from superposition is,
where , the components of the left-hand side correspond to the observed dishes, and the remaining part accounts for the new dishes made by the chefs . Since Superposition is used, each component is a summation across all chefs. Note that the summation also eases the normalization because the summation of the left-hand side is simply .
Considering the above and IRP representation, can be seen as all the menu options of the chefs serving in restaurant , and the sampling of is only a selection procedure from these candidate menu options. Following this idea, we obtain the posterior distribution of as,
Another sampling method for CHDP-Superposition is to introduce an auxiliary variable for the sampling of which is given in Appendix 1.
Sampling for CHDP-Maximization. Similar to CHDP-Superposition, the difficulty also lies in the fact that the is a maximization of a number of here. The from maximization is,
Under IRP representation, the sampling here could also be considered as a menu option selecting procedure. Compared with CHDP-Superposition, the difference is that not all the menu options of chefs serving in restaurant are seen as candidates. CHDP-Maximization only takes the menu options from the chefs who are the best at these options as the candidates. Finally, the posterior distribution of is,
where is the identity function which is equal to 1 if the condition is satisfied; 0, otherwise. Here, the identity functions serve as the candidate filter. Note that the normalization is nontrivial for CHDP-Maximization because some options are removed from the candidate list and then the unit summation for each chef does not hold any more.
The posterior distributions of the remaining variables simply follow the three-layer HDP. Due to the space limitation, we list the distributions of the remaining variables in Appendix 1. The entire procedure for the inference of IRP is summarized in Algorithm 1.
5.2 Variational inference
Different from the designed sampler in the previous section which uses samples to approximate the posterior distribution of latent variables, variational inference blei2016variational casts this distribution approximation problem to an optimization problem. While samplers have the advantage of asymptotically exact, they are usually not efficient in practice when facing large-scale data. Optimization-based variational inference wang2011online is more tractable than samplers with only a small loss in terms of theoretical accuracy. We therefore develop a variational inference algorithm for CHDP, described as follows, to handle large-scale data.
The core idea of variational inference is to propose a number of (normally independent) variational distributions of latent variables with corresponding variational parameters and to reduce the distance (usually Kullback-Leibler (KL) divergence) between the real posterior distribution and these variational distributions through adjusting the value of these variational parameters. However, the infinite number of factors and their weights make the posterior inference of the stick weights even harder. One common work-around in nonparametric Bayesian learning is to use a truncation method. The truncation method fox2009bayesian ; willsky2009nonparametric , which uses a relatively big as the (potential) maximum number of topics, is widely accepted. For CHDP, we define the following variational distributions for the latent variables using stick-breaking representation: