1 Introduction
Latent variable models (LVMs) (Bishop, 1998; Knott and Bartholomew, 1999; Blei, 2014) are a major workhorse in machine learning (ML) to extract latent patterns underlying data, such as themes behind documents and motifs hiding in genome sequences. To properly capture these patterns, LVMs are equipped with a set of components
, each of which is aimed to capture one pattern and is usually parametrized by a vector. For instance, in topic models
(Blei et al., 2003), each component (referred to as topic) is in charge of capturing one theme underlying documents and is represented by a multinomial vector.While existing LVMs have demonstrated great success, they are less capable in addressing two new problems emerged due to the growing volume and complexity of data. First, it is often the case that the frequency of patterns is distributed in a powerlaw fashion (Wang et al., 2014; Xie et al., 2015) where a handful of patterns occur very frequently whereas most patterns are of low frequency. Existing LVMs lack capability to capture infrequent patterns, which is possibly due to the design of LVMs’ objective function used for training. For example, a maximum likelihood estimator would reward itself by modeling the frequent patterns well as they are the major contributors of the likelihood function. On the other hand, infrequent patterns contribute much less to the likelihood, thereby it is not very rewarding to model them well and LVMs tend to ignore them. Infrequent patterns often carry valuable information, thus should not be ignored. For instance, in a topic modeling based recommendation system, an infrequent topic (pattern) like losing weight is more likely to improve the clickthrough rate than a frequent topic like politics. Second, the number of components strikes a tradeoff between model size (complexity) and modeling power. For a small , the model is not expressive enough to sufficiently capture the complex patterns behind data; for a large , the model would be of large size and complexity, incurring high computational overhead. How to reduce model size while preserving modeling power is a challenging issue.
To cope with the two problems, several studies (Zou and Adams, 2012; Xie et al., 2015; Xie, 2015) propose a “diversification” approach, which encourages the components of a LVM to be mutually “dissimilar”. First, regarding capturing infrequent patterns, as posited in (Xie et al., 2015) “diversified” components are expected to be less aggregated over frequent patterns and part of them would be spared to cover the infrequent patterns. Second, concerning shrinking model size without compromising modeling power, Xie (2015) argued that “diversified” components bear less redundancy and are mutually complementary, making it possible to capture information sufficiently well with a small set of components, i.e., obtaining LVMs possessing high representational power and low computational complexity.
The existing studies (Zou and Adams, 2012; Xie et al., 2015; Xie, 2015) of “diversifying” LVMs mostly focus on point estimation (Wasserman, 2013)
of the model components, under a frequentiststyle regularized optimization framework. In this paper, we study how to promote diversity under an alternative learning paradigm: Bayesian inference
(Jaakkola and Jordan, 1997; Bishop and Tipping, 2003; Neal, 2012), where the components are considered as random variables of which a
posterior distribution shall be computed from data under certain priors. Compared with point estimation, Bayesian learning offers complementary benefits. First, it offers a “modelaveraging” (Jaakkola and Jordan, 1997; Bishop and Tipping, 2003) effect for LVMs when they are used for decisionmaking and prediction because the parameters shall be integrated under a posterior distribution, thus potentially alleviate overfitting on training data. Second, it provides a natural way to quantify uncertainties of model parameters, and downstream decisions and predictions made thereupon (Jaakkola and Jordan, 1997; Bishop and Tipping, 2003; Neal, 2012). Affandi et al. (2013) investigated the “diversification” of Bayesian LVMs using the determinantal point process (DPP) (Kulesza et al., 2012)prior. While Markov chain Monte Carlo (MCMC)
(Affandi et al., 2013) methods have been developed for approximate posterior inference under the DPP prior, DPP is not amenable for another mainstream paradigm of approximate inference techniques – variational inference (Wainwright et al., 2008) – which is usually more efficient (Hoffman et al., 2013) than MCMC. In this paper, we propose alternative diversitypromoting priors that overcome this limitation.We propose two approaches that have complementary advantages to perform diversitypromoting Bayesian learning of LVMs. Following (Xie et al., 2015)
, we adopt a notion of diversity that component vectors are more diverse provided the pairwise angles between them are larger. First, we define mutual angular Bayesian network (MABN) priors over the components, which assign higher probability density to components that have larger mutual angles and use these priors to affect the posterior via Bayes’ rule. Specifically, we build a Bayesian network
(Koller and Friedman, 2009) whose nodes represent the directional vectors of the components and local probabilities are parameterized by von MisesFisher (Mardia and Jupp, 2009)distributions that entail an inductive bias towards vectors with larger mutual angles. The MABN priors are amenable for approximate posterior inference of model components. In particular, they facilitate variational inference, which is usually more efficient than MCMC sampling. Second, in light of that it is not flexible (or even possible) to define priors to capture certain diversitypromoting effects such as small variance of mutual angles, we adopt a posterior regularization approach
(Zhu et al., 2014b), in which a diversitypromoting regularizer is directly imposed over the postdata distributions to encourage diversity and the regularizer can be flexibly defined to accommodate various desired diversitypromoting goals. We instantiate the two approaches to the Bayesian mixture of experts model (BMEM) (Waterhouse et al., 1996) and experiments demonstrate the effectiveness and efficiency of our approaches.We also study how to “diversify” Bayesian nonparametric LVMs (BNLVMs) (Ferguson, 1973; Ghahramani and Griffiths, 2005; Hjort et al., 2010). Different from parametric LVMs where the component number is set to an finite value and does not change throughout the entire execution of algorithm, in BNLVMs the number of components is unlimited and can reach infinite in principle. As more data accumulates, new components are dynamically added. Compared with parametric LVMs, BNLVMs possess the following advantages: (1) they are highly flexible and adaptive: if new data cannot be well modeled by existing components, new components are automatically invoked; (2) in BNLVMs, the “best” number of components is determined according to the fitness to data, rather than being manually set which is a challenging task even for domain experts. To “Diversify” BNLVMs, we extend the MABN prior to an Infinite Mutual Angular (IMA) prior that encourages infinitely many components to have large angles. In this prior, the components are mutually dependent, which incurs great challenges for posterior inference. We develop an efficient sampling algorithm based on slice sampling (Teh and Ghahramani, 2007) and Riemann manifold Hamiltonian Monte Carlo (Girolami and Calderhead, 2011). We apply the IMA prior to induce diversity in the infinite latent feature model (ILFM) (Ghahramani and Griffiths, 2005) and experiments on various datasets demonstrate that the IMA is able to (1) achieve better performance with fewer components; (2) better capture infrequent patterns; and (3) reduce overfitting.
The major contributions of this work are:

We propose a mutual angular Bayesian network (MABN) prior which is biased towards components having large mutual angles, to promote diversity in Bayesian LVMs.

We develop an efficient variational inference method for posterior inference of model components under the MABN priors.

To flexibly accommodate various diversitypromoting effects, we study a posterior regularization approach which directly imposes diversitypromoting regularization over the postdata distributions.

We extend the MABN prior from the finite case to the infinite case and apply it to “diversify” Bayesian nonparametric models.

We develop an efficient sampling algorithm based on slice sampling and Riemann manifold Hamiltonian Monte Carlo for “diversified” BNLVMs.

Using Bayesian mixture of experts model and infinite latent feature model as study cases, we empirically demonstrate the effectiveness and efficiency of our methods.
The rest of the paper is organized as follows. Section 2 reviews related works. In Section 3 and 4, we introduce how to promote diversity in Bayesian parametric and nonparametric LVMs respectively. Section 5 gives experimental results and Section 6 concludes the paper.
2 Related Work
Recent works (Zou and Adams, 2012; Xie et al., 2015; Xie, 2015) have studied the diversification of components in LVMs under a point estimation framework. Zou and Adams (2012) leverage the determinantal point process (DPP) (Kulesza et al., 2012) to promote diversity in latent variable models. Xie et al. (2015) propose a mutual angular regularizer that encourages model components to be mutually different where the dissimilarity is measured by angles. Cogswell et al. (2015)
define a covariancebased regularizer to reduce the correlation among hidden units in neural networks, for the sake of alleviating overfitting.
Diversitypromoting Bayesian learning of LVMs has been investigated in (Affandi et al., 2013), which utilizes the DPP prior to induce bias towards diverse components. They develop a Gibbs sampling (Gilks, 2005) algorithm. But the determinant in DPP makes variational inference based algorithms very difficult to derive. Our conference version of the paper (Xie et al., 2016) has introduced a mutual angular prior to “diversify” Bayesian parametric LVMs. This work extends the study of “diversification” to nonparametric models where the number of components is infinite.
Diversitypromoting regularization is investigated in other problems as well, such as ensemble learning and classification. In ensemble learning, many studies (Kuncheva and Whitaker, 2003; Banfield et al., 2005; Partalas et al., 2008; Yu et al., 2011)
explore how to select a diverse subset of base classifiers or regressors, with the aim to improve generalization performance and reduce computational complexity. In multiway classification,
Malkin and Bilmes (2008) propose to use the determinant of a covariance matrix to encourage “diversity” among classifiers. Jalali et al. (2015) propose a class of variational Gram functions (VGFs) to promote pairwise dissimilarity among classifiers.3 DiversityPromoting Bayesian Learning of Parametric Latent Variable Models
In this section, we study how to “diversify” parametric Bayesian LVMs where the number of components is finite. We investigate two approaches: prior control and posterior regularization, which have complementary advantages.
3.1 DiversityPromoting Mutual Angular Prior
The first approach we take is to define a prior which has an inductive bias towards components that are more “diverse” and use it to affect the posterior via Bayes’ rule. We refer to this approach as prior control. While diversity can be defined in various ways, following (Xie et al., 2015) we adopt the notion that a set of component vectors are considered to be more diverse if the pairwise angles between them are larger. We desire the prior to have two traits. First, to favor diversity, they assign a higher density to components having larger mutual angles. Second, it should facilitate posterior inference. In Bayesian learning, the easiness of posterior inference relies heavily on the prior (Blei and Lafferty, 2006; Wang and Blei, 2013).
One possible solution is to turn the mutual angular regularizer (Xie et al., 2015) that encourages a set of component vectors to have large mutual angles into a distribution based on Gibbs measure (Kindermann et al., 1980), where is the partition function guaranteeing that integrates to one. The concern is that it is not sure whether is finite, i.e., whether is proper. When an improper prior is utilized in Bayesian learning, the posterior is also highly likely to be improper, except in a few special cases (Wasserman, 2013). Performing inference on improper posteriors is problematic.
Here we define mutual angular Bayesian network (MABN) priors possessing the aforementioned two traits, based on Bayesian network (Koller and Friedman, 2009) and von MisesFisher (Mardia and Jupp, 2009) distribution. For technical convenience, we decompose each realvalued component vector into , where is the magnitude and is the direction (). Let denote the directional vectors. Note that the angle between two vectors is invariant to their magnitudes, thereby, the mutual angles of component vectors in are the same as angles of directional vectors in . We first construct a prior which prefers vectors in to possess large angles. The basic idea is to use a Bayesian network (BN) to characterize the dependency among directional vectors and design local probabilities to entail an inductive bias towards large mutual angles. In the Bayesian network (BN) shown in Figure 1, each node represents a directional vector and its parents are nodes . We define a local probability at node to encourage to have large mutual angles with
. Since these directional vectors lie on a sphere, we use the von MisesFisher (vMF) distribution to model them. The probability density function of the vMF distribution is
, where the random variable lies on a dimensional sphere (), is the mean direction with , is the concentration parameter and is the normalization constant. The local probability at node is defined as a von MisesFisher (vMF) distribution whose density is(1) 
with mean direction .
Now we explain why this local probability favors large mutual angles. Since and are unitlength vectors, is the cosine of the angle between and . If has larger angles with
, then the average negative cosine similarity
would be larger, accordingly would be larger. This statement is true for all . As a result, would be larger if the directional vectors have larger mutual angles. For the magnitudes of the components, which have nothing to do with the mutual angles, we samplefor each component independently from a gamma distribution with shape parameter
and rate parameter .The generative process of is summarized as follows:

Draw

For , draw

For , draw

For , let
The probability distribution over
can be written as(2) 
According to the factorization theorem (Koller and Friedman, 2009) of Bayesian network, it is easy to verify , thus is a proper prior.
When inferring the posterior of model components using a variational inference method, we need to compute the expectation of appearing in the local probability , which is extremely difficult. To address this issue, we define an alternative local probability that achieves similar modeling effect as , but greatly facilitates variational inference. We reparametrize the local probability defined in Eq.(1) using Gibbs measure:
(3)  
which is another vMF distribution with mean direction and concentration parameter . This reparameterized local probability is proportional to , which measures the negative cosine similarity between and its parent vectors. Thereby, still encourages large mutual angles between vectors as does. The difference between and is that in the term is moved from the denominator to the normalizer, thus we can avoid computing the expectation of . Though it incurs a new problem that we need to compute the expectation of , which is also hard due to the complex form of the function, we managed to resolve this problem as detailed in Section 3.1.1. We refer to the MABN prior defined in Eq.(2) as type I MABN and that with local probability defined in Eq.(3) as type II MABN.
3.1.1 Approximate Inference Algorithms
We develop algorithms to infer the posteriors of components under the MABN prior. Since exact posteriors are intractable, we resort to approximate inference techniques. Two main paradigms of approximate inference methods are: (1) variational inference (VI) (Wainwright et al., 2008); (2) Markov chain Monte Carlo (MCMC) sampling (Gilks, 2005). These two approaches possess benefits that are mutually complementary. MCMC can achieve a better approximation of the posterior than VI since it generates samples from the exact posterior while VI seeks an approximation. However, VI can be computationally more efficient (Hoffman et al., 2013).
Variational Inference
The basic idea of VI (Wainwright et al., 2008) is to use a “simpler” variational distribution
to approximate the true posterior by minimizing the KullbackLeibler divergence between these two distributions, which is equivalent to maximizing the following variational lower bound w.r.t
:(4) 
where is the MABN prior and is data likelihood. Here we choose to be a mean field variational distribution , where and . Given the variational distribution, we first compute the analytical expression of the variational lower bound, in which we particularly discuss how to compute . If choosing to be the typeI MABN prior (Eq.(2)), we need to compute which is very difficult to deal with due to the presence of . Instead we choose the typeII MABN prior for the convenience of deriving the variational lower bound. Under the typeII MABN, we need to compute for all , where is the partition function of . The analytical form of this expectation is difficult to derive as well due to the complexity of the function: where is the modified Bessel function of the first kind at order . To address this issue, we derive an upper bound of and compute the expectation of the upper bound, which is relatively easy to do. Consequently, we obtain a further lower bound of the variational lower bound and learn the variational and model parameters w.r.t the new lower bound.
Now we proceed to derive the upper bound of , which equals to . Applying the inequality (Bouchard, 2007), where is a variational parameter, we have
(5) 
Then applying the inequality (Bouchard, 2007), where is another variational parameter and , we have
(6) 
Finally, applying the following integrals on a highdimensional sphere: (1) , (2) , (3) , we get
(7) 
The expectation of this upper bound is much easier to compute. Specifically, we need to tackle , which can be computed as
(8)  
where , , , and .
MCMC Sampling
One potential drawback of the variational inference approach is that a large approximation error can be incurred if the variational distribution is far from the true posterior. We further present an alternative approximation inference method — Markov chain Monte Carlo (MCMC) (Gilks, 2005), which draws samples from the exact posterior distribution and uses the samples to represent the posterior. Specifically we choose the MetropolisHastings (MH) algorithm (Gilks, 2005) which generates samples from an adaptive proposal distribution, computes acceptance probabilities based on the unnormalized true posterior and uses the acceptance probabilities to decide whether a sample should be accepted or rejected. The most commonly used proposal distribution is based on random walk: the newly proposed sample comes from a random perturbation around the previous sample . For the directional variables and magnitude variables
, we define the proposal distributions to be a von MisesFisher distribution and a normal distribution respectively:
(9) 
is required to be positive, but the Gaussian distribution may generate nonpositive samples. To address this problem, we adopt a truncated sampler
(Wilkinson, 2015) which repeatedly draws samples until a positive value is obtained. Under such a truncated sampling scheme, the MH acceptance ratio needs to be modified accordingly. Please refer to (Wilkinson, 2015) for details.MH eventually converges to a stationary distribution where the generated samples represent the true posterior. The downside of MCMC is that it could take a long time to converge, which is usually computationally less efficient than variational inference (Hoffman et al., 2013). Under the MH algorithm, the MABN prior facilitates better efficiency compared with the DPP prior. In each iteration, the MABN prior needs to be evaluated, whose complexity is quadratic in the component number whereas evaluating the DPP has a cubic complexity in .
3.2 DiversityPromoting Posterior Regularization
In practice, one may desire to achieve more than one diversitypromoting effects in LVMs. For example, the mutual angular regularizer (Xie et al., 2015) aims to encourage the pairwise angles between components to have not only large mean, but also small variance such that the components are uniformly “different” from each other and evenly spread out to different directions in the space. It would be extremely difficult, if ever possible, to define a proper prior that can accommodate all desired effects. For instance, the MABN priors defined above can encourage the mutual angles to have large mean, but are unable to promote small variance. To overcome such inflexibility of the prior control method, we resort to a posterior regularization approach (Zhu et al., 2014b). Instead of designing a Bayesian prior to encode the diversification desideratum and indirectly influencing the posterior, posterior regularization directly imposes a control over the postdata distributions to achieve certain goals. Giving prior and data likelihood , computing the posterior is equivalent to solving the following optimization problem (Zhu et al., 2014b)
(10) 
where is any valid probability distribution. The basic idea of posterior regularization is to impose a certain regularizer over to incorporate prior knowledge and structural bias (Zhu et al., 2014b) and solve the following regularized problem
(11) 
where is a tradeoff parameter. Through properly designing , many diversitypromoting effects can be flexibly incorporated. Here we present a specific example while noting that many other choices are applicable. Gaining insight from (Xie et al., 2015), we define as
(12) 
where is the nonobtuse angle measuring the dissimilarity between and , and the regularizer is defined as the mean of pairwise angles minus their variance. The intuition behind this regularizer is: if the mean of angles is larger (indicating these vectors are more different from each other on the whole) and the variance of the angles is smaller (indicating these vectors evenly spread out to different directions), then these vectors are more diverse. Note that it is very difficult to design priors to simultaneously achieve these two effects.
While posterior regularization is more flexible, it lacks some strengths possessed by the prior control method for our consideration of diversifying latent variable models. First, prior control is a more natural way of incorporating prior knowledge, with solid theoretical foundation. Second, prior control can facilitate sampling based algorithms that are not applicable for the above posterior regularization.^{1}^{1}1Note that it does exist some examples of posterior regularization that have nice samplingbased algorithms, such as the maxmargin topic models with a Gibbs classifier (Zhu et al., 2014a). In sum, the two approaches have complementary advantages and should be chosen according to specific problem context.
3.3 “Diversifying” Bayesian Mixture of Experts Model
In this section, we apply the two approaches developed above to “diversify” the Bayesian mixture of experts model (BMEM) (Waterhouse et al., 1996).
3.3.1 BMEM with Mutual Angular Prior
The mixture of experts model (MEM) (Jordan and Jacobs, 1994) has been widely used for machine learning tasks where the distribution of input data is so complicated that a single model (“expert”) cannot be effective for all the data. MEM assumes that the input data is inherently belonging to multiple latent groups and one single “expert” is allocated to each group to handle the data therein. Here we consider a classification task whose goal is to learn binary linear classifiers given the training data , where is the input feature vector and is the class label. We assume there are latent experts where each expert is a classifier with coefficient vector . Given a test example , it first goes through a gate function that decides which expert is best suitable to classify this example and the decision is made in a probabilistic way. A discrete variable is utilized to indicate the selected expert and the probability that (assigning example to expert ) is , where is a coefficient vector characterizing the selection of expert . Given the selected expert, the example is classified using the coefficient vector corresponding to that expert. As described in Figure 2, the generative process of is as follows

For

Draw , where

Draw .

As of now, the model parameters and are deterministic variables. Next we place a prior over them to enable Bayesian learning (Waterhouse et al., 1996) and desire this prior to be able to promote diversity among the experts to retain the advantages of “diversifying” LVMs as stated before. The mutual angular Bayesian network prior can be applied to achieve this goal
where and .
3.3.2 BMEM with Mutual Angular Posterior Regularization
As an alternative approach, the diversity in BMEM can be imposed by placing the mutual angular regularizer (Eq.(12)) over the postdata posteriors (Zhu et al., 2014b). Here we instantiate the general diversitypromoting posterior regularization defined in Eq.(11) to BMEM, by specifying the following parametrization. The latent variables in BMEM include , and and the postdata distribution over them is defined as . For computational tractability, we define and to be: and where , are von MisesFisher distributions and , are gamma distributions, and define to be multinomial distributions: where is a multinomial vector. The priors over and are specified to be: and where , are vMF distributions and , are gamma distributions. Under such parametrization, we solve the following diversitypromoting posterior regularization problem
(13) 
Note that other parametrizations are also valid, such as placing Gaussian priors over and and setting , to be Gaussian.
4 DiversityPromoting Bayesian Nonparametric Modeling
In the last section, we study how to promote diversity among a finite number of components in parametric LVMs. In this section, we investigate how to achieve this goal in nonparametric LVMs where the component number is infinite in principle. We extend the mutual angular Bayesian network (MABN) prior defined in last section to an Infinite Mutual Angular (IMA) prior that encourages infinitely many components to have large angles. In this prior, the components are mutually dependent, which incurs great challenges for posterior inference. We develop an efficient sampling algorithm based on slice sampling (Teh and Ghahramani, 2007) and Riemann manifold Hamiltonian Monte Carlo (Girolami and Calderhead, 2011). We apply the IMA prior to induce diversity in the infinite latent feature model (ILFM) (Ghahramani and Griffiths, 2005).
4.1 Bayesian Nonparametric Latent Variable Models
A BNLVM consists of an infinite number of components, each parameterized by a vector. For example, in Dirichlet process Gaussian mixture model (DPGMM)
(Rasmussen, 1999; Blei et al., 2006), the components are clusters, each parameterized with a Gaussian mean vector. In Indian buffet process latent feature model (IBPLFM) (Ghahramani and Griffiths, 2005), the components are features, each parameterized by a weight vector. Given these infinitely many components, BNLVMs design some proper mechanism to select one or a finite subset of them to model each observed data example. For example, in DPGMM, a Chinese restaurant process (CRP) (Aldous, 1985) is designed to assign each data example to one of the infinite number of clusters. In IBPLFM, an Indian buffet process (IBP) (Ghahramani and Griffiths, 2005) is utilized to select a finite set of features from the infinite feature pool to reconstruct each data example. A BNLVM typically consists of two priors. One is a base distribution from which the parameter vectors of components are drawn. The other is a stochastic process – such as CRP and IBP – which designates how to select components to model data. The prior studied in this paper belongs to the first regime. It is commonly assumed that parameter vectors of the components are independently drawn from the same base distribution. For example, in both DPGMM and IBPLFM, the mean vectors and weight vectors are independently drawn from a Gaussian distribution. In this paper, we aim to design a prior that encourages the component vectors to be mutually different and “diverse”, under which the component vectors are not independent any more, which presents great challenges for posterior inference.4.2 Infinite Mutual Angular Prior
In the MABN prior, the components are added one by one. Each new component is encouraged to have large angles with previous ones. This adding process can be repeated infinitely many times, resulting in a prior that encourages an infinite number of components to have large mutual angles
(14) 
The factorization theorem (Koller and Friedman, 2009) of Bayesian network ensures that integrates to one. The magnitudes do not affect angles (hence diversity), which can be generated independently from a gamma distribution.
To this end, the generative process of can be summarized as follows:

Sample

For , sample

For , sample

For ,
The probability distribution over can be written as
(15) 
4.3 DiversityPromoting Infinite Latent Feature Model
In this section, using infinite latent feature model (ILFM) (Griffiths and Ghahramani, ) as an instance of BNLVM, we showcase how to promote diversity among the components therein with the IMA prior. Given a set of data examples where , ILFM aims to invoke a finite subset of features from an infinite feature collection to construct these data examples. Each feature (which is a component in this LVM) is parameterized by a vector . For each data example , a subset of features are selected to construct it. The selection is denoted by a binary vector where denotes the th feature is invoked to construct the th example and otherwise. Given the parameter vectors of features and the selection vector , the example can be represented as: . The binary selection vectors can be either drawn from an Indian buffet process (IBP) (Ghahramani and Griffiths, 2005) or a stickbreaking construction (Teh and Ghahramani, 2007). Let
be the prior probability that feature
is present in a data example and the features are permuted such that their prior probabilities are in a decreasing ordering: . According to the stickbreaking construction, these prior probabilities can be generated in the following way: , . Given , the binary indicator is generated as . To reduce the redundancy among the features, we impose the IMA prior over their parameter vectors to encourage them to be mutually different, which results in an IMALFM model.4.4 Algorithm
In this section, we develop a sampling algorithm to infer the posteriors of and in the IMALFM model. Two major challenges need to be addressed. First, the prior over is not conjugate to the likelihood function . Second, the parameter vectors are usually of highdimensional, rendering slow mixing. To address the first challenge, we adopt the slicing sampling algorithm (Teh and Ghahramani, 2007). This algorithm introduces an auxiliary slice variable , where is the prior probability of the last active feature. A feature is active if there exists an example such that and is inactive otherwise. In the sequel, we discuss the sampling of other variables.
Sample New Features
Let be the maximal feature index with and be the index such that all active features have index ( itself would be inactive feature). If the new value of makes , then we draw new (inactive) features, including the parameter vectors and prior probabilities. The prior probabilities are drawn sequentially from using adaptive rejection sampling (ARS) (Gilks and Wild, 1992). The parameter vectors are drawn sequentially from
(16) 
where we draw from which is a von MisesFisher distribution and draw from a Gamma distribution, then multiply and together since they are independent. For each new feature , the corresponding binary selection variables are initialized to zero.
Sample Existing
We sample from , where .
Sample
Given , we only need to sample for from , where
Comments
There are no comments yet.