Gibbs Max-margin Topic Models with Data Augmentation

10/10/2013 ∙ by Jun Zhu, et al. ∙ Tsinghua University 0

Max-margin learning is a powerful approach to building classifiers and structured output predictors. Recent work on max-margin supervised topic models has successfully integrated it with Bayesian topic models to discover discriminative latent semantic structures and make accurate predictions for unseen testing data. However, the resulting learning problems are usually hard to solve because of the non-smoothness of the margin loss. Existing approaches to building max-margin supervised topic models rely on an iterative procedure to solve multiple latent SVM subproblems with additional mean-field assumptions on the desired posterior distributions. This paper presents an alternative approach by defining a new max-margin loss. Namely, we present Gibbs max-margin supervised topic models, a latent variable Gibbs classifier to discover hidden topic representations for various tasks, including classification, regression and multi-task learning. Gibbs max-margin supervised topic models minimize an expected margin loss, which is an upper bound of the existing margin loss derived from an expected prediction rule. By introducing augmented variables and integrating out the Dirichlet variables analytically by conjugacy, we develop simple Gibbs sampling algorithms with no restricting assumptions and no need to solve SVM subproblems. Furthermore, each step of the "augment-and-collapse" Gibbs sampling algorithms has an analytical conditional distribution, from which samples can be easily drawn. Experimental results demonstrate significant improvements on time efficiency. The classification performance is also significantly improved over competitors on binary, multi-class and multi-label classification tasks.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

Code Repositories

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

As the availability and scope of complex data increase, developing statistical tools to discover latent structures and reveal hidden explanatory factors has become a major theme in statistics and machine learning. Topic models represent one type of such useful tools to discover latent semantic structures that are organized in an automatically learned latent topic space, where each topic (i.e., a coordinate of the latent space) is a unigram distribution over the terms in a vocabulary. Due to its nice interpretability and extensibility, the Bayesian formulation of topic models 

(Blei et al., 2003) has motivated substantially broad extensions and applications to various fields, such as document analysis, image categorization (Fei-Fei and Perona, 2005), and network data analysis (Airoldi et al., 2008). Besides discovering latent topic representations, many models usually have a goal to make good predictions, such as relational topic models (Chang and Blei, 2009)

whose major goal is to make accurate predictions on the link structures of a document network. Another example is supervised topic models, our focus in this paper, which learn a prediction model for regression and classification tasks. As supervising information (e.g., user-input rating scores for product reviews) gets easier to obtain on the Web, developing supervised latent topic models has attracted a lot of attention. Both maximum likelihood estimation (MLE) and max-margin learning have been applied to learn supervised topic models. Different from the MLE-based approaches 

(Blei and McAuliffe, 2007)

, which define a normalized likelihood model for response variables, max-margin supervised topic models, such as maximum entropy discrimination LDA (MedLDA) 

(Zhu et al., 2012), directly minimize a margin-based loss derived from an expected (or averaging) prediction rule.

By performing discriminative learning, max-margin supervised topic models can discover predictive latent topic representations and have shown promising performance in various prediction tasks, such as text document categorization (Zhu et al., 2012) and image annotation (Yang et al., 2010)

. However, their learning problems are generally hard to solve due to the non-smoothness of the margin-based loss function. Most existing solvers rely on a variational approximation scheme with strict mean-field assumptions on posterior distributions, and they normally need to solve multiple latent SVM subproblems in an EM-type iterative procedure. By showing a new interpretation of MedLDA as a regularized Bayesian inference method, the recent work 

(Jiang et al., 2012) successfully developed Monte Carlo methods for such max-margin topic models, with a weaker mean-field assumption. Though the prediction performance is improved because of more accurate inference, the Monte Carlo methods still need to solve multiple SVM subproblems. Thus, their efficiency could be limited as learning SVMs is normally computationally demanding. Furthermore, due to the dependence on SVM solvers, it is not easy to parallelize these algorithms for large-scale data analysis tasks, although substantial efforts have been made to develop parallel Monte Carlo methods for unsupervised topic models (Newman et al., 2009; Smola and Narayanamurthy, 2010; Ahmed et al., 2012).

This paper presents Gibbs MedLDA, an alternative formulation of max-margin supervised topic models, for which we can develop simple and efficient inference algorithms. Technically, instead of minimizing the margin loss of an expected (averaging) prediction rule as adopted in existing max-margin topic models, Gibbs MedLDA minimizes the expected margin loss of many latent prediction rules, of which each rule corresponds to a configuration of topic assignments and the prediction model, drawn from a post-data posterior distribution. Theoretically, the expected margin loss is an upper bound of the existing margin loss of an expected prediction rule. Computationally, although the expected margin loss can be hard in developing variational algorithms, we successfully develop simple and fast collapsed Gibbs sampling algorithms without any restricting assumptions on the posterior distribution and without solving multiple latent SVM subproblems. Each of the sampling substeps has a closed-form conditional distribution, from which a sample can be efficiently drawn. Our algorithms represent an extension of the classical ideas of data augmentation (Dempster et al., 1977; Tanner and Wong, 1987; van Dyk and Meng, 2001) and its recent developments in learning fully observed max-margin classifiers (Polson and Scott, 2011) to learn the sophisticated latent topic models. We further generalize the ideas to develop a Gibbs MedLDA regression model and a multi-task Gibbs MedLDA model, and we also develop efficient collapsed Gibbs sampling algorithms for them with data augmentation. Empirical results on real data sets demonstrate significant improvements in time efficiency. The classification performance is also significantly improved in binary, multi-class, and multi-label classification tasks.

The rest of the paper is structured as follows. Section 2 summarizes some related work. Section 3 reviews MedLDA and its EM-type algorithms. Section 4 presents Gibbs MedLDA and its sampling algorithms for classification. Section 5 presents two extensions of Gibbs MedLDA for regression and multi-task learning. Section 6 presents empirical results. Finally, Section 7 concludes and discusses future directions.

2 Related Work

Max-margin learning has been very successful in building classifiers (Vapnik, 1995) and structured output prediction models (Taskar et al., 2003) in the last decade. Recently, research on learning max-margin models in the presence of latent variable models has received increasing attention because of the promise of using latent variables to capture the underlying structures of the complex problems. Deterministic approaches (Yu and Joachims, 2009) fill in the unknown values of the hidden structures by using some estimates (e.g., MAP estimates), and then a max-margin loss function is defined with the filled-in hidden structures, while probabilistic approaches aim to infer an entire distribution profile of the hidden structures given evidence and some prior distribution, following the Bayes’ way of thinking. Though the former is powerful, we focus on Bayesian approaches, which can naturally incorporate prior beliefs, maintain the entire distribution profile of latent structures, and be extensible to nonparametric methods. One representative work along this line is maximum entropy discrimination (MED) (Jaakkola et al., 1999; Jebara, 2001), which learns a distribution of model parameters given a set of labeled training data.

MedLDA (Zhu et al., 2012) is one extension of MED to infer hidden topical structures from data and MMH (max-margin Harmoniums) (Chen et al., 2012) is another extension that infers the hidden semantic features from multi-view data. Along similar lines, recent work has also successfully developed nonparametric Bayesian max-margin models, such as infinite SVMs (iSVM) (Zhu et al., 2011b) for discovering clustering structures when building SVM classifiers and infinite latent SVMs (iLSVM) (Zhu et al., 2011a) for automatically learning predictive features for SVM classifiers. Both iSVM and iLSVM can automatically resolve the model complexity (e.g., the number of components in a mixture model or the number of latent features in a factor analysis model). The nonparametric Bayesian max-margin ideas have been proven to be effective in dealing with more challenging problems, such as link prediction in social networks (Zhu, 2012) and low-rank matrix factorization for collaborative recommendation (Xu et al., 2012, 2013).

One common challenge of these Bayesian max-margin latent variable models is on the posterior inference, which is normally intractable. Almost all the existing work adopts a variational approximation scheme, with some mean-field assumptions. Very little research has been done on developing Monte Carlo methods, except the work (Jiang et al., 2012) which still makes mean-field assumptions. The work in the present paper provides a novel way to formulate Bayesian max-margin models and we show that these new formulations can have very simple and efficient Monte Carlo inference algorithms without making restricting assumptions. The key step to deriving our algorithms is a data augmentation formulation of the expected margin-based loss. Other work on inferring the posterior distributions of latent variables includes max-margin min-entropy models (Miller et al., 2012) which learn a single set of model parameters, different from our focus of inferring the model posterior distribution.

Data augmentation refers to methods of augmenting the observed data so as to make it easy to analyze with an iterative optimization or sampling algorithm. For deterministic algorithms, the technique has been popularized in the statistics community by the seminal expectation-maximization (EM) algorithm 

(Dempster et al., 1977) for maximum likelihood estimation (MLE) with missing values. For stochastic algorithms, the technique has been popularized in statistics by Tanner and Wong’s data augmentation algorithm for posterior sampling (Tanner and Wong, 1987) and in physics by Swendsen and Wang’s sampling algorithms for Ising and Potts models (Swendsen and Wang, 1987). When using the idea to solve estimation or posterior inference problems, the key step is to find a set of augmented variables, conditioned on which the distribution of our models can be easily sampled. The speed of mixing or convergence is another important concern when designing a data augmentation method. While the conflict between simplicity and speed is a common phenomenon with many standard augmentation schemes, some work has demonstrated that with more creative augmentation schemes it is possible to construct EM-type algorithms (Meng and van Dyk, 1997)

or Markov Chain Monte Carlo methods (known as slice sampling) 

(Neal, 1997) that are both fast and simple. We refer the readers to (van Dyk and Meng, 2001) for an excellent review of the broad literature of data augmentation and an effective search strategy for selecting good augmentation schemes.

For our focus on max-margin classifiers, the recent work (Polson and Scott, 2011) provides an elegant data augmentation formulation for support vector machines (SVM) with fully observed input data, which leads to analytical conditional distributions that are easy to sample from and fast to mix. Our work in the present paper builds on the method of Polson et al. and presents a successful implementation of data augmentation to deal with the challenging posterior inference problems of Bayesian max-margin latent topic models. Our approach can be generalized to deal with other Bayesian max-margin latent variable models, e.g., max-margin matrix factorization (Xu et al., 2013), as reviewed above.

Finally, some preliminary results were presented in a conference paper (Zhu et al., 2013a). This paper presents a full extension.

3 MedLDA

We begin with a brief overview of MedLDA and its learning algorithms, which motivate our developments of Gibbs MedLDA.

3.1 MedLDA: a Regularized Bayesian Model

We consider binary classification with a labeled training set , where the response variable takes values from the output space . Basically, MedLDA consists of two parts — an LDA model for describing input documents , where denote the words appearing in document , and an expected classifier for considering the supervising signal . Below, we introduce each of them in turn.

LDA: Latent Dirichlet allocation (LDA) (Blei et al., 2003) is a hierarchical Bayesian model that posits each document as an admixture of topics, where each topic is a multinomial distribution over a -word vocabulary. For document , the generating process can be described as

  1. draw a topic proportion

  2. for each word :

    1. draw a topic assignment111A -dimension binary vector with only one nonzero entry.

    2. draw the observed word

where is a Dirichlet distribution; is multinomial; and denotes the topic selected by the non-zero entry of . For a fully-Bayesian LDA, the topics are random samples drawn from a prior, e.g., .

Given a set of documents , we let denote the set of topic assignments for document and let and denote all the topic assignments and mixing proportions for the whole corpus, respectively. Then, LDA infers the posterior distribution using Bayes’ rule

where according to the generating process of LDA; and is the marginal evidence. We can show that the posterior distribution by Bayes’ rule is the solution of an information theoretical optimization problem

(1)

where

is the Kullback-Leibler divergence and

is the space of probability distributions with an appropriate dimension. In fact, if we add the constant

to the objective, the problem is the minimization of the KL-divergence , whose solution is the desired posterior distribution by Bayes’ rule. One advantage of this variational formulation of Bayesian inference is that it can be naturally extended to include some regularization terms on the desired post-data posterior distribution . This insight has been taken to develop regularized Bayesian inference (RegBayes) (Zhu et al., 2011a), a computational framework for doing Bayesian inference with posterior regularization222Posterior regularization was first used in (Ganchev et al., 2010) for maximum likelihood estimation and was later extended in (Zhu et al., 2011a) to Bayesian and nonparametric Bayesian methods.. As shown in (Jiang et al., 2012) and detailed below, MedLDA is one example of RegBayes models. Moreover, as we shall see in Section 4, our Gibbs max-margin topic models follow this similar idea too.

Expected Classifier: Given a training set , an expected (or averaging) classifier chooses a posterior distribution over a hypothesis space of classifiers such that the -weighted (expected) classifier

will have the smallest possible risk. MedLDA follows this principle to learn a posterior distribution such that the expected classifier

(2)

has the smallest possible risk, approximated by the training error . The discriminant function is defined as

where is the average topic assignment associated with the words , a vector with element , and is the classifier weights. Note that the expected classifier and the LDA likelihood are coupled via the latent topic assignments . The strong coupling makes it possible for MedLDA to learn a posterior distribution that can describe the observed words well and make accurate predictions.

Regularized Bayesian Inference: To integrate the above two components for hybrid learning, MedLDA regularizes the properties of the topic representations by imposing the following max-margin constraints derived from the classifier (2) to a standard LDA inference problem (3.1)

(3)

where () is the cost of making a wrong prediction; and are non-negative slack variables for inseparable cases. Let be the objective for doing standard Bayesian inference with the classifier and . MedLDA solves the regularized Bayesian inference (Zhu et al., 2011a) problem

(4)
s.t.:

where the margin constraints directly regularize the properties of the post-data distribution and is the positive regularization parameter. Equivalently, MedLDA solves the unconstrained problem333If not specified, is subject to the constraint .

(5)

where is the hinge loss that upper-bounds the training error of the expected classifier (2). Note that the constant is included simply for convenience.

3.2 Existing Iterative Algorithms

Since it is difficult to solve problem (4) or (5) directly because of the non-conjugacy (between priors and likelihood) and the max-margin constraints, corresponding to a non-smooth posterior regularization term in (5), both variational and Monte Carlo methods have been developed for approximate solutions. It can be shown that the variational method (Zhu et al., 2012) is a coordinate descent algorithm to solve problem (5) with the fully-factorized assumption that

while the Monte Carlo methods (Jiang et al., 2012) make a weaker assumption that

All these methods have a similar EM-type iterative procedure, which solves many latent SVM subproblems, as outlined below.

Estimate : Given , we solve problem (5) with respect to . In the equivalent constrained form, this step solves

(6)

This problem is convex and can be solved with Lagrangian methods. Specifically, let be the Lagrange multipliers, one per constraint. When the prior

is the commonly used standard normal distribution, we have the optimum solution

, where . It can be shown that the dual problem of (6) is the dual of a standard binary linear SVM and we can solve it or its primal form efficiently using existing high-performance SVM learners. We denote the optimum solution of this problem by .

Estimate : Given , we solve problem (5) with respect to . In the constrained form, this step solves

(7)

Although we can solve this problem using Lagrangian methods, it would be hard to derive the dual objective. An effective approximation strategy was used in (Zhu et al., 2012; Jiang et al., 2012), which updates for only one step with fixed at . By fixing at , we have the solution

where the second term indicates the regularization effects due to the max-margin posterior constraints. For those data with non-zero Lagrange multipliers (i.e., support vectors), the second term will bias MedLDA towards a new posterior distribution that favors more discriminative representations on these “hard” data points. The Monte Carlo methods (Jiang et al., 2012) directly draw samples from the posterior distribution or its collapsed form using Gibbs sampling to estimate , the expectations required to learn . In contrast, the variational methods (Zhu et al., 2012) solve problem (7) using coordinate descent to estimate with a fully factorized assumption.

4 Gibbs MedLDA

Now, we present Gibbs max-margin topic models for binary classification and their “augment-and-collapse” sampling algorithms. We will discuss further extensions in the next section.

4.1 Learning with an Expected Margin Loss

As stated above, MedLDA chooses the strategy to minimize the hinge loss of an expected classifier. In learning theory, an alternative approach to building classifiers with a posterior distribution of models is to minimize an expected loss, under the framework known as Gibbs classifiers (or stochastic classifiers) (McAllester, 2003; Catoni, 2007; Germain et al., 2009) which have nice theoretical properties on generalization performance.

For our case of inferring the distribution of latent topic assignments and the classification model , the expected margin loss is defined as follows. If we have drawn a sample of the topic assignments and the prediction model from a posterior distribution , we can define the linear discriminant function

as before and make prediction using the latent prediction rule

(8)

Note that the prediction is a function of the configuration . Let , where is a cost parameter as defined before. The hinge loss of the stochastic classifier is

a function of the latent variables , and the expected hinge loss is

a function of the posterior distribution . Since for any , the hinge loss is an upper bound of the training error of the latent Gibbs classifier (8), that is,

we have

where is an indicator function that equals to 1 if the predicate holds otherwise 0. In other words, the expected hinge loss is an upper bound of the expected training error of the Gibbs classifier (8). Thus, it is a good surrogate loss for learning a posterior distribution which could lead to a low training error in expectation.

Then, with the same goal as MedLDA of finding a posterior distribution that on one hand describes the observed data and on the other hand predicts as well as possible on training data, we define Gibbs MedLDA as solving the new regularized Bayesian inference problem

(9)

Note that we have written the expected margin loss as a function of the complete distribution . This doesn’t conflict with our definition of as a function of the marginal distribution because the other irrelevant variables (i.e., and ) are integrated out when we compute the expectation.

Comparing to MedLDA in problem (5), we have the following lemma by applying Jensen’s inequality. The expected hinge loss is an upper bound of the hinge loss of the expected classifier (2):

and thus the objective in (9) is an upper bound of that in (5) when values are the same.

4.2 Formulation with Data Augmentation

If we directly solve problem (9), the expected hinge loss is hard to deal with because of the non-differentiable max function. Fortunately, we can develop a simple collapsed Gibbs sampling algorithm with analytical forms of local conditional distributions, based on a data augmentation formulation of the expected hinge-loss.

Let be the unnormalized likelihood of the response variable for document . Then, problem (9) can be written as

(10)

where . Solving problem (10) with the constraint that , we can get the normalized posterior distribution

where is the normalization constant. Due to the complicated form of , it will not have simple conditional distributions if we want to derive a Gibbs sampling algorithm for directly. This motivates our exploration of data augmentation techniques. Specifically, using the ideas of data augmentation (Tanner and Wong, 1987; Polson and Scott, 2011), we have Lemma 4.2. [Scale Mixture Representation] The unnormalized likelihood can be expressed as

Due to the fact that if , we have . Then, we can follow the proof in (Polson and Scott, 2011) to get the results. Lemma 4.2 indicates that the posterior distribution of Gibbs MedLDA can be expressed as the marginal of a higher-dimensional distribution that includes the augmented variables , that is,

(11)

where is the set of positive real numbers; the complete posterior distribution is

and the unnormalized joint distribution of

and is

In fact, we can show that the complete posterior distribution is the solution of the data augmentation problem of Gibbs MedLDA

which is again subject to the normalization constraint that . The first term in the objective is . If we like to impose a prior distribution on the augmented variables , one good choice can be an improper uniform prior. The objective of this augmented problem is an upper bound of the objective in (10) (thus, also an upper bound of MedLDA’s objective due to Lemma 4.1). This is because by using the data augmentation we can show that

where

denotes all the random variables in MedLDA. Therefore, we have

4.3 Inference with Collapsed Gibbs Sampling

Although with the above data augmentation formulation we can do Gibbs sampling to infer the complete posterior distribution and thus by ignoring , the mixing rate would be slow because of the large sample space of the latent variables. One way to effectively reduce the sample space and improve mixing rates is to integrate out the intermediate Dirichlet variables and build a Markov chain whose equilibrium distribution is the resulting marginal distribution . We propose to use collapsed Gibbs sampling, which has been successfully used in LDA (Griffiths and Steyvers, 2004). With the data augmentation representation, this leads to an “augment-and-collapse” sampling algorithm for Gibbs MedLDA, as detailed below.

For the data augmented formulation of Gibbs MedLDA, by integrating out the Dirichlet variables , we get the collapsed posterior distribution:

where

is the Gamma function; is the number of times that the term is assigned to topic over the whole corpus; is the set of word counts associated with topic ; is the number of times that terms are associated with topic within the -th document; and is the set of topic counts for document . Then, the conditional distributions used in collapsed Gibbs sampling are as follows.

For

: Let us assume its prior is the commonly used isotropic Gaussian distribution

, where is a non-zero parameter. Then, we have the conditional distribution of given the other variables:

(12)

a -dimensional Gaussian distribution, where the posterior mean and the covariance matrix are

Therefore, we can easily draw a sample from this multivariate Gaussian distribution. The inverse can be robustly done using Cholesky decomposition, an procedure. Since is normally not large, the inversion can be done efficiently, especially in applications where the number of documents is much larger than the number of topics.

For : The conditional distribution of given the other variables is

By canceling common factors, we can derive the conditional distribution of one variable given others as:

(13)

where indicates that term is excluded from the corresponding document or topic; ; and

is the discriminant function value without word . We can see that the first term is from the LDA model for observed word counts and the second term is from the supervised signal .

1:  Initialization: set and randomly draw

from a uniform distribution.

2:  for  to  do
3:     draw the classifier from the normal distribution (12)
4:     for  to  do
5:        for each word in document  do
6:           draw a topic from the multinomial distribution (13)
7:        end for
8:        draw (and thus ) from the inverse Gaussian distribution (14).
9:     end for
10:  end for
Algorithm 1 Collapsed Gibbs Sampling Algorithm for GibbsMedLDA Classification Models

For : Finally, the conditional distribution of the augmented variables given the other variables is factorized and we can derive the conditional distribution for each as

where

is a generalized inverse Gaussian distribution (Devroye, 1986) and is a normalization constant. Therefore, we can derive that follows an inverse Gaussian distribution

(14)

where

for and .

With the above conditional distributions, we can construct a Markov chain which iteratively draws samples of the classifier weights using Eq. (12), the topic assignments using Eq. (13) and the augmented variables using Eq. (14), with an initial condition. To sample from an inverse Gaussian distribution, we apply the transformation method with multiple roots (Michael et al., 1976) which is very efficient with a constant time complexity. Overall, the per-iteration time complexity is , where is the total number of words in all documents. If is not very large (e.g., ), which is the common case in practice as is often very large, the per-iteration time complexity is ; if is large (e.g., ), which is not common in practice, drawing the global classifier weights will dominate and the per-iteration time complexity is . In our experiments, we initially set and randomly draw from a uniform distribution. In training, we run this Markov chain to finish the burn-in stage with iterations, as outlined in Algorithm 1. Then, we draw a sample as the Gibbs classifier to make predictions on testing data.

In general, there is no theoretical guarantee that a Markov chain constructed using data augmentation can converge to the target distribution (See (Hobert, 2011) for a failure example). However, for our algorithms, we can justify that the Markov transition distribution of the chain satisfies the condition from (Hobert, 2011), i.e., the transition probability from one state to any other state is larger than 0. Condition implies that the Markov chain is Harris ergodic (Tan, 2009, Lemma 1). Therefore, no matter how the chain is started, our sampling algorithms can be employed to effectively explore the intractable posterior distribution. In practice, the sampling algorithm as well as the ones to be presented require only a few iterations to get stable prediction performance, as we shall see in Section 6.5.1. More theoretical analysis such as convergence rates requires a good bit of technical Markov chain theory and is our future work.

4.4 Prediction

To apply the Gibbs classifier , we need to infer the topic assignments for testing document, denoted by . A fully Bayesian treatment needs to compute an integral in order to get the posterior distribution of the topic assignment given the training data and the testing document content :

where is the dimensional simplex; and the second equality holds due to the conditional independence assumption of the documents given the topics. Various approximation methods can be applied to compute the integral. Here, we take the approach applied in (Zhu et al., 2012; Jiang et al., 2012), which uses a point estimate of topics from training data and makes predictions based on them. Specifically, we use a point estimate (a Dirac measure) to approximate the probability distribution . For the collapsed Gibbs sampler, an estimate of using the samples is the posterior mean

Then, given a testing document , we infer its latent components using by drawing samples from the local conditional distribution

(15)

where is the number of times that the terms in this document assigned to topic with the -th term excluded. To start the sampler, we randomly set each word to one topic. Then, we run the Gibbs sampler for a few iterations until some stop criterion is satisfied, e.g., after a few burn-in steps or the relative change of data likelihood is lower than some threshold. Here, we adopt the latter, the same as in (Jiang et al., 2012). After this burn-in stage, we keep one sample of for prediction using the stochastic classifier. Empirically, using the average of a few (e.g., 10) samples of could lead to slightly more robust predictions, as we shall see in Section 6.5.4.

5 Extensions to Regression and Multi-task Learning

The above ideas can be naturally generalized to develop Gibbs max-margin supervised topic models for various prediction tasks. In this section, we present two examples for regression and multi-task learning, respectively.

5.1 Gibbs MedLDA Regression Model

We first discuss how to generalize the above ideas to develop a regression model, where the response variable takes real values. Formally, the Gibbs MedLDA regression model also has two components — an LDA model to describe input bag-of-words documents and a Gibbs regression model for the response variables. Since the LDA component is the same as in the classification model, we focus on presenting the Gibbs regression model.

5.1.1 The Models with Data Augmentation

If a sample of the topic assignments and the prediction model are drawn from the posterior distribution , we define the latent regression rule as

(16)

To measure the goodness of the prediction rule (16), we adopt the widely used -insensitive loss

where is the margin between the true score and the predicted score. The -insensitive loss has been successfully used in learning fully observed support vector regression (Smola and Scholkopf, 2003). In our case, the loss is a function of predictive model as well as the topic assignments which are hidden from the input data. To resolve this uncertainty, we define the expected -insensitive loss

a function of the desired posterior distribution .

With the above definitions, we can follow the same principle as Gibbs MedLDA to define the Gibbs MedLDA regression model as solving the regularized Bayesian inference problem

(17)

Note that as in the classification model, we have put the complete distribution as the argument of the expected loss , which only depends on the marginal distribution . This does not affect the results because we are taking the expectation to compute and any irrelevant variables will be marginalized out.

As in the Gibbs MedLDA classification model, we can show that is an upper bound of the -insensitive loss of MedLDA’s expected prediction rule, by applying Jensen’s inequality to the convex function . We have . We can reformulate problem (17) in the same form as problem (10), with the unnormalized likelihood

Then, we have the dual scale of mixture representation, by noting that

(18)

[Dual Scale Mixture Representation] For regression, the unnormalized likelihood can be expressed as

By the equality (18), we have Each of the exponential terms can be formulated as a scale mixture of Gaussians due to Lemma 2. Then, the data augmented learning problem of the Gibbs MedLDA regression model is

where and

Solving the augmented problem and integrating out , we can get the collapsed posterior distribution

5.1.2 A Collapsed Gibbs Sampling Algorithm

Following similar derivations as in the classification model, the Gibbs sampling algorithm to infer the posterior has the following conditional distributions, with an outline in Algorithm 2.

For : Again, with the isotropic Gaussian prior , we have

(19)

where the posterior covariance matrix and the posterior mean are