In recent years the field of natural language processing (NLP) has decisively shifted away from bag-of-words representations towards neural models. These models represent text using embeddings that are learned from data in an end-to end manner. A potential drawback to such embeddings is that learned representations tend to beentangled, in the sense that an embedding is a monolithic vector that encodes some unknown set of characteristics of the input data. When one is interested in training a model solely for a particular task, entanglement is not necessarily a problem, so long as the trained model achieves sufficiently robust predictive performance. However, there are cases where it is desirable to learn a representation that factors into distinct, complementary sets of features, i.e., is disentangled.
One reason we may want a disentangled representation is interpretability. Separating representations into distinct factors that correspond to identifiable subsets of features, such as the topic and political leaning of an opinion piece, allows one to more easily reason about which features informed a prediction. A second reason to induce disentangled representations is data efficiency. Suppose that were to train a model on images that contain categories of shapes which assume categories of colors. If a model can separate shape from color, then it should generalize to shape and color combinations not observed in the training data. This means that we can hope to train such a model on examples, rather than a dataset in which all combinations of features are present, which would require examples. Learning disentangled representations thus provides a strategy for factorizing a problem in a high-dimensional feature space into problems in lower-dimensional feature spaces.
In computer vision, there has been considerable effort to develop methods for inducing disentangled representations in semi- and un-supervised settings[kingma2013auto-encoding, higgins2016beta, siddharth2017learning, esmaeili2018structured, zhao2017infovae, gao2018auto, achille2018information, kim2018disentangling, chen2018isolating]. Many of these approaches define deep generative models such as variational autoencoders (VAE) [kingma2013auto, rezende2014stochastic], or Generative Adversarial Networks (GANs) [goodfellow2014generative, chen2016infogan]. In NLP, work on learning disentangled representations has been more limited [ruder2016hierarchical, he-2017, zhang2017aspect, jain-EMNLP-18]. A large body of pre-neural work exists on aspect-based topic models that derive from Latent Dirichlet Allocation (LDA) [blei2003latent]
. This includes approaches for sentiment analysis[brody2010unsupervised, sauper2010incorporating, sauper2011content, mukherjee2012aspect, sauper2013automatic, kim2013hierarchical], and models in the factorial LDA family [paul2010two, paul2012factorial, wallace2014large].
There has been relatively little work in NLP on learning disentangled representations with neural architectures. One reason for this is that work on deep generative models for text is not as well-established as work for images. Early approaches in this space, such as the Neural Variational Document Model (NVDM) [miao2016neural] and autoencoding LDA [srivastava2017autoencoding] developed neural topic models in which the generative model is either an LDA-style mixture, or a SAGE-style [SAGE] log-linear combination over topics. More recently there have been some efforts to develop deep generative models with interpretable aspects, chiefly the work by hu2017controlled, which combines a recurrent VAE architecture with a set of aspect discriminators to induce a structured representation.
In this paper we explore the effectiveness of neural topic models for learning disentangled representations. We treat review datasets as a particular case study, where we consider the task of learning structured representations for both the reviewer and the reviewed item. Reviews comprise several variables of interest, such as the aspect of the item being discussed, user sentiment regarding each aspect, and characteristics of the item for each aspect (i.e., sub-aspects). More concretely, in any review corpus, items will very likely share certain aspects, each affecting the rating separately. For example, in the case of restaurant reviews, all establishments will serve food and have a location. Similarly, every beer will have an aroma and appearance. More generally, each aspect may contain nested sub-aspects: A restaurant can serve Italian, Chinese, or fast food; and a beer can be dark or light in appearance.
In this paper we develop autoencoding models that induce representations of review texts that capture this structure. Such representations can perform aspect-based item comparisons, and also provide one sort of interpretability. To realize these goals, VALTA combines topic modelling and recommender systems into a structured VAE framework. We model reviews in a structured manner by associating an aspect with each sentence in a review, and use aspect-specific topics to define a log-linear likelihood, similar to the one used in SAGE [SAGE], the NVDM [miao2016neural], and ProdLDA [srivastava2017autoencoding]. Topic and aspect weights are predicted based on a user and item embedding. The result is a highly structured model, in which both aspects and sub-aspects are interpretable, and topics have a high predictive power in terms of perplexity and coherence scores. These learned representations can be used for downstream tasks such as genre discovery, representation quality, and aspect-retrieval.
|number of aspects|
|number of sub-aspects|
|number of hidden units|
|parameters of generative model|
|parameters of inference model|
|review written by user about item|
aspect log probabilities of
|aspect assignment of|
|hidden representation of item|
|hidden representation of user|
|hidden representation of|
|aspect-specific topic distributions of|
|aspect-importance of item|
|aspect-preference of user|
|global rating bias|
|item rating bias|
|user rating bias|
|true rating by user for item|
|prediction of rating by user for item|
2 Background and Preliminaries
Review datasets have been widely studied in the context of recommender systems [bennett2007netflix]. Matrix factorization techniques [koren2009matrix, mnih2008probabilistic, bao2014topicmf] are widely used to predict ratings by representing each user and item by a dimensional vector, which we sometimes refer to as an embedding. Since these approaches consider the ratings alone, they ignore the text of the review, which is a key source of information. mcauley2013hidden proposed combining topic models and matrix factorization techniques for learning ratings to learn topics and ratings simultaneously. Subsequent approaches aimed to exploit review text in addition to rating [mcauley2013hidden, diao2014jointly, zheng2017joint, catherine2017transnets, cheng2018aspect]. These efforts have shown that topic models indeed can act as a good regularizer for rating prediction, particularly for users or items with few reviews [mcauley2013hidden, cheng2018aspect]
. In the last few years, both recommender systems and topic modelling approaches have shifted towards deep learning methods[srivastava2017autoencoding, miao2016neural, diao2014jointly], many of which also exploit text to predict ratings.
While neural recommender systems can achieve good predictive performance, it is unclear how they do so, because learned feature vectors are optimized only to code indiscriminately for (unknown) predictive combinations of attributes. Such entangled representations thus do not reveal any information about the structure of the data, which in turn hinders model interpretability and generalizability. By imbuing representations with probabilistic semantics, we can design the models to explicitly tease out structured embeddings, components of which may then be re-used.
In prior work, deep generative models have been proposed to learn representations of text via variational autoencoders [kingma2013auto, rezende2014stochastic]. VAEs jointly optimize a generative network and an inference network. The former, , specifies a distribution over set of hidden variables and observed variables . The latter is a conditional distribution . Defining as the empirical distribution, these two models are trained by optimizing the evidence lower bound (ELBO),
Variants of VAEs have been used to develop autoencoding topic models [srivastava2017autoencoding, miao2016neural]. These models achieve predictive performance (in terms of the perplexity score) that is competitive with other bag-of-word models, but lack the explicit structure that we aim to capture here. More specifically: the prior in existing VAE-based approaches takes the form of a Gaussian with diagonal covariance matrix, where each dimension of the Gaussian corresponds to a topic. By contrast, we here aim to characterize groups of topics that correspond to specific aspects of interest. One means of realizing this would be to posit Gaussians, one per aspect, and each of these might comprise its own set of topics.
To learn structured representations of reviews, we begin by identifying key axes of variation in review datasets. We define three variable categories: items, users, and review texts. We assume aspects of interest for all items. For example, in the case of beer reviews, these aspects correspond to properties such as appearance and aroma. We further decompose these aspects into topics. A topic within appearance might be, e.g., dark versus pale beer. Reflecting these structural assumptions, our model defines aspect-specific embeddings that in turn yield distributions over topics. Thus, a representation of a sweet, dark beer should place a relatively large mass on the dark topic of appearance, and a large mass on the sweet topic of appearance.
The relative importance of aspects may vary for both items and users. A restaurant, for example, may be located on the water or on a famous city street, in which case the location is likely to be its most salient aspect. Similarly, lagers are not typically renowned for their smell. Users will have their own weightings of aspect importance. A particular user may be concerned primarily with food quality over price, and might prefer Chinese food. Others, meanwhile, may emphasize location or ambience. This sort of structure is is similar to the aspect-aware topic model proposed in [cheng2018aspect].
The words contained in a review are a function of the aspects and topics, and their relative importance for particular user-item pairs. To accurately learn topics and predict ratings, we now introduce variables that are defined at the review level. A naive approach would be to encode the review text and then train the generative model to learn both topics and ratings for item and user . Here we propose an approach that is directly motivated by the observation that the aspects and topics discussed in review will depend on a combination of the aspect preferences of , and the relative salience of the respective aspects for . Therefore, rather than encoding directly, we encode information about and separately, and then combine these representations to yield a joint embedding for and predict the rating . Table 1 presents the notation we use throughout this paper.
It is likely that most reviews will contain at least some words about all aspects (although the prevalence of individual aspects will vary across reviews). Thus it is intuitive to attempt to infer which parts of a review talk about which aspect. In our model we make the simplifying assumption that every sentence within a review discusses only a single aspect. One could alternatively assign aspects at the word- or paragraph-level. However, sentence-level assignment constitutes an intuitive compromise, and is also consistent with prior work [mcauley2012learning, lu2011multi]. Note that while the aspect assignment varies between sentences within a review, we keep the topic proportions fixed for that particular review.
Following prior work [mcauley2013hidden, diao2014jointly, cheng2018aspect], we assume the input representation for item is a bag-of-words vector encoding the words used across all reviews written about this item. Similarly, we define the input vector for user as a bag-of-words induced over all reviews that they have written. This representation has been shown to perform well in terms of capturing characteristics of items and users [mcauley2013hidden, catherine2017transnets, zheng2017joint], but it does not take into account the relative importance of different aspects with respect to both and . Nor have such models explicitly accounted for the intuitive observation that different parts of reviews (probably) discuss different aspects, which we achieve via sentence-wise aspect assignments based on encoded sentences (one per each sentence in the review written by for ).
We provide a schematic of our model in Figure 1. The inference and generative models are defined to codify the structure discussed above. Specifically, given the topic distribution for item and user , sentence aspect assignments , and the review , we define the inference model
An obvious choice for the likelihood model is to define a decoder for review . However, this will entangle the different aspects discussed in said review. To ensure that the generative model associates different dimensions with specific aspects, we define our generative network at a sentence level
We note that the -dimensional topic distribution is fixed at the review level. This reflects the assumption that given the item and user, the specific topics of interest will not change, as the opinion of the user and the characteristics of the item are fixed. The only axis of variation is the user’s decision regarding which aspect to write about in any given sentence. However, we must ensure that the generative model focuses only on the assigned aspect, rather than topics of all aspects. We enforce this by multiplying the columns of with the (nearly) one-hot topic assignment vector : . Because resembles a one-hot vector, this effectively masks the topic distributions pertaining to other (unassigned) topics. Thus, only the topic distribution corresponding to a single (selected) aspect is responsible for reconstructing .
We use the generative model for text introduced in prior work [srivastava2017autoencoding]. This model induces probabilities for each word via feeding the
through a single layer neural network followed by applying a log softmax
Where the log word probabilities
are computed using a single-layer perceptronwith weights .
|Beer (Beeradvocate)||Aroma, Taste, Mouthfeel, Look||4,923||2,017||127,346||1,515,517|
|Restaurant (Yelp)||Price, Ambiance, Food, Service||13,847||6,588||140,139||1,416,317|
|Clothing (Amazon)||Formality, Appearance, Type||12,203||73,903||80,285||447,920|
|Movie (Amazon)||Genre, Awards, Screen Play||7,590||2,288||100,489||1,446,690|
3.1 Concrete Distribution
An important factor in our model is the choice of and . The appropriate choice for and are discrete and Dirichlet distributions respectively, as they represent aspect assignment and topic proportions
. This is problematic in practice because discrete variables are not amenable to the reparameterization trick, thus precluding use of estimation via standard backpropagation algorithms. In the case of Dirichlet distributions, several methods have been proposed to allow for sampling via reparameterization[ruiz2016generalized, figurnov2018implicit]
. However, in practice these methods dramatically increase training in our implementation because the base system, PyTorch, does not provide GPU implementations for these distributions at the time of writing. In this work we choose to model both variablesand using the Concrete distribution, a relaxation of discrete distributions implemented via a Gumbel softmax [maddison2016concrete, jang2016categorical]. The Gumbel distribution can be sampled in a reparameterized way by drawing and then computing . If has aspect log-probabilities , then we can sample from a continuous approximation of the discrete distribution by sampling a set of i.i.d. and applying the transformation
where is a temperature parameter controlling relaxation. The sample is a continuous approximation of the desired one-hot representation. The role of is critical in our model, as it dictates the peakiness of the samples. In the case of , we keep the temperature low to enforce the assumption that each sentence is only talking about a certain aspect. However, we do not wish for the topic proportions to be close to a one-hot vector, as this would restrict items to contain a single topic within each aspect. To encourage to mimic a Dirichlet distribution, we set the temperature to higher values, thereby encouraging all dimensions within each aspect to contribute. In our experiments, we have observed that sampling with a low temperature results in few dimensions within each aspect learning something meaningful about the review.
3.2 Rating Prediction
A good representation of a review should not contain only informative topics, but should also assist in accurately predicting the rating linked to the review. In this subsection, we extend VALTA to predict ratings in combination with learning aspects and topics. We take several factors into account when predicting the rating. As discussed above, we assume that users have different aspect preferences and that items exhibit different aspect-importance. To extract aspect-importance vectors and for item and user , respectively, we use the weights of the sentence encoder that is responsible for predicting the aspect
As is trained at the sentence level (and so compelled to extract words associated with aspects), we can re-use its weights to extract an aspect-importance vector from a collection of reviews. We then average these two embeddings to obtain aspect-importance for a particular pair
One could consider learning different embeddings for items and users that are different from the one in topic models. However, as discussed in [mcauley2013hidden], coupling the embeddings for items and users with their topic models representation helps to learn topics that explain the diversities in ratings. Thus, in our model we re-use the input to the concrete distribution to predict the rating associated with each aspect as
Based on this structure we predict the overall ratings as
where is the global rating bias, and and are item and user bias respectively. This approach is similar to the family of aspect-aware latent factor models (ALFM) proposed in [cheng2018aspect]. Following prior work, we use the mean squared error (MSE) loss for the recommender model component. This may also be interpreted in a probabilistic way where we model
as a Gaussian distribution:.
In this subsection, we put everything together to define a unified objective for VALTA. For clarity, we decompose our objective into the four terms, which together define a lower bound on the log marginal likelihood, analogous to the VAE objective defined in equation 1
The first term is the expected log likelihood of the review
Note that this expectation is are defined w.r.t to the inference model , which we omit for simplicity.
The second term is the likelihood of the rating
Finally, as with a normal VAE we incorporate two regularization terms in the form of KL divergences between the encoder distribution and the prior
where the terms are responsible for reconstructing the review text, predicting the rating, matching the aspect distribution in the encoder to prior, and matching the topic distributions in the encoder to the prior respectively.
4 Related Work
A comprehensive literature review of recommender systems is beyond the scope of this work. Here we discuss models that exploit both rating and reviews to jointly learn topics and predict ratings. We divide these models to three classes: 1) probabilistic topic models; 2) deep learning-based approaches; 3) VAEs. VALTA belongs to the last category.
In the first class, the most closely related approach to our work is the aspect-aware topic model (ATM) [cheng2018aspect], which considers a similar decomposition of reviews to aspects and sub-aspects. In the same paper, the authors also propose an aspect-aware latent factor model (ALFM) which exploits the parameters learned from the ATM to predict ratings. While VALTA shares the idea of further decomposition of aspects with ATM, it is trained to learn topics and predict ratings jointly rather than sequentially.
Another related model is factorial-LDA [NIPS2012_4784] which learns a facorized topic structure. Their approach to learn structured topics is different to VALTA in that factorial-LDA learns topics as tuples while VALTA learns topics as hierarchies. Other approaches similar to ours are [mcauley2013hidden, diao2014jointly, zhao2015improving, zhang2014explicit]. However, they are all purely probabilistic models and furthermore they do not consider hierarchical topics.
Two recent, closely related deep learning-based methods are [catherine2017transnets, zheng2017joint]. Both exploit the review texts for a pairs to predict ratings. While these models perform well in terms of predicted ratings, they are not designed to learn topics. VAEs have also been used for collaborative filtering [li2017collaborative, karamanolakis2018item]. However, as pointed out by [liang2018variational], these approaches tend to under-fit the data. In the vision domain, the idea of capitalizing on more complex priors in VAEs has become a popular idea and has been strongly associated with disentanglement [kingma2014semi, narayanaswamy2017learning, esmaeili2018structured]. However, this idea has not been emphasized as much in natural language processing.
To our knowledge, VALTA is the only VAE-based approach that considers hierarchical topics. Furthermore, our encoder architecture is unique in that it couples a sentence-level decoder with an item and user encoder. We also note that aspect classification has also been separately studied in the context of semantic analysis [poria2016aspect, mcauley2012learning, lu2011multi, schouten2016survey].
To assess the structured representation learned by VALTA, we evaluate a number of tasks and datasets. The experiments are designed to evaluate the quality of aspects and topics, rating-prediction, and structure of the representations. We implement all VAE-based models in Probabilistic Torch[siddharth2017learning], a library for deep probabilistic models. In all experiments, we set the temperature for and to 0.66 and 5.0 respectively. The number of hidden units for all models is 256, followed by a for the beer review and for all other datasets.
The main findings in our experiments are as follows.
VALTA can disentangle aspects of a review in a fully unsupervised manner. We demonstrate this on the CitySearch and BeerAdvocate datasets, which have been annotated with aspect-specific ratings [mcauley2012learning, ganu2009beyond].
VALTA learns word distributions for every aspect and topic that have a higher coherence score [lau2014machine] than baseline methods at both the sentence and review level. This indicates interpretability.
VALTA learns a representation that can be used to make aspect-based comparisons of items and users.
In all but one dataset, VALTA produces the most accurate rating predictions of all models considered.
5.1 Datasets and Preprocessing
A summary of the data we use in this paper is shown in Table 2. We focus on datasets that exhibit clear structure. We chose the BeerAdvocate dataset111http://snap.stanford.edu/data/ and restaurant data from Yelp Dataset Challenge222https://www.yelp.com/dataset_challenge/ from as they both contain explicit aspects. For the other two datasets, we selected Clothing and Movie reviews from Amazon [mcauley2015image, he2016ups].
We preprocessed the datasets as follows. we used the Spacy library to remove all stop words, and we removed all words that occurred fewer than five times. We also used Spacy for sentence segmentation. The resultant vocabulary size for these datasets varies from 20000 to 30000. We also filtered reviews such that we include only items and users for which we have at least five reviews.
We compare our model to a diverse set of baseline models, including probabilistic, VAE-based, and aspect-based models. We also include a simple version of our own model that we call variational review LDA (VRLDA) which follows the implementation of VALTA save for the aspects. In other words, the representation used VRLDA is only a flat dimensional vector. The full list of baselines are: LDA [blei2003latent], Local-LDA [brody2010unsupervised], HFT [mcauley2013hidden], MF [koren2009matrix], NVDM [miao2016neural], ProdLDA [srivastava2017autoencoding], and VRLDA.
In Table 5, we show the top 10 words in topics for the BeerAdvocate and the Yelp data. Words associated in sub-aspects are clearly related to each other. For example, in the beer dataset, the topic “dark” contains words such as “black”, “tans”, and “brown”. Furthermore, we can see that the topics within every aspect are also correlated with one another. In the beer example, if we look at the topic neighbours of “dark”, we can see the topic “yellow”. Note that the “dark” and “yellow” topics are learned within the same aspect in our model. The same pattern can be observed in the Yelp data where we can recover topics corresponding to food types, such as “Chinese”, “Pizza”, and “Breakfast”.
5.4 Quantitative Assessment
We perform a several quantitative evaluations of our model. We first demonstrate that we can successfully disentangle different aspects at the sentence level. We evaluate this on the two available annotated datasets: CitySearch [ganu2009beyond] and BeerAdvocate [mcauley2012learning]. Prior work on sentence aspect classification shows that Local-LDA is one of the most successful at capturing aspects in an unsupervised way [lu2011multi]
. Therefore, we compare against both LDA and Local-LDA. We also train fully supervised SVM classifier333We use the SVM implementation in scikit-learn 0.19.2 [scikit-learn]. on the labeled data. As presented in Table 3, VALTA outperforms other approaches in terms of both accuracy and F1-score.
Next, we quantitatively evaluate the top-words learned in topics. According to lau2014machine, NPMI is a good metric for qualitative evaluation of topics in terms of matching human judgment. We measure NPMI both at the sentence and the review level to take both aspects and topics into account. For every baseline, we only compared at the input-level that it was trained on. For example, LDA is trained at the review level while Local-LDA is trained at the sentence level. We report results in Table 4. We can see that VALTA performs significantly better than baselines, both at the sentence and review levels. Note that we keep the overall number topics for all other baselines to be the same as VALTA ().
5.5 Genre Discovery and Aspect-Based Analysis
In Table 5, we show that we can successfully learn a structured representation of aspects and topics. An interesting question is whether based on this representation, we have managed to cluster the items in a reasonable way. Furthermore, can we now perform an aspect-based comparison of different items? In this section, we investigate this question both qualitatively and quantitatively. We hypothesize that if the learned representation of the item is sufficiently rich in capturing the structure of the data, then even a simple classifier should be able to accurately distinguish between the categories. After training VALTA and other baselines, we fit a multi-class SVM to classify the items.444We use the SVC implementation in sklearn 0.19.2. Results are shown in Figure 2 (left). VALTA outperforms other approaches with respect to clustering items in an unsupervised manner due to is structured nature.
To inspect VALTA’s ability to enable aspect-based comparison, we perform the following experiments: for both the BeerAdvocate and Yelp restaurant data, we manually select three categories of items that are different with respect to every aspect. We encode all the reviews associated with these items and we plot the histogram of parameters
, as well as their kernel density estimation in different aspects.
Figure 2 shows that item representations cluster appropriately within topics. For example, American Porter is a sweet dark beer, thus the histogram of for American Porter is on the side of ”dark” in the appearance aspect and on the side of ”sweet” in the taste aspect. American IPA on the other hand is bitter pale beer, thus it has very little overlap with American Porter in either aspects. Note that in the beer example, we trained with . Since are parameters of the Concrete distribution, we only need to see value for 1 as the other one provides no addition information.
5.6 Recommendation Performance
As noted in the methodology section, VALTA’s generative model is also trained to predict rating for pairs of users and items, based on their aspect and topic representation. Results on MSE are shown in Table 6. It can be observed that VALTA outperforms baselines in two of the datasets by taking both the rating and our structured review representation into account, and perform reasonably close to state-of-art (HFT) in other cases.
We have proposed VALTA, a novel VAE-based model that instantiates structured probabilistic topic models in combination with an inference neural network to learn aspect-based representations of reviews. VALTA uncovers interpretable aspects, and additional structure (sub-aspects) beneath these. These representations enable one to measure similarity with respect to individual aspects, and thus perform aspect-wise clustering. Furthermore, we demonstrated the these representations afford improved generalization, as assessed in zero-shot settings.
Our hope is that structured (disentangled) representations will see increased development and use in natural language processing (NLP) applications, as these may allow greater generalizability and transparency.
Appendix A NPMI estimation
We define the probability of word ; , and the joint probability of two words and occurring together as their relative frequency. Let be the total number of documents where is present and be the total number documents where and are both present:
NPMI is typically computed for top words for a given topic. The formula for computing NPMI for a given word is stated as:
In order to compute NPMI for a particular topic , we compute this value for all top words associated to that topic:
Finally, we compute the overall NPMI as the average of NPMIs for every topic:
Appendix B Top Words
In Figure 7, we show the top 10 words for the datasets.