HyperMiner
Improved Embedded Topic Models in Hyperbolic Space
view repo
Embedded topic models are able to learn interpretable topics even with large and heavy-tailed vocabularies. However, they generally hold the Euclidean embedding space assumption, leading to a basic limitation in capturing hierarchical relations. To this end, we present a novel framework that introduces hyperbolic embeddings to represent words and topics. With the tree-likeness property of hyperbolic space, the underlying semantic hierarchy among words and topics can be better exploited to mine more interpretable topics. Furthermore, due to the superiority of hyperbolic geometry in representing hierarchical data, tree-structure knowledge can also be naturally injected to guide the learning of a topic hierarchy. Therefore, we further develop a regularization term based on the idea of contrastive learning to inject prior structural knowledge efficiently. Experiments on both topic taxonomy discovery and document representation demonstrate that the proposed framework achieves improved performance against existing embedded topic models.
READ FULL TEXT VIEW PDFImproved Embedded Topic Models in Hyperbolic Space
With a long track record of success in a variety of applications Wang et al. (2007); Mimno et al. (2009); Rubin et al. (2012); Jiang et al. (2015); Wang et al. (2018); Jelodar et al. (2020), topic models have emerged as one of the most powerful tools for automatic text analysis. Typically, given a collection of documents, a topic model aims to identify a group of salient topics by capturing common word co-occurrence patterns. Despite their popularity, traditional topic models such as Latent Dirichlet Allocation (LDA) Blei et al. (2003) and its variants Griffiths et al. (2003); Blei and Lafferty (2006); Mcauliffe and Blei (2007); Zhou et al. (2012); Paisley et al. (2014)
are plagued by complicated posterior inference, presenting a challenge to create deeper and more expressive models of text. Fortunately, recent developments of Variational AutoEncoders (VAEs) and Autoencoding Variational Inference (AVI)
Kingma and Welling (2013); Rezende et al. (2014) have shed light on this problem, resulting in the proposal of a series of Neural Topic Models (NTMs) Miao et al. (2016); Srivastava and Sutton (2017); Zhang et al. (2018); Nan et al. (2019). With better flexibility and scalability, NTMs have gained increasing research interest over the past few years.Parallel to neural topic modeling, the idea of bringing word embeddings Mikolov et al. (2013a, b) into topic models has also attracted much attention. Considering the large performance degradation over short texts due to limited word co-occurrence information, some early works Petterson et al. (2010); Nguyen et al. (2015); Li et al. (2016) exploit word embeddings as complementary metadata and incorporate them into the generative process of topic models. Recently, more flexible ways Zhao et al. (2018b); Wang et al. of combining word embeddings have been explored thanks to the development of NTMs. For example, Bianchi et al. (2020) use word embeddings directly as part of the encoder’s input. In particular, a novel one called Embedded Topic Model (ETM) Dieng et al. (2020)
stands out for its performance as well as the elegant way it integrates word embeddings. Specifically, by representing topics as points in the word embedding space, ETM assigns probabilities to words based on their (inner product) distances from each topic embedding. As a result, semantically related words tend to fall around the same topic, thus facilitating the discovery of more interpretable topics.
Under the inspiration of ETM, Duan et al. (2021a)
have extended a similar idea to hierarchical topic modeling and proposed SawETM. In addition to mapping words and hierarchical topics into a shared embedding space, SawETM has also developed a unique Sawtooth Connection module to capture the dependencies between the topics at different layers, which, on the other side, empowers it to support a deep network structure. While achieving promising results, both ETM and SawETM hold the Euclidean embedding space assumption, leading to a fundamental limitation that their ability to model complex patterns (akin to social networks, knowledge graphs, and taxonomies) is inherently bounded by the dimensionality of the embedding space
Nickel and Kiela (2017); Nickel et al. (2014). As a consequence, the underlying semantic hierarchy among the words and topics can hardly be expressed adequately in a relatively low-dimensional embedding space, as illustrated on the left side of Figure 1.Apart from the difficulty in capturing the implicit semantic hierarchy, another concomitant problem is the dilemma of incorporating explicit structural knowledge. Assuming we have a prior taxonomy of concepts and wish to use it to guide the learning of hierarchical topics, it is challenging to preserve the structure between concepts in Euclidean space by constraining the word and topic embeddings. To cope with this issue, TopicNet Duan et al. (2021b)
employs the Gaussian-distributed embeddings as a substitute for the vector embeddings to represent words and topics. As such, the prior knowledge of hypernym relations between concepts could be naturally injected via the encapsulation of probability densities. However, maintaining the semantic hierarchy in such an embedding space still suffers from a certain degree of distortion, as it poses a challenge to the optimization of KL divergence between distributions. Furthermore, the introduction of Gaussian-distributed embeddings entails a great demand on memory, limiting its potential scalability to large vocabularies and high-dimensional embedding spaces.
To overcome the above shortcomings brought by Euclidean embedding space, we propose to compute embeddings in hyperbolic space. Distinguished by the tree-likeness properties Gromov (1987); Hamann (2018), hyperbolic space has been consistently shown to be superior in modeling hierarchical data compared to Euclidean space Ganea et al. (2018); Sala et al. (2018); Tifrea et al. (2018); Cho et al. (2019). By measuring the distance between words and topics in hyperbolic embedding space, the model is encouraged to better capture the underlying semantic hierarchy among words. As shown on the right side of Figure 1, some general words such as “new” and “does” fall around the center, they stay close to all other points because they often co-occur with other words. While more specific words like “moon” and “nasa” fall near the boundary and are only close to the nearby points. Moreover, hyperbolic space also provides a better platform to inject prior structural knowledge, since hierarchical relations can be effectively preserved by imposing constraints on the distance between word and topic embeddings. In a nutshell, the main contributions of this paper are as follows:
We propose to compute the distance between topics and words in hyperbolic embedding space on the basis of existing embedded topic models, which is beneficial to both the mining of implicit semantic hierarchy and the incorporation of explicit structural knowledge.
We design a node-level graph representation learning scheme that can inject prior structural knowledge to effectively guide the learning of a meaningful topic taxonomy.
Extensive experiments on topic quality and document representation demonstrate that the proposed approach achieves competitive performance against baseline methods.
ETM Dieng et al. (2020) is a neural topic model that builds on two main techniques: LDA Blei et al. (2003) and word embeddings Mikolov et al. (2013a, b). To marry the probabilistic topic modeling of LDA with the contextual information brought by word embeddings, ETM maintains vector representations of both words and topics and uses them to derive the per-topic distribution over the vocabulary. Specifically, consider a corpus with distinct terms comprising the vocabulary, we denote the word embedding matrix as , where is the dimensionality of the embedding space. For each topic, there is also an embedding representation , then ETM defines the per-topic distribution over the vocabulary as
(1) |
With the above definition, ETM specifies a generative process analogous to LDA. Let denote the word in the document, the generative process is as follows.
Draw topic proportions .
For each word in the document:
Draw topic assignment .
Draw word .
Where
in step 1 denotes the logistic-normal distribution
Atchison and Shen (1980), which transforms a standard Gaussian random variable to the simplex. By taking the inner product of the word embedding matrix
and the topic embedding to derive , the intuition behind ETM is that semantically related words will be assigned to similar topics. With this property, ETM has been demonstrated to improve the quality of the learned topics, especially in the presence of large vocabularies. Like most NTMs, ETM is fitted via an efficient amortized variational inference algorithm.In this part, we briefly review some key concepts on hyperbolic geometry. A comprehensive and in-depth description can be found in Lee (2013) and Nickel and Kiela (2018). Under the mathematical framework of Riemannian geometry, hyperbolic geometry specializes in the case of constant negative curvatures. Intuitively, the hyperbolic space can be understood as a continuous version of trees: the volume of a ball expands exponentially with its radius, just as how the number of nodes in a binary tree grows exponentially with its depth. Mathematically, there exist multiple equivalent models for hyperbolic space with different definitions and metrics. Here, we consider two representative ones in light of optimization simplicity and stability: Poincaré ball model Nickel and Kiela (2017) and the Lorentz model Nickel and Kiela (2018).
The Poincaré ball model of an -dimensional hyperbolic space with curvature () is defined by the Riemannian manifold , where is the open -dimensional ball with radius and
is the metric tensor that can be converted from the Euclidean metric tensor
as(2) |
The Lorentz model (also named Hyperboloid model) of an -dimensional hyperbolic space with curvature () is defined by the Riemannian manifold , where and . denote the Lorentzian inner product. Let , the Lorentz inner product induced by is calculated as
(3) |
An intuitive illustration of the equivalence between the Poincaré ball model and the Lorentz model and some other related operations in hyperbolic space will be introduced in Appendix C.
In this section, we elaborate on how the introduced hyperbolic embeddings facilitate the mining of implicit semantic hierarchy and the incorporation of explicit tree-structure knowledge, both of which encourage the model to find more interpretable topics^{1}^{1}1Our code is available at https://github.com/NoviceStone/HyperMiner.
Theoretically, the idea of representing words and topics in hyperbolic space is orthogonal to a wide range of topic models employing the word embeddings technique. To provide the foundation for the subsequent injection of structural knowledge, we here apply our method to a hierarchical embedded topic model SawETM Duan et al. (2021a). SawETM utilizes the adapted Poisson gamma belief network (PGBN) Zhou et al. (2016) as its generative module (decoder) and decomposes the topic matrices into the inner product of topic embeddings at adjacent layers. The novelty of our method lies in that the hierarchical relations can be better reflected by the distances between embeddings in hyperbolic space. Mathematically, the generative model with latent layers is formulated as
(4) |
The above formula clearly describes how the multi-layer document representation is generated via a top-down process. Specifically, the latent representation of the top layer is sampled from a fixed gamma prior distribution, then at each intermediate layer the latent units are factorized into the product of the factor loading matrix and latent units of the above layer. Until the bottom layer, the observation of word count vector
is modeled as the Poisson distribution. Note that the subscript
denotes the document index and some other variablesare hyperparameters. Especially, the factor loading matrix
of layer is derived based on the distance between the topic embeddings at two adjacent layers, i.e., and . Note that represents the word embeddings. Since all embeddings are projected into the hyperbolic space to fully explore the underlying semantic hierarchy among the words and topics, we design our similarity score function as(5) |
Where and are the distance functions of the Poincaré ball model and the Lorentz model, respectively. As the two models of hyperbolic space are mathematically equivalent, we take the Poincaré ball as an example for analysis. Eq. (5) shows that the distance changes smoothly with respect to the norm of and . This locality plays a crucial role in learning continuous embeddings of hierarchies. For instance, the origin of has a zero norm, it would have relatively small distance to all other points, which exactly corresponds to the root node of a tree. On the other hand, those points close to the boundary of the ball have a norm close to one, so the distance between them grows quickly, which properly reflects the relationships between the leaf nodes of a tree.
Hierarchical structures are ubiquitous in knowledge representation and reasoning Nickel and Kiela (2018). Particularly, mining a set of meaningful topics organized into a hierarchy from massive text corpora is intuitively appealing, as it allows users to easily access the information of their interest. However, most existing hierarchical topic models struggle to realize this goal without any supervision, and some appropriate guidance with prior structural knowledge proves to be helpful for mitigating this issue Duan et al. (2021b); Wang et al. (2022).
We assume the prior knowledge takes the form of a concept taxonomy, which is compatible with the deep structure of the proposed hierarchical topic model in Section 3.1. In detail, the taxonomy exhibits a top-down reification process and concepts between two adjacent layers with connections follow the hypernym relation, as shown in Figure 2(a). Meanwhile, to keep the taxonomy consistent with the corresponding dataset, we construct it by traversing each word in the vocabulary to find its ancestors along the hypernym paths provided by WordNet Miller (1995).
Although the paradigm of contrastive learning has been successfully applied to graph representation learning in Euclidean space You et al. (2020); Hassani and Khasahmadi (2020); Zhu et al. (2020); Peng et al. (2020), those developed contrastive algorithms are not directly applicable in our case. On the one hand, we focus more on learning node representations while not destroying their prior hierarchical structures. On the other hand, hyperbolic space possesses distinctive properties (, hierarchical awareness and spacious room) compared to its Euclidean counterpart. Consequently, to accommodate the two differences, we design a node-level hyperbolic contrastive loss such that the prior knowledge can be effectively injected as an inductive bias to influence the learning of topic taxonomy.
Specifically, we set the number of topics at each layer of the hierarchical topic model to be the same as the number of concepts at each layer of the taxonomy, then each topic is assigned a corresponding concept as its semantic prior. Since concepts are connected in the taxonomy, such relations can be transfered accordingly between topics, which provide the basis for picking positive and negative pairs among the topic and word embeddings. Let denote the set of all embeddings, then each embedding is associated with two groups of embeddings as the positive samples and the negative samples, respectively. Then the average hyperbolic contrastive loss is defined as (we omit the superscript for simplicity of notation)
(6) |
where is the similarity score function defined in Eq. (5) and is the temperature parameter. Note that is a positive sample drawn from and is the set of negative samples.
Inspired by the homophily property (, similar actors tend to associate with each other) in many graph networks Nickel and Kiela (2017), we take one-hop neighbors of each anchor, , its parent node and its child nodes as positive samples to maintain the hierarchical semantic information. For the negative samples, we select embeddings from the non-first-order neighbors that have the highest similarity scores with the anchor embedding.
As most existing NTMs can be viewed as the extensions of the framework of VAEs Kingma and Welling (2013); Rezende et al. (2014), they generally develop a similar training objective to VAEs, which is to maximize the Evidence Lower BOund (ELBO). For our generative model, the ELBO of each document can be derived as
(7) |
where the first term is the Kullback–Leibler divergence that constrains the approximate posterior
to be close to the prior, and the second term denotes the expected log-likelihood or reconstruction error. Considering that our generative model employs the gamma-distributed latent variables, it brings the difficulty of reparameterizing a gamma-distributed random variable when we design a sampling-based inference network. Therefore, we instead utilize a Weibull distribution to approximate the conditional posterior inspired by
Zhang et al. (2018), as the analytic KL expression and efficient reparameterization make it easy to estimate the gradient of ELBO with respect to network parameters. The implementation details of our variational encoder is described in Appendix
B.Furthermore, to inject the prior knowledge to guide the learning of a topic taxonomy, we train the ELBO jointly with a regularization term specified by the proposed contrastive loss in Section 3.2
(8) |
where is the hyper-parameter used to control the impact of the regularization term, whose detailed effect is investigated in Appendix D. We summarize our complete learning procedure in Algorithm 1.
Number of docs | Vocabulary size | Total number of words | Categories | |
---|---|---|---|---|
20NG | 18,846 | 8,581 | 1,446,490 | 20 |
TMN | 32,597 | 13,368 | 592,973 | 7 |
WIKI | 28,472 | 20,000 | 3,719,096 | N/A |
RCV2 | 804,414 | 7,282 | 60,209,009 | N/A |
We conduct our experiments on four benchmark datasets with various sizes and document lengths, including 20Newsgroups (20NG) Lang (1995), Tag My News (TMN) Vitale et al. (2012), WikiText-103 (WIKI) Merity et al. (2016), and Reuters Corpus Volume II (RCV2) Lewis et al. (2004). The statistics of these datasets are presented in Table 1. In particular, TMN is a short text corpus with an average of about 20 words per document; 20NG and TMN are the two corpora that are associated with document labels.
As baselines, we choose several exemplary ones from the state-of-the-art topic models, including: 1) LDA Blei et al. (2003), one of the most widely used topic models; 2) ProdLDA Srivastava and Sutton (2017), an NTM which replaces the mixture model in LDA with a product of experts; 3) ETM Dieng et al. (2020), an NTM that marries conventional topic models with word embeddings; 4) WHAI Zhang et al. (2018), a hierarchical NTM which develops a deep Weibull variational encoder based on PGBN Zhou et al. (2016); 5) SawETM Duan et al. (2021a), which proposes a Sawtooth Connection module to build the dependencies between topics at different layers; 6) TopicNet Duan et al. (2021b), a knowledge-based hierarchical NTM that guides topic discovery through prior semantic graph. All baselines are implemented meticulously according to their official code.
We aim to evaluate our model’s performance in terms of both topic quality and document representation. For topic quality, we adopt topic coherence (TC) and topic diversity (TD) as performance metrics. Given a reference corpus, TC measures the interpretability of each topic by computing the semantic coherence of the most significant words Mimno et al. (2011). Precisely, we apply the widely used Normalized Pointwise Mutual Information (NPMI) Aletras and Stevenson (2013) and compute it over the top 10 words of each topic, with the original document collections of each dataset serving as the reference corpus. Note that the value of NPMI ranges from to 1, and higher values indicate better interpretability. TD, as the name suggests, measures how diverse the discovered topics are. Following Dieng et al. (2020), we define TD to be the percentage of unique words in the top 25 words of all learned topics. TD close to 0 means redundant topics and TD close to 1 implies more varied topics.
On the other hand, since per-document topic proportions can be viewed as unsupervised document representations, we intend to evaluate the quality of such representations by performing document clustering tasks. We report the purity and Normalized Mutual Information (NMI) Manning et al. (2010) on two datasets providing the document labels, , 20NG and TMN. Concretely, with the default training/test split of each dataset, we first train a topic model on the training set, and then the trained model is used to extract features of all test documents. Subsequently, we apply the KMeans algorithm on and calculate the purity and NMI of the KMeans clusters (denoted by km-Purity and km-NMI). Note that both of the two metrics range from 0 to 1, and higher scores indicate better performance. For the hierarchical topic models, we take latent units of the first layer as the document feature.
Considering that not all discovered topics are interpretable Zhao et al. (2018a), we select the top 50% topics with the highest NPMI values and report the average score over those selected topics to evaluate the topic quality comprehensively. Figure 3 exhibits the performance comparison results of different models. Note that HyperMiner is the variant of SawETM that replaces the inner product between Euclidean embeddings with the distance between hyperbolic embeddings, corresponding to the model proposed in Section 3.1. While HyperMiner-KG is an advanced version of HyperMiner that guides the learning of topic taxonomy by external structural knowledge, as introduced in Section 3.2. From what is shown in Figure 3, HyperMiner achieves consistent performance gains on all datasets compared to SawETM, in regard of both TC and TD, which demonstrates the superiority of hyperbolic geometry in uncovering the latent hierarchies among topics and words. In addition, as knowledge-guided topic models, both TopicNet and HyperMiner-KG get better performance than those without any supervision, indicating the positive role of prior knowledge in helping to mine more interpretable and diverse topics. However, HyperMiner-KG still performs slightly better than TopicNet while consuming less memory. We attribute this result to our naturally-designed framework of injecting the tree-structure knowledge in a contrastive manner.
Method | 20NG | TMN | ||
---|---|---|---|---|
km-Purity | km-NMI | km-Purity | km-NMI | |
LDA Blei et al. (2003) | 38.43 0.52 | 35.98 0.39 | 48.17 0.69 | 30.96 0.78 |
ProdLDA Srivastava and Sutton (2017) | 39.21 0.63 | 36.52 0.51 | 55.28 0.67 | 35.57 0.72 |
ETM Dieng et al. (2020) | 42.68 0.71 | 37.72 0.64 | 59.35 0.74 | 38.75 0.86 |
WHAI Zhang et al. (2018) | 40.89 0.35 | 38.90 0.27 | 58.06 0.45 | 37.34 0.48 |
SawETM Duan et al. (2021a) | 43.36 0.48 | 41.59 0.62 | 61.13 0.56 | 40.78 0.63 |
TopicNet Duan et al. (2021b) | 42.94 0.41 | 40.76 0.53 | 60.52 0.50 | 40.09 0.54 |
HyperETM | 43.63 0.51 | 39.06 0.64 | 61.22 0.62 | 40.52 0.71 |
HyperMiner | 44.37 0.38 | 42.83 0.45 | 62.96 0.48 | 41.93 0.52 |
HyperMiner-KG | 45.16 0.35 | 43.65 0.39 | 63.84 0.43 | 42.81 0.47 |
ETM Dieng et al. (2020) | 19.87 0.81 | 33.64 0.69 | 39.06 0.54 | 42.13 0.47 | 43.85 0.42 |
HyperETM | 24.33 0.76 | 36.57 0.65 | 40.92 0.56 | 43.04 0.43 | 44.38 0.40 |
SawETM Duan et al. (2021a) | 16.74 0.78 | 27.05 0.66 | 31.68 0.51 | 34.06 0.42 | 35.42 0.37 |
HyperMiner | 20.16 0.80 | 29.73 0.63 | 33.04 0.49 | 34.98 0.41 | 36.01 0.36 |
TopicNet Duan et al. (2021b) | 20.29 0.58 | 31.26 0.51 | 34.57 0.45 | 36.84 0.39 | 38.02 0.36 |
HyperMiner-KG | 22.83 0.55 | 33.15 0.50 | 36.28 0.43 | 38.11 0.40 | 39.46 0.34 |
Table 2
shows the clustering performance of different models. We run all the models in comparison five times with different random seeds and report the mean and standard deviation. From the results presented above, we have the following remarks:
i)For all the evaluation metrics, our proposed improved variants perform consistently better than their prototypical models (refer to HyperETM versus ETM, and HyperMiner versus SawETM), which demonstrates that the introduced hyperbolic embeddings are beneficial to both the discovery of high-quality topics and the learning of good document representations.
ii) As a knowledge-guided topic model, HyperMiner-KG achieves a significant improvement over the base model SawETM, while TopicNet suffers a slight performance degradation compared to SawETM, which also serves as its base model in the original paper. This observation shows that with the hyperbolic contrastive loss, our model not only injects the knowledge successfully into the learning of hierarchical topics, but also achieves a better balance among the comprehensive metrics of topic modeling. iii) The superior performance of our model on TMN also suggests its potential for short text topic modeling.To further investigate the effectiveness of our method under different dimensional settings, we proceed to compare the extrinsic predictive performance of document representations through classification tasks. Consistent with the practice in clustering tasks, we first collect the features of training set and test set
separately, which are inferred by a well-trained topic model. Then we train an SVM classifier using
and their corresponding labels. Finally, we use the trained classifier to predict the labels of and compute the accuracy. Table 3 illustrates the classification results of different embedded topic models. From the table we can see that the improved variants with our method surpass their base counterparts in various dimensionality settings. Especially, the performance gap between them has been further widened in the low-dimensional embedding space, confirming the natural advantage of hyperbolic distance metric in learning useful document representations.As our proposed HyperMiner-KG imposes a prior hierarchical constraint (, concept taxonomy) on the embedding space, the topic embeddings and word embeddings are learned to maintain this structure as much as possible. Therefore, to verify the effectiveness of our proposed regularization term for injecting structural knowledge, we visualize the two-dimensional hyperbolic embedding space learned by HyperMiner-KG, as displayed in Figure 4. Figure 4(a) exhibits the learned topic embeddings and the concept hierarchy used to guide them, in which we can see that the distribution of topic embeddings well preserves the semantic structure of prior knowledge. Specifically, for topics guided by higher-level and more general concepts (, physical_entity.n.01), their embeddings tend to locate in the center of the disc. While for those led by slightly more certain concepts (, substance.n.07), their embeddings prefer to scatter around the boundary. Figure 4(b) presents the distribution of learned word embeddings, which also reflects the underlying lexical hierarchy of the corpus. The words that co-occur more frequently with different terms (, organization, subject) tend to fall around the center of the disc, so as to maintain a small distance from arbitrary words. In contrast, those words with precise meanings fall near the edge area with spacious room and keep a small distance only from words with similar semantics.
Furthermore, to qualitatively demonstrate the crucial role of prior structural knowledge in helping to discover more interpretable topics, we show the contents (, top words) of some learned topics by HyperMiner-KG, as illustrated in Figure 5. From it we can observe that in a majority of cases the prior concepts can successfully guide the topics to learn semantically consistent contents (, the topics guided by coding_system and instrumentality). Moreover, the contents of topics at different layers are also semantically related due to the concepts guiding them. For instance, the content of the topic guided by whole.n.02 covers the contents of topics led by living_thing.n.01 and artifact.n.01, respectively. Another interesting phenomenon is that the topic led by round_shape.n.01 involves not only words related to shapes, but also some other words such as files and ftp. The reason could be one of its child concepts disk often co-occurs with those words in the given corpus, suggesting the topic learning is co-influenced by both data likelihood and knowledge regularization.
Historically, many attempts have been made to develop hierarchical topic models. Expanding on their flat counterparts, hierarchical topic models aim to generate an intuitive topic hierarchy by capturing the correlations among topics. Due to the inflexibility of requiring accurate posterior inference, early works Griffiths et al. (2003); Blei and Lafferty (2006); Paisley et al. (2014); Zhou et al. (2016) primarily focused on learning potential topic hierarchies purely from data, with some additional probabilistic assumptions being imposed. Also, there is a small body of work that tries to integrate domain knowledge into the process of discovering topic hierarchies. For example, anchored CorEx Gallagher et al. (2017) takes user-provided seed words and learns informative topics by maximizing total correlation while preserving anchor words related information. More recently, JoSH Meng et al. (2020) adopts a more effective strategy that takes a category hierarchy as the guidance and models category-word semantic correlation via joint spherical text and tree embedding. Different from anchored CorEx and JoSH, which deviate from the conventional topic modeling framework, our approach still follows the regular probabilistic generative process. In addition, we use a concept taxonomy to guide the topic learning so that more fine-grained topics can be mined. In this regard, our proposed HyperMiner-KG is much more related to TopicNet Duan et al. (2021b), yet it is more efficient with a smaller storage footprint.
This paper presents a novel framework that introduces hyperbolic embeddings to represent words and topics on top of existing embedded topic models. By using the hyperbolic distance to measure the semantic similarity between words and topics, the model can better explore the underlying semantic hierarchy to find more interpretable topics. Besides, a hyperbolic contrastive loss has been further proposed, which effectively injects prior structural knowledge into hierarchical topic models to guide learning a meaningful topic taxonomy. Our method shows appealing properties that can overcome several shortcomings of existing embedded topic models. Extensive experiments have been carried out, demonstrating that our method achieves consistent performance improvements in discovering high-quality topics and deriving useful document representations.
Bo Chen acknowledges the support of NSFC (U21B2006 and 61771361), Shaanxi Youth Innovation Team Project, the 111 Project (No. B18039) and the Program for Oversea Talent by Chinese Central Government.
Journal of machine Learning research
3 (Jan), pp. 993–1022. Cited by: §1, §2.1, §4.1, Table 2.The 22nd international conference on artificial intelligence and statistics
, pp. 1832–1840. Cited by: §1.Deep sentiment classification and topic discovery on novel coronavirus or covid-19 online discussions: nlp using lstm recurrent neural network approach
. IEEE Journal of Biomedical and Health Informatics 24 (10), pp. 2733–2742. Cited by: §1.Proceedings of the 2009 conference on empirical methods in natural language processing
, pp. 880–889. Cited by: §1.Stochastic backpropagation and approximate inference in deep generative models
. In International conference on machine learning, pp. 1278–1286. Cited by: §1, §3.3.Topical n-grams: phrase and topic discovery, with an application to information retrieval
. In Seventh IEEE international conference on data mining (ICDM 2007), pp. 697–702. Cited by: §1.For all authors…
Do the main claims made in the abstract and introduction accurately reflect the paper’s contributions and scope?
Did you describe the limitations of your work? We describe the limitations of our word in the appendix
Did you discuss any potential negative societal impacts of your work? We discuss the negative societal impacts in the appendix
Have you read the ethics review guidelines and ensured that your paper conforms to them?
If you are including theoretical results…
Did you state the full set of assumptions of all theoretical results?
Did you include complete proofs of all theoretical results?
If you ran experiments…
Did you include the code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL)? They will be included in the supplementary material.
Did you specify all the training details (, data splits, hyperparameters, how they were chosen)?
Did you report error bars (, with respect to the random seed after running experiments multiple times)?
Did you include the total amount of compute and the type of resources used (, type of GPUs, internal cluster, or cloud provider)?
If you are using existing assets (, code, data, models) or curating/releasing new assets…
If your work uses existing assets, did you cite the creators?
Did you mention the license of the assets?
Did you include any new assets either in the supplemental material or as a URL?
Did you discuss whether and how consent was obtained from people whose data you’re using/curating?
Did you discuss whether the data you are using/curating contains personally identifiable information or offensive content?
If you used crowdsourcing or conducted research with human subjects…
Did you include the full text of instructions given to participants and screenshots, if applicable?
Did you describe any potential participant risks, with links to Institutional Review Board (IRB) approvals, if applicable?
Did you include the estimated hourly wage paid to participants and the total amount spent on participant compensation?
We in this paper propose a method to improve existing embedded topic models (ETMs) by introducing the hyperbolic distance to measure the semantic similarity between topics and words. Additionally, to mine a meaningful topic taxonomy with the guidance of prior structural knowledge, we further develop a regularization term based on contrastive learning that can effectively inject prior knowledge into hierarchical topic models. The main limitation of our work could be the mismatch problem between the given prior knowledge and the target corpus. Specifically, to provide proper guidance for mining an interpretable topic taxonomy, the prior structural knowledge should be well matched with the corresponding dataset. Although we present a seemingly effective heuristic strategy by finding ancestor concepts of each word in the vocabulary, there are certainly better ways to construct qualified priors to guide the learning of topic hierarchies. However, this is beyond the scope of this paper and we will conduct a thorough investigation of this issue in future work.
Our work builds on advanced topic modeling techniques and thus can be used for regular text analysis. For example, topic discovery and obtaining document representation. Furthermore, our work also provides a solution to inject prior knowledge as an inductive bias to influence topic learning, which is particularly useful when users are only interested in certain types of information. Imagine a user’s goal is to extract the parts about a specific topic from a large amount of news, our model can act as a good filter. Or consider the application scenario of recommending papers to researchers, the browsing history as prior knowledge reflects their preferences, which can be incorporated into the model so that only the papers on related topics are presented, thus improving the recommendation accuracy. Potential negative societal impact of our work could arise from malicious intent in changing model’s behavior by injecting deliberate human prejudice, which may harm the fairness of the community. However, we hope our work is utilized to enable new downstream applications primarily from the originality of benefiting the community development.
We summarize the training/test split of each dataset in Table B. 1. In particular, 20NG^{2}^{2}2http://qwone.com/~jason/20Newsgroups/ and TMN^{3}^{3}3http://acube.di.unipi.it/tmn-dataset/ are used to evaluate both topic quality and document representation, and their document collections are divided into standard training sets and test sets. WIKI^{4}^{4}4https://blog.salesforceairesearch.com/the-wikitext-long-term-dependency-language-modeling-dataset/ and RCV2^{5}^{5}5https://trec.nist.gov/data/reuters/reuters.html are only used for topic discovery, so we use all documents for training.
Vocabulary size | Number of training docs | Number of test docs | |
---|---|---|---|
20NG | 8,581 | 11,314 | 7,532 |
TMN | 13,368 | 26,077 | 6,520 |
WIKI | 20,000 | 28,472 | / |
RCV2 | 7,282 | 804,414 | / |
In this part, we give an example to illustrate how the prior concept taxonomy (or structural knowledge) is constructed. Specifically, given the vocabulary of a dataset, we first filter out those words that are not included in the WordNet thesaurus. For each of the remaining words, we then find its ancestor concepts along the hypernym paths integrated in WordNet (, for the word coffee, it has a hypernym path where a series of abstract concepts, , beverage, food, substance, physical_entity, successively appear, as displayed in Figure B. 1). After traversing all the words, we can get a concept tree with great depth, but the number of nodes in the deepest layer may be very small. Therefore, to keep the number of nodes growing as the layer gets deeper, we choose to reserve the sevaral layers closest to the root node. For the words in the deeper layers, we connect them directly to their ancestor concepts of the deepest layer that has been preserved.
In this way, we construct a concept taxonomy with a depth of 5 (the layer of words is excluded) for each dataset. More precisely, the number of concepts in each layer is [2, 12, 83, 325, 560] for 20NG; [2, 11, 84, 366, 683] for TMN; [2, 12, 91, 408, 810] for WIKI; [2, 11, 70, 306, 540] for RCV2.
Since the exact posterior distribution for is intractable in our generative model, we aim to design a sampling-based inference network to approximate the true posterior distribution, which is adopted by most neural topic models. In view of the hierarchical structure where deep-layer latent variables are difficult to receive effective information from the original input, we draw experience from LadderVAE Sønderby et al. [2016] and use a skip-connected deterministic upward path to infer the hidden features of the input
(9) |
where MLP is a multi-layer perceptron consisting of two fully connected layers, with the ReLU activation following behind. The obtained hidden features are subsequently combined with the prior from the stochastic up-down path to approximate the variational posterior, which is expressed as
(10) |
where denotes the concatenation at topic dimension, is a simple fully connected layer with identity activation, and applies nonlinearity to each element, ensuring that shape and scale parameters of the Weibull distribution are positive. The reason for using the Weibull distribution to approximate the gamma-distributed conditional posterior has been explained in the main body. Note that both shape and scale parameters, , and
, are inferred through the neural networks, by using the combination of the bottom-up likelihood information and the top-down prior information as input. Figure
B. 2 depicts the overall inference process.All our experiments are performed on a single Nvidia Geforce RTX 3090 GPU card, with PyTorch as the programming platform to implement our models. For the MLP module in the inference network, we set the number of hidden neurons as 300. In addition, we also add a batch normalization layer to prevent overfitting. For all the embedded topic models, we set the embedding size as 50. To optimize our models, we use the Adam
Kingma and Ba [2014] optimizer with a learning rate of 0.01. As for the size of each mini- batch, we set it to 200 for all datasets. What’s more, for our proposed HyperMiner-KG, the size of negative samples is set as 256 for each anchor to calculate the hyperbolic contrastive loss.It is also worth noting that the number of topics at each layer in hierarchical topic models is set to be consistent with the number of concepts at the corresponding layer in the constructed concept taxonomy. Please refer to Section B.2 for detailed settings. For the single-layer topic models, we set the number of topics to be the same as the number of concepts at the deepest layer of the concept taxonomy.
In this section, we aim to offer an intuitive explanation about the equivalence of the two hyperbolic models mentioned in the main text. Firstly, we need to clarify the concept of geodesic. In geometry, a geodesic is commonly a curve representing the shortest path between two points in a surface. In a typical Euclidean space, the geodesic is the straight line connecting two points, and its length is the widely used Euclidean distance that is determined only by the coordinates of the two points. While in a hyperbolic space, the length of a geodesic is not only related to the coordinates of its connected points, but also affected by the curvature of hyperbolic space. This can be illustrated by the left side of Figure C. 3, as the curvature (negative) decreases, the corresponding curvature radius decreases, but the distance between x and y increases and the geodesics lines get closer to the origin.
The right side of Figure C. 3 clearly describes the projection of the geodesic on the Lorentz surface to the geodesic in the Poincaré disk. We say that the two models are mathematically equivalent because points on the Poincare disk and points in the Lorentz space can be mapped to each other, while all geometric properties including isometry are preserved. For example, to map a point in the Lorentz model into the corresponding point in the Poincaré ball, we have the following diffeomorphism Nickel and Kiela [2018] , where
(11) |
Furthermore, points in can be mapped back into via
(12) |
To calculate the lengths of geodesics in the Poincaré disk and Lorentz space, respectively, please refer to the definition of and in Eq (5). However, despite the mathematical equivalence of the two models, it does not mean that the lengths calculated by and are exactly the same.
A Riemannian manifold is a differentiable manifold equipped with a metric tensor . It can be locally approximated to a linear Euclidean space at an arbitrary point , and the approximated space is termed as a tangent space . Hyperbolic spaces are smooth Riemannian manifolds with a constant negative curvature. There are several essential vector operations required for learning embeddings in a hyperbolic space, we will give an introduction to them in the following.
Exponential and logarithmic maps. An exponential map is the function projecting a tangent vector onto . A logarithmic map projects vectors on the manifold back to the tangent space satisfying .
Parallel transport. A parallel transport can move a tangent vector along the surface of a curved manifold. For example, to move a tangent vector to another tangent space , we use the notation .
The concrete formula of these operations in Poincaré Ball and Lorentz model are summarized in Table C. 2. Where and are the Möbius addition Ungar [2007] and gyration operator Ungar [2007], respectively.
Poincaré Ball Model | Lorentz Model | |
---|---|---|
Log map | ||
Exp map | ||
Transport |
HyperMiner-KG | 20NG | TMN | ||
---|---|---|---|---|
km-Purity | km-NMI | km-Purity | km-NMI | |
43.76 0.32 | 42.63 0.38 | 62.25 0.46 | 41.32 0.49 | |
44.13 0.33 | 42.96 0.36 | 62.73 0.47 | 41.68 0.48 | |
44.48 0.39 | 43.28 0.41 | 63.07 0.52 | 42.06 0.54 | |
45.16 0.35 | 43.65 0.39 | 63.84 0.48 | 42.81 0.52 | |
44.81 0.37 | 43.47 0.38 | 63.39 0.46 | 42.34 0.50 |
To investigate the effect of the regularization term (prior structural knowledge) in HyperMiner-KG, we further evaluate the quality of document representations learned by HyperMiner-KG with different regularization coefficient on document clustering tasks.
From the Table C. 3 we can see, wth the increase of the regularization coefficient, HyperMiner-KG has shown better performance on both km-Purity and km-NMI, proving that incorporating prior structural knowledge is beneficial to learning better document representations. However, the regularization coefficient is not the bigger the better, it has a most suitable value, which in our experiments is 5.