Tired of Topic Models? Clusters of Pretrained Word Embeddings Make for Fast and Good Topics too!

by   Suzanna Sia, et al.
Johns Hopkins University

Topic models are a useful analysis tool to uncover the underlying themes within document collections. Probabilistic models which assume a generative story have been the dominant approach for topic modeling. We propose an alternative approach based on clustering readily available pre-trained word embeddings while incorporating document information for weighted clustering and reranking top words. We provide benchmarks for the combination of different word embeddings and clustering algorithms, and analyse their performance under dimensionality reduction with PCA. The best performing combination for our approach is comparable to classical models, and complexity analysis indicate that this is a practical alternative to traditional topic modeling.


BERTopic: Neural topic modeling with a class-based TF-IDF procedure

Topic models can be useful tools to discover latent topics in collection...

Identifying Reference Spans: Topic Modeling and Word Embeddings help IR

The CL-SciSumm 2016 shared task introduced an interesting problem: given...

Topic Modeling on User Stories using Word Mover's Distance

Requirements elicitation has recently been complemented with crowd-based...

Topic Sensitive Attention on Generic Corpora Corrects Sense Bias in Pretrained Embeddings

Given a small corpus D_T pertaining to a limited set of focused topics,...

Distilled Wasserstein Learning for Word Embedding and Topic Modeling

We propose a novel Wasserstein method with a distillation mechanism, yie...

Topic Modeling with Contextualized Word Representation Clusters

Clustering token-level contextualized word representations produces outp...

Nonparametric Spherical Topic Modeling with Word Embeddings

Traditional topic models do not account for semantic regularities in lan...

1 Introduction

For exploratory document analysis, which aims to uncover main themes and underlying narratives within a corpus Boyd-Graber et al. (2017), Topic Models are undeniably the standard approach. But in times of distributed and even contextualized embeddings, are they the only option?

This work explores an alternative to topic modeling by reformulating ‘key themes’ or ‘topics’ as clusters

of words under the modern distributed representation learning paradigm. Unsupervised pre-trained word embeddings provide a representation for each word type as a vector. This allows us to cluster them based on their distance in high-dimensional space. The goal of this work is not to strictly outperform, but rather to benchmark standard clustering of modern embedding methods against the classical approach of Latent Dirichlet Allocation

Blei et al. (2003). We restrict our study to influential embedding methods and focus on centroid based clustering algorithms as they provide a natural way to obtain the top words in each cluster based on distance from the cluster center.

Aside from reporting the best performing combination of word embeddings and clustering algorithm, we are also interested in whether there are consistent patterns across the choice of embeddings and clustering algorithm. A word embedding method that does consistently well across clustering algorithms, would suggest that it is a good representation for unsupervised document analysis. Similarly, a clustering algorithm that performs consistently well across embeddings would suggest that the assumptions of this algorithm are more likely to be generalizable even with future advances in word embedding methods.

Finally, we seek to incorporate document information directly into the clustering algorithm, and quantify the effects of two key methods, 1) weighting terms during clustering and 2) reranking terms for obtaining the top representative words. Our contributions are as follows:

  • To our knowledge, this is the first work which systematically applies centroid based clustering algorithms on embedding methods for document analysis.

  • We analyse how clustering embeddings directly can potentially achieve lower computational complexity and runtime as compared to probabilistic generative approaches.

  • Our proposed approach for incorporating document information into clustering and reranking of top words results in sensible topics; the best performing combination is comparable with LDA, but with smaller time complexity and empirical runtime.

  • We find that the dimensions of some word embeddings can be reduced by more than 50% before clustering.

2 Related Work and Background

This work focuses on centroid based k-means (KM), spherical k-means (SK) for hard clustering and GMM for soft clustering.

111We also experiment with k-medoids but observe that this is strictly worse than KM (see Appendix A).

We apply clustering to pre-trained embeddings, namely word2vec Mikolov et al. (2013), Glove Pennington et al. (2014), FastText Bojanowski et al. (2017), Spherical Meng et al. (2019), ELMo Peters et al. (2018) and BERT Devlin et al. (2018). Word2vec, Glove, FastText and Spherical all have dimension size 300, BERT has 768 and ELMo has 1024 dimensions.

Clustering word embeddings has been used for readability assessment Cha et al. (2017), argument mining Reimers et al. (2019), document classification and document clustering Sano et al. (2017). To our knowledge, there is no prior work that studies the interaction between word embeddings and clustering algorithms on unsupervised document analysis in a direct comparison with standard LDA (Blei et al., 2003). Most related perhaps is the work of de Miranda et al. (2019), who this idea with self-organising maps, but do not provide any quantitative results.222As this is a preprint, we gladly welcome any pointers to related work we missed.

3 Methodology

3.1 General Clustering Approach

We first preprocess and extract the vocabulary from our training documents (subsection 5.1). Each word is converted to its embedding representation, following which we apply the various clustering algorithms to obtain clusters, using weighted (subsection 3.3) or unweighted word types. After the clustering algorithm has converged, we obtain the top words from each cluster for evaluation.

3.2 Obtaining the top-J words

In traditional topic modeling (LDA), the top

words are those with highest probability under each topic-word distribution. For centroid based clustering algorithms, the top words are naturally those closest to the cluster center, and for probabilistic clustering, the top words are those with highest probability under the cluster parameters. Formally, this means choosing the set of types


3.3 Incorporating document information

We explore various methods to incorporate corpus information into the clustering algorithm. Specifically, we examine three different schemes to assign scores to word types:




These scores are then used for weighting word types when clustering (models marked ), reranking top words (models marked ), both (models marked ), or neither (models simply ), i.e., using uniform weights.

FastText Glove
christians jesus control drive
religious bible switch power
belief christian connected control
christianity church setup chip
relevation faith capability memory
believers christians supplied speed
fundamentalist religion switching machine
fundamentalist christ switched display
nonchristians religious enabled hardware
unbelievers belief powering output
-0.190 0.438 -0.238 0.2017
Table 1: Top 10 topic words on 20NG, comparison between FastText and Glove embeddings using weighted KM (KM) and reranking top words (KM) for a particular topic. We observe large gains in NPMI score averaged across all topics.

4 Computational Complexity

The complexity of KM is , and of GMM is where refers to the number of iterations,333In general, required for convergence differs for clustering algorithm and embedding representation. However we can specify the maximum number of iterations as a constant factor for worst case analysis. is the number of clusters (topics), is the number of word types (unique vocabulary), and is the dimension of the embeddings. Weighted variants have an one-off cost of weight initialisation, and contribute a constant multiplicative factor when recalulculating the centroid in the clustering algorithm. Reranking has an additional factor, where is the average number of elements in a cluster. In contrast, LDA via collapsed Gibbs sampling has a complexity of , where is the number of all tokens in the corpus. When , clustering methods can potentially achieve better performance-complexity tradeoffs.

5 Experiments

5.1 Experimental Setup


We use the 20 newsgroup dataset (20NG) which is a common text analysis dataset containing around 18000 documents and 20 categories.444The 20 newsgroup dataset can be obtained from http://qwone.com/~jason/20Newsgroups/ We adopt the standard 60-40 train-test splits in the dataset, and run the clustering algorithm on the training set, obtained the top 10 words of each cluster, and evaluated this on the test split. We present results averaged across 5 random seeds.


We remove stopwords, punctuation and digits, lowercase tokens, and exclude words that appear in less than 5 documents. For contextualised word embeddings (BERT and ELMo), sentences served as the context window to obtain the token representations which were averaged to obtain the type representation. For BERT, we experiment with two variants, BERT(ns), which ignores subword tokens, BERT, which averages the subword token representations.555Taking the first subword embedding as a representation for the whole word, performs consistently worse than averaging subword representations.

Evaluation Metric

We evaluate the clustering results using the topical coherence metric normalised pointwise mutual information (NPMI; Bouma, 2009) which has been shown to correlate with human judgements Chang et al. (2009). NPMI ranges from .

5.2 Results and Discussion


training data (words/tokens)



3B (news)

-0.555 -0.451 -0.286 0.166 0.188 0.146 -0.291 -0.092 0.091 0.166 0.233 0.177


2B (Wikipedia)

-0.561 -0.657 -0.419 0.225 0.142 0.196 -0.382 -0.187 0.212 0.235 0.240 0.253


840B (Common Crawl)

-0.436 -0.111 -0.299 0.182 0.213 0.155 -0.043 0.179 0.233 0.219 0.237 0.240


0.8B (Books) + 2.5B (Wikipedia)

0.001 0.013 0.151 0.253 0.241 0.222 0.180 0.193 0.242 0.250 0.244 0.241


0.8B (Books) + 2.5B (Wikipedia)

-0.08 -0.076 0.083 0.159 0.174 0.159 0.201 0.185 0.253 0.252 0.253 0.255


0.8B (1 Billion Word)

-0.808 -0.733 -0.289 0.050 0.039 0.111 0.165 -0.461 0.165 0.139 0.164 0.177


2B (Wikipedia)

-0.197 -0.222 -0.398 0.212 0.227 0.185 0.180 0.159 0.250 0.282 0.271 0.247
average -0.337 -0.289 -0.167 0.176 0.174 0.161 0.024 0.017 0.209 0.225 0.231 0.227
std. dev. 0.276 0.269 0.226 0.057 0.060 0.036 0.221 0.226 0.052 0.046 0.030 0.030
Table 2: NPMI Results (higher is better) for pre-trained word embeddings and k-means (KM), spherical k-means (SK) and GMM. indicates weighted and indicates reranking of top words. LDA has an NPMI score of 0.279, while the best performing Spherical Embeddings with KMN achieves a slightly better (but not statistically different) NPMI of 0.282. All results are averaged across 5 random seeds.
(a) Clustering With KM
(b) Clustering With KM and Reranking
(c) Clustering With GMM
Figure 1: Plots showing the effect of PCA dimension reduction on different embedding and clustering algorithms. Note: Plots do not include ELMo as using PCA greatly reduces topic cohesion and BERT First Word as it follows a similar trend to BERT Average


Compared to a simple LDA run, which performs no better than our best method, KM and takes about a minute using MALLET (McCallum, 2002), running the clustering on CPU takes little more than 10 seconds using sklearn (Pedregosa et al., 2011), and a third of that using custom JAX implementations on GPU (Bradbury et al., 2018).

Incorporating Document Information

We find that simply using Term Frequency (TF) outperforms the other weighting schemes (subsection 3.3). In particular, TF-IDF, perhaps surprisingly, is a poorer reweighting scheme (see Appendix B). Therefore our results in Table 2 and subsequent analysis utilises TF for weighted clustering and reranking.

Analysis of Algorithms - Weighted Clustering

Under the unweighted clustering of vocabulary types, all clustering algorithms and embedding combinations perform poorly compared to LDA. GMM outperforms KM and SK for both weighted (indicated with ) and unweighted variants across all embedding methods (). 666

2 tailed t-test for GMM

vs KM and SK.

Analysis of Algorithms - Reranking

For KM and SK, extracting the top topic words (subsection 3.2) before reranking results in reasonable looking themes, but scores poorly on NPMI. Reranking top topic words with a window size of 100 results in a large improvement for KM () and SK. Examples before and after reranking are provided in Table 1. This indicates that cluster centers are surrounded by low frequency types, even if the clusters are centered around valid themes.

Reranking GMM gains are much less pronounced. We found that the top topic words before and after reranking for BERT-GMM have an average Jaccard similarity score of 0.910, indicating that the Gaussians are already centered at word types of high frequency in the training corpus, and fundamentally have different cluster centers from those learned by KM.

Analysis of Embedding Method

The best performer are Spherical embeddings (Meng et al., 2019) which achieve the top 2 NPMI scores in the table with KM and SK. ELMo performs poorly compared to the other embeddings, this could be due to the layer wise combination of ELMo Embeddings which have been tuned for other tasks. FastText and Glove’s performance is reliant on weighted clustering and reranking, and can even achieve similar performance to BERT. This has important implications for practical applications when GPU resources are not always available to efficiently extract BERT embeddings from pre-trained models.

BERT embeddings perform consistently well across the clustering algorithm variants with weighted and reranking. Interestingly, the exclusion of words that are tokenized into subwords (BERT (ns)) does not negatively impact topic coherence (). This suggests that compound words that can be tokenized into subwords are not critical to finding coherent topics.

5.3 Dimensionality Reduction

We apply PCA to the word embeddings before clustering to estimate the amount of redundancy in the dimensions of large embeddings, which impact clustering complexity (

section 4). Across both KM and GMM, BERT embeddings can be reduced to 300 dimensions (). Both Spherical and BERT begin to fall at 200 dimensions, but this effect can be mitigated with reranking.

We observe that for GMM, we can safely reduce the dimensions of BERT embeddings from 768 to 100, and even achieve better performance at lower dimensionality. The reduction is consistent across different types of embeddings, indicating that GMM performs better under lower dimensionality (Figure 1). However given the cubic complexity of GMM in the number of dimensions (section 4), KM which achieves a comparable performance might be preferred in practical settings.

6 Conclusion

We outlined a methodology for clustering word embeddings for unsupervised document analysis, and presented a systematic comparison of various influential embedding methods and clustering algorithms. Our experiments suggest that pre-trained word embeddings combined with weighted clustering algorithms and reranking, provide a viable alternative to traditional topic modeling at lower complexity and runtime.


  • Blei et al. (2003) David M Blei, Andrew Y Ng, and Michael I Jordan. 2003. Latent dirichlet allocation.

    Journal of machine Learning research

    , 3(Jan):993–1022.
  • Bojanowski et al. (2017) Piotr Bojanowski, Edouard Grave, Armand Joulin, and Tomas Mikolov. 2017. Enriching word vectors with subword information. Transactions of the Association for Computational Linguistics, 5:135–146.
  • Bouma (2009) Gerlof Bouma. 2009. Normalized (pointwise) mutual information in collocation extraction. Proceedings of GSCL, pages 31–40.
  • Boyd-Graber et al. (2017) Jordan Boyd-Graber, Yuening Hu, David Mimno, et al. 2017. Applications of topic models. Foundations and Trends® in Information Retrieval, 11(2-3):143–296.
  • Bradbury et al. (2018) James Bradbury, Roy Frostig, Peter Hawkins, Matthew James Johnson, Chris Leary, Dougal Maclaurin, and Skye Wanderman-Milne. 2018. JAX: composable transformations of Python+NumPy programs.
  • Cha et al. (2017) Miriam Cha, Youngjune Gwon, and HT Kung. 2017. Language modeling by clustering with word embeddings for text readability assessment. In Proceedings of the 2017 ACM on Conference on Information and Knowledge Management, pages 2003–2006.
  • Chang et al. (2009) Jonathan Chang, Sean Gerrish, Chong Wang, Jordan L Boyd-Graber, and David M Blei. 2009. Reading tea leaves: How humans interpret topic models. In Advances in neural information processing systems, pages 288–296.
  • Devlin et al. (2018) Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
  • McCallum (2002) Andrew Kachites McCallum. 2002. Mallet: A machine learning for language toolkit. Http://mallet.cs.umass.edu.
  • Meng et al. (2019) Yu Meng, Jiaxin Huang, Guangyuan Wang, Chao Zhang, Honglei Zhuang, Lance Kaplan, and Jiawei Han. 2019. Spherical text embedding. In Advances in Neural Information Processing Systems, pages 8206–8215.
  • Mikolov et al. (2013) Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781.
  • de Miranda et al. (2019) Guilherme Raiol de Miranda, Rodrigo Pasti, and Leandro Nunes de Castro. 2019. Detecting topics in documents by clustering word vectors. In

    International Symposium on Distributed Computing and Artificial Intelligence

    , pages 235–243. Springer.
  • Pedregosa et al. (2011) F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. 2011. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825–2830.
  • Pennington et al. (2014) Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014. Glove: Global vectors for word representation. In

    Empirical Methods in Natural Language Processing (EMNLP)

    , pages 1532–1543.
  • Peters et al. (2018) Matthew E. Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. 2018. Deep contextualized word representations. In Proc. of NAACL.
  • Reimers et al. (2019) Nils Reimers, Benjamin Schiller, Tilman Beck, Johannes Daxenberger, Christian Stab, and Iryna Gurevych. 2019. Classification and clustering of arguments with contextualized word embeddings. arXiv preprint arXiv:1906.09821.
  • Sano et al. (2017) Motoki Sano, Austin J Brockmeier, Georgios Kontonatsios, Tingting Mu, John Y Goulermas, Jun’ichi Tsujii, and Sophia Ananiadou. 2017. Distributed document and phrase co-embeddings for descriptive clustering. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers, pages 991–1001.

Appendix A k-means (KM) vs k-medoids (KD)

To further understand the effect of other centroid based algorithms on topic coherence, we also applied the k-medoids (KD) clustering algorithm. KD is a hard clustering algorithm similar to KM but less sensitive to outliers.

As we can see in Table 3, in all cases KD usually did as well or worse than KM. KD also did relatively poorly after frequency reranking. Where KD did do better than KM, the difference is not very striking and the NPMI scores were still quite below the other top performing models.

Appendix B Comparing Different Reranking Schemes

As mentioned in the paper, after clustering the embeddings, instead of directly retrieving the top- terms, we can rerank the terms based on metrics and then retrieve the top- terms that have the highest ranked values. We compare term frequency (TF), term frequency–inverse document frequency (TF-IDF) and term frequency–document frequency (TF-DF), equations for which are presented in subsection 3.3. To get a single value for each word for TF-IDF, we sum over all the documents to get one aggregated value.

We present the results for using different reranking schemes for KM (Table 4) and Weighted KM for Frequency (Table 5).

We can see that compared to the TF results in the main paper, other schemes for reranking such as aggregated TF-IDF and TF-DF, while producing more coherent topics over the original hard clustering, fare worse in comparison with reranking with TF.

Word2Vec -0.555 -0.890 0.166 -0.038
FastText -0.561 -0.445 0.225 0.194
Glove -0.436 -0.490 0.182 0.025
BERT (reduced) 0.001 -0.117 0.253 0.197
BERT 12 (average) -0.080 -0.014 0.159 0.215
BERT 12 (first word) -0.057 -0.014 0.159 0.216
ELMO -0.808 -0.801 0.050 -0.057
Spherical -0.197 -0.198 0.212 0.200
average -0.337 -0.370 0.176 0.119
std. dev. 0.294 0.343 0.061 0.120
Table 3: Results for pre-trained word embeddings and k-means (KM) and k-medoids (KD). indicates reranking of top words using term frequency.
Word2Vec 0.166 0.145 0.169
FastText 0.225 0.191 0.220
Glove 0.181 0.152 0.181
BERT(ns) 0.253 0.231 0.248
BERT(average) 0.159 0.152 0.168
BERT(first-word) 0.159 0.135 0.165
ELMO 0.050 0.035 0.083
Spherical 0.212 0.192 0.213
average 0.176 0.154 0.181
std. dev. 0.061 0.057 0.049
Table 4: Results for k-means (without weighting) with pre-trained word embeddings using different reranking metrics : TF, TF-IDF and TF-DF
Word2Vec 0.166 0.112 0.141
FastText 0.235 0.248 0.239
Glove 0.219 0.210 0.235
BERT(ns) 0.250 0.232 0.245
BERT(average) 0.252 0.230 0.251
BERT(first-word) 0.258 0.187 0.218
ELMO 0.139 0.090 0.134
Spherical 0.282 0.263 0.273
average 0.225 0.197 0.217
std. dev. 0.049 0.0634 0.051
Table 5: Results for k-means (weighted) pre-trained word embeddings using different reranking metrics: TF, TF-IDF and TF-DF. indicates weighted with term frequency