For exploratory document analysis, which aims to uncover main themes and underlying narratives within a corpus Boyd-Graber et al. (2017), Topic Models are undeniably the standard approach. But in times of distributed and even contextualized embeddings, are they the only option?
This work explores an alternative to topic modeling by reformulating ‘key themes’ or ‘topics’ as clusters
of words under the modern distributed representation learning paradigm. Unsupervised pre-trained word embeddings provide a representation for each word type as a vector. This allows us to cluster them based on their distance in high-dimensional space. The goal of this work is not to strictly outperform, but rather to benchmark standard clustering of modern embedding methods against the classical approach of Latent Dirichlet AllocationBlei et al. (2003). We restrict our study to influential embedding methods and focus on centroid based clustering algorithms as they provide a natural way to obtain the top words in each cluster based on distance from the cluster center.
Aside from reporting the best performing combination of word embeddings and clustering algorithm, we are also interested in whether there are consistent patterns across the choice of embeddings and clustering algorithm. A word embedding method that does consistently well across clustering algorithms, would suggest that it is a good representation for unsupervised document analysis. Similarly, a clustering algorithm that performs consistently well across embeddings would suggest that the assumptions of this algorithm are more likely to be generalizable even with future advances in word embedding methods.
Finally, we seek to incorporate document information directly into the clustering algorithm, and quantify the effects of two key methods, 1) weighting terms during clustering and 2) reranking terms for obtaining the top representative words. Our contributions are as follows:
To our knowledge, this is the first work which systematically applies centroid based clustering algorithms on embedding methods for document analysis.
We analyse how clustering embeddings directly can potentially achieve lower computational complexity and runtime as compared to probabilistic generative approaches.
Our proposed approach for incorporating document information into clustering and reranking of top words results in sensible topics; the best performing combination is comparable with LDA, but with smaller time complexity and empirical runtime.
We find that the dimensions of some word embeddings can be reduced by more than 50% before clustering.
2 Related Work and Background
This work focuses on centroid based k-means (KM), spherical k-means (SK) for hard clustering and GMM for soft clustering.111We also experiment with k-medoids but observe that this is strictly worse than KM (see Appendix A).
We apply clustering to pre-trained embeddings, namely word2vec Mikolov et al. (2013), Glove Pennington et al. (2014), FastText Bojanowski et al. (2017), Spherical Meng et al. (2019), ELMo Peters et al. (2018) and BERT Devlin et al. (2018). Word2vec, Glove, FastText and Spherical all have dimension size 300, BERT has 768 and ELMo has 1024 dimensions.
Clustering word embeddings has been used for readability assessment Cha et al. (2017), argument mining Reimers et al. (2019), document classification and document clustering Sano et al. (2017). To our knowledge, there is no prior work that studies the interaction between word embeddings and clustering algorithms on unsupervised document analysis in a direct comparison with standard LDA (Blei et al., 2003). Most related perhaps is the work of de Miranda et al. (2019), who this idea with self-organising maps, but do not provide any quantitative results.222As this is a preprint, we gladly welcome any pointers to related work we missed.
3.1 General Clustering Approach
We first preprocess and extract the vocabulary from our training documents (subsection 5.1). Each word is converted to its embedding representation, following which we apply the various clustering algorithms to obtain clusters, using weighted (subsection 3.3) or unweighted word types. After the clustering algorithm has converged, we obtain the top words from each cluster for evaluation.
3.2 Obtaining the top-J words
In traditional topic modeling (LDA), the top
words are those with highest probability under each topic-word distribution. For centroid based clustering algorithms, the top words are naturally those closest to the cluster center, and for probabilistic clustering, the top words are those with highest probability under the cluster parameters. Formally, this means choosing the set of typesas
3.3 Incorporating document information
We explore various methods to incorporate corpus information into the clustering algorithm. Specifically, we examine three different schemes to assign scores to word types:
These scores are then used for weighting word types when clustering (models marked ), reranking top words (models marked ), both (models marked ), or neither (models simply ), i.e., using uniform weights.
4 Computational Complexity
The complexity of KM is , and of GMM is where refers to the number of iterations,333In general, required for convergence differs for clustering algorithm and embedding representation. However we can specify the maximum number of iterations as a constant factor for worst case analysis. is the number of clusters (topics), is the number of word types (unique vocabulary), and is the dimension of the embeddings. Weighted variants have an one-off cost of weight initialisation, and contribute a constant multiplicative factor when recalulculating the centroid in the clustering algorithm. Reranking has an additional factor, where is the average number of elements in a cluster. In contrast, LDA via collapsed Gibbs sampling has a complexity of , where is the number of all tokens in the corpus. When , clustering methods can potentially achieve better performance-complexity tradeoffs.
5.1 Experimental Setup
We use the 20 newsgroup dataset (20NG) which is a common text analysis dataset containing around 18000 documents and 20 categories.444The 20 newsgroup dataset can be obtained from http://qwone.com/~jason/20Newsgroups/ We adopt the standard 60-40 train-test splits in the dataset, and run the clustering algorithm on the training set, obtained the top 10 words of each cluster, and evaluated this on the test split. We present results averaged across 5 random seeds.
We remove stopwords, punctuation and digits, lowercase tokens, and exclude words that appear in less than 5 documents. For contextualised word embeddings (BERT and ELMo), sentences served as the context window to obtain the token representations which were averaged to obtain the type representation. For BERT, we experiment with two variants, BERT(ns), which ignores subword tokens, BERT, which averages the subword token representations.555Taking the first subword embedding as a representation for the whole word, performs consistently worse than averaging subword representations.
5.2 Results and Discussion
training data (words/tokens)
840B (Common Crawl)
0.8B (Books) + 2.5B (Wikipedia)
0.8B (Books) + 2.5B (Wikipedia)
0.8B (1 Billion Word)
Compared to a simple LDA run, which performs no better than our best method, KM and takes about a minute using MALLET (McCallum, 2002), running the clustering on CPU takes little more than 10 seconds using sklearn (Pedregosa et al., 2011), and a third of that using custom JAX implementations on GPU (Bradbury et al., 2018).
Incorporating Document Information
We find that simply using Term Frequency (TF) outperforms the other weighting schemes (subsection 3.3). In particular, TF-IDF, perhaps surprisingly, is a poorer reweighting scheme (see Appendix B). Therefore our results in Table 2 and subsequent analysis utilises TF for weighted clustering and reranking.
Analysis of Algorithms - Weighted Clustering
Under the unweighted clustering of vocabulary types, all clustering algorithms and embedding combinations perform poorly compared to LDA. GMM outperforms KM and SK for both weighted (indicated with ) and unweighted variants across all embedding methods (). 666 2 tailed t-test for GMM
2 tailed t-test for GMMvs KM and SK.
Analysis of Algorithms - Reranking
For KM and SK, extracting the top topic words (subsection 3.2) before reranking results in reasonable looking themes, but scores poorly on NPMI. Reranking top topic words with a window size of 100 results in a large improvement for KM () and SK. Examples before and after reranking are provided in Table 1. This indicates that cluster centers are surrounded by low frequency types, even if the clusters are centered around valid themes.
Reranking GMM gains are much less pronounced. We found that the top topic words before and after reranking for BERT-GMM have an average Jaccard similarity score of 0.910, indicating that the Gaussians are already centered at word types of high frequency in the training corpus, and fundamentally have different cluster centers from those learned by KM.
Analysis of Embedding Method
The best performer are Spherical embeddings (Meng et al., 2019) which achieve the top 2 NPMI scores in the table with KM and SK. ELMo performs poorly compared to the other embeddings, this could be due to the layer wise combination of ELMo Embeddings which have been tuned for other tasks. FastText and Glove’s performance is reliant on weighted clustering and reranking, and can even achieve similar performance to BERT. This has important implications for practical applications when GPU resources are not always available to efficiently extract BERT embeddings from pre-trained models.
BERT embeddings perform consistently well across the clustering algorithm variants with weighted and reranking. Interestingly, the exclusion of words that are tokenized into subwords (BERT (ns)) does not negatively impact topic coherence (). This suggests that compound words that can be tokenized into subwords are not critical to finding coherent topics.
5.3 Dimensionality Reduction
We apply PCA to the word embeddings before clustering to estimate the amount of redundancy in the dimensions of large embeddings, which impact clustering complexity (section 4). Across both KM and GMM, BERT embeddings can be reduced to 300 dimensions (). Both Spherical and BERT begin to fall at 200 dimensions, but this effect can be mitigated with reranking.
We observe that for GMM, we can safely reduce the dimensions of BERT embeddings from 768 to 100, and even achieve better performance at lower dimensionality. The reduction is consistent across different types of embeddings, indicating that GMM performs better under lower dimensionality (Figure 1). However given the cubic complexity of GMM in the number of dimensions (section 4), KM which achieves a comparable performance might be preferred in practical settings.
We outlined a methodology for clustering word embeddings for unsupervised document analysis, and presented a systematic comparison of various influential embedding methods and clustering algorithms. Our experiments suggest that pre-trained word embeddings combined with weighted clustering algorithms and reranking, provide a viable alternative to traditional topic modeling at lower complexity and runtime.
Blei et al. (2003)
David M Blei, Andrew Y Ng, and Michael I Jordan. 2003.
Latent dirichlet allocation.
Journal of machine Learning research, 3(Jan):993–1022.
- Bojanowski et al. (2017) Piotr Bojanowski, Edouard Grave, Armand Joulin, and Tomas Mikolov. 2017. Enriching word vectors with subword information. Transactions of the Association for Computational Linguistics, 5:135–146.
- Bouma (2009) Gerlof Bouma. 2009. Normalized (pointwise) mutual information in collocation extraction. Proceedings of GSCL, pages 31–40.
- Boyd-Graber et al. (2017) Jordan Boyd-Graber, Yuening Hu, David Mimno, et al. 2017. Applications of topic models. Foundations and Trends® in Information Retrieval, 11(2-3):143–296.
- Bradbury et al. (2018) James Bradbury, Roy Frostig, Peter Hawkins, Matthew James Johnson, Chris Leary, Dougal Maclaurin, and Skye Wanderman-Milne. 2018. JAX: composable transformations of Python+NumPy programs.
- Cha et al. (2017) Miriam Cha, Youngjune Gwon, and HT Kung. 2017. Language modeling by clustering with word embeddings for text readability assessment. In Proceedings of the 2017 ACM on Conference on Information and Knowledge Management, pages 2003–2006.
- Chang et al. (2009) Jonathan Chang, Sean Gerrish, Chong Wang, Jordan L Boyd-Graber, and David M Blei. 2009. Reading tea leaves: How humans interpret topic models. In Advances in neural information processing systems, pages 288–296.
- Devlin et al. (2018) Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
- McCallum (2002) Andrew Kachites McCallum. 2002. Mallet: A machine learning for language toolkit. Http://mallet.cs.umass.edu.
- Meng et al. (2019) Yu Meng, Jiaxin Huang, Guangyuan Wang, Chao Zhang, Honglei Zhuang, Lance Kaplan, and Jiawei Han. 2019. Spherical text embedding. In Advances in Neural Information Processing Systems, pages 8206–8215.
- Mikolov et al. (2013) Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781.
de Miranda et al. (2019)
Guilherme Raiol de Miranda, Rodrigo Pasti, and Leandro Nunes de Castro. 2019.
Detecting topics in documents by clustering word vectors.
International Symposium on Distributed Computing and Artificial Intelligence, pages 235–243. Springer.
- Pedregosa et al. (2011) F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. 2011. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825–2830.
Pennington et al. (2014)
Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014.
vectors for word representation.
Empirical Methods in Natural Language Processing (EMNLP), pages 1532–1543.
- Peters et al. (2018) Matthew E. Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. 2018. Deep contextualized word representations. In Proc. of NAACL.
- Reimers et al. (2019) Nils Reimers, Benjamin Schiller, Tilman Beck, Johannes Daxenberger, Christian Stab, and Iryna Gurevych. 2019. Classification and clustering of arguments with contextualized word embeddings. arXiv preprint arXiv:1906.09821.
- Sano et al. (2017) Motoki Sano, Austin J Brockmeier, Georgios Kontonatsios, Tingting Mu, John Y Goulermas, Jun’ichi Tsujii, and Sophia Ananiadou. 2017. Distributed document and phrase co-embeddings for descriptive clustering. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers, pages 991–1001.
Appendix A k-means (KM) vs k-medoids (KD)
To further understand the effect of other centroid based algorithms on topic coherence, we also applied the k-medoids (KD) clustering algorithm. KD is a hard clustering algorithm similar to KM but less sensitive to outliers.
As we can see in Table 3, in all cases KD usually did as well or worse than KM. KD also did relatively poorly after frequency reranking. Where KD did do better than KM, the difference is not very striking and the NPMI scores were still quite below the other top performing models.
Appendix B Comparing Different Reranking Schemes
As mentioned in the paper, after clustering the embeddings, instead of directly retrieving the top- terms, we can rerank the terms based on metrics and then retrieve the top- terms that have the highest ranked values. We compare term frequency (TF), term frequency–inverse document frequency (TF-IDF) and term frequency–document frequency (TF-DF), equations for which are presented in subsection 3.3. To get a single value for each word for TF-IDF, we sum over all the documents to get one aggregated value.
We can see that compared to the TF results in the main paper, other schemes for reranking such as aggregated TF-IDF and TF-DF, while producing more coherent topics over the original hard clustering, fare worse in comparison with reranking with TF.
|BERT 12 (average)||-0.080||-0.014||0.159||0.215|
|BERT 12 (first word)||-0.057||-0.014||0.159||0.216|