Discovering topics in text datasets by visualizing relevant words

07/18/2017 ∙ by Franziska Horn, et al. ∙ Berlin Institute of Technology (Technische Universität Berlin) 0

When dealing with large collections of documents, it is imperative to quickly get an overview of the texts' contents. In this paper we show how this can be achieved by using a clustering algorithm to identify topics in the dataset and then selecting and visualizing relevant words, which distinguish a group of documents from the rest of the texts, to summarize the contents of the documents belonging to each topic. We demonstrate our approach by discovering trending topics in a collection of New York Times article snippets.



There are no comments yet.


page 3

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Large, unstructured text datasets, e.g. in the form of data dumps leaked to journalists, are becoming more and more frequent. To quickly get an overview of the contents of such datasets, tools for exploratory analysis are essential.

We propose a method for extracting from a set of texts the relevant words that distinguish these documents from others in the dataset. By using the DBSCAN clustering algorithm [4], the documents in a dataset can be grouped to reveal salient topics. We can then summarize the texts belonging to each topic by visualizing the extracted relevant words in word clouds, thereby enabling one to grasp the contents of the documents at a glance.

By identifying relevant words in clusters of recent New York Times article snippets, we demonstrate how our approach can reveal trending topics.

All tools discussed in this paper as well as code to replicate the experiments are available as an open source Python library.111

1.1 Related work

Identifying relevant words in text documents was traditionally limited to the area of feature selection, where different approaches were used to discard ‘irrelevant’ features in an attempt to improve the classification performance by reducing noise as well as save computational resources

[5]. However, the primary objective here was not to identify words that best describe the documents belonging to certain clusters, but to identify features that are particularly uninformative in a classification task and can be disregarded. Other work was focused on selecting keywords for individual documents, e.g. based on tf-idf variants [9]

or by using classifiers

[8, 16]

. Yet, while these keywords might provide adequate summaries of single documents, they do not necessarily overlap with keywords found for other documents about this topic and therefore it is difficult to aggregate them to get an overview of the contents of a group of texts. Current tools available for creating word clouds as a means of summarizing a (collection of) document(s) mostly rely on term frequencies (while ignoring stopwords), possibly combined with part-of-speech tagging and named entity recognition to identify words of interest

[6, 11]. While an approach based on tf-idf features selects words occurring frequently in a group of documents, these words do not reliably distinguish the documents from texts belonging to other clusters [7]. In more recent work, relevant features were selected using layerwise relevance propagation (LRP) to trace a classifier’s decision back to the samples’ input features [3, 13]

. This was successfully used to understand the classification decisions made by a convolutional neural network trained on a text categorization task and to subsequently determine relevant features for individual classes by aggregating the LRP scores computed on the test samples

[1, 2]. While in classification settings LRP works great to identify relevant words describing different classes of documents, this method is not suited in our case as we are dealing with unlabeled data.

2 Methods

To get a quick overview of a text dataset, we want to identify and visualize the ‘relevant words’ occurring in the collection of texts. We define relevant words as some characteristic features of the documents, which distinguish them from other documents. As the first step in this process, the texts therefore have to be preprocessed and transformed into feature vectors (Section 

2.1). While relevant words are supposed to occur often in the documents of interest, they should also distinguish them from other documents. When analyzing a whole dataset it is therefore most revealing to look at individual clusters and obtain the relevant words for each cluster, i.e. find the features that distinguish one cluster (i.e. topic) from another. To cluster the documents in a dataset, we use the DBSCAN algorithm (Section 2.2).

The relevant words for a cluster are identified by computing a relevancy score for every word (with , where is the number of unique terms in the given vocabulary) and then the word clouds are created using the top ranking words. The easiest way to compute relevancy scores is to simply check for frequent features in a selection of documents. However, this does not necessarily produce features that additionally occur infrequently in other clusters. Therefore, we instead compute a score for each word indicating in how many documents of one cluster it occurs compared to other clusters (Section 2.3).

2.1 Preprocessing & Feature extraction

All texts in a dataset are preprocessed by lowercasing and removing non-alphanumeric characters. Then each text is transformed into a bag-of-words (BOW) feature vector by first computing a normalized count, the term frequency (tf), for each word in the text, and then weighting this by the word’s inverse document frequency (idf) to reduce the influence of very frequent but inexpressive words that occur in almost all documents (such as ‘and’ and ‘the’) [10, 15]. The idf of a term is calculated as the logarithm of the total number of documents, , divided by the number of documents which contain term , i.e.

The entry corresponding to the term in the tf-idf feature vector of a document is then

In addition to single terms, we are also considering meaningful combinations of two words (i.e. bigrams) as features. However, to not inflate the feature space too much (since later, relevancy scores have to be computed for every feature), only distinctive bigrams are selected. This is achieved by computing a bigram score for every combination of two words occurring in the corpus similar as in [12] and then selecting those with a score significantly higher than that of random word combinations. Further details can be found in the appendix of [7].

2.2 Clustering

To identify relevant words summarizing the different topics in the dataset, the texts first have to be clustered. For this, we use density-based spatial clustering of applications with noise (DBSCAN) [4]

, a clustering algorithm that identifies clusters as areas of high density in the feature space, separated by areas of low density. This algorithm was chosen as it does not assume that the clusters have a certain shape (unlike e.g. the k-means algorithm, which assumes spherical clusters) and it allows for noise in the dataset, i.e. does not enforce that all samples belong to a certain cluster.

DBSCAN is based on pairwise distances between samples and first identifies ‘core samples’ in areas of high density and then iteratively expands a cluster by joining them with other samples, whose distance is below some user defined threshold. As the cosine similarity is a reliable measure of similarity for text documents, we compute the pairwise distances used in the DBSCAN algorithm by first reducing the documents’ tf-idf feature vectors to 250 linear kernel PCA components to remove noise and create more overlap between the feature vectors 

[14], and then compute the cosine similarity between these vectors and subtract it from to transform it into a distance measure. As clustering is an unsupervised process, a value for the distance threshold has to be chosen such that the obtained clusters seem reasonable. In the experiments described below, we found that a minimum cosine similarity of to other samples in the cluster (i.e. using a distance threshold of ) leads to texts about the same topic being grouped together.

We denote as the cluster that document was assigned to in the clustering procedure.

2.3 Identifying relevant words

Relevant words for each cluster are identified by computing a relevancy score for every word and then selecting the highest scoring words.

We compute a score for each word depending on the number documents it occurs in from one cluster compared to the documents from other clusters. We call the fraction of documents from a target cluster that contain the word this word’s true positive rate

Correspondingly, we can compute a word’s false positive rate as the mean plus the standard deviation of the TPRs of this word for all other clusters:

222We are not taking the maximum of the other clusters’ TPRs for this word to avoid a large influence of a cluster with maybe only a few samples.

The objective is to find words that occur in many documents from the target cluster (i.e. have a large ), but only occur in few documents of other clusteres (i.e. have a low ). One way to identify such words would be to compute the difference between both rates, i.e.

which is similar to traditional feature selection approaches [5]. However, while this score yields words that occur more often in the target cluster than in other clusters, it does not take into account the relative differences. For example, to be able to detect emerging topics in newspaper articles, we are not necessarily interested in words that occur often in today’s articles and infrequently in yesterday’s. Instead, we acknowledge that not most articles published today will be written about some new event, only significantly more articles compared to yesterday. Therefore, we propose instead a rate quotient, which gives a score of to every word that has a TPR about three times higher than its FPR:

While the rate quotient extracts relevant words that would otherwise go unnoticed, for a given FPR of it assigns the same score to words with a TPR of and a TPR of . Therefore, to create a proper ranking amongst all relevant words, we take the mean of both scores to compute the final score,

which results in the TPR-FPR relation shown in Fig. 1.

Figure 1: Relevancy score depending on a word’s TPR and FPR for a cluster.

3 Experiments & Results

To illustrate how the identified relevant words can help when exploring new datasets, we test the previously described methods on recent article snippets from the New York Times. The code to replicate the experiments is available online and includes functions to cluster documents, extract relevant words and visualize them in word clouds, as well as highlight relevant words in individual documents.333

To see if our approach can be used to discover trending topics, we are using newspaper article snippets from the week of President Trump’s inauguration (Jan -, 2017), as well as three weeks prior (including the last week of 2016), downloaded with the Archive API from New York Times.444

Figure 2: Relevant words in NY Times article snippets during the week of president Trump’s inauguration (green/up) and three weeks prior (red/down).
Figure 3: Frequencies of selected words in NY Times article snippets from different days.
Figure 4: Word clouds created from the relevant words identified for two of over 50 clusters during the week Jan -, 2017 and corresponding headlines.

Before we cluster the texts, if we just manually split them into the articles published during the week of the inauguration () and before (), the identified relevant words already reveals clear trends (Fig. 2). Obviously, the main focus that week was on the inauguration itself, however it already becomes apparent that this will be followed by protest marches and also the Australian Open was happening at that time. When looking at the occurrence frequencies of different words over time (Fig. 3), we can see the spike of ‘Trump’ at the day of his inauguration, but while some stopwords occur equally frequent on all days, other rather meaningless words such as ‘Tuesday’ have clear spikes as well (on Tuesdays). Therefore, care has to be taken when contrasting articles from different times when computing relevant words, as it could easily happen that these meaningless words are picked up as well simply because e.g. one month contains more Tuesdays than another month used for comparison.

To identify trending topics, the articles from the week of the inauguration were clustered using DBSCAN. When enforcing a minimum cosine similarity of to other samples of a cluster as well as at least three articles per cluster, we obtain over 50 clusters for this week (as well as several articles considered ‘noise’). While some clusters correspond to specific sections of the newspaper (e.g. corrections to articles published the days before), others indeed refer to meaningful events that happened that week, e.g. the nomination of Betsy DeVos or an avalanche in Italy (Fig. 4).

4 Conclusion

Examining the relevant words that summarize different groups of documents in a dataset is a very helpful step in the exploratory analysis of a collection of texts. It allows to quickly grasp the contents of documents belonging to certain clusters and can help identify salient topics, which is important if one is faced with a large dataset and quickly needs to find documents of interest.

We have explained how to compute a relevancy score for individual words depending on the number of documents in the target cluster this word occurs in compared to other clusters. This method is very fast and robust with respect to small or varying numbers of samples per cluster. The usefulness of our approach was demonstrated by using the obtained word clouds to identify trending topics in recent New York Times article snippets.

We hope the provided code will encourage other people faced with large collections of texts to quickly dive into the analysis and to thoroughly explore new datasets.


We would like to thank Christoph Hartmann for his helpful comments on an earlier version of this manuscript. Franziska Horn acknowledges funding from the Elsa-Neumann scholarship from the TU Berlin.


  • Arras et al. [2016a] Leila Arras, Franziska Horn, Grégoire Montavon, Klaus-Robert Müller, and Wojciech Samek. Explaining Predictions of Non-Linear Classifiers in NLP. In Proceedings of the 1st Workshop on Representation Learning for NLP, pages 1–7. Association for Computational Linguistics, 2016a.
  • Arras et al. [2016b] Leila Arras, Franziska Horn, Grégoire Montavon, Klaus-Robert Müller, and Wojciech Samek. "what is relevant in a text document?": An interpretable machine learning approach. arXiv preprint arXiv:1612.07843, 2016b.
  • Bach et al. [2015] Sebastian Bach, Alexander Binder, Grégoire Montavon, Frederick Klauschen, Klaus-Robert Müller, and Wojciech Samek. On pixel-wise explanations for non-linear classifier decisions by layer-wise relevance propagation. PLOS ONE, 10(7):e0130140, 2015.
  • Ester et al. [1996] Martin Ester, Hans-Peter Kriegel, Jörg Sander, Xiaowei Xu, et al. A density-based algorithm for discovering clusters in large spatial databases with noise. In Kdd, volume 96, pages 226–231, 1996.
  • Forman [2003] George Forman. An extensive empirical study of feature selection metrics for text classification. The Journal of Machine Learning Research, 3:1289–1305, 2003.
  • Heimerl et al. [2014] Florian Heimerl, Steffen Lohmann, Simon Lange, and Thomas Ertl. Word cloud explorer: Text analytics based on word clouds. In System Sciences (HICSS), 2014 47th Hawaii International Conference on, pages 1833–1842. IEEE, 2014.
  • Horn et al. [2017] Franziska Horn, Leila Arras, Grégoire Montavon, Klaus-Robert Müller, and Wojciech Samek. Exploring text datasets by visualizing relevant words. arXiv preprint arXiv:1707.05261, 2017.
  • Hulth [2003] Anette Hulth.

    Improved automatic keyword extraction given more linguistic knowledge.


    Proceedings of the 2003 conference on Empirical methods in natural language processing

    , pages 216–223. Association for Computational Linguistics, 2003.
  • Lee and Kim [2008] Sungjick Lee and Han-joon Kim. News keyword extraction for topic tracking. In Networked Computing and Advanced Information Management, 2008. NCM’08. Fourth International Conference on, volume 2, pages 554–559. IEEE, 2008.
  • Manning et al. [2008] Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schütze. Introduction to Information Retrieval. Cambridge University Press, New York, NY, USA, 2008. ISBN 0521865719, 9780521865715.
  • McNaught and Lam [2010] Carmel McNaught and Paul Lam. Using wordle as a supplementary research tool. The qualitative report, 15(3):630, 2010.
  • Mikolov et al. [2013] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems, pages 3111–3119, 2013.
  • Montavon et al. [2017] Grégoire Montavon, Wojciech Samek, and Klaus-Robert Müller. Methods for interpreting and understanding deep neural networks. arXiv preprint arXiv:1706.07979, 2017.
  • Schölkopf et al. [1998] Bernhard Schölkopf, Alexander Smola, and Klaus-Robert Müller.

    Nonlinear component analysis as a kernel eigenvalue problem.

    Neural computation, 10(5):1299–1319, 1998.
  • Yang and Pedersen [1997] Yiming Yang and Jan O. Pedersen. A comparative study on feature selection in text categorization. In Proceedings of the Fourteenth International Conference on Machine Learning, ICML ’97, pages 412–420, San Francisco, CA, USA, 1997. Morgan Kaufmann Publishers Inc. ISBN 1-55860-486-3.
  • Zhang et al. [2006] Kuo Zhang, Hui Xu, Jie Tang, and Juanzi Li.

    Keyword Extraction Using Support Vector Machine

    , pages 85–96.
    Springer Berlin Heidelberg, Berlin, Heidelberg, 2006.