Topic Discovery in Massive Text Corpora Based on Min-Hashing

07/03/2018 ∙ by Gibran Fuentes-Pineda, et al. ∙ 0

The task of discovering topics in text corpora has been dominated by Latent Dirichlet Allocation and other Topic Models for over a decade. In order to apply these approaches to massive text corpora, the vocabulary needs to be reduced considerably and large computer clusters and/or GPUs are typically required. Moreover, the number of topics must be provided beforehand but this depends on the corpus characteristics and it is often difficult to estimate, especially for massive text corpora. Unfortunately, both topic quality and time complexity are sensitive to this choice. This paper describes an alternative approach to discover topics based on Min-Hashing, which can handle massive text corpora and large vocabularies using modest computer hardware and does not require to fix the number of topics in advance. The basic idea is to generate multiple random partitions of the corpus vocabulary to find sets of highly co-occurring words, which are then clustered to produce the final topics. In contrast to probabilistic topic models where topics are distributions over the complete vocabulary, the topics discovered by the proposed approach are sets of highly co-occurring words. Interestingly, these topics underlie various thematics with different levels of granularity. An extensive qualitative and quantitative evaluation using the 20 Newsgroups (18K), Reuters (800K), Spanish Wikipedia (1M), and English Wikipedia (5M) corpora shows that the proposed approach is able to consistently discover meaningful and coherent topics. Remarkably, the time complexity of the proposed approach is linear with respect to corpus and vocabulary size; a non-parallel implementation was able to discover topics from the entire English edition of Wikipedia with over 5 million documents and 1 million words in less than 7 hours.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In Natural Language Processing and Information Retrieval, topics are hidden semantic structures that capture the thematics of a collection of text documents. The automatic discovery of these structures from the vector space model has been a challenging and widely studied problem for several decades. This problem has become more important with the advent of the Word Wide Web and the proliferation of large-scale text corpora, as topics can provide valuable insights into the content of the documents and serve as a meaningful way to organize and browse such massive amounts of data. Moreover, topics have been found useful for several applications such as hashtag recommendation 

[26], online community detection [48], recommender systems [14, 8], depression detection [35], link prediction [34], and crime prediction [42], among others.

Many different approaches for discovering topics have been proposed in the past few decades, including Latent Semantic Analysis (LSA) [13], Probabilistic Latent Semantic Analysis [20], and directed (e.g. Latent Dirichlet Allocation [3], and Correlated Topic Model [2]

) and undirected topic models (e.g. Boltzmann Machines 

[37, 32], and Neural Autoregressive Distribution Estimators [23]). Among the most successful approaches to topic discovery has been Latent Dirichlet Allocation (LDA) [3]

, a directed graphical model with latent topic variables, where topics are distributions over the complete vocabulary and documents are likewise distributions over topics. Exact inference in LDA is intractable, and therefore approximate inference based on Markov Chain Monte Carlo (MCMC) sampling methods is often done (e.g. Gibbs Sampling). However, MCMC sampling does not scale well with corpus and/or vocabulary size, and in recent years much research has been devoted to devising more scalable inference methods for LDA, including parallel and distributed versions of the MCMC-based sampling process (e.g. AliasLDA 

[25], LightLDA [45], SaberLDA [27], NomadLDA [44], POBP [43], LDA* [46], HarpLDA [47] or WarpLDA [7]), and variational Bayesian formulations (e.g. Online LDA [18, 19], HSVG [31] or SCVB0 [15]). On the other hand, LDA and most Topic Models require the number of topics to be provided beforehand but this number depends on the corpus characteristics and it is often difficult to estimate, especially for massive text corpora. Unfortunately, both the quality of the discovered topics and the time complexity are sensitive to this number.

In this paper, we describe a different approach to topic discovery, called Sampled Min-Hashing (SMH), which builds upon previous work on object discovery from large-scale image collections [16]. The basic idea is to generate multiple random partitions of the corpus vocabulary by applying Min-Hashing on the word occurrence space spanned by inverted file lists to find sets of highly co-occurring words, which are then clustered to produce the final topics. As opposed to LDA and other topic models, where topics are distributions over the complete vocabulary, SMH topics are sets of highly co-occurring words. Moreover, SMH does not require the number of topics to be provided beforehand. We show that SMH can consistently discover meaningful topics from various corpora, scaling well to corpora with both large number of documents and vocabulary sizes. Interestingly, the topics discovered by SMH range from general (i.e. present in a large portion of the corpus) to more specific (i.e. present in a smaller portion of the corpus). We present an extensive evaluation and analysis of the impact of SMH parameters and the vocabulary size on the coherence of the discovered topics, based on the methodology proposed by Lau et al. [24].

The remainder of the paper is organized as follows. In Sect. 2, we review some related work on Min-Hashing and beyond-pairwise relationship mining. Section 3 describes the original Min-Hashing scheme for pairwise similarity. SMH is presented in detail in Sect. 4. The experimental evaluation of the coherence of the discovered topics and the scalability of the approach are reported in Sect. 5. Finally, Sect. 6 concludes with some remarks and future work.

2 Related Work

Locality Sensitive Hashing (LSH) is a randomized algorithm for performing approximate similarity search in high dimensional spaces. The general idea of LSH is to define a suitable family of similarity-preserving hash functions for randomly projecting the high dimensional space onto a lower dimensional subspace such that the distances between items are approximately preserved. Originally, LSH was proposed for efficient pairwise similarity search on large-scale datasets. However, it has also been used to compute a fast proposal distribution when sampling mixtures of exponential families [1], to efficiently find high-confidence association rules without support pruning [10], to retrieve inner products in collaborative filtering [40]

, and to accelerate deep neural networks 

[41]. In general, LSH has allowed for greater scalability in different applications.

Multiple LSH schemes have been proposed for different metric spaces such as the Hamming distance [21], the Euclidean distance [12], and the Jaccard similarity [4, 11]. In particular, Min-Hashing [4, 11]

, an LSH scheme to perform similarity search for sets based on the Jaccard similarity, has been of special interest for document and image retrieval applications because documents and images are often represented as sets of words or visual words. However, the original Min-Hashing scheme assumes a set representation (i.e. presence or absence of words/visual words) of documents or images which is not suitable for many applications where the frequency of occurrence is important 

[38, 5]. For this reason, extensions to the original Min-Hashing scheme have been proposed for bags with both integer and real-valued multiplicities (e.g. [9, 30, 22, 17]).

Although pairwise similarity search is a building block for several applications, some problems require searching higher-order relationships (e.g. estimating multi-way associations among words from a corpus [28], clustering collinear points in high-dimensional spaces [16] or modeling 3D objects for retrieval and recognition [49]). However, the complexity of finding higher order relationships increases exponentially with the order of the relationship and the total number of elements in the dataset. Interestingly, the space partitioning induced by Min-Hashing schemes not only approximately preserve pairwise similarities but also higher order relationships based on the Jaccard Co-occurrence Coefficient, an extension of the Jaccard similarity for measuring beyond-pairwise relationships among sets [16, 39]. Shrivastava and Li [39] proposed a new bucketing scheme for Min-Hashing in order to perform -way similarity searches, which was applied to finding sets of semantically similar words and enhancing document retrieval with multiple queries.

Min-Hashing has also been exploited to mine visual word co-occurrences from a collection of images by applying it to the inverted file lists instead of the bag-of-words representation of images [9, 16]. In particular, Sampled Min-Hashing (SMH) [16] can discover objects by treating each space partitioning induced by Min-Hashing as a sample of high beyond-pairwise co-occurrences and by clustering overlapping partition cells, which are composed of visual word sets that frequently co-occur in the collection, to form complete objects. Here, we hypothesize that words frequently occurring together in the same document in a given corpus likely belong to the same topic and therefore we can discover topics by applying SMH to the inverted file lists of the corpus, which represent word occurrences. We generalize SMH to take into account word frequencies, which have shown to be relevant in Natural Language Processing and Information Retrieval tasks.

3 Min-Hashing for Similarity Search

Min-Hashing is an LSH scheme in which hash functions are defined with the property that the probability of any pair of sets

having the same value is equal to their Jaccard Similarity, i.e.,

(1)

A MinHash function can be implemented as follows. First, a random permutation of all the elements of the universal set is generated. Then, the first element of the sequence induced by on each set is assigned as its MinHash value, that is to say . Since similar sets share many elements they will have a high probability of taking the same MinHash value, whereas dissimilar sets will have a low probability. Usually, different MinHash values are computed for each set from different hash functions using independent random permutations. It has been shown that the portion of identical MinHash values between two sets from the

independent MinHash functions is an unbiased estimator of their Jaccard similarity 

[4].

The original Min-Hashing scheme has been extended to perform similarity search on integer and real-valued bags [6, 9, 30, 22, 17], generalizing the Jaccard similarity to

(2)

where and are the integer or real-valued multiplicities of the element in the bags and respectively 111Note that Eq. 2 reduces to Eq. 1 if all multiplicities in bags and are either 0 or 1, i.e. and represent sets, since corresponds to counting the number of common elements and to counting the number of elements in both bags.. In particular, Chum et al. [9] proposed a simple strategy for bags with integer-valued multiplicities where each bag is converted to a set by replacing the multiplicity of the element in with new elements . In this way, an extended universal set is created as

where are the maximum multiplicities of elements respectively. Thus, the application of the original Min-Hashing scheme described above to the converted bags adheres to the property that . In general, it has been established that in order for a hash function to have the property that , it must be an instance of Consistent Sampling [30] (see Definition 3.1).

Definition 3.1 (Consistent Sampling [30]).

Given a bag with multiplicities , consistent sampling generates a sample with the following two properties.

  1. Uniformity: Each sample should be drawn uniformly at random from , where is the multiplicity of the element in . In other words, the probability of drawing as a sample of is proportional to its multiplicity and

    is uniformly distributed.

  2. Consistency: If , then any sample drawn from that satisfies will also be a sample from .

Once the MinHash values for each bag have been computed, tuples of different MinHash values are defined as follows

where is the -th MinHash value of bag and . Thus, different hash tables are constructed and each bag is stored in the bucket corresponding to for each hash table . Two bags are stored in the same hash bucket on the -th hash table iff , i.e. all the MinHash values of the tuple are the same for both bags. Since similar bags are expected to share several MinHash values, there is a high probability that they will have an identical tuple. In contrast, dissimilar bags will seldom have the same MinHash value, and therefore the probability that they will have an identical tuple will be low. More precisely, the probability that two bags and share the different MinHash values of a given tuple is

(3)

Consequently, the probability that two bags and have at least one identical tuple is

(4)

To search for similar bags to a given query bag , first the different tuples are computed for Q. Then, the corresponding buckets in the hash tables are inspected and all stored bags are retrieved. Finally, the retrieved bags are sorted in descending order of their Jaccard similarity with the query bag ; typically, retrieved bags with lower similarity than a given threshold are discarded.

4 Sampled Min-Hashing for Topic Discovery

4.1 Min-Hashing for Mining Beyond-Pairwise Relationships

In order to measure beyond-pairwise relationships between multiple sets, the Jaccard similarity in Eq. 1 can be generalized as a Jaccard Co-Occurrence Coefficient for sets as follows

(5)

where the numerator is the number of elements that all the sets have in common, and the denominator corresponds to the number of elements that appear at least once in the sets .

The property that a hash function defined by a Min-Hashing scheme adheres to Eq. 1 can be directly extended to sets [16, 39], i.e.

(6)

More generally, we can define a Jaccard Co-Occurrence Coefficient for bags as

(7)

where are the multiplicities of the element in bags  respectively. From Definition 3.1, it follows that

(8)

for any hash function generated with consistent sampling. Eq 8 holds because the bags will have an identical MinHash value every time the sample from the maximum multiplicity is less than or equal to the minimum multiplicity , which in general will happen times given that all samples are drawn uniformly at random from the multiplicity of each bag.

As in Min-Hashing for pairwise similarity search, tuples of MinHash values are computed and the probability that bags will have an identical tuple is given by

(9)

[width=]mh_tuple_collision.pdf

Figure 1: Probability of bags having an identical tuple as a function of their for different tuple sizes ().

Figure. 1 shows the plots of the probability of bags having an identical tuple as a function of their for different tuple sizes . As can be observed, the probability increases with larger values while it decreases exponentially for larger tuple sizes . Having larger tuple sizes allows us to reduce the probability that bags with small values have an identical tuple, but comes with the cost of also reducing the probability that bags with larger values have an identical tuple. However, we can increase the latter probability by increasing the number of tuples . Specifically, the probability that bags have at least one identical tuple from the different tuples is

(10)

Therefore, the choice of and

becomes a trade-off between precision and recall.

As illustrated in Fig. 2, the probability that bags have at least one identical tuple approximates a co-occurrence filter such that

[width=]mh_unit_filter.pdf

Figure 2: Collision probability of bags as a function of their for different co-occurrence thresholds () and tuple sizes ().

where is a threshold parameter of the filter defined by the user. Given the threshold and the tuple size , we can obtain the number of tuples by setting to and solving for , which gives

(11)

Note that the number of tuples increases exponentially as the tuple size increases and/or the threshold decreases.

4.2 Topic Discovery

Finding word co-occurrences has been a recurrent task in Natural Language Processing for several decades because they underlie different linguistic phenomena such as semantic relationships or lexico-syntactic constraints. Here we hypothesize that highly co-occurring words likely belong to the same topic, and we propose to mine those words by applying Min-Hashing to the occurrence pattern of each word in a given corpus. To achieve this, we represent each document in the corpus by a bag-of-words and the occurrence pattern of each word in the vocabulary by its corresponding inverted file bag whose elements are document IDs and whose multiplicities are the frequencies with which the word occurred in the document .

After computing tuples and storing each inverted file bag in the corresponding hash tables, we extract each set composed of inverted file bags with an identical tuple (i.e. they are stored in the same bucket in the same hash table), where since we are considering beyond-pairwise co-occurrences. We call these sets co-occurring word sets (CWS) because they are composed of inverted file bags corresponding to words with high values. In the above approach, the Min-Hashing parameter (see Eq. 4.1) controls the degree of co-occurrence of the words in each CWS so that higher values of will produce CWS with higher values whereas smaller values of will produce CWS with smaller values. In order to reduce the space complexity, since the tuples are generated from independent hash functions, we can compute them one by one so that only one hash table (instead of

) is maintained in memory at every moment.

We name this approach Sampled Min-Hashing (SMH) because each hash table associated to a tuple generates CWS by sampling the word occurrence space spanned by the inverted file bags , that is, each hash table randomly partitions the vocabulary based on the word occurrences. In SMH, multiple random partitions are induced by different hash tables, each of which generates several CWS. Representative and stable words belonging to the same topic are expected to be present in multiple CWS (i.e. lie on overlapping inter-partition cells). Therefore, we cluster CWS that share many words in an agglomerative manner to form the final topics . We measure the proportion of words shared between two CWS and by their overlap coefficient, namely

This agglomerative clustering can be formulated as finding the connected components of an undirected graph whose vertices are the CWS and edges connect every pair of CWS with an overlap coefficient greater than a threshold , i.e. . Each connected component of is a cluster composed of the CWS that form a topic . Given that , we can efficiently find these CWS pairs by using Min-Hashing for pairwise similarity search (see Sect. 3), thus avoiding the overhead of computing the overlap coefficient between all CWS pairs. An overview of the whole topic discovery process by SMH can be seen in Fig. 3.

[width=]overview.pdf

Figure 3: Overview of topic discovery by Sampled Min-Hashing.

The agglomerative clustering merges chains of CWS with high overlap coefficients into the same topic. As a result, CWS associated with the same topic can belong to the same cluster even if they do not share words with one another, as long as they are members of the same chain. In general, the generated clusters have the property that for any CWS, there exists at least one CWS in the same cluster with which it has an overlap coefficient greater than a given threshold . Note that this is a connectivity-based clustering procedure which generates clusters based on the minimum similarity of all pairs of sets. Because of this, the number of topics produced by SMH depends on the parameter configuration and word co-occurrence characteristics of the corpus. This contrasts with LDA and other topic models where the number of topics is given in advance by the user.

Finally, each topic discovered by SMH is represented by the set of all words in the CWS that belong to the topic. Therefore, the number of words in a topic also depends on the parameter configuration and the degree of co-occurrence of words belonging to the same topic in the corpus. This again contrasts with LDA and other topic models where topics are represented as distributions over the complete vocabulary, although only the top- most probable words (typically is set to 5 or 10) of each topic are shown to the user. For each topic, words are ordered descendently by the number of CWS in which they appear such that more representative and coherent words are shown first to the user.

5 Experimental Results

We evaluated 222The source code for all the reported experiments related to topic discovery is available at https://github.com/gibranfp/SMH-Topic-Discovery. An implementation of Sampled Min-Hashing is available at https://github.com/gibranfp/Sampled-MinHashing. the coherence of the topics discovered by SMH on the Reuters corpus using different parameter settings and vocabulary sizes [36]. We also compared SMH to Online LDA with respect to both topic coherence and scalability using corpora of increasing sizes. Specifically, we performed experiments on the 20 Newsgroups (a collection of newsgroup documents), Reuters (a collection of news articles), and Spanish and English Wikipedia (2 collections of and encyclopedia entries respectively)333English Wikipedia dump from 2016–11–01. Spanish Wikipedia dump from 2017–04–20. corpora. In all 4 corpora, a standard list of top words were removed and the remaining vocabulary was restricted to the most frequent words ( for 20 Newsgroups, for Reuters and for both Spanish and English Wikipedia). It is worth noting that these vocabulary sizes are considerably larger than what it is typically used in topic models (e.g. in [18] topics were discovered from Nature articles using a vocabulary of words and from Wikipedia articles using a vocabulary of words). Here, we decided to use larger vocabulary sizes in order to evaluate the scalability and robustness of SMH with respect to both corpus and vocabulary size.

[width=]coocurrence_threshold_k400.pdf

Figure 4: NPMI scores for topics discovered by SMH with different thresholds (). Median NPMI is shown as a solid yellow line and mean NPMI as a dashed green line.

In order to evaluate topic coherence, we relied on the Normalized Point Mutual Information (NPMI) since it strongly correlates with human judgments and outperforms other metrics [24]. NPMI is defined for an ordered topic from its top- words as follows

(12)

Following Lau et al.’s [24] topic coherence evaluation methodology 444We used an implementation by Lau et al. [24], which is available at https://github.com/jhlau/topic_interpretability, all 4 corpora were lemmatized using NLTK’s WordNet lemmatizer [29]. NPMI scores were then computed from the top-10 words for each topic, and lexical probabilities , and were calculated by sampling word counts within a sliding context window over an external corpus, in this case the lemmatized English Wikipedia.

As mentioned in Sect. 4, in SMH both the number of topics and the number of words in each topic depend on the parameter settings and characteristics of the corpus. In order to make the evaluation of all models comparable, we ordered topics in each model descendingly based on the average number of documents in which their top-10 words appear (all topics with less than 10 words were discarded), only taking into account the top 400 topics. In addition, only clusters with at least 5 CWS were considered in order to avoid random topics that may not be meaningful.

NPMI
Avg Med STD #Topics Time
0.04 0.111 0.08 0.100 1708 2111
0.06 0.112 0.08 0.106 1038 1288
0.08 0.112 0.08 0.108 743 591
0.10 0.107 0.06 0.115 572 358
Table 1: NPMI statistics for SMH with different thresholds ().

5.1 Evaluation of SMH Parameters

SMH has 3 main parameters that can affect its behavior and output: the threshold , the tuple size , and the overlap coefficient . We ran experiments with a range of different values for these parameters in order to evaluate their impact on the time required to discover topics, the number of discovered topics, and the coherence of the top 400 topics. In the following, we describe each of these experiments in detail and discuss results.

The threshold value is an SMH parameter that roughly controls to what degree a group of words must co-occur in order to be stored in the same bucket in at least one hash table (see Eq. 4.1 and Fig. 2) and therefore be considered a co-occurring word set (CWS). For small values, words need to have a higher co-occurrence (i.e. have a larger ) to be considered a CWS. Conversely, larger values are more permissive and allow words with lower co-occurrence to be considered a CWS. Accordingly, smaller values require more tuples than larger values. We evaluate the coherence of the topics discovered by SMH using values of , , , and . By setting the tuple size to 2 and using Eq. 11, we found the number of tuples (hash tables) for these values to be , , and , respectively. Figure 4 shows the distribution of NPMI scores for the 4 different values. Interestingly, the coherence of the discovered topics remains stable over the range of values and only noticeably declines when . As shown in Table 1, the value has a greater impact on the number of discovered topics compared to NPMI scores, reducing quickly as grows large. This is because many relevant CWS tend to lie towards small values and are therefore not found by SMH with larger values. We can also observe that the discovery time grows rapidly with the value of , since larger values require more tuples to be computed. So, smaller values may improve recall but at the cost of increasing discovery time.

NPMI
Avg Med STD #Topics Time
2 0.112 0.08 0.108 743 743
3 0.115 0.06 0.119 580 9588
4 0.120 0.07 0.122 544 139954
Table 2: NPMI statistics for SMH using different tuple sizes ().

The tuple size is another SMH parameter that determines how closely the probability of finding a CWS approximates a unit step function (see Eq. 4.1 and Fig. 2) such that only CWS with a larger than are likely found by SMH. We evaluate SMH with different tuple sizes, specifically equal to , and . Again, by using Eq. 11 we found that the number of tuples for these tuple sizes and is , and respectively. Table 2

shows the NPMI statistics, the number of discovered topics, and discovery time for all 3 different tuple sizes. Note that the average and median NPMI score as well as the standard deviation are very similar for the 3 tuple sizes. On the other hand, the number of discovered topics consistently decreases for larger tuple sizes, since the probability of finding a CWS more closely approximates a unit step function and as a result there are less false positives. However, the discovery time grows exponentially with the tuple size since a significantly larger number of tuples are then required. Therefore, a larger tuple size

may improve precision but at a high computational cost.

NPMI
Avg Med STD #Topics Time
0.5 0.063 0.04 0.066 1842 2278
0.7 0.101 0.07 0.093 1795 2066
0.9 0.111 0.08 0.100 1708 2111
Table 3: NPMI statistics for SMH with different overlap coefficient thresholds ().

Finally, we evaluate the impact of the overlap coefficient . This parameter specifies the degree of overlap that 2 CWS must have in order to be merged into the same cluster and become the same topic. Small values allow pairs of CWS that have a small proportion of shared words to be merged into the same cluster. In contrast, larger values require a larger proportion of shared words. Table 1 presents NPMI statistics as well as the number of discovered topics and the discovery time for SMH with different values. We can observe that NPMI scores were considerably lower for than for and , while the number of discovered topics and the discovery speed was very similar for the 3 values. The reason produces topics with lower NPMI scores is that the threshold becomes too low, which causes many CWS from different topics to be merged into a single topic.

5.2 Impact of the Vocabulary Size

Reducing the vocabulary to the top- most frequent words is a common approach to improve the quality of the discovered topics and speed up discovery. Here, we evaluate the impact of different vocabulary sizes on the coherence of the discovered topics by SMH with , and . Table 4 shows NPMI statistics, the number of discovered topics and discovery time for the Reuters corpus with vocabularies composed of the top , , , and words. In general, NPMI scores decrease as vocabulary size increases. This is expected since larger vocabularies introduce less common words which may not appear frequently in the reference corpus from which NPMI’s lexical probabilities are sampled. However, the number of discovered topics consistently increases with larger vocabularies because additional topics are formed with the extra words. Surprisingly, the discovery time was very similar for the 5 different vocabulary sizes despite there being 5 times more words in the largest vocabulary than there were in the smallest.

NPMI
Avg Med STD #Topics Time
20000 0.130 0.11 0.109 334 2120
40000 0.121 0.10 0.102 610 2162
60000 0.114 0.09 0.098 909 2044
80000 0.115 0.09 0.100 1228 2072
100000 0.111 0.08 0.101 1708 2111
Table 4: NPMI statistics for SMH with different vocabulary sizes ().

5.3 Comparison with Online LDA

[width=]comparison_smh_lda_multi.pdf

Figure 5: NPMI scores for SMH (top 200, 400 and 600) and Online LDA (topic number set to 200, 400 and 600) topics discovered from 20 Newsgroups (top) and Reuters (bottom). Median NPMI is shown as a solid yellow line and mean NPMI as a dashed green line.

LDA and variants have been the dominant approach to topic discovery for over a decade. Therefore, we compare the coherence of SMH and Online LDA topics using the 20 Newsgroups and Reuters corpora. Online LDA is a scalable LDA variant which uses stochastic variational inference instead of Gibbs sampling to approximate the posterior distribution that defines the topics of the corpus555We used the implementation included in scikit-learn [33] which is based on the code originally provided by the authors.. This variant allows for topic discovery at a larger scale without the need of a computer cluster and has become a popular alternative to the original LDA. The NMPI scores for topics discovered by SMH with , and (top 200, 400 and 600 topics) and Online LDA (number of topics set to 200, 400 and 600) are shown in Figure 5. For both corpora, the distribution of NPMI scores of SMH and Online LDA topics is very similar. Note that an increase in the number of topics tends to shift the distribution of NPMI scores for both approaches towards lower values, since more topics with less common words are considered. However, the effect is more severe in Online LDA than SMH.

5.4 Scalability

In order to evaluate the scalability of SMH, we discovered topics from the 20 Newsgroups, Reuters, and Spanish and English Wikipedia corpora whose sizes range from thousands to millions of documents and whose vocabularies range from thousands of words to as much as one million words. We also compared the time required by SMH to discover topics with Online LDA666Due to the high memory and computational requirements, it was not possible to run Online LDA for the Spanish and English Wikipedia. All experiments were performed on a DellTM PowerEdgeTM with 2 Intel® Xeon® CPUs X5650@2.67GHz (12 cores) and 32 GB of RAM. For comparison purposes, each experiment used only a single thread. Table 5 presents the discovery time in seconds for SMH with tuple size and thresholds , compared with Online LDA at , and topics. Note that the time complexity of SMH is linear with respect to both corpus and vocabulary size. Remarkably, SMH took at most 6.4 hours (when ) and as little as 58 minutes (when ) to process the entire Wikipedia in English, which contains over 5 million documents with a vocabulary of 1 million words. Although the time required by both SMH and Online LDA to process the 20 Newsgroups corpus was very similar, for the Reuters corpus SMH was significantly faster than Online LDA.

SMH Online LDA
Corpus 0.04 0.06 0.08 0.10 200 400 600
20 Newsgroups 158 49 26 16 138 236 311
Reuters 2111 1288 591 358 9744 101418 138144
Wikipedia (Es) 8173 4181 2332 1483
Wikipedia (En) 22777 10669 5353 3475
Table 5: Time in seconds to discover topics on the 20 Newsgroups, Reuters and Wikipedia corpora with vocabularies of 10000, 100000 and 1000000 words respectively.

5.5 Examples of Discovered Topics by SMH

Size Top 10 words
20 Newsgroups ()
religion, atheist, religious, atheism, belief, christian, faith, argument, bear, catholic,…
os, cpu, pc, memory, windows, microsoft, price, fast, late, manager,…
game, season, team, play, score, minnesota, win, move, league, playoff,…
rfc, crypt, cryptography, hash, snefru, verification, communication, privacy, answers, signature,…
decision, president, department, justice, attorney, question, official, responsibility, yesterday, conversation,…
meter, uars, balloon, ozone, scientific, foot, flight, facility, experiment, atmosphere,…
dementia, predisposition, huntington, incurable, ross, forgetfulness, suzanne, alzheimer, worsen, parkinson,…
Reuters ()
point, index, market, high, stock, close, end, share, trade, rise,…
voter, election, poll, party, opinion, prime, seat, candidate, presidential, hold,…
play, team, match, game, win, season, cup, couch, final, champion,…
wrongful, fujisaki, nicole, acquit, ronald, jury, juror, hiroshi, murder, petrocelli,…
window, nt, microsoft, computer, server, software, unix, company, announce, machine,…
mexico, mexican, peso, city, state, trade, foreign, year, share, government,…
spongiform, encephalopathy, bovine, jakob, creutzfeldt, mad, cow, wasting, cjd, bse,…
Wikipedia Spanish ()
amerindios, residiendo, afroamericanos, hispanos, isleños, asiáticos, latinos, pertenecían, firme, habkm²,…
river, plate, juniors, boca, racing, libertadores, clubes, posiciones, lorenzo, rival,…
depa, billaba, obiwan, kenobi, padmé, haruun, vaapad, amidala, syndulla, skywalker,…
touchdowns, touchdown, quarterback, intercepciones, pases, yardas, nfl, patriots, recepciones, jets,…
canción, disco, álbum, canciones, sencillo, unidos, reino, unido, número, música,…
poeta, poesía, poemas, poetas, mundo, escribió, literatura, poema, nacional, siglo,…
Wikipedia English ()
families, householder, capita, makeup, latino, median, hispanic, racial, household, census,…
dortmund, borussia, schalke, leverkusen, werder, bayer, eintracht, wolfsburg, vfl, vfb,…
padmé, luminara, amidala, barriss, talzin, offee, unduli, skywalker, darth, palpatine,…
touchdown, yard, quarterback, pas, quarter, interception, fumble, rush, sack, bowl,…
release, album, guitar, single, bass, vocal, band, drum, chart, record,…
vowel, consonant, noun, plural, verb, pronoun, syllable, tense, adjective, singular,…
mesoamerican, mesoamerica, olmec, michoacán, preclassic, abaj, takalik, exact, corn, veracruz,…
Table 6: Sample topics discovered by SMH from the 20 Newsgroups, Reuters, and Spanish and English Wikipedia corpora with vocabulary size () of 20K, 100K, 1M, and 1M words respectively. For each topic, its size and the top-10 words are presented.

Table 6 exemplifies some of the topics discovered by SMH on the 20 Newsgroups, Reuters, and English and Spanish Wikipedia corpora777The complete set of topics discovered by SMH on each corpus is available at https://github.com/gibranfp/SMH-Topic-Discovery/blob/master/example_topics/. In general, these topics range from small (tens of words) to large (hundred of words) and from specific (e.g. the Star Wars universe or the O.J. Simpson murder case) to general (e.g. demography or elections). In the case of the 20 Newsgroups corpus (18K newsgroups emails), SMH discovered several topics that loosely correspond to the main thematic of the different newsgroups from where the documents were collected. For example, the sample topics from 20 Newsgroups in Table 6 are related to religion, computers, sports, cryptography, politics, space and medicine. On the other hand, most topics from the Reuters corpus (800K news articles) are related to major world events, important world news, economy, finance, popular sports and technology; the Reuters sample topics in Table 6 are related to the stock market, elections, football, the O.J. Simpson murder case, Microsoft Windows, the Mexican economy, and the mad cow disease. Finally, a wide variety of topics were discovered from both Spanish and English editions of Wikipedia, including demography, history, sports, series, and music. It is also worth noting the similarity of some discovered topics that appear in both English and Spanish Wikipedia, e.g. the sample topics related to demography, the Star Wars universe, and American football.

6 Conclusion and Future Work

We presented Sampled Min-Hashing (SMH), a simple approach to automatically discover topics from collections of text documents leveraging Min-Hashing to efficiently mine and cluster beyond-pairwise word co-occurrences from inverted file bags. This approach proved to be highly effective and scalable to massive datasets, offering an alternative to topic models. Moreover, SMH does not require a fixed number of topics to be determined in advance. Instead, its performance depends on the inherent co-occurrence of words found in the given corpus and its own parameter settings. In contrast to LDA and other topic models, where topics are distributions over the complete vocabulary, SMH topics are subsets of highly co-occurring word sets. We showed that SMH can discover meaningful and coherent topics from corpora of different sizes and diverse domains, on a par with those discovered by Online LDA. Interestingly, the topics discovered by SMH have different levels of generality that go from specific (e.g. a topic related to a particular capital stock) to general (e.g. a topic related to stock markets in general).

In SMH, Min-Hashing is repurposed as a method for mining highly co-occurring word sets by considering each hash table as a sample of word co-occurrence. The coherence of the topics discovered by this approach was stable over a range of parameter settings. In particular, our experiments demonstrated the stability of SMH over different values of the threshold , the tuple size and the overlap coefficient . In our evaluation we found that many interesting co-occurring word sets lie towards smaller values, and thus posit that smaller values may improve recall albeit at a higher computational cost. Similarly, larger tuple sizes may improve precision at a very high computational cost. Fortunately, our empirical results suggest that a tuple size of with a threshold of provides a good trade-off between recall, precision and efficiency. With those parameter values, maximum coherence is achieved when setting the overlap coefficient threshold to without compromising speed or recall. Finally, smaller vocabulary sizes tend to slightly improve topic coherence while considerably reducing recall. In general, we found that SMH can produce coherent topics with large vocabularies at a high recall rate, showing its robustness to noisy and uncommon words.

We demonstrated the scalability of SMH by applying it to corpora with an increasing number of documents and vocabulary sizes. We found that the discovery time required by SMH grows linearly with both corpus and vocabulary size. Remarkably, SMH performed topic discovery on the entire English edition of Wikipedia, which contains over 5 million documents and 1 million words, in at most 6.4 hours on relatively modest computing resources. As opposed to Online LDA, SMH’s discovery time is not directly affected by the number of discovered topics, but instead by the number of documents in the corpus, the vocabulary size, the threshold , and the tuple size . On the Reuters corpus, which has more than 800,000 documents and 100,000 words, SMH was significantly faster than Online LDA and, when the number of topics was set to 400 and 600, its advantage was greater still. The current implementation of SMH does not take advantage of multi-core processors or distributed systems. However, given that its hash tables can be computed independently, SMH should be highly parallelizable. In the future, we plan to develop a parallel version of SMH which could scale to even larger corpora and make the discovery process much faster.

Acknowledgements

We acknowledge the resources and services provided for this project by the High Performance Computing Laboratory (LUCAR) of IIMAS-UNAM. We also thank Adrian Durán Chavesti for his invaluable help while we used LUCAR and Derek Cheung for proofreading the manuscript.

References

  • [1] Amr Ahmed, Sujith Ravi, Shravan Narayanamurthy, and Alex Smola. Fastex: Hash clustering with exponential families. In P. Bartlett, F.C.N. Pereira, C.J.C. Burges, L. Bottou, and K.Q. Weinberger, editors, Advances in Neural Information Processing Systems 25, pages 2807–2815. MIT Press, Cambridge, MA, 2012.
  • [2] David Blei and John Lafferty. Correlated topic models. In Y. Weiss, B. Schölkopf, and J. Platt, editors, Advances in Neural Information Processing Systems, pages 147–154. MIT Press, Cambridge, MA, 2006.
  • [3] David M. Blei, Andrew Y. Ng, and Michael I. Jordan. Latent Dirichlet allocation.

    Journal of Machine Learning Research

    , pages 993–1022, 2003.
  • [4] Andrei Z. Broder. On the resemblance and containment of documents. Computer, 33(11):46–53, 2000.
  • [5] Christopher Buckley. The importance of proper weighting methods. In Proceedings of the Workshop on Human Language Technology, pages 349–352, 1993.
  • [6] Moses Charikar. Similarity estimation techniques from rounding algorithms. In

    Proceedings of the 34th Annual ACM Symposium on Theory of Computing

    , pages 380–388, 2002.
  • [7] Jianfei Chen, Kaiwei Li, Jun Zhu, and Wenguang Chen. Warplda: A cache efficient o(1) algorithm for latent dirichlet allocation. Proceedings of the VLDB Endowment, 9(10):744–755, 2016.
  • [8] Konstantinos Christidis and Gregoris Mentzas. A topic-based recommender system for electronic marketplace platforms. Expert Systems with Applications, 40(11):4370–4379, 2013.
  • [9] Ondrej Chum, James Philbin, and Andrew Zisserman. Near duplicate image detection: min-hash and tf-idf weighting. In Proceedings of the British Machine Vision Conference, 2008.
  • [10] Edith Cohen, Mayur Datar, Shinji Fujiwara, Aristides Gionis, Piotr Indyk, Rajeev Motwani, Jeffrey D. Ullman, and Cheng Yang. Finding interesting associations without support pruning. IEEE Transactions on Knowledge and Data Engineering, 12(1), 2001.
  • [11] Edith Cohen, Mayur Datar, Shinji Fujiwara, Aristides Gionis, Piotr Indyk, Rajeev Motwani, Jeffrey D. Ullman, and Cheng Yang. Finding interesting associations without support pruning. IEEE Transactions on Knowledge and Data Engineering, 13(1):64–78, 2001.
  • [12] Mayur Datar, Nicole Immorlica, Piotr Indyk, and Vahab S. Mirrokni. Locality-sensitive hashing scheme based on p-stable distributions. In Proceedings of the Twentieth Annual Symposium on Computational Geometry, pages 253–262, 2004.
  • [13] Scott Deerwester, Susan T. Dumais, George W. Furnas, Thomas K. Landauer, and Richard Harshman. Indexing by latent semantic analysis. Journal of the American Society for Information Science, 41(6):391–407, 1990.
  • [14] Fabiano Fernandes dos Santos, Marcos Aurélio Domingues, Camila Vaccari Sundermann, Veronica Oliveira de Carvalho, Maria Fernanda Moura, and Solange Oliveira Rezende. Latent association rule cluster based model to extract topics for classification and recommendation applications. Expert Systems with Applications, 2018.
  • [15] James Foulds, Levi Boyles, Christopher DuBois, Padhraic Smyth, and Max Welling.

    Stochastic collapsed variational bayesian inference for latent dirichlet allocation.

    In Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 446–454, 2013.
  • [16] Gibran Fuentes Pineda, Hisashi Koga, and Toshinori Watanabe. Scalable object discovery: A hash-based approach to clustering co-occurring visual words. IEICE Transactions on Information and Systems, E94-D(10):2024–2035, 2011.
  • [17] Bernhard Haeupler, Mark Manasse, and Kunal Talwar. Consistent weighted sampling made fast, small, and easy. CoRR, abs/1410.4266, 2014.
  • [18] Matthew D. Hoffman, David M. Blei, and Francis Bach. Online learning for latent Dirichlet allocation. In Advances in Neural Information Processing Systems 23, 2010.
  • [19] Matthew D. Hoffman, David M. Blei, Chong Wang, and John Paisley. Stochastic variational inference. Journal of Machine Learning Research, 14(1):1303–1347, 2013.
  • [20] Thomas Hofmann. Probabilistic latent semantic indexing. In Proceedings of the International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 50–57, 1999.
  • [21] Piotr Indyk and Rajeev Motwani.

    Approximate nearest neighbors: Towards removing the curse of dimensionality.

    In Proceedings of 30th ACM Symposium on Theory of Computing, pages 604–613, 1998.
  • [22] Sergey Ioffe. Improved consistent sampling, weighted minhash and l1 sketching. In Proceedings of the IEEE International Conference on Data Mining, pages 246–255, 2010.
  • [23] Hugo Larochelle and Lauly Stanislas. A neural autoregressive topic model. In Advances in Neural Information Processing Systems 25, pages 2717–2725, 2012.
  • [24] Jey Han Lau, David Newman, and Timothy Baldwin. Machine reading tea leaves: Automatically evaluating topic coherence and topic model quality. In Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics, pages 530–539, 2014.
  • [25] Aaron Q. Li, Amr Ahmed, Sujith Ravi, and Alexander J. Smola. Reducing the sampling complexity of topic models. In Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 891–900, 2014.
  • [26] Jia Li and Hua Xu. Suggest what to tag: Recommending more precise hashtags based on users’ dynamic interests and streaming tweet content. Knowledge-Based Systems, 106:196–205, 2016.
  • [27] Kaiwei Li, Jianfei Chen, Wenguang Chen, and Jun Zhu. Saberlda: Sparsity-aware learning of topic models on gpus. In Proceedings of the Twenty-Second International Conference on Architectural Support for Programming Languages and Operating Systems, pages 497–509, 2017.
  • [28] Ping Li and Kenneth W. Church. A sketch algorithm for estimating two-way and multi-way associations. Comput. Linguist., 33(3):305–354, 2007.
  • [29] Edward Loper and Steven Bird. Nltk: The natural language toolkit. In Proceedings of the ACL-02 Workshop on Effective Tools and Methodologies for Teaching Natural Language Processing and Computational Linguistics - Volume 1, pages 63–70, 2002.
  • [30] Mark Manasse, Frank McSherry, and Kunal Talwar. Consistent weighted sampling. Technical Report MSR-TR-2010-73, Microsoft Research, 2010.
  • [31] David Mimno, Matthew D. Hoffman, and David M. Blei. Sparse stochastic inference for latent Dirichlet allocation. In International Conference on Machine Learning, 2012.
  • [32] Ruslan Salakhutdinov Nitish Srivastava and Geoffrey Hinton. Modeling documents with a deep Boltzmann machine. In

    Proceedings of the Conference on Uncertainty in Artificial Intelligence

    , 2013.
  • [33] Fabian Pedregosa, Gaël Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel, Peter Prettenhofer, Ron Weiss, Vincent Dubourg, Jake Vanderplas, Alexandre Passos, David Cournapeau, Matthieu Brucher, Matthieu Perrot, and Édouard Duchesnay. Scikit-learn: Machine learning in python. Journal of Machine Learning Research, 12:2825–2830, 2011.
  • [34] Daniele Quercia, Harry Askham, and Jon Crowcroft. Tweetlda: Supervised topic classification and link prediction in twitter. In Proceedings of the 4th Annual ACM Web Science Conference, pages 247–250, 2012.
  • [35] Philip Resnik, William Armstrong, Leonardo Max Batista Claudino, Thang Nguyen, Viet-An Nguyen, and Jordan L. Boyd-Graber. Beyond LDA: exploring supervised topic modeling for depression-related language in twitter. In Proceedings of the 2nd Workshop on Computational Linguistics and Clinical Psychology: From Linguistic Signal to Clinical Reality, pages 99–107, 2015.
  • [36] Tony Rose, Mark Stevenson, and Miles Whitehead. The reuters corpus volume 1-from yesterday’s news to tomorrow’s language resources. In LREC, volume 2, pages 827–832, 2002.
  • [37] Ruslan Salakhutdinov and Geoffrey E. Hinton. Replicated softmax: An undirected topic model. In Advances in Neural Information Processing Systems 22, pages 1607–1614, 2009.
  • [38] Gerard Salton and Christopher Buckley. Term-weighting approaches in automatic text retrieval. Information Processing & Management, 24(5):512–523, 1988.
  • [39] Anshumali Shrivastava and Ping Li. Beyond pairwise: Provably fast algorithms for approximate -way similarity search. In Advances in Neural Information Processing Systems, pages 791–799, 2013.
  • [40] Anshumali Shrivastava and Ping Li. Asymmetric lsh (alsh) for sublinear time maximum inner product search (mips). In Proceedings of the 27th International Conference on Neural Information Processing Systems - Volume 2, pages 2321–2329, 2014.
  • [41] Ryan Spring and Anshumali Shrivastava.

    Scalable and sustainable deep learning via randomized hashing.

    In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 445–454, 2017.
  • [42] Xiaofeng Wang, Matthew S. Gerber, and Donald E. Brown. Automatic crime prediction using events extracted from twitter posts. In Proceedings of the 5th International Conference on Social Computing, Behavioral-Cultural Modeling and Prediction, pages 231–238, 2012.
  • [43] JianFeng Yan, Jia Zeng, Zhi-Qiang Liu, Lu Yang, and Yang Gao. Towards big topic modeling. information sciences, 390(Supplement C):15–31, 2017.
  • [44] Hsiang-Fu Yu, Cho-Jui Hsieh, Hyokun Yun, S.V.N. Vishwanathan, and Inderjit S. Dhillon. A scalable asynchronous distributed algorithm for topic modeling. In Proceedings of the 24th International Conference on World Wide Web, pages 1340–1350, 2015.
  • [45] Jinhui Yuan, Fei Gao, Qirong Ho, Wei Dai, Jinliang Wei, Xun Zheng, Eric Po Xing, Tie-Yan Liu, and Wei-Ying Ma. Lightlda: Big topic models on modest computer clusters. In Proceedings of the 24th International Conference on World Wide Web, pages 1351–1361, 2015.
  • [46] Lele Yut, Ce Zhang, Yingxia Shao, and Bin Cui. Lda*: A robust and large-scale topic modeling system. Proceedings of the VLDB Endowment, 10(11), 2017.
  • [47] Bingjing Zhang, Bo Peng, and Judy Qiu. High performance lda through collective model communication optimization. Procedia Computer Science, 80(Supplement C):86–97, 2016.
  • [48] Zhongying Zhao, Shengzhong Feng, Qiang Wang, Joshua Zhexue Huang, Graham J. Williams, and Jianping Fan. Topic oriented community detection through social objects and link analysis in social networks. Knowledge-Based Systems, 26:164–173, 2012.
  • [49] Dengyong Zhou, Jiayuan Huang, and Bernhard Schölkopf. Learning with hypergraphs: Clustering, classification, and embedding. In Advances in Neural Information Processing Systems 19, pages 1601–1608, 2006.