Currently, probabilistic topic models are important tools for improving automatic text processing including information retrieval, text categorization, summarization, etc. Besides, they can be useful in supporting expert analysis of document collections, news flows, or large volumes of messages in social networks [1, 2, 3]. To facilitate this analysis, such approaches as automatic topic labeling and various visualization techniques have been proposed [2, 5].
Boyd-Graber et al.  indicate that to be understandable by humans, topics should be specific, coherent, and informative. Relationships between the topic components can be inferred. In  four topic visualization approaches are compared. The authors of the experiment concluded that manual topic labels include a considerable number of phrases; users prefer shorter labels with more general words and tend to incorporate phrases and more generic terminology when using more complex network graph. Blei and Lafferty  visualize topics with ngrams consisting of words mentioned in these topics. These works show that phrases and knowledge about hyponyms/hypernyms are important for topic representation.
In this paper we describe an approach to integrate large manual lexical resources such as WordNet or EuroVoc into probabilistic topic models, as well as automatically extracted n-grams to improve coherence and informativeness of generated topics. The structure of the paper is as follows. In Section 2 we consider related works. Section 3 describes the proposed approach. Section 4 enumerates automatic quality measures used in experiments. Section 5 presents the results obtained on several text collections according to automatic measures. Section 6 describes the results of manual evaluation of combined topic models for Islam Internet-site thematic analysis.
2 Related Work
Topic modeling approaches are unsupervised statistical algorithms that usually considers each document as a "bag of words". There were several attempts to enrich word-based topic models (=unigram topic models) with additional prior knowledge or multiword expressions.
Andrzejewski et al.  incorporated knowledge by Must-Link and Cannot-Link primitives represented by a Dirichlet Forest prior. These primitives were then used in , where similar words are encouraged to have similar topic distributions. However, all such methods incorporate knowledge in a hard and topic-independent way, which is a simplification since two words that are similar in one topic are not necessarily of equal importance for another topic.
Xie et al.  proposed a Markov Random Field regularized LDA model (MRF-LDA), which utilizes the external knowledge to improve the coherence of topic modeling. Within a document, if two words are labeled as similar according to the external knowledge, their latent topic nodes are connected by an undirected edge and a binary potential function is defined to encourage them to share the same topic label. Distributional similarity of words is calculated beforehand on a large text corpus.
In , the authors gather so-called lexical relation sets (LR-sets) for word senses described in WordNet. The LR-sets include synonyms, antonyms and adjective-attribute related words. To adapt LR-sets to a specific domain corpus and to remove inappropriate lexical relations, the correlation matrix for word pairs in each LR-set is calculated. This matrix at the first step is used for filtrating inappropriate senses, then it is used to modify the initial LDA topic model according to the generalized Polya urn model described in 
. The generalized Polya urn model boosts probabilities of related words in word-topic distributions.
Gao and Wen  presented Semantic Similarity-Enhanced Topic Model that accounts for corpus-specific word co-occurrence and word semantic similarity calculated on WordNet paths between corresponding synsets using the generalized Polya urn model. They apply their topic model for categorizing short texts.
All above-mentioned approaches on adding knowledge to topic models are limited to single words. Approaches using ngrams in topic models can be subdivided into two groups. The first group of methods tries to create a unified probabilistic model accounting unigrams and phrases. Bigram-based approaches include the Bigram Topic Model  and LDA Collocation Model . In  the Topical N-Gram Model was proposed to allow the generation of ngrams based on the context. However, all these models are enough complex and hard to compute on real datasets.
The second group of methods is based on preliminary extraction of ngrams and their further use in topics generation. Initial studies of this approach used only bigrams [15, 16]. Nokel and Loukachevitch  proposed the LDA-SIM algorithm, which integrates top-ranked ngrams and terms of information-retrieval thesauri into topic models (thesaurus relations were not utilized). They create similarity sets of expressions having the same word components and sum up frequencies of similarity set members if they co-occur in the same text.
In this paper we describe the approach to integrate whole manual thesauri into topic models together with multiword expressions.
3 Approach to Integration Whole Thesauri into Topic Models
In our approach we develop the idea of  that proposed to construct similarity sets between ngram phrases between each other and single words. Phrases and words are included in the same similarity set if they have the same component word, for example, weapon – nuclear weapon – weapon of mass destruction; discrimination – racial discrimination. It was supposed that if expressions from the same similarity set co-occur in the same document then their contribution into the document’s topics is really more than it is presented with their frequencies, therefore their frequencies should be increased. In such an approach, the algorithm can "see"similarities between different multiword expressions with the same component word.
In our approach, at first, we include related single words and phrases from a thesaurus such as WordNet or EuroVoc in these similarity sets. Then, we add preliminarily extracted ngrams into these sets and, this way, we use two different sources of external knowledge. We use the same LDA-SIM algorithm as described in  but study what types of semantic relations can be introduced into such similarity sets and be useful for improving topic models. The pseudocode of LDA-SIM algorithm is presented in Algorithm 1, where is a similarity set, expressions in similarity sets can comprise single words, thesaurus phrases or generated noun compounds.
We can compare this approach with the approaches applying the generalized Polya urn model [9, 10, 11]. To add prior knowledge, those approaches change topic distributions for related words globally in the collection. We modify topic probabilities for related words and phrases locally, in specific texts, only when related words (phrases) co-occur in these texts.
4 Automatic Measures to Estimate the Quality of Topic Models
To estimate the quality of topic models, we use two main automatic measures: topic coherence and kernel uniqueness. For human content analysis, measures of topic coherence and kernel uniqueness are both important and complement each other. Topics can be coherent but have a lot of repetitions. On the other hand, generated topics can be very diverse, but incoherent within each topic.
Topic coherence is an automatic metric of interpretability. It was shown that the coherence measure has a high correlation with the expert estimates of topic interpretability [10, 19]. Mimno  described an experiment comparing expert evaluation of LDA-generated topics and automatic topic coherence measures. It was found that most "bad"topics consisted of words without clear relations between each other.
Newman et al.  asked users to score topics on a 3-point scale, where 3=“useful” (coherent) and 1=“useless” (less coherent). They instructed the users that one indicator of usefulness is the ease by which one could think of a short label to describe a topic. Then several automatic measures, including WordNet-based measures and corpus co-occurrence measures, were compared. It was found that the best automatic measure having the largest correlation with human evaluation is word co-occurrence calculated as point-wise mutual information (PMI) on Wikipedia articles. Later Lau et al.  showed that normalized poinwise mutual information (NPMI)  calculated on Wikipedia articles correlates even more strongly with human scores.
We calculate automatic topic coherence using two measure variants. The coherence of a topic is the median PMI (NPMI) of word pairs representing the topic, usually it is calculated for most probable elements (in our study ten elements) in the topic. The coherence of the model is the median of the topic coherence. To make this measure more objective, it should be calculated on an external corpus . In our case, we use Wikipedia dumps.
Human-constructed topics usually have unique main words. The measure of kernel uniqueness shows to what extent topics are different from each other and is calculated as the number of unique elements among most probable elements of topics (kernels) in relation to the whole number of elements in kernels.
If uniqueness of the topic kernels is closer to zero then many topics are similar to each other, contain the same words in their kernels. In this paper the kernel of a topic means the ten most probable words in the topic. We also calculated perplexity as the measure of language models. We use it for additional checking the model quality.
5 Use of Automatic Measures to Assess Combined Models
For evaluating topics with automatic quality measures, we used several English text collections and one Russian collection (Table 1). We experiment with three thesauri: WordNet111https://wordnet.princeton.edu/ (155 thousand entries), information-retrieval thesaurus of the European Union EuroVoc (15161 terms)222http://eurovoc.europa.eu/drupal/, and Russian thesaurus RuThes (115 thousand entries) 333http://www.labinform.ru/pub/ruthes/index_eng.htm.
|Text collection||Number of texts||Number of words|
English part of
|English part of||23545||53 mln|
|ACL Anthology||10921||48 mln|
|NIPS Conference||17400||5 mln|
|Russian banking texts||10422||32 mln|
At the preprocessing step, documents were processed by morphological analyzers. Also, we extracted noun groups as described in . As baselines, we use the unigram LDA topic model and LDA topic model with added 1000 ngrams with maximal NC-value  extracted from the collection under analysis.
As it was found before [15, 17], the addition of ngrams without accounting relations between their components considerably worsens the perplexity because of the vocabulary growth (for perplexity the less is the better) and practically does not change other automatic quality measures (Table 2).
We add the Wordnet data in the following steps. At the first step, we include WordNet synonyms (including multiword expressions) into the proposed similarity sets (LDA-Sim+WNsyn). At this step, frequencies of synonyms found in the same document are summed up in process LDA topic learning as described in Algorithm 1. We can see that the kernel uniqueness becomes very low, topics are very close to each other in content (Table 2: LDA-Sim+WNsyn). At the second step, we add word direct relatives (hyponyms, hypernyms, etc.) to similarity sets. Now the frequencies of semantically related words are added up enhancing the contribution into all topics of the current document.
The Table 2 shows that these two steps lead to great degradation of the topic model in most measures in comparison to the initial unigram model: uniqueness of kernels abruptly decreases, perplexity at the second step grows by several times (Table 2: LDA-Sim+WNsynrel). It is evident that at this step the model has a poor quality. When we look at the topics, the cause of the problem seems to be clear. We can see the overgeneralization of the obtained topics. The topics are built around very general words such as "person "organization "year etc. These words were initially frequent in the collection and then received additional frequencies from their frequent synonyms and related words.
Then we suppose that these general words were used in texts to discuss specific events and objects, therefore, we change the constructions of the similarity sets in the following way: we do not add word hyponyms to its similarity set. Thus, hyponyms, which are usually more specific and concrete, should obtain additional frequencies from upper synsets and increase their contributions into the document topics. But the frequencies and contribution of hypernyms into the topic of the document are not changed. And we see the great improvement of the model quality: the kernel uniqueness considerably improves, perplexity decreases to levels comparable with the unigram model, topic coherence characteristics also improve for most collections (Table 2:LDA-Sim+WNsynrel/hyp).
We further use the WordNet-based similarity sets with n-grams having the same components as described in . All measures significantly improve for all collections (Table 2:LDA-Sim+WNsr/hyp+Ngrams). At the last step, we try to apply the same approach to ngrams that was previously utilized to hyponym-hypernym relations: frequencies of shorter ngrams and words are summed to frequencies of longer ngrams but not vice versa. In this case we try to increase the contribution of more specific longer ngrams into topics. It can be seen (Table 2) that the kernel uniqueness grows significantly, at this step it is 1.3-1.6 times greater than for the baseline models achieving 0.76 on the ACL collection (Table 2:LDA-Sim+WNsr/hyp+Ngrams/l).
At the second series of the experiments, we applied EuroVoc information retrieval thesaurus to two European Union collections: Europarl and JRC. In content, the EuroVoc thesaurus is much smaller than WordNet, it contains terms from economic and political domains and does not include general abstract words. The results are shown in Table 3. It can be seen that inclusion of EuroVoc synsets improves the topic coherence and increases kernel uniqueness (in contrast to results with WordNet). Adding ngrams further improves the topic coherence and kernel uniqueness.
At last we experimented with the Russian banking collection and utilized RuThes thesaurus. In this case we obtained improvement already on RuThes synsets and again adding ngrams further improved topic coherence and kernel uniqueness (Table 4).
It is worth noting that adding ngrams sometimes worsens the TC-NPMI measure, especially on the JRC collection. This is due to the fact that in these evaluation frameworks, the topics’ top elements contain a lot of multiword expressions, which rarely occur in Wikipedia, used for the coherence calculation, therefore the utilized automatic coherence measures can have insufficient evidence for correct estimates.
6 Manual Evaluation of Combined Topic Models
To estimate the quality of topic models in a real task, we chose Islam informational portal "Golos Islama"(Islam Voice)444https://golosislama.com/ (in Russian). This portal contains both news articles related to Islam and articles discussing Islam basics. We supposed that the thematic analysis of this specialized site can be significantly improved with domain-specific knowledge described in the thesaurus form. We extracted the site contents using Open Web Spider555https://github.com/shen139/openwebspider/releases and obtained 26,839 pages.
To combine knowledge with a topic model, we used RuThes thesaurus together with the additional block of the Islam thesaurus. The Islam thesaurus contains more than 5 thousand Islam-related terms including single words and expressions.
For each combined model, we ran two experiments with 100 topics and with 200 topics. The generated topics were evaluated by two linguists, who had previously worked on the Islam thesaurus. The evaluation task was formulated as follows: the experts should read the top elements of the generated topics and try to formulate labels of these topics. The labels should be different for each topic in the set generated with a specific model. The experts should also assign scores to the topics’ labels:
2, if the label describes all or almost all elements of ten top elements of the topic
1, if the description is partial, that is, several elements do not correspond to the label,
0, if the label cannot be formulated.
Then we can sum up all the scores for each model under consideration and compare the total scores in value. Thus, maximum values of the topic score are 200 for a 100-topic model and 400 for a 200-topic model. In this experiment we do not measure inter-annotator agreement for each topic, but try to get expert’s general impression.
|N||Model||100 topics||200 topics|
|8||LDA-Sim+ synrel+ More10phrases||150||0.587||2022||0.26||295||0.526||1758||0.25|
|9||LDA-Sim+ synrel/hyp+ 1000phrases||153||0.656||2163||0.26||310||0.603||1900||0.24|
|10||LDA-Sim+ synrel/hyp+ More10phrases||174||0.636||2476||0.24||302||0.244||2476||0.24|
|11||LDA-Sim+ synrel/GL+ More10phrases||186||0.655||1772||0.25||350||0.612||1464||0.25|
|12||LDA-Sim+ synrel/GL/hyp||184||0. 686||2203||0.24||346||0.644||1812||0.23|
Due to the complicated character of the Islam portal contents for automatic extraction (numerous words and names difficult for Russian morphological analyzers), we did not use automatic extraction of multiword expressions and exploited only phrases described in RuThes or in the Islam Thesaurus. We added thesaurus phrases in two ways: most frequent 1000 phrases (as in [15, 17]) and phrases with frequency more than 10 (More10phrases): the number of such phrases is 9351.
The results of the evaluation are shown in Table 5. The table contains the overall expert scores for a topic model (Score), kernel uniqueness as in the previous section (KernU), perplexity (Prpl). Also for each model kernels, we calculated the average number of known relations between topics’s elements: thesaurus relations (synonyms and direct relations between concepts) and component-based relations between phrases (Relc).
It can be seen that if we add phrases without accounting component similarity (Runs 2, 3), the quality of topics decreases: the more phrases are added, the more the quality degrades. The human scores also confirm this fact. But if the similarity between phrase components is considered then the quality of topics significantly improves and becomes better than for unigram models (Runs 4, 5). All measures are better. Relational coherence between kernel elements also grows. The number of added phrases is not very essential.
Adding unary synonyms decreases the quality of the models (Run 6) according to human scores. But all other measures behave differently: kernel uniqueness is high, perplexity decreases, relational coherence grows. The problem of this model is in that non-topical, general words are grouped together, reinforce one another but do not look as related to any topic. Adding all thesaurus relations is not very beneficial (Runs 7, 8). If we consider all relations except hyponyms, the human scores are better for corresponding runs (Runs 9, 10). Relational coherence in topics’ kernels achieves very high values: the quarter of all elements have some relations between each other, but it does not help to improve topics. The explanation is the same: general words can be grouped together.
|N||Unigram topic||Phrase-enriched topic||Thesaurus-enriched topic|
|Syria topic (Run 1)||Syria topic (Run 5)||Syria topic (Run 12)|
|Relation coherence 0.11||Relation coherence 0.13||Relation coherence 0.36|
|5.||оппозиция||режим асада||башар асад|
|(opposition)||(al-Assad regime)||(Bashar al-Assad)|
|8.||дамаск||режим башара асада||режим асада|
|(Damask)||(Bashar al-Assad regime)||(al-Assad regime)|
|(President)||(Syrian authorities )||(regime)|
||Orthodox church topic||Orthodox church topic||Orthodox church topic|
||Relation coherence 0.04||Relation coherence 0.2||Relation coherence 0.33|
|1.||православный||русская православная цер-||церковь|
|(orthodox)||ковь (Russian orthodox church)||(church)|
|6.||русский||православная церковь||русская православная цер-|
|(Russian)||(orthodox church)||ковь (Russian orthodox church)|
|(priest)||(state language)||(ROC, abbr. for Russian church)|
|(Kirill, orthodox patriarch)||(priest)||(cathedral)|
At last, we removed General Lexicon concepts from the RuThes data, which are top-level, non-thematic concepts that can be met in arbitrary domains and considered all-relations and without-hyponyms variants (Runs 11, 12). These last variants achieved maximal human scores because they add thematic knowledge and avoid general knowledge, which can distort topics. Kernel uniqueness is also maximal.
Table 6 shows similar topics obtained with the unigram, phrase-enriched (Run 5) and the thesaurus-enriched topic model (Run 12). The Run-5 model adds thesaurus phrases with frequency more than 10 and accounts for the component similarity between phrases. The Run-12 model accounts both component relations and hypernym thesaurus relations. All topics are of high quality, quite understandable. The experts evaluated them with the same high scores.
Phrase-enriched and thesaurus-enriched topics convey the content using both single words and phrases. It can be seen that phrase-enriched topics contain more phrases. Sometimes the phrases can create not very convincing relations such as Russian church - Russian language. It is explainable but does not seem much topical in this case.
The thesaurus topics seem to convey the contents in the most concentrated way. In the Syrian topic general word country is absent; instead of UN (United Nations), it contains word rebel, which is closer to the Syrian situation. In the Orthodox church topic, the unigram variant contains extra word year, relations of words Moscow and Kirill to other words in the topic can be inferred only from the encyclopedic knowledge.
In this paper we presented the approach for introducing thesaurus information into topic models. The main idea of the approach is based on the assumption that if related words or phrases co-occur in the same text, their frequencies should be enhanced and this action leads to their mutual larger contribution into topics found in this text.
In the experiments on four English collections, it was shown that the direct implementation of this idea using WordNet synonyms and/or direct relations leads to great degradation of the unigram model. But the correction of initial assumptions and excluding hyponyms from frequencies adding improve the model and makes it much better than the initial model in several measures. Adding ngrams in a similar manner further improves the model.
Introducing information from domain-specific thesaurus EuroVoc led to improving the initial model without the additional assumption, which can be explained by the absence of general abstract words in such information-retrieval thesauri.
We also considered thematic analysis of an Islam Internet site and evaluated the combined topic models manually. We found that the best, understandable topics are obtained by adding domain-specific thesaurus knowledge (domain terms, synonyms, and relations).
This study is supported by Russian Scientific Foundation in part concerning the combined approach uniting thesaurus information and probabilistic topic models (project N16-18-02074). The study on application of the approach to content analysis of Islam sites is supported by Russian Foundation for Basic Research (project N 16-29-09606).
-  Blei, D.: Probabilistic topic models. Communications of the ACM, 55(4), 77–84 (2012)
-  Smith, A., Lee, T. Y., Poursabzi-Sangdeh, F., Boyd-Graber, J., Elmqvist, N., Findlater, L.: Evaluating Visual Representations for Topic Understanding and Their Effects on Manually Generated Labels. Transactions of the Assoc. for Computational Linguistics, 5, 1–15, (2017)
-  Chang, J., Boyd-Graber, J., Wang Ch., Gerrich S., Blei, D. Reading tea leaves: How humans interpret topic models. In Proceedings of the 24th Annual Conference on Neural Information Processing Systems, pp. 288–296 (2009)
-  Boyd-Graber, J., Mimno, D., Newman, D.: Care and Feeding of Topic Models: Problems, Diagnostics, and Improvements. CRC Handbooks of Modern Statistical Methods. CRC Press, Boca Raton, Florida (2014)
-  Blei,D., Lafferty, J.: 2009. Visualizing topics with multi-word expressions. https: //arxiv.org/pdf/0907.1013.pdf.
Andrzejewski, D., Zhu, X., Craven, M.: Incorporating domain knowledge into topic modeling via dirichlet forest priors. In Proceedings of the 26th Annual International Conference on Machine Learning, pp. 25–32 (2011)
-  Newman, D., Bonilla, E., Buntine, W.: Improving topic coherence with regularized topic models. In Advances in Neural Information Processing Systems, pp. 496–504 (2011)
-  Xie, P., Yang D., Xing, E.: Incorporating word correlation knowledge into topic modeling. In Proceedings of NAACL-2015, pp. 725–734 (2015)
-  Chen, Z., Mukherjee, A., Liu, B., Hsu, M., Castellanos, M., Ghosh, R.: Discovering coherent topics using general knowledge. In Proceedings of the 22nd ACM international conference on Information and Knowledge Management. ACM, pp. 209–218 (2013)
-  Mimno, D., Wallach, H., Talley, E., Leenders, M., McCallum, A.: Optimizing semantic coherence in topic models. In Proceedings of EMNLP’11, pp. 262–272 (2011)
-  Gao, Y., Wen, D.: Semantic Similarity-Enhanced Topic Models for Document Analysis. In Smart Learning Environments, Springer Berlin Heidelberg, pp. 45-56 (2015)
-  Wallach, H.: Topic modeling: Beyond bag-of-words. In Proceedings of the 23rd International Conference on Machine Learning, pp. 977–984 (2006)
-  Griffiths, Th., Steyvers, M., Tenenbaum, J.: Topics in semantic representation. Psychological Review, 114(2), 211–244 (2007)
-  Wang, X., McCallum, A., Wei, X.: Topical n-grams: Phrase and topic discovery, with an application to information retrieval. In Proceedings of the 2007 Seventh IEEE International Conference on Data Mining, pp. 697–702 (2007)
-  Lau, J., Baldwin, T., Newman, D.: On collocations and topic models. ACM Transactions on Speech and Language Processing, 10(3), 1–14 (2013)
-  Nokel, M., Loukachevitch, N.: A method of accounting bigrams in topic models, Proceedings of the 11th Workshop on Multiword Expressions (2015)
-  Nokel, M., Loukachevitch, N.: Accounting ngrams and multi-word terms can improve topic models, Proceedings of the 11th Workshop on Multiword Expressions (2016)
-  Loukachevitch, N., Dobrov B.: RuThes linguistic ontology vs. Russian wordnets. In Proceedings of Global WordNet Conference GWC-2014 (2014)
-  Lau, J., Newman, D., Baldwin, T.: Machine reading tea leaves: Automatically evaluating topic coherence and topic model quality. In Proceedings of the European Chapter of the Association for Computational Linguistics (2014)
-  Bouma, G.:Normalized (pointwise) mutual information in collocation extraction. In Proceedings of the Biennial GSCL Conference, Potsdam, Germany, pp. 31–40 (2009)
Frantzi, K., Ananiadou, S.: The c-value/nc-value domain-independent method for multi-word term extraction. Journal of Natural Language Processing, 6(3), 145–179 (1999)