Structuring knowledge and finding relevant literature have always been important problems in science and education. Library cataloging systems  have remained effective search and structuring tools from as early as the fifth century B.C., but recently demonstrated a lack of flexibility dealing with large-scale and diverse documents of the Internet. At the same time, modern search engines, although able to process huge corpora, do not usually provide an overview of knowledge domains and only process short and specific queries. To utilize both approaches, various exploratory search systems  have been proposed. They fit well in scientific and popular scientific scenarios, when a user wants to grasp a wide area having no particular request or given an abstract description of several concepts related to some area.
The integral part of building an exploratory search engine is constructing a representation for documents in the corpus, that can be used for searching and visualization
. Two generic representations are used in NLP, namely 1) long and sparse counter / TF-IDF vectors, as well as 2) short and dense embeddings (such as paragraph2vec or topic models). The latter are more preferable nowadays due to their compact size and better capturing of synonymy and relatedness.
Among recent approaches for building dense document representations, probabilistic topic models have been shown to perform on par with Skip-Gram Negative Sampling while enjoying improved training performance and interpretability of vector components , which is crucial in large-scale exploratory search scenarios.
Heterogeneous documents on the Internet pose additional challenges for the exploratory search engine design. First, vocabularies used in different Internet sources are often dissimilar and noisy, which might prevent their successful incorporation into a single model. Additional problems occur with topical imbalance, when some knowledge domains might become underrepresented in a topic model due to smaller number of documents in the source. Second, the number of concepts discussed in the documents grows proportionally with the number of documents, and in large-scale scenarios, the number of topics may become so huge that a user gets confused if all the topics are displayed simultaneously. Turning to hierarchical extentions of topic models (hierarchical topic models or HTMs) might be a beneficial solution in this case.
Our research aims at providing initial answers to these problems and our contribution is therefore twofold: first, we provide an algorithm, which is able to incorporate data into a hierarchical topic model without suppressing underrepresented knowledge domains over web-scale numbers of heterogeneous documents; second, we implement a web service called Rysearch that allows to observe topics and subtopics in a hierarchical knowledge map, as well as to perform inexact search over the search index.
We demonstrate the utility of the algorithm and the service on two popular scientific datasets and make our implementations available online.
Aggregation of Heterogeneous Sources
|Average topic quality||Hierarchy edges quality|
|level 1||level 2||AP@100||AP@200||AP@500||AP@1000|
|No init, no fixed vocab (Baseline)||21.8||20.0||10.8||21.9||38.4||50.9|
|No init, fixed vocab||23.3||19.3||37.8||42.6||48.6||58.0|
|Init, no fixed vocab||24.1||21.0||36.0||40.3||45.0||54.8|
|Iterative init, no fixed vocab||25.8||21.8||45.3||44.1||45.2||54.6|
|Iterative init, fixed vocab||27.4||21.9||44.6||45.7||49.3||57.2|
|Init, fixed vocab (Proposed)||28.4||23.6||42.2||47.4||51.7||59.4|
In our experiments we use two datasets gathered from popular scientific websites: Postnauka.ru (around 3000 documents) as the initial collection and Habr.com (around 10000 documents) as the collection to add, or added collection. Postnauka.ru covers many major scientific topics, and articles have manually assigned tags set by the website editors, which allow building a model of high interpretability. Habr.com is a blog platform focused on IT.
The baseline approach to solve the problem is to merge the collections together and build a new model. As one may see in Table 1, it leads to losing the interpretability at each level of hierarchy, and many smaller topics get lost. The baseline approach can only discover knowledge domains of roughly equal sizes that does not correspond well to the data used, and most of the constructed topics appear to be detailed topics of the bigger source while the topics of the smaller source merge together despite being more interpretable and diverse.
We propose two modifications to the baseline approach: initialization and fixed vocabulary.
Initialization is setting the initial parameters of a HTM with some approximation before training the model of the merged collection. We use the parameters of an interpretable HTM of the initial collection. As topic modeling problem has multiple solutions, such an approach allows to find a solution that is close to the initial model. The interpretable topics of the initial collection that are not present in the new collection remain almost intact, while the topics specific for the added collection expand and may change their vocabulary accordingly.
Fixed vocabulary modification prevents the HTM from extending its vocabulary with new words occurring in the added documents. It is equivalent to adding the new words to the stopwords list which is known to improve text analysis quality in the case when the new words are rare.
We also explore the iterative version of the proposed algorithm: the added collection is separated into batches of gradually increasing size (not larger than 10% of the merged collection size in the previous iteration) and the next iteration is initialized with the previous iteration model. The results are shown in Table 1.
Rysearch111http://github.com/AVBelyy/Rysearch is a web application which provides tools for searching over and exploring heterogeneous hierarchical topic models, represented as an interactive topical map (built with FoamTree visualization library222https://carrotsearch.com/foamtree/). A fragment of such map can be seen on Figure 1.
We utilize BigARTM library for efficient construction of topic models. It was shown in  that ARTM implemented in BigARTM outperforms LDA models implemented in state-of-the-art Vowpal Wabbit333https://github.com/JohnLangford/vowpal_wabbit/ and Gensim  libraries in terms of training and inference time, as well as on held-out perplexity. Additional benefits of ARTM approach include controlled sparsity of parameters (which helps reducing the size of a search index in exploratory search engines), interpretability of components and easy composability of linguistic priors, or regularizers. For a survey of regularizers, we refer to . The hierarchical extension of ARTM (or hARTM) that we use is proposed in .
Topic Model Parameters
In our experiments, we build a two-level hARTM model with 21 topics (twenty subject topics, each for a specific knowledge domain, and one for the backgroundlexicon, occurring in all the domains simultaneously) at the upper level and 61 topics (sixty subject topics and one background topic) at the lower level, with topical edges between the pairs of subject topics. We use smoothing and decorrelating ARTM regularizers at both levels to create background topics with common vocabulary and subject topics with diverse and dissimilar vocabulary.
In the paper, we propose two modifications to the baseline algorithm, namely initialization and fixed vocabulary. We then perform ablation study, where we show that each modification individually improves model’s quality, and their combination works the best. Here we report more results related to the same ablation study. First, let us define the following short names for the proposed combinations of modifications:
D- I- is the baseline algorithm, with non-fixed dictionary and no initialization.
D+ I- has fixed dictionary, but no initialization.
D- I+ has non-fixed dictionary with initialization.
D+ I+ is the proposed algorithm, with both fixed dictionary and initialization.
D+ I+- is the iterative version of the proposed algorithm, where the added collection is merged into the topic model in a succession of small batches.
D- I+- is the iterative version of the proposed algorithm with non-fixed dictionary.
The number of edges depends on the application and can be chosen using Figure 2
. We can observe that for better performing combinations the line is “steeper” in the areas of low and high probabilities and reaches a plateau in the middle, meaning that edges of high and low probability are well-separated, so the number of edges plot might also serve as a simple proxy to the quality of a topical hierarchy.
- Belyy et al.  Anton Belyy, Mariia Seleznova, Aleksei Sholokhov, and Konstantin Vorontsov. Quality evaluation and improvement for hierarchical topic modeling. Computational Linguistics and Intellectual Technologies: Papers from the Annual International Conference “Dialogue”, pages 110–124, 2018.
- Chan and Salaba  Lois Mai Chan and Athena Salaba. Cataloging and classification: an introduction. Rowman & Littlefield, 2015.
Chirkova and Vorontsov 
NA Chirkova and KV Vorontsov.
Additive regularization for hierarchical multimodal topic modeling.
Journal Machine Learning and Data Analysis, 2(2):187–200, 2016.
- Nikolenko  Sergey I. Nikolenko. Topic quality metrics based on distributed word representations. In Proc. 39th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 1029–1032, 2016.
Potapenko et al. 
Anna Potapenko, Artem Popov, and Konstantin Vorontsov.
Interpretable probabilistic embeddings: bridging the gap between topic models and neural networks.In
Conference on Artificial Intelligence and Natural Language, pages 167–180. Springer, 2017.
- Rehurek and Sojka  Radim Rehurek and Petr Sojka. Software framework for topic modelling with large corpora. In In Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks. Citeseer, 2010.
Vorontsov and Potapenko 
Konstantin Vorontsov and Anna Potapenko.
Tutorial on probabilistic topic modeling: Additive regularization for stochastic matrix factorization.In International Conference on Analysis of Images, Social Networks and Texts_x000D_, pages 29–46. Springer, 2014.
- Vorontsov et al.  Konstantin Vorontsov, Oleksandr Frei, Murat Apishev, Peter Romov, and Marina Dudarenko. Bigartm: Open source library for regularized multimodal topic modeling of large collections. In International Conference on Analysis of Images, Social Networks and Texts, pages 370–381. Springer, 2015.
- White and Roth  Ryen W White and Resa A Roth. Exploratory search: Beyond the query-response paradigm. Synthesis lectures on information concepts, retrieval, and services, 1(1):1–98, 2009.