Towards Large-Scale Exploratory Search over Heterogeneous Sources

by   Mariia Seleznova, et al.

Since time immemorial, people have been looking for ways to organize scientific knowledge into some systems to facilitate search and discovery of new ideas. The problem was partially solved in the pre-Internet era using library classifications, but nowadays it is nearly impossible to classify all scientific and popular scientific knowledge manually. There is a clear gap between the diversity and the amount of data available on the Internet and the algorithms for automatic structuring of such data. In our preliminary study, we approach the problem of knowledge discovery on web-scale data with diverse text sources and propose an algorithm to aggregate multiple collections into a single hierarchical topic model. We implement a web service named Rysearch to demonstrate the concept of topical exploratory search and make it available online.



There are no comments yet.


page 2

page 4


Towards Large-Scale Exploratory Search over Heterogeneous Source

Since time immemorial, people have been looking for ways to organize sci...

A Web-scale system for scientific knowledge exploration

To enable efficient exploration of Web-scale scientific knowledge, it is...

Datasheet for the Pile

This datasheet describes the Pile, a 825 GiB dataset of human-authored t...

Bounded Rationality in Scholarly Knowledge Discovery

In an information-rich world, people's time and attention must be divide...

Burgeoning Data Repository Systems, Characteristics and Development Strategies: Insights of Natural Resources and Environmental Scientists

Nowadays, we have the emergence and abundance of many different data rep...

Creative Exploration Using Topic Based Bisociative Networks

Bisociative knowledge discovery is an approach that combines elements fr...

Hierarchically-Organized Latent Modules for Exploratory Search in Morphogenetic Systems

Self-organization of complex morphological patterns from local interacti...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.


Structuring knowledge and finding relevant literature have always been important problems in science and education. Library cataloging systems [2] have remained effective search and structuring tools from as early as the fifth century B.C., but recently demonstrated a lack of flexibility dealing with large-scale and diverse documents of the Internet. At the same time, modern search engines, although able to process huge corpora, do not usually provide an overview of knowledge domains and only process short and specific queries. To utilize both approaches, various exploratory search systems [9] have been proposed. They fit well in scientific and popular scientific scenarios, when a user wants to grasp a wide area having no particular request or given an abstract description of several concepts related to some area.

The integral part of building an exploratory search engine is constructing a representation for documents in the corpus, that can be used for searching and visualization

. Two generic representations are used in NLP, namely 1) long and sparse counter / TF-IDF vectors, as well as 2) short and dense embeddings (such as paragraph2vec or topic models). The latter are more preferable nowadays due to their compact size and better capturing of synonymy and relatedness.

Among recent approaches for building dense document representations, probabilistic topic models have been shown to perform on par with Skip-Gram Negative Sampling while enjoying improved training performance and interpretability of vector components [5], which is crucial in large-scale exploratory search scenarios.

Heterogeneous documents on the Internet pose additional challenges for the exploratory search engine design. First, vocabularies used in different Internet sources are often dissimilar and noisy, which might prevent their successful incorporation into a single model. Additional problems occur with topical imbalance, when some knowledge domains might become underrepresented in a topic model due to smaller number of documents in the source. Second, the number of concepts discussed in the documents grows proportionally with the number of documents, and in large-scale scenarios, the number of topics may become so huge that a user gets confused if all the topics are displayed simultaneously. Turning to hierarchical extentions of topic models (hierarchical topic models or HTMs) might be a beneficial solution in this case.

Our research aims at providing initial answers to these problems and our contribution is therefore twofold: first, we provide an algorithm, which is able to incorporate data into a hierarchical topic model without suppressing underrepresented knowledge domains over web-scale numbers of heterogeneous documents; second, we implement a web service called Rysearch that allows to observe topics and subtopics in a hierarchical knowledge map, as well as to perform inexact search over the search index.

We demonstrate the utility of the algorithm and the service on two popular scientific datasets and make our implementations available online.

Aggregation of Heterogeneous Sources

Average topic quality Hierarchy edges quality
level 1 level 2 AP@100 AP@200 AP@500 AP@1000
No init, no fixed vocab (Baseline) 21.8 20.0 10.8 21.9 38.4 50.9
No init, fixed vocab 23.3 19.3 37.8 42.6 48.6 58.0
Init, no fixed vocab 24.1 21.0 36.0 40.3 45.0 54.8
Iterative init, no fixed vocab 25.8 21.8 45.3 44.1 45.2 54.6
Iterative init, fixed vocab 27.4 21.9 44.6 45.7 49.3 57.2
Init, fixed vocab (Proposed) 28.4 23.6 42.2 47.4 51.7 59.4

Table 1: Comparison of the baseline algorithm and the proposed modifications. All the proposed modifications show to increase HTM quality on their own and their combination gives the best result. To compare hierarchical topic models, average topic quality based on word embeddings from [4] for each hierarchy level, as well as ranking quality metrics for topic hierarchy edges from [1] are used.


In our experiments we use two datasets gathered from popular scientific websites: (around 3000 documents) as the initial collection and (around 10000 documents) as the collection to add, or added collection. covers many major scientific topics, and articles have manually assigned tags set by the website editors, which allow building a model of high interpretability. is a blog platform focused on IT.

Baseline Algorithm

The baseline approach to solve the problem is to merge the collections together and build a new model. As one may see in Table 1, it leads to losing the interpretability at each level of hierarchy, and many smaller topics get lost. The baseline approach can only discover knowledge domains of roughly equal sizes that does not correspond well to the data used, and most of the constructed topics appear to be detailed topics of the bigger source while the topics of the smaller source merge together despite being more interpretable and diverse.

Proposed Algorithm

We propose two modifications to the baseline approach: initialization and fixed vocabulary.

Initialization is setting the initial parameters of a HTM with some approximation before training the model of the merged collection. We use the parameters of an interpretable HTM of the initial collection. As topic modeling problem has multiple solutions, such an approach allows to find a solution that is close to the initial model. The interpretable topics of the initial collection that are not present in the new collection remain almost intact, while the topics specific for the added collection expand and may change their vocabulary accordingly.

Fixed vocabulary modification prevents the HTM from extending its vocabulary with new words occurring in the added documents. It is equivalent to adding the new words to the stopwords list which is known to improve text analysis quality in the case when the new words are rare.

We also explore the iterative version of the proposed algorithm: the added collection is separated into batches of gradually increasing size (not larger than 10% of the merged collection size in the previous iteration) and the next iteration is initialized with the previous iteration model. The results are shown in Table 1.


Rysearch111 is a web application which provides tools for searching over and exploring heterogeneous hierarchical topic models, represented as an interactive topical map (built with FoamTree visualization library222 A fragment of such map can be seen on Figure 1.

Figure 1: Different levels of detail of hierarchical cells on the map: (a) a topic, (b) a topic with three sub-topics, (c) a set of document pertaining to the specific sub-topic, which can be further zoomed in by clicking on a (…) cell.

Supplementary material


We utilize BigARTM library for efficient construction of topic models. It was shown in [8] that ARTM implemented in BigARTM outperforms LDA models implemented in state-of-the-art Vowpal Wabbit333 and Gensim [6] libraries in terms of training and inference time, as well as on held-out perplexity. Additional benefits of ARTM approach include controlled sparsity of parameters (which helps reducing the size of a search index in exploratory search engines), interpretability of components and easy composability of linguistic priors, or regularizers. For a survey of regularizers, we refer to [7]. The hierarchical extension of ARTM (or hARTM) that we use is proposed in [3].

Topic Model Parameters

In our experiments, we build a two-level hARTM model with 21 topics (twenty subject topics, each for a specific knowledge domain, and one for the backgroundlexicon, occurring in all the domains simultaneously) at the upper level and 61 topics (sixty subject topics and one background topic) at the lower level, with topical edges between the pairs of subject topics. We use smoothing and decorrelating ARTM regularizers at both levels to create background topics with common vocabulary and subject topics with diverse and dissimilar vocabulary.

Quality Evaluation

In the paper, we propose two modifications to the baseline algorithm, namely initialization and fixed vocabulary. We then perform ablation study, where we show that each modification individually improves model’s quality, and their combination works the best. Here we report more results related to the same ablation study. First, let us define the following short names for the proposed combinations of modifications:

  • D- I- is the baseline algorithm, with non-fixed dictionary and no initialization.

  • D+ I- has fixed dictionary, but no initialization.

  • D- I+ has non-fixed dictionary with initialization.

  • D+ I+ is the proposed algorithm, with both fixed dictionary and initialization.

  • D+ I+- is the iterative version of the proposed algorithm, where the added collection is merged into the topic model in a succession of small batches.

  • D- I+- is the iterative version of the proposed algorithm with non-fixed dictionary.

Figure 2: Number of edges in each combination, calculated as where is the model’s parameter matrix.

The number of edges depends on the application and can be chosen using Figure 2

. We can observe that for better performing combinations the line is “steeper” in the areas of low and high probabilities and reaches a plateau in the middle, meaning that edges of high and low probability are well-separated, so the number of edges plot might also serve as a simple proxy to the quality of a topical hierarchy.

Figure 3: Ranking quality measures for each combination’s topical hierarchy.

In Figure 3 we provide ranking quality measures calculated with embedding similarity between each pair of topics’ top 10 words to show how well the model ranks the hierarchy edges from the best to the worst ones. These metrics depend on the number of edges and are explained in more detail in [1].


  • Belyy et al. [2018] Anton Belyy, Mariia Seleznova, Aleksei Sholokhov, and Konstantin Vorontsov. Quality evaluation and improvement for hierarchical topic modeling. Computational Linguistics and Intellectual Technologies: Papers from the Annual International Conference “Dialogue”, pages 110–124, 2018.
  • Chan and Salaba [2015] Lois Mai Chan and Athena Salaba. Cataloging and classification: an introduction. Rowman & Littlefield, 2015.
  • Chirkova and Vorontsov [2016] NA Chirkova and KV Vorontsov. Additive regularization for hierarchical multimodal topic modeling.

    Journal Machine Learning and Data Analysis

    , 2(2):187–200, 2016.
  • Nikolenko [2016] Sergey I. Nikolenko. Topic quality metrics based on distributed word representations. In Proc. 39th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 1029–1032, 2016.
  • Potapenko et al. [2017] Anna Potapenko, Artem Popov, and Konstantin Vorontsov.

    Interpretable probabilistic embeddings: bridging the gap between topic models and neural networks.


    Conference on Artificial Intelligence and Natural Language

    , pages 167–180. Springer, 2017.
  • Rehurek and Sojka [2010] Radim Rehurek and Petr Sojka. Software framework for topic modelling with large corpora. In In Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks. Citeseer, 2010.
  • Vorontsov and Potapenko [2014] Konstantin Vorontsov and Anna Potapenko.

    Tutorial on probabilistic topic modeling: Additive regularization for stochastic matrix factorization.

    In International Conference on Analysis of Images, Social Networks and Texts_x000D_, pages 29–46. Springer, 2014.
  • Vorontsov et al. [2015] Konstantin Vorontsov, Oleksandr Frei, Murat Apishev, Peter Romov, and Marina Dudarenko. Bigartm: Open source library for regularized multimodal topic modeling of large collections. In International Conference on Analysis of Images, Social Networks and Texts, pages 370–381. Springer, 2015.
  • White and Roth [2009] Ryen W White and Resa A Roth. Exploratory search: Beyond the query-response paradigm. Synthesis lectures on information concepts, retrieval, and services, 1(1):1–98, 2009.