Construction and Quality Evaluation of Heterogeneous Hierarchical Topic Models

11/07/2018 ∙ by Anton Belyy, et al. ∙ 0

In our work, we propose to represent HTM as a set of flat models, or layers, and a set of topical hierarchies, or edges. We suggest several quality measures for edges of hierarchical models, resembling those proposed for flat models. We conduct an assessment experimentation and show strong correlation between the proposed measures and human judgement on topical edge quality. We also introduce heterogeneous algorithm to build hierarchical topic models for heterogeneous data sources. We show how making certain adjustments to learning process helps to retain original structure of customized models while allowing for slight coherent modifications for new documents. We evaluate this approach using the proposed measures and show that the proposed heterogeneous algorithm significantly outperforms the baseline concat approach. Finally, we implement our own ESE called Rysearch, which demonstrates the potential of ARTM approach for visualizing large heterogeneous document collections.



There are no comments yet.


page 15

page 26

page 27

page 29

page 30

page 31

page 34

page 37

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Problem statement

We are given a set of document collections. Each collection contains documents, comprised of several modalities (unigrams, or words, and user-annotated tags). To omit specifying modality, we will often use the term token, which means: a word or a tag. A collection possesses its own vocabulary , which is comprised of tokens of two modalities: . A common vocabulary denotes all the words or tags that occur in some collection-specific vocabulary and is defined as . We define size of vocabulary as a sum of sizes of the vocabulary for each modality: . A document is represented as a set of tokens in each modality, with order of tokens being insignificant (bag-of-words model): , where , , and .

A set of document collections is said to be heterogeneous, if all of the following is true:

  1. document collections vary significantly in document size:

  2. document collections vary significantly in vocabulary size:

  3. document collections have significantly different topical structures.

In each of the above-mentioned cases, the significance of a variation is defined empirically by human experts and we will not formalize it further. ‘‘Topical structure’’ in the third criterion is also a semi-empirical term here, which will be formalized in the following sections.

In this setup, we are asked to propose an algorithm for building hierarchical topic models (HTMs) over heterogeneous sources, that will ‘‘respect’’ difference of each individual source and produce an overall coherent and easy-to-navigate model. Since these properties are defined informally, we are further asked to propose evaluation criteria for HTM and compare the proposed algorithm to baseline method for aggregating heterogeneous sources into a HTM.

To demonstrate the validity of the proposed algorithm and quality measures for creating exploratory search systems, we are further asked to implement a simple ESE for two heterogeneous sources, Postnauka111 and Habrahabr222, with navigation, searching and automatic tagging facilities. This should also serve as a demonstration of applicability of hARTM approach (discussed later in more detail) for building hierarchical navigators over large collections, continuing the work of Chirkova et al [Chirkova2016], but extending it to a more realistic heterogeneous case.

Therefore, the contribution of our work is threefold:

  1. We propose new evaluation measures for hierarchical topic models,

  2. We propose an algorithm to build hierarchical models in the case of heterogeneous data sources,

  3. We implement an exploratory search engine based on the proposed algorithm.

2 Probabilistic topic modeling

In probabilistic topic modeling a document collection is a set of triples sampled from a discrete distribution over . is a set of documents, is a collection vocabulary i.e. a set of words (tokens) appearing in documents and is a set of topics. and are observable variables, while is a latent variable. Thus, a topic model of consists of probability distributions for each and for each , while distributions

are given. An estimator for the given variables is

, where is a number of times the token appears in the document , is a number of words in .

Let us make several standard assumptions. First, the order of tokens in a document is not important (bag-of-words model). Second, the tokens distributions over topics are not document dependent, i.e. .

Under the listed assumptions a flat (one-level) topic model is described with the Bayes formula

The problem can equivalently be stated as a matrix factorization:


or as an optimization task:



Let us see that the problem 1 is not well-posed as it admits infinitely many solutions: for any matrix of rank we have . Typically this is solved by reducing the solution space or by introducing regularization term to objective function:

Actually, we can classify topic modeling approaches by their regularization term. Probabilistic Latent Semantic Analysis (PLSA)

[hofmann1999probabilistic] is the basic approach which does not use any regularization:

Latent Dirichlet Allocation (LDA) [blei2003latent] is another common approach, where it is assumed that columns of and

matrices are random variables with distributions

and , respectively. Using Bayes inference we can show it is equivalent to the following regularizer:

Finally, Additive Regularizations of Topic Models (ARTM) [vorontsov2015additive] approach allows to combine arbitrary set of differentiable functions with weights into a single additive regularizer:

These additive regularizers can encode some apriori knowledge about the topics, which we want to enforce (e.g. topics need to sparse, decorrelated, and coherent), but they can also enforce structures, such as hiearchical relationship between topics. In the following section we will discuss it in more detail.

3 Hierarchical topic modeling

In this section, we describe the development of hierarchical topic models. Most of the advancements in this field were designed as extensions for LDA framework, starting with hLDA proposed in 2004 by David Blei, so we will describe these extensions in detail. We will also cover hierarchical ARTM (hARTM) [Chirkova2016], which is an extension for ARTM framework.

Extensions of LDA framework

Several works have been proposed from 2004 to 2016 to support hierarchies in state-of-the-art LDA model.

Some of them focus on extending generative process of LDA. In hLDA [Blei2004] topical hierarchy is represented as a tree, where each sub-topic has exactly one parent. hPAM [Mimno2007] overcomes this limitation and represents hierarchies as multilevel acyclic graphs. hHDP [Zavitsanos2011] also use multilevel graph model for hierarchies and additionally provide ways to esimate number of levels and number of topics on each level of hierarchy.

Other works focus on scalability and performance on huge-scale datasets. [Wang2014] proposes a method that is both scalable and interpretable by humans. [pujara2012large] proposes an simple iterative meta-algorithm that builds hierarchies in a top-down fashion using MapReduce. It also provides an open-source implementation, which is an extension of MapReduce LDA [zhai2012mr] package.

Hierarchical ARTM

Let us define a topic hierarchy as an oriented multiparticle (multilevel) acyclic graph where each level is a flat topic model. The edges connect topics from neighboring levels and represent parent-child relations in the hierarchy.

Let us assume that levels of a topic hierarchy are built, the -th level has the parameters , and is a set of -th level topics. To build the -th level, we model -th level distributions of tokens over topics as mixtures of the next level distributions , where is a set of -th level topics. This leads to a Bayes formula

The equivalent matrix factorization problem is where


contains the distributions of the -th level topics over the -th level topics.

4 Quality of flat topic models

We briefly survey quality measures that have been proposed for flat topic models. We denote as a sequence of top tokens of a topic , that is an ordered sequence of tokens from -th column of matrix with highest probability:

Usually, is fixed to a small number beforehand. Then, all of the major evaluation measures for flat topic models can be expressed in the following form:


where inner term is a measure of cooccurrence of top tokens and , and is measure-specific norm.

We begin with coherence proposed in [Mimno2011]. For a topic it is defined as


where is the document frequency of top token (i.e., the number of documents with at least one token ) and is the co-document frequency of top tokens and (i.e., the number of documents containing at least one token and at least one token ). Two modifications have proposed to the original coherence. First one is tf-idf coherence, which is defined as


where tf-idf measure is computed with augmented frequency,

where is the number of occurrences of token in document

. Usage of tf-idf score is preferred to co-document frequency, since the latter can be skewed towards tokens that commonly occur together but do not define interpretable topics, as was shown in


Next modification is word embedding coherence, which we define as


where is a mapping from tokens to

-dimensional vectors and

is a distance function.

Another class of quality measures is based on the idea of pointwise mutual information (PMI) between top tokens in a topic. Following [lau2014machine], we survey three variations of this idea.

  • the basic pairwise PMI measure [newman2010automatic], defined as

  • the normalized PMI measure [bouma2009normalized], defined as

  • the pairwise log conditional probability (LCP) measure [Mimno2011], defined as


Having defined measures (7)–(12), let us compare them in terms of correlation with human judgement of topical quality. We borrow this comparative results from the recent paper by Sergey Nikolenko [Nikolenko2016], where an assessment experiment was conducted. Three different models, namely pLSA, LDA, and ARTM, were assessed, where each topic was described by its 20 most probable words. The human experts were asked the following question: ‘‘Do you understand why these words are collected together in this topic?’’, with three possible answers: (a) absolutely not; (b) partially; (c) yes. The measures 711 were calculated as described above; for word embedding coherence a pre-trained word2vec [mikolov2013distributed] model of dimension 500, trained on a large Russian language corpus [ai2015evaluating, panchenko2018russe], was used as mapping

along with inverse cosine similarity as a distance function:

. Area under ROC curve (ROC-AUC) [hand2001simple, ling2003auc], where scores are given by the described measures and labels are manually assigned by human experts, is used for evaluation. The results of the experiment are shown in the table 1.

Model Quality measures
pLSA 0.7720 0.8910 0.8954 0.8675 0.8707 0.8811
LDA 0.7817 0.8748 0.8786 0.8469 0.8372 0.8541
ARTM 0.7513 0.8439 0.8543 0.6637 0.6973 0.7738
Table 1: Comparison results between flat topic quality measures: area under curve (AUC) between human-assessed evaluation and automatic quality measures, borrowed from [Nikolenko2016]. The best results for each model are highlighted in bold.

We see that the word embedding coherence is in larger agreement with human assessors than other measures. The reason to this might be that measures paradigmatic relatedness [schutze1993vector] of tokens, as underlying word2vec model is trained to make words with similar contexts have similar vectors, and it better approximates human understanding of word similarity than syntagmatic relatedness, measured by cooccurrence- and tfidf-based measures.

5 Visualization of topic models

Since its inception, topic modeling has been successfully used for visualizing and navigating through large scientific corpora. Oftentimes, when introducing a topic model with new properties, authors provide a convenient visualization of their models. In their seminal work, Blei et al. [blei2006dynamic] proposed a dynamic topic model to track evolution of topics over time. They demonstated it by building a dynamic topic model over a collection of Science journal articles from 1880 through 2000. Another interesting visualization system, RoseRiver [cui2014hierarchical], provides a flow visualization of hierarchical topic models of real-world news data. A more intuitive way to navigate topical hierarchies — a Sunburst chart  — is used in Hiérarchie [smith2014hiearchie], an hLDA-based framework.

Figure 1: Topic Model Visualization Engine (TMVE). On the front page, a user can choose a generic topic, then display all documents relevant to it. From an article screen, the user can also choose from relevant documents and/or relevant topics, thus easily navigating around an area of interest. Their navigator is accessible online at

Generic topical navigators for displaying flat topic models have been proposed as well. In 2012, Chaney et al. [Chaney2012] proposed a Topic Model Visualization Engine (TMVE), a user-friendly system to navigate over English Wikipedia collection. A demonstration of its capabilities is given on Figure 1. Some systems, such as Termite [Chuang2012], LDAVis [sievert2014ldavis] or Serendip [alexander2014serendip] focus more on internal structure of topic models rather then creating a navigator over a collection of documents. Termite is a convenient tool to visualize matrix of a flat topic model. It highlights the most probable terms for each topic and introduces a new saliency measure to select most relevant terms in a topic. Similar relevance measure is introduced in [sievert2014ldavis], along with LDAVis system. Their system can be used for both deep analysis of each topic using most relevant keyword as well as for displaying inter-topical relations by measuring topical distance and performing multidimensional scaling to a flat map. Apart from these levels of abstraction, Serendip provides an additional level of close-ups to inidividual passages and words. It makes use of and distributions and, similar to Termite, displays distribution of topics for each document and, just as RoseRiver, tracks evolution of topics similar to RoseRiver, but on a document scale.

Figure 2: TopicPanorama. A dense and informative representation of topic model is available on the main screen. It includes: a) level-of-detail (LOD) visualization of topic inter-relations, d) lead-lag temporal dynamics between topics, (e-f) most relevant words and documents of each topic.

The idea to align topics and/or documents on a flat map and provide a meaning for distance on the map is ubiquitous in the topic modeling visualization field. TopicPanorama [wang2016topicpanorama] uses circular graph representation and brings topics, discussed in the same document collections, together. The model is built on various data sources, including news articles, tweets, and blog data. Various levels of details are also displayed, which can bee seen on Figure 2. Continuing the idea multi-collection visualization, Oelke et al. [oelke2014comparative] present DiTop-View: a system of rich visualization of topical inter-relations. They propose topic coins

as a new way to display topics suitable for comparative analysis, and several heuristical criteria to determine topics that are specific for given collections. Gansner et al.

[gansner2012], in a service they called TwitterScore, work with a single Twitter source. They propose a way to build a flat map with similar topics grouped into ‘‘countries’’, and a way to update this map in real time as the new data appears.

An interesting and relevant solution for topic alignment problem is proposed in [fedoryakatechnology]. They formulate the problem as a search for a linear spectre of topics, or a permutation over topics, such that sum of distances between neighboring topics in this permutation is minimal:


If is expressed as a square matrix , where , then the problem 13 can be formulated as a search for Hamiltonian path of minimal weight in a complete weighted undirected graph with weight matrix , that can be solved either exactly (which is only feasible for small ) or approximately. A reasonable choice of intrinsic (that is, only depending on matrix of a topic model) measures , as well as usage of polynomial time Lin-Kernighan-Helsgaun algorithm [helsgaun2000effective] to solve problem 13, are proposed. Another notable contribution of [fedoryakatechnology] is an open source VisARTM333 application, which supports automatic construction of topic models with required properties using BigARTM library. Different visualization modes are available, including temporal and hierarchical visualizations and in-depth textual statistics for each document in a collection.

Visualization of topic models is a rich field of study with an abundancy of recent contributions, of which we covered only a few. For a complete interactive review of modern visualization systems, see [kucher2015text]. Another survey, in the Russian language, is presented in [aysina2015review].

In Chapter 1, the research problem was stated and supported by the existing work. Probabilistic flat and hierarchical topic modeling, the area of our foremost contribution, was introduced. Major quality measures for flat topic models, which will form the basis of measures for hierarchical topic models, were described and compared. Multiple recent visualization frameworks for topic modeling were demonstrated, which supports the point of applicability of topic modeling for exploratory search engine construction.

6 Quality of topical edges

In Section 4, we surveyed classical measures of topical quality, which are in agreement with human understanding of what makes topic good or bad. We have also shown that these measures can be expressed as an average of functions depending on top tokens of the topic .

However, a hierarchical topic model consists not only of its levels, but also of relations between topics from the neighboring levels, whereas the classical measures of the model’s levels take these dependencies completely out of consideration. Hence, they fractionally depicts the quality of a hierarchical model. This section is aimed to bridge the gap by proposing several quality measures for the ‘‘parent - child’’ relations between topics in a hierarchical model. We propose two kinds of quality measures, extrinsic and intrinsic. The former will take advantage of word cooccurrences in external corpora, and the latter will only use internal parameters of a topic model, hence the names.

6.1 Extrinsic similarity based measures

We extend the classical evaluation scheme from Eq. 6 to a hierarchical form


where is a measure of cooccurrence of top tokens and , and is measure-specific norm. We refer to Section 4 and denote document frequency as , co-document frequency as , and vector mapping from token to real-valued -dimensional vector as . Then we define our measures as:


where is the number of distinct top tokens pairs.

To estimate (co-)document frequencies and to train mapping , a large external corpus (e.g. Wikipedia or Twitter) is usually used.

6.2 Intrinsic similarity based measures

Another option for comparing parent and child topics is provided by a topic model itself: we can compare them as probability distributions. Two standard similarity measures for distributions P and Q are Hellinger distance and Kullback-Leibler divergence. The first one is a bounded measure and can be interpreted as distance between two topics in some space. The second is an unbounded asymmetric measure and can be interpreted as ‘‘how much information will be lost if we substitute parent topic P with some child topic Q’’.


7 Quality of topical hierarchy

The goal now is to combine the edges measure into some construction, which would be a representative quality score for a hierarchy as a whole.


Hereafter we work with a normalized matrix as the following:


It allows to apply shared topic-agnostic threshold and to rank all values of matrix on the same scale.

7.1 Averaging quality

In the spirit of [Mimno2011] where the average coherence was used as a model quality measure, let us consider the average edge quality as quality measure for hierarchical model ():


The particular hierarchy configuration depends on the chosen threshold for , which determines what probability is sufficient to include an edge connecting and into the hierarchy. Therefore different thresholds lead to different values of a quality measure.

7.2 Ranking quality

Another approach to form a quality measure with an interpretable value is to consider the process of establishing a hierarchy as a ranking process. Consider that we have built a model i.e. we have matrices , , and for each level. It would be natural to accept only the most meaningful edges according to a human’s point of view. As our edge measures turned out to be good approximators of the assessors’ judgment, we can choose only edges with the top scores of some fixed measures. If our model is ‘‘good’’, then top- scored edges (let us call them ‘‘the request’’) should match with top- maximal elements of the (let us call them ‘‘the response’’). The difference between the request and the response for each was evaluated by common ranking measures, such as:

  • Average Precision @ – described in [li2011short],

  • Normalized Discounted Cumulative Gain (NDCG) @ – described in [li2011short],

  • Inverse Defect Pairs (Inverse DP) @ – the inverse value of the number of pairs that appear in the wrong order (i.e. are reversed) in the response.

8 Heterogeneous topic modeling

Let us denote an iterative algorithm which constructs a hierarchical hARTM model over collection with initial approximation . We propose two meta-algorithms, concat and heterogeneous, which re-use procedure to build a hierarchical topic model for heterogeneous collections .

8.1 concat algorithm

In this algorithm, we simply concatenate all document collections and build a topic model upon it.

  1. Concatenate all collections ;

  2. Build hARTM model ;

  3. Return .

8.2 heterogeneous algorithm

In this algorithm, without loss of generality, we denote the first document collection as the base collection. Collections will be denoted as new collections. Additionally, let denote the initial model, built upon base collection and provided as a parameter. Our algorithm works in iterative fashion, on each iteration improving the current model .

  1. Initialize train set ;

  2. For each :

    1. Form a new small batch ;

    2. Add new batch to train set ;

    3. Build hARTM model ;

  3. Return .

9 Exploratory search engine

To demonstrate the quality of heterogeneous algorithm, as well as to prove applicability of hierarchical topic modeling for creating exploratory search engines, we have implemented Rysearch system. It is a web application with rich visualization, searching and article navigating facilities.

In the following subsections we will discuss its three key components, namely client-server architecture, user interface, and user experience, greater detail.

9.1 Client-server architecture

Figure 3: Client-server architecture of Rysearch.

Rysearch has been designed as an experimental system, with the potential to scale for real life scenarios. Scaling, in this case, means being able to put up with high work load from two ends: larger streams of data coming from multiple heterogeneous sources, and increasing number of simultaneous end-users. In such setup, application design should be highly modular, with small independent blocks instead of a single monolith architecture. Our server side meets these demands in the following ways.

Firstly, its storage facilities make heavy use of MongoDB database and its features, such as advanced full-text indexing and extensible structure of MongoDB documents.

Secondly, its backend architecture is designed hierarchically, with ARTM proxy module receiving frontend queries and balancing them evenly between workers, or ARTM bridges (these modules are called bridges as they were designed as thin clients that provide a ‘‘bridge’’ between BigARTM library and frontend servers). If one ARTM bridge becomes idle, it receives a new portion of work in a round-robin fashion. If an ARTM bridge stops responding to periodic ping queries, it is considered faulty and is excluded from round robin. Inside ARTM bridge are two database connectors, DataSource and Model, which handle requests to MongoDB and ARTM model databases, respectively. The former is used for displaying documents and full text searching. The latter is used for displaying hierarchical topic map, retrieving documents of a particular topic and for document searching.

Thirdly, all backend and frontend modules of Rysearch are implemented as microservices, that is, yhey are loosely connected via ZeroMQ [hintjens2013zeromq], or ØMQ, a protocol for asynchronous messaging without a dedicated message broker. This architectural style is associated with increased deployability, modifiability and resilience to design erosion, according to [chenmicroservices].

Being a microservice-based application, Rysearch also benefits from a possibility to write services in different programming languages, as well as to choose the most suitable frameworks for each service. As such, all backend services are written in Python, using pyzmq library and python wrapper for BigARTM. Frontend service is written is Javascript, basing on node.js platform with node package manager (npm) and express.js framework for serving client requests (using GET and POST methods of HTTP protocol).

The main befenit of node.js / express.js framework is that is was designed to serve requests in asynchronous fashion using a single-threaded non-blocking event loop. This is very useful for building fast application with high throughput. When the main screen of Rysearch is being loaded, a single query (‘‘show me the topic map’’) is sent to the frontend, which serves the reply from its internal cache, without sending queries to the backend. All the heavy computation that cannot be pre-computed is forwarded to backend workers using a proxy, as we described earlier, but when possible, responses are served using fast and high-throughput frontend servers. Another benefit of these frameworks is an ability to communicate asynchronously with backend, using zmq binding for node.js, without any hassle. If we served user queries in a synchronous manner, say using a thread pool of workers, we would have to do extra job to synchronize communication between our frontend and a backend proxy, which could be error-prone and unstable.

A client side is written in Javascript, uses Bootstrap and jQuery libraries, and is maintained by Bower package manager. To communicate with the server side, it sends asynchronous JSON-formatted (AJAX) requests and asynchronously waits for the responses in the same format. All messages, except an incoming document upload, are sent with GET method. Typically, a frontend server either sends the response back (using a previously cached response from the backend), or forwards client’s request (with none to minimal preprocessing). In this case, a client ID is put into the message queue and a message is sent to the backend proxy. When a reply from proxy arrives (or when the timeout of 1 minute is reached), a corresponding client ID is taken from message queue and the backend’s response (again, with none to minimal postprocessing) is sent to the client. Uploading of a document (for document search, which we describe later) from a client is done using XMLHttpRequest level 2 [xhr2], or XHR2, which provides an API to transfer files from a client to a frontend server. It is then stored in a temporary disk storage, shared between frontend and backend servers, and a link to is sent via ØMQ channel. It is inconvenient to set up a shareable storage between servers just for uploading files, and in future versions we aim to come up with different scheme for processing file uploads.

9.2 User interface

Rysearch is single-page web application, meaning that it loads a single HTML page and dynamically updates the page as user interacts with the app. In each state of user interaction, the application can display one of two screens, or views: map view or document view.

Map view
Figure 4: The map view of Rysearch user interface. On top of the page there are (a) navigation bar and (b) search widget, where users type in text queries or upload documents. Most of the page is occupied by (c) hierarchical map made out of tiles, which represent topics and their respective subtopics.

On this screen, a user is presented with an overview of a whole document collection of an exploratory search engine. Documents are organized into topics, and topics are organized into sub-topics, in a one-to-many fashion: each document or sub-topic can be part of several sub-topics and topics, respectively. To find information of interest, a user can either zoom in to a desired level of detail (see Figure 5), or use a search widget. We will explore searching facilities later in the next section, but for now we not that the search can go either by typing text queries, as in classical search engines, or by uploading a document, which is specific to topic model-based search.

If, after performing a search or after exploring some topic in great depth, a user wishes to go back, he can do so by following breadcrumbs on a navigation bar. As a user opens a new tile, searches something, or opens a document, a new link is created there to the previous state on the map to allow to go back in the exploration process.

To display an interactive topical map, we use client-side FoamTree visualization library444 It takes tree-like structures (such as topic-subtopics-documents structure, where one-to-many relations are duplicated across the levels of hierarchy) as an input and produces an interactive tiled map by iteratively constructing hierarchical Voronoi diagrams, embedded either into the whole screen on into cells of higher level. As FoamTree allows to specify approximate positions of high-level cells on the map by submitting initial ordering of Voronoi tiles, we make use of linear spectres, discussed in Section 5, to bring similar high-level topics as close as possible on the map.

Figure 5: Different levels of detail of topical cells on the map. By default, each topic is seen in the state (a), which is a superficial view of top-3 tags associated with each top-level topic. By clicking on the cell this view is changed into (b), where more detailed view of top-3 tags of each sub-topic is displayed. Another click changes cell into state (c), where we see top-10 documents of a selected sub-topic. By clicking on (…) tile, this cell can be infinitely deepened, resulting in an ‘‘infinite scroll’’, which is seen at the bottom sub-topic of a cell (d).
Document view
Figure 6: The document view of Rysearch user interface. There are (a) document title and author, as well as an indicator from each source this document comes from. For collections where documents are tagged by authors we display (b) the original tags, as well as top-5 recommended tags. Below is, of course, (c) document text, and to the right are (d) top-5 recommended similar documents.

On this view, a document meta-information, contents, and suggestions are presented. Meta-information includes the title, the author’s name, and the icon representing the specific collection the document belongs to. Contents is a textual representation of a document, without any images or markup. Suggestions include suggested tags and suggested documents. The former can be helpful for users, who wish to perform a full-text search on Rysearch or elsewhere, but don’t know which specific keywords to use. The latter provides another, horizontal level of navigation: suppose a user wants to explore documents similar to the given, but does not want to go up and down cells on the map view. Then he or she can use suggested documents as an alternative way of exploring the area of interest.

9.3 User experience

In this subsection, we explore a set of typical goals users can have while using the suggested search engine, and demonstrate that they indeed achieve their goals by referring to the user screens. We present these demonstrations on Figures 7, 8, 9, and 10.

Figure 7: Scenario 1: a user explores a known area of interest. A set of sub-topics and documents, most relevant to that topic, appear as they click on that topic. If they are dissatisfied with the presented results, they go deeper into (sub-)topic, which is possible because of the ‘‘infinite scroll’’, or wider to other topics, as similar topics are aligned closer on the map thanks to linear spectres. Therefore, to explore an area interest, either in-depth or superficially, a user does not typically have to jump over distant parts of map or over different browser windows – a typical search query is concentrated in small region.
Figure 8: Scenario 2: a user wishes to discover areas of interest by typing in text query. Immediately as they type a search query, a map is being dynamically rebuilt, thanks to asynchronous architecture and dynamic nature of FoamTree visualization library. A set of most important topics (up to 5 of them) is highlighted with tile size, with an additional identifier of a number of docs inside each topic, which are relevant to the exact text query. A user’s goal is instantly achieved, even before they finish typing their query. They can also read specific documents that matched the query, as these documents will be highlighted in bold in ‘‘infinite scroll’’ view on the map.
Figure 9: Scenario 3a: a user wishes to discover areas of interest by uploading a document, which superficially discusses multiple topics. Suppose a user wants to discover what is an International laboratory ‘‘Computer technologies’’ by uploading their news web page to Rysearch. The resulting highlighted topics discuss (a) international affairs of countries, (b) science and education, and (c) information technology. The user, in this case, will have a correct understanding of the laboratory in a matter of seconds.
Figure 10: Scenario 3b: a user wants to discover areas of interest by uploading a scientific article, which makes a focused contribution to a few scientific fields. He uploads a paper [aleksandrov2012method], which contributes to the field of bioinformatics by using methods from mathematics and computer science. Again, in this case, the engine was able to highlight all major areas of interest, without requiring the user to even open the article.

In Chapter 2, we introduced quality measures for topical edges of hierarchical models, arguing that previously proposed measures only fractionally depict the quality of HTMs. For this measures need to be aggregated to achieve a single quality score of a hierarchical model, we proposed two approaches of such aggregation: averaging and ranking. The former is simpler and more intuitive, while the latter is more interpretable. We also proposed two algorithms for building a topic model over heterogeneous sources, concat and heterogeneous. Finally, we described in detail an exploratory search engine, which make use of the proposed algorithm to build an interpretable hierarchical map of popular scientific topics.

In the following section we will show the agreement between the proposed measures and human judgement on topical edge quality and compare the proposed algorithms with the these measures.

10 Datasets

To construct ‘‘parent-child’’ topic pairs for human annotation, we trained three two-level hierarchical topic models on three datasets:

  • Postnauka, a popular scientific website with edited articles on a wide spectrum of topics, focusing on human sciences,

  • Habrahabr and Geektimes, social blogging platforms specializing in Computer science, engineering and IT entrepreneurship,

  • Elementy, a popular scientific website with a particular focus on life sciences.

Postnauka 2976 43196 1799 20 58
Habrahabr 81076 588400 77102 6 15
Elementy 2017 40452 9 25
Table 2: Dataset parameters. is a collection size, is a number of unique words in the collection, is a number of unique tags, is a number of first level (parent) topics, is a number of second level (child) topics.

The first two collections, Postnauka and Habrahabr, were also used in constructing search index for Rysearch. Detailed characteristics of these datasets and corresponding topic models are provided in Table 2.

To provide a mapping from words to vectors for EmbedSim measure (Eq. 15), we used RusVectores [kutuzov2016webvectors] pre-trained model, which was trained on external Russian National Corpus and Russian Wikipedia (with 600 million tokens, resulting in more than 392 thousand unique word embeddings). To measure document frequency as well as co-document frequency for CoocSim measure (Eq. 15), we used Habrahabr and Postnauka collections, as co-document frequency is difficult to compute on larger external corpora.

11 Assessment experiment

11.1 Assessment task statement

In the assessment experiment, conducted on Yandex.Toloka crowdsourcing platform, the participating experts were asked the following question: ‘‘given two pairs of topics, and , decide whether one is a subtopic of another’’. Possible answers were: ‘‘ is a subtopic of ’’, ‘‘ is a subtopic of ’’ and ‘‘these topics are not related’’. Topic was denoted by 10 top tokens from its probability distribution .

After the experiment was finished, the first two answers were grouped to denote a single answer ‘‘these topics are somehow related’’ as it was often difficult for assessors to distinguish between a parent and a child given their top tokens.

11.2 Quality control

The workers were selected from the pool of top-50% Yandex.Toloka assessors, which takes into account the rating received during all previous annotation jobs completed by the assessor. Selected workers were required to undergo a training before entering an experiment. The training consisted of 22 pairs of topics, which we labelled manually.

Experts could have skipped some tasks if they were not sure. To ensure qualified responses during the experiment, we banned those workers who skipped more than 10 tasks in a row from participating in our experiment. To ensure the diversity of responses, we additionally allowed each worker to annotate not more than 125 edges each.

Upon successful completion of a task (which consisted of annotating 5 edges) each worker received $0.02, or $2.4 per hour on average.

11.3 Experiment results

Overall, 68 trusted workers participated in our study, each contributed around 100 assessed topical pairs. Assessment of one pair of topics, given their 10 top tokens, took around 5 seconds for each participants on average. Each topic pair was evaluated by at least five different experts, which gave us 6750 expert annotations for 1350 unique pairs.

Our participants were mainly from Russia and Ukraine, with age varying from 21 to 64 years.

Figure 11: Probability distributions of the proposed measure scores for ‘‘bad’’ and ‘‘good’’ edges from the assessment task.
Agreed assessors Edge count Edge percentage
3 374 27.7%
4 468 34.7%
5 508 37.6%
Table 3: Inter-assessor agreement. For each pair of topics, we calculate how many assessors made the same verdict (that the topics from the pair are related or that they are not). For 5 assessors per pair, there is always a majority decision, but it can be reached by either 3, 4, or 5 assessors. In the second and the third column we show the quantity and the percentage of the edges with the number of agreed assessors from the first column.

If many people think that there is a ‘‘topic – subtopic’’ relation between two particular topics in a model, a good measure should give a high score for such a pair of topics. In this case we say that a measure approximates assessors’ opinion. Moreover, we want that measure to keep an order on the model edges consistent with this statement: the more people agreed that the relation exists – the higher the measure score should be. In order to prove that the proposed measure holds this constraint, consider the following classification problem. Let us call the assessors’ judgment the fact that at least 4 of 5 assessors agreed that an edge exists in a hierarchy. If it holds, then assessors’ judgment on this edge is equal to 1 (the edge is ‘‘good’’), and -1 (the edge is ‘‘bad’’) otherwise. In Table 3 we see that for 72.3% of assessed topical edges we can determine whether they are ‘‘bad’’ or ‘‘good’’. Let the edges of a hierarchical model be the objects: the positive and negative classes consist of the edges with a positive and a negative assessors’ opinion respectively. Let the classifier based on the measure be the following:


where and are the topics from parent and child levels of the model respectively, is one of the proposed measures and w is a margin of the classifier. Having it written in this form, we can calculate ROC AUC for each classifier and estimate the quality of each measure: better approximators are expected to have better scores.

Measure Score
EmbedSim 0.878
CoocSim 0.815
KLSim 0.790
HellingerSim 0.766
Table 4: ROC AUC scores for the proposed measures.

The Table 4 presents ROC AUC score for each classifier. One can see that the best classification quality was demonstrated by the classifier based on the EmbedSim measure (AUC = 0.878). The other measures demonstrated moderate yet acceptable consistency with the assessors’ opinion: AUC values lied evenly above 0.75.

For better understanding of this result one can see Figure 11. For each graph the red line is a density distribution of the measure value for bad edges, and the green one is the same for good edges. The better some vertical line divides bad edges from good ones – the better the measure is. In further experiments we use the EmbedSim measure, as it demonstrated the best consistency with the assessors’ judgment.

Figure 12: Topics and their subtopics from the assessment task scored with the EmbedSim measure as hierarchy edges. Each topic or subtopic is represented by its 10 top tokens.

To understand how these measures work, let us consider an example. We are given six ‘‘parent-child’’ pairs of topics that were assessed by humans. Three of them are labeled as ‘‘good’’ (there is a semantic similarity between parent and child), other three are labeled as ‘‘bad’’ (little or no similarity). On Figure 12 one can see these pairs on the right, along with their scores given by the EmbedSim measure. The higher the score, the more confident the measure is. On the left there is a distribution of all the edges from an assessment task described in the following section. Y-coordinates of points are assigned according to measure score, and colors are set by the assessment experts.

12 Comparison of concat and heterogeneous algorithms

12.1 Averaging measures

Figure 13 depicts values of for all possible values of . The heterogeneous model has a higher score than the less elaborated concat model no matter what threshold was set. However, this measure lacks the interpretability of its value (Y-coordinate of curves on Figure 13).

Figure 13: Averaging quality for the EmbedSim edge measure. The considered models (concat and heterogeneous) are described in Section 8.

12.2 Ranking measures

Figure 14: The ranking quality for the EmbedSim edge measure. The considered models (concat and heterogeneous) are described in Section 8.

Figure 14 shows that in all cases the ranking quality scores are, again, higher for heterogeneous model. One may interpret this result as the following: if a model is ‘‘good’’, than its top- edges should match the top- edges of the measure precisely enough, no matter what was set. According to Figure 14 it holds for all ranking measures, but the biggest gap was given by the Average Precision. Hence, if one wants to compare quality of two different hierarchies, the advice may be the following: take Embedding similarity (EmbedSim) as the edge measure and plot the Average Precision @ graph. The better model will be the one having better score(s) at the desirable value(s) of .

There is also a notable advantage of ranking approach over the averaging approach: it allows to choose the optimal number of hierarchical edges in the model, which we will discuss in the following subsection.

12.3 Automated quality improvement

Figure 15: (left) A subset of first level (parent) topics of a hierarchy with their respective subtopics. Each topic or subtopic is represented by its 10 top tokens. The presented edges are divided into ‘‘good’’ (empty circles), ‘‘bad’’ (filled circles) and ‘‘moderate’’ (semi-filled circles). (right) The hierarchy from the left with the same first level (parent) topics presented. The new subtopics are chosen through ranking approach to edge selection.

The left side of Figure 15 demonstrates a subset of the parent topics of the concat model with their child topics. According to our method, we plotted an inverse DP@ graph for the EmbedSim measure of the edges of this model (see Section 12.2), found its maximum (in our case it was at ) and built a new hierarchy that contained only the top- of the edges. The right side of Figure 15 demonstrates how quality of the same model increased without rebuilding the model itself. One can see that the new hierarchy looks more consistent and elaborated in comparison with the previous one.

In Chapter 3, we conducted experiments on three Russian popular scientific datasets to validate that the measures, proposed in the previous chapter, are in agreement with human judgement on topical edges quality. We also compared the algorithms for heterogeneous topic modeling defined in Section 8 to find out that more elaborate heterogeneous algorithm outperforms the basic concat approach. Finally, we have presented an empirical method that improves the quality of already built models using ranking quality approach.

In this work, we proposed several automated measures for ‘‘parent-child’’ relations of a topic hierarchy. We showed that the EmbedSim measure based on word embeddings reaches significant consistency with the assessors’ judgment on whether the connection between topics exists or not. Other measures demonstrated moderate yet acceptable consistency and can also be used in conjunction with EmbedSim.

We also proposed two approaches for measuring quality of a hierarchy as a whole. Using measures of edges’ quality we examined averaging and ranking approach to build an aggregated quality measure, and showed that better models reach higher scores in comparison with less elaborated models.

Finally, we demonstrated several applications of the proposed framework. First and foremost, we implemented an exploratory search engine Rysearch, which uses hierarchical topic modeling and the proposed algorithm to visualize a popular scientific topics in a hierarchical cell map. Another example is the usage of the proposed ranking approach for choosing the optimal set of edges to be included into a hierarchy.

Our work extends existing quality measures from flat topic models to hierarchical ones which, to the best of our knowledge, hasn’t been done before. The results were presented at the 24rd International Conference on Computational Linguistics and Intellectual Technologies (Dialogue’2018) and at the 60th Scientific Conference of MIPT, winning the best paper award in the latter.

We can thus conclude that the research goal has been achieved and all research tasks have been accomplished.