HiTR: Hierarchical Topic Model Re-estimation for Measuring Topical Diversity of Documents

10/12/2018
by   Hosein Azarbonyad, et al.
0

A high degree of topical diversity is often considered to be an important characteristic of interesting text documents. A recent proposal for measuring topical diversity identifies three distributions for assessing the diversity of documents: distributions of words within documents, words within topics, and topics within documents. Topic models play a central role in this approach and, hence, their quality is crucial to the efficacy of measuring topical diversity. The quality of topic models is affected by two causes: generality and impurity of topics. General topics only include common information of a background corpus and are assigned to most of the documents. Impure topics contain words that are not related to the topic. Impurity lowers the interpretability of topic models. Impure topics are likely to get assigned to documents erroneously. We propose a hierarchical re-estimation process aimed at removing generality and impurity. Our approach has three re-estimation components: (1) document re-estimation, which removes general words from the documents; (2) topic re-estimation, which re-estimates the distribution over words of each topic; and (3) topic assignment re-estimation, which re-estimates for each document its distributions over topics. For measuring topical diversity of text documents, our HiTR approach improves over the state-of-the-art measured on PubMed dataset.

READ FULL TEXT

page 9

page 10

research
06/26/2018

Unveiling the semantic structure of text documents using paragraph-aware Topic Models

Classic Topic Models are built under the Bag Of Words assumption, in whi...
research
08/05/2020

BATS: A Spectral Biclustering Approach to Single Document Topic Modeling and Segmentation

Existing topic modeling and text segmentation methodologies generally re...
research
03/30/2021

Local and Global Topics in Text Modeling of Web Pages Nested in Web Sites

Topic models are popular models for analyzing a collection of text docum...
research
04/16/2021

Hierarchical Topic Presence Models

Topic models analyze text from a set of documents. Documents are modeled...
research
07/11/2000

Two Steps Feature Selection and Neural Network Classification for the TREC-8 Routing

For the TREC-8 routing, one specific filter is built for each topic. Eac...
research
01/31/2018

On the Topic of Jets

We introduce jet topics: a framework to identify underlying classes of j...
research
04/19/2018

Invitación al estudio estadístico del lenguaje

Invitation to the statistical study of language: The topic of this prese...

Please sign up or login with your details

Forgot password? Click here to reset