My Approach = Your Apparatus? Entropy-Based Topic Modeling on Multiple Domain-Specific Text Collections

11/25/2019
by   Julian Risch, et al.
0

Comparative text mining extends from genre analysis and political bias detection to the revelation of cultural and geographic differences, through to the search for prior art across patents and scientific papers. These applications use cross-collection topic modeling for the exploration, clustering, and comparison of large sets of documents, such as digital libraries. However, topic modeling on documents from different collections is challenging because of domain-specific vocabulary. We present a cross-collection topic model combined with automatic domain term extraction and phrase segmentation. This model distinguishes collection-specific and collection-independent words based on information entropy and reveals commonalities and differences of multiple text collections. We evaluate our model on patents, scientific papers, newspaper articles, forum posts, and Wikipedia articles. In comparison to state-of-the-art cross-collection topic modeling, our model achieves up to 13 perplexity, and up to 31 importantly, our approach is the first topic model that ensures disjunct general and specific word distributions, resulting in clear-cut topic representations.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
05/06/2018

Dynamic and Static Topic Model for Analyzing Time-Series Document Collections

For extracting meaningful topics from texts, their structures should be ...
research
10/23/2020

Topic Modeling with Contextualized Word Representation Clusters

Clustering token-level contextualized word representations produces outp...
research
08/05/2020

BATS: A Spectral Biclustering Approach to Single Document Topic Modeling and Segmentation

Existing topic modeling and text segmentation methodologies generally re...
research
04/24/2019

Toponym Identification in Epidemiology Articles -- A Deep Learning Approach

When analyzing the spread of viruses, epidemiologists often need to iden...
research
12/31/2019

Domain-topic models with chained dimensions: modeling the evolution of a major oncology conference (1995-2017)

In this paper we introduce a novel approach for the computational analys...
research
05/18/2022

Topic Segmentation of Research Article Collections

Collections of research article data harvested from the web have become ...
research
09/01/2018

A Multilingual Information Extraction Pipeline for Investigative Journalism

We introduce an advanced information extraction pipeline to automaticall...

Please sign up or login with your details

Forgot password? Click here to reset