G2T: A Simple but Effective Framework for Topic Modeling based on Pretrained Language Model and Community Detection

04/13/2023
by   Leihang Zhang, et al.
0

It has been reported that clustering-based topic models, which cluster high-quality sentence embeddings with an appropriate word selection method, can generate better topics than generative probabilistic topic models. However, these approaches suffer from the inability to select appropriate parameters and incomplete models that overlook the quantitative relation between words with topics and topics with text. To solve these issues, we propose graph to topic (G2T), a simple but effective framework for topic modelling. The framework is composed of four modules. First, document representation is acquired using pretrained language models. Second, a semantic graph is constructed according to the similarity between document representations. Third, communities in document semantic graphs are identified, and the relationship between topics and documents is quantified accordingly. Fourth, the word–topic distribution is computed based on a variant of TFIDF. Automatic evaluation suggests that G2T achieved state-of-the-art performance on both English and Chinese documents with different lengths. Human judgements demonstrate that G2T can produce topics with better interpretability and coverage than baselines. In addition, G2T can not only determine the topic number automatically but also give the probabilistic distribution of words in topics and topics in documents. Finally, G2T is publicly available, and the distillation experiments provide instruction on how it works.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
06/16/2017

An Automatic Approach for Document-level Topic Model Evaluation

Topic models jointly learn topics and document-level topic distribution....
research
06/13/2016

Graph-Community Detection for Cross-Document Topic Segment Relationship Identification

In this paper we propose a graph-community detection approach to identif...
research
10/05/2020

Improving Neural Topic Models using Knowledge Distillation

Topic models are often used to identify human-interpretable topics to he...
research
10/02/2016

Text Network Exploration via Heterogeneous Web of Topics

A text network refers to a data type that each vertex is associated with...
research
03/29/2019

Re-Ranking Words to Improve Interpretability of Automatically Generated Topics

Topics models, such as LDA, are widely used in Natural Language Processi...
research
11/11/2015

Hierarchical Latent Semantic Mapping for Automated Topic Generation

Much of information sits in an unprecedented amount of text data. Managi...
research
04/24/2021

Automatic Description Construction for Math Expression via Topic Relation Graph

Math expressions are important parts of scientific and educational docum...

Please sign up or login with your details

Forgot password? Click here to reset