Graph-based Topic Extraction from Vector Embeddings of Text Documents: Application to a Corpus of News Articles

10/28/2020
by   M. Tarik Altuncu, et al.
47

Production of news content is growing at an astonishing rate. To help manage and monitor the sheer amount of text, there is an increasing need to develop efficient methods that can provide insights into emerging content areas, and stratify unstructured corpora of text into `topics' that stem intrinsically from content similarity. Here we present an unsupervised framework that brings together powerful vector embeddings from natural language processing with tools from multiscale graph partitioning that can reveal natural partitions at different resolutions without making a priori assumptions about the number of clusters in the corpus. We show the advantages of graph-based clustering through end-to-end comparisons with other popular clustering and topic modelling methods, and also evaluate different text vector embeddings, from classic Bag-of-Words to Doc2Vec to the recent transformers based model Bert. This comparative work is showcased through an analysis of a corpus of US news coverage during the presidential election year of 2016.

READ FULL TEXT

page 5

page 8

page 9

page 10

research
08/03/2018

Content-driven, unsupervised clustering of news articles through multiscale graph partitioning

The explosion in the amount of news and journalistic content being gener...
research
08/19/2021

A Framework for Neural Topic Modeling of Text Corpora

Topic Modeling refers to the problem of discovering the main topics that...
research
04/15/2021

Vec2GC – A Graph Based Clustering Method for Text Representations

NLP pipelines with limited or no labeled data, rely on unsupervised meth...
research
03/15/2022

TSM: Measuring the Enticement of Honeyfiles with Natural Language Processing

Honeyfile deployment is a useful breach detection method in cyber decept...
research
08/31/2019

Extracting information from free text through unsupervised graph-based clustering: an application to patient incident records

The large volume of text in electronic healthcare records often remains ...
research
11/02/2020

Biased TextRank: Unsupervised Graph-Based Content Extraction

We introduce Biased TextRank, a graph-based content extraction method in...
research
04/16/2021

Tracing Topic Transitions with Temporal Graph Clusters

Twitter serves as a data source for many Natural Language Processing (NL...

Please sign up or login with your details

Forgot password? Click here to reset