Efficient Clustering from Distributions over Topics

12/15/2020
by   Carlos Badenes-Olmedo, et al.
0

There are many scenarios where we may want to find pairs of textually similar documents in a large corpus (e.g. a researcher doing literature review, or an R D project manager analyzing project proposals). To programmatically discover those connections can help experts to achieve those goals, but brute-force pairwise comparisons are not computationally adequate when the size of the document corpus is too large. Some algorithms in the literature divide the search space into regions containing potentially similar documents, which are later processed separately from the rest in order to reduce the number of pairs compared. However, this kind of unsupervised methods still incur in high temporal costs. In this paper, we present an approach that relies on the results of a topic modeling algorithm over the documents in a collection, as a means to identify smaller subsets of documents where the similarity function can then be computed. This approach has proved to obtain promising results when identifying similar documents in the domain of scientific publications. We have compared our approach against state of the art clustering techniques and with different configurations for the topic modeling algorithm. Results suggest that our approach outperforms (> 0.5) the other analyzed techniques in terms of efficiency.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
07/28/2019

TopicSifter: Interactive Search Space Reduction Through Targeted Topic Modeling

Topic modeling is commonly used to analyze and understand large document...
research
02/03/2023

Analyzing the impact of climate change on critical infrastructure from the scientific literature: A weakly supervised NLP approach

Natural language processing (NLP) is a promising approach for analyzing ...
research
11/25/2019

FLATM: A Fuzzy Logic Approach Topic Model for Medical Documents

One of the challenges for text analysis in medical domains is analyzing ...
research
07/08/2021

Assigning Topics to Documents by Successive Projections

Topic models provide a useful tool to organize and understand the struct...
research
07/21/2023

Identifying document similarity using a fast estimation of the Levenshtein Distance based on compression and signatures

Identifying document similarity has many applications, e.g., source code...
research
08/12/2022

Scholastic: Graphical Human-Al Collaboration for Inductive and Interpretive Text Analysis

Interpretive scholars generate knowledge from text corpora by manually s...
research
12/04/2019

PDC – a probabilistic distributional clustering algorithm: a case study on suicide articles in PubMed

The need to organize a large collection in a manner that facilitates hum...

Please sign up or login with your details

Forgot password? Click here to reset