SimDoc: Topic Sequence Alignment based Document Similarity Framework

11/15/2016
by   Gaurav Maheshwari, et al.
0

Document similarity is the problem of estimating the degree to which a given pair of documents has similar semantic content. An accurate document similarity measure can improve several enterprise relevant tasks such as document clustering, text mining, and question-answering. In this paper, we show that a document's thematic flow, which is often disregarded by bag-of-word techniques, is pivotal in estimating their similarity. To this end, we propose a novel semantic document similarity framework, called SimDoc. We model documents as topic-sequences, where topics represent latent generative clusters of related words. Then, we use a sequence alignment algorithm to estimate their semantic similarity. We further conceptualize a novel mechanism to compute topic-topic similarity to fine tune our system. In our experiments, we show that SimDoc outperforms many contemporary bag-of-words techniques in accurately computing document similarity, and on practical applications such as document clustering.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
08/17/2012

Content-based Text Categorization using Wikitology

A major computational burden, while performing document clustering, is t...
research
12/29/2011

Document Clustering based on Topic Maps

Importance of document clustering is now widely acknowledged by research...
research
02/03/2023

ANTM: An Aligned Neural Topic Model for Exploring Evolving Topics

As the amount of text data generated by humans and machines increases, t...
research
12/29/2011

A comparison of two suffix tree-based document clustering algorithms

Document clustering as an unsupervised approach extensively used to navi...
research
11/30/2017

Calculating Semantic Similarity between Academic Articles using Topic Event and Ontology

Determining semantic similarity between academic documents is crucial to...
research
01/11/2022

Structure with Semantics: Exploiting Document Relations for Retrieval

Retrieving relevant documents from a corpus is typically based on the se...
research
05/14/2020

An Efficient Shared-memory Parallel Sinkhorn-Knopp Algorithm to Compute the Word Mover's Distance

The Word Mover's Distance (WMD) is a metric that measures the semantic d...

Please sign up or login with your details

Forgot password? Click here to reset