SimDoc: Topic Sequence Alignment based Document Similarity Framework

11/15/2016
by   Gaurav Maheshwari, et al.
0

Document similarity is the problem of estimating the degree to which a given pair of documents has similar semantic content. An accurate document similarity measure can improve several enterprise relevant tasks such as document clustering, text mining, and question-answering. In this paper, we show that a document's thematic flow, which is often disregarded by bag-of-word techniques, is pivotal in estimating their similarity. To this end, we propose a novel semantic document similarity framework, called SimDoc. We model documents as topic-sequences, where topics represent latent generative clusters of related words. Then, we use a sequence alignment algorithm to estimate their semantic similarity. We further conceptualize a novel mechanism to compute topic-topic similarity to fine tune our system. In our experiments, we show that SimDoc outperforms many contemporary bag-of-words techniques in accurately computing document similarity, and on practical applications such as document clustering.

READ FULL TEXT

page 1

page 2

page 3

page 4

08/17/2012

Content-based Text Categorization using Wikitology

A major computational burden, while performing document clustering, is t...
02/09/2019

A new simple and effective measure for bag-of-word inter-document similarity measurement

To measure the similarity of two documents in the bag-of-words (BoW) vec...
10/25/2021

Contrastive Learning for Neural Topic Model

Recent empirical studies show that adversarial topic models (ATM) can su...
09/22/2021

Automated Feature-Topic Pairing: Aligning Semantic and Embedding Spaces in Spatial Representation Learning

Automated characterization of spatial data is a kind of critical geograp...
12/29/2011

A comparison of two suffix tree-based document clustering algorithms

Document clustering as an unsupervised approach extensively used to navi...
11/30/2017

Calculating Semantic Similarity between Academic Articles using Topic Event and Ontology

Determining semantic similarity between academic documents is crucial to...
01/11/2022

Structure with Semantics: Exploiting Document Relations for Retrieval

Retrieving relevant documents from a corpus is typically based on the se...