Semantic Document Distance Measures and Unsupervised Document Revision Detection

09/05/2017
by   Xiaofeng Zhu, et al.
0

In this paper, we model the document revision detection problem as a minimum cost branching problem that relies on computing document distances. Furthermore, we propose two new document distance measures, word vector-based Dynamic Time Warping (wDTW) and word vector-based Tree Edit Distance (wTED). Our revision detection system is designed for a large scale corpus and implemented in Apache Spark. We demonstrate that our system can more precisely detect revisions than state-of-the-art methods by utilizing the Wikipedia revision dumps https://snap.stanford.edu/data/wiki-meta.html and simulated data sets.

READ FULL TEXT
research
09/12/2018

Semantic WordRank: Generating Finer Single-Document Summarizations

We present Semantic WordRank (SWR), an unsupervised method for generatin...
research
12/29/2011

A comparison of two suffix tree-based document clustering algorithms

Document clustering as an unsupervised approach extensively used to navi...
research
08/21/2020

Keywords lie far from the mean of all words in local vector space

Keyword extraction is an important document process that aims at finding...
research
03/24/2016

Semantic Regularities in Document Representations

Recent work exhibited that distributed word representations are good at ...
research
07/14/2021

A New Parallel Algorithm for Sinkhorn Word-Movers Distance and Its Performance on PIUMA and Xeon CPU

The Word Movers Distance (WMD) measures the semantic dissimilarity betwe...
research
07/25/2018

Directory Reconciliation

We initiate the theoretical study of directory reconciliation, a general...
research
05/14/2020

An Efficient Shared-memory Parallel Sinkhorn-Knopp Algorithm to Compute the Word Mover's Distance

The Word Mover's Distance (WMD) is a metric that measures the semantic d...

Please sign up or login with your details

Forgot password? Click here to reset