Quantifying the Dissimilarity of Texts

05/03/2023
by   Benjamin Shade, et al.
0

Quantifying the dissimilarity of two texts is an important aspect of a number of natural language processing tasks, including semantic information retrieval, topic classification, and document clustering. In this paper, we compared the properties and performance of different dissimilarity measures D using three different representations of texts – vocabularies, word frequency distributions, and vector embeddings – and three simple tasks – clustering texts by author, subject, and time period. Using the Project Gutenberg database, we found that the generalised Jensen–Shannon divergence applied to word frequencies performed strongly across all tasks, that D's based on vector embedding representations led to stronger performance for smaller texts, and that the optimal choice of approach was ultimately task-dependent. We also investigated, both analytically and numerically, the behaviour of the different D's when the two texts varied in length by a factor h. We demonstrated that the (natural) estimator of the Jaccard distance between vocabularies was inconsistent and computed explicitly the h-dependency of the bias of the estimator of the generalised Jensen–Shannon divergence applied to word frequencies. We also found numerically that the Jensen–Shannon divergence and embedding-based approaches were robust to changes in h, while the Jaccard distance was not.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
11/11/2016

Generalized Entropies and the Similarity of Texts

We show how generalized Gibbs-Shannon entropies can provide new insights...
research
11/21/2021

More Romanian word embeddings from the RETEROM project

Automatically learned vector representations of words, also known as "wo...
research
10/01/2015

Similarity of symbol frequency distributions with heavy tails

Quantifying the similarity between symbolic sequences is a traditional p...
research
05/25/2023

The Representation Jensen-Shannon Divergence

Statistical divergences quantify the difference between probability dist...
research
11/20/2013

Experiments of Distance Measurements in a Foliage Plant Retrieval System

One of important components in an image retrieval system is selecting a ...
research
07/29/2019

One-to-X analogical reasoning on word embeddings: a case for diachronic armed conflict prediction from news texts

We extend the well-known word analogy task to a one-to-X formulation, in...
research
08/05/2020

Generalized Word Shift Graphs: A Method for Visualizing and Explaining Pairwise Comparisons Between Texts

A common task in computational text analyses is to quantify how two corp...

Please sign up or login with your details

Forgot password? Click here to reset