Re-evaluating Word Mover's Distance

05/30/2021
by   Ryoma Sato, et al.
0

The word mover's distance (WMD) is a fundamental technique for measuring the similarity of two documents. As the crux of WMD, it can take advantage of the underlying geometry of the word space by employing an optimal transport formulation. The original study on WMD reported that WMD outperforms classical baselines such as bag-of-words (BOW) and TF-IDF by significant margins in various datasets. In this paper, we point out that the evaluation in the original study could be misleading. We re-evaluate the performances of WMD and the classical baselines and find that the classical baselines are competitive with WMD if we employ an appropriate preprocessing, i.e., L1 normalization. However, this result is not intuitive. WMD should be superior to BOW because WMD can take the underlying geometry into account, whereas BOW cannot. Our analysis shows that this is due to the high-dimensional nature of the underlying metric. We find that WMD in high-dimensional spaces behaves more similarly to BOW than in low-dimensional spaces due to the curse of dimensionality.

READ FULL TEXT
research
06/26/2019

Hierarchical Optimal Transport for Document Representation

The ability to measure similarity between documents enables intelligent ...
research
04/04/2021

Non-negative matrix and tensor factorisations with a smoothed Wasserstein loss

Non-negative matrix and tensor factorisations are a classical tool in ma...
research
07/14/2023

Fast Algorithms for a New Relaxation of Optimal Transport

We introduce a new class of objectives for optimal transport computation...
research
12/29/2022

On the Geometry of Reinforcement Learning in Continuous State and Action Spaces

Advances in reinforcement learning have led to its successful applicatio...
research
05/16/2022

A scalable deep learning approach for solving high-dimensional dynamic optimal transport

The dynamic formulation of optimal transport has attracted growing inter...
research
04/23/2019

Wasserstein-Fisher-Rao Document Distance

As a fundamental problem of natural language processing, it is important...
research
02/25/2020

A CLT in Stein's distance for generalized Wishart matrices and higher order tensors

We study the convergence along the central limit theorem for sums of ind...

Please sign up or login with your details

Forgot password? Click here to reset