Hierarchical Optimal Transport for Document Representation

06/26/2019
by   Mikhail Yurochkin, et al.
3

The ability to measure similarity between documents enables intelligent summarization and analysis of large corpora. Past distances between documents suffer from either an inability to incorporate semantic similarities between words or from scalability issues. As an alternative, we introduce hierarchical optimal transport as a meta-distance between documents, where documents are modeled as distributions over topics, which themselves are modeled as distributions over words. We then solve an optimal transport problem on the smaller topic space to compute a similarity score. We give conditions on the topics under which this construction defines a distance, and we relate it to the word mover's distance. We evaluate our technique for k-NN classification and show better interpretability and scalability with comparable performance to current methods at a fraction of the cost.

READ FULL TEXT

page 6

page 7

page 13

page 14

page 15

research
07/24/2023

Towards Generalising Neural Topical Representations

Topic models have evolved from conventional Bayesian probabilistic model...
research
05/30/2021

Re-evaluating Word Mover's Distance

The word mover's distance (WMD) is a fundamental technique for measuring...
research
01/29/2020

Multi-Marginal Optimal Transport Defines a Generalized Metric

We prove that the multi-marginal optimal transport (MMOT) problem define...
research
01/28/2020

Structural-Aware Sentence Similarity with Recursive Optimal Transport

Measuring sentence similarity is a classic topic in natural language pro...
research
04/23/2019

Wasserstein-Fisher-Rao Document Distance

As a fundamental problem of natural language processing, it is important...
research
04/30/2021

Embedding Semantic Hierarchy in Discrete Optimal Transport for Risk Minimization

The widely-used cross-entropy (CE) loss-based deep networks achieved sig...
research
02/21/2016

Interactive Storytelling over Document Collections

Storytelling algorithms aim to 'connect the dots' between disparate docu...

Please sign up or login with your details

Forgot password? Click here to reset