Self-Supervised Document Similarity Ranking via Contextualized Language Models and Hierarchical Inference

06/02/2021
by   Dvir Ginzburg, et al.
0

We present a novel model for the problem of ranking a collection of documents according to their semantic similarity to a source (query) document. While the problem of document-to-document similarity ranking has been studied, most modern methods are limited to relatively short documents or rely on the existence of "ground-truth" similarity labels. Yet, in most common real-world cases, similarity ranking is an unsupervised problem as similarity labels are unavailable. Moreover, an ideal model should not be restricted by documents' length. Hence, we introduce SDR, a self-supervised method for document similarity that can be applied to documents of arbitrary length. Importantly, SDR can be effectively applied to extremely long documents, exceeding the 4,096 maximal token limits of Longformer. Extensive evaluations on large document datasets show that SDR significantly outperforms its alternatives across all metrics. To accelerate future research on unlabeled long document similarity ranking, and as an additional contribution to the community, we herein publish two human-annotated test sets of long documents similarity evaluation. The SDR code and datasets are publicly available.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
05/09/2022

Long Document Re-ranking with Modular Re-ranker

Long document re-ranking has been a challenging problem for neural re-ra...
research
06/03/2019

Unsupervised Neural Generative Semantic Hashing

Fast similarity search is a key component in large-scale information ret...
research
03/15/2022

FastKASSIM: A Fast Tree Kernel-Based Syntactic Similarity Metric

Syntax is a fundamental component of language, yet few metrics have been...
research
07/01/2020

Self-supervised Deep Reconstruction of Mixed Strip-shredded Text Documents

The reconstruction of shredded documents consists of coherently arrangin...
research
10/12/2022

Lbl2Vec: An Embedding-Based Approach for Unsupervised Document Retrieval on Predefined Topics

In this paper, we consider the task of retrieving documents with predefi...
research
10/13/2020

Aspect-based Document Similarity for Research Papers

Traditional document similarity measures provide a coarse-grained distin...
research
08/20/2021

Supervised Contrastive Learning for Interpretable Long Document Comparison

Recent advancements in deep learning techniques have transformed the are...

Please sign up or login with your details

Forgot password? Click here to reset