Supervised Contrastive Learning for Interpretable Long Document Comparison

08/20/2021
by   Akshita Jha, et al.
0

Recent advancements in deep learning techniques have transformed the area of semantic text matching. However, most of the state-of-the-art models are designed to operate with short documents such as tweets, user reviews, comments, etc., and have fundamental limitations when applied to long-form documents such as scientific papers, legal documents, and patents. When handling such long documents, there are three primary challenges: (i) The presence of different contexts for the same word throughout the document, (ii) Small sections of contextually similar text between two documents, but dissimilar text in the remaining parts – this defies the basic understanding of "similarity", and (iii) The coarse nature of a single global similarity measure which fails to capture the heterogeneity of the document content. In this paper, we describe CoLDE: Contrastive Long Document Encoder – a transformer-based framework that addresses these challenges and allows for interpretable comparisons of long documents. CoLDE uses unique positional embeddings and a multi-headed chunkwise attention layer in conjunction with a contrastive learning framework to capture similarity at three different levels: (i) high-level similarity scores between a pair of documents, (ii) similarity scores between different sections within and across documents, and (iii) similarity scores between different chunks in the same document and also other documents. These fine-grained similarity scores aid in better interpretability. We evaluate CoLDE on three long document datasets namely, ACL Anthology publications, Wikipedia articles, and USPTO patents. Besides outperforming the state-of-the-art methods on the document comparison task, CoLDE also proves interpretable and robust to changes in document length and text perturbations.

READ FULL TEXT

page 7

page 8

research
04/19/2023

Shuffle Divide: Contrastive Learning for Long Text

We propose a self-supervised learning method for long text documents bas...
research
07/29/2023

Analysing the Resourcefulness of the Paragraph for Precedence Retrieval

Developing methods for extracting relevant legal information to aid lega...
research
12/13/2022

Attentive Deep Neural Networks for Legal Document Retrieval

Legal text retrieval serves as a key component in a wide range of legal ...
research
03/15/2022

FastKASSIM: A Fast Tree Kernel-Based Syntactic Similarity Metric

Syntax is a fundamental component of language, yet few metrics have been...
research
06/02/2021

Self-Supervised Document Similarity Ranking via Contextualized Language Models and Hierarchical Inference

We present a novel model for the problem of ranking a collection of docu...
research
10/13/2020

Aspect-based Document Similarity for Research Papers

Traditional document similarity measures provide a coarse-grained distin...
research
05/25/2023

Efficient Document Embeddings via Self-Contrastive Bregman Divergence Learning

Learning quality document embeddings is a fundamental problem in natural...

Please sign up or login with your details

Forgot password? Click here to reset