DeepAI
Log In Sign Up

Supervised Contrastive Learning for Interpretable Long Document Comparison

08/20/2021
by   Akshita Jha, et al.
0

Recent advancements in deep learning techniques have transformed the area of semantic text matching. However, most of the state-of-the-art models are designed to operate with short documents such as tweets, user reviews, comments, etc., and have fundamental limitations when applied to long-form documents such as scientific papers, legal documents, and patents. When handling such long documents, there are three primary challenges: (i) The presence of different contexts for the same word throughout the document, (ii) Small sections of contextually similar text between two documents, but dissimilar text in the remaining parts – this defies the basic understanding of "similarity", and (iii) The coarse nature of a single global similarity measure which fails to capture the heterogeneity of the document content. In this paper, we describe CoLDE: Contrastive Long Document Encoder – a transformer-based framework that addresses these challenges and allows for interpretable comparisons of long documents. CoLDE uses unique positional embeddings and a multi-headed chunkwise attention layer in conjunction with a contrastive learning framework to capture similarity at three different levels: (i) high-level similarity scores between a pair of documents, (ii) similarity scores between different sections within and across documents, and (iii) similarity scores between different chunks in the same document and also other documents. These fine-grained similarity scores aid in better interpretability. We evaluate CoLDE on three long document datasets namely, ACL Anthology publications, Wikipedia articles, and USPTO patents. Besides outperforming the state-of-the-art methods on the document comparison task, CoLDE also proves interpretable and robust to changes in document length and text perturbations.

READ FULL TEXT

page 7

page 8

12/13/2022

Attentive Deep Neural Networks for Legal Document Retrieval

Legal text retrieval serves as a key component in a wide range of legal ...
06/02/2021

Self-Supervised Document Similarity Ranking via Contextualized Language Models and Hierarchical Inference

We present a novel model for the problem of ranking a collection of docu...
11/15/2016

SimDoc: Topic Sequence Alignment based Document Similarity Framework

Document similarity is the problem of estimating the degree to which a g...
03/15/2022

FastKASSIM: A Fast Tree Kernel-Based Syntactic Similarity Metric

Syntax is a fundamental component of language, yet few metrics have been...
08/17/2012

Content-based Text Categorization using Wikitology

A major computational burden, while performing document clustering, is t...
10/13/2020

Aspect-based Document Similarity for Research Papers

Traditional document similarity measures provide a coarse-grained distin...
05/04/2020

Noise Pollution in Hospital Readmission Prediction: Long Document Classification with Reinforcement Learning

This paper presents a reinforcement learning approach to extract noise i...