Learning Bill Similarity with Annotated and Augmented Corpora of Bills

09/14/2021
by   Jiseon Kim, et al.
9

Bill writing is a critical element of representative democracy. However, it is often overlooked that most legislative bills are derived, or even directly copied, from other bills. Despite the significance of bill-to-bill linkages for understanding the legislative process, existing approaches fail to address semantic similarities across bills, let alone reordering or paraphrasing which are prevalent in legal document writing. In this paper, we overcome these limitations by proposing a 5-class classification task that closely reflects the nature of the bill generation process. In doing so, we construct a human-labeled dataset of 4,721 bill-to-bill relationships at the subsection-level and release this annotated dataset to the research community. To augment the dataset, we generate synthetic data with varying degrees of similarity, mimicking the complex bill writing process. We use BERT variants and apply multi-stage training, sequentially fine-tuning our models with synthetic and human-labeled datasets. We find that the predictive performance significantly improves when training with both human-labeled and synthetic data. Finally, we apply our trained model to infer section- and bill-level similarities. Our analysis shows that the proposed methodology successfully captures the similarities across legal documents at various levels of aggregation.

READ FULL TEXT
research
11/01/2019

BERT Goes to Law School: Quantifying the Competitive Advantage of Access to Large Legal Corpora in Contract Understanding

Fine-tuning language models, such as BERT, on domain specific corpora ha...
research
05/03/2023

CiteCaseLAW: Citation Worthiness Detection in Caselaw for Legal Assistive Writing

In legal document writing, one of the key elements is properly citing th...
research
11/08/2022

Conciseness: An Overlooked Language Task

We report on novel investigations into training models that make sentenc...
research
10/06/2020

LEGAL-BERT: The Muppets straight out of Law School

BERT has achieved impressive performance in several NLP tasks. However, ...
research
07/30/2020

Pixel-wise Crowd Understanding via Synthetic Data

Crowd analysis via computer vision techniques is an important topic in t...
research
12/17/2021

PeopleSansPeople: A Synthetic Data Generator for Human-Centric Computer Vision

In recent years, person detection and human pose estimation have made gr...

Please sign up or login with your details

Forgot password? Click here to reset