Pile of Law: Learning Responsible Data Filtering from the Law and a 256GB Open-Source Legal Dataset

07/01/2022
by   Peter Henderson, et al.
0

One concern with the rise of large language models lies with their potential for significant harm, particularly from pretraining on biased, obscene, copyrighted, and private information. Emerging ethical approaches have attempted to filter pretraining material, but such approaches have been ad hoc and failed to take into account context. We offer an approach to filtering grounded in law, which has directly addressed the tradeoffs in filtering material. First, we gather and make available the Pile of Law, a 256GB (and growing) dataset of open-source English-language legal and administrative data, covering court opinions, contracts, administrative rules, and legislative records. Pretraining on the Pile of Law may potentially help with legal tasks that have the promise to improve access to justice. Second, we distill the legal norms that governments have developed to constrain the inclusion of toxic or private content into actionable lessons for researchers and discuss how our dataset reflects these norms. Third, we show how the Pile of Law offers researchers the opportunity to learn such filtering rules directly from the data, providing an exciting new research direction in model-based processing.

READ FULL TEXT
research
04/18/2021

When Does Pretraining Help? Assessing Self-Supervised Learning for Law and the CaseHOLD Dataset

While self-supervised learning has made rapid advances in natural langua...
research
05/06/2021

On the Ethical Limits of Natural Language Processing on Legal Text

Natural language processing (NLP) methods for analyzing legal text offer...
research
06/03/2023

MultiLegalPile: A 689GB Multilingual Legal Corpus

Large, high-quality datasets are crucial for training Large Language Mod...
research
03/28/2023

Foundation Models and Fair Use

Existing foundation models are trained on copyrighted material. Deployin...
research
08/20/2023

LegalBench: A Collaboratively Built Benchmark for Measuring Legal Reasoning in Large Language Models

The advent of large language models (LLMs) and their adoption by the leg...
research
10/12/2022

Identity, Crimes, and Law Enforcement in the Metaverse

With the boom in metaverse-related projects in major areas of the public...
research
02/17/2023

A Juridicidade a Regulamentação dos Dark Patterns

The evolution of audiovisual computer interfaces was an important milest...

Please sign up or login with your details

Forgot password? Click here to reset