MultiLegalPile: A 689GB Multilingual Legal Corpus

06/03/2023
by   Joel Niklaus, et al.
0

Large, high-quality datasets are crucial for training Large Language Models (LLMs). However, so far, there are few datasets available for specialized critical domains such as law and the available ones are often only for the English language. We curate and release MultiLegalPile, a 689GB corpus in 24 languages from 17 jurisdictions. The MultiLegalPile corpus, which includes diverse legal data sources with varying licenses, allows for pretraining NLP models under fair use, with more permissive licenses for the Eurlex Resources and Legal mC4 subsets. We pretrain two RoBERTa models and one Longformer multilingually, and 24 monolingual models on each of the language-specific subsets and evaluate them on LEXTREME. Additionally, we evaluate the English and multilingual models on LexGLUE. Our multilingual models set a new SotA on LEXTREME and our English models on LexGLUE. We release the dataset, the trained models, and all of the code under the most open possible licenses.

READ FULL TEXT

page 1

page 4

research
06/15/2023

SCALE: Scaling up the Complexity for Advanced Language Model Evaluation

Recent strides in Large Language Models (LLMs) have saturated many NLP b...
research
03/07/2023

The BigScience ROOTS Corpus: A 1.6TB Composite Multilingual Dataset

As language models grow ever larger, the need for large-scale high-quali...
research
01/30/2023

LEXTREME: A Multi-Lingual and Multi-Task Benchmark for the Legal Domain

Lately, propelled by the phenomenal advances around the transformer arch...
research
08/08/2023

SILO Language Models: Isolating Legal Risk In a Nonparametric Datastore

The legality of training language models (LMs) on copyrighted or otherwi...
research
10/02/2021

Swiss-Judgment-Prediction: A Multilingual Legal Judgment Prediction Benchmark

In many jurisdictions, the excessive workload of courts leads to high de...
research
05/02/2023

MultiLegalSBD: A Multilingual Legal Sentence Boundary Detection Dataset

Sentence Boundary Detection (SBD) is one of the foundational building bl...
research
07/01/2022

Pile of Law: Learning Responsible Data Filtering from the Law and a 256GB Open-Source Legal Dataset

One concern with the rise of large language models lies with their poten...

Please sign up or login with your details

Forgot password? Click here to reset