LeXFiles and LegalLAMA: Facilitating English Multinational Legal Language Model Development

05/12/2023
by   Ilias Chalkidis, et al.
0

In this work, we conduct a detailed analysis on the performance of legal-oriented pre-trained language models (PLMs). We examine the interplay between their original objective, acquired knowledge, and legal language understanding capacities which we define as the upstream, probing, and downstream performance, respectively. We consider not only the models' size but also the pre-training corpora used as important dimensions in our study. To this end, we release a multinational English legal corpus (LeXFiles) and a legal knowledge probing benchmark (LegalLAMA) to facilitate training and detailed analysis of legal-oriented PLMs. We release two new legal PLMs trained on LeXFiles and evaluate them alongside others on LegalLAMA and LexGLUE. We find that probing performance strongly correlates with upstream performance in related legal topics. On the other hand, downstream performance is mainly driven by the model's size and prior legal knowledge which can be estimated by upstream and probing performance. Based on these findings, we can conclude that both dimensions are important for those seeking the development of domain-specific PLMs.

READ FULL TEXT

page 16

page 17

page 19

research
10/04/2021

JuriBERT: A Masked-Language Model Adaptation for French Legal Text

Language models have proven to be very useful when adapted to specific d...
research
06/05/2023

LexGPT 0.1: pre-trained GPT-J models with Pile of Law

This research aims to build generative language models specialized for t...
research
10/03/2021

LexGLUE: A Benchmark Dataset for Legal Language Understanding in English

Law, interpretations of law, legal arguments, agreements, etc. are typic...
research
10/24/2022

Legal-Tech Open Diaries: Lesson learned on how to develop and deploy light-weight models in the era of humongous Language Models

In the era of billion-parameter-sized Language Models (LMs), start-ups h...
research
11/23/2022

Agent-Specific Deontic Modality Detection in Legal Language

Legal documents are typically long and written in legalese, which makes ...
research
10/02/2021

Swiss-Judgment-Prediction: A Multilingual Legal Judgment Prediction Benchmark

In many jurisdictions, the excessive workload of courts leads to high de...
research
08/08/2023

SILO Language Models: Isolating Legal Risk In a Nonparametric Datastore

The legality of training language models (LMs) on copyrighted or otherwi...

Please sign up or login with your details

Forgot password? Click here to reset