Large Scale Legal Text Classification Using Transformer Models

10/24/2020
by   Zein Shaheen, et al.
0

Large multi-label text classification is a challenging Natural Language Processing (NLP) problem that is concerned with text classification for datasets with thousands of labels. We tackle this problem in the legal domain, where datasets, such as JRC-Acquis and EURLEX57K labeled with the EuroVoc vocabulary were created within the legal information systems of the European Union. The EuroVoc taxonomy includes around 7000 concepts. In this work, we study the performance of various recent transformer-based models in combination with strategies such as generative pretraining, gradual unfreezing and discriminative learning rates in order to reach competitive classification performance, and present new state-of-the-art results of 0.661 (F1) for JRC-Acquis and 0.754 for EURLEX57K. Furthermore, we quantify the impact of individual steps, such as language model fine-tuning or gradual unfreezing in an ablation study, and provide reference dataset splits created with an iterative stratification algorithm.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
09/02/2021

LegaLMFiT: Efficient Short Legal Text Classification with LSTM Language Model Pre-Training

Large Transformer-based language models such as BERT have led to broad p...
research
02/28/2023

Text classification dataset and analysis for Uzbek language

Text classification is an important task in Natural Language Processing ...
research
09/30/2021

Multi-granular Legal Topic Classification on Greek Legislation

In this work, we study the task of classifying legal texts written in th...
research
02/16/2020

The Utility of General Domain Transfer Learning for Medical Language Tasks

The purpose of this study is to analyze the efficacy of transfer learnin...
research
12/04/2018

Practical Text Classification With Large Pre-Trained Language Models

Multi-emotion sentiment classification is a natural language processing ...
research
11/11/2022

Misinformation Detection using Persuasive Writing Strategies

The spread of misinformation is a prominent problem in today's society, ...
research
09/21/2023

Accelerating Thematic Investment with Prompt Tuned Pretrained Language Models

Prompt Tuning is emerging as a scalable and cost-effective method to fin...

Please sign up or login with your details

Forgot password? Click here to reset