MultiEURLEX – A multi-lingual and multi-label legal document classification dataset for zero-shot cross-lingual transfer

09/02/2021
by   Ilias Chalkidis, et al.
7

We introduce MULTI-EURLEX, a new multilingual dataset for topic classification of legal documents. The dataset comprises 65k European Union (EU) laws, officially translated in 23 languages, annotated with multiple labels from the EUROVOC taxonomy. We highlight the effect of temporal concept drift and the importance of chronological, instead of random splits. We use the dataset as a testbed for zero-shot cross-lingual transfer, where we exploit annotated training documents in one language (source) to classify documents in another language (target). We find that fine-tuning a multilingually pretrained model (XLM-ROBERTA, MT5) in a single source language leads to catastrophic forgetting of multilingual knowledge and, consequently, poor zero-shot transfer to other languages. Adaptation strategies, namely partial fine-tuning, adapters, BITFIT, LNFIT, originally proposed to accelerate fine-tuning for new end-tasks, help retain multilingual knowledge from pretraining, substantially improving zero-shot cross-lingual transfer, but their impact also depends on the pretrained model used and the size of the label set.

READ FULL TEXT

page 1

page 7

page 8

page 9

research
06/08/2022

Realistic Zero-Shot Cross-Lingual Transfer in Legal Topic Classification

We consider zero-shot cross-lingual transfer in legal topic classificati...
research
03/04/2023

DiTTO: A Feature Representation Imitation Approach for Improving Cross-Lingual Transfer

Zero-shot cross-lingual transfer is promising, however has been shown to...
research
05/01/2020

From Zero to Hero: On the Limitations of Zero-Shot Cross-Lingual Transfer with Multilingual Transformers

Massively multilingual transformers pretrained with language modeling ob...
research
09/25/2022

An Empirical Study on Cross-X Transfer for Legal Judgment Prediction

Cross-lingual transfer learning has proven useful in a variety of Natura...
research
05/01/2020

XCOPA: A Multilingual Dataset for Causal Commonsense Reasoning

In order to simulate human language capacity, natural language processin...
research
10/11/2022

Contrastive Training Improves Zero-Shot Classification of Semi-structured Documents

We investigate semi-structured document classification in a zero-shot se...
research
08/02/2021

PyEuroVoc: A Tool for Multilingual Legal Document Classification with EuroVoc Descriptors

EuroVoc is a multilingual thesaurus that was built for organizing the le...

Please sign up or login with your details

Forgot password? Click here to reset