PyEuroVoc: A Tool for Multilingual Legal Document Classification with EuroVoc Descriptors

08/02/2021
by   Andrei-Marius Avram, et al.
0

EuroVoc is a multilingual thesaurus that was built for organizing the legislative documentary of the European Union institutions. It contains thousands of categories at different levels of specificity and its descriptors are targeted by legal texts in almost thirty languages. In this work we propose a unified framework for EuroVoc classification on 22 languages by fine-tuning modern Transformer-based pretrained language models. We study extensively the performance of our trained models and show that they significantly improve the results obtained by a similar tool - JEX - on the same dataset. The code and the fine-tuned models were open sourced, together with a programmatic interface that eases the process of loading the weights of a trained model and of classifying a new document.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
09/15/2022

ÚFAL CorPipe at CRAC 2022: Effectivity of Multilingual Models for Coreference Resolution

We describe the winning submission to the CRAC 2022 Shared Task on Multi...
research
09/22/2021

Role of Language Relatedness in Multilingual Fine-tuning of Language Models: A Case Study in Indo-Aryan Languages

We explore the impact of leveraging the relatedness of languages that be...
research
09/15/2023

Resolving Legalese: A Multilingual Exploration of Negation Scope Resolution in Legal Documents

Resolving the scope of a negation within a sentence is a challenging NLP...
research
09/02/2021

MultiEURLEX – A multi-lingual and multi-label legal document classification dataset for zero-shot cross-lingual transfer

We introduce MULTI-EURLEX, a new multilingual dataset for topic classifi...
research
05/31/2022

Multilingual Transformers for Product Matching – Experiments and a New Benchmark in Polish

Product matching corresponds to the task of matching identical products ...
research
11/28/2019

Legal document retrieval across languages: topic hierarchies based on synsets

Cross-lingual annotations of legislative texts enable us to explore majo...
research
11/21/2022

OSDG 2.0: a multilingual tool for classifying text data by UN Sustainable Development Goals (SDGs)

Despite concrete indicators and targets, monitoring the progress of the ...

Please sign up or login with your details

Forgot password? Click here to reset