MultiLegalSBD: A Multilingual Legal Sentence Boundary Detection Dataset

05/02/2023
by   Tobias Brugger, et al.
0

Sentence Boundary Detection (SBD) is one of the foundational building blocks of Natural Language Processing (NLP), with incorrectly split sentences heavily influencing the output quality of downstream tasks. It is a challenging task for algorithms, especially in the legal domain, considering the complex and different sentence structures used. In this work, we curated a diverse multilingual legal dataset consisting of over 130'000 annotated sentences in 6 languages. Our experimental results indicate that the performance of existing SBD models is subpar on multilingual legal data. We trained and tested monolingual and multilingual models based on CRF, BiLSTM-CRF, and transformers, demonstrating state-of-the-art performance. We also show that our multilingual models outperform all baselines in the zero-shot setting on a Portuguese test set. To encourage further research and development by the community, we have made our dataset, models, and code publicly available.

READ FULL TEXT
research
09/15/2023

Resolving Legalese: A Multilingual Exploration of Negation Scope Resolution in Legal Documents

Resolving the scope of a negation within a sentence is a challenging NLP...
research
06/09/2021

Probing Multilingual Language Models for Discourse

Pre-trained multilingual language models have become an important buildi...
research
06/03/2023

MultiLegalPile: A 689GB Multilingual Legal Corpus

Large, high-quality datasets are crucial for training Large Language Mod...
research
08/24/2021

Are the Multilingual Models Better? Improving Czech Sentiment with Transformers

In this paper, we aim at improving Czech sentiment with transformer-base...
research
06/29/2023

LyricWhiz: Robust Multilingual Zero-shot Lyrics Transcription by Whispering to ChatGPT

We introduce LyricWhiz, a robust, multilingual, and zero-shot automatic ...
research
04/21/2020

Making Monolingual Sentence Embeddings Multilingual using Knowledge Distillation

We present an easy and efficient method to extend existing sentence embe...
research
09/30/2021

Multi-granular Legal Topic Classification on Greek Legislation

In this work, we study the task of classifying legal texts written in th...

Please sign up or login with your details

Forgot password? Click here to reset