SCALE: Scaling up the Complexity for Advanced Language Model Evaluation

06/15/2023
by   Vishvaksenan Rasiah, et al.
0

Recent strides in Large Language Models (LLMs) have saturated many NLP benchmarks (even professional domain-specific ones), emphasizing the need for novel, more challenging novel ones to properly assess LLM capabilities. In this paper, we introduce a novel NLP benchmark that poses challenges to current LLMs across four key dimensions: processing long documents (up to 50K tokens), utilizing domain specific knowledge (embodied in legal texts), multilingual understanding (covering five languages), and multitasking (comprising legal document to document Information Retrieval, Court View Generation, Leading Decision Summarization, Citation Extraction, and eight challenging Text Classification tasks). Our benchmark comprises diverse legal NLP datasets from the Swiss legal system, allowing for a comprehensive study of the underlying Non-English, inherently multilingual, federal legal system. Despite recent advances, efficiently processing long documents for intense review/analysis tasks remains an open challenge for language models. Also, comprehensive, domain-specific benchmarks requiring high expertise to develop are rare, as are multilingual benchmarks. This scarcity underscores our contribution's value, considering most public models are trained predominantly on English corpora, while other languages remain understudied, particularly for practical domain-specific NLP tasks. Our benchmark allows for testing and advancing the state-of-the-art LLMs. As part of our study, we evaluate several pre-trained multilingual language models on our benchmark to establish strong baselines as a point of reference. Despite the large size of our datasets (tens to hundreds of thousands of examples), existing publicly available models struggle with most tasks, even after in-domain pretraining. We publish all resources (benchmark suite, pre-trained models, code) under a fully permissive open CC BY-SA license.

READ FULL TEXT

page 31

page 32

research
06/03/2023

MultiLegalPile: A 689GB Multilingual Legal Corpus

Large, high-quality datasets are crucial for training Large Language Mod...
research
10/10/2022

HumSet: Dataset of Multilingual Information Extraction and Classification for Humanitarian Crisis Response

Timely and effective response to humanitarian crises requires quick and ...
research
01/30/2023

LEXTREME: A Multi-Lingual and Multi-Task Benchmark for the Legal Domain

Lately, propelled by the phenomenal advances around the transformer arch...
research
05/25/2020

MaintNet: A Collaborative Open-Source Library for Predictive Maintenance Language Resources

Maintenance record logbooks are an emerging text type in NLP. They typic...
research
11/30/2022

BudgetLongformer: Can we Cheaply Pretrain a SotA Legal Language Model From Scratch?

Pretrained transformer models have achieved state-of-the-art results in ...
research
01/31/2021

Multilingual Email Zoning

The segmentation of emails into functional zones (also dubbed email zoni...
research
02/15/2022

MuLD: The Multitask Long Document Benchmark

The impressive progress in NLP techniques has been driven by the develop...

Please sign up or login with your details

Forgot password? Click here to reset