BLUEX: A benchmark based on Brazilian Leading Universities Entrance eXams

07/11/2023
by   Thales Sales Almeida, et al.
0

One common trend in recent studies of language models (LMs) is the use of standardized tests for evaluation. However, despite being the fifth most spoken language worldwide, few such evaluations have been conducted in Portuguese. This is mainly due to the lack of high-quality datasets available to the community for carrying out evaluations in Portuguese. To address this gap, we introduce the Brazilian Leading Universities Entrance eXams (BLUEX), a dataset of entrance exams from the two leading universities in Brazil: UNICAMP and USP. The dataset includes annotated metadata for evaluating the performance of NLP models on a variety of subjects. Furthermore, BLUEX includes a collection of recently administered exams that are unlikely to be included in the training data of many popular LMs as of 2023. The dataset is also annotated to indicate the position of images in each question, providing a valuable resource for advancing the state-of-the-art in multimodal language understanding and reasoning. We describe the creation and characteristics of BLUEX and establish a benchmark through experiments with state-of-the-art LMs, demonstrating its potential for advancing the state-of-the-art in natural language understanding and reasoning in Portuguese. The data and relevant code can be found at https://github.com/Portuguese-Benchmark-Datasets/BLUEX

READ FULL TEXT
research
01/01/2021

BanglaBERT: Combating Embedding Barrier for Low-Resource Language Understanding

Pre-training language models on large volume of data with self-supervise...
research
10/29/2020

RussianSuperGLUE: A Russian Language Understanding Evaluation Benchmark

In this paper, we introduce an advanced Russian general language underst...
research
11/15/2022

GLUE-X: Evaluating Natural Language Understanding Models from an Out-of-distribution Generalization Perspective

Pre-trained language models (PLMs) are known to improve the generalizati...
research
11/26/2020

SLURP: A Spoken Language Understanding Resource Package

Spoken Language Understanding infers semantic meaning directly from audi...
research
06/08/2023

M3Exam: A Multilingual, Multimodal, Multilevel Benchmark for Examining Large Language Models

Despite the existence of various benchmarks for evaluating natural langu...
research
03/23/2022

Can Prompt Probe Pretrained Language Models? Understanding the Invisible Risks from a Causal View

Prompt-based probing has been widely used in evaluating the abilities of...
research
05/28/2023

Decoding the Underlying Meaning of Multimodal Hateful Memes

Recent studies have proposed models that yielded promising performance f...

Please sign up or login with your details

Forgot password? Click here to reset