A Data Bootstrapping Recipe for Low Resource Multilingual Relation Classification

10/18/2021
by   Arijit Nag, et al.
0

Relation classification (sometimes called 'extraction') requires trustworthy datasets for fine-tuning large language models, as well as for evaluation. Data collection is challenging for Indian languages, because they are syntactically and morphologically diverse, as well as different from resource-rich languages like English. Despite recent interest in deep generative models for Indian languages, relation classification is still not well served by public data sets. In response, we present IndoRE, a dataset with 21K entity and relation tagged gold sentences in three Indian languages, plus English. We start with a multilingual BERT (mBERT) based system that captures entity span positions and type information and provides competitive monolingual relation classification. Using this system, we explore and compare transfer mechanisms between languages. In particular, we study the accuracy efficiency tradeoff between expensive gold instances vs. translated and aligned 'silver' instances. We release the dataset for future research.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
05/08/2023

MultiTACRED: A Multilingual Version of the TAC Relation Extraction Dataset

Relation extraction (RE) is a fundamental task in information extraction...
research
04/17/2021

DiS-ReX: A Multilingual Dataset for Distantly Supervised Relation Extraction

Distant supervision (DS) is a well established technique for creating la...
research
10/11/2022

Multilingual BERT has an accent: Evaluating English influences on fluency in multilingual models

While multilingual language models can improve NLP performance on low-re...
research
10/10/2022

HumSet: Dataset of Multilingual Information Extraction and Classification for Humanitarian Crisis Response

Timely and effective response to humanitarian crises requires quick and ...
research
01/11/2023

Multilingual Entity and Relation Extraction from Unified to Language-specific Training

Entity and relation extraction is a key task in information extraction, ...
research
05/15/2023

Taxi1500: A Multilingual Dataset for Text Classification in 1500 Languages

While natural language processing tools have been developed extensively ...
research
01/25/2022

Documenting Geographically and Contextually Diverse Data Sources: The BigScience Catalogue of Language Data and Resources

In recent years, large-scale data collection efforts have prioritized th...

Please sign up or login with your details

Forgot password? Click here to reset