BenCoref: A Multi-Domain Dataset of Nominal Phrases and Pronominal Reference Annotations

04/07/2023
by   shadman-rohan, et al.
0

Coreference Resolution is a well studied problem in NLP. While widely studied for English and other resource-rich languages, research on coreference resolution in Bengali largely remains unexplored due to the absence of relevant datasets. Bengali, being a low-resource language, exhibits greater morphological richness compared to English. In this article, we introduce a new dataset, BenCoref, comprising coreference annotations for Bengali texts gathered from four distinct domains. This relatively small dataset contains 5200 mention annotations forming 502 mention clusters within 48,569 tokens. We describe the process of creating this dataset and report performance of multiple models trained using BenCoref. We anticipate that our work sheds some light on the variations in coreference phenomena across multiple domains in Bengali and encourages the development of additional resources for Bengali. Furthermore, we found poor crosslingual performance at zero-shot setting from English, highlighting the need for more language-specific resources for this task.

READ FULL TEXT
research
05/12/2022

Controlling Formality in Low-Resource NMT with Domain Adaptation and Re-Ranking: SLT-CDT-UoS at IWSLT2022

This paper describes the SLT-CDT-UoS group's submission to the first Spe...
research
03/15/2022

ViWOZ: A Multi-Domain Task-Oriented Dialogue Systems Dataset For Low-resource Language

Most of the current task-oriented dialogue systems (ToD), despite having...
research
12/03/2019

An Annotated Dataset of Coreference in English Literature

We present in this work a new dataset of coreference annotations for wor...
research
06/30/2019

Evaluating Language Model Finetuning Techniques for Low-resource Languages

Unlike mainstream languages (such as English and French), low-resource l...
research
03/30/2022

An Overview of Indian Language Datasets used for Text Summarization

In this paper, we survey Text Summarization (TS) datasets in Indian Lang...
research
05/26/2019

Where's My Head? Definition, Dataset and Models for Numeric Fused-Heads Identification and Resolution

We provide the first computational treatment of fused-heads construction...
research
02/13/2023

AbLit: A Resource for Analyzing and Generating Abridged Versions of English Literature

Creating an abridged version of a text involves shortening it while main...

Please sign up or login with your details

Forgot password? Click here to reset