CLSE: Corpus of Linguistically Significant Entities

11/04/2022
by   Aleksandr Chuklin, et al.
0

One of the biggest challenges of natural language generation (NLG) is the proper handling of named entities. Named entities are a common source of grammar mistakes such as wrong prepositions, wrong article handling, or incorrect entity inflection. Without factoring linguistic representation, such errors are often underrepresented when evaluating on a small set of arbitrarily picked argument values, or when translating a dataset from a linguistically simpler language, like English, to a linguistically complex language, like Russian. However, for some applications, broadly precise grammatical correctness is critical – native speakers may find entity-related grammar errors silly, jarring, or even offensive. To enable the creation of more linguistically diverse NLG datasets, we release a Corpus of Linguistically Significant Entities (CLSE) annotated by linguist experts. The corpus includes 34 languages and covers 74 different semantic types to support various applications from airline ticketing to video games. To demonstrate one possible use of CLSE, we produce an augmented version of the Schema-Guided Dialog Dataset, SGD-CLSE. Using the CLSE's entities and a small number of human translations, we create a linguistically representative NLG evaluation benchmark in three languages: French (high-resource), Marathi (low-resource), and Russian (highly inflected language). We establish quality baselines for neural, template-based, and hybrid NLG systems and discuss the strengths and weaknesses of each approach.

READ FULL TEXT

page 17

page 18

research
09/29/2021

StoryDB: Broad Multi-language Narrative Dataset

This paper presents StoryDB - a broad multi-language dataset of narrativ...
research
09/19/2023

FRASIMED: a Clinical French Annotated Resource Produced through Crosslingual BERT-Based Annotation Projection

Natural language processing (NLP) applications such as named entity reco...
research
12/20/2022

Naamapadam: A Large-Scale Named Entity Annotated Data for Indic Languages

We present, Naamapadam, the largest publicly available Named Entity Reco...
research
09/29/2019

Towards Zero-resource Cross-lingual Entity Linking

Cross-lingual entity linking (XEL) grounds named entities in a source la...
research
05/01/2022

ELQA: A Corpus of Questions and Answers about the English Language

We introduce a community-sourced dataset for English Language Question A...
research
09/15/2022

Corpus-Guided Contrast Sets for Morphosyntactic Feature Detection in Low-Resource English Varieties

The study of language variation examines how language varies between and...
research
04/12/2021

Family of Origin and Family of Choice: Massively Parallel Lexiconized Iterative Pretraining for Severely Low Resource Machine Translation

We translate a closed text that is known in advance into a severely low ...

Please sign up or login with your details

Forgot password? Click here to reset