HinGE: A Dataset for Generation and Evaluation of Code-Mixed Hinglish Text

07/08/2021
by   Vivek Srivastava, et al.
7

Text generation is a highly active area of research in the computational linguistic community. The evaluation of the generated text is a challenging task and multiple theories and metrics have been proposed over the years. Unfortunately, text generation and evaluation are relatively understudied due to the scarcity of high-quality resources in code-mixed languages where the words and phrases from multiple languages are mixed in a single utterance of text and speech. To address this challenge, we present a corpus (HinGE) for a widely popular code-mixed language Hinglish (code-mixing of Hindi and English languages). HinGE has Hinglish sentences generated by humans as well as two rule-based algorithms corresponding to the parallel Hindi-English sentences. In addition, we demonstrate the inefficacy of widely-used evaluation metrics on the code-mixed data. The HinGE dataset will facilitate the progress of natural language generation research in code-mixed languages.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
07/24/2021

MIPE: A Metric Independent Pipeline for Effective Code-Mixed NLG Evaluation

Code-mixing is a phenomenon of mixing words and phrases from two or more...
research
06/17/2022

BITS Pilani at HinglishEval: Quality Evaluation for Code-Mixed Hinglish Text Using Transformers

Code-Mixed text data consists of sentences having words or phrases from ...
research
02/23/2023

MUTANT: A Multi-sentential Code-mixed Hinglish Dataset

The multi-sentential long sequence textual data unfolds several interest...
research
07/31/2017

The Code2Text Challenge: Text Generation in Source Code Libraries

We propose a new shared task for tactical data-to-text generation in the...
research
08/23/2021

CGEMs: A Metric Model for Automatic Code Generation using GPT-3

Today, AI technology is showing its strengths in almost every industry a...
research
08/04/2021

Quality Evaluation of the Low-Resource Synthetically Generated Code-Mixed Hinglish Text

In this shared task, we seek the participating teams to investigate the ...
research
05/05/2020

Russian Natural Language Generation: Creation of a Language Modelling Dataset and Evaluation with Modern Neural Architectures

Generating coherent, grammatically correct, and meaningful text is very ...

Please sign up or login with your details

Forgot password? Click here to reset