Codeswitched Sentence Creation using Dependency Parsing

12/05/2020
by   Dhruval Jain, et al.
0

Codeswitching has become one of the most common occurrences across multilingual speakers of the world, especially in countries like India which encompasses around 23 official languages with the number of bilingual speakers being around 300 million. The scarcity of Codeswitched data becomes a bottleneck in the exploration of this domain with respect to various Natural Language Processing (NLP) tasks. We thus present a novel algorithm which harnesses the syntactic structure of English grammar to develop grammatically sensible Codeswitched versions of English-Hindi, English-Marathi and English-Kannada data. Apart from maintaining the grammatical sanity to a great extent, our methodology also guarantees abundant generation of data from a minuscule snapshot of given data. We use multiple datasets to showcase the capabilities of our algorithm while at the same time we assess the quality of generated Codeswitched data using some qualitative metrics along with providing baseline results for couple of NLP tasks.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
05/02/2022

Hausa Visual Genome: A Dataset for Multi-Modal English to Hausa Machine Translation

Multi-modal Machine Translation (MMT) enables the use of visual informat...
research
12/07/2019

PidginUNMT: Unsupervised Neural Machine Translation from West African Pidgin to English

Over 800 languages are spoken across West Africa. Despite the obvious di...
research
04/12/2023

ChatGPT Beyond English: Towards a Comprehensive Evaluation of Large Language Models in Multilingual Learning

Over the last few years, large language models (LLMs) have emerged as th...
research
09/07/2018

Multitask and Multilingual Modelling for Lexical Analysis

In Natural Language Processing (NLP), one traditionally considers a sing...
research
09/27/2017

Replicability Analysis for Natural Language Processing: Testing Significance with Multiple Datasets

With the ever-growing amounts of textual data from a large variety of la...
research
01/31/2021

Multilingual Email Zoning

The segmentation of emails into functional zones (also dubbed email zoni...
research
04/27/2023

A Modular Approach for Multilingual Timex Detection and Normalization using Deep Learning and Grammar-based methods

Detecting and normalizing temporal expressions is an essential step for ...

Please sign up or login with your details

Forgot password? Click here to reset