Lexical Normalization for Code-switched Data and its Effect on POS-tagging

by   Rob van der Goot, et al.

Social media provides an unfiltered stream of user-generated input, leading to creative language use and many interesting linguistic phenomena, which were previously not available so abundantly. However, this language is harder to process automatically. One particularly challenging phenomenon is the use of multiple languages within one utterance, also called Code-Switching (CS). Whereas monolingual social media data already provides many problems for natural language processing, CS adds another challenging dimension. One solution that is commonly used to improve processing of social media data is to translate input texts to standard language first. This normalization has shown to improve performance of many natural language processing tasks. In this paper, we focus on normalization in the context of code-switching. We introduce a variety of models to perform normalization on CS data, and analyse the impact of word-level language identification on normalization. We show that the performance of the proposed normalization models is generally high, but language labels are only slightly informative. We also carry out POS tagging as extrinsic evaluation and show that automatic normalization of the input leads to 3.2 increase of 6.8


page 1

page 2

page 3

page 4


Leveraging Pretrained Word Embeddings for Part-of-Speech Tagging of Code Switching Data

Linguistic Code Switching (CS) is a phenomenon that occurs when multilin...

Part of speech tagging for code switched data

We address the problem of Part of Speech tagging (POS) in the context of...

Training a code-switching language model with monolingual data

A lack of code-switching data complicates the training of code-switching...

Sequence-to-Sequence Lexical Normalization with Multilingual Transformers

Current benchmark tasks for natural language processing contain text tha...

Is this word borrowed? An automatic approach to quantify the likeliness of borrowing in social media

Code-mixing or code-switching are the effortless phenomena of natural sw...

Script Normalization for Unconventional Writing of Under-Resourced Languages in Bilingual Communities

The wide accessibility of social media has provided linguistically under...

To Normalize, or Not to Normalize: The Impact of Normalization on Part-of-Speech Tagging

Does normalization help Part-of-Speech (POS) tagging accuracy on noisy, ...

Please sign up or login with your details

Forgot password? Click here to reset