Lexical Normalization for Code-switched Data and its Effect on POS-tagging

06/01/2020
by   Rob van der Goot, et al.
0

Social media provides an unfiltered stream of user-generated input, leading to creative language use and many interesting linguistic phenomena, which were previously not available so abundantly. However, this language is harder to process automatically. One particularly challenging phenomenon is the use of multiple languages within one utterance, also called Code-Switching (CS). Whereas monolingual social media data already provides many problems for natural language processing, CS adds another challenging dimension. One solution that is commonly used to improve processing of social media data is to translate input texts to standard language first. This normalization has shown to improve performance of many natural language processing tasks. In this paper, we focus on normalization in the context of code-switching. We introduce a variety of models to perform normalization on CS data, and analyse the impact of word-level language identification on normalization. We show that the performance of the proposed normalization models is generally high, but language labels are only slightly informative. We also carry out POS tagging as extrinsic evaluation and show that automatic normalization of the input leads to 3.2 increase of 6.8

READ FULL TEXT

page 1

page 2

page 3

page 4

research
05/31/2019

Leveraging Pretrained Word Embeddings for Part-of-Speech Tagging of Code Switching Data

Linguistic Code Switching (CS) is a phenomenon that occurs when multilin...
research
09/28/2019

Part of speech tagging for code switched data

We address the problem of Part of Speech tagging (POS) in the context of...
research
11/14/2019

Training a code-switching language model with monolingual data

A lack of code-switching data complicates the training of code-switching...
research
10/06/2021

Sequence-to-Sequence Lexical Normalization with Multilingual Transformers

Current benchmark tasks for natural language processing contain text tha...
research
03/15/2017

Is this word borrowed? An automatic approach to quantify the likeliness of borrowing in social media

Code-mixing or code-switching are the effortless phenomena of natural sw...
research
05/25/2023

Script Normalization for Unconventional Writing of Under-Resourced Languages in Bilingual Communities

The wide accessibility of social media has provided linguistically under...
research
07/17/2017

To Normalize, or Not to Normalize: The Impact of Normalization on Part-of-Speech Tagging

Does normalization help Part-of-Speech (POS) tagging accuracy on noisy, ...

Please sign up or login with your details

Forgot password? Click here to reset