Automatic Normalization of Word Variations in Code-Mixed Social Media Text

04/03/2018
by   Rajat Singh, et al.
0

Social media platforms such as Twitter and Facebook are becoming popular in multilingual societies. This trend induces portmanteau of South Asian languages with English. The blend of multiple languages as code-mixed data has recently become popular in research communities for various NLP tasks. Code-mixed data consist of anomalies such as grammatical errors and spelling variations. In this paper, we leverage the contextual property of words where the different spelling variation of words share similar context in a large noisy social media text. We capture different variations of words belonging to same context in an unsupervised manner using distributed representations of words. Our experiments reveal that preprocessing of the code-mixed dataset based on our approach improves the performance in state-of-the-art part-of-speech tagging (POS-tagging) and sentiment analysis tasks.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
11/15/2016

Recurrent Neural Network based Part-of-Speech Tagger for Code-Mixed Social Media Text

This paper describes Centre for Development of Advanced Computing's (CDA...
research
05/30/2018

A Corpus of English-Hindi Code-Mixed Tweets for Sarcasm Detection

Social media platforms like twitter and facebook have be- come two of th...
research
05/22/2018

Normalization of Transliterated Words in Code-Mixed Data Using Seq2Seq Model & Levenshtein Distance

Building tools for code-mixed data is rapidly gaining popularity in the ...
research
03/15/2017

Is this word borrowed? An automatic approach to quantify the likeliness of borrowing in social media

Code-mixing or code-switching are the effortless phenomena of natural sw...
research
11/28/2017

Surfacing contextual hate speech words within social media

Social media platforms have recently seen an increase in the occurrence ...
research
10/25/2021

Battling Hateful Content in Indic Languages HASOC '21

The extensive rise in consumption of online social media (OSMs) by a lar...
research
07/29/2020

Development of POS tagger for English-Bengali Code-Mixed data

Code-mixed texts are widespread nowadays due to the advent of social med...

Please sign up or login with your details

Forgot password? Click here to reset