Language Identification of Hindi-English tweets using code-mixed BERT

07/02/2021
by   Mohd Zeeshan Ansari, et al.
0

Language identification of social media text has been an interesting problem of study in recent years. Social media messages are predominantly in code mixed in non-English speaking states. Prior knowledge by pre-training contextual embeddings have shown state of the art results for a range of downstream tasks. Recently, models such as BERT have shown that using a large amount of unlabeled data, the pretrained language models are even more beneficial for learning common language representations. Extensive experiments exploiting transfer learning and fine-tuning BERT models to identify language on Twitter are presented in this paper. The work utilizes a data collection of Hindi-English-Urdu codemixed text for language pre-training and Hindi-English codemixed for subsequent word-level language classification. The results show that the representations pre-trained over codemixed data produce better results by their monolingual counterpart.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
05/25/2023

Comparative Study of Pre-Trained BERT Models for Code-Mixed Hindi-English Data

The term "Code Mixed" refers to the use of more than one language in the...
research
04/13/2021

Understanding Transformers for Bot Detection in Twitter

In this paper we shed light on the impact of fine-tuning over social med...
research
04/18/2022

L3Cube-HingCorpus and HingBERT: A Code Mixed Hindi-English Dataset and BERT Language Models

Code-switching occurs when more than one language is mixed in a given se...
research
08/31/2016

Demographic Dialectal Variation in Social Media: A Case Study of African-American English

Though dialectal language is increasingly abundant on social media, few ...
research
07/11/2020

Feature Selection on Noisy Twitter Short Text Messages for Language Identification

The task of written language identification involves typically the detec...
research
05/11/2021

Role of Artificial Intelligence in Detection of Hateful Speech for Hinglish Data on Social Media

Social networking platforms provide a conduit to disseminate our ideas, ...
research
12/07/2020

An Empirical Survey of Unsupervised Text Representation Methods on Twitter Data

The field of NLP has seen unprecedented achievements in recent years. Mo...

Please sign up or login with your details

Forgot password? Click here to reset