Feature Selection on Noisy Twitter Short Text Messages for Language Identification

07/11/2020
by   Mohd Zeeshan Ansari, et al.
0

The task of written language identification involves typically the detection of the languages present in a sample of text. Moreover, a sequence of text may not belong to a single inherent language but also may be mixture of text written in multiple languages. This kind of text is generated in large volumes from social media platforms due to its flexible and user friendly environment. Such text contains very large number of features which are essential for development of statistical, probabilistic as well as other kinds of language models. The large number of features have rich as well as irrelevant and redundant features which have diverse effect over the performance of the learning model. Therefore, feature selection methods are significant in choosing feature that are most relevant for an efficient model. In this article, we basically consider the Hindi-English language identification task as Hindi and English are often two most widely spoken languages of India. We apply different feature selection algorithms across various learning algorithms in order to analyze the effect of the algorithm as well as the number of features on the performance of the task. The methodology focuses on the word level language identification using a novel dataset of 6903 tweets extracted from Twitter. Various n-gram profiles are examined with different feature selection algorithms over many classifiers. Finally, an exhaustive comparative analysis is put forward with respect to the overall experiments conducted for the task.

READ FULL TEXT
research
01/22/2020

Investigating Classification Techniques with Feature Selection For Intention Mining From Twitter Feed

In the last decade, social networks became most popular medium for commu...
research
06/12/2020

SemEval-2020 Task 12: Multilingual Offensive Language Identification in Social Media (OffensEval 2020)

We present the results and main findings of SemEval-2020 Task 12 on Mult...
research
07/02/2021

Language Identification of Hindi-English tweets using code-mixed BERT

Language identification of social media text has been an interesting pro...
research
08/30/2016

Language Detection For Short Text Messages In Social Media

With the constant growth of the World Wide Web and the number of documen...
research
08/27/2017

Impact of Feature Selection on Micro-Text Classification

Social media datasets, especially Twitter tweets, are popular in the fie...
research
11/05/2021

Developing Successful Shared Tasks on Offensive Language Identification for Dravidian Languages

With the fast growth of mobile computing and Web technologies, offensive...
research
11/30/2020

Twitter Spam Detection: A Systematic Review

Nowadays, with the rise of Internet access and mobile devices around the...

Please sign up or login with your details

Forgot password? Click here to reset