A Simple and Efficient Probabilistic Language model for Code-Mixed Text

06/29/2021
by   M Zeeshan Ansari, et al.
0

The conventional natural language processing approaches are not accustomed to the social media text due to colloquial discourse and non-homogeneous characteristics. Significantly, the language identification in a multilingual document is ascertained to be a preceding subtask in several information extraction applications such as information retrieval, named entity recognition, relation extraction, etc. The problem is often more challenging in code-mixed documents wherein foreign languages words are drawn into base language while framing the text. The word embeddings are powerful language modeling tools for representation of text documents useful in obtaining similarity between words or documents. We present a simple probabilistic approach for building efficient word embedding for code-mixed text and exemplifying it over language identification of Hindi-English short test messages scrapped from Twitter. We examine its efficacy for the classification task using bidirectional LSTMs and SVMs and observe its improved scores over various existing code-mixed embeddings

READ FULL TEXT

page 1

page 2

page 3

page 4

research
11/23/2020

Evaluating Input Representation for Language Identification in Hindi-English Code Mixed Text

Natural language processing (NLP) techniques have become mainstream in t...
research
04/02/2023

MMT: A Multilingual and Multi-Topic Indian Social Media Dataset

Social media plays a significant role in cross-cultural communication. A...
research
07/06/2018

Natural Language Processing for Information Extraction

With rise of digital age, there is an explosion of information in the fo...
research
06/23/2021

Clinical Named Entity Recognition using Contextualized Token Representations

The clinical named entity recognition (CNER) task seeks to locate and cl...
research
05/10/2012

Discrimination of English to other Indian languages (Kannada and Hindi) for OCR system

India is a multilingual multi-script country. In every state of India th...
research
10/26/2018

Automatic Identification and Ranking of Emergency Aids in Social Media Macro Community

Online social microblogging platforms including Twitter are increasingly...
research
08/30/2016

Language Detection For Short Text Messages In Social Media

With the constant growth of the World Wide Web and the number of documen...

Please sign up or login with your details

Forgot password? Click here to reset