Is this word borrowed? An automatic approach to quantify the likeliness of borrowing in social media

by   Jasabanta Patro, et al.

Code-mixing or code-switching are the effortless phenomena of natural switching between two or more languages in a single conversation. Use of a foreign word in a language; however, does not necessarily mean that the speaker is code-switching because often languages borrow lexical items from other languages. If a word is borrowed, it becomes a part of the lexicon of a language; whereas, during code-switching, the speaker is aware that the conversation involves foreign words or phrases. Identifying whether a foreign word used by a bilingual speaker is due to borrowing or code-switching is a fundamental importance to theories of multilingualism, and an essential prerequisite towards the development of language and speech technologies for multilingual communities. In this paper, we present a series of novel computational methods to identify the borrowed likeliness of a word, based on the social media signals. We first propose context based clustering method to sample a set of candidate words from the social media data.Next, we propose three novel and similar metrics based on the usage of these words by the users in different tweets; these metrics were used to score and rank the candidate words indicating their borrowed likeliness. We compare these rankings with a ground truth ranking constructed through a human judgment experiment. The Spearman's rank correlation between the two rankings (nearly 0.62 for all the three metric variants) is more than double the value (0.26) of the most competitive existing baseline reported in the literature. Some other striking observations are, (i) the correlation is higher for the ground truth data elicited from the younger participants (age less than 30) than that from the older participants, and (ii )those participants who use mixed-language for tweeting the least, provide the best signals of borrowing.


page 1

page 2

page 3

page 4


Automatic Normalization of Word Variations in Code-Mixed Social Media Text

Social media platforms such as Twitter and Facebook are becoming popular...

PESTO: Switching Point based Dynamic and Relative Positional Encoding for Code-Mixed Languages

NLP applications for code-mixed (CM) or mix-lingual text have gained a s...

Lexical Normalization for Code-switched Data and its Effect on POS-tagging

Social media provides an unfiltered stream of user-generated input, lead...

Harnessing Code Switching to Transcend the Linguistic Barrier

Code mixing (or code switching) is a common phenomenon observed in socia...

Tuiteamos o pongamos un tuit? Investigating the Social Constraints of Loanword Integration in Spanish Social Media

Speakers of non-English languages often adopt loanwords from English to ...

Shared Lexical Items as Triggers of Code Switching

Why do bilingual speakers code-switch (mix their two languages)? Among t...

Please sign up or login with your details

Forgot password? Click here to reset