Log In Sign Up

Harnessing Code Switching to Transcend the Linguistic Barrier

Code mixing (or code switching) is a common phenomenon observed in social-media content generated by a linguistically diverse user-base. Studies show that in the Indian sub-continent, a substantial fraction of social media posts exhibit code switching. While the difficulties posed by code mixed documents to further downstream analyses are well-understood, lending visibility to code mixed documents under certain scenarios may have utility that has been previously overlooked. For instance, a document written in a mixture of multiple languages can be partially accessible to a wider audience; this could be particularly useful if a considerable fraction of the audience lacks fluency in one of the component languages. In this paper, we provide a systematic approach to sample code mixed documents leveraging a polyglot embedding based method that requires minimal supervision. In the context of the 2019 India-Pakistan conflict triggered by the Pulwama terror attack, we demonstrate an untapped potential of harnessing code mixing for human well-being: starting from an existing hostility diffusing hope speech classifier solely trained on English documents, code mixed documents are utilized as a bridge to retrieve hope speech content written in a low-resource but widely used language - Romanized Hindi. Our proposed pipeline requires minimal supervision and holds promise in substantially reducing web moderation efforts.


page 1

page 2

page 3

page 4


Sentiment Analysis of Code-Mixed Indian Languages: An Overview of SAIL_Code-Mixed Shared Task @ICON-2017

Sentiment analysis is essential in many real-world applications such as ...

MIPE: A Metric Independent Pipeline for Effective Code-Mixed NLG Evaluation

Code-mixing is a phenomenon of mixing words and phrases from two or more...

Is this word borrowed? An automatic approach to quantify the likeliness of borrowing in social media

Code-mixing or code-switching are the effortless phenomena of natural sw...

Prevalence of code mixing in semi-formal patient communication in low resource languages of South Africa

In this paper we address the problem of code-mixing in resource-poor lan...

Complexity Metric for Code-Mixed Social Media Text

An evaluation metric is an absolute necessity for measuring the performa...

1 Introduction

Analyzing geopolitical events through the lens of social media is a highly active research domain. From referendums with far-reaching political consequences (e.g., Brexit [1]) to sensitive and highly polarizing issues like mass shootings in the US [2], large scale social media analysis has the potential to offer important insights to political and social scientists. While social media discussions lend a great platform to exchange ideas, share opinions and debate issues, tackling online attacks targeted at certain individuals, or communities forms an important modern-day social media challenge to ensure human well-being.

Typical approach to moderate online hate consists of detecting hate speech for subsequent moderation. However, one recent line of work argued in favor of identifying positive content in the context of heated online discussions between nuclear adversaries at the brink of waging full-fledged war [3]. In a substantial corpus of 2.04 million YouTube comments on videos relevant to the 2019 India-Pakistan conflict triggered by the Pulwama terror attack, [3] advocates the importance of hostility-diffusing comments and defines a new task of detecting hostility-diffusing hope speech.

While [3] presents an important study of modern conflict between two nuclear adversaries with a long history of acrimonious past and grim projection of consequences should there be a full-blown war (100 million projected death as forecast by [4]

), typical to several studies conducted in linguistically diverse regions, the focus is largely restricted to the English subset of the comments (921,235 English comments posted by 392,460 users). With a combined language base of more than 500 million speakers of Hindi in India and Pakistan as opposed to nearly 250 million speakers in English, and a considerable fraction using Romanized Hindi on the web, extending this result to the Romanized Hindi subset of comments has understandable benefits. However, such omissions are common in social analyses due to lack of NLP (Natural Language Processing) resources in the native language. Most NLP pipelines are monolingual - state of the art POS (Part-of-Speech) taggers, parsers, or NER (Named Entity Recognition) taggers are rarely trained to handle multi-lingual documents and largely focus on English. Moreover, apart from typical colloquial style social media content, the presence of code mixing

111Throughout the paper, we use the terms code switching and code mixing interchangeably. [5, 6] – seamless alteration between multiple languages within the same document boundary (e.g., a tweet, or a comment on a YouTube video) – makes the task substantially more challenging.

While the challenges posed by code switching to downstream analyses are well-documented, in this paper, we focus on a largely under-explored research question: How can code switching be harnessed for social good and human well-being?

Our research question is motivated by a simple intuition that a short text document is likely to express a consistent sentiment; if reliable linguistic separation of such code mixed documents can be achieved, the Hindi portion of the comments can be further harnessed to explore similar comments in the Hindi subset for which we require no further training (the hope speech classifier we used in this paper is trained on English comments). Effectively, our method uses the Hindi portions of code mixed comments as a seed set of comments to mine similar content authored in Hindi. Our approach presents a compelling case study on how code switching can be harnessed as a bridge to detect peace-seeking content written in a low-resource language. A reliable system to identify code mixed hope speech documents has additional untapped benefits. Intuitively, a code mixed document written in two dominant languages in a linguistically diverse region is likely to be partially accessible to a wider set of audience.

Contributions: Our contributions are the following.

  1. Human well-being: We focus on an important task of detecting hostility-diffusing hope speech [3]. Social media would play an increasingly important role in understanding and analyzing modern conflicts [7]; online discussions between countries with a long history of conflicts is an under-studied yet highly important area.

  2. Framework: Code switching is typically viewed as an impediment to effective corpus analysis; to the best of our knowledge, we first highlight its untapped potential in acting as a bridge. While the role of mother tongue as a conversational lubricant in a code switched environment has been previously studied in educational settings [8], harnessing code switching to effectively sample content from a sub-corpus written in a different language was never explored heretofore to our knowledge.

  3. Machine Learning: We leverage recent literature in language identification and Active Sampling to sample documents exhibiting high-levels of code mixing and provide an end-to-end pipeline to sample from Romanized Hindi starting with a hope speech classifier trained on English documents. Our results indicate that our approach reduces considerable moderation efforts in detecting hope speech written mostly in Romanized Hindi.

2 Related Work

Code switching has been a widely studied area in linguistics for nearly half a century [9]. While recent work on analyzing the social aspects of code mixing in online communities is gaining importance [10], typically, code switching is viewed as an impediment to downstream NLP analyses and much of the focus in the community is concentrated in detecting token-level language and switch points [11, 12, 13] for cleaner linguistic separation. To the best of our knowledge, harnessing code switching for social good and human well-being has been largely unexplored. Our work draws inspiration from field-work in classroom settings showing how code switching helps students overcome linguistic barriers and how native tongue is used in a code mixed setting as a “conversational lubricant” [8].

Our work focuses on an important domain of online hostility-diffusion between civilians of nuclear adversaries first studied in [3]. We use several resources presented in the paper (e.g., data set, language identification method with minor modification). However, in [3] the primary focus was mostly restricted to the English subset of comments, whereas in our work, we focus on leveraging an untapped potential of code switching and propose a pipeline to identify hostility-diffusing hope speech from the Hindi sub-corpus, a task not addressed in [3].

The importance of a robust token-level language identification system was explored in [13]. The study demonstrated that typical document-level language identification systems are a poor fit for code mixed documents. In the context of Indian social media, [13] also provided statistics on the use of Romanized Hindi, and code-mixed text revealing significant use. Language preference priors for expression opinion were further investigated in  [14] revealing that negative opinion is often presented in Hindi, and [10] demonstrating Wikipedia editor success.

Several studies have addressed challenges in analyzing code mixed text by using a token-level language-identification step in their NLP pipelines [15, 16, 17]. [17] in particular presents an HMM-based unsupervised token-level language-identification method to analyze code-switching statistics on social media.

Recent studies have used sentence embeddings for sampling comments similar to a “query” document. [18, 19]

used sentence embeddings in active learning settings to expand a training data set. In both cases sampling based on the sentence embeddings yielded a far better rate of acquisition of the desired class compared to random sampling. We utilize the polyglot embeddings themselves as sentence embeddings in our rare positive mining task.

3 Problem Definition

3.1 Task: Hope Speech Detection

We focus on the prediction task of hope speech detection, first proposed in [3] in the context of online discussions relevant to the 2019 India-Pakistan conflict. Aimed at diffusing hostility, a hope speech classifier is a nuanced classifier (precise definition of hope speech with illustrative examples is presented in [3]) to detect content that contains a unifying message focusing on the war’s futility, the importance of peace, and the human and economic costs involved, or expresses criticism of either the author’s own nation’s entities or policies, or the actions or entities of the two involved countries.

Data set: Our data set, , consists of 2.04 million comments posted by 791,289 user on 2,890 YouTube videos relevant to this India-Pakistan conflict. Our main focus is on the English and Romanized Hindi subsets denoted as (921,235 comments) and (1033,908 comments), respectively.

Annotated data set: The hope speech classifier is trained on an annotated data set, , of 2,277 positive and 7,716 negative English comments and an in-the-wild performance (on data not belonging to the train or test set) of 84.68% precision was reported.

3.2 An Illustrative Example

To motivate our intuition, we first provide an illustrative example of a code switched comment exhibiting hope speech along with a loose translation. English, Hindi and neutral tokens (e.g., proper nouns, numerals, or technology terms) are color-coded with blue, red and black respectively (color scheme is consistent throughout the paper).

  I am Indian and I say peace is the only solution ankh k badle ankh mangoge toh sari dunya andhi hojayegi   I am Indian, and I say peace is the only solution; an eye for an eye makes the whole world blind.  

In the above example, both the Hindi and English components exhibit peace-seeking intent. Our main goal in this paper is to harness the Hindi components present in these highly code mixed hope speech comments to detect hope speech in the Hindi sub-corpus. Associated research questions are the following:

  • How can we sample code mixed documents?

  • How can we harness the Hindi part of a code mixed document to sample hope speech from the Hindi portion?

3.3 A Challenging Data Set

Similar to most data sets of noisy, short social media texts generated in a linguistically diverse region, our data set exhibits a considerable presence of out-of-vocabulary (OOV) words, code mixing, and grammar and spelling disfluencies. In addition to these challenges, given that a vast majority of the content contributors do not speak English as their first language, we noticed varying levels of English proficiency in the corpus with a substantial incidence of phonetic spelling errors (e.g., [thankyou pakusta for hiumaniti no war aman ssnti kayam kare] loosely translates to Thank you Pakistan for humanity; let peace prevail.); 32% of times, the word liar was misspelled as lier. Since Romanized Hindi does not have any standard spelling (e.g., the word aman meaning peace is spelled in the corpus as amun, amaan and aman), a high level of spelling variations added to the challenges.

How hard is it to sample hope speech? On a random sample of 1,000 comments from , our annotators222For all tasks, two annotators proficient in English, Hindi, Urdu, and Dutch were used. Across all rounds of labeling, the minimum Fleiss’ measure was high (0.84) indicating strong inter-rater agreement. After independent labeling, differences were resolved through discussion. found 18 positives (i.e., 1.8%). This result aligns with results reported in [3] where only 2.45% randomly sampled English comments were marked as hope speech. Additionally, a previous study of a multilingual Hindi-English tweet corpus observed that Hindi was more commonly used to express negative sentiment [14]. The minuscule presence of hope speech indicates that detecting such content is essentially a rare positive mining task and automated methods are essential.

3.4 Our Pipeline

Research question: How to harness code switching to sample hope speech from the Hindi subset ?

Figure 1: System diagram.

A schematic diagram of our pipeline to sample hope speech from the Hindi subset, , is presented in Figure 1. Our pipeline consists of the following steps.

  1. Identify the subset, , from with substantial code mixing.

  2. Run the hope speech classifier (trained on annotated English comments ) on and construct the subset containing comments predicted as hope-speech.

  3. Construct transforming each comment in discarding any tokens not written in Romanized Hindi.

  4. Using as the seed set, retrieve the nearest neighbors in the comment embedding space from .

  5. Manually inspect the obtained sampled comments to detect hope speech.

Steps 1, 2, 3, and 4 require minimal manual supervision. Step 5 is the only step that requires substantial manual effort. Our results indicate that we obtained a nearly 10-fold improvement over our baseline.

4 Methods and Results

Research question: how to sample code mixed documents?

4.1 Code Mixing Index (CMI)

We used a well-known metric to measure the extent of code switching in a document - Code Mixing Index (CMI) - first proposed in [11]. Essentially, CMI measures the presence of a dominant language in a document. Let a document expressed with different languages, , and neutral tokens be represented as a sequence of words: . Let () return the language of word (or neutral if it is a neutral token). For each language, () denotes the total number of utterances of in the document, i.e., () = ) where is the indicator function. The CMI of the document , CMI(), is measured as:
CMI() = . In the boundary condition, where every word in the document is a neutral token, CMI is defined as 0; hence, CMI() . A low CMI value indicates minimal code switching i.e. the document is almost entirely written in the dominant language. Understandably, when , the highest possible CMI is 0.5 indicating equal presence of two component languages. When

(.) is estimated using a language identification method, we denote the estimated

CMI of a document as .

We now illustrate with an example: [bilkul sahi baat kahi aapne imran khan saab please please no more war only peace] (loosely translates to You’ve spoken the absolute truth Mr. Imran Khan, please no more war, only peace.). In this example, , , = 15, and = 2. Hence, CMI of the document is = 0.46. We considered documents with greater than or equal to 0.4 as documents exhibiting significant code mixing.

Figure 2: A TSNE  [20] plot of the polyglot document-embedding space. The code-mixed region (black) lies between the Hindi (red) and English (blue) language clusters.

4.2 Estimating CMI

In order to sample documents with high *[CMI] 0.5ex[1pt]CMI, we need a reliable token-level language identification module. We used the polyglot-embedding based method proposed in [3] (we denote this method by ). We chose because it requires minimal supervision and is particularly well-suited for noisy social media texts [19]. In particular, involves obtaining the document embeddings, and then using Means on these embeddings. The method is shown to reveal highly precise language clusters. Previous use-cases of were limited to document-level language identification. In our experiments we found that without any significant modification, the technique is capable of token-level language identification with considerable accuracy. Our token-level language identification follows the same method presented in [3]. We consider a token as a single-word document, obtain its embedding, and assign language to the nearest cluster center in the document embedding space.

Pakistan, he, army, media, Modi, Pak, Pakistani, Kashmir, pilot, attack, video, news, khan, jai, 2, hind, Imran, Muslim, sir, 1
Table 1: Top 20 neutral words by frequency detected by .

Detecting neutral tokens:

Neutral tokens are identified using a simple heuristic: for a two-language scenario, a token is marked neutral if it is approximately equidistant from the two respective cluster centers. For a given token,

, let the Euclidean distance of from the English cluster and Hindi cluster in the comment embedding space be represented as and , respectively. Let the distance between the two cluster centers be expressed as . = neutral iff and .

Table 1 shows the top 20 (ranked by frequency) neutral tokens detected by our method when is set to 0.1. They broadly include proper nouns (e.g., Modi, Khan, Pakistan), numerals (e.g., 1), technical terms (e.g., video) and overloaded words (e.g., he; he in Hindi is the verb is, and is the third-person singular masculine pronoun in English).

Code switched hope speech Loose translation
I am Pakistani agar ap dono country ny war karne ha to gurbat khatam karnay ke war karo dono countries bht gareeb hain plzz dont do war war is not solution of peace I am Pakistani. If both countries have to wage a war, wage a war to end poverty; both countries are very poor. Please, do not war, war is not solution of peace.
please media walo nafrat phailana chhod do we want peace only jai hind Please media folks, stop spreading hate. We want peace; hail India!
absolutely right i think ab netao kee jung ko ham dono mumalik ne rad karna hai we people of both countries want peace peace and peace Absolutely right. I think politicians’ war has to prevented by common people of both countries. We people of both countries want peace, peace, and peace.
Table 2: Random sample of code mixed hope speech obtained by hope speech classifier run on .
Code switched hope speech Loose translation
bhai ap bhi khuch rahiye allah apki har farmaish puri kare and I repeat again bhai I love you all my dear brothers and sisters mujhe ekh dusre se indian or pakistaani keh kar bulana bilkul pasand nahi hum sab bhai or bhen hai or rahenge we are good humans of earth Brother, you also be happy. May Allah grant all your wish and I repeat again, brother, I love you all my dear brothers and sisters. I don’t like to identify each other as Indian or Pakistani, we are all brothers and sisters, we are good humans of earth.
galiyan dene se kya hoga beach me to begunah awam mare gii orr ham jo jang jang krte henn kha jang itnii asan he no this war is end of the world because Ind an Pak is newclear states Nothing will come out of abusing, we will cry for war while innocent civilians will die. Who said that war is easy? No, this war is end of the world because Ind and Pak are nuclear states.
daikhou dosto apaas me baahss bazii maat kro plz such me I have love for both Pakistan and India bus apaas me muhabaatey rakhou I m student of 9th lakn me muhabaat chataa hu donoo countries me choroo fazoul ki nafrateey love you Pakistan and my neighbour country India Look friends, stop quarrelling among each other. Please, for real, I have love for both Pakistan and India, just harbor love between each other. I am a student of 9th grade, but I want love between both countries. Leave this useless hate, love you Pakistan and my neighbor country India.
Table 3: Random sample of hope speech obtained through NN-sample().
Corpus CMI *[CMI] 0.5ex[1pt]CMI Overall RMSE
0.03 0.04 0.05
0.10 0.12
0.38 0.45
Table 4: CMI estimation root mean squared error.
Code switched hope speech Loose translation
khuda kare dono mulko ke beech aman ho jaye hum yahi chahate hai God willing, peace between two countries happens. I want that.
Europe main sub aik sath rahe aur kitne mulk aik sath rehte hain aur aik hum hai ka larte hi rahe ga kab samjhe ga hum In Europe, so many countries are living together. Whereas we are still fighting, when will we understand?
koi mulk nhi chahta aur na kisi mulk ki awaam chaahti ki uske padosi mulk se ki aapsi taaluqaat kharaab ho No country or its civilians want that the relationship with its neighboring country sours.
Table 5: Random sample of hope speech obtained through NN-Sample().
Predicted Label
702 325 144
True Label 334 4690 56
85 148 3235
Table 6: Confusion matrix of token-level performance evaluation of on 300 annotated comments from .

On a data set of 300 comments with gold standard token-level annotation, we found that performs token-level language detection with considerable accuracy. As shown in Table 6, the overall accuracy of is 88.76%.

Estimating CMI: Once the reliability of token-level language identification by is established, we next evaluate the reliability of the estimate for CMI (i.e. *[CMI] 0.5ex[1pt]CMI). We first define the subset with substantial estimated code mixing, , as the following: = s.t. and *[CMI] 0.5ex[1pt]CMI() , i.e., the document is either part of the English or Romanized Hindi subset and its estimated CMI is high indicating nearly equal presence of Hindi and English. As shown in Figure 2, of indeed falls in the overlapping region of the English and Hindi cluster. We manually inspected and annotated 1,000 randomly sampled comments from and found that 95.9% comments exhibited code switching. We further obtained token level consensus labels for 100 randomly sampled comments each from , , and . Table 4 compares the ground truth CMI and *[CMI] 0.5ex[1pt]CMI and demonstrates that we achieved a reasonable approximation of true CMI using .

4.3 Sampling Hope Speech From

Research question: How to harness the Hindi part of a code mixed document to sample hope speech from ?

Once we identify a comment subset with substantial code mixing, , obtaining hope speech comments using an off-the-shelf hope speech classifier is straight-forward. Out of 36,969 comments in , the classifier predicted a set of 199 comments, , as positives. Upon manual annotation, we obtained 149 positives (denoted as ), i.e., 74.87% positives. Understandably, due to presence of code switching, the in-the-wild precision in is lower than previously reported [3] in-the-wild precision of 84.68% in . Table 2 lists a subset of randomly sampled comments from . We noted that the Hindi component of the comments were consistent with the overall sentiment of the comment.

A noisy approximation of the Hindi sub-part of these comments can be obtained by through discarding non-Hindi tokens.

  I love India I am Pakistani mein amun chahta hon khuda ke waste jang nai peace peace peace   I love India, I am Pakistani. I want peace for God’s sake, not war, peace peace peace.  

For instance, the above comment is transformed into [mein amun chahta hon khuda ke jang nai] (loosely translates to I want peace for God not war) when we discard non-Hindi tokens using . Waste is both a valid English and Hindi word (meaning sake), and the language detector makes an error in correctly predicting it. We admit that it is possible to use more sophisticated methods to extract Hindi that consider context (e.g., considering context to assign label to a fence word) and possibly squeeze more performance out of it. However, we are primarily interested in establishing a blue-print for harnessing code switching for social good and testing the robustness of our pipeline without resorting to performance-driven engineering. In every step of our pipeline, a better-performing algorithm (e.g., better language detection module, sophisticated method to extract Hindi, more powerful comment embeddings, further effective sampling technique) can be plugged in without disturbing the flow and with a possibility of performance improvement.

Active Sampling: Once we extract the Hindi sub-parts of (denoted as ), our next task is to find comments in that are similar to the Hindi sub-part. To this end, we use a recently-proposed [19] Active Sampling algorithm which samples nearest neighbors in the comment embedding space to identify rare positives. Our choice of this Active Sampling technique is motivated by its effectiveness in mining rare positives and reported robustness to spelling variations which is particularly critical because our corpus contains noisy social media texts and Romanized Hindi does not have standard spelling rules. Following [19], we used cosine distance of the embeddings as the distance measure.

Our sampling algorithm is described in Algorithm 1. This algorithm takes a seed set, , and a sample pool as inputs and outputs a set, , containing nearest neighbors of in the comment-embedding space. Initially, our expanded set, , is an empty set. At each step, we expand this set with nearest neighbors that are already not present in the expanded set or the seed set. The function getNearestNeighbor(c, dist) returns the comment in with minimum distance greater than or equal to dist. The size parameter is set to 5, i.e., for each comment, we add five unique nearest neighbors. We set to since we are interested in detecting hope speech in Hindi.

Baselines: Recall that, a random sample of 1,000 comments from only yielded 1.8% positives which is our primary baseline method (denoted as random-Sample()).

Initialization: Main loop: foreach comment  c    do
          while count size do
                   if neighbor  then
                   end if
          end while
end foreach
Algorithm 1 NN-Sample(, )
Method Performance

NN-Sample() 17.4%
NN-Sample() 26.2%
NN-Sample() 20.6%
NN-Sample() 30.0%

Table 7: Sampling performance. 500 comments are randomly sampled from every NN-sample(.) set.
Method Performance

NN-Sample() 0.05
NN-Sample() 0.43

Table 8: Estimated CMI comparison.
Figure 3: A 2D visualization showing the sampling results against the embedding space. Discarding non-Hindi tokens retrieves documents with low CMI written mostly in Hindi.

Table 7 compares the performance of our sampling method against the baseline (we do not explicitly mention which is consistently set to across all NN-Sample methods). We obtained substantial improvement over the baseline. Both NN-Sample() and NN-Sample() require human inspection only at the last step of our pipeline. Our results indicate that our approach can substantially reduce manual moderation effort in detecting hope speech. Effectively, we sampled hope speech from a Hindi corpus simply relying on a classifier trained on English comments and harnessing code switching as a bridge. In all steps of the pipeline, we perform noisy approximations in estimating CMI, extracting Hindi sub-parts of comments and of course, detecting hope speech. If we introduce little more supervision and instead expand the manually annotated hope speech set , as expected, our performance improved. Our results indicate that using minimal manual supervision we can sample with as high as 30% accuracy from the Hindi subset .

Research question: What is the benefit of extracting the Hindi sub-part? Both NN-Sample() and NN-Sample() are outperformed by corresponding sampling methods NN-Sample() and NN-Sample(), respectively (see, Table 7). We were curious to analyze if extracting the Hindi allows sampling from the sub-region of mostly written in pure Hindi. As shown in Table 8, without removing the non-Hindi part, NN-Sample (intuitively) yielded a set with high level of code mixing. Nearest neighbors of a code switched comment are likely other code switched comments. However, sampling using just the Hindi sub-part yielded substantially less code mixing - Table  5 shows considerably less code-mixing than Table 3. Our intuition that removing non-Hindi is crucial for sampling from the low CMI region of is further supported by Figure 3 wherein NN-Sample() has a better spread over while is mostly located in the code mixed region.

5 Conclusion

In NLP literature, typically, code switching is viewed as an impediment to downstream analyses. In this paper, we first raise a novel proposition that code switching can be harnessed for social good and human well-being by using it as a bridge to retrieve hostility-diffusing content written in a low-resource language. Our approach is appealing for its minimal supervision requirements. In the context of hostility diffusing hope speech comments, our methods can be used to broaden the reach of such content overcoming the varied language skills of linguistically diverse regions and transcending language barriers.


  • [1] Fabio Celli, Evgeny Stepanov, Massimo Poesio, and Giuseppe Riccardi. Predicting brexit: Classifying agreement is better than sentiment and pollsters. In Proceedings of the Workshop on Computational Modeling of People’s Opinions, Personality, and Emotions in Social Media (PEOPLES), pages 110–118, 2016.
  • [2] Dorottya Demszky, Nikhil Garg, Rob Voigt, James Zou, Jesse Shapiro, Matthew Gentzkow, and Dan Jurafsky. Analyzing polarization in social media: Method and application to tweets on 21 mass shootings. In Proceedings of NAACL-HLT 2019, pages 2970–3005. ACL, June 2019.
  • [3] Shriphani Palakodety, Ashiqur R. KhudaBukhsh, and Jaime G. Carbonell. Hope speech detection: A computational analysis of the voice of peace. CoRR, abs/1909.12940, 2019.
  • [4] Owen B Toon, Charles G Bardeen, Alan Robock, Lili Xia, Hans Kristensen, Matthew McKinzie, RJ Peterson, Cheryl S Harrison, Nicole S Lovenduski, and Richard P Turco. Rapidly expanding nuclear arsenals in pakistan and india portend regional and global catastrophe. Science Advances, 5(10):eaay5478, 2019.
  • [5] John J Gumperz. Discourse strategies, volume 1. Cambridge University Press, 1982.
  • [6] Carol Myers-Scotton. Dueling languages: Grammatical structure in code-switching. claredon, 1993.
  • [7] Thomas Zeitzoff. How social media is changing conflict. Journal of Conflict Resolution, 61(9):1970–1991, 2017.
  • [8] Wolfgang Butzkamm. Code-switching in a bilingual history lesson: The mother tongue as a conversational lubricant. International Journal of Bilingual Education and Bilingualism, 1(2):81–99, 1998.
  • [9] Peter Auer. Code-switching in conversation: Language, interaction and identity. Routledge, 2013.
  • [10] Michael Yoder, Shruti Rijhwani, Carolyn Rosé, and Lori Levin. Code-switching as a social act: The case of Arabic Wikipedia talk pages. In Proceedings of the Second Workshop on NLP and Computational Social Science, pages 73–82. ACL, August 2017.
  • [11] Amitava Das and Björn Gambäck. Identifying languages at the word level in code-mixed indian social media text. In Proceedings of the 11th International Conference on Natural Language Processing, pages 378–387, 2014.
  • [12] Shruti Rijhwani, Royal Sequiera, Monojit Choudhury, Kalika Bali, and Chandra Shekhar Maddila. Estimating code-switching on twitter with a novel generalized word-level language detection technique. In Proceedings of ACL 2017, pages 1971–1982, 2017.
  • [13] Spandana Gella, Kalika Bali, and Monojit Choudhury. “ye word kis lang ka hai bhai?” testing the limits of word level language identification. In Proceedings of the 11th International Conference on Natural Language Processing, pages 368–377. NLP Association of India, December 2014.
  • [14] Koustav Rudra, Shruti Rijhwani, Rafiya Begum, Kalika Bali, Monojit Choudhury, and Niloy Ganguly. Understanding language preference for expression of opinion and sentiment: What do hindi-english speakers do on twitter? In Proceedings of EMNLP 2016, pages 1131–1141, 2016.
  • [15] Dong-Phuong Nguyen and A. Seza Dogruoz. Word level language identification in online multilingual communication. In Proceedings of EMNLP 2013, pages 857–862. ACL, 2013.
  • [16] Heba Elfardy and Mona T. Diab. Sentence level dialect identification in arabic. In ACL, 2013.
  • [17] Shruti Rijhwani, Royal Sequiera, Monojit Choudhury, Kalika Bali, and Chandra Shekhar Maddila. Estimating code-switching on twitter with a novel generalized word-level language detection technique. In Proceedings of the ACL 2017, pages 1971–1982. ACL, July 2017.
  • [18] Ashutosh Kumar, Satwik Bhattamishra, Manik Bhandari, and Partha Talukdar. Submodular optimization-based diverse paraphrasing and its effectiveness in data augmentation. In Proceedings of NAACL-HLT 2019, pages 3609–3619, 2019.
  • [19] Shriphani Palakodety, Ashiqur R. KhudaBukhsh, and Jaime G. Carbonell. Voice for the voiceless: Active sampling to detect comments supporting the rohingyas. CoRR, abs/1910.03206, 2019.
  • [20] Laurens van der Maaten and Geoffrey Hinton. Visualizing data using t-sne. Journal of machine learning research, 9(Nov):2579–2605, 2008.