Exploration of social media and exponential growth of data redirect us to bring various insights on language aspects. For example, a phenomenon known as code-mixing where people from multilingual backgrounds often mix two languages while communicating on social media. As our traditional NLP tools often fail or give poor results on such data Bergsma et al. (2012), it becomes an important prerequisite to identify language tags with high accuracy. Language identification on such data at both word and sentence level still remains as a challenging task due to a few facts such as inconsistent phonetic transliteration, borrowing of words, spelling errors, intra sentential change of matrix, embedded languages as well as the use of numeric and special characters in words.
Here, we discuss our system for word level language identification of low resource Bengali-English code-mixed data taken from Twitter, where both are typed in Roman script. Bengali is an Indo-Aryan language of India where 8.10% 111https://en.wikipedia.org/wiki/Languages_of_India of the total population are the language speakers and is also the official language of Bangladesh. The original script in which Bengali is written by locals is the Eastern Nagari Script 222https://www.omniglot.com/writing/bengali.htm. Most of the code-mixed data was collected from Twitter 333https://twitter.com/.
Employing LSTM, two models were developed, one using character based encoding and the other using Bengali root phones. Combining these two models, two ensemble models were prepared using stacking and threshold. Finally, the performance metrics like accuracy and macro-averaged F1-Score along with confusion matrix were used to compare the results. With respect to the related works in Sec2, our system managed to score comparable results using a fraction of data.
2 Related Work
The LID problem in code-mixed texts has already been studied by many researchers in the aspect of social media communication utilizing various methods involving both supervised and unsupervised techniques. Nguyen and Doğruöz (2013) experimented with linear chain CRFs but their contextual features are limited to bi-grams of words. Kim et al. (2014) studied the linguistic behavior of bilingual Twitter users from Qatar, Switzerland and Québec, and also acknowledged that code-mixing could not be studied due to the absence of appropriate LID tools. Das and Gambäck (2014)
introduced code-mixing index to evaluate the level of mixing of the languages and made a system for Bengali-English LID using SVM trained on n-gram with weights, dictionary, minimum edit distance, 7 word window for context and got an F1-Score of 76.37%.Barman et al. (2014)
uses char-n-grams, capitalization, dictionary along with a decision tree trained on word length to train a SVM model for code-mixed LID. They also use CRF for context capturing.Jhamtani et al. (2014)
combined to classifiers into an ensemble model for Hindi-English code-mixed LID. The first classifier used modified edit distance, word frequency and character n-grams as features. The second classifier used output of the first classifier for the current word, along with language tag and POS tag of neighboring to give the final tag.Piergallini et al. (2016) used n-grams along with capitalization as features to train a linear model for Swahili-English word level LID and got an accuracy of 96.5%. Jaech et al. (2016) employed a hierarchical neural model, where a CNN was trained on char2vec and bi-LSTM on sentence level word embeddings to create a system which got an F1-Score of 93.1% on Spanish-English data. Rijhwani et al. (2017) demonstrates a generalized unsupervised model based on HMM for code-mixed LID for an arbitrarily large set of languages. Singer et al. (2003) used phones and acoustic properties for language identification in speech data. To the best of our knowledge, none of the works have used character or phonetic encoding in an LSTM architecture for textual language identification task.
3 Data Sets
We collected transliterated Bengali words in Roman and English words from ICON 16 444http://ltrc.iiit.ac.in/icon2016/, ICON 17 555https://ltrc.iiit.ac.in/icon2017/ contests and code-mixed data used in Mandal and Das (2018). Some additional English words were collected from online 666https://github.com/first20hours/google-10000-english resources. The datasets had no intersection. All words were converted to lowercase, and words having length less than 3 or any special or numeric characters (e.g. ri8, 2toh) were removed. Also, words with 2 consecutive identical characters was normalized to 2 consecutive identical characters. Words which had both EN and BN components, for e.g. journey(ta), which is an EN word with BN suffix were discarded as well. The distribution for training, development and testing was 6632x2, 300x2 and 700x2, respectively. It should be noted that the amount of data is comparatively less as Bengali falls in the low resource language category when dealing with data from social media.
4 Support Vector Machine Model
In order to start with (baseline model), we decided to make a supervised model using Support Vector Machine (SVM) with bag-of-words (BOW) principle. The training and development set were merged in order to create the training data in this case. Then character level bigrams, trigrams and quadgrams were extracted and SVM with linear kernel implemented in the scikit-learn777http://scikit-learn.org/stable/ package was employed for training purpose. This system achieved an accuracy of 83.64% on the test data.
5 Neural Model
We moved towards implementing neural models to achieve better accuracy. Two encoding methods for neural model were tried, namely character based and phone based. The encoding methods we used are described in detail below.
5.1 Character Encoding
We decided to make a neural model based on character encoding for acquiring patterns at an elementary level. Similar encoding methods have been tested for character level NMT before, used in a seq2seq architecture Lee et al. (2016). In order to build the encoder, a dictionary based on character-index was made where index is the English alphabet. The algorithm fetches each character at a time from the word and replaces it with the respective index, for example, the encoding of good is represented as [7,15,15,4], for bad it will be [2,1,4].
5.2 Phonetic Encoding
It is evident that different languages have different phonetic properties 888https://en.wikipedia.org/wiki/Phonetics, especially when they have distant origins. In order to exploit this, we developed a language identification neural model based on phonetic encoding. One of our major aims was to obtain such encodings in terms of Bengali root phones. Thus, a phonetic library that contains two parts, namely root phones and similar phone groups was prepared by the authors in 3 steps. First, character level bigrams and trigrams were collected from the Bengali words present in the training set and were sorted in non-increasing order of frequency. We did not extend to quadgrams as we could see by observing the data set that the users generally tend to use maximum of 3 characters to phonetically represent the corresponding Bengali character. Then, from the code chart of Bengali unicode characters 999https://unicode.org/charts/PDF/Unicode-10.0/, the pronunciation of each of the Bengali characters written in Roman script was gathered and finally, a root phone list (RP) was prepared. By referring to the (2,3)-gram frequency table, we noticed that some of the transliterations collected from the unicode chart are rarely used in real life for phonetic typing. Thus, such instances were discarded from the list (e.g. nga, nya, ddha). Next, with respect to the root phones, all the corresponding (phonetically similar) bigrams and trigrams with high frequency were gathered and formed into groups, i.e similar phone groups (SPG). Though some of the bigrams and trigrams had quite high frequency, yet they were not included in SPG as they didn’t correspond to any of the root phones, example (e.g. bhi). The phonetic library was finally checked and verified by a linguist. Complete model consisting of RP and SPG is shown in Table 1. In each of the phonetic groups, we had set the first phone as the respective root phone for ease of searching and retrieval of root index.
|(aa, a), (i, ee), (u, w), (r, ri), (e), (ai, oi), (o, oo)|
|(au, ou, ow), (ka, k), (kha, kh), (ga, g), (gha, gh)|
|(ca, c), (cha, ch), (sa, s, sh), (jha, jh), (bha, bh, v)|
|(ta, t), (tha, th), (da, d), (dha, dh), (na, n), (pa, p)|
|(pha, ph, f), (ba, b), (ma, m), (ya, y), (ra, rh)|
|(la, l), (ja, j, z), (ha, h)|
The routine for encoding is based on variable length sliding window technique, where n[3,1]. The algorithm in each of iteration takes n characters, with n starting as 3. If the character cluster appear in phonetic groups, we append the root phone index, else we reduces n by 1 and repeat the same process till n = 1. If the n-gram is not found in phonetic groups, 35 (as number of root phones is 31) is appended. From the next iteration, the window starts from n + 1, where n denotes the ending point of the window in the last iteration. The algorithm stops when window traverses the whole word. Pseudo-code given below.
while index len(word)
if word[index:index+j](j3,2,1) in SPG
enc.append(root_phone_index), index += j
else enc.append(35), index += 1
On encoding the words in the combined dataset using the above algorithm, the number of times 35 occurred (i.e n-gram absent in the phonelib is seen) was found to be 82 and 1 for EN and BN data respectively. Some examples of word encodings are, khabar meaning ’food’ will be [10,24,4], which is essentially kha-ba-r. Also khbr, which is the same word but is less seen in terms of phonetic typing, produces identical encoding. Though this transforming ability performs quite well on many of such instances, it fails on some instances as well like the word korchi produces [9,7,4,14,2] while krci results in [9,4,13,2] even though they are the phonetic transliterations of the same word, where the latter is less accurately typed. The normalized root phones frequency graph is shown in Fig 1. There, y-axis denotes the normalized frequency while the x-axis denotes the root phones (same order as in Table 1). From the table we can see 3 spikes in the starting, which are essentially the frequency of the phones i,e,o. Thus we can draw the inference, that these 3 RP are quite essential while transliterating BN in Roman. From the rest of the graph, we can see a clear distinction between the blue and green line, implying how EN and BN is different in terms of the root phones.
We adopted the Long Short Term Memory version of the RNN architecture101010https://keras.io/ to train both of our models. We specifically chose this because it outperforms almost all the other models in sequence learning Greff et al. (2017)
. Two LSTM models were built, one for each type of encoding. We set the same parameters for both, with sigmoid activation function and adam optimizer whereas the loss function used was binary cross-entropy, number of epochs was set to 500 and batch size was set to 1658. All the other parameters were kept at default. The significant difference between the two models was the architecture. For character encoding, the layer sizes were 15-35-25-1 and for phonetic it was 15-15-40-1. The numerics indicate here that the input and the output layer has 15 and 1 node(s) respectively, and the other two are the number of nodes in two hidden layers. The encodings were padded to length 15 in order to make them uniform and was finally converted into one-hot vectors. The target label for Bengali and English was set to 0 and 1, respectively, thus the output of the trained models was fuzzy values between them (this can be visualized clearly in the scatter plots given in the next section).
For both the character and phonetic models, first a simple roundup logic was used for testing on the development data (dev_round). As from the fuzzy value scatter plots we could see that theres a scope of improvement, we decided to use brute force for finding a threshold on the development data (dev_thresh) in order to tune the trained model. Finally using this threshold, we evaluated the model on our testing data (test_thresh). The details of both the models along with the ensemble model is given below.
7.1 Character Model
The results of the character model are shown in Table 2 and the fuzzy outputs scatter plot on the development data is shown in Fig 2. Simple round up on the fuzzy output gave an accuracy of 91.50%. Using brute force, the threshold where the accuracy is peaking on the development data was found to be at 0.92. This showed a slight improvement in accuracy by 0.66%. Finally, on test data, this tuned model gave an accuracy of 91.71%.
7.2 Phonetic Model
Results of the phonetic model are shown in Table 3. On simple roundup on development data, the accuracy was not very good (82.5%), but using a similar threshold technique as used in the character model, the accuracy improved significantly (6.16%) and reached 88.66%. The threshold in this case was calculated to be 0.95. The scatter plot of the fuzzy values is shown in Fig 3. Using threshold, the accuracy achieved on the test data was 90.42% and a precision of 91.74% which is slightly better than the character model.
7.3 Ensemble Model
Two ensemble methods were tried, using stacking and thresholding. For the former, logistic regression was used as the combiner algorithm and was trained on the fuzzy values given by the two models on the development data. For threshold, mean of the two fuzzy values was taken and by using a similar brute force technique as used in char and phonetic model, the threshold was found to be0.9. The architecture is shown below in Fig 4.
The stacking method showed slight improvement and the accuracy achieved was 91.78%. On the other hand, accuracy improved by 0.57% using threshold technique. Also, we can see the improvement using the neural networks is quite noticeable, i.e an increase in accuracy by 8.71% from the SVM n-gram based model.
7.4 Error Analysis
We started our error analysis by preparing confusion matrices (CM) of the four models. The values observed on test data are shown in Table 5. Here, BN correctly predicted was considered as TP and EN was considered as TN.
From CM 1 given in Table 5, we found that the character model is quite balanced in terms of biasness as the values of TP and TN are quite close. Unlike the character model, we can see from CM 2 (phonetic model) that TP is much more than TN, thus the precision is higher by 0.15% compared to the character model. From CM 3, we observed that TN is much higher compared to TP (by 39) and thus, it can be inferred that the logistic regression model is biased towards EN. On the contrary, from CM 4, we can claim that TP is much higher compared to TN (by 41). This is due to the fact that the regression range of the fuzzy outputs from the models for BN is much higher compared to that of EN, and thus calculating threshold keeping accuracy as the metric favors TP.
One of the primary drawbacks of our system is that it is based on word level and not sentence, thus, it fails in capturing context information. This problem is quite evident when words with similar spellings but belonging to different languages are assigned with same tag. E.g, in our test data, the word ”choke” was present in both EN and BN (meaning eyes) data but was always tagged as BN. The system is not designed to handle words with numeric or special characters (e.g. ri8 instead of right), and also elongated words (e.g. goood instead of good) though a simple normalization prior to feeding would solve the issue.
9 Conclusion & Future Work
In the present work, we have build a system for word level LID. A new method for word encoding has been introduced using root phones. Employing character and phonetic encoding methods, we designed two deep LSTM models. Finally, two ensemble models were designed using stacking and threshold techniques. The stacking method achieved an accuracy of 91.78% while on the other hand the threshold model obtained the best accuracy with a score of 92.35%. Considering that our training data was quite low 13k, the architecture seems to be quite promising. In future, our aim is to gather more data from social media. This would not only help in the generic improvement of accuracy, but the n-gram statistics may also be useful to enrich and fine tune our phonetic library even more, thus improving the encoding quality. Also, we would like to test a similar methodology on other language pairs as well to evaluate its applicability.
- Barman et al. (2014) Utsab Barman, Amitava Das, Joachim Wagner, and Jennifer Foster. 2014. Code mixing: A challenge for language identification in the language of social media. In Proceedings of the first workshop on computational approaches to code switching. pages 13–23.
- Bergsma et al. (2012) Shane Bergsma, Paul McNamee, Mossaab Bagdouri, Clayton Fink, and Theresa Wilson. 2012. Language identification for creating language-specific twitter collections. In Proceedings of the second workshop on language in social media. Association for Computational Linguistics, pages 65–74.
- Das and Gambäck (2014) Amitava Das and Björn Gambäck. 2014. Identifying languages at the word level in code-mixed indian social media text .
- Greff et al. (2017) Klaus Greff, Rupesh K Srivastava, Jan Koutník, Bas R Steunebrink, and Jürgen Schmidhuber. 2017. Lstm: A search space odyssey. IEEE transactions on neural networks and learning systems 28(10):2222–2232.
- Jaech et al. (2016) Aaron Jaech, George Mulcaire, Mari Ostendorf, and Noah A Smith. 2016. A neural model for language identification in code-switched tweets. In Proceedings of The Second Workshop on Computational Approaches to Code Switching. pages 60–64.
- Jhamtani et al. (2014) Harsh Jhamtani, Suleep Kumar Bhogi, and Vaskar Raychoudhury. 2014. Word-level language identification in bi-lingual code-switched texts. In Proceedings of the 28th Pacific Asia Conference on Language, Information and Computing.
- Kim et al. (2014) Suin Kim, Ingmar Weber, Li Wei, and Alice Oh. 2014. Sociolinguistic analysis of twitter in multilingual societies. In Proceedings of the 25th ACM conference on Hypertext and social media. ACM, pages 243–248.
- Lee et al. (2016) Jason Lee, Kyunghyun Cho, and Thomas Hofmann. 2016. Fully character-level neural machine translation without explicit segmentation. arXiv preprint arXiv:1610.03017 .
- Mandal and Das (2018) Soumil Mandal and Dipankar Das. 2018. Analyzing roles of classifiers and code-mixed factors for sentiment identification. arXiv preprint arXiv:1801.02581 .
Nguyen and Doğruöz (2013)
Dong Nguyen and A Seza Doğruöz. 2013.
Word level language identification in online multilingual
Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing. pages 857–862.
- Piergallini et al. (2016) Mario Piergallini, Rouzbeh Shirvani, Gauri S Gautam, and Mohamed Chouikha. 2016. Word-level language identification and predicting codeswitching points in swahili-english language data. In Proceedings of the Second Workshop on Computational Approaches to Code Switching. pages 21–29.
- Rijhwani et al. (2017) Shruti Rijhwani, Royal Sequiera, Monojit Choudhury, Kalika Bali, and Chandra Shekhar Maddila. 2017. Estimating code-switching on twitter with a novel generalized word-level language detection technique. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). volume 1, pages 1971–1982.
- Singer et al. (2003) Elliot Singer, Pedro A Torres-Carrasquillo, Terry P Gleason, William M Campbell, and Douglas A Reynolds. 2003. Acoustic, phonetic, and discriminative approaches to automatic language identification. In Eighth European conference on speech communication and technology.