A Parts-of-Speech (POS) Tagger is a piece of software that reads the text in some language and assigns parts of speech tags, such as noun, verb, adjective, etc., to each word/token. POS Tags are useful for building parse trees, which may be used to build textbfNamed Entity Recognizers (NER) or Dependency Parsers. POS Tagging is also useful for building lemmatizers, which are used to reduce a word to its root form. POS taggers for widely spoken languages have been developed in abundance. But such resources are very scarce for low resourced languages.
On the other hand, code-mixing is simply a mix of two or more languages in communication. Due to the emergence of social media, a lavish amount of digital code-mixed data is generated. This is because people nowadays are very comfortable with multilingualism. This phenomenon has produced a section of researchers, who contemplate code-mixed texts as being a new language.
As mentioned earlier, since POS tagging systems for low resourced languages are hard to come by, developing one that will cater to code-mixed text is trivial. POS tagging systems, if developed for Code-Mixed data, can lead to deciphering many complex Natural Language Processing (NLP) tasks and hence, we attempt to develop the same in this reported work. We try to focus on creating a POS tagger for English-Bengali code-mixed data, as languages such as Bengali are morphologically rich in nature.
Our method includes scraping of code-mixed English-Bengali tweets on Twitter and cleaning them. The Bengali words in these tweets were in Roman script. These cleaned tweets were used as a development dataset for building our system. Our system starts with tagging individual tokens of a tweet with their respective languages, either English, Bengali or Unknown. This step will give rise to segments/sub-sequences of the tweet, written in the same language. It is to be noted that tokens tagged as Unknown were discarded. The segments will then be passed to two POS taggers, one designed for English and the other designed for Bengali. The output from the POS taggers will then be joined together to get the final POS tagged, code-mixed tweet. Since the POS tagging modules of English and Bengali use different tag sets, we further map the tags to a manually defined universal POS tag set. This step produces a final POS tagged tweet with uniform tags. The architecture of the proposed model is shown in Figure 1.
The remainder of the paper is organized as follows. Section 2 documents a brief state-of-art on this domain. Section 3 defines the data preparation steps. Section 4 defines the pipeline which helps us in POS tagging the code-mixed tweets. This will be followed by the results in Section 5 and concluding remarks in Section 6.
2 Related Work
In the past few years, a lot of significant work has been done in the field of Parts of Speech tagging. The first significant POS tagger came in the early Nineties which was a rule-based tagger Karlsson et al. (2011). One of the English rule-based taggers had an accuracy of 99.5% Samuelsson and Voutilainen (1997)
. POS taggers based on statistical approaches were also used during this time, which was based on statistical models like bi-gram,tri-gram and Markov ModelsDeRose (1988); Cutting et al. (1992); Dermatas and Kokkinakis (1995); Meteer et al. (1991); Merialdo (1994). Subsequently, POS tagger based on both statistical methods and a rule-based approach was proposed by Brill (1992).
used neural networks for POS tagging for the first time.
POS taggers for the Bengali language was also built by Seddiqui et al. (2003). This POStagger was built on the analysis of the Bengali morphemes. Other works have been done in Bengali POS tagging by Hasan et al. (2007) and Dandapat et al. (2007) which were rule-based and semi-supervised.
Pimpale and Patel (2016) attempted to tag code-mixed data using Stanford POS tagger. He trained the POS tagger on constrained data of Hindi, Bengali, and Telugu, mixed with English. They garnered accuracy figures of 71%. Similarly, Sarkar (2016) used the HMM model on constrained code-mixed data and achieved an accuracy figure of 75.60%.
Pipeline architecture for POS tagging of code-mixed data was first used by Barman et al. (2016)
. The training data was very low in their case and the LID (language identification) and transliteration models used were based on Support Vector Machines (SVM) and manual transliteration. Our approach also used pipeline architecture similar to theirs, but our model does not require any annotated data to train the system. Also, the LID and transliteration modules, in our case, have been fully trained with much larger data, using Deep Learning architecture.
3 Data Preparation
We decided to use a development dataset for building our system. It is to be noted that this data was used to build the proposed system and not to train it. Since code-mixed data consisting of English and Bengali language are difficult to find, we decided to scrape such data from Twitter. The collected tweets contained multiple degrees of noise and hence, it needed to be cleaned before using it to develop our future systems. After cleaning the tweets, they were subjected to a Language Tagger module that tagged every token of the tweet with their corresponding language (English, Bengali, and Unknown, in this case).
3.1 Tweet Scraping and Cleaning
Initially, we had to assemble the development data, consisting of English-Bengali code-mixed data, that will be used to build the POS tagger model. For this, we scraped tweets from Twitter, as it is a social media handle with a huge repository of such data. Our tweet scraper module used the Twint module111https://pypi.org/project/twint/, a python package that helps to scrape tweets. The program was fed with a list of Bengali (Romanized) keywords that will be used to scrape the tweets. Later, the Twint object iterates the keywords and recovers tweets corresponding to the same keywords.
Using this method, 5,148 code-mixed tweets containing English and Bengali (Romanized) words were collected. The collected tweets were noisy and hence we needed to clean it beforehand to proceed. The cleaning module was a manifold approach that involved cleaning links, smileys, Emojis, Hashtags, and Mentions (Usernames).
3.2 Language Tagging and Segmentation
We observed that there is no end-to-end POS tagger available that can jointly tag English and Bengali tokens. Thus we decided to segment the cleaned tweets, into Bengali and English. This was done so that tokens in different language segments can be tagged with their respective POS tags, separately.
For segmenting the tweets, the words needed to be tagged with their corresponding language. To develop such a Language Tagging (LT) model, we collected 11,060 Romanized words of Bengali and 7,223 words of English. We developed a binary classification model that takes as input, the tokens of a tweet (in character embedding) and outputs the language of the word to either English or Bengali. Tokens (in character embedding) were fed to a stacked LSTM of size 2. The output vectors from the LSTM cells were then fed to a fully connected layer, which then mapped the words to its specific language. For the given model, Activation was kept as Sigmoid, Optimizer used was Adam and Loss used was Binary Crossentropy
. Batch Size was kept at 30. The program was executed for 30 epochs and the model was validated using a validation split of 0.2.
The architecture of the language tagging module is shown in Figure 2. The model returned a validation accuracy of 91%. It is to be noted that, characters apart from alphabets and numbers were tagged as ‘Unknown’. Tweets with no language tag and only unknown tags were discarded. An example of language tagging is shown in Table 1. Statistics of the tweets after cleaning and language tagging are shown in Table 2.
|No. of tweets before LT||5,148|
|No. of tweets after LT||5,012|
|No. of tokens before LT||1,44,17|
|No. of tokens after LT||1,41,47|
|No. of tweets with no language tag||136|
After the language tagging is done, a segmentation module partitions the code-mixed input into segments concerning its language tags. In our case, segments are sub-sequences of the instance, written in the same language. An example of segmentation is shown below, where strings in brackets denote segments;
1. (Movie)En (ta bhalo chilo)Bn (but mid point)En (e amar khub)Bn (boring)En (lagte shuru korlo)Bn.
2. (I had to go)En (karon o khub)Bn (urgently)En (daklo amaye)Bn.
3.3 Language Switch Analysis
Language tagged tweets were then analyzed to examine switching patterns. For this, the tweets were tokenized and a list of bigrams was extracted. Since the tokens of the tweets are tagged with their specific language, we could find out the count of bigrams with respect to EN-EN (both tokens of biagram are in English), BN-BN (both tokens of biagram are in Bengali), EN-BN (fist token of biagram is in English and second in Bengali) and BN-EN (fist token of biagram is in English and second in Bengali).
|Switch||Count||Freq >500||Freq >1000|
4 Parts of Speech Tagging
After the data preparation step, the language tagged segments are passed to the corresponding language POS tagger for the final tagging. Two different POS tagging systems were used for English and Bengali. For POS tagging the English Segments we used the Stanford POS tagger222https://nlp.stanford.edu/software/tagger.shtml and the output was recorded.
For the Bengali segments, we used a tagger developed by Das et al. (2014). They trained the tagger on 10,000 Bengali (Devanagari) POS tagged sentences and tested it on 2,000 Bengali (Devanagari) sentences. Their model returned 92% accuracy. To use their model, we had to transliterate the Bengali segments into its corresponding Devanagari script. The model developed to do the same is described in Section 4.1.
4.1 Bengali Transliteration
To develop the transliteration system, we initially collected 22,781 Romanized Bengali words and manually transliterated them to its Devanagari counterpart. We developed a Sequence-to-Sequence model that takes as input the Romanized Bengali words and outputs the Bengali words in the Devanagari script. The embedding used in this model was at the character level.
The model consists of two parts: an Encoder and Decoder. The encoder takes as input, Romanized Bengali characters, creates one-hot vectors of the same and passes this to the Embedding layer. The output of the embedding layer is given to a stacked LSTM cell, which produces a context vector of the input word. The Decoder module takes as input the Bengali characters in Devanagari script, creates a one-hot vector of the same and passes it to an embedding layer. The output of the embedding layer is given to a stacked LSTM cell which is initialized with the state of the encoder module. The stacked LSTM cell then produces Bengali characters (in Devanagari script) as output, with an offset of a one-time step. The activation of the model was selected as Softmax, Optimizer used was Adam and Loss used was Sparse Categorical Crossentropy. Batch Size was kept at 1024. The program was executed for 50 epochs and the model was validated using a validation split of 0.1.
The validation accuracy of the model was recorded as 87%. The architecture of the model is shown in Figure 3.
The transliterated segments are then fed to the Bengali POS tagger and the corresponding outputs are recorded.
After POS tagging both the English and Bengali segments, the results are joined together to get a POS tagged code-mixed tweet.
4.2 Mapping to Universal POS Tag Set
The final POS tagged code-mixed tweets need to be generalized to a universal system because the POS tags of the Bengali and English POS taggers are different. This is because English and Bengali POS taggers have different grammar and thus use different POS tag sets. To simplify this situation, we use a universal POS tag set that comprises the tags as showed in Table 4.
The table shows the universal tags in bold and italics while the other texts define the universal tag.
For mapping the English POS tags to this universal POS tag set we use map_tag which is an inbuilt tool of NLTK. It maps the English tags to these tags based on some pre-defined rules.
The mapping of the Bengali POS tags (Stanford POS tags) to the universal POS tag set is shown in Table 5. Here, text in bold and italics denotes the universal tag, while the other defines the Stanford POS tags.
|Syst. Tag||Univ. Tag||Syst. Tag||Univ. Tag|
Finally, the POS tagged segments (mapped to the universal POS tagset) are recorded as the final output.
Since there is no automated evaluation metric present to assess the quality of POS tagging a code-mixed sentence, we hired a linguist who was proficient in both Bengali and English. The linguist was asked to prepare a test data comprising of 100 English-Bengali code-mixed sentences. Further, the linguist was asked to POS tag the tokens, based on the universal POS tagset, separately. The linguist was told to look into the context of the sentence while tagging the tokens. This approach was used to properly
tag ambiguous words, such as ’to’, which occurs in both English and Bengali.
tag words in the switching point.
The same test data was tagged using our system as well. To calculate the agreement between the manual annotation and system annotation, we used Krippendorff’s Alpha Krippendorff (2011), and the metrics and the confusion are shown in Table 6
Inter-system annotation agreement scores described in Table 6 evaluates the overall system. To dive deeper, we evaluated every sentence of the test data. This was done using two methods.
For a code-mixed sentence, the POS tag of every token in the same manually annotated sentence as compared to the POS tag of every token in the same system annotated sentence. scoreA was calculated as
Method 2: POS tagging of tokens that lie in the language switching point,i.e., , is of utmost importance as the context of the two words may change. As a result, POS tags may also differ. In this context, scoreB was calculated by multiplying 0.25 to scoreA and taking the absolute value of its value, if POS tags (for the language switching point) in the manually annotated sentence and the system annotated sentence, match. The multiplying factor was kept at 0.25 as there can be four bigrams, i.e., EN-EN, BN-BN, EN-BN, and BN-EN.
If there is more than one switching point and the POS tags match, the multiplying factor was repeated for the number of switching. So, if there are two switching points, and the POS tags match, scoreA will be multiplied by 0.25 and 0.25 to get scoreB.
, where denotes the number of language switching points present and the trailing * denote that the formula holds true if certain conditions are met.
With the help of the above methods, ScoreA and ScoreB were calculated for every sentence and finally, the average for the whole test data was calculated. With method 1, our algorithm garnered accuracy of 72.72% and with method 2, the accuracy increased to 75.29%.
In this work, we have devised a modular system that can POS tag English-Bengali code-mixed sentences. The system uses sub-modules to perform the same. Owing to the fact, that the sub-modules can be trained for any given language, the proposed approach can be used to tag a variety of code-mixed data involving any two language pairs.
The system can be enhanced further if the sub-modules can be trained using more annotated data. E.g., if the POS tagger for the Bengali language could have been trained using more data, the problem of tagging untrained tokens with ’UN’ tags could have been solved. Also, the problem of wrongly tagging tokens, e.g., tagging NOUN as ADJ, VERB and tagging PRON as NOUN, VERB, etc., could have been solved. This would have made the Bengali POS tagging module more robust. The same applies to the transliteration module as well.
In the future, we would like to develop an end-to-end system, so that the errors of one sub-module do not propagate to the other sub-modules.
- Part-of-speech tagging of code-mixed social media content: pipeline, stacking and joint modelling. In Proceedings of the Second Workshop on Computational Approaches to Code Switching, pp. 30–39. Cited by: §2.
- A simple rule-based part of speech tagger. In Proceedings of the third conference on Applied natural language processing, pp. 152–155. Cited by: §2.
- A practical part-of-speech tagger. In Third Conference on Applied Natural Language Processing, pp. 133–140. Cited by: §2.
- Automatic part-of-speech tagging for bengali: an approach for morphologically rich languages in a poor resource scenario. In Proceedings of the 45th Annual Meeting of the ACL on Interactive Poster and Demonstration Sessions, pp. 221–224. Cited by: §2.
- Automatic detection of subject/object drops in bengali. In 2014 International Conference on Asian Language Processing (IALP), Vol. , pp. 91–94. External Links: Cited by: §4.
- Automatic stochastic tagging of natural language texts. Computational Linguistics 21 (2), pp. 137–163. Cited by: §2.
- Grammatical category disambiguation by statistical optimization. Computational linguistics 14 (1), pp. 31–39. Cited by: §2.
Comparison of different pos tagging techniques (n-gram, hmm and brill’s tagger) for bangla. In Advances and innovations in systems, computing sciences and software engineering, pp. 121–126. Cited by: §2.
- Constraint grammar: a language-independent system for parsing unrestricted text. Vol. 4, Walter de Gruyter. Cited by: §2.
- Computing krippendorff’s alpha-reliability. Cited by: §5.
- Conditional random fields: probabilistic models for segmenting and labeling sequence data. Cited by: §2.
- Tagging english text with a probabilistic model. Computational linguistics 20 (2), pp. 155–171. Cited by: §2.
POST: using probabilities in language processing.. In IJCAI, pp. 960–965. Cited by: §2.
- Neural network approach to word category prediction for english texts. In Proceedings of the 13th conference on Computational linguistics-Volume 3, pp. 213–218. Cited by: §2.
- Experiments with pos tagging code-mixed indian social media text. arXiv preprint arXiv:1610.09799. Cited by: §2.
- Comparing a linguistic and a stochastic tagger. In Proceedings of the 35th Annual Meeting of the Association for Computational Linguistics and Eighth Conference of the European Chapter of the Association for Computational Linguistics, pp. 246–253. Cited by: §2.
- Part-of-speech tagging for code-mixed indian social media text at icon 2015. arXiv preprint arXiv:1601.01195. Cited by: §2.
- Parts of speech tagging using morphological analysis in bangla. In Proceeding of the 6th International Conference on Computer and Information Technology (ICCIT), Cited by: §2.
- Shallow parsing with conditional random fields. In Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology-Volume 1, pp. 134–141. Cited by: §2.
- Conditional random field based pos tagger for hindi. Proceedings of the MSPIL, pp. 63–68. Cited by: §2.