Recurrent Neural Network based Part-of-Speech Tagger for Code-Mixed Social Media Text

11/15/2016
by   Raj Nath Patel, et al.
CDAC
0

This paper describes Centre for Development of Advanced Computing's (CDACM) submission to the shared task-'Tool Contest on POS tagging for Code-Mixed Indian Social Media (Facebook, Twitter, and Whatsapp) Text', collocated with ICON-2016. The shared task was to predict Part of Speech (POS) tag at word level for a given text. The code-mixed text is generated mostly on social media by multilingual users. The presence of the multilingual words, transliterations, and spelling variations make such content linguistically complex. In this paper, we propose an approach to POS tag code-mixed social media text using Recurrent Neural Network Language Model (RNN-LM) architecture. We submitted the results for Hindi-English (hi-en), Bengali-English (bn-en), and Telugu-English (te-en) code-mixed data.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

10/31/2016

Experiments with POS Tagging Code-mixed Indian Social Media Text

This paper presents Centre for Development of Advanced Computing Mumbai'...
02/01/2017

SMPOST: Parts of Speech Tagger for Code-Mixed Indic Social Media Text

Use of social media has grown dramatically during the last few years. Us...
04/03/2018

Automatic Normalization of Word Variations in Code-Mixed Social Media Text

Social media platforms such as Twitter and Facebook are becoming popular...
04/16/2019

UTFPR at SemEval-2019 Task 5: Hate Speech Identification with Recurrent Neural Networks

In this paper we revisit the problem of automatically identifying hate s...
04/11/2016

Shallow Parsing Pipeline for Hindi-English Code-Mixed Social Media Text

In this study, the problem of shallow parsing of Hindi-English code-mixe...
09/26/2020

Abusive Language Detection and Characterization of Twitter Behavior

In this work, abusive language detection in online content is performed ...
04/27/2018

Extracting textual overlays from social media videos using neural networks

Textual overlays are often used in social media videos as people who wat...

Code Repositories

rnn4nlp

This contains RNN based word level quality estimation, and Part-of-Speech-Tagger


view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Code-Mixing and Code-Switching are observed in the text or speech produced by a multilingual user. Code-Mixing occurs when a user changes the language within a sentence, i.e. a clause, phrase or word of one language is used within an utterance of another language. Whereas, the co-occurrence of speech extract of two different grammatical systems is known as Code-Switching.

The language analysis of code-mixed text is a non-trivial task. Traditional approaches of POS tagging are not effective, for this text, as it does not adhere to any grammatical structure in general. Many studies have shown that RNN based POS taggers produced comparable results and, is also the state-of-the-art for some languages. However, to the best of our knowledge, no study has been done for RNN based POS tagging of code-mixed data.

In this paper, we have proposed a POS tagger using RNN-LM architecture for code-mixed Indian social media text. Earlier, researchers have adopted RNN-LM architecture for Natural language Understanding (NLU) [Yao et al.2013, Yao et al.2014]

and Translation Quality Estimation 

[Patel and Sasikumar2016]. RNN-LM

models are similar to other vector-space language models 

[Bengio et al.2003, Morin and Bengio2005, Schwenk2007, Mnih and Hinton2009] where we represent each word with a high dimensional real-valued vector. We modified RNN-LM architecture to predict the POS tag of a word, given the word and its context. Let’s consider the following example:

Input:

Output : G_N G_PRP G_N CC G_V G_R G_R

In the above sentence, to predict POS tag (G_N) for the word ‘’ using an RNN-LM model with window size 3, the input will be ‘’. Whereas, in standard RNN-LM model, ‘’ will be the input with ‘’ as the output. We will discuss details of various models tried and their implementations in section 3.

In this paper, we show that our approach achieves results close to the state-of-the-art systems such as 111http://nlp.stanford.edu/software/tagger.shtml (Maximum-Entropy based POS tagger)Stanford [Toutanova et al.2003], and 222https://code.google.com/archive/p/hunpos/

(Hidden Markov Model based POS tagger)

HunPos [Halácsy et al.2007] .

2 Related Work

POS tagging has been investigated for decades in the literature of Natural Language Processing (NLP). Different methods like a Support Vector Machine 

[Màrquez and Giménez2004]

, Decision Tree 

[Schmid and Laws2008], Hidden Markov Model (HMM) [Kupiec1992] and, Conditional Random Field Auto Encoders [Ammar et al.2014] have been tried for this task. Among these works, Neural Network (NN) based models is mainly related to this paper. In NN family, RNN is widely used network for various NLP applications [Mikolov et al.2010, Mikolov et al.2013a, Mikolov et al.2013b, Socher et al.2013a, Socher et al.2013b].

Recently, RNN based models have been used to POS tag the formal text, but have not been tried yet on code-mixed data. wang:blstm:2015 have tried Bidirectional Long Short-Term Memory (LSTM) on Penn Treebank WSJ test set, and reported state-of-the-art performance. qin:pos:2015 has shown that RNN models outperform Majority Voting (MV) and HMM techniques for POS tagging of Chinese Buddhist text. zennaki:2015 have used RNN for resource-poor languages and reported comparable results with state-of-the-art systems  

[Das and Petrov2011, Duong et al.2013, Gouws and Søgaard2015].

Work on POS tagging code-mixed Indian social media text is at a very nascent stage to date. vyas:2014 and jamatia:2015 have worked on data labeling and automatic POS tagging of such data using various machine learning techniques. Building further on that labeled data, pimpale:2015 and, sarkar:2015 have tried word embedding as an additional feature to the machine learning based classifiers for POS tagging.

3 Experimental Setup

3.1 RNN Models

There are many variants of RNN networks for different applications. For this task, we used elaman [Elman1990], Long Short-Term Memory (LSTM[Hochreiter and Schmidhuber1997], Deep LSTM

, Gated Recurrent Unit (

GRU)  [Cho et al.2014], which are widely used RNN models in the NLP literature.

In the following sub-sections, we gave a brief description of each model with mathematical equations (1,2, and 3). In the equations, and are the input and output vectors respectively. and represent the current and previous hidden states respectively. are the weight matrices and

are the bias vectors.

is the elementwise multiplication of the vectors. We used , the logistic sigmoid and , the hyperbolic tangent function to add nonlinearity in the network with function at the output layer.

3.1.1 Elman

Elman and Jordon [Jordan1986] networks are the simplest network in RNN family and are known as Simple_RNN. Elman network is defined by the following set of equations:

(1)

3.1.2 Lstm

LSTM is found to be better for modeling of long-range dependencies than Simple_RNN. Simple_RNN also suffers from the problem of vanishing and exploding gradient  [Bengio et al.1994]. LSTM and other complex RNN models tackle this problem by introducing a gating mechanism. Many variants of LSTM [Graves2013, Yao et al.2014, Jozefowicz et al.2015] have been tried in literature for the various tasks. We implemented the following version:

(2)

where , , are , and gates respectively. is the new memory content and is updated memory.

3.1.3 Deep LSTM

In this paper, we used Deep LSTM with two layers. Deep LSTM is created by stacking multiple LSTM on the top of each other. The output of lower LSTM forms input to the upper LSTM. For example, if is the output of lower LSTM, then we apply a matrix transform to form the input for the upper LSTM. The Matrix transformation enables us to have two consecutive LSTM layers of different sizes.

3.1.4 Gru

GRU is quite a similar network to the LSTM, without any memory unit. GRU network also uses a different gating mechanism with () and () gates. The following set of equations defines a GRU model:

(3)

3.2 Implementation

All the models were implemented using 333http://deeplearning.net/software/theano/#downloadTHEANO framework [Bergstra et al.2010, Bastien et al.2012]. For all the models, the word embedding dimensionality was 100, no of hidden units were 100 and the context word window size was 5 (

). We initialized all the square weight matrices as random orthogonal matrices. All the bias vectors were initialized to zero. Other weight matrices were sampled from a Gaussian distribution with mean 0 and variance

.

We trained all the models using Truncated Back-Propagation-Through-Time (T-BPTT) [Werbos1990]

with the stochastic gradient descent. Standard values of hyper-parameters were used for RNN model training, as suggested in the literature 

[Yao et al.2014, Patel and Sasikumar2016]

. The depth of BPTT was fixed to 7 for all the models. We trained each model for 50 epochs and used Ada-delta 

[Zeiler2012] to adapt the learning rate of each parameter automatically ( and ).

3.3 Data

We used the data shared by the contest organizers [Jamatia and Das2016]. The code-mixed data of bn-en, hi-en and te-en was shared separately for the Facebook (fb), Twitter (twt) and Whatsapp (wa) posts and conversations with Coarse-Grained (CG) and Fine-Grained (FG) POS annotations. We combined the data from fb, twt, and wa for CG and FG annotation of each language pair. The data was divided into training, testing, and development sets. Testing and development sets were randomly sampled from the complete data. Table 1 details sizes of the different sets at the sentence and token level. Tag-set counts for CG and FG are also provided.

We preprocess the text for Mentions, Hashtags, Smilies, URLs, Numbers and, Punctuations. In the preprocessing, we mapped all the words of a group to a single new token as they have the same POS tag. For example, all the Mentions like @dhoni, @bcci, and @iitb were mapped to @user; all the Hashtags like #dhoni, #bcci, #iitb were mapped to #user.

#sentences #tokens #tags
code-mix training dev testing training dev testing CG FG
hi-en 2430 100 100 37799 1888 1457 18 40
bn-en 524 50 50 11977 1477 1231 18 38
te-en 1779 100 100 26470 1436 1543 18 50
Table 1: Data Distribution; CG: Coarse-Grained, FG: Fine-Grained

3.4 Methodology

The RNN-LM models use only the context words’ embedding as the input features. We experimented with three RNN model configurations. In the first setting (Simple_RNN, LSTM, Deep LSTM, GRU), we learn the word representation from scratch with the other model parameters. In the second configuration (GRU_Pre), we trained word representations (pre-training) using  [Mikolov et al.2013b] tool and fine tuned with the training of other parameters of the network. Pre-training not only guides the learning towards minima with better generalization in non-convex optimization [Bengio2009, Erhan et al.2010] but also improves the accuracy of the system [Kreutzer et al.2015, Patel and Sasikumar2016]. In the third setting (GRU_Pre_Lang), we also added language of the words as an additional feature with the context words. We learn the vector representation of languages similar to that of words, from scratch.

4 Results

We used F1-Score to evaluate the experiments, results are displayed in the Table 2. We trained models as described in the section 3.4. To compare our results, we also trained the Stanford and HunPos taggers on the same data, accuracy is given in Table 2.

From the table, it is evident that pre-training and language as an additional feature is helpful. Also, the accuracy of our best system (GRU_Pre_Lang) is comparable to that of Stanford and HunPos. GRU models are out-performing other models (Simple_RNN, LSTM, Deep LSTM) for this task also as reported by Chung:2014 for a suit of NLP tasks.

hi-en %F1 score bn-en %F1 score te-en %F1 score
model CG FG CG FG CG FG
Simple_RNN 78.16 68.73 70.16 64.49 72.27 69.04
LSTM 62.75 53.94 41.91 35.05 57.59 51.45
Deep LSTM 70.07 59.78 54.64 46.88 65.86 59.45
GRU 78.29 69.32 71.90 64.96 72.40 68.72
GRU_Pre 80.51 71.72 74.77 68.54 74.02 70.05
GRU_Pre_Lang 80.92 73.10 74.05 69.23 74.00 70.33
HunPos 77.50 69.04 76.55 71.02 74.30 70.73
Stanford 79.89 73.91 79.36 73.44 77.05 73.42
Table 2: F1 scores for different experiments

5 Submission to the Shared Task

The contest was having two type of submissions, first, : restricted to use only the data shared by the organizers with the participants’ implemented systems; second, : participants were allowed to use the publicly available resources (training data, implemented systems etc.).

We submitted for all the language pairs (hi-en, bn-en and, te-en) and domains (fb, twt and, wa). For constrained submission, the output of GRU_Pre_Lang was used. We trained Stanford POS tagger with the same data for submission. jamatia:2016 evaluated all the submitted systems against another gold-test set and reported the results.

6 Analysis

We did a preliminary analysis of our systems and reported few points in this section.

  • The POS categories, contributing more in the error are G_X, G_V, G_N and G_J for coarse-grained and V_VM, JJ, N_NN and N_NNP for fine-grained systems. Also, we did the confusion matrix analysis and found that these POS tags are mostly confused with each other only. For instance, G_J POS tag was tagged 28 times wrongly to the other POS tags in which 17 times it was G_N.

  • RNN models require a huge amount of corpus to train the model parameters. From the results, we can observe that for hi-en and te-en with only approx 2K training sentences, the results of best RNN model (GRU_Pre_Lang) are comparable to Stanford and HunPos. For bn-en, the corpus was very less (only approx 0.5K sentences) for RNN training which resulted into poor performance compared to Stanford and HunPos. With this and the earlier work on RNN based POS tagging, we can expect that RNN models could achieve state-of-the-art accuracy with given the sufficient amount of training data.

  • In general, LSTM and Deep LSTM models perform better than Simple_RNN. But here, Simple_RNN is outperforming both LSTM and Deep LSTM. The reason could be less amount of data for training such a complex model.

  • Few orthographically similar words of English and Hindi, having different POS tags are given with examples in Table 3. System confuses in POS tagging of such words. With adding language as an additional feature, we were able to tag these type of words correctly.

word lang example POS
are hi are shyaam kidhar ho? PSP
are en they are going. G_V
to hi tumane to dekha hi nhi. G_PRT
to en they go to school. CC
hi hi mummy to aisi hi hain. G_V
hi en hi, how are you. G_PRT
Table 3: Similar words in hi-en data

7 Conclusion and Future Work

We developed language independent and generic POS tagger for social media text using RNN networks. We tried Simple_RNN, LSTM, Deep LSTM and, GRU models. We showed that GRU outperforms other models, and also benefits from pre-training and language as an additional feature. Also, the accuracy of our approach is comparable to that of Stanford and HunPos.

In the future, we could try RNN models with more features like POS tags of context words, prefixes and suffixes, length, position, etc. Word characters also have been found to be a very useful feature in RNN based POS taggers.

References

  • [Ammar et al.2014] Waleed Ammar, Chris Dyer, and Noah A Smith. 2014.

    Conditional random field autoencoders for unsupervised structured prediction.

    In Advances in Neural Information Processing Systems, pages 3311–3319.
  • [Bastien et al.2012] Frederic Bastien, Pascal Lamblin, Razvan Pascanu, James Bergstra, Ian Goodfellow, Arnaud Bergeron, Nicolas Bouchard, David Warde-Farley, and Yoshua Bengio. 2012. Theano: new features and speed improvements. In

    NIPS 2012 deep learning workshop

    .
  • [Bengio et al.1994] Yoshua Bengio, Patrice Simard, and Paolo Frasconi. 1994. Learning long-term dependencies with gradient descent is difficult. In IEEE Transactions on Neural Networks, pages 157–166.
  • [Bengio et al.2003] Yoshua Bengio, Rejean Ducharme, Pascal Vincent, and Christian Jauvin. 2003. A neural probabilistic language model. In Journal of Machine Learning Reseach, volume 3.
  • [Bengio2009] Yoshua Bengio. 2009. Learning Deep Architectures for AI. Foundations and trends® in Machine Learning, 2(1):1–127.
  • [Bergstra et al.2010] James Bergstra, Olivier Breuleux, Frederic Bastien, Pascal Lamblin, Razvan Pascanu, Guillaume Desjardins, Joseph Turian, David Warde-Farley, and Yoshua Bengio. 2010. Theano: a CPU and GPU math expression compiler. In Proceedings of the Python for scientific computing conference (SciPy), volume 4.
  • [Cho et al.2014] Kyunghyun Cho, Bart Van Merriënboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. 2014. Learning phrase representations using RNN encoder-decoder for statistical machine translation. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), Doha, Qatar.
  • [Chung et al.2014] Junyoung Chung, Caglar Gulcehre, KyungHyun Cho, and Yoshua Bengio. 2014. Empirical evaluation of gated recurrent neural networks on sequence modeling. In arXiv:1412.3555 [cs.NE].
  • [Das and Petrov2011] Dipanjan Das and Slav Petrov. 2011. Unsupervised part-of-speech tagging with bilingual graph-based projections. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies-Volume 1, pages 600–609. Association for Computational Linguistics.
  • [Duong et al.2013] Long Duong, Paul Cook, Steven Bird, and Pavel Pecina. 2013. Simpler unsupervised pos tagging with bilingual projections. In ACL (2), pages 634–639.
  • [Elman1990] Jeffrey L Elman. 1990. Finding Structure in Time. Cognitive science, 14(2):179–211.
  • [Erhan et al.2010] Dumitru Erhan, Yoshua Bengio, Aaron Courville, Pierre-Antoine Manzagol, Pascal Vincent, and Samy Bengio. 2010. Why Does Unsupervised Pre-training Help Deep Learning? Journal of Machine Learning Research, 11(Feb):625–660.
  • [Gouws and Søgaard2015] Stephan Gouws and Anders Søgaard. 2015. Simple task-specific bilingual word embeddings. In Proceedings of NAACL-HLT, pages 1386–1390.
  • [Graves2013] Alex Graves. 2013. Generating sequences with recurrent neural networks. arXiv:1308.0850.
  • [Halácsy et al.2007] Péter Halácsy, András Kornai, and Csaba Oravecz. 2007. Hunpos: An open source trigram tagger. In Proceedings of the 45th annual meeting of the ACL on interactive poster and demonstration sessions, pages 209–212. Association for Computational Linguistics.
  • [Hochreiter and Schmidhuber1997] Sepp Hochreiter and Jurgen Schmidhuber. 1997. Long short-term memory. In Neural computation, pages 1735–1780.
  • [Jamatia and Das2016] Anupam Jamatia and Amitava Das. 2016. Task Report: Tool Contest on POS Tagging for Code-Mixed Indian Social Media (Facebook, Twitter, and Whatsapp) Text@ICON 2016. In Proceedings of ICON 2016.
  • [Jamatia et al.2015] Anupam Jamatia, Björn Gambäck, and Amitava Das. 2015. Part-of-speech tagging for code-mixed English-Hindi Twitter and Facebook chat messages. RECENT ADVANCES IN, page 239.
  • [Jordan1986] Michael I Jordan. 1986. Attractor Dynamics and Parallellism in a Connectionist Sequential Machine. In Proceedings of 1986 Cognitive Science Conference, pages 531–546.
  • [Jozefowicz et al.2015] Rafal Jozefowicz, Wojciech Zaremba, and Ilya Sutskever. 2015. An empirical exploration of recurrent network architectures. In Proceedings of the 32nd International Conference on Machine Learning, pages 2342–2350.
  • [Kreutzer et al.2015] Julia Kreutzer, Shigehiko Schamoni, and Stefan Riezler. 2015. QUality Estimation from ScraTCH (QUETCH): Deep Learning for Word-level Translation Quality Estimation. In Proceedings of the Tenth Workshop on Statistical Machine Translation, pages 316–322, Lisboa, Portugal.
  • [Kupiec1992] Julian Kupiec. 1992. Robust part-of-speech tagging using a hidden markov model. Computer Speech & Language, 6:225–242.
  • [Màrquez and Giménez2004] L Màrquez and J Giménez. 2004. A general pos tagger generator based on support vector machines. Journal of Machine Learning Research.
  • [Mikolov et al.2010] Tomas Mikolov, Martin Karafiat, Lukas Burget, Jan Cernocky, and Sanjeev Khudanpur. 2010. Recurrent neural network based language model. In Proceedings of Interspeech, volume 2, Makuhari, Chiba, Japan.
  • [Mikolov et al.2013a] Tomas Mikolov, Quoc V Le, and Ilya Sutskever. 2013a. Exploiting Similarities among Languages for Machine Translation. In CoRR, pages 1–10.
  • [Mikolov et al.2013b] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. 2013b. Distributed representations of words and phrases and their compositionality. In Advances in Neural Information Processing Systems, pages 3111–3119.
  • [Mnih and Hinton2009] Andriy Mnih and Geoffrey E. Hinton. 2009. A scalable hierarchical distributed language model. In Advances in neural information processing systems, pages 1081–1088.
  • [Morin and Bengio2005] Frederic Morin and Yoshua Bengio. 2005. Hierarchical Probabilistic Neural Network Language Model. In Aistats, volume 5, pages 246–252.
  • [Patel and Sasikumar2016] Raj Nath Patel and M Sasikumar. 2016. Translation Quality Estimation using Recurrent Neural Network. In Proceedings of the First Conference on Machine Translation, volume 2, pages 819–824, Berlin, Germany. Association for Computational Linguistics.
  • [Pimpale and Patel2015] Prakash B. Pimpale and Raj Nath Patel. 2015. Experiments with POS Tagging Code-mixed Indian Social Media Text. ICON.
  • [Qin2015] Longlu Qin. 2015. POS tagging of Chinese Buddhist texts using Recurrent Neural Networks. Technical report, Stanford University.
  • [Sarkar2015] Kamal Sarkar. 2015. Part-of-Speech Tagging for Code-mixed Indian Social Media Text at ICON 2015. ICON.
  • [Schmid and Laws2008] Helmut Schmid and Florian Laws. 2008.

    Estimation of conditional probabilities with decision trees and an application to fine-grained pos tagging.

    In Proceedings of the 22nd International Conference on Computational Linguistics-Volume 1, pages 777–784. Association for Computational Linguistics.
  • [Schwenk2007] Holger Schwenk. 2007. Continuous space language models. In Computer Speech and Language, volume 21, pages 492–518.
  • [Socher et al.2013a] Richard Socher, John Bauer, Christopher D. Manning, , and Andrew Y. Ng. 2013a. Parsing With Compositional Vector Grammars. In Proceedings of the ACL 2013, pages 455–465.
  • [Socher et al.2013b] Richard Socher, Alex Perelygin, , and Jy Wu. 2013b. Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of EMNLP, pages 1631–1642.
  • [Toutanova et al.2003] Kristina Toutanova, Dan Klein, Christopher D Manning, and Yoram Singer. 2003. Feature-Rich Part-of-Speech Tagging with a Cyclic Dependency Network. In Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology-Volume 1, pages 173–180. Association for Computational Linguistics.
  • [Vyas et al.2014] Yogarshi Vyas, Spandana Gella, Jatin Sharma, Kalika Bali, and Monojit Choudhury. 2014. Pos tagging of english-hindi code-mixed social media content. In EMNLP, volume 14, pages 974–979.
  • [Wang et al.2015] Peilu Wang, Yao Qian, Frank K Soong, Lei He, and Hai Zhao. 2015. Part-of-speech tagging with bidirectional long short-term memory recurrent neural network. arXiv preprint arXiv:1510.06168.
  • [Werbos1990] Paul J. Werbos. 1990. Backpropagation through time: what it does and how to do it. In IEEE, volume 78, pages 550–1560.
  • [Yao et al.2013] Kaisheng Yao, Geoffrey Zweig, Mei-Yuh Hwang, Yangyang Shi, and Dong Yu. 2013. Recurrent neural networks for language understanding. In INTERSPEECH, pages 2524–2528.
  • [Yao et al.2014] Kaisheng Yao, Baolin Peng, Yu Zhang, Dong Yu, Geoffrey Zweig, and Yangyang Shi. 2014. Spoken language understanding using long short-term memory neural networks. In Spoken Language Technology Workshop (SLT), IEEE, pages 189–194.
  • [Zeiler2012] Matthew D. Zeiler. 2012. ADADELTA: an adaptive learning rate method. arXiv:1212.5701 [cs.LG].
  • [Zennaki et al.2015] Othman Zennaki, Nasredine Semmar, and Laurent Besacier. 2015. Unsupervised and Lightly Supervised Part-of-Speech Tagging Using Recurrent Neural Networks. In Proceedings of the 29th Pacific Asia Conference on Language, Information and Computation, Shanghai, China.