This contains RNN based word level quality estimation, and Part-of-Speech-Tagger
This paper describes Centre for Development of Advanced Computing's (CDACM) submission to the shared task-'Tool Contest on POS tagging for Code-Mixed Indian Social Media (Facebook, Twitter, and Whatsapp) Text', collocated with ICON-2016. The shared task was to predict Part of Speech (POS) tag at word level for a given text. The code-mixed text is generated mostly on social media by multilingual users. The presence of the multilingual words, transliterations, and spelling variations make such content linguistically complex. In this paper, we propose an approach to POS tag code-mixed social media text using Recurrent Neural Network Language Model (RNN-LM) architecture. We submitted the results for Hindi-English (hi-en), Bengali-English (bn-en), and Telugu-English (te-en) code-mixed data.READ FULL TEXT VIEW PDF
This contains RNN based word level quality estimation, and Part-of-Speech-Tagger
Code-Mixing and Code-Switching are observed in the text or speech produced by a multilingual user. Code-Mixing occurs when a user changes the language within a sentence, i.e. a clause, phrase or word of one language is used within an utterance of another language. Whereas, the co-occurrence of speech extract of two different grammatical systems is known as Code-Switching.
The language analysis of code-mixed text is a non-trivial task. Traditional approaches of POS tagging are not effective, for this text, as it does not adhere to any grammatical structure in general. Many studies have shown that RNN based POS taggers produced comparable results and, is also the state-of-the-art for some languages. However, to the best of our knowledge, no study has been done for RNN based POS tagging of code-mixed data.
In this paper, we have proposed a POS tagger using RNN-LM architecture for code-mixed Indian social media text. Earlier, researchers have adopted RNN-LM architecture for Natural language Understanding (NLU) [Yao et al.2013, Yao et al.2014]
and Translation Quality Estimation[Patel and Sasikumar2016]. RNN-LM
models are similar to other vector-space language models[Bengio et al.2003, Morin and Bengio2005, Schwenk2007, Mnih and Hinton2009] where we represent each word with a high dimensional real-valued vector. We modified RNN-LM architecture to predict the POS tag of a word, given the word and its context. Let’s consider the following example:
Output : G_N G_PRP G_N CC G_V G_R G_R
In the above sentence, to predict POS tag (G_N) for the word ‘’ using an RNN-LM model with window size 3, the input will be ‘’. Whereas, in standard RNN-LM model, ‘’ will be the input with ‘’ as the output. We will discuss details of various models tried and their implementations in section 3.
In this paper, we show that our approach achieves results close to the state-of-the-art systems such as 111http://nlp.stanford.edu/software/tagger.shtml (Maximum-Entropy based POS tagger)Stanford [Toutanova et al.2003], and 222https://code.google.com/archive/p/hunpos/ (Hidden Markov Model based POS tagger)
(Hidden Markov Model based POS tagger)HunPos [Halácsy et al.2007] .
Recently, RNN based models have been used to POS tag the formal text, but have not been tried yet on code-mixed data. wang:blstm:2015 have tried Bidirectional Long Short-Term Memory (LSTM) on Penn Treebank WSJ test set, and reported state-of-the-art performance. qin:pos:2015 has shown that RNN models outperform Majority Voting (MV) and HMM techniques for POS tagging of Chinese Buddhist text. zennaki:2015 have used RNN for resource-poor languages and reported comparable results with state-of-the-art systems[Das and Petrov2011, Duong et al.2013, Gouws and Søgaard2015].
Work on POS tagging code-mixed Indian social media text is at a very nascent stage to date. vyas:2014 and jamatia:2015 have worked on data labeling and automatic POS tagging of such data using various machine learning techniques. Building further on that labeled data, pimpale:2015 and, sarkar:2015 have tried word embedding as an additional feature to the machine learning based classifiers for POS tagging.
In the following sub-sections, we gave a brief description of each model with mathematical equations (1,2, and 3). In the equations, and are the input and output vectors respectively. and represent the current and previous hidden states respectively. are the weight matrices and
are the bias vectors.is the elementwise multiplication of the vectors. We used , the logistic sigmoid and , the hyperbolic tangent function to add nonlinearity in the network with function at the output layer.
Elman and Jordon [Jordan1986] networks are the simplest network in RNN family and are known as Simple_RNN. Elman network is defined by the following set of equations:
LSTM is found to be better for modeling of long-range dependencies than Simple_RNN. Simple_RNN also suffers from the problem of vanishing and exploding gradient [Bengio et al.1994]. LSTM and other complex RNN models tackle this problem by introducing a gating mechanism. Many variants of LSTM [Graves2013, Yao et al.2014, Jozefowicz et al.2015] have been tried in literature for the various tasks. We implemented the following version:
where , , are , and gates respectively. is the new memory content and is updated memory.
In this paper, we used Deep LSTM with two layers. Deep LSTM is created by stacking multiple LSTM on the top of each other. The output of lower LSTM forms input to the upper LSTM. For example, if is the output of lower LSTM, then we apply a matrix transform to form the input for the upper LSTM. The Matrix transformation enables us to have two consecutive LSTM layers of different sizes.
GRU is quite a similar network to the LSTM, without any memory unit. GRU network also uses a different gating mechanism with () and () gates. The following set of equations defines a GRU model:
All the models were implemented using 333http://deeplearning.net/software/theano/#downloadTHEANO framework [Bergstra et al.2010, Bastien et al.2012]. For all the models, the word embedding dimensionality was 100, no of hidden units were 100 and the context word window size was 5 (
). We initialized all the square weight matrices as random orthogonal matrices. All the bias vectors were initialized to zero. Other weight matrices were sampled from a Gaussian distribution with mean 0 and variance.
We trained all the models using Truncated Back-Propagation-Through-Time (T-BPTT) [Werbos1990]
with the stochastic gradient descent. Standard values of hyper-parameters were used for RNN model training, as suggested in the literature[Yao et al.2014, Patel and Sasikumar2016]
. The depth of BPTT was fixed to 7 for all the models. We trained each model for 50 epochs and used Ada-delta[Zeiler2012] to adapt the learning rate of each parameter automatically ( and ).
We used the data shared by the contest organizers [Jamatia and Das2016]. The code-mixed data of bn-en, hi-en and te-en was shared separately for the Facebook (fb), Twitter (twt) and Whatsapp (wa) posts and conversations with Coarse-Grained (CG) and Fine-Grained (FG) POS annotations. We combined the data from fb, twt, and wa for CG and FG annotation of each language pair. The data was divided into training, testing, and development sets. Testing and development sets were randomly sampled from the complete data. Table 1 details sizes of the different sets at the sentence and token level. Tag-set counts for CG and FG are also provided.
We preprocess the text for Mentions, Hashtags, Smilies, URLs, Numbers and, Punctuations. In the preprocessing, we mapped all the words of a group to a single new token as they have the same POS tag. For example, all the Mentions like @dhoni, @bcci, and @iitb were mapped to @user; all the Hashtags like #dhoni, #bcci, #iitb were mapped to #user.
The RNN-LM models use only the context words’ embedding as the input features. We experimented with three RNN model configurations. In the first setting (Simple_RNN, LSTM, Deep LSTM, GRU), we learn the word representation from scratch with the other model parameters. In the second configuration (GRU_Pre), we trained word representations (pre-training) using [Mikolov et al.2013b] tool and fine tuned with the training of other parameters of the network. Pre-training not only guides the learning towards minima with better generalization in non-convex optimization [Bengio2009, Erhan et al.2010] but also improves the accuracy of the system [Kreutzer et al.2015, Patel and Sasikumar2016]. In the third setting (GRU_Pre_Lang), we also added language of the words as an additional feature with the context words. We learn the vector representation of languages similar to that of words, from scratch.
We used F1-Score to evaluate the experiments, results are displayed in the Table 2. We trained models as described in the section 3.4. To compare our results, we also trained the Stanford and HunPos taggers on the same data, accuracy is given in Table 2.
From the table, it is evident that pre-training and language as an additional feature is helpful. Also, the accuracy of our best system (GRU_Pre_Lang) is comparable to that of Stanford and HunPos. GRU models are out-performing other models (Simple_RNN, LSTM, Deep LSTM) for this task also as reported by Chung:2014 for a suit of NLP tasks.
|hi-en %F1 score||bn-en %F1 score||te-en %F1 score|
The contest was having two type of submissions, first, : restricted to use only the data shared by the organizers with the participants’ implemented systems; second, : participants were allowed to use the publicly available resources (training data, implemented systems etc.).
We submitted for all the language pairs (hi-en, bn-en and, te-en) and domains (fb, twt and, wa). For constrained submission, the output of GRU_Pre_Lang was used. We trained Stanford POS tagger with the same data for submission. jamatia:2016 evaluated all the submitted systems against another gold-test set and reported the results.
We did a preliminary analysis of our systems and reported few points in this section.
The POS categories, contributing more in the error are G_X, G_V, G_N and G_J for coarse-grained and V_VM, JJ, N_NN and N_NNP for fine-grained systems. Also, we did the confusion matrix analysis and found that these POS tags are mostly confused with each other only. For instance, G_J POS tag was tagged 28 times wrongly to the other POS tags in which 17 times it was G_N.
RNN models require a huge amount of corpus to train the model parameters. From the results, we can observe that for hi-en and te-en with only approx 2K training sentences, the results of best RNN model (GRU_Pre_Lang) are comparable to Stanford and HunPos. For bn-en, the corpus was very less (only approx 0.5K sentences) for RNN training which resulted into poor performance compared to Stanford and HunPos. With this and the earlier work on RNN based POS tagging, we can expect that RNN models could achieve state-of-the-art accuracy with given the sufficient amount of training data.
In general, LSTM and Deep LSTM models perform better than Simple_RNN. But here, Simple_RNN is outperforming both LSTM and Deep LSTM. The reason could be less amount of data for training such a complex model.
Few orthographically similar words of English and Hindi, having different POS tags are given with examples in Table 3. System confuses in POS tagging of such words. With adding language as an additional feature, we were able to tag these type of words correctly.
|are||hi||are shyaam kidhar ho?||PSP|
|are||en||they are going.||G_V|
|to||hi||tumane to dekha hi nhi.||G_PRT|
|to||en||they go to school.||CC|
|hi||hi||mummy to aisi hi hain.||G_V|
|hi||en||hi, how are you.||G_PRT|
We developed language independent and generic POS tagger for social media text using RNN networks. We tried Simple_RNN, LSTM, Deep LSTM and, GRU models. We showed that GRU outperforms other models, and also benefits from pre-training and language as an additional feature. Also, the accuracy of our approach is comparable to that of Stanford and HunPos.
In the future, we could try RNN models with more features like POS tags of context words, prefixes and suffixes, length, position, etc. Word characters also have been found to be a very useful feature in RNN based POS taggers.
Conditional random field autoencoders for unsupervised structured prediction.In Advances in Neural Information Processing Systems, pages 3311–3319.
NIPS 2012 deep learning workshop.
Estimation of conditional probabilities with decision trees and an application to fine-grained pos tagging.In Proceedings of the 22nd International Conference on Computational Linguistics-Volume 1, pages 777–784. Association for Computational Linguistics.