A hierarchical character-word neural network for language identification
Social media messages' brevity and unconventional spelling pose a challenge to language identification. We introduce a hierarchical model that learns character and contextualized word-level representations for language identification. Our method performs well against strong base- lines, and can also reveal code-switching.READ FULL TEXT VIEW PDF
Language identification of social media text still remains a challenging...
This paper presents the models submitted by Ghmerti team for subtasks A ...
Identifying the different varieties of the same language is more challen...
Text from social media provides a set of challenges that can cause
We introduce a novel sub-character architecture that exploits a unique
In this paper, we present a set of computational methods to identify the...
Language identification for code-switching (CS), the phenomenon of
A hierarchical character-word neural network for language identification
Language identification (language ID), despite being described as a solved problem more than ten years ago [McNamee2005], remains a difficult problem. Particularly when working with short texts, informal styles, or closely related language pairs, it is an active area of research [Gella et al.2014, Wang et al.2015, Baldwin and Lui2010]. These difficult cases are often found in social media content. Progress on language ID in social media is needed especially since all downstream tasks, like machine translation or semantic parsing, depend on correct language ID.
, and other tasks, to language ID. We adapt a hierarchical character-word neural architecture from kim2015character, demonstrating that it works well for language ID. Our model, which we call C2V2L (“character to vector to language”) is hierarchical in the sense that it explicitly builds a continuous representation for each word from its character sequence, capturing orthographic and morphology-related patterns, and then combines those word level representations in context, finally classifying the full word sequence. Our model does not require any special handling of casing or punctuation nor do we need to remove the URLs, usernames or hashtags, and it is trained end-to-end using standard procedures.
We demonstrate the model’s state-of-the-art performance in experiments on two difficult language ID datasets consisting of tweets. This hierarchical technique works better than previous classifiers using character or word
-gram features as well as a similar neural model that treats an entire tweet as a single character sequence. We find further that the model can benefit from additional out-of-domain data, unlike much previous work, and with very little modification can annotate word-level code-switching. We also confirm that smoothed character n-gram language models perform very well for language ID tasks.
Our model has two main components, though they are trained together, end-to-end.111Code will be made available on GitHub after publication.
The first, “char2vec,” applies a convolutional neural network (CNN) to a whitespace-delimited word’s Unicode character sequence, providing a word vector. The second is a bidirectional LSTM recurrent neural network (RNN) that maps a sequence of such word vectors to a label (a language).
The first layer of char2vec embeds characters. An embedding is learned for each Unicode code point that appears at least twice in the training data, including punctuation, emoji, and other symbols. If is the set of characters then we let the size of the character embedding layer be . (If each dimension of the character embedding vector holds just one bit of information then bits should be enough to uniquely encode each character.) The character embedding matrix is . Words are given to the model as a sequence of characters. When each character in a word of length is replaced by its embedding vector we get a matrix . There are columns in
because padding characters are added to the left and right of each word.
The char2vec architecture uses two sets of filter banks. The first set is comprised of matrices where ranges from to . The matrix is narrowly convolved with each of the , a bias term
is added and an ReLU non-linearity,, is applied to produce an output . is of size with one row for each of the filters and one column for each of the characters in the input word. Since each of the is a filter with a width of three characters, the columns of each hold a representation of a character tri-gram. During training, we apply dropout on to regularize the model. The matrix is then convolved with a second set of filters where ranges from 1 to and controls the number of filters of each of the possible widths, , or . Another convolution and ReLU non-linearity is applied to get
. Max-pooling across time is used to create a fix-sized vectorfrom . The dimension of is , corresponding to the number of filters used.
Similar to kim2015character who use a highway network after the max-pooling layer, we apply a residual network layer. Both highway and residual network layers allow values from the previous layer to pass through unchanged but the residual layer is preferred in our case because it uses half as many parameters [He et al.2015]. The residual network uses a matrix
and bias vectorto create the vector where . The resulting vector is used as a word embedding vector in the word-level LSTM portion of the model.
There are three differences between our version of the model and the one described by kim2015character. First, we use two layers of convolution instead of just one, inspired by [Ling et al.2015a]
which uses a 2-layer LSTM for character modeling. Second, we use the ReLU function as a nonlinearity as opposed to the tanh function. ReLU has been highly successful in computer vision applications in conjunction with convolutional layers[Jarrett et al.2009]. Finally, we use a residual network layer instead of a highway network layer after the max-pooling step, to reduce the model size.
It is possible to use bi-LSTMs instead of convolutional layers in char2vec as done by ling2015finding. We explored this option in preliminary experiments and found that using convolutional layers has several advantages, including a large improvement in speed for both the forward and backward pass, many fewer parameters, and improved language ID accuracy.
The sequence of word embedding vectors is processed by a bi-LSTM, which outputs a sequence of vectors, where is the number of words in the tweet. All LSTM gates are used as defined by sak2014long. Dropout is used as a regularizer on the inputs to the LSTM [Zaremba et al.2014]. The output vectors
are transformed into probability distributions over the set of languages by applying an affine transformation followed by a softmax:
(These word-level predictions, we will see in §5.4, are useful for annotating code-switching.) The sentence-level prediction is then given by an average of the word-level language predictions:
The final affine transformation can be interpreted as a language embedding, where each language is represented by a vector of the same dimensionality as the LSTM outputs. The goal of the LSTM then is (roughly) to maximize the dot product of each word’s representation with the language embedding(s) for that sentence. The only supervision in the model comes from computing the loss of sentence-level predictions.
We consider two datasets: TweetLID and Twitter70. Summary statistics for each of the datasets are provided in Table 1 including the number of training examples per dataset.
The TweetLID dataset [Zubiaga et al.2014] comes from a language ID shared task that focused on six commonly spoken languages of the Iberian peninsula: Spanish, Portuguese, Catalan, Galician, English, and Basque. There are approximately 15,000 Tweets in the training data and 25,000 in the test set. The data is unbalanced, with the majority of examples being in the Spanish language. The “undetermined” label (’und’), comprising 1.4% of the training data, is used for Tweets that use only non-linguistic tokens or belong to an outside language. Additionally, some Tweets are ambiguous (’amb’) among a set of languages (2.3%), or code-switch between languages (2.4%). The evaluation criteria take into account all of these factors, requiring prediction of at least one acceptable language for an ambiguous Tweet or all languages present for a code-switched Tweet. The fact that hundreds of Tweets were labeled ambiguous or undetermined by annotators who were native speakers of these languages reveals the difficulty of this task.
For tweets labeled as ambiguous or containing multiple languages, the training objective distributes the “true” probability mass evenly across each of the label languages, e.g. 50% Spanish and 50% Catalan.
The TweetLID shared task had two tracks: one that restricted participants to only use the official training data and another that was unconstrained, allowing the use of any external data. There were 12 submissions in the constrained track and 9 in the unconstrained track. Perhaps surprisingly, most participants performed worse on the unconstrained track than they did on the constrained one.
As supplementary data for our unconstrained-track experiments, we collected data from Wikipedia for each of the six languages in the TweetLID corpus. Participants in the TweetLID shared task also used Wikipedia as a data source for the unconstrained track. We split the text into 25,000 sentence fragments per language, with each fragment of length comparable to that of a Tweet. The Wikipedia sentence fragments are easily distinguished from tweets. Wikipedia fragments are more formal and are more likely to use complex words; for example, one fragment reads “ring homomorphisms are identical to monomorphisms in the category of rings.” In contrast, tweets tend to use variable spelling and more simple words, as in “Haaaaallelujaaaaah http://t.co/axwzUNXk06” and “@justinbieber: Love you mommy http://t.co/xEGAxBl6Cc http://t.co/749s6XKkgK awe ”. Previous work confirms that language ID is more challenging on social media text than sentence fragments taken from more formal text, like Wikipedia [Carter2012]. Despite the domain mismatch, we find in §5.2 that additional text at training time is useful for our model.
The TweetLID training data is too small to divide into training and validation sets. We created a tuning set by adding samples taken from Twitter70 and from the 2014 Workshop on Computational Approaches to Code Switching [Solorio et al.2014]
to the official TweetLID training data. We used this augmented datset with a 4:1 train/eval split for hyperparameter tuning.222We used this augmented data to tune hyperparameters for both constrained and unconstrained models. However, after setting hyperparameters, we trained our constrained model using only the official training data, and the unconstrained model using only the training data + Wikipedia. Thus, no extra data was used to learn actual model parameters for the constrained case.
The Twitter70 dataset was published by the Twitter Language Engineering Team in November 2015.333For clarity, we refer to this data as “Twitter70” but it can be found in the Twitter blog post under the name “recall oriented.” See http://t.co/EOVqA0t79j The languages come from the Afroasiatic, Dravidian, Indo-European, Sino-Tibetan, and Tai-Kadai families. Each person who wants to use the data must redownload the Tweets using the Twitter API. In between the time when the data was published and when it is downloaded, some of the Tweets can be lost due to account deletion or changes in privacy settings. At the time when the data was published there were approximately 1,500 Tweets for each language. We were able to download 82% of the Tweets but the amount that we could access varied by language with as many as 1,569 examples for Sindhi and as few as 371 and 39 examples for Uyghur and Oriya, respectively. The median number of Tweets per language was 1,083. To our knowledge, there are no published benchmarks on this dataset.
Unlike TweetLID, the Twitter70 data has no unknown or ambiguous labels. Some tweets do contain code-switching but that is not labeled as such; a single language is assigned. There is no predefined test set so we used the last digit of the identification number to partition them. Identifiers ending in zero (15%) were used for the test set and those ending in one (5%) were used for tuning.
When processing the input at the character level, the vocabulary for each data source is defined as the set of Unicode code-points that occur at least twice in the training data: 956 and 5,796 characters for TweetLID and Twitter70, respectively. A small number of languages, such as Mandarin, are responsible for most of the characters in the Twitter70 vocabulary.
One recent work processed the input one byte at a time instead of by character [Gillick et al.2015]. In early experiments, we found that when using bytes the model would often make mistakes that should have been much more obvious from the orthography alone. We do not recommend using the byte sequence for language ID.
An advantage of the hybrid character-word model is that only limited preprocessing is required. The runtime of training char2vec is proportional to the longest word in a minibatch. The data contains many long and repetitive character sequences such as “hahahaha…” or “arghhhhh…”. To deal with these, we restricted any sequence of repeating characters to at most five repetitions where the repeating pattern can be from one to four characters. There are many tweets that string together large numbers of twitter username or hashtags without spaces between them. These create extra long “words” that cause our implementation to use more memory and do extra computation during training. To solve this we enforce the constraint that there must be a space before any URL, username, or hashtag. To deal with the few remaining extra-long character sequences, we force word breaks in non-space character sequences every 40 bytes. This primarily affects languages that are not space-delimited like Chinese. We do not perform any special handling of casing or punctuation nor do we need to remove the URLs, usernames or hashtags as has been done in previous work [Zubiaga et al.2014]. The same preprocessing is used when training the -gram models.
Training is done using minibatches of size 25 and a learning rate of 0.001 using the Adam method for optimization [Kingma and Ba2014]. For the Twitter70 dataset we used 5% held out data for tuning and 15% for evaluation. To tune, we trained 15 models with random hyperparameters and selected the one that performed the best on the development set. Training is done for 80,000 minibatches for TweetLID and 100,000 for the Twitter70 dataset.
There are only four hyperparameters to tune for each model: the number of filters in the first convolutional layer, the number of filters in the second convolutional layer, the size of the word-level LSTM vector, and the dropout rate. We designed our model to have as few hyperparameters as possible so that we could be more confident that our models were properly tuned. The selected hyperparameter values are listed in Table 2. There are 193K parameters in the TweetLID model and 427K in the Twitter70 model.
|1st Conv. Layer ()||50||59|
|2nd Conv. Layer ()||93||108|
For all the studies below on language identification, we compare to two baselines: i) langid.py, a popular open-source language ID package, and ii) a classifier using -gram character language models. For the TweetLID dataset, additional comparisons are included as described next. In addition, we test our model’s word level performance on a code-switching dataset.
The first baseline, based on the langid.py
package, uses a naive Bayes classifier overbyte -gram features [Lui and Baldwin2012]. The pretrained model distributed with the package is designed to perform well on a wide range of domains, and achieved high performance on “microblog messages” (tweets) in the original paper. langid.py
uses feature selection for domain adaptation and to reduce the model size; thus, retraining it on in-domain data as we do in this paper does not provide an entirely fair comparison. However, we include it for its popularity and importance.
The second baseline is built from character -gram language models for each language . It assigns each tweet according to , i.e., applying Bayes’ rule with a uniform class prior [Dunning1994]. For TweetLID, the rare ’und’ was handled with a rejection model. Specifically, after is chosen, a log likelihood ratio test is applied to decide whether to reject the decision in favor of the ’und’ class, using the language models for and ’und’ with a threshold chosen to optimize on the dev set. The models were trained using Witten-Bell smoothing [Bell et al.1989], but otherwise the default parameters of the SRILM toolkit [Stolcke2002] were used.444Witten-Bell smoothing works well when working with comparatively small vocabularies such as with character sets.
TweetLID model training ignores tweets labeled as ambiguous or containing multiple languages, and the unconstrained TweetLID models use a simple interpolation of TweetLID and Wikipedia component models. The-gram order was chosen to minimize perplexity using 5-fold cross validation, yielding for TweetLID and Twitter70, and for Wikipedia.
Note that both of these baselines are generative, learning separate models for each language. In contrast, the neural network models explored here are trained on all languages, so parameters may be shared across languages. In particular, for the hierarchcial model, a character sequence corresponding to a word in more than one language (e.g. “no” in English and Portuguese) has a language-independent word embedding.
In the constrained track of the 2014 shared task, hurtado2014elirf attained the highest performance (75.2 macroaveraged ). They used a set of one-vs-all SVM classifiers with character -gram features, and returned all languages for which the classification confidence was above a fixed threshold. This provides our third, strongest baseline.
In the unconstrained track, the winning team was gamallo2014comparing, using a naive Bayes classifier on word unigrams. They incorporated Wikipedia text to train their model, and were the only team in the competition whose unconstrained model outperformed their constrained one. We compare to their constrained-track result here.
We also consider a version of our model, “C2L,” which uses only the char2vec component of C2V2L, treating the entire tweet as a single word. This tests the value of the intermediate word representations in C2V2L; C2L has no explicit word representations. Hyperparameter tuning was carried out separately for C2L.
The first column of Table 3 shows the aggregate results across all labels. Our model achieves the state of the art on this task, surpassing the shared task winner, hurtado2014elirf. As expected, C2L fails to match the performance of C2V2L, demonstrating that there is value in the hierarchical representations. The performance of the -gram LM baseline is notably strong, beating eleven out of the twelve submissions to the TweetLID shared task. We also report category-specific performance for our models and baselines in Table 3. Note that performance on underrepresented categories such as ’glg’ and ’und’ is much lower than the other categories. The category breakdown is not available for previously published results.
One important advantage of our model is its ability to handle special categories of tokens that would otherwise require special treatment as out-of-vocabulary symbols, such as URLs, hashtags, emojis, usernames, etc. Anecdotally, we observe that the input gates of the word-level LSTM are less likely to open for these special classes of tokens. This is consistent with the hypothesis that the model has learned to ignore tokens that are non-informative with respect to language ID.
We augmented C2V2L’s training data with 25,000 fragments of Wikipedia text, weighting the TweetLID training examples ten times more strongly. After training on the combined data, we “fine-tune” the model on the TweetLID data for 2,000 minibatches. We found this stage necessary to correct for bias away from the undetermined language category, which does not occur in the Wikipedia data. The same hyperparameters were used as in the constrained experiment.
For the -gram baseline, we interpolated the language models trained on TweetLID and Wikipedia for each language. Interpolation weights given to the Wikipedia language models, set by cross-validation, ranged from 16% for Spanish to 39% for Galician, the most and least common labels respectively.
We also compare to the unconstrained-track results of hurtado2014elirf and gamallo2014comparing.
The results for these experiments are given in Table 4. Like gamallo2014comparing, we see a benefit from the use of out-of-domain data, giving a new state of the art on this task as well. Overall the -gram language model does not benefit from Wikipedia, but we observe that if the undetermined category, which is not observed in Wikipedia training data, is ignored, then there is a net gain in performance.
Top seven most similar words from the training data and their cosine similarities for inputs “couldn’t”, “@maria_sanchez”, and “noite”.
In Table 5, we show the top seven neighbors to selected input words based on cosine similarity. In the left column we see that words with similar features, such as the presence of “n’t” contraction, can be grouped together by char2vec. In the middle column, an out-of-vocabulary username is supplied and similar usernames are retrieved. When working with -gram features, removing usernames is common, but some previous work demonstrates that they still carry useful information for predicting the language of the tweet [Jaech and Ostendorf2015]. The third example,“noite” (the Portuguese word for “night”), shows that the word embeddings are largely invariant to changes in punctuation and capitalization.
We compare C2V2L to langid.py and the 5-gram language model on the Twitter70 dataset; see Table 6. Since this data has not been published on, these are the only two comparisons we have. Although the 5-gram model achieves the best performance, the results are virtually identical to those for C2V2L except for the closely-related Bosnian-Croatian language pair.
The lowest performance for all the models is on closely related language pairs. For example using the C2V2L model, the score for Danish is only 62.7 due to confusion with the mutually intelligble Norwegian [Van Bezooijen et al.2008]. Distinguishing Bosnian from Croatian, two varieties of a single language, is also difficult. Languages that have unique orthographies such as Greek and Korean are identified with near perfect accuracy by each of the models.
A potential advantage of the C2V2L model over the -gram models is the ability to share information between related languages. In Figure 2
we show a T-SNE plot of the language embedding vectors taken from the softmax layer of our model trained with a rank constraint of 10 on the softmax layer. Many of the languages in the plot appear close to related languages.555The rank constraint was added for visualization; without it, the model learns language embeddings which are all roughly orthogonal to each other, making T-SNE visualization difficult. Another advantage of the C2V2L model is that the word-level predictions provide an indication of code-switching, explored next.
Because C2V2L produces language predictions for every word before making the tweet-level prediction, the same architecture can be used in word-level analysis of code-switched text, switching between multiple languages. Training a model that predicts code-switching at the token level requires a dataset that has language labels at this level. We used the Spanish-English dataset from the EMNLP 2014 shared task on Language Identitication in Code-Switched Data [Solorio et al.2014]: a collection of monolingual and code-switched tweets in English and Spanish.
To train and predict at the word level, we simply remove the final average over the word predictions, and calculate the loss as the sum of the cross-entropy between each word’s prediction and the corresponding gold label. Both the char2vec and word LSTM components of the model architecture are unaffected, other than retraining their parameters.666Potentially, both sentence-level and word-level supervision could be used to train the same model, but we leave that for future work. To tune hyperparameters, we trained 10 models with random parameter settings on 80% of the data from the training set, and chose the settings from the model that performed best on the remaining 20%. We then retrained on the full training set with these settings.
C2V2L performed well at this task, scoring 95.1 for English (which would have achieved second place in the shared task, out of eight entries), 94.1 for Spanish (second place), 36.2 for named entities (fourth place) and 94.2 for Other (third place).777Full results for the 2014 shared task are omitted for space but can be found at http://emnlp2014.org/workshops/CodeSwitch/results.php. While our code-switching results are not quite state-of-the-art, they show that our model learns to make accurate word-level predictions.
The task of language ID has a long history both in the speech domain [House and Neuburg1977] and for text [Cavnar and Trenkle1994]. Previous work on the text domain mostly uses word or character -gram features combined with linear classifiers [Hurtado et al.2014, Gamallo et al.2014].
Recently published work by xeroxTweetlid showed that combining an -gram language model classifier (similar to our -gram baseline) with information from the Twitter social graph improves language ID on TweetLID from 74.7 to 76.6 which is only slightly better than our model’s performance of 76.2.
Bergsma2012Language created their own multilingual Twitter dataset and tested both a discriminative model based on -grams plus hand-crafted features and a compression-based classifier. Since the Twitter API requires researchers to re-download tweets based on their identifiers, published datasets quickly go out of date when the tweets in question are no longer available online, making it difficult to compare against prior work.
Several other studies have investigated the use of character sequence models in language processing. These techinques were first applied only to create word embeddings [dos Santos and Zadrozny2015, dos Santos and Guimaraes2015] and then later extended to have the word embeddings feed directly into a word level RNN. Applications include part-of-speech (POS) tagging [Ling et al.2015b], language modeling [Ling et al.2015a], dependency parsing [Ballesteros et al.2015], translation [Ling et al.2015b], and slot filling text analysis [Jaech et al.2016]. The work is divided in terms of whether the character sequence is modeled with an LSTM or CNN, though virtually all now leverage the resulting word vectors in a word-level RNN. We are not aware of prior results comparing LSTMs and CNNs on a specific task, but the reduction in model size compared to word-only systems is reported to be much higher for LSTM architectures. All analyses report that the greatest improvements in performance from character sequence models are for infrequent and previously unseen words, as expected.
In addition to the 2014 Workshop on Computational Approaches to Code Switching [Solorio et al.2014], word-level language identification in code-switched text has been studied by Mandal2015AdaptiveVI in the context of question answering and by Garrette2015UnsupervisedCF in the context of language modeling for document transcription. Both used primarily character -gram features. The use of character representations is well motivated for code-switching LID since the presence of multiple languages means that one is more likely to encounter a previously unseen word.
We present C2V2L, a hierarchical neural model for language ID that outperforms previous work on the challenging TweetLID task. We also find that smoothed character -gram language models can work well as classifiers for language ID for short texts. Without feature engineering, our -gram model beat eleven out of the twelve submissions in the TweetLID shared task, and gives the best reported performance on the Twitter70 dataset, where training data for some languages is quite small. In future work, we plan to further adapt C2V2L for analyzing code-switching, having found that it already offers good performance without any change to the architecture.
Proc. Conf. Empirical Methods Natural Language Process. (EMNLP), pages 349–359.
Boosting named entity recognition with neural character embeddings.In Proc. ACL Named Entities Workshop, pages 25–33.
Proc. Int. Conf. Machine Learning (ICML).