Language Identification (LID) is the Natural Language Processing (NLP) task of automatically recognizing the language that a document is written in. While this task was called ”solved” by some authors over a decade ago, it has seen a resurgence in recent years thanks to the rise in popularity of social media(jauhiainen2018automatic; jaech2016hierarchical), and the corresponding daily creation of millions of new messages in dozens of different languages including rare ones that are not often included in language identification systems. Moreover, these messages are typically very short (Twitter messages were until recently limited to 140 characters) and very noisy (including an abundance of spelling mistakes, non-word tokens like URLs, emoticons, or hashtags, as well as foreign-language words in messages of another language), whereas LID was solved using long and clean documents. Indeed, several studies have shown that LID systems trained to a high accuracy on traditional documents suffer significant drops in accuracy when applied to short social-media texts (lui2012langid; carter2013microblog).
Given its massive scale, multilingual nature, and popularity, Twitter has naturally attracted the attention of the LID research community. Several attempts have been made to construct LID datasets from that resource. However, a major challenge is to assign each tweet in the dataset to the correct language among the more than 70 languages used on the platform. The three commonly-used approaches are to rely on human labeling (lui2014accurate; tromp2011graph), machine detection (tromp2011graph; jurgens2017incorporating), or user geolocation (carter2013microblog; blodgett2017dataset; bergsma2012language). Human labeling is an expensive process in terms of workload, and it is thus infeasible to apply it to create a massive dataset and get the full benefit of Twitter’s scale. Automated LID labeling of this data creates a noisy and imperfect dataset, which is to be expected since the purpose of these datasets is to create new and better LID algorithms. And user geolocation is based on the assumption that users in a geographic region use the language of that region; an assumption that is not always correct, which is why this technique is usually paired with one of the other two. Our first contribution in this paper is to propose a new approach to build and automatically label a Twitter LID dataset, and to show that it scales up well by building a dataset of over 18 million labeled tweets. Our hope is that our new Twitter dataset will become a benchmarking standard in the LID literature.
|Tromp & Pechenizkiy (tromp2011graph)||
|Carter et al. (carter2013microblog)||Social network information||Prior probabilities||Accuracy||0.972||0.886|
|Gamallo et al. (gamallo2014comparing)||Words||Dictionary||F1-score||0.733||N/A|
|Jaech et al. (jaech2016hierarchical)||Words||LSTM||F1-score||0.912||0.879|
|Kocmi & Bojar (kocmi2017lanidenn)||Character n-grams||GRU||Accuracy||0.955||0.912|
|Jurgens et al. (jurgens2017incorporating)||Character n-grams||Encoder-decoder||Accuracy||0.982||0.960|
Traditional LID models (lui2012langid; carter2013microblog; gamallo2014comparing)
proposed different ideas to design a set of useful features. This set of features is then passed to traditional machine learning algorithms such as Naive Bayes (NB). The resulting systems are capable of labeling thousands of inputs per second with moderate accuracy. Meanwhile, neural network models(kocmi2017lanidenn; jurgens2017incorporating) approach the problem by designing a deep and complex architecture like gated recurrent unit (GRU) or encoder-decoder net. These models use the message text itself as input using a sequence of character embeddings, and automatically learn its hidden structure via a deep neural network. Consequently, they obtain better results in the task but with an efficiency trade-off. To alleviate these drawbacks, our second contribution in this paper is to propose a shallow but efficient neural LID algorithm. We followed previous neural LID (kocmi2017lanidenn; jurgens2017incorporating) in using character embeddings as inputs. However, instead of using a deep neural net, we propose to use a shallow ngram-regional convolution neural network (CNN) with an attention mechanism to learn input representation. We experimentally prove that the ngram-regional CNN is the best choice to tackle the bottleneck problem in neural LID. We also illustrate the behaviour of the attention structure in focusing on the most important features in the text for the task. Compared with other benchmarks on our Twitter datasets, our proposed model consistently achieves new state-of-the-art results with an improvement of 5% in accuracy and F1 score and a competitive inference time.
The rest of this paper is structured as follows. After a background review in the next section, we will present our Twitter dataset in Section 3. Our novel LID algorithm will be the topic of Section 4. We will then present and analyze some experiments we conducted with our algorithm in Section 5, along with benchmarking tests of popular and literature LID systems, before drawing some concluding remarks in Section 6. Our Twitter dataset and our LID algorithm’s source code are publicly available111https://github.com/duytinvo/LID_NN.
2 Related Work
In this section, we will consider recent advances on the specific challenge of language identification in short text messages. Readers interested in a general overview of the area of LID, including older work and other challenges in the area, are encouraged to read the thorough survey of (jauhiainen2018automatic).
2.1 Probabilistic LID
One of the first, if not the first, systems for LID specialized for short text messages is the graph-based method of (tromp2011graph). Their graph is composed of vertices, or character n-grams (n = 3) observed in messages in all languages, and of edges, or connections between successive n-grams weighted by the observed frequency of that connection in each language. Identifying the language of a new message is then done by identifying the most probable path in the graph that generates that message. Their method achieves an accuracy of 0.975 on their own Twitter corpus.
Carter, Weerkamp, and Tsagkias proposed an approach for LID that exploits the very nature of social media text (carter2013microblog)
. Their approach computes the prior probability of the message being in a given language independently of the content of the message itself, in five different ways: by identifying the language of external content linked to by the message, the language of previous messages by the same user, the language used by other users explicitly mentioned in the message, the language of previous messages in the on-going conversation, and the language of other messages that share the same hashtags. They achieve a top accuracy of 0.972 when combining these five priors with a linear interpolation.
One of the most popular language identification packages is the langid.py library proposed in (lui2012langid)
, thanks to the fact it is an open-source, ready-to-use library written in the Python programming language. It is a multinomial Naïve Bayes classifier trained on character n-grams (1n 4) from 97 different languages. The training data comes from longer document sources, both formal ones (government publications, software documentation, and newswire) and informal ones (online encyclopedia articles and websites). While their system is not specialized for short messages, the authors claim their algorithm can generalize across domains off-the-shelf, and they conducted experiments using the Twitter datasets of (tromp2011graph) and (carter2013microblog) that achieved accuracies of 0.941 and 0.886 respectively, which is weaker than the specialized short-message LID systems of (tromp2011graph) and (carter2013microblog).
Starting from the basic observation of Zipf’s Law, that each language has a small number of words that occur very frequently in most documents, the authors of (gamallo2014comparing) created a dictionary-based algorithm they called Quelingua. This algorithm includes ranked dictionaries of the 1,000 most popular words of each language it is trained to recognize. Given a new message, recognized words are given a weight based on their rank in each language, and the identified language is the one with the highest sum of word weights. Quelingua achieves an F1-score of 0.733 on the TweetLID competition corpus (zubiaga2014overview), a narrow improvement over a trigram Naïve Bayes classifier which achieves an F1-Score of 0.727 on the same corpus, but below the best results achieved in the competition.
2.2 Neural Network LID
Neural network models have been applied on many NLP problems in recent years with great success, achieving excellent performance on challenges ranging from text classification (Vo18) to sequence labeling (yang2018ncrf). In LID, the authors of (jaech2016hierarchical)
built a hierarchical system of two neural networks. The first level is a Convolutional Neural Network (CNN) that converts white-space-delimited words into a word vector. The second level is a Long-Short-Term Memory (LSTM) network (a type of recurrent neural network (RNN)) that takes in sequences of word vectors outputted by the first level and maps them to language labels. They trained and tested their network on Twitter’s official Twitter70 dataset222https://blog.twitter.com/engineering/en_us/a/2015/evaluating-language-identification-performance.html
, and achieved an F-score of 0.912, compared to langid.py’s performance of 0.879 on the same dataset. They also trained and tested their system using the TweetLID corpus and achieved an F1-score of 0.762, above the system of(gamallo2014comparing) presented earlier, and above the top system of the TweetLID competition, the SVM LID system of (hurtado2014elirf) which achieved an F1-score of 0.752.
The authors of (kocmi2017lanidenn) also used a RNN system, but preferred the Gated Recurrent Unit (GRU) architecture to the LSTM, indicating it performed slightly better in their experiments. Their system breaks the text into non-overlapping 200-character segments, and feeds character n-grams (n = 8) into the GRU network to classify each letter into a probable language. The segment’s language is simply the most probable language over all letters, and the text’s language is the most probable language over all segments. The authors tested their system on short messages, but not on tweets; they built their own corpus of short messages by dividing their data into 200-character segments. On that corpus, they achieve an accuracy of 0.955, while langid.py achieves 0.912.
The authors of (jurgens2017incorporating) also created a character-level LID network using a GRU architecture, in the form of a three-layer encoder-decoder RNN. They trained and tested their system using their own Twitter dataset, and achieved an F1-score of 0.982, while langid.py achieved 0.960 on the same dataset.
To summarize, we present the key results of the papers reviewed in this section in Table 1, along with the results langid.py obtained on the same datasets as benchmark.
3 Our Twitter LID Datasets
3.1 Source Data and Language Labeling
Unlike other authors who built Twitter datasets, we chose not to mine tweets from Twitter directly through their API, but instead use tweets that have already been downloaded and archived on the Internet Archive333https://archive.org/details/twitterstream. This has two important benefits: this site makes its content freely available for research purposes, unlike Twitter which comes with restrictions (especially on distribution), and the tweets are backed-up permanently, as opposed to Twitter where tweets may be deleted at any time and become unavailable for future research or replication of past studies. The Internet Archive has made available a set of 1.7 billion tweets collected over the year of 2017 in a 600GB JSON file which includes all tweet metadata attributes444https://developer.twitter.com/en/docs/tweets/data-dictionary/overview/intro-to-tweet-json.html. Five of these attributes are of particular importance to us. They are , , , , and , corresponding respectively to the unique tweet ID number, the unique user ID number, the text content of the tweet in UTF-8 characters, the tweet’s language as determined by Twitter’s automated LID software, and the user’s self-declared language.
We begin by filtering the corpus to keep only those tweets where the user’s self-declared language and the tweet’s detected language correspond; that language becomes the tweet’s correct language label. This operation cuts out roughly half the tweets, and leaves us with a corpus of about 900 million tweets in 54 different languages. Table 2 shows the distribution of languages in that corpus. Unsurprisingly, it is a very imbalanced distribution of languages, with English and Japanese together accounting for 60% of all tweets. This is consistent with other studies and statistics of language use on Twitter, going as far back as 2013555https://www.statista.com/statistics/267129/most-used-languages-on-twitter/. It does however make it very difficult to use this corpus to train a LID system for other languages, especially for one of the dozens of seldom-used languages. This was our motivation for creating a balanced Twitter dataset.
3.2 Our Balanced Datasets
When creating a balanced Twitter LID dataset, we face a design question: should our dataset seek to maximize the number of languages present, to make it more interesting and challenging for the task of LID, but at the cost of having fewer tweets per language to include seldom-used languages. Or should we maximize the number of tweets per language to make the dataset more useful for training deep neural networks, but at the cost of having fewer languages present and eliminating the seldom-used languages. To circumvent this issue, we propose to build three datasets: a small-scale one with more languages but fewer tweets, a large-scale one with more tweets but fewer languages, and a medium-scale one that is a compromise between the two extremes. Moreover, since we plan for our datasets to become standard benchmarking tools, we have subdivided the tweets of each language in each dataset into training, validation, and testing sets.
Small-scale dataset: This dataset is composed of 28 languages with 13,000 tweets per language, subdivided into 7,000 training set tweets, 3,000 validation set tweets, and 3,000 testing set tweets. There is thus a total of 364,000 tweets in this dataset. Referring to Table 2, this dataset includes every language that represents 0.002% or more of the Twitter corpus. To be sure, it is possible to create a smaller dataset with all 54 languages but much fewer tweets per language, but we feel that this is the lower limit to be useful for training LID deep neural systems.
Medium scale dataset: This dataset keeps 22 of the 28 languages of the small-scale dataset, but has 10 times as many tweets per language. In other words, each language has a 70,000-tweet training set, a 30,000-tweet validation set, and a 30,000-tweet testing set, for a total of 2,860,000 tweets.
Large-scale dataset: Once again, we increased tenfold the number of tweets per language, and kept the 14 languages that had sufficient tweets in our initial 900 million tweet corpus. This gives us a dataset where each language has 700,000 tweets in its training set, 300,000 tweets in its validation set, and 300,000 tweets in its testing set, for a total 18,200,000 tweets. Referring to Table 2, this dataset includes every language that represents 0.1% or more of the Twitter corpus.
4 Proposed Model
Since many languages have unclear word boundaries, character n-grams, rather than words, have become widely used as input in LID systems (lui2012langid; kocmi2017lanidenn; tromp2011graph; jurgens2017incorporating). With this in mind, the LID problem can be defined as such: given a tweet consisting of ordered characters () selected within the vocabulary set of unique characters () and a set of languages () , the aim is to predict the language present in using a classifier:
where is a scoring function quantifying how likely language was used given the observed message .
Most statistical LID systems follow the model of (lui2012langid). They start off by using what is called a one-hot encoding technique, which represents each character as a one-hot vector according to the index of this character in . This transforms into a matrix :
is passed to a feature extraction function, for example row-wise sum or tf-idf weighting, to obtain a feature vector.
is finally fed to a classifier model for either discriminative scoring (e.g. Support Vector Machine) or generative scoring (e.g. Naïve Bayes).
Unlike statistical methods, a typical neural network LID system, as illustrated in Figure 0(a), first pass this input through an embedding layer to map each character to a low-dense vector , where denotes the dimension of character embedding. Given an input tweet , after passing through the embedding layer, we obtain an embedded matrix:
The embedded matrix is then fed through a neural network architecture, which transforms it into an output vector of length L that represents the likelihood of each language, and which is passed through a function. This updates equation 1 as:
Tweets in particular are noisy messages which can contain a mix of multiple languages. To deal with this challenge, most previous neural network LID systems used deep sequence neural layers, such as an encoder-decoder (jurgens2017incorporating) or a GRU (kocmi2017lanidenn), to extract global representations at a high computational cost. By contrast, we propose to employ a shallow (single-layer) convolution neural network (CNN) to locally learn region-based features. In addition, we propose to use an attention mechanism to proportionally merge together these local features for an entire tweet . We hypothesize that the attention mechanism will effectively capture which local features of a particular language are the dominant features of the tweet. There are two major advantages of our proposed architecture: first, the use of the CNN, which has the least number of parameters among other neural networks, simplifies the neural network model and decreases the inference latency; and second, the use of the attention mechanism makes it possible to model the mix of languages while maintaining a competitive performance.
4.1 ngam-regional CNN Model
To begin, we present a traditional CNN with an ngam-regional constrain as our baseline. CNNs have been widely used in both image processing (Lecun98) and NLP (Collobert11). The convolution operation of a filter with a region size is parameterized by a weight matrix
and a bias vector, where is the dimension of the CNN. The inputs are a sequence of consecutive input columns in , represented by a concatenated vector . The region-based feature vector is computed as follows:
where denotes a concatenation operation and is a non-linear function. The region filter is slid from the beginning to the end of to obtain a convolution matrix :
The first novelty of our CNN is that we add a zero-padding constrain at both sides ofto ensure that the number of columns in is equal to the number of columns in . Consequently, each feature vector corresponds to an input vector at the same index position , and is learned from concatenating the surrounding -gram embeddings. Particularly:
is the number of zero-padding columns. Finally, in a normal CNN, a row-wise max-pooling function is applied onto extract the most salient features, as shown in Equation 8. However, one weakness of this approach is that it extracts the most salient features out of sequence.
4.2 Attention Mechanism
Instead of the traditional pooling function of Equation 8, a second important innovation of our CNN model is to use an attention mechanism to model the interaction between region-based features from the beginning to the end of an input. Figure 0(b) illustrates our proposed model. Given a sequence of regional feature vectors as computed in Equation 6, we pass it through a fully-connected hidden layer to learn a sequence of regional hidden vectors using Equation 9.
is a non-linear activation function,and denote model parameters, and is the dimension of the hidden layer. We followed Yang et al. (Yang16) in employing a regional context vector to measure the importance of each window-based hidden vector. The regional importance factors are computed by:
The importance factors are then fed to a layer to obtain the normalized weight:
The final representation of a given input is computed by a weighted sum of its regional feature vectors:
5 Experimental Results
|Parameter||our CNN||our att CNN|
|Model||Small-scale dataset||Medium-scale dataset||Large-scale dataset|
For the benchmarks, we selected five systems. We picked first the langid.py666https://github.com/saffsd/langid.py library which is frequently used to compare systems in the literature. Since our work is in neural-network LID, we selected two neural network systems from the literature, specifically the encoder-decoder EquiLID777https://github.com/davidjurgens/equilid system of (jurgens2017incorporating) and the GRU neural network LanideNN888https://github.com/tomkocmi/LanideNN system of (kocmi2017lanidenn). Finally, we included CLD2999https://github.com/CLD2Owners/cld2 and CLD3101010https://github.com/google/cld3, two implementations of the Naïve Bayes LID software used by Google in their Chrome web browser (lui2014accurate; jauhiainen2018automatic; bergsma2012language) and sometimes used as a comparison system in the LID literature (blodgett2017dataset; jurgens2017incorporating; bergsma2012language; lui2012langid; kocmi2017lanidenn). We obtained publicly-available implementations of each of these algorithms, and test them all against our three datasets. In Table 4
, we report each algorithm’s accuracy and F1 score, the two metrics usually reported in the LID literature. We also included precision and recall values, which are necessary for computing F1 score. And finally we included the speed in number of messages handled per second. This metric is not often discussed in the LID literature, but is of particular importance when dealing with a massive dataset such as ours or a massive streaming source such as Twitter.
We compare these benchmarks to our two models: the improved CNN as described in Section 4.1 and our proposed CNN model with an attention mechanism of Section 4.2. These are labelled CNN and Attention CNN in Table 4. In both models, we filter out characters that appear less than times and apply a dropout approach with a dropout rate of . ADAM optimization algorithm and early stopping techniques are employed during training. The full list of parameters and settings is given in Table 3. It is worth noting that we randomly select this configuration without any tuning process.
The first thing that appears from these results is the speed difference between algorithms. CLD3 and langid.py both can process several thousands of messages per second, and CLD2 is even an order of magnitude better, but the two neural network software have considerably worse performances, at less than a dozen messages per second. This is the efficiency trade-off of neural-network LID systems we mentioned in Section 1; although to be fair, we should also point out that those two systems are research prototypes and thus may not have been fully optimized.
In terms of accuracy and F1 score, langid.py, LanideNN, and EquiLID have very similar performances. All three consistently score above 0.90, and each achieves the best accuracy or the best F1 score at some point, if only by 0.002. By contrast, CLD2 and CLD3 have weaker performances; significantly so in the case of CLD3. In all cases, using our small-, medium-, or large-scale test set does not significantly affect the results.
All the benchmark systems were tested using the pre-trained models they come with. For comparison purposes, we retrained langid.py from scratch using the training and validation portion of our datasets, and ran the tests again. Surprisingly, we find that the results are worse for all metrics compared to using their pre-trained model, and moreover that using the medium- and large-scale datasets give significantly worse results than using the small-scale dataset. This may be a result of the fact the corpus the langid.py software was trained with and optimized for originally is drastically different from ours: it is a imbalanced dataset of 18,269 tweets in 9 languages. Our larger corpora, being more drastically different from the original, give increasingly worse performances. This observation may also explain the almost 10% variation in performance of langid.py reported in the literature and reproduced in Table 1. The fact that the message handling performance of the library drops massively compared to its pre-trained results further indicates how the software was optimized to use its corpus. Based on this initial result, we decided not to retrain the other benchmark systems.
The last two lines of Table 4 report the results of our basic CNN and our attention CNN LID systems. It can be seen that both of them outperform the benchmark systems in accuracy, precision, recall, and F1 score in all experiments. Moreover, the attention CNN outperforms the basic CNN in every metric (we will explore the benefit of the attention mechanism in the next subsection). In terms of processing speed, only the CLD2 system surpasses ours, but it does so at the cost of a 10% drop in accuracy and F1 score. Looking at the choice of datasets, it can be seen that training with the large-scale dataset leads to a nearly 1% improvement compared to the medium-sized dataset, which also gives a 1% improvement compared to the small-scale dataset. While it is expected that using more training data will lead to a better system and better results, the small improvement indicates that even our small-scale dataset has sufficient messages to allow the network training to converge.
|IDs||Tweets with attention values||att. CNN||CNN|
5.3 Impact of Attention Mechanism
We can further illustrate the impact of our attention mechanism by displaying the importance factor corresponding to each character in selected tweets. Table 5 shows a set of tweets that were correctly identified by the attention CNN but misclassified by the regular CNN in three different languages: English, French, and Vietnamese. The color intensity of a letter’s cell is proportional to the attention mechanism’s normalized weight , or on the focus the network puts on that character. In order words, the attention CNN puts more importance on the features that have the darkest color.
The case studies of Table 5 show the noise-tolerance that comes from the attention mechanism. It can be seen that the system puts virtually no weight on URL links (e.g. , , ), on hashtags (e.g. ), or on usernames (e.g. , , ). We should emphasize that our system does not implement any text preprocessing steps; the input tweets are kept as-is. Despite that, the network learned to distinguish between words and non-words, and to focus mainly on the former. In fact, when the network does put attention on these elements, it is when they appear to use real words (e.g. “star” and “seed” in the username of , “mother” and “none” in the hashtag of ). This also illustrates how the attention mechanism can pick out fine-grained features within noisy text: in those examples, it was able to focus on real-word components of longer non-word strings.
The examples of Table 5 also show that the attention CNN learns to focus on common words to recognize languages. Some of the highest-weighted characters in the example tweets are found in common determiners, adverbs, and verbs of each language. These include “in” (), “des” (), “le” (), “est” (), “quá” (), and “nhất” (). These letters and words significantly contribute in identifying the language of a given input.
Finally, when multiple languages are found within a tweet, the network successfully captures all of them. For example, switches from French to Spanish and mixes both English and Vietnamese. In both cases, the network identifies features of both languages; it focuses strongly on “est” and “y” in , and on “Don’t” and “bài” in . The message of mixes three languages, Vietnamese, English, and Korean, and the network focuses on all three parts, by picking out “nhật” and “mừng” in Vietnamese, “#생일축하해” and “#태형생일” in Korean, and “” in English. Since our system is setup to classify each tweet into a single language, the strongest feature of each tweet wins out and the message is classified in the corresponding language. Nonetheless, it is significant to see that features of all languages present in the tweet are picked out, and a future version of our system could successfully decompose the tweets into portions of each language.
In this paper, we first demonstrated how to build balanced, automatically-labelled, and massive LID datasets. These datasets are taken from Twitter, and are thus composed of real-world and noisy messages. We applied our technique to build three datasets ranging from hundreds of thousands to tens of millions of short texts. Next, we proposed our new neural LID system, a CNN-based network with an attention mechanism to mitigate the performance bottleneck issue while still maintaining a state-of-the-art performance. The results obtained by our system surpassed five benchmark LID systems by 5% to 10%. Moreover, our analysis of the attention mechanism shed some light on the inner workings of the typically-black-box neural network, and demonstrated how it helps pick out the most important linguistic features while ignoring noise. All of our datasets and source code are publicly available at https://github.com/duytinvo/LID_NN.