Language Detection For Short Text Messages In Social Media

08/30/2016 ∙ by Ivana Balazevic, et al. ∙ 0

With the constant growth of the World Wide Web and the number of documents in different languages accordingly, the need for reliable language detection tools has increased as well. Platforms such as Twitter with predominantly short texts are becoming important information resources, which additionally imposes the need for short texts language detection algorithms. In this paper, we show how incorporating personalized user-specific information into the language detection algorithm leads to an important improvement of detection results. To choose the best algorithm for language detection for short text messages, we investigate several machine learning approaches. These approaches include the use of the well-known classifiers such as SVM and logistic regression, a dictionary based approach, and a probabilistic model based on modified Kneser-Ney smoothing. Furthermore, the extension of the probabilistic model to include additional user-specific information such as evidence accumulation per user and user interface language is explored, with the goal of improving the classification performance. The proposed approaches are evaluated on randomly collected Twitter data containing Latin as well as non-Latin alphabet languages and the quality of the obtained results is compared, followed by the selection of the best performing algorithm. This algorithm is then evaluated against two already existing general language detection tools: Chromium Compact Language Detector 2 (CLD2) and langid, where our method significantly outperforms the results achieved by both of the mentioned methods. Additionally, a preview of benefits and possible applications of having a reliable language detection algorithm is given.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 15

page 17

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

Abstract

1 Introduction

Language detection is a natural language processing task of identifying the language a given document is written in. It is often the first step in a document processing pipeline. Moreover, it is considered to be a critical preprocessing step in applications that require language-specific modeling, such as search engines, where depending on the detected language different tokenizers may be used. Another common example of applying language detection is as a preceding step to machine translation, since the language of the text to be translated is not always specified. Therefore, a reliable language detection tool is needed.

Even though language detection itself has been studied since the 1960s, short texts appearing in social media websites and forums still require a special treatment, due to their specific language type, i.e. acronyms, abbreviations, spelling mistakes, emoticons, new words, etc. Therefore, despite the fact that language detection has been a long-known problem, an appropriate solution for short texts classification is yet to be found. Platforms such as Twitter, where these short texts are used, are recently becoming important real-time information resources [1, 2]. A wide range of applications is connected to their usage - event detection [3, 4], media analysis [5], opinion mining [6, 7], predicting movie ratings [8]

, etc. The majority of the social networks users contribute by writing posts in their own languages, but since this multi-language environment can potentially affect the outcomes of content retrieval and analysis of those posts, the ultimate goal is properly separating the posts to obtain monolingual content. Therefore, language detection is an important part in facilitating content analysis of social media websites. In this paper, three approaches to language detection for short text messages have been developed and tested on Twitter data. Those approaches include: support vector machines (SVMs) and logistic regression, a probabilistic model based on modified Kneser-Ney smoothing, and a dictionary based approach. One important contribution of this paper is the extension of the probabilistic model to include additional information extracted from the Tweet objects. The first hypothesis that is examined here is that users mostly tweet in only a few languages, so storing the information about the languages connected to a particular user should result in improving the classification accuracy of the original model. Furthermore, the language that users choose as their user interface (UI) language should carry some information about the languages that those users tweet in as well, so it should be considered as relevant too. It is important to mention that this kind of meta-information extracted from Tweet objects is not just Twitter-specific, but it can be applied to texts from all the websites where there is certain access to user profiles, e.g. all social media websites, forums, blogs, etc, which makes this contribution widely applicable.

Many different approaches for tackling the language detection problem have been developed so far. Some of the best known models include the one of Cavnar and Trenkle [9] popularized in the textcat tool, the Chromium Compact Language Detector 2 (CLD2) [10], originally extracted from the source code for Google Chromium’s library by Michael McCandless and developed further by Dick Sites, and langid [11], an off-the-shelf language identification tool by Lui and Baldwin. The Cavnar and Trenkle method uses a per-language character frequency model and classifies documents via their relative “out-of-place” distance from each language (see [9] for more details). Variants on this method include Bayesian models for character sequence prediction [12], dot products of word frequency vectors [13], and information-theoretic measures of document similarity [14, 15]. CLD2 and langid

are both Naive Bayes classifiers, where

CLD2 probabilistically detects over 80 languages in Unicode UTF-8 text and for the mixed-language input returns the top three languages found for a given input and their approximate percentages of the total text bytes, while langid

is trained on 97 languages over a naive Bayes classifier with a multinomial event model over a mixture of byte n-grams

designed to be used off-the-shelf [11]. In the Results Section, the performance of CLD2 and langid is compared to the performance of the methods developed in this paper. Additionally, kernel methods such as support vector machines (SVMs) were recently successfully applied to the same task [16, 17, 18]

, which motivated testing their performance on the short texts dataset from this paper. Recently, approaches based on deep neural network architectures are becoming increasingly common, with very promising results in language detection on speech data

[19, 20]. Even though these architectures are out of scope of the work done in this paper, they could be investigated in future work.

However, the main difference between most of the approaches mentioned in the previous paragraph and the problem about to be tackled in this paper is that they are trained on large corpora with long, structured, well-written texts - e.g. the design target of CLD2 are web pages with at least 200 characters (approximately two sentences) and it is not designed to do well on very short texts. The only algorithm from the ones mentioned for which its authors claim that it does well on short texts from the microblog domain is langid, which is the reason why we chose to compare its performance with ours in the Results Section of this paper. An interesting research by Baldwin and Lui [21] explores the impact of document length on language detection, with the conclusion that the performance accuracy improves significantly with increasing the document length. Therefore, in order for a method to achieve high accuracy results on short texts, it has to learn the particular characteristics of those texts, since this rather specific type of language is fairly difficult to match for methods trained on external corpora. The main advantage of the methods implemented in this paper compared to the methods targeted at long documents is the fact that they have been specifically developed to be able to recognize the language of short, noisy texts. The problem of short texts language detection has been investigated by Nakatani Shuyo, who developed Language Detection with Infinity Gram (ldig). He reports quite impressive results of “over 99% accuracy for 19 languages”, on the corpus containing 700,000 labeled tweets. However, the drawback of ldig is that it is limited to Latin alphabet languages only, while some of the most common Twitter languages include Japanese, Korean, Chinese, etc. Furthermore, his analysis is limited to texts longer than 3 words, which is often not the case when dealing with Twitter data. Another example of short text language detection is done in [1] and it relies on the results of the before mentioned textcat tool. Similarly to Shuyo’s work, the results presented here are on a dataset containing only 5 Latin alphabet languages, which is considered not nearly enough to declare having a reliable short texts language detection algorithm task solved. In this paper, we try to overcome the difficulties accompanying short texts language detection compared to the longer texts on the one hand, while at the same time including the non-Latin alphabet languages in the dataset, since they are considered to be an important subset of languages which should not be excluded.

1.1 Datasets

The dataset that is used for this task consisted of .json files in which the Tweet objects are stored. Each Tweet object contains different types of relevant information about its nature, such as the unique identifier of the tweet itself, the text that it contains, the information about its author, time and location at the point of creation, etc. However, only parts of this information are considered relevant for the classification task. The files that are used contain tweets collected using the Twitter API in April 2012, where in total around 22,000 tweets in 16 different languages are randomly collected at different time points during two days. Languages appearing in less than 3 tweets are discarded and due to insufficient domain knowledge, Indonesian and Malay are grouped together to one language. The language distribution across the dataset is shown in Fig. 1

. When collecting the data, we complied with the Twitter’s Terms of Use. As expected, almost half of the total number of tweets are written in English. The languages following English by the number of tweets are Malay, Japanese, Korean, Spanish, Portuguese, and Dutch. Even though the distribution of the number of tweets per language in the dataset is highly skewed, it corresponds quite well to the distribution of Twitter languages given in

[22].

To make the language labeling of the tweets easier and reduce the manual work, the open-source language detection library Chromium Compact Language Detector 2 (CLD2) is used. The results obtained by CLD2 are then manually checked and all the wrongly assigned labels are corrected in order to obtain a clean dataset and avoid repeating the same mistakes that CLD2 made. Finally, the column indicating the tweet language is added to the existing .csv file. However, since only around 8,000 tweets are obtained using this rather time-consuming approach, additional tweets are collected from the users with the user ID already existing in the dataset, assuming the majority of users would tweet only in one or two languages. Therefore, the language from the tweet already present in the dataset is assigned to all the remaining tweets from the same user. To prevent possible mistakes when using this approach, all the newly collected data is later again manually rechecked.

Figure 1: Distribution of languages in the dataset

2 Methods

The general task of language detection is to predict, for a given text , the language in which the text is written. A typical naive approach to solving the language detection task would be to show the text to a certain language expert, who would then decide on a language the text is written in. However, that would require many different language experts for each of the languages. This solution becomes even more problematic if the database of texts is not static but it is changing over time, where scalability becomes an issue. Therefore, a machine learning approach is needed. In the machine learning approach to solving the language detection problem, we are given a certain amount of data (a set of texts in different languages) and the labels (languages to which those texts belong). The labels have previously been assigned to the data by some form of annotation procedure. Even though the need for human language experts still exists in the annotation step, once the initial amount of data is labeled, the algorithm does the rest of the work. Having the labels and not just the raw texts makes this a supervised learning problem

, in which each example is a pair of the input object and the target value. The goal of supervised learning is to predict the correct output value for each input object.

2.1 Preprocessing

The preprocessing task is usually the first part in a machine learning document processing pipeline, preceding the extraction of features from the data. In this paper, preprocessing included editing the tweet texts and assigning them the corresponding language labels. The text editing in general consists of cleaning the texts, removing all the information considered to be unnecessary for the task, and transforming all the texts into the same, mutually comparable form. As the first preprocessing step, all the links and expressions of addressing a particular user (the @user_name form) are removed from every tweet text using simple regular expressions, as they are considered irrelevant for the differentiation between languages. In addition, all the emoticons are removed too, since they maintain the same form across the languages. The text is then converted to lowercase, all multiple white spaces are trimmed, and all the punctuation marks are removed. This procedure transformed all the texts into an equal format, to improve the accuracy when performing their mutual comparisons.

2.2 Feature Extraction

In this paper, two different types of features are extracted from the tweet texts, depending on the classifier used: character n-grams and bag-of-words features. Character n-grams can be described as all character substrings of length in the given text. On the other hand, the bag-of-words features are defined as an unordered collection of words in the text. Whenever possible, the character n-gram feature model is chosen over the bag-of-words model, which is justified by the specific type of language used in the dataset. Namely, character n-grams model is more resilient against misspellings, abbreviations, acronyms, and word derivations than the bag-of-words, since it does not strictly impose the splitting of texts by white spaces.

For SVM and logistic regression classification, character n-grams are chosen as the appropriate feature type. After extracting the n-grams, the next step is to transform this collection of features into numerical feature vectors, which is a standard step before applying most of the machine learning algorithms to text data. This task can be done in many ways - from the simplest one of having a binary indicator whether a particular n-gram appeared in a text to counting the occurrences of that n-gram in a text and optionally applying different kinds of normalizations to those counts.

In order to choose the most appropriate feature type for this task, the evaluation of different values and normalization types is done on a held-out development set by performing a 5-fold cross-validation procedure. For every combination of values (the values of are considered reasonable for short Twitter texts) and normalization types (tf-idf and length normalization are examined here), the micro- and macro-averaged F1-scores are computed using the scikit-learn software [23]. For both classifiers, the value of with the tf-idf weighting is chosen as the best feature type, since it slightly outperformed all the other parameter combinations.

In the probabilistic approach, character n-grams are again chosen as the best suited feature type. Due to the use of the modified Kneser-Ney smoothing [24]

and its recursive nature, character 1-4-grams are chosen as the appropriate features. Namely, the Kneser-Ney probabilities of the higher order n-grams are computed using the probabilities of the lower order n-grams. Limiting the order of the character n-grams to

seemed as the most reasonable choice here, due to tweets being too short to extract features longer than 4-grams. No normalization is performed in this approach, since the Kneser-Ney algorithm is designed to work directly on n-gram counts.

In the dictionary based approach, the bag-of-words feature model is the only possible feature model, since this approach relies on matching the words from a dictionary with the words in the text, so the features necessarily need to be whole words. Therefore, the character n-gram model is not considered here.

2.3 Classification

Svm

Support vector machines (SVMs) are supervised learning models, used mostly for classification and regression problems. In the next paragraphs, the SVM classification is described for the case of only two classes for simplicity, since multi-class classification is just an extension of that model [25]. The multi-class support in this paper is handled according to the one-vs-one scheme. SVM classification is focused on trying to maximize the margin, i.e. the distance of the data points of both classes from the decision boundary based on structural risk minimization [26]. One way to achieve this is by solving the dual optimization problem [27]. The dual optimization problem is defined as follows:

subject to

where are the feature vectors and are the corresponding class labels. The function is the so-called kernel function, which describes the similarity between two documents and allows the extension of SVMs to nonlinear problems. The parameter is a regularization constant, which allows for some points in the training set to be misclassified, in order to avoid overfitting. All data points with are the so-called support vectors, i.e. those data points that lay on or inside the margin. The -s are typically equal to 0 for most of the documents considered, which makes SVMs extremely efficient: when assigning the label to a new data point, only those documents have to be considered which have support vectors larger than 0. After the training phase is completed, a new data point is classified according to the following expression [27]:

where is the bias term. One of the most important points to consider when choosing an SVM as a classification method is the choice of the corresponding kernel function, since the appropriate choice of the kernel function significantly influences the classification accuracy. In this paper, the linear kernel is chosen, since it performs well in the cases where the dimensionality is much higher than the number of data points. Additionally, computing the linear kernel requires less computational cost than computing any of the other kernel functions (e.g. rbf, polynomial, etc.), since a linear kernel is just a simple dot product in the feature space. Therefore, training an SVM with a linear kernel is faster than with any other kernel, particularly when using a dedicated library such as LibLinear [28]. Finally, most of the text classification problems are linearly separable [29], so no other kernels except for the linear are needed. SVM is chosen as one of the models in this work because of its many advantages [23]: it is effective in high dimensional spaces, it is still effective even in cases where the number of dimensions exceeds the number of samples (usually the case with categorization of text documents), it uses only a subset of training points in the decision function, so it is also memory efficient, and it is unlikely to overfit, since the ratio of number of data points and effective dimensions is typically high [30], given that an appropriate regularization term is used.

Logistic Regression

Logistic regression, despite having the word “regression” as part of its name, is a linear model for classification rather than regression [31]. The logistic regression classification paradigm is described here for the two class case only. It is a type of probabilistic statistical model, where the probabilities describing the possible assignments to different classes are modeled using the logistic function [23], which is defined as:

where is the assigned class label, is the data point, is the regression coefficient, and is the probability of being drawn from the positive class. A new data point gets assigned to a class with the highest probability. As an optimization problem, two-class L2-penalized logistic regression minimizes the following cost function:

where is the L2-regularization and is the inverse regularization constant. The reasons for using logistic regression in this work are: its simplicity - it creates a linear decision boundary, it is effective in high dimensional spaces, and it is unlikely to overfit when appropriate regularization term is chosen.

After extracting the character n-grams features as described in the Feature Extraction Section, the obtained feature matrix containing the

tf-idf features and the label vector containing the corresponding class labels are split into training and test parts, which is repeated in a 5-fold cross-validation procedure. After being trained on the training data using the scikit-learn implementation of the two classifiers, the learned models are then applied to the test data. During the classification procedure, special treatment is given to texts containing the characters belonging to Thai, Arabic, Korean, Japanese, and Chinese language, due to the very large number of different characters present in each of those languages and as a result, classifiers performing poorly on texts belonging to those languages. Therefore, SVM and logistic regression classifiers are not trained on texts coming from those languages, but a specific method is applied to determine from which language the text originated from. For this reason, a designated threshold value is determined experimentally on the held-out development set in order to check if a test text belongs to one of the languages mentioned. If the special characters make more than the specified percentage of the text length, the text gets assigned the language label to which those special characters belong. The biggest challenge here is the differentiation between Japanese and Chinese, since they both use Kanji characters. However, the number of Kanji characters used in Chinese is much larger than in Japanese and Hiragana and Katakana are specific to Japanese only, which enables successful separation of the two languages.

Dictionary Based Approach

The dictionary based method is by far the simplest one of all the methods tested - it is based on having the dictionary of all possible words for each language, comparing the words in the text with the words in each of the dictionaries and counting the number of hits per text and per language. The winning language label for each text is the one with the highest number of hits. The primary advantage of the dictionary based approach is its simplicity - it does not require a training phase and the algorithm itself is very easy to implement. However, usually it cannot compete with other more powerful methods, as shown in the Results Section. Additionally, it uses a lot of memory by saving all the dictionaries, which is efficiently dealt with by using bloom filters for each dictionary rather than iterating over every dictionary document.

The first step of the dictionary based approach is to download the dictionary of all the words for each of the languages. For that, the GNU Aspell dictionaries are used. The dictionary files are then preprocessed in order to include only one word per line and stored as bloom filters, due to time and space efficiency reasons. A bloom filter is a space-efficient probabilistic data structure used to test whether an element is a member of a set, where false positive matches are possible while false negatives are not. In other words, it is possible to conclude if an element is “possibly in set” or “definitely not in set”. Each of the words in a single tweet text is checked against every bloom filter and the number of alleged hits is counted accordingly. The text is then assigned to the language with the highest number of hits. Some Asian languages (Thai, Korean, Japanese, and Chinese) are handled the same way as described in the previous paragraph, since there are no GNU Aspell dictionaries for those languages.

Probabilistic Model Based on Modified Kneser-Ney Smoothing

The probabilistic model algorithm implemented in this work outputs a vector of probabilities for a certain text belonging to each of the languages present in the data set. The language with the highest probability assigned gets chosen as the class label. The algorithm is based on modified Kneser-Ney smoothing, which is a slightly altered version of Kneser-Ney smoothing, proven to outperform the original version [24]. Kneser-Ney smoothing makes use of absolute discounting by subtracting a fixed value from the lower order terms to omit n-grams with lower frequencies, i.e. it takes into account the frequency of unigrams in relation to possible higher order n-grams in which those unigrams are contained. Additionally, this smoothing results in assigning a probability value greater than zero to all n-grams which are not appearing in the training set but are present in the test set. Modified Kneser-Ney smoothing computes the conditional probability for each n-gram in every text, where is the -th n-gram in the text and are the preceding n-grams. The probability value for the whole text is then computed as:

where is the number of n-grams in the text. The conditional probability is then defined recursively as:

where denotes the count, is the scaling factor to make the distribution sum to 1 and is the discount factor where:

i.e. instead of using a single discount for all non-zero counts as in Kneser-Ney smoothing, three different parameters , , and are applied to n-grams with one, two, and three or more counts, respectively. To make the distribution sum to 1, is defined as:

where stands for the number of words that have or more counts and

is a free variable, on which has been summed over. The estimates for the optimal discount values

, , and are computed as a function of training data counts [32]:

where and is the number of n-grams appearing times in the training data. It is important to mention that if the smoothing term was omitted, the probability value of the whole text would be 0 for each text that contains an n-gram present in the test data but which never appeared in the training data. After computing the probabilities for each text and each language, the text gets assigned the language label with the highest probability score. The advantage of this probabilistic model over the other methods is that it takes into account the n-grams with zero-counts by smoothing the probability function, which should in turn lead to higher accuracy. However, the method also has certain drawbacks, such as its complexity and high space and time usage due to its many recursive calls.

In this approach, the training phase consists of counting the number of occurrences of a specific 1-4-gram in texts of each language. This dictionary of n-gram counts is then fed into the test phase. The test phase consists of iterating over every n-gram of each text from the test data and computing the modified Kneser-Ney probability of that n-gram belonging to a certain language. The probability of the whole text belonging to a certain language is calculated by multiplying the probabilities for all the n-grams contained in that text. Finally, the language with the highest probability is chosen. One important fact here is that special handling for non-Latin languages that is used for SVM, logistic regression, and the dictionary based approach is not used here, since the probabilistic model performed well on texts belonging to those languages.

2.4 Including Additional Information

The most important contribution and the biggest novelty of this paper compared to previous work done in the language detection field is the extension of the probabilistic model to include additional personalized user information. Due to the fact that the output of the probabilistic model is in the form of a probability distribution over different languages, additional information can be added to the model in order to improve the predictions. Two types of information are investigated here and added to the original model and their impact on the predictions is evaluated. This information includes prior information about the language usage by a particular user and information about the user interface language.

First, a prior frequency distribution for choosing a specific language is defined for each user. This distribution is chosen to be uniform at the beginning, with an experimentally determined value, as described later in the Results Section. For every new text in the test data, the chosen language is no longer determined by looking only at the probability distribution given by the classifier as before, but at the product of that distribution with the prior user-specific distribution. This user-specific prior distribution gets adjusted each time after observing a new text and determining its language by increasing the count for that language. This way, the user-specific evidence accumulation is included in the model. Since it is assumed that an average user tweets in only a few languages, this adjustment of the model is believed to help improve the overall results. The adjustment should be especially important in the cases of texts with high uncertainty in the results gained directly from the original classifier. This is best illustrated by an example - if a user frequently tweeted in language A but the Kneser-Ney probability for the current tweet is not large enough to decide for language A and it is really close to the probability of a language B, the user data accumulation will help decide for the more frequently used language, in this case A. On the other hand, if the Kneser-Ney probability for language B is much higher than the one for language A, the algorithm is still going to decide for the language B, if an appropriate prior distribution is chosen. The value to which the flat prior distribution is initialized decides in this case how important a new data point is - if a prior is set to a low value, each new data point influences the distribution greatly; if it is on the contrary set to a high value, a lot of new data has to be seen to significantly change the distribution.

Furthermore, it is assumed that the user interface (UI) language chosen by the user should carry additional information about the language the user tweets in, i.e. those two languages should be equivalent in some cases. Therefore, it has been decided to include the information about the UI language in the original model as following: the prior distribution is no longer set to a uniform value at the beginning of the classification procedure, but the value for the UI language is increased, again by an experimentally determined amount.

3 Results

In this first part of this Section, we give a brief introduction to accuracy measures used to compare performances of the classifiers. To gain better insight into the performance of different methods, their prediction results are compared in the second part of this Section. In addition, the effect of adding additional information to the model other than classifying based just on character n-grams is evaluated in the next Subsection. The best performing method is then chosen and those results are compared to the CLD2 and langid results, which is described in the last Subsection.

3.1 Accuracy Measures

A good classifier is defined as one for which the number of true positives (TP) and true negatives (TN) is high, while at the same time keeping the number of false positives (FP) and false negatives (FN) low. To sum up those outcomes, two different accuracy measures were used in assessing the classifiers performance: micro- and macro-averaged F1-score. In order to understand the F1-score, precision and recall need to be explained first.

Precision is defined as:

and it measures the ability of the classifier not to assign a sample to the class to which it does not belong. Recall is defined as:

and it measures the ability of the classifier to find all the samples that belong to that class (both assigned to it and the ones not assigned). F1-score is defined as:

and it is the weighted average (harmonic mean) of precision and recall. The

micro-averaged F1-score calculates the metrics globally by counting the total number of TPs, FPs, and FNs, while the macro-averaged F1-score calculates the metrics for each label and finds their unweighted mean, not taking label imbalance into account [23]. Because of the large difference in sample sizes between the languages in the dataset used in this paper, the difference between the micro- and macro-averaged F1-scores is expected to be large as well. The lack of training data for some languages (e.g. Turkish, Italian, German, see Fig. 1)does not allow the classifier to learn the correct representation for those languages. Therefore, the F1-scores for those labels are expected to be low, which will then affect the macro-averaged F1-score in a negative way.

3.2 Performance Comparison of Different Methods

In order to assess the quality of the SVM and logistic regression predictions, a 5-fold cross-validation procedure is performed on both of the classification methods and the presented results are in the form of mean micro- and macro-averaged F1-scores and their standard deviations over the folds. Additionally, SVM and logistic regression are combined together to form an

ensemble learning method, since it is assumed that this will increase the classification accuracy, as suggested in [33] and [34]. In the ensemble learning, the confidence scores for choosing a specific language are in fact the probabilities gained from both of the classifiers. Precisely, if the predicted language for a certain text differs across the classifiers, the label is taken from the classifier which yields a higher confidence score. However, since the scikit-learn implementation of the SVM classifier does not implicitly include probability scores, those are obtained with the use of Platt scaling [35] by setting the probability parameter to True. The corresponding micro- and macro-averaged F1-scores can be seen in Table 1. Contrary to the previous hypothesis, the SVM itself outperforms the ensemble classifier both in micro- and macro-averaged F1-scores. Obviously, the logistic regression fails to capture some information about the correct decision boundary between the language classes, while still having high confidence about the predictions. In assessing the performance accuracy of the probabilistic model based on modified Kneser-Ney smoothing, the 5-fold cross-validation procedure is again performed. It is shown in Table 1 that the probabilistic model with Kneser-Ney smoothing significantly outperforms the traditional classifiers, presumably due to its smoothing of the n-gram counts distribution to account for the zero-probabilities n-grams. It is important to mention that no smoothing is done with the traditional classifiers, since that would make the feature matrices no longer sparse, which would in turn result in huge computational costs.

l — c — c method & micro-averaged F1& macro-averaged F1

[2pt]- SVM & &

logistic regression & &

ensemble learning & &

probabilistic model & &

dictionary based & &

Table 1: Micro- and macro-averaged F1-scores for different methods

To assess the dictionary based method, no training phase is necessary, since the algorithm just compares the words from the tweets with the words in the provided dictionaries. The performance of the algorithm is still measured on the same test sets as the other methods, in order to ensure fair comparison. It is obvious that the dictionary method performs a lot worse than the other two methods, which confirms the initially expected outcome. Some possible reasons for that could be that the texts in the dataset are very short (a tweet is limited to 140 characters), which is not enough to gain a different number of hits for different languages. They are also filled with spelling mistakes, which changes the original words so they don’t match exactly the ones in the dictionary. Additionally, this outcome also shows the weakness of the bag-of-words feature model compared to the n-gram model used in other approaches.

3.3 What if we add additional information?

Under the hypothesis that the classification accuracy can be improved by adding additional information to our model other than only n-gram frequency counts, different approaches are tested together with the probabilistic model, since that is the model that yielded the best results in comparison with other methods. First, the effect of accumulating the data per user is implemented. For each new tweet, we extract its user ID and if this user ID already appeared in the dataset before, we increment the count for the number of tweets in that language for that user. However, if this is the first tweet by that user, the prior distribution of tweet counts per language is set to a uniform distribution. The exact value of the uniform distribution is determined experimentally on a held-out development set and different values

are tested to analyze the influence that the height of the prior has on the F1-scores. Fig. 2 shows the micro- and macro-averaged F1-scores for different prior distribution values. The lower the prior distribution value is, the better the micro- and macro-averaged scores get. Therefore, the best performing model is the one where the prior distribution is flat with the value and it achieves the micro-averaged F1-score of and the macro-averaged score of . Additionally, it is obvious that all the results including the user-specific information perform better than the one without it, which confirms the hypothesis that including user-specific information is a valuable extension of the model. The second hypothesis includes the users’ UI language. The UI language is included in the uniform prior distribution as following - the count for that language is increased compared to the counts for other languages. The exact increase amount is again determined experimentally, where the values of are tested on the held-out development set, while the counts for all the other languages are set to , due to the results presented earlier in Fig. 2. All of the above mentioned values increased the classification accuracy compared to the procedure where the UI language information is not included in the model, as it can be seen in Fig. 3. The best results are achieved when using the count increase value of . Compared to the model without the added UI language information, a very significant increase in classification accuracy of is achieved when looking at the macro-averaged F1-scores. The potential reason for that may be that the information about the UI language influenced the most those texts for which not enough training data is available, which improved the overall macro-averaged score. Regarding the micro-averaged scores, an improvement of has been achieved compared to the model without the UI language information.

Figure 2: Micro- and macro-averaged F1-scores for different prior distribution values
Figure 3: Micro- and macro-averaged F1-scores for assigning different importance values to the UI language information

l — c — c method & micro-averaged F1& macro-averaged F1

[1.5pt]- prob. model & &

prob. model & evidence acc. & &

prob. model & evidence acc. & UI & &

Table 2: Micro- and macro-averaged F1-scores after adding user-specific information to the probabilistic model

In conclusion, the chosen classification method is the probabilistic model with modified Kneser-Ney smoothing, where the feature model is based on character n-grams with addition of user-specific information in form of evidence accumulation and UI language information. With that method, the classification accuracy of is obtained in the micro-averaged F1-scores and in the macro-averaged F1-scores. The performance overview of all the approaches tested with the probabilistic model is shown in Table 2.

Additionally, the distribution of F1-scores across different languages is plotted for the best performing method in Fig. 4. By comparing the results presented in Fig. 4 with the number of samples per category shown in Fig. 1, it can be seen that the 4 languages that had the least samples (Turkish, Tagalog, Italian, German) are also the ones with the lowest F1-scores. This confirms the conjecture that not having enough training data affects the classification accuracy. As expected, languages with a large number of samples (English, Malay, Spanish, Portuguese, Dutch) achieve high F1-scores as well. However, it is interesting to notice that very good results are obtained for the texts belonging to the non-Latin languages such as Russian, Japanese, Korean, Arabic, and Thai. This is an indication that the representation of those categories is easily learned by the model, due to the special type of characters used in those languages. The only non-Latin language where the results are not so high is Chinese, possibly due to the large overlap between the characters used in Japanese and Chinese.

Figure 4: The distribution of F1-scores across different categories

3.4 Performance Comparison with Cld2 and langid

The goal of this Section is to compare the performance of the best method from the previous Section with the performance of the Chromium Compact Language Detector 2 and the langid tools. The categorization is done on both raw tweets and the preprocessed ones, where the preprocessing procedure is the same as the one described in the Preprocessing Section of this paper. The language chosen as the predicted language by CLD2 is the one with the highest confidence, as outputted by the CLD2 algorithm, while langid outputs the language with the highest confidence only. The obtained results are visible in Table 3. The results obtained by CLD2 improve significantly when applying the preprocessing methods developed in this paper compared to the results on the raw data, but they are still considerably worse than the results achieved by our algorithm. The macro-averaged F1-scores differ slightly more between the classifiers than the micro-averaged ones. The reason for that may lie in the fact that the probabilistic model outperforms CLD2 and langid for some languages for which there is not a lot of data available, an aspect that we attribute to the usage of prior information. Results achieved by langid do not depend much on preprocessing the data, but they are significantly worse than the results achieved by our algorithm in both micro- and macro-averaged F1-scores, even though the authors claim it should perform well across different domains.

l — c — c method & micro-averaged F1& macro-averaged F1

[1.5pt]- cld2 (raw data) & &

cld2 (preprocessed data) & &

langid (raw data) & &

langid (preprocessed data) & &

prob. model & evidence acc. & UI & &

Table 3: Micro- and macro-averaged F1-scores of the best performing classification method compared to Chromium Compact Language Detector 2

To sum up, our probabilistic model with modified Kneser-Ney smoothing with the addition of user-specific data outperforms CLD2 by and langid by in the micro-averaged F1-score and by and respectively in the macro-averaged one, which is considered to be a very significant improvement. These results support the statement mentioned in the Introduction that CLD2 is not suited well for language detection on short texts and therefore confirm the initially stated need for a better algorithm. Additionally, we show that algorithms like langid which work considerably well across domains have certain difficulties when competing with an algorithm designed specifically for one domain only.

3.5 Getting More Out of the Data

In this Section, the attention is drawn to the possible applications of having a reliable short text language detection algorithm. Even though language detection by itself is an interesting and challenging task, a lot more conclusions and appealing statistics can be drawn after having it already applied on real world data. Therefore, the connections of the predicted languages with the UI language and the location are investigated.

Fig. 5 shows the relationship of the predicted language vs. the UI language. It is not so surprising to notice that users belonging to many different nationalities (assuming that the nationalities usually correspond the users’ UI languages) tweet in English or that many users have English as their UI language, independent of which language they tweet in. On the other hand, it is interesting to see that users which tweet often in Spanish have, apart from English, Spanish and Portuguese set up as their UI languages. This illustrates the geographic and lexical similarity of those languages. It is also interesting to notice that e.g. users that tweet in Dutch have mostly English as their UI language (), followed by Dutch (), compared to the users who tweeted in French which mostly have French as their UI language (), followed by English (). This might be a good indication of how widespread the usage of English language in a particular country is.

Figure 5: The percentage of predicted language vs. the UI language

In Fig. 6, the predicted languages are compared with regard to the UTC time when the tweets in those languages are posted. The time is presented as hours in 0-23 range, where e.g. the number represents all tweets posted between 05:00h and 05:59h. It can be seen that some languages (e.g. English, Portuguese, Spanish) have rather even occurrence distribution throughout the whole day, since those languages are very widespread throughout different continents and therefore different timezones as well. On the contrary, there are no Chinese tweets after 4 pm UTC time, since then the local time in China is 12 am. As expected, with most of the languages prevalent mostly in one timezone (e.g. French, Dutch), peaks in the number of tweets can be observed in the late afternoon and evening.

Figure 6: The percentage of predicted language vs. the time when the tweet is posted

4 Conclusions

In this paper, different algorithmic approaches to language detection for short texts in social media are investigated. The first approach includes the use of the well-known classifiers such as SVM and logistic regression and the combination of both. The second approach is based on a probabilistic model with modified Kneser-Ney smoothing, with the extension in terms of including additional information specific to a single user. The last approach is a simple dictionary based method. When comparing the classification performance of all the algorithms, the probabilistic model outperforms the other methods. The dictionary method achieves by far the worst results, since short tweets full of spelling mistakes, abbreviations, and acronyms do not match most of the words present in the Aspell dictionaries. The other two methods are trained directly on Twitter data, which gives them a significant advantage over the dictionary method. After introducing additional information about the users into the probabilistic model, such as the user interface language and keeping track of the languages the user previously tweeted in, the classification accuracy of the probabilistic model is increased even further. The reason why the probabilistic model with modified Kneser-Ney smoothing performs better than all the other methods presumably lies in the fact that it includes the smoothing of the n-grams probability distribution, i.e. it can effectively handle the n-grams appearing in the test data that are not present in the training data, while incorporating the relationship between lower and higher order n-grams.

The main goal of this paper was to develop a language detection algorithm aimed at short texts, since that is where most of the general language detection tools fail. Therefore, the results obtained by the above mentioned algorithms are compared with the already existing general language detection tools Chromium Compact Language Detector 2 (CLD2) and langid. Both SVM and logistic regression and especially the probabilistic model provided a substantial increase in the accuracy results compared to those two tools. This improvement becomes even more pronounced after introducing additional user-specific information into the model, which brings us one step closer to solving the task of reliably detecting the language of short texts.

Acknowledgments

This work was supported by the Brain Korea 21 Plus Program, through the National Research Foundation of Korea funded by the Ministry of Education and by the Federal Ministry for Education and Research (BMBF) under Grant 01IS14013A-E and Grant 01GQ1115. Correspondence to KRM.

References

  • [1] Simon Carter, Wouter Weerkamp, and Manos Tsagkias. Microblog language identification: Overcoming the limitations of short, unedited and idiomatic text. Lang. Resour. Eval., 47(1):195–215, March 2013.
  • [2] Gene Golovchinsky and Miles Efron. Making sense of twitter search. In Proc. CHI2010 Workshop on Microblogging: What and How Can We Learn From It? April 11, 2010.
  • [3] Takeshi Sakaki, Makoto Okazaki, and Yutaka Matsuo. Earthquake shakes twitter users: Real-time event detection by social sensors. In Proceedings of the 19th International Conference on World Wide Web, WWW ’10, pages 851–860, New York, NY, USA, 2010. ACM.
  • [4] Sarah Vieweg, Amanda L. Hughes, Kate Starbird, and Leysia Palen. Microblogging during two natural hazards events: What twitter may contribute to situational awareness. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, CHI ’10, pages 1079–1088, New York, NY, USA, 2010. ACM.
  • [5] D. L. Altheide. Qualitative Media Analysis (Qualitative Research Methods). Sage Pubn Inc, 1996.
  • [6] Bernard J. Jansen, Mimi Zhang, Kate Sobel, and Abdur Chowdury. Twitter power: Tweets as electronic word of mouth. J. Am. Soc. Inf. Sci. Technol., 60(11):2169–2188, November 2009.
  • [7] A. Tumasjan, T.O. Sprenger, P.G. Sandner, and I.M. Welpe. Predicting elections with twitter: What 140 characters reveal about political sentiment. In Proceedings of the Fourth International AAAI Conference on Weblogs and Social Media, pages 178–185, 2010.
  • [8] Andrei Oghina, Mathias Breuss, Manos Tsagkias, and Maarten de Rijke. Predicting imdb movie ratings using social media. In Proceedings of the 34th European Conference on Advances in Information Retrieval, ECIR’12, 2012.
  • [9] William B. Cavnar and John M. Trenkle. N-gram-based text categorization. In In Proceedings of SDAIR-94, 3rd Annual Symposium on Document Analysis and Information Retrieval, pages 161–175, 1994.
  • [10] Dick Sites. Compact language detector 2. https://github.com/CLD2Owners/cld2, 2014.
  • [11] Marco Lui and Timothy Baldwin. Langid.py: An off-the-shelf language identification tool. In Proceedings of the ACL 2012 System Demonstrations, ACL ’12, pages 25–30, Stroudsburg, PA, USA, 2012. Association for Computational Linguistics.
  • [12] Ted Dunning. Statistical identification of language. Technical report, 1994.
  • [13] Marc Damashek. Gauging similarity with n-grams: Language-independent categorization of text. Science, 267(5199):843–849, 1995.
  • [14] Javed A. Aslam and Meredith Frost. An information-theoretic measure for document similarity. In Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Informaion Retrieval, SIGIR ’03, pages 449–450, New York, NY, USA, 2003. ACM.
  • [15] Bruno Martins and Mário J. Silva. Language identification in web pages. In Proceedings of the 2005 ACM Symposium on Applied Computing, SAC ’05, pages 764–768, New York, NY, USA, 2005. ACM.
  • [16] O Teytaud and Radwan Jalam. Kernel based text categorization. In 12th International Joint Conference on Neural Networks (IJCNN), Washington, US, pages 1892–1897. IEEE, 2001.
  • [17] Huma Lodhi, Craig Saunders, John Shawe-Taylor, Nello Cristianini, and Chris Watkins. Text classification using string kernels. J. Mach. Learn. Res., 2:419–444, March 2002.
  • [18] Canasai Kruengkrai, Prapass Srichaivattana, Virach Sornlertlamvanich, and Hitoshi Isahara. Language identification based on string kernels. In In Proceedings of the 5th International Symposium on Communications and Information Technologies (ISCIT-2005, pages 896–899, 2005.
  • [19] Ignacio Lopez-Moreno, Javier Gonzalez-Dominguez, and Oldrich Plchot. Automatic language identification using deep neural networks. In Proc. ICASSP, 2014.
  • [20] Ruben Zazo, Alicia Lozano-Diez, Javier Gonzalez-Dominguez, Doroteo T. Toledano, and Joaquin Gonzalez-Rodriguez.

    Language identification in short utterances using long short-term memory (lstm) recurrent neural networks.

    PLoS ONE, 11(1):1–17, 01 2016.
  • [21] Timothy Baldwin and Marco Lui. Language identification: The long and the short of the matter. In Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, HLT ’10, pages 229–237, Stroudsburg, PA, USA, 2010. Association for Computational Linguistics.
  • [22] Delia Mocanu, Andrea Baronchelli, Bruno Gonçalves, Nicola Perra, and Alessandro Vespignani. The twitter of babel: Mapping world languages through microblogging platforms. CoRR, abs/1212.5238, 2012.
  • [23] F. Pedregosa et al. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825–2830, 2011.
  • [24] Stanley F. Chen and Joshua Goodman. An empirical study of smoothing techniques for language modeling. In Proceedings of the 34th Annual Meeting on Association for Computational Linguistics, ACL ’96, pages 310–318, Stroudsburg, PA, USA, 1996. Association for Computational Linguistics.
  • [25] Koby Crammer and Yoram Singer. On the algorithmic implementation of multiclass kernel-based vector machines. J. Mach. Learn. Res., 2:265–292, March 2002.
  • [26] Vladimir N. Vapnik.

    The Nature of Statistical Learning Theory

    .
    Springer-Verlag New York, Inc., New York, NY, USA, 1995.
  • [27] K.-R. Müller, S. Mika, G. Rätsch, S. Tsuda, and B Schölkopf. An introduction to kernel-based learning algorithms. IEEE Transactions on Neural Networks, 12(2):181–202, 2001.
  • [28] Chih-Wei Hsu, Chih-Chung Chang, and Chih-Jen Lin. A practical guide to support vector classification. Technical report, Department of Computer Science, National Taiwan University, 2003.
  • [29] Thorsten Joachims. Text categorization with suport vector machines: Learning with many relevant features. In Proceedings of the 10th European Conference on Machine Learning, ECML ’98, pages 137–142, London, UK, UK, 1998. Springer-Verlag.
  • [30] Mikio L. Braun, Joachim M. Buhmann, and Klaus-Robert Müller. On relevant dimensions in kernel feature spaces. Journal of Machine Learning Research, 9:1875–1908, 2008.
  • [31] Christopher M. Bishop. Pattern Recognition and Machine Learning (Information Science and Statistics). Springer-Verlag New York, Inc., Secaucus, NJ, USA, 2006.
  • [32] M. Finke, J. Fritsch, P. Geutner, K. Ries, T. Zeppenfeld, and A. Waibel. The janus-rtk switchboard/callhome 1997 evaluation system. In Proceedings of LVCSR Hub 5-E Workshop, Baltimore.
  • [33] David Opitz and Richard Maclin. Popular ensemble methods: An empirical study.

    Journal of Artificial Intelligence Research

    , 11:169–198, 1999.
  • [34] Lior Rokach. Ensemble-based classifiers. Artif. Intell. Rev., 33(1-2):1–39, February 2010.
  • [35] John C. Platt. Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods. In Advances in Large Margin Classifiers, pages 61–74. MIT Press, 1999.