Comparative Studies of Detecting Abusive Language on Twitter
The context-dependent nature of online aggression makes annotating large collections of data extremely difficult. Previously studied datasets in abusive language detection have been insufficient in size to efficiently train deep learning models. Recently, Hate and Abusive Speech on Twitter, a dataset much greater in size and reliability, has been released. However, this dataset has not been comprehensively studied to its potential. In this paper, we conduct the first comparative study of various learning models on Hate and Abusive Speech on Twitter, and discuss the possibility of using additional features and context data for improvements. Experimental results show that bidirectional GRU networks trained on word-level features, with Latent Topic Clustering modules, is the most accurate model scoring 0.805 F1.READ FULL TEXT VIEW PDF
Comparative Studies of Detecting Abusive Language on Twitter
Abusive language refers to any type of insult, vulgarity, or profanity that debases the target; it also can be anything that causes aggravation Spertus (1997); Schmidt and Wiegand (2017). Abusive language is often reframed as, but not limited to, offensive language Razavi et al. (2010), cyberbullying Xu et al. (2012), othering language Burnap and Williams (2014), and hate speech Djuric et al. (2015).
Recently, an increasing number of users have been subjected to harassment, or have witnessed offensive behaviors online Duggan (2017)
. Major social media companies (i.e. Facebook, Twitter) have utilized multiple resources—artificial intelligence, human reviewers, user reporting processes, etc.—in effort to censor offensive language, yet it seems nearly impossible to successfully resolve the issueRobertson (2017); Musaddique (2017).
The major reason of the failure in abusive language detection comes from its subjectivity and context-dependent characteristics Chatzakou et al. (2017). For instance, a message can be regarded as harmless on its own, but when taking previous threads into account it may be seen as abusive, and vice versa. This aspect makes detecting abusive language extremely laborious even for human annotators; therefore it is difficult to build a large and reliable dataset Founta et al. (2018).
. This quantity is not sufficient to train the significant number of parameters in deep learning models. Due to this reason, these datasets have been mainly studied by traditional machine learning methods. Most recently, Founta et al.Founta et al. (2018) introduced Hate and Abusive Speech on Twitter, a dataset containing 100K tweets with cross-validated labels. Although this corpus has great potential in training deep models with its significant size, there are no baseline reports to date.
This paper investigates the efficacy of different learning models in detecting abusive language. We compare accuracy using the most frequently studied machine learning classifiers as well as recent neural network models.111The code can be found at: https://github.com/younggns/comparative-abusive-lang Reliable baseline results are presented with the first comparative study on this dataset. Additionally, we demonstrate the effect of different features and variants, and describe the possibility for further improvements with the use of ensemble models.
used Support Vector Machine (SVM), both with word-level features to classify offensive language. Xiang et al.Xiang et al. (2012) generated topic distributions with Latent Dirichlet Allocation Blei et al. (2003), also using word-level features in order to classify offensive tweets.
More recently, distributed word representations and neural network models have been widely applied for abusive language detection. Djuric et al. Djuric et al. (2015) used the Continuous Bag Of Words model with paragraph2vec algorithm Le and Mikolov (2014) to more accurately detect hate speech than that of the plain Bag Of Words models. Badjatiya et al. Badjatiya et al. (2017)
implemented Gradient Boosted Decision Trees classifiers using word representations trained by deep learning models. Other researchers have investigated character-level representations and their effectiveness compared to word-level representationsMehdad and Tetreault (2016); Park and Fung (2017).
As traditional machine learning methods have relied on feature engineering, (i.e. n-grams, POS tags, user information)Schmidt and Wiegand (2017)
, researchers have proposed neural-based models with the advent of larger datasets. Convolutional Neural Networks and Recurrent Neural Networks have been applied to detect abusive language, and they have outperformed traditional machine learning classifiers such as Logistic Regression and SVMPark and Fung (2017); Badjatiya et al. (2017). However, there are no studies investigating the efficiency of neural models with large-scale datasets over 100K.
This section illustrates our implementations on traditional machine learning classifiers and neural network based models in detail. Furthermore, we describe additional features and variant models investigated.
We implement five feature engineering based machine learning classifiers that are most often used for abusive language detection.
In data preprocessing, text sequences are converted into Bag Of Words (BOW) representations, and normalized with Term Frequency-Inverse Document Frequency (TF-IDF) values.
We experiment with word-level features using n-grams ranging from 1 to 3, and character-level features from 3 to 8-grams.
Each classifier is implemented with the following specifications:
Naïve Bayes (NB): Multinomial NB with additive smoothing constant 1
Logistic Regression (LR): Linear LR with L2 regularization constant 1 and limited-memory BFGS optimization
Support Vector Machine (SVM)
: Linear SVM with L2 regularization constant 1 and logistic loss function
Along with traditional machine learning approaches, we investigate neural network based models to evaluate their efficacy within a larger dataset.
In particular, we explore Convolutional Neural Networks (CNN), Recurrent Neural Networks (RNN), and their variant models.
A pre-trained GloVe Pennington et al. (2014) representation is used for word-level features.
CNN: We adopt Kim’s Kim (2014)
implementation as the baseline. The word-level CNN models have 3 convolutional filters of different sizes [1,2,3] with ReLU activation, and a max-pooling layer. For the character-level CNN, we use 6 convolutional filters of various sizes [3,4,5,6,7,8], then add max-pooling layers followed by 1 fully-connected layer with a dimension of 1024.
Park and Fung Park and Fung (2017) proposed a HybridCNN model which outperformed both word-level and character-level CNNs in abusive language detection. In order to evaluate the HybridCNN for this dataset, we concatenate the output of max-pooled layers from word-level and character-level CNN, and feed this vector to a fully-connected layer in order to predict the output.
All three CNN models (word-level, character-level, and hybrid) use cross entropy with softmax as their loss function and Adam Kingma and Ba (2014) as the optimizer.
cell for each recurrent unit. From extensive parameter-search experiments, we chose 1 encoding layer with 50 dimensional hidden states and an input dropout probability of 0.3. The RNN models use cross entropy with sigmoid as their loss function and Adam as the optimizer.
For a possible improvement, we apply a self-matching attention mechanism on RNN baseline models Wang et al. (2017) so that they may better understand the data by retrieving text sequences twice. We also investigate a recently introduced method, Latent Topic Clustering (LTC) Yoon et al. (2018). The LTC method extracts latent topic information from the hidden states of RNN, and uses it for additional information in classifying the text data.
While manually analyzing the raw dataset, we noticed that looking at the tweet one has replied to or has quoted, provides significant contextual information. We call these, “context tweets”. As humans can better understand a tweet with the reference of its context, our assumption is that computers also benefit from taking context tweets into account in detecting abusive language.
As shown in the examples below, (2) is labeled abusive due to the use of vulgar language.
However, the intention of the user can be better understood with its context tweet (1).
(1) I hate when I’m sitting in front of the bus and somebody with a wheelchair get on.
(2) I hate it when I’m trying to board a bus and there’s already an as**ole on it.
Similarly, context tweet (3) is important in understanding the abusive tweet (4), especially in identifying the target of the malice.
(3) Survivors of #Syria Gas Attack Recount ‘a Cruel Scene’.
(4) Who the HELL is “LIKE” ING this post? Sick people….
Huang et al. Huang et al. (2016)
used several attributes of context tweets for sentiment analysis in order to improve the baseline LSTM model. However, their approach was limited because the meta-information they focused on—author information, conversation type, use of the same hashtags or emojis—are all highly dependent on data.
In order to avoid data dependency, text sequences of context tweets are directly used as an additional feature of neural network models. We use the same baseline model to convert context tweets to vectors, then concatenate these vectors with outputs of their corresponding labeled tweets. More specifically, we concatenate max-pooled layers of context and labeled tweets for the CNN baseline model. As for RNN, the last hidden states of context and labeled tweets are concatenated.
Hate and Abusive Speech on Twitter Founta et al. (2018) classifies tweets into 4 labels, “normal”, “spam”, “hateful” and “abusive”. We were only able to crawl 70,904 tweets out of 99,996 tweet IDs, mainly because the tweet was deleted or the user account had been suspended. Table 1 shows the distribution of labels of the crawled data.
In the data preprocessing steps, user IDs, URLs, and frequently used emojis are replaced as special tokens. Since hashtags tend to have a high correlation with the content of the tweet Lehmann et al. (2012), we use a segmentation library222WordSegment module description page: https://pypi.org/project/wordsegment/ Segaran and Hammerbacher (2009) for hashtags to extract more information.
For character-level representations, we apply the method Zhang et al. Zhang et al. (2015)
proposed. Tweets are transformed into one-hot encoded vectors using 70 character dimensions—26 lower-cased alphabets, 10 digits, and 34 special characters including whitespace.
In training the feature engineering based machine learning classifiers, we truncate vector representations according to the TF-IDF values (the top 14,000 and 53,000 for word-level and character-level representations, respectively) to avoid overfitting. For neural network models, words that appear only once are replaced as unknown tokens.
Since the dataset used is not split into train, development, and test sets, we perform 10-fold cross validation, obtaining the average of 5 tries; we divide the dataset randomly by a ratio of 85:5:10, respectively. In order to evaluate the overall performance, we calculate the weighted average of precision, recall, and F1 scores of all four labels, “normal”, “spam”, “hateful”, and “abusive”.
As shown in Table 2, neural network models are more accurate than feature engineering based models (i.e. NB, SVM, etc.) except for the LR model—the best LR model has the same F1 score as the best CNN model.
Among traditional machine learning models, the most accurate in classifying abusive language is the LR model followed by ensemble models such as GBT and RF. Character-level representations improve F1 scores of SVM and RF classifiers, but they have no positive effect on other models.
For neural network models, RNN with LTC modules have the highest accuracy score, but there are no significant improvements from its baseline model and its attention-added model. Similarly, HybridCNN does not improve the baseline CNN model. For both CNN and RNN models, character-level features significantly decrease the accuracy of classification.
The use of context tweets generally have little effect on baseline models, however they noticeably improve the scores of several metrics. For instance, CNN with context tweets score the highest recall and F1 for “hateful” labels, and RNN models with context tweets have the highest recall for “abusive” tweets.
While character-level features are known to improve the accuracy of neural network models Badjatiya et al. (2017), they reduce classification accuracy for Hate and Abusive Speech on Twitter. We conclude this is because of the lack of labeled data as well as the significant imbalance among the different labels. Unlike neural network models, character-level features in traditional machine learning classifiers have positive results because we have trained the models only with the most significant character elements using TF-IDF values.
Variants of neural network models also suffer from data insufficiency. However, these models show positive performances on “spam” (14%) and “hateful” (4%) tweets—the lower distributed labels. The highest F1 score for “spam” is from the RNN-LTC model (0.551), and the highest for “hateful” is CNN with context tweets (0.309). Since each variant model excels in different metrics, we expect to see additional improvements with the use of ensemble models of these variants in future works.
In this paper, we report the baseline accuracy of different learning models as well as their variants on the recently introduced dataset, Hate and Abusive Speech on Twitter. Experimental results show that bidirectional GRU networks with LTC provide the most accurate results in detecting abusive language. Additionally, we present the possibility of using ensemble models of variant models and features for further improvements.
K. Jung is with the Department of Electrical and Computer Engineering, ASRI, Seoul National University, Seoul, Korea. This work was supported by the National Research Foundation of Korea (NRF) funded by the Korea government (MSIT) (No. 2016M3C4A7952632), the Technology Innovation Program (10073144) funded by the Ministry of Trade, Industry & Energy (MOTIE, Korea).
We would also like to thank Yongkeun Hwang and Ji Ho Park for helpful discussions and their valuable insights.
Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pages 1532–1543.