Transfer Learning for Hate Speech Detection in Social Media

06/10/2019 ∙ by Marian-Andrei Rizoiu, et al. ∙ 0

In today's society more and more people are connected to the Internet, and its information and communication technologies have become an essential part of our everyday life. Unfortunately, the flip side of this increased connectivity to social media and other online contents is cyber-bullying and -hatred, among other harmful and anti-social behaviors. Models based on machine learning and natural language processing provide a way to detect this hate speech in web text in order to make discussion forums and other media and platforms safer. The main difficulty, however, is annotating a sufficiently large number of examples to train these models. In this paper, we report on developing automated text analytics methods, capable of jointly learning a single representation of hate from several smaller, unrelated data sets. We train and test our methods on the total of 37,520 English tweets that have been annotated for differentiating harmless messages from racist or sexists contexts in the first detection task, and hateful or offensive contents in the second detection task. Our most sophisticated method combines a deep neural network architecture with transfer learning. It is capable of creating word and sentence embeddings that are specific to these tasks while also embedding the meaning of generic hate speech. Its prediction correctness is the macro-averaged F1 of 78% and 72% in the first and second task, respectively. This method enables generating an interpretable two-dimensional text visualization --- called the Map of Hate --- that is capable of separating different types of hate speech and explaining what makes text harmful. These methods and insights hold a potential for not only safer social media, but also reduced need to expose human moderators and annotators to distressing online messaging.



There are no comments yet.


page 6

page 7

page 8

page 9

page 11

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Ubiquitous access to the Internet brought with it profound change to our lifestyle: information, and online social interactions are at our fingertips; however, it brings with it new challenges, such as the unprecedented liberalization of hate speech. Defined as “public speech that expresses hate or encourages violence towards a person or group based on something such as race, religion, sex, or sexual orientation”111, last accessed on 15 May 2019, the proliferation of online hate speech is suspected to be an important culprit in creating a state for political violence and in exacerbating ethnic violences, such as the Rohingya crises in Myanmar [Reuters2018]. Considerable pressure is mounting on social media platforms to timely detect and eliminate hate speech, alongside with cyber-bulying and offensive content [Zhang, Robinson, and Tepper2018].

In this work, we address three open questions relating to the detection of hateful content (i.e., hate speech, racist, offensive, or sexist content). The first question relates to constructing a general detection system for textual hate content. There is a considerable amount of work on detecting hate speech [Cheng, Danescu-Niculescu-Mizil, and Leskovec2015, Waseem2016, Chatzakou et al.2017, Wulczyn, Thain, and Dixon2017, Davidson et al.2017, Fehn Unsvåg and Gambäck2018], however, most of this work relies on hand-crafted features, user information, or platform-specific metadata which limits its generalization to new data sets and data sources. The first question is can we design a “general purpose” hate embedding and detection system, which does not rely on expensive hand-crafted features, but which is capable to adapt to a particular learning task? The second question relates to data availability. In recent time, publicly available hateful speech data sets have started to appear. However these are most often of small scale and they are not representative for the entire spectrum of hateful content. The question is can we leverage multiple smaller, unrelated datasets to learn jointly, and transfer information between apparently unrelated learning tasks? The third question relates to interpretation and analysis of hate speech by asking can we construct a tool for separating types of hate speech, and characterizing what makes particular language hateful?

This paper addresses the above three open questions by leveraging two unrelated hate speech data sets. We address the first two open questions by proposing a novel neural network transfer learning pipeline. We use state of the art pre-trained word embeddings such as Embeddings from Language Models (ELMo) [Peters et al.2018], which we adapt to the current learning tasks using a

bidirectional Long Short-Term Memory

(bi-LSTM) [Hochreiter and Schmidhuber1997] unit. This creates a single representation space capable of successfully embedding hateful content for multiple learning tasks. We show that the system is capable of transferring knowledge from one task to another, thus boosting performances. The system operates solely on the analyzed text, and it is therefore platform and source independent. Given a new learning task, the hateful embeddings trained on other tasks can be directly applied, or if labeled data is available for the new task, it can be used to contextualize the embeddings.

We address the third open question by building the Map of Hate, a two-dimensional representation of the hateful embeddings described above. We show that the map is capable of separating classes of hateful content, and detect the impact of leveraging jointly multiple data sets.

The three main contributions of this work are as follows:

  1. We assemble DeepHate — a deep neural network architecture — capable of creating task-specific word and sentence embeddings. This allows a higher performance in hate speech detection.

  2. We propose t-DeepHate, which connects the architecture with transfer learning methods that allows leveraging several smaller, unrelated data sets to train an embedding capable of representing “general purpose” hate speech.

  3. We introduce the Map of Hate – an interpretable 2D visualization of hateful content, capable of separating different types of hateful content and explaining what makes text hateful.

The rest of the paper is organized as follows: In Section 2, we provide background to hate speech detection, hate speech mapping, and transfer learning to explain this work contributes to the existing body of knowledge. In Section 3, we describe our proposed hate speech detection models, including both the deep neural network architecture and its transfer learning augmentation to allow training and making predictions on multiple data sets and multiple learning problems. In Section 4, we explain and justify the materials, evaluation methods, and additional text processing details in our experiments. In Sections 6 and 5, we present our main findings and conclude the study, respectively.

2 Background

In this section, we present in their respective subsections a survey of related work on hate speech detection, hate speech mapping, and transfer learning. This connects our study to the existing body of knowledge and also serves as our computational motivation.

2.1 Related Work on Hate Speech Detection

Hate speech detection is a very active research field in the late 2010s. Earlier approaches in 2015–2017 have been based on simpler classifiers and hard-coded features. For example, waseem-hovy-2016 waseem-hovy-2016 has used a

Logistic Regression model with character level features

to classify tweets – short messages from Twitter, a major social media platform. Davidson:2017 Davidson:2017 have also used this modeling method (i.e., Logistic Regression) for tweet classification, but with

word level features, part-of-speech, sentiment, and some meta-data associated with the tweets. User features (e.g., number of friends, followers, gender, geographic location, anonymity status, active vs. not-active status, among others) have also been shown to be useful in identifying aggressive and anti-social behaviour [Cheng, Danescu-Niculescu-Mizil, and Leskovec2015, Waseem2016, Chatzakou et al.2017, Wulczyn, Thain, and Dixon2017]. However, fehn-unsvag-gamback:2018 fehn-unsvag-gamback:2018 have shown that user features only slightly improve the classifier’s ability to detect hate speech, when tested on three Twitter data sets with a Logistic Regression model. Another drawback is that these user features are often limited or unavailable.

The most recent approaches, as illustrated below by three 2017–2018 papers, have investigated the usefulness of neuronal models

within the intention of reducing the feature engineering overhead. First, Park-fung:2017 Park-fung:2017 have used a neuronal approach consisting of two binary classifiers: a

Convolutional Neuronal Networks (CNNs) with word and character level embeddings for predicting abusive speech, and a Logistic Regression classifier with n-gram features for discriminating between different types of abusive speech (i.e., racism, sexism, or both). Second, Zhang:2018 Zhang:2018 have applied pre-trained word embeddings and CNNs with Gated Recurrent Units (GRUs) to the modeling of long dependencies between features. Third, Founta:2018 Founta:2018 have built two neuronal classifiers: one for textual content and another one for user features. Their experiments conclude that when the networks are trained jointly, the overall performance increases.

This work proposes a neural model, which extends the prior literature in two significant ways. First, it uses a bi-LSTM to adapt pretrained embedding to the hate speech domain. Second, it employs a transfer learning setup to construct a single set of hate speech embedding while leveraging multiple datasets.

Figure 1: The conceptual scheme of DeepHate, our hate language detection deep neural network architecture. The text of tweets is processed through two pre-trained units (shown in blue background): the Pre-processing unit and the ELMo embedding (Embeddings from Language Model) unit. The result is a general-purpose numerical embedding for each token in the input. These embeddings are passed through three more units (shown in white background): the bi-directional LSTM (Long Short-Term Memory unit), the Max-pooling unit, and the classification unit. The output of the chain is the hate prediction for the input text.

2.2 Related Work on Hate Speech Mapping

In line with our own work to map hate speech, some research has been devoted to interpret the results produced by neuronal network-based models. Park-fung:2017 Park-fung:2017 have clustered the vocabulary of the waseem2016 waseem2016 data set using the fine-tuned embedding from their model and found the clusters clearly grouped sexist, racist, harassing, etc. words. To continue, Wang:2018 Wang:2018 have presented the following three methods for interpretability: 1) iterative partial occlusion (i.e., masking input words) to study the network sensibility to the input length; 2) its opposite problem, called lack of localization, in which the model is insensitive to any region of the input; and 3) maximum activations of the final max pooling layer of a CNN-GRU network to identify the lexical units that contribute to the classification. According to their results long inputs are more difficult to classify and not all the maximum activated units are hateful.

To overcome the limitations of small data sets on sexist speech detection, Sharifirad-etal-2018 Sharifirad-etal-2018 have applied text augmentation (i.e., increasing the length of the instances), and text generation (i.e., increasing the size of the data set by adding new instances) with certain success. Within the same goal, Sharifirad-etal-2018 Sharifirad-etal-2018 have studied the effect of using word embeddings learned from data sets containing abusive language and compared it with lexicon-based features

. Their experiments have shown that a Logistic Regressing classifier with abusive word embeddings trained with a couple of hundred training instances is able to outperform the same classifier with lexicon-based features on full data sets.

Finally, Karan-snajder:2018 Karan-snajder:2018 have applied the frustratingly easy domain adaptation (FEDA) framework [Daume III2007] to hate speech detection. This method works by joining two data sets A and B from different domains in which their features are copied three times as follows: 1) unaltered instances from both domains; 2) A specific features, which is 0 for all instances not from A; and 3) B specific features, which is 0 for all instances not from B. The study has tested the use of Support Vector Machines (SVM) for classification, concluding that domain adaptation boosts the classifier’s performance significantly in six out of the tested nine cases (or data sets).

2.3 Transfer Learning as Domain Adaptation

Transfer learning is the idea of utilizing features, weights, or otherwise defined knowledge acquired for one task to solve another related problem. Transfer learning has been extensively used for domain adaptation and building models to solve problems where only limited data is available to train, validate, and evaluate the outcomes [Pan and Yang2010]. To be best of our knowledge, transfer learning has never been applied to the problem of hate speech detection.

Formally, transfer learning involves the concepts of domains and learning tasks. Given a source domain and source learning task , a target domain and target learning task , transfer learning aims to make a contribution (i.e., improvement) to the learning of the target predictive function in using the knowledge in and where , or [Pan and Yang2010].

In our experiments, the label distribution of the two tasks is different, but related, since the two data sets we used are annotated for analyzing different types of hate speech. Hence, where are the classes to be learned. As explained in Section 3.2, our transfer learning model shares the weights of the features across different classifiers which benefits when only limited training data is available.

3 Model

In this section, we describe our proposed hate speech detection models. First, in Section 3.1, we present the proposed deep neural network architecture, which inputs the raw text of tweets and learns to predict whether the text is hateful, and its hate category. Then, in Section 3.2, we augment the architecture to allow to train and make predictions on multiple data sets, and multiple learning problems.

Figure 2: The conceptual scheme of t-DeepHate – our transfer learning architecture. Given two datasets of tweets (the red and the blue dataset), the text of their tweets is processed through a shared pipeline (consisting of the pre-processing unit, ELMo, the bi-LSTM and the max-pooling components) and a task-specific component (the Hate classification). This ensures that the representation space constructed at the output of the max-pooling unit is adequate for both learning tasks.

3.1 The Hate Speech Detection Pipeline

Fig. 1 shows the conceptual schema of our model DeepHate, which contains the five units detailed here below.

The pre-processing unit. Compared to text data gathered from other sources, Twitter data sets tend to be noisier. Typically, they contain a substantially larger amount of misspellings, non-standard abbreviations, Internet-related symbols and slang, as well as other irregularities. In order to achieve better classification results, we pre-process the textual data prior to training our learners. During the pre-processing step, we remove repetitive punctuation, redundant white spaces, emojis, as well as Uniform Resource Locators (URLs). We add one space before every remaining punctuation. Please notice that we do not apply stemming, nor do we remove stopwords. This pre-processing results in cleaner text than the original noisy input. Words and punctuation are also clearly separated by a single white space.

The ELMo unit. In order to process natural language text, we map each English term to a numerical vectorial representation — dubbed the word representation. Training word representations typically requires amounts of textual data way larger than our available hate speech data sets. Therefore, we opt to start from pre-trained word embedding models, among which we select the aforementioned ELMo. ELMo is itself a Neural Network (NN) model, which takes as its input a sentence, and outputs a vector representation for each word in the sentence. peters-etal-2018-deep peters-etal-2018-deep show that ELMo’s predictive performances are improved when used in conjunction with another (pre-trained) word embedding model. Here, we use the Global Vectors (GloVe) embedding [Pennington, Socher, and Manning2014]) with a 200-dimensional embedding pre-trained on a data set of two billion tweets ELMo constructs a lookup table between the words observed in the training set, and their pre-trained representations. However, whenever we use embeddings trained on other data sets, we run the risk of encountering previously unseen words, which do not have a corresponding embedding. ELMo addresses this issue by training a NN to encode words at a character level. When an unseen word is encountered, ELMo uses this NN to construct the vector representation of the word starting from its spelling.

Polysemy – i.e., the fact that a word may have multiple possible meanings – is another problem for word embedding methods that employ look-up tables, as each word can only have one entry in the table, and precisely one representation. ELMo addresses the issue of possibly multiple (and hence polysemous) vectors by first “reading" through the whole sentence, then tweaking the word representation according to the context. This means that the same word may have different representations in different sentences. In this work, we employ the pre-trained ELMo 5.5B222, last accessed on 15 May 2019 together with the Twitter trained GloVe embeddings333, last accessed on 15 May 2019.

The bi-LSTM. The ELMo model described above is pre-trained for general purposes, and consequently its constructed embeddings may have limited usefulness for the hate speech detection application (as shown in Section 5.3). Thus, we add a bi-LSTM layer with randomly initialized weights to adapt the ELMo representation to the hate speech detection domain. The bi-LSTM module scans each sentence twice: the forward scan is from left to right, the backward scan is from right to left. This scanning produces two task-specific word representations for each word in the sentence – i.e., one from each scan. We have also tried Gated Recurrent Units (GRUs) [Cho et al.2014], but given its lower prediction performance, we only present the LSTM results in the rest of this paper.

The max-pooling unit. The next unit of the pipeline in Fig. 1 is the max-pooling layer, which constructs the embedding of an entire sentence starting from word representations. It inputs the numerical representation of each word of the sentence and it constructs a fixed-length vector by taking the maximum value for each dimension of the word representations. This produces a sentence representation defined in the same high-dimensional space as the word embeddings. We have also implemented an attention mechanism [Vaswani et al.2017], but given its lower prediction performance, we only present the LSTM results in the rest of this paper.

The hate classification unit. The last unit of the pipeline is a differentiable classifier, which inputs the previously constructed sentence representation and outputs the final prediction, which is used to calculate the loss according to the ground truth, and to train the model. The weights of all the trainable modules in Fig. 1 (i.e., the Bi-LSTM, the Max-pooling and the Hate classifier) are trained end-to-end via back-propagation.

3.2 Transfer Learning Setup

In the transfer learning setup, we address two learning tasks simultaneously — that is, predicting the type of hate speech in two unrelated data sets. Intuitively, jointly solving both tasks would allow the insights learned from one task to be transferred to the other task.

A mix of shared and individual processing units. Fig. 2 shows t-DeepHate the transfer learning schema that we have developed for leveraging multiple data sets, and for solving multiple learning problems. In the transfer learning schema, we first mix data from different data sets together and we send the mixed data batch through the training pipeline. The pre-processing and the ELMo embedding units are the same as in the non-transfer settings, both having pre-trained weights. The bi-LSTM and the max-pooling units are also shared (i.e., they are processing text from multiple data set – and they are trainable via back-propagation). When the tweet representation is obtained, we separate the data from different data sets, and we feed it into classifiers dedicated for each task. This ensures that the final prediction for each data set are made independent one from the other, as the prediction targets may be different for each task.

A more comprehensive hate representation. As visible from Fig. 2, apart from the last unit, the entire processing pipeline is shared among all learning tasks. As a result, we learn a single more comprehensive word representation and, as a result, a tweet representation containing features required by all the different tasks. By assuming that there are common features for all types of hate speech, multiple small data sets can be used together to train a larger model without over-fitting. This joint learning of more comprehensive word representations forms the main underlying hypothesis of our Map of Hate — a two-dimensional visualization of hateful language — that we are to introduce and illustrate in Section 5.3. In other words, our transfer learning model enables building a single, more general representation that is useful for both prediction over multiple data sets and visualizing the results as one two-dimensional plot.

4 Experimental Setup

This section presents the two datasets employed in this work (Section 4.1), and our hate speech prediction setup, the baselines, and some technical details relating to the implementation of our models (Section 4.2).

4.1 Datasets

The Waseem data set [Waseem2016] is publicly available, and it consists of 15,216 instances from Twitter that were annotated as Racist, Sexist or Harmless. As shown in Table 1, this data set is very imbalanced, with the majority class being Harmeless, which meant to reflect a real world scenario where hate speech is less frequent than neutral tweets.

There are some questions about the quality of labeling in the Waseem data set, namely the quantity of false positives, considering that the data set was compiled to quantify the agreement between expert and amateur raters. This is acknowledged by waseem2016 waseem2016, who observes that “the main cause of error are false positives”. Here are some examples of such sexism false positives:

  • @FarOutAkhtar How can I promote gender
    equality without sounding preachy or being
    a ‘‘feminazi’’? #AskFarhan 
  • Yes except the study @Liberal_fem (the
    Artist Formerly known as Mich_something)
    offered’s author says it does NOT prove
    bias @TamedInsanity 
  • In light of the monster derailment that
    is #BlameOneNotAll here are some mood
    capturing pics for my feminist pals 

The Davidson data set [Davidson et al.2017] is also publicly available. It consists of 22,304 instances from Twitter annotated as Hate, Offensive and Harmless. This data set was compiled by searching for tweets using the lexicon from

Data set Classes #instances
Waseem Racist, Sexist, Harmless 15,216
Davidson Hateful, Offensive, Harmless 22,304
Table 1: Data sets for hate speech detection on Twitter in English

4.2 Hate speech prediction

Prediction setup. We predict the types of hate speech (hate, offensive and harmless for Davidson, and racism, sexist and harmless for Waseem) using four classifiers: the Davidson and the Waseem baselines, and our approaches DeepHate and t-DeepHate. Given the imbalance between classes in both datasets, we over-sample the smaller classes to obtain a balanced dataset. We also tried under-sampling the larger classes, but we obtained lower results. Each classifier is trained on of the each dataset, and tested on the remainder

. The procedure is repeated 10 times, and we report the mean and standard deviation. t-DeepHate is trained on

of both datasets simultaneously, the others are trained on each datasets individually. We evaluate the prediction performance using the F1 measure, which is the geometric mean of precision and recall – i.e. a classifier needs to obtain high precision and high recall simultaneously to achieve a high F1. We average the F1 over all classes in each dataset (i.e., we compute

macro-F1), so that smaller classes are equally represented in the final score.

Baselines. We compared our proposed method with two baselines: waseem-hovy-2016 waseem-hovy-2016 and Davidson:2017 Davidson:2017. waseem-hovy-2016 waseem-hovy-2016 applied a similar pre-process to ours, removing punctuation, excess white-spaces, URLs, stop words and applying lower-casing and Porter stemmer for removing morphological and inflexional endings from words in English. They used a Logistic Regression model with character n-grams of lengths up to four as features. Davidson:2017 Davidson:2017 applied the same pre-process as waseem-hovy-2016 waseem-hovy-2016 and also used a Logistic Regression model. They include several word level features as 1-3 word n-grams weighted with TF-IDF and 1-3 Part-of-Speech n-grams. They also include tweet level features as readability scores taken from a modified version of Flesch-Kincaid Grade Level and Flesch Reading Ease scores (where the number of sentences is fix to one), and sentiment scores derived from a sentiment lexicon design for social media [Hutto and Gilbert2014]. Besides, they include binary and count indicators features for hashtags, mentions, retweets, URLs, and number of words, characters and syllables in each tweet. Since they do not carry-out an ablation study, it is unclear which are most predictive for hate speech.

Figure 6: Prediction performances on two datasets: Davidson and Waseem. (a) Boxplot summarizing macro-F1 score for each dataset, and each approach (baseline, DeepHate, t-DeepHate). Red diamonds and values indicate mean F1. (b)(c)Confusion matrix for one prediction made by the baseline (b) and by t-DeepHate (c) on the Davidson dataset.

Deep learners implementation details.

Our proposed methods were implemented in PyTorch

444 [Paszke et al.2017]. We use GloVe with 200 dimensions to initialize the word representations used by ELMo. The ELMo embeddings have 4096 dimensions. We use a 2-layer stacked bi-LSTM with a hidden vector size of 512 dimensions. During training, we use the Adam optimizer with weight decay , and an initial learning rate of

. As the error function, we use the Cross Entropy Loss. The number of epochs during which we train the classifiers is

, and the batch size is . The model performance is tested every epochs. The most important learning parameters (i.e. the hidden state vector size in the bi-LSTM unit, and the batch size of the learning process) were tunned via Grid Search on the validation set (detailed in the appendix appendix).

5 Main findings

In this section we explore the capabilities and the limits of our proposed hate speech detection pipelines – DeepHate and t-DeepHate. In Section 5.1, we present their prediction performances, particularly when little and increasing volumes of labeled data are available. In Section 5.2, we explore examples of correctly and incorrectly predicted hate speech alongside with the indicators of each decision. Finally in Section 5.3, we construct the Map of Hate – a two-dimensional visualization of hateful and harmless tweets –, and we explain the effects of jointly building a single tweet representation space, and task specific representations.

Figure 9: Prediction performances with limited amounts of training data on the Davidson (a) and the Waseem datasets (b). The x-axis show the percentage of the training set used for training, the y-axis shows the macro-F1 measure. Each bar shows the mean value over 10 runs, and the standard deviation.
Figure 10: Warning: this figure contain real-world examples of offensive language! Automatically highlighting offensive terms. Examples of six tweets from the Davidson and Waseem datasets, together with their predicted and observed hate category. The top three tweets are hateful (sexist, offensive and racist respectively) and their category was correctly predicted. The bottom three three tweets are harmless, but they were incorrectly predicted as hateful. The color map shows how many times a word’s representation was selected for the tweet representation, normalized by the size of the embedding (here 512).

5.1 Hate speech prediction performance

Predict hate speech in two datasets. We follow the setup outlined in Section 4.2, and we measure the prediction performances of DeepHate and t-DeepHate against the two baselines. Fig. (a)a shows as boxplots the prediction performances for each classifier, and each dataset.

We make several observations. Firstly, our models DeepHate and t-DeepHate outperform the baselines on the Waseem dataset, and they under-perform them on the Davidson dataset. We posit that this is due to the external information – statistics, user information and tweet metadata – employed by the baselines. We chose not to use this additional information in our approaches, as it would render the obtained results applicable solely to the Twitter data source. Our aim is to build a hate speech detection system, which is not reliant on any information external to the analyzed text. Figs. (c)c and (b)b outline the difficulty of the problem by showing the confusion matrix obtained on Davidson, by the baseline and by DeepHate. Visibly, the Hate class is confused with Offensive in 41% of the cases. By manually inspecting some failed predictions, we notice that it is particularly difficult even for a human to differentiate between purposely hateful language (with a particular target in mind) and the generally offensive texts (usually without a target).

Secondly, we observe that the Davidson baseline outperforms the Waseem baseline on both Davidson and Waseem datasets. This shows that the external features built by Davidson are more informative for Twitter originating hate speech than Waseem’s. Thirdly, we observe that t-DeepHate outperforms DeepHate on both datasets, admittedly not with a large margin. Note that for these predictions, the learners were trained on all the available data (here of each dataset). Next, we investigate the performances of our approaches when only limited data is available.

Predict hate speech using limited amounts of data. The advantage of jointly leveraging multiple datasets emerges when having only limited amounts of labeled data is available. Here, we restrain the amount of training data: after sampling the training set ( of the dataset) and the testing set (), we further subsample the training set so that only a percentage of it is available for the model training. For t-DeepHate, we perform this additional subsampling for only one of the datases, and we keep all the training examples of the other dataset. In doing so, we re-enact the typical situation in which it is required to learn a hate speech classifier with very limited amounts of labeled data, but with the help of a larger unrelated hate speech dataset.

We vary the percentage of training data in the downsampled dataset, and we show in Figs. (b)b and (a)a the mean prediction performance and its standard deviation on the Davidson and Waseem datasets, respectively. Visibly, the performances of DeepHate exhibit a large variation, even when as much as of the training set is observed. In comparison, t-DeepHate is considerably more stable and it constantly outperforms DeepHate, showcasing the importance of building text embeddings from both the larger and the down-sampled datasets jointly. Noteworthy, the Davidson baseline appears particularly strong, achieving a macro-F1 score of almost when trained on only of the training dataset (compared to macro-F1 = when trained on the entire training set). This shows that the user information and the tweets’ non-textual metadata are particularly indicative of hateful content.

Figure 15: The Map of Hate constructed on the Davidson dataset (a)(b) and the Waseem dataset (c)(d), by using the tweet embeddings generated by DeepHate (a)(c) and t-DeepHate (b)(d).
Figure 19: The Map of Hate constructed by t-DeepHate jointly on of the Davidson and Waseem datasets. Crosses and circles show incorrectly and correctly predicted examples, respectively. (a) All the six classes in the two datasets. (b) The Harmless classes of the two datasets overlap. (c) The hateful classes: Hate and Offensive (Davidson), Racism and Sexism (Waseem).

5.2 Highlighting hateful content

Compute word contributions to predictions. The max-pooling is a dimension-wise max operation, i.e. for each dimension it selects the maximum value over all word embeddings in the forward and backwards pass of the Bi-LSTM. Assume a total of words in the sentence, each represented by a numerical vector with dimensions. For each of the dimensions, the max-pooling picks the maximum value across the words. We count the number of times each word is selected to represent the whole sentence, across the dimensions and we normalize the scores by (a word can be picked a maximum of times). This constructs a score between zero (a word is never selected) and 1 (the word is always selected). The higher the score of a word, the more representative is the word’s representation for the final sentence representation, and for the hate prediction.

Correctly and incorrectly predicted examples. Fig. 10 shows six examples of tweets in our datasets, with each work highlighted according to their score (Warning! this figure contain real-world examples of offensive language). The top three examples are correctly predicted as sexist, offensive and racist, respectively. We observe that most tweets labeled as sexism (Waseem dataset) start with “I’m not a sexist, but …” and variations. While this might raise questions towards data sampling bias in the data set construction, this behavior is captured in the higher weights assigned with “but“ (ending of “I’m not a sexist, but”), “woman” and (strangely) “RT” (i.e. retweet). In the offensive example (Davidson dataset), we notice that offensive slang words are correctly scored higher. We also notice that racism examples (such as the third tweet in Fig. 10) tend to refer exclusively to the Islamic religion and its followers – which can bias the learned embedding. However, we observe that words such as “destructive” are correctly recognized as indicators of hate speech, for this example. The bottom three lines in Fig. 10 show examples where harmless tweets are incorrectly labeled as hateful. This happens mainly due to their writing style and choice of words. The tweet mis-classified as sexism uses language similar to sexism to draw awareness against it, and as a result, it is classified itself as sexist. Similarly, the incorrectly labeled offensive example is written in a style similar to other offensive tweets in our dataset. Finally, the falsely racist tweet uses Islam-related terminology, and because of the data sampling bias, it is classified as racist.

Figure 23: The impact of building task-specific word embeddings. (a) The prediction performance (mean macro-F1 and standard deviation) of DeepHate with general-purpose, and with task-specific embeddings. (b)(c) The Map of Hate constructed using general-purpose embeddings (a) and using task-specific embeddings (c).

5.3 The Map of Hate

Construction of the Map of Hate. To understand the impact of the different modeling choices, we construct the Map of Hate – a two dimensional visualization of the space of hateful text built using t-SNE [Maaten and Hinton2008]

, a technique originally designed to visualize high-dimensional data. Given the tweet representation built by DeepHate and t-DeepHate (the output of the max-pooling units in

Figs. 2 and 1, respectively), t-SNE builds a mapping to a 2D space in which the Euclidean pairwise distances corresponds to the distance between pairs of tweet representation in the high-dimensional space. Figs. (c)c and (a)a visualize the Map of Hate constructed by DeepHate on a sample of of the tweets Davidson and Waseem data sets, respectively. We observe that the tweets of different classes in Davidson appear clustered more closely together than in Waseem. Noticeably, the racist and sexist tweets in Waseem appear scattered throughout the harmless tweets. This seems to be indicative of the false positives discussed in Section 4.1.

The effects of transfer learning. Figs. (d)d and (b)b show the tweet sample in Davidson and Waseem data sets, respectively, projected in the joint space constructed by t-DeepHate. We observe that the Davidson tweets appear to lose their clustering, and break down into subgroups. This explains the slightly lower prediction results obtained by t-DeepHate on Davidson. The situation is reversed on Waseem, where tweets belonging to different hate categories are clustered more tightly together – explaining the better performances on this data set. It appears the quality of representation on Davidson is slightly negatively impacted by labeling quality in Waseem. However, the prediction on Waseem benefits from a large increase – the new space is more adequate to represent its tweets thanks to the Davidson data set.

Fig. (a)a visualizes in the same figure the tweets from both Davidson and Waseem, projected into the t-DeepHate space (an interactive version of this map is available online555Interactive Map of Hate: The circles represent the tweets correctly predicted by t-DeepHate, and the crosses the incorrectly predicted. Fig. (b)b further details only the harmless tweets from both datasets, and Fig. (c)c the hateful tweets from both datasets (racism and sexism from Waseem, and hate and offensive from Davidson). Several conclusions emerge. Firstly, the hateful and harmless content appears separated in the joint space (bottom-left for hateful and top-right for harmless). Second, the harmless tweets from both datasets appear overlapped (Fig. (b)b), which is correct and intuitive since both classes stand for the same type of content. Thirdly, the hateful content (Fig. (c)c) has a more complex dynamic: the offensive (Davidson) tweets occupy most of the space, while most of the sexist and racist (Waseem) appear tightly clustered on the sides. This is indicative of the sampling bias in Waseem. Lastly, the sexist and racist (Waseem) sprinkled tweets throughout the harmless are appear overwhelmingly miss-classified, which can be explained by the false positives in Waseem.

Task-specific word embeddings. Here, we explore the impact of building task-specific word embeddings, i.e. using the bi-directional LSTM to adapt the ELMo word embeddings. Fig. (a)a shows the prediction performance of DeepHate on Davidson and Waseem data sets, with the general-purpose ELMo word embeddings and with the task-specific embeddings. We observe that the task-specific embeddings provide a consistent performance increase for both datasets. Fig. (b)b depicts the Map of Hate constructed by DeepHate using general embeddings, and in Fig. (c)c using task-specific embeddings. Visibly, when using general-purpose embeddings, the tweets belonging to different classes do not appear distinguishably separated, whereas they appear clustered when using task-specific embeddings. This highlights the impact of performing domain adaptation for textual embedding.

6 Conclusion

With our social interactions and information being increasingly online, more and more emphasis is placed on identifying and resolving issues of the Internet to our society. To illustrate the immense worldwide popularity of going online as part of our everyday life to consume or produce content, based on the “Household Use of Information Technology" survey for 2016–2017 by the Australian Bureau of Statistics (ABS),666, last accessed on 15 May 2019 this web usage has grown and stabilized itself to almost of Australian households having access to the Internet (up to for those households that have children aged under 15 years) and also almost of Australian people, aged 15 years or over. The top online activities are entertainment and social networking, followed by Internet banking. Unfortunately, the same survey also gives evidence of the unwanted, harmful, and anti-social flip side of the increased connectivity to social media and other online contents. Namely, in 2016–2017, and of the participating web-connected households with children aged 5–14 reported that a child had been exposed to harmful content and subject to cyber-bullying, respectively.

Consequently, it is of utmost importance to make social media safer by detecting and reducing their hateful, offensive, or otherwise unwanted social interactions. In this paper, we have considered machine learning and natural language processing as a way to differentiate harmless tweets from racist, sexists, hateful, or offensive messages on Twitter. More specifically, we have trained and tested a deep neural network architecture without and with transfer learning using cross-validation on the total of English tweets. Our most sophisticated method is capable of creating word and sentence embeddings that are specific to these racism, sexism, hatred, and offensive detection tasks while also leveraging several smaller, unrelated data sets to embed the meaning of generic hate speech. Its predictive classification correctness is the macro-averaged F1 from to in these detection tasks. The method enables visualizing text in an interpretable fashion in two dimensions.

Our methods are the keys for analysing social media contents at scale in order to make these web platforms safer and understand the genre of hate speech and its sub-genres better. Our automated text processing and visualization methods are capable of separating different types of hate speech and explaining what makes text harmful. Their use could even reduce need to expose human moderators and annotators to distressing messaging on social media platforms.


  • [Chatzakou et al.2017] Chatzakou, D.; Kourtellis, N.; Blackburn, J.; De Cristofaro, E.; Stringhini, G.; and Vakali, A. 2017. Mean birds: Detecting aggression and bullying on Twitter. In Proceedings of the 2017 ACM on Web Science Conference, WebSci ’17, 13–22.
  • [Cheng, Danescu-Niculescu-Mizil, and Leskovec2015] Cheng, J.; Danescu-Niculescu-Mizil, C.; and Leskovec, J. 2015. Antisocial behavior in online discussion communities. In Proceedings of the 9th International Conference on Web and Social Media, 61–70.
  • [Cho et al.2014] Cho, K.; van Merrienboer, B.; Gulcehre, C.; Bahdanau, D.; Bougares, F.; Schwenk, H.; and Bengio, Y. 2014. Learning phrase representations using RNN encoder–decoder for statistical machine translation. In Empirical Methods in Natural Language Processing (EMNLP), 1724–1734.
  • [Daume III2007] Daume III, H. 2007. Frustratingly easy domain adaptation. In Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics, 256–263.
  • [Davidson et al.2017] Davidson, T.; Warmsley, D.; Macy, M. W.; and Weber, I. 2017. Automated hate speech detection and the problem of offensive language. In International Conference on Weblogs and Social Media.
  • [Fehn Unsvåg and Gambäck2018] Fehn Unsvåg, E., and Gambäck, B. 2018. The effects of user features on Twitter hate speech detection. In Proceedings of the 2nd Workshop on Abusive Language Online (ALW2), 75–85.
  • [Founta et al.2018] Founta, A.-M.; Chatzakou, D.; Kourtellis, N.; Blackburn, J.; Vakali, A.; and Leontiadis, I. 2018. A unified deep learning architecture for abuse detection. Computing Research Repository (CoRR) abs/1802.00385.
  • [Hochreiter and Schmidhuber1997] Hochreiter, S., and Schmidhuber, J. 1997. Long short-term memory. Neural Computation 9:1735–80.
  • [Hutto and Gilbert2014] Hutto, C. J., and Gilbert, E. 2014.

    Vader: A parsimonious rule-based model for sentiment analysis of social media text.

    In Adar, E.; Resnick, P.; Choudhury, M. D.; Hogan, B.; and Oh, A. H., eds., ICWSM.
  • [Karan and Šnajder2018] Karan, M., and Šnajder, J. 2018. Cross-domain detection of abusive language online. In Proceedings of the 2nd Workshop on Abusive Language Online (ALW2), 132–137.
  • [Maaten and Hinton2008] Maaten, L. v. d., and Hinton, G. 2008. Visualizing data using t-sne. JMLR 9(Nov):2579–2605.
  • [online supplement2019] online supplement. 2019. Appendix: Transfer Learning for Hate Speech Detection in Social Media.
  • [Pan and Yang2010] Pan, S. J., and Yang, Q. 2010. A survey on transfer learning. Trans. on Knowledge and Data Eng. 22(10):1345–1359.
  • [Park and Fung2017] Park, J. H., and Fung, P. 2017. One-step and two-step classification for abusive language detection on Twitter. In Proceedings of the First Workshop on Abusive Language Online, 41–45.
  • [Paszke et al.2017] Paszke, A.; Gross, S.; Chintala, S.; Chanan, G.; Yang, E.; DeVito, Z.; Lin, Z.; Desmaison, A.; Antiga, L.; and Lerer, A. 2017. Automatic differentiation in pytorch. In NIPS-W.
  • [Pennington, Socher, and Manning2014] Pennington, J.; Socher, R.; and Manning, C. 2014. Glove: Global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), 1532–1543.
  • [Peters et al.2018] Peters, M.; Neumann, M.; Iyyer, M.; Gardner, M.; Clark, C.; Lee, K.; and Zettlemoyer, L. 2018. Deep contextualized word representations. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), 2227–2237.
  • [Reuters2018] Reuters. 2018. Why facebook is losing the war on hate speech in myanmar. (Accessed on 16/05/2019).
  • [Sharifirad, Jafarpour, and Matwin2018] Sharifirad, S.; Jafarpour, B.; and Matwin, S. 2018.

    Boosting text classification performance on sexist tweets by text augmentation and text generation using a combination of knowledge graphs.

    In Workshop on Abusive Language Online (ALW2), 107–114.
  • [Vaswani et al.2017] Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A. N.; Kaiser, Ł.; and Polosukhin, I. 2017. Attention is all you need. In Advances in neural information processing systems, 5998–6008.
  • [Wang2018] Wang, C. 2018. Interpreting neural network hate speech classifiers. In Proceedings of the 2nd Workshop on Abusive Language Online (ALW2), 86–92.
  • [Waseem and Hovy2016] Waseem, Z., and Hovy, D. 2016. Hateful symbols or hateful people? predictive features for hate speech detection on twitter. In NAACL Student Research Workshop, 88–93.
  • [Waseem2016] Waseem, Z. 2016. Are you a racist or am I seeing things? Annotator influence on hate speech detection on Twitter. In Workshop on NLP and Computational Social Science, 138–142.
  • [Wulczyn, Thain, and Dixon2017] Wulczyn, E.; Thain, N.; and Dixon, L. 2017. Ex machina: Personal attacks seen at scale. In World Wide Web, WWW ’17, 1391–1399.
  • [Zhang, Robinson, and Tepper2018] Zhang, Z.; Robinson, D.; and Tepper, J. 2018. Detecting hate speech on Twitter using a convolution-GRU based deep neural network. In Gangemi, A.; Navigli, R.; Vidal, M.; Hitzler, P.; Troncy, R.; Hollink, L.; Tordai, A.; and Alam, M., eds., ESWC 2018: The Semantic Web, 745–760.


This document is accompanying the submission Transfer Learning for Hate Speech Detection in Social Media. The information in this document complements the submission, and it is presented here for completeness reasons. It is not required for understanding the main paper, nor for reproducing the results.

Appendix A Supplemental figures

This section presents the supplemental figures mentioned in the main text. Fig. 24 shows the prediction performances of t-DeepHate on the increasing partial training set – other dataset. The barplot shows the prediction performances on the complete dataset. For example, the bar for Davidson at shows the prediction performance on the Davidson dataset when t-DeepHate was trained on of the training data in Waseem, and the complete training data of the Davidson dataset. For each dataset, the complete training is of the dataset, and testing is performed on of the dataset.

Fig. (a)a shows the results of the grid search for the optimal size of the hidden state vector. Fig. (b)b shows the results of the grid search for the optimal learning batch size.

Figure 24: t-DeepHate: prediction performances on the increasing partial training set – other dataset.
Figure 27: Grid search for optimal hyper-parameter values: the hidden state size of the bi-LSTM (a) and the learning batch size (b).