Nowadays, hate speech is becoming a pressing issue and occurs in multiple domains, mostly in the major social media platforms or political speeches. Hate speech is defined as verbal communication that denigrates a person or a community on some characteristics such as race, color, ethnicity, gender, sexual orientation, nationality, or religion (Nockleby2000; davidson2017automated). Some examples given by schmidt-wiegand-2017-survey are:
Go fucking kill yourself and die already a useless ugly pile of shit scumbag.
The Jew Faggot Behind The Financial Collapse.
Hope one of those bitches falls over and breaks her leg.
Several sensitive comments on social media platforms have led to crime against minorities (Williams2020). Hate speech can be considered as an umbrella term that different authors have coined with different names. xu-etal-2012-learning; Hosseinmardi2015; Zhong referred it by the term cyberbully-ing, while davidson2017automated used the term offensive language to some expressions that can be strongly impolite, rude or use of vulgar words towards an individual or group that can even ignite fights or be hurtful. Use of words like f**k, n*gga, b*tch is common in social media comments, song lyrics, etc. Although these terms can be treated as obscene and inappropriate, some people also use them in non-hateful ways in different contexts (davidson2017automated). This makes it challenging for all hate speech systems to distinguish between hate speech and offensive content. davidson2017automated tried to distinguish between the two classes in their Twitter dataset.
These days due to globalization and online media streaming services, we are exposed to different cultures across the world through movies. Thus, an analysis of the amount of hate and offensive content in the media that we consume daily could be helpful.
Two research questions guided our research:
RQ 1. What are the limitations of social media hate speech detection models to detect hate speech in movie subtitles?
RQ 2. How to build a hate and offensive speech classification model for movie subtitles?
To address the problem of hate speech detection in movies, we chose three different models. We have used the BERT (devlin-etal-2019-bert) model, due to the recent success in other NLP-related fields, a Bi-LSTM (Hochreiter1997) model to utilize the sequential nature of movie subtitles and a classic Bag of Words (BoW) model as a baseline system.
The paper is structured as follows: Section 2 gives an overview of the related work in this topic and Section 3 describes the research methodology and the annotation work, while in Section 4 we discuss the employed datasets and the pre-processing steps. Furthermore, Section 5 describes the implemented models while Section 6 presents the evaluation of the models, the qualitative analysis of the results and the annotation analysis followed by Section 7, which covers the threats to the validity of our research. Finally, we end with the conclusion in Section 8 and propose further work directions in Section 9.
2 Related Work
Some of the existing hate speech detection models classify comments targeted towards certain commonly attacked communities like gay, black, and Muslim, whereas in actuality, some comments did not have the intention of targeting a community (Borkan2019; Dixon2018). Mathew2021 introduced a benchmark dataset consisting of hate speech generated from two social media platforms, Twitter and Gab. In the social media space, a key challenge is to separately identify hate speech from offensive text. Although they might appear the same way semantically, they have subtle differences. Therefore they tried to solve the bias and interpretability aspect of hate speech and did a three-class classification (i.e., hate, offensive, or normal). They reported the best macro-averaged F1-score of 68.7% on their BERT-HateXplain model. It is also one of the models that we use in our study, as it is one of the ‘off-the-shelf‘ hate speech detection models that can easily be employed for the topic at hand.
Lexicon-based detection methods have low precision because they classify the messages based on the presence of particular hate speech-related terms, particularly those insulting, cursing, and bullying words. davidson2017automated used a crowdsourced hate speech lexicon to identify tweets with the occurrence of hate speech keywords to filter tweets. They then used crowdsourcing to label these tweets into three classes: hate speech, offensive language, and neither. In their dataset, the more generic racist and homophobic tweets were classified as hate speech, whereas the ones involving sexist and abusive words were classified as offensive. It is one of the datasets we have used in exploring transfer learning and model fine-tuning in our study.
Due to global events, hate speech also plagues online news platforms. In the news domain, context knowledge is required to identify hate speech. gao2017 conducted a study on a dataset prepared from user comments on news articles from the Fox News platform. It is the second dataset we have used to explore transfer learning from the news domain to movie subtitles in our study.
Several other authors have collected the data from different online platforms and labeled them manually. Some of these data sources are: Twitter (Xiang2012; xu-etal-2012-learning), Instagram (Hosseinmardi2015; Zhong), Yahoo! (Nobata2016; Djuric2015), YouTube (Dinakar2012) and Whisper (silva2016analyzing)
to name a few. Most of the data sources used in the previous studies are based on social media, news, and micro-blogging platforms. However, the notion of the existence of hate speech in movie dialogues has been overlooked. Thus in our study, we first explore how the different existing ML (Machine Learning) models classify hate and offensive speech in movie subtitles and propose a new dataset compiled from six movie subtitles.
3 Research Methodology
To investigate the problem of detecting hate and offensive speech in movies, we used different machine learning models trained on social media content such as tweets or discussion thread comments from news articles. Here, the models in our research were developed and evaluated on an in-domain 80% train and 20% test split data using the same random state to ensure comparability.
We have developed six different models: two Bi-LSTM models, two BoW models, and two BERT models. For each pair, one of them has been trained on a dataset consisting of Twitter posts and the other on a dataset consisting of Fox News discussion threads. The trained models have been used to classify movie subtitles to evaluate their performance by domain adaptation from social media content to movies. In addition, another state-of-the-art BERT-based classification model called HateXplain Mathew2021 has been used to classify the movies out of the box. While it is also possible to further fine-tune the HateXplain model, we are restricted in reporting the result of the ’off-the-shelf’ classification system to new domains, such as movie subtitles.
Furthermore, the movie dataset we have collected (see Section 4) is used to train domain-specific BoW, Bi-LSTM, and BERT models using 6-fold cross-validation, where each movie was selected as a fold and report the averaged results. Finally, we have identified the best model trained on social media content based on macro-averaged F1-score and fine-tuned it with the movie dataset using 6-fold cross-validation on that particular model, to investigate fine-tuning and transfer learning capabilities for hate speech on movie subtitles.
3.1 Annotation Guidelines
In our annotation guidelines, we defined hateful speech as a language used to express hatred towards a targeted individual or group or is intended to be derogatory, to humiliate, or to insult the members of the group, based on attributes such as race, religion, ethnic origin, sexual orientation, disability, or gender. Although the meaning of hate speech is based on the context, we provided the above definition agreeing to the definition provided by Nockleby2000; davidson2017automated. Offensive speech uses profanity, strongly impolite, rude, or vulgar language expressed with fighting or hurtful words to insult a targeted individual or group (davidson2017automated). We used the same definition also for offensive speech in the guidelines. The remaining subtitles were defined as normal.
3.2 Annotation Task
For the annotation of movie subtitles, we have used Amazon Mechanical Turk (MTurk) crowdsourcing. Before the main annotation task, we have conducted an annotation pilot study, where 40 subtitles texts were randomly chosen from the movie subtitle dataset. Each of them has included 10 hate speech, 10 offensive, and 20 normal subtitles that are manually annotated by experts. In total, 100 MTurk workers were assigned for the annotation task. We have used the built-in MTurk qualification requirement (HIT approval rate higher than 95% and number of HITs approved larger than 5000) to recruit workers during the Pilot task. Each worker was assessed for accuracy and the 13 workers who have completed the task with the highest annotation accuracy were chosen for the main study task. The rest of the workers were compensated for the task they have completed in the pilot study and blocked from participating in the main annotation task. For each HIT, the workers are paid 40 cents both for the pilot and the main annotation task.
For the main task, the 13 chosen MTurk workers were first assigned to one movie subtitle annotation to further look at the annotator agreement as will be described in Section 6.3. Two annotators were replaced during the main annotation task with the next-best workers from the identified workers in the pilot study. This process was repeated after each movie annotation for the remaining five movies. One batch consists of 40 subtitles which were displayed in chronological order to the worker. Each batch has been annotated by three workers. In Figure 1, you can see the first four questions of a batch out of the movie American History X AmericanHistoryX.
|American History X||0.83||0.13||0.03||1565|
|The Wolf of Wall Street||0.81||0.19||0.001||3063|
The publicly available Fox News corpus111https://github.com/sjtuprog/fox-news-comments consists of 1,528 annotated comments compiled from ten discussion threads that happened on the Fox News website in 2016. The corpus does not differentiate between offensive and hateful comments. This corpus has been introduced by gao2017 and has been annotated by two trained native English speakers. We have identified 13 duplicates and two empty comments in this corpus and removed them for accurate training results. The second publicly available corpus we use consists of 24,802 tweets222https://github.com/t-davidson/hate-speech-and-offensive-language/. We identified 204 of them as duplicates and removed them again to achieve accurate training results. The corpus has been introduced by davidson2017automated and was labeled by CrowdFlower workers as hate speech, offensive, and neither. The last class is referred to as normal in this paper. The distribution of the normal, offensive, and hate classes can be found in Table 1.
The novel movie dataset we introduce consists of six movies. The movies have been chosen based on keyword tags provided by the IMDB website333https://www.imdb.com/search/keyword/. The tags hate-speech and racism were chosen because we assumed that they were likely to contain a lot of hate and offensive speech. The tag friendship was chosen to get contrary movies containing a lot of normal subtitles, with less hate speech content. In addition, we excluded movie genres like documentations, fantasy, or musicals to keep the movies comparable to each other. Namely we have chosen the movies BlacKkKlansman (BlacKkKlansman) which was tagged as hate-speech, Django Unchained (DjangoUnchained), American History X (AmericanHistoryX) and Pulp Fiction (PulpFiction) which were tagged as racism whereas South Park (SouthPark) as well as The Wolf of Wall Street (WolfOfWallStreet) were tagged as friendship in December 2020. The detailed distribution of the normal, offensive, and hate classes, movie-wise, can be found in Table 2.
The goal of the pre-processing step was to make the text of the Tweets and conversational discussions as comparable as possible to the movie subtitles since we assume that this will improve the transfer learning results. Therefore, we did not use pre-processing techniques like stop word removal or lemmatization.
4.2 Data Cleansing
After performing a manual inspection, we applied certain rules to remove the textual noise from our datasets. The following was the noise observed in each dataset, which we removed for the Twitter and Fox News datasets: (1) repeated punctuation marks, (2) multiple username tags, (3) emoticon character encodings, and (4) website links. For the movie subtitle text dataset: (1) sound expressions, e.g [PEOPLE CHATTERING], [DOOR OPENING], (2) name header of the current speaker, e.g. "DIANA: Hey, what’s up?" which refers to Diana is about to say something, (3) HTML tags, (4) non-alpha character subtitle, and (5) non-ASCII characters.
4.3 Subtitle format conversion
The downloaded subtitle files are provided by the website www.opensubtitles.org444https://www.opensubtitles.org/ and are free to use for scientific purposes. The files are available in the SRT-format555https://en.wikipedia.org/wiki/SubRip that have a time duration along with a subtitle, which while watching appears on the screen in a given time frame. We performed the following operations to create the movie dataset: (1) Converted the SRT-format to CSV-format by separating start time, end time, and the subtitle text, (2) Fragmented subtitles which were originally single appearances on the screen and spanned across multiple screen frames were combined, by identifying sentence-ending punctuation marks, (3) Combined single word subtitles with the previous subtitle because single word subtitles tend to be expressions to what has been said before.
5 Experimental Setup
For the development of BERT-based models, we rely on the TFBERTForSequenceClassification algorithm, which is provided by HuggingFace666https://huggingface.co/transformers and pre-trained on bert-base-uncased. Learning rate of 3e-06 and sparse categorical cross-entropy loss function was used for this. All the models used the Adam optimizer kingma2017adam. We describe the detailed hyper-parameters for all the models used for all the experiments in the Appendix A.1.
6 Results and Annotation Analysis
In this section, we will discuss the different classification results obtained from the various hate speech classification models. We will also briefly present a qualitative exploration of the annotated movie datasets. The model referred in the tables as LSTM refers to Bi-LSTM models used.
6.1 Classification results and Discussion
We have introduced a new dataset of movie subtitles in the field of hate speech research. A total of six movies are annotated, which consists of sequential subtitles.
First, we experimented on the HateXplain model Mathew2021 by testing the model’s performance on the movie dataset. We achieved a macro-averaged F1-score of 66% (see Table 3). Next, we tried to observe how the different models (BoW, Bi-LSTM, and BERT) perform using transfer learning and how comparable are those results to this state-of-the-art model’s results.
|Model||Class||F1-Score||Macro AVG F1|
We trained and tested the BERT, Bi-LSTM, and BoW model by applying an 80:20 split on the social media datasets (see Table 4). When applied to the Fox News dataset, we observed that BERT performed better than both BoW and Bi-LSTM with a small margin in terms of macro-averaged F1-score. Hate is detected close to 50% whereas normal is detected close to 80% for all three models on F1-score.
When applied on the Twitter dataset, results are almost the same for the BoW and Bi-LSTM models, whereas the BERT model performed close to 10% better by reaching a macro-averaged F1-score of 76%. All the models have a high F1-score of above 90% for identifying offensive class. This goes along with the fact that the offensive class is the dominant one in the Twitter dataset (Table 1).
Hence, by looking at the macro-averaged F1-score values, BERT performed best in the task for training and testing on social media content on both datasets.
|Dataset||Model||Class||F1-Score||Macro AVG F1|
When trained on the Fox News dataset, BoW and Bi-LSTM performed similarly by poorly detecting hate in the movies. In contrast, BERT identified the hate class more than twice as well by reaching an F1-score of 39%.
When trained on the Twitter dataset, BERT performed almost double in terms of macro-averaged F1-score than the other two models. Even though the detection for the offensive class was high on the Twitter dataset (see Table 4) the models did not perform as well on the six movies, which could be due to the domain change. However, BERT was able to perform better on the hate class, even though it was trained on a small proportion of hate content in the Twitter dataset. The other two models performed very poorly.
|Dataset||Model||Class||F1-Score||Macro AVG F1|
To address RQ 2, we train new models from scratch on the six movies dataset using 6-fold cross-validation (see Table 6). In this setup, each fold represents one movie that is exchanged iteratively during evaluation.
Compared to the domain adaptation (see Table 5), the BoW and Bi-LSTM models performed better. Bi-LSTM distinguished better than BoW among hate and offensive while maintaining a good identification of the normal class resulting in a better macro-averaged F1-score of 71% as compared to 64% for the BoW model. BERT performed best across all three classes resulting in 10% better results compared to the Bi-LSTM model on macro-averaged F1-score, however, it has similar results when compared to the domain adaptation (see Table 5) results.
Furthermore, the absolute amount of hateful subtitles in the movies The Wolf of Wall Street (3), South Park (10), and Pulp Fiction (16) are very minor, hence the cross-validation on these three movies as test set is very sensible of only predicting a few of them wrong since a few of them will already result in a high relative amount.
|Dataset||Model||Class||F1-Score||Macro AVG F1|
The macro-averaged F1-score increased compared to the domain adaptation (see Table 5) from 64% to 89% for the model trained on the Fox News dataset. For the Twitter dataset the macro-averaged F1-score is comparable to the domain adaptation (see Table 5) and in-domain results (see Table 6). Compared to the results of the HateXplain model (see Table 3) the identification of the normal utterances are comparable whereas the offensive class was identified by our BERT model much better, with an increment of 48%, but the hate class was identified by a decrement of 18%.
The detailed results of all experiments is given in Appendix A.2.
6.2 Qualitative Analysis
In this section, we investigate the unsuccessfully classified utterances (see Figure 2) of all six movies by the BERT model trained on the Twitter dataset and fine-tuned with the six movies via 6-fold cross-validation (see Table 7) to analyze the model addressing RQ 2.
The majority of unsuccessfully classified utterances (564) are offensive classified as normal and vice versa resulting in 69%. Hate got classified as offensive in 5% of all cases and offensive as hate in 8%. The remaining misclassification is between normal and hate resulting in 18%, which we refer to as the most critical for us to analyze further.
We looked at the individual utterances of the hate class misclassified as normal (37 utterances). We observed that most of them were sarcastic and those did not contain any hate keywords, whereas some could have been indirect or context-dependent, for example, the utterance "It’s just so beautiful. We’re cleansing this country of a backwards race of chimpanzees" indirectly and sarcastically depicts hate speech which our model could not identify. We assume that our model has shortcomings in interpreting those kinds of utterances correctly.
Furthermore, we analyzed the utterances of the class normal which were misclassified as hate (60 utterances). We observed that around a third of them were actual hate but were misclassified by our annotators as normal, hence those were correctly classified as hate by our model. We noticed that a fifth of them contain the keyword "Black Power", which we refer to as normal whereas the BERT model classified them as hate.
|Dataset||Model||Class||F1-Score||Macro AVG F1|
6.3 Annotation Analysis
Using the MTurk crowdsourcing, a total of 10,688 subtitles (from the six movies) are annotated. For each of the three workers involved, 81% agreed to the same class. Out of the total annotations, only 0.7% received disagreement on the classes (where all the three workers chose a different class for each subtitle).
To ensure the quality of the classes for the training, we chose majority voting. In the case of disagreement, we took the offensive class as the final class of the subtitle. One reason why workers do disagree might be that they do interpret a scene differently. We think that providing the video and audio clips of the subtitle frames might help to disambiguate such confusions.
Let us consider an example from one of the annotation batches that describes a scene where the shooting of an Afro-American appears to happen. Subtitle 5 in that batch reads out "Shoot the nigger!", and subtitle 31 states "Just shit. Got totally out of control.", which was interpreted as normal by a worker who might not be sensible to the word shit, as offensive speech by a worker who is, in fact, sensible to the word shit or as hate speech by a worker who thinks that the word shit refers to the Afro-American.
The movie Django Unchained DjangoUnchained was tagged as racism and has been annotated as the most hateful movie (see Table 2) followed by BlacKkKlansman BlacKkKlansman and American History X AmericanHistoryX which where tagged as racism or hateful. This indicates that hate speech and racist comments often go along together. As expected, movies tagged by friendship like The Wolf of Wall Street WolfOfWallStreet and South Park SouthPark were less hateful. Surprisingly the percentage of offensive speech increases when the percentage of hate decreases making the movies tagged by friendship most offensive in our movie dataset.
7 Threats to Validity
The pre-processing of the movies or the social media datasets could have deleted crucial parts which would have made a hateful tweet normal, for example. Thus the training on such datasets could impact the training negatively.
Movies are not real, they are more like a very good simulation. Thus, for this matter, hate speech is simulated and arranged. Maybe documentation movies are better suited since they tend to cover real-case scenarios.
The annotations could be wrong since the task of identifying hate speech is subjective.
Movies might not contain a lot of hate speech, hence the need to detect them is very minor.
As the annotation process was done batch-wise, annotators might lose crucial contextual information when the batch change happens, as it misses the chronological order of the dialogue.
Only textual data might not provide enough contextual information for the annotators to correctly annotate the dialogues as the other modalities of the movies (audio and video) are not considered.
In this paper, we applied different approaches to detect hate and offensive speech in a novel proposed movie subtitle dataset. In addition, we proposed a technique to combine fragments of movie subtitles and made the social media text content more comparable to movie subtitles (for training purposes).
For the classification, we used two techniques of transfer learning, i.e., domain adaptation and fine-tuning. The former was used to evaluate three different ML models, namely Bag of Words for a baseline system, transformer-based systems as they are becoming the state-of-the-art classification approaches for different NLP tasks, and Bi-LSTM-based models as our movie dataset represents sequential data for each movie. The latter was performed only on the BERT model and we report our best result by cross-validation on the movie dataset.
All three models were able to perform well for the classification of the normal class. Whereas when it comes to the differentiation between offensive and hate classes, BERT achieved a substantially higher F1-score as compared to the other two models.
The produced artifacts could have practical significance in the field of movie recommendations. We will release the annotated datasets, keeping all the contextual information (time offsets of the subtitle, different representations, etc.), the fine-tuned and newly trained models, as well as the python source code and pre-processing scripts, to pursue research on hate speech on movie subtitles.777https://github.com/uhh-lt/hatespeech
9 Further Work
The performance of hate speech detection in movies can be improved by increasing the existing movie dataset with movies that contain a lot of hate speech.
Moreover, multi-modal models can also improve performance by using speech or image. In addition, some kind of hate speech can only be detected through the combination of different modals, like some memes in the hateful meme challenge by Facebook (FacebookHateMeme2021) e.g. a picture that says look how many people love you whereas the image shows an empty desert.
Furthermore, we also did encounter the widely reported sparsity of hate speech content, which can be mitigated by using techniques such as data augmentation, or balanced class distribution. We intentionally did not perform shuffling of all six movies before splitting into k-folds to retain a realistic scenario where a classifier is executed on a new movie.
Another interesting aspect that can be looked at is the identification of the target groups of the hate speech content in movies and to see the more prevalent target groups. This work can also be extended for automated annotation of movies to investigate the distribution of offensive and hate speech.
Appendix A Appendix
a.1 Hyperparameter values for experiments
All the models used the Adam optimizer kingma2017adam
. Bi-LSTM and BoW used the cross-entropy loss function whereas our BERT models used the sparse categorical and cross-entropy loss function. Further values for the hyperparameters for each experiment are shown in Table8.
For all the models except for the model trained on the Twitter dataset, the architecture consists of an embedding layer followed by two Bi-LSTM layers stacked one after another. Finally, a Dense layer with a softmax activation function is giving the output class.
For training with Twitter (both in-domain and domain adaptation), a single Bi-LSTM layer is used.
The BoW model uses two hidden layers consisting of 100 neurons each.
BERT uses TFBertForSequenceClassification model and BertTokenizer as its tokenizer from the pretrained model bert-base-uncased.
|Model||Train-Dataset||Test-Dataset||Learning Rate||Epochs||Batch Size|
|BoW||Fox News||Fox News||1e-03||8||32|
|BERT||Fox News||Fox News||3e-06||17||32|
|BERT||Fox News and Movies||Movies||3e-06||6||32|
|BERT||Twitter and Movies||Movies||3e-06||6||32|
|Bi-LSTM||Fox News||Fox News||1e-03||8||32|
a.2 Additional Performance Metrics for Experiments
We report precision, recall, F1-score and macro averaged F1-score for every experiment in Table 9.
|Model||Train-Dataset||Test-Dataset||Category||Precision||Recall||F1-Score||Macro AVG F1|
|BoW||Fox News||Fox News||normal||0.81||0.84||0.83||0.63|
|BoW||Fox News||Fox News||hate||0.45||0.41||0.43||0.63|
|BERT||Fox News||Fox News||normal||0.84||0.87||0.86||0.68|
|BERT||Fox News||Fox News||hate||0.57||0.46||0.51||0.68|
|BERT||Fox News and Movies||Movies||normal||0.97||0.97||0.97||0.89|
|BERT||Fox News and Movies||Movies||hate||0.83||0.81||0.82||0.89|
|BERT||Twitter and Movies||Movies||normal||0.97||0.97||0.97||0.77|
|BERT||Twitter and Movies||Movies||offensive||0.76||0.76||0.75||0.77|
|BERT||Twitter and Movies||Movies||hate||0.57||0.73||0.59||0.77|
|Bi-LSTM||Fox News||Fox News||normal||0.83||0.72||0.77||0.62|
|Bi-LSTM||Fox News||Fox News||hate||0.39||0.55||0.46||0.62|