In recent years abusive language and hate speech have become pernicious problems for online communication platforms, including newspapers, that host comment sections [green_no_2018, gardiner_dark_2016]. At their inception, comment sections were welcomed by newspapers as a way to cultivate critical discourse between journalists and their audience and generate traffic (cf., [Reich2011, Papacharissi2004]). In 2008, upon opening comment sections for their articles, the major U.S. media platform National Public Radio111www.npr.org (NPR) wrote “We are providing a forum for infinite conversations on NPR.org. Our hopes are high. We hope the conversations will be smart and generous of spirit. We hope the adventure is exciting, fun, helpful and informative” [meyer_npr_2008]. In 2016, inundated with trolls and toxic comments, NPR like many other major newspapers shut down their comment sections [montgomery_beyond_2016, green_no_2018, gardiner_dark_2016]. Newspapers that opt to keep comment sections open have faced increasing legal scrutiny and significant comment moderation costs [green_no_2018, gardiner_dark_2016, niemann_what_2020].
In response to the increase in toxic comments and the associated costs, there has been growing interest in the field of abusive language detection (ALD). ALD is an application of machine learning (ML) and natural language processing (NLP) that can be used to develop (semi-) automated comment moderation tools [salminen_developing_2020, brunk_can_2019, niemann_abusive_2020]. The realization that it would be necessary to (semi-)automate at least parts of the moderation workflow came early on [Paulussen2011], and work on the corresponding systems began more than a decade ago [Yin2009].
Against the backdrop of the refugee crisis in 2015, the field of abusive comment detection experienced a surge following Nobata et al.’s influential paper [nobata_abusive_2016]. Since then, various scholars around the world have primarily focused on creating machine learning models that could improve the detection quality, which is reflected in a growing number of (meta-) studies [fortuna_survey_2018, niemann_abusive_2020, Pamungkas2021, Yin2021]. Linked but typically subordinate to the machine learning work individual contributions cover related topics such as labels [niemann_abusive_2020], datasets [vidgen_directions_2021, Poletto2020], or the integration into moderation platforms [Loosen2017, Niemann2021]. However, not only academics have been working on a resolution to this pressing issue. The New York Times222www.nytimes.com (NYT) for example, one of the biggest U.S. newspapers, partnered with Alphabet333https://abc.xyz/ to build a tool called Perspective444www.perspectiveapi.com to automatically flag toxic comments [lecher_alphabet_2017]. At its roll-out in 2017, Perspective was reportedly already automatically approving about 20% of the total comments received by the NYT [lecher_alphabet_2017].
However, despite all these efforts, the core problem—the effective, efficient, and ideally error-free automated detection of abusive comments— has still not been solved, as many classifiers are showing fairly good but often exaggerated performances [Yin2021]. While this may be largely attributable to the complexity of the classification task and mistakes or imprecisions in the experimental setups, another—so far largely unconsidered—reason might lie within the data and the notion of language itself.
Typically, the dataset is randomly split into training and testing subsets to ensure that the results are a reflection of the classifier’s performance on future data [sogaard_we_2020]. However, classifiers evaluated in this random train-test fashion tend to yield overoptimistic results so that when the model is deployed, it performs worse than expected. An important reason for this drop in performance is that the random train-test split falsely assumes a static language environment [vidgen_challenges_2019, sogaard_we_2020, lazaridou_pitfalls_2021]. When split in this fashion, the training and testing data share the same time period. This condition does not hold when the model is deployed. The goal of a classifier deployed is to predict future data, but a classifier trained under the false assumption that language is static will rely on a training corpus that is increasingly less representative of future data
Language and the subjects of our online commentary are constantly in flux. Some words that were once considered to have negative meanings like "wicked" and "sick" can now have positive meanings, and vice versa [zeitlin_11_2019]—processes of semantic change known as amelioration and pejoration respectively [frermann_bayesian_2016, lukes_sentiment_2018, cook_automatically_2010]. Language in online spaces like comment forums often changes especially rapidly as different forms of "netspeak" (i.e. internet slang) emerge and fade in social networks [goel_social_2016, eisenstein_diffusion_2014]. The people, places, topics, and trends in newspaper articles are also constantly changing. For example, during the COVID-19 pandemic, words like coronavirus, social distancing, and quarantine abruptly become widespread in news coverage. The wider phenomenon of dynamically changing data streams is known as concept drift in computer science literature. In light of temporal changes in language, it is critical to determine how abusive language detection systems degrade over time.
This paper examines the temporal effects on the performance of abusive language detection classifiers trained on a German news comment dataset from Nov. 2018 to Jun. 2020. Our goals are to determine if random splits tend to overestimate model performance compared to time stratified evaluation, and measure temporal degradation; whether and by how much the performance of a model decreases as the time between the training and testing data increases) [lazaridou_pitfalls_2021].
The contribution of this paper is to caution practitioners of abusive language detection about the problem of concept drifting data and to provide evidence of performance degradation when standard ML techniques are applied naively. Newspapers that implement semi-automated ALD systems must be aware that maintaining the model’s advertised performance benchmarks will require re-training on new data or implementing other, more complex adaption strategies.
The rest of the paper is structured as follows. Sec. 2 describes the phenomenon of concept drift and explores past findings in NLP applications. A detailed review of concept drift adaptation strategies is outside of the scope of this paper, but we reference several recognized review papers. Sec. 3 describes our dataset of German-language newspaper comments. Sec. 4 describes two experiments we conducted to examine the temporal dynamics of ALD classifiers as well as our text-preprocessing techniques and automated machine learning (Auto-ML) approach to model development. In Sec. 5, we report the results of our experiments and provide further evidence of temporal changes in language in our German newspaper comment dataset. Finally, in Sec. 6 we outline some of the implications of our findings for practitioners of ALD and provide directions for future investigation.
2 Related Work
In many machine learning applications, models need to adapt to dynamic environments where the incoming data or the target outputs change unpredictably. In this section, we will describe the problem of changing data as theconcept drift problem and summarize some of the most relevant NLP papers that deal with concept drift and temporal degradation.
2.1 Concept Drift
Comments posted to a newspaper website arrive as a data stream (although data streams are sometimes differentiated by the high rate of speed and real-time or "one-pass" analysis)[bifet_machine_2018]. Temporal changes in a data stream, i.e., changes in the mapping between the input data and the target output variable across time, are a well-studied phenomenon called concept drift [dries_adaptive_2009, kifer_detecting_2004, gama_survey_2014]. Concept drift between time and can be written as
is the joint distribution at timebetween the set of input variables and the target variable [gama_survey_2014].
There are two different types of concept drift: real concept drift and virtual concept drift. Real concept drift refers changes in that occur either with or without changes in [gama_survey_2014]. In abusive comment moderation, real concept drift is a change in the types of content moderators consider abusive. These changes may occur independently of changes in the incoming data like new hate speech regulations being put into effect or a managerial decision to raise the threshold for restricting comments.
Virtual concept drift describes changes in distribution of the incoming data without affecting [gama_survey_2014]. In abusive comment moderation, virtual concept drift is the more likely type of drift as the language of abuse and the targets of abuse (e.g., people, places, organizations) change. Virtual drift also encompasses changes in the class distribution (i.e., changes in the proportion of abusive comments and clean comments).
There has been growing interest in the domains of concept drift detection and adaptive learning to develop concept drift adaption strategies. Adaptive learning is a concept drift adaption strategy in which the model is updated online during operation [gama_survey_2014]. A wide variety of adaptive learning algorithms have been proposed ranging from simple sliding-window techniques in which older data is gradually dropped from the training corpus to complex learning algorithms that combine drift detection methods with sophisticated data forgetting mechanisms. For the most comprehensive review of concept drift adaptation strategies, we refer readers to Gama et al. 2014 [gama_survey_2014] and Tsymbal 2004 [Tsymbal_2004].
2.2 Drift and Degradation in NLP Tasks
The problem of concept drift and temporal degradation has been studied in a variety of NLP tasks including document classification [lazaridou_pitfalls_2021, huang_examining_2018, rocha_exploiting_2008, silic_exploring_2012]lukes_sentiment_2018, muller_addressing_2020, bifet_sentiment_2010, a_bechini_addressing_2021]rijhwani_temporally-informed_2020], fake review detection [mohawesh_analysis_2021], spam filtering [delany_case-based_2005], and abusive language detection [nobata_abusive_2016, florio_time_2020]. In the longest time interval studied, [huang_examining_2018] classified sentences from American political platforms between 1948 and 2016 as either Democrat or Republican. By training and testing on data from different time intervals, they showed that F1-scores degraded by 40 points in some cases. Others like [lazaridou_pitfalls_2021, muller_addressing_2020, rijhwani_temporally-informed_2020, florio_time_2020] have shown that classifiers can degrade over much shorter periods on the order of months.
In most cases like [lazaridou_pitfalls_2021, silic_exploring_2012, a_bechini_addressing_2021, rijhwani_temporally-informed_2020, florio_time_2020] this degradation accumulates steadily over time due to gradual concept drift in the data [gama_survey_2014]. However, in some cases, an abrupt change in the data (i.e. abrupt concept drift [gama_survey_2014]) can cause a sudden drop in performance. For example, [muller_addressing_2020] examined concept drift in vaccine-related sentiment analysis and found that the performance of outdated classifiers suddenly dropped by about 20% in the early months of 2020 when the COVID-19 pandemic began to receive global attention.
To the best of our knowledge, Florio et al. [florio_time_2020]
is the only other work to measure temporal degradation in the context of abusive language detection. They trained two models—a Support Vector Machine (SVM) and Google’s Bidirectional Encoder Representation Transformer (BERT)[devlin_bert_2019] adapted for Italian—on an Italian language Twitter dataset of 4,000 samples from 2015 to 2017. Both models were then evaluated on monthly evaluation datasets of 2,000 samples from Sept. 2018 to Feb. 2019. Within the six months of evaluation data, the BERT and SVM models lost 0.227 and 0.284 F1 points, respectively—a drop that would severely cripple the usability of these models in a real-world setting.
Although past findings of concept drift indicate its relevance to ALD systems, our study contributes several novel insights. All languages undergo temporal changes. However, it is not clear how these processes manifest in different languages. A review by Niemann et al. [niemann_abusive_2020] found that German datasets for abusive language detection are both relatively rare (compared to English, Italian, and Indonesian) and tend to yield worse F1 scores. Thus our dataset provides a unique perspective on the problem of concept drift in a German ALD setting. Our data also overlaps the emergence of the COVID-19 pandemic, which was accompanied by intense news coverage and changes in our vocabulary. The period of our data allows for an examination of the concept drift associated with the pandemic. Overall, this work aims to integrate insights from concept drift literature and the field of ALD, an intersection that is so far poorly understood.
The basis for the conducted experiments (cf., Sec. 4) is provided by an extensive German news comment dataset. The dataset was provided by one of the largest German newspapers and contains comments submitted to the newspaper’s website by the readers. To provide a safe discussion space and to prevent legal trouble, each incoming comment is checked by a team of professional community managers before being published (pre-moderation process, cf., [Reich2011, Grimmelmann2015]).
|Number of comments||256,173|
|Number of accepted comments||239,323|
|Number of rejected comments||16,850|
If a comment is deemed non-publishable (e.g., by containing racist or sexist content), it will not appear on the website. However, differing from scraped datasets, all comments are included, even those too critical to be published. As depicted in Tab. 1, the dataset consist of more than 250,000 comments, around 17,000 (6.5%) were rejected by moderators, the remaining 240,000 (93.5%) were considered non-problematic. Each instance within the dataset is represented as shown in Tab. 2. Each entry contains a unique identifier (ID) as well as the date and time the readers of the newspaper posted the comment. All data within the dataset originates from user comments within a timeframe starting at the 01.11.2018 and ending on the 29.06.2020. Further entries are the textual content of each comment as well as the resulting moderation decision (comment was accepted by moderators results in "0", comment was rejected by moderators results in "1"). Lastly, each comment’s length (number of characters used) is listed.
|Date||Date and time the comment was posted||datetime||01.11.2018 -|
|Text||Text of the comment||text||-|
|Rejected||Decision if the comment is rejected by||bool||[0,1]|
|Comment_length||Number of characters within a comment||int||[0,26516]|
4 Experimental setup
This section describes the experimental setup for two tests: the time-stratified vs. random split test and the temporal degradation test. In the time-stratified vs. control experiment, we use a time-stratified evaluation procedure to examine how random train-test splits in concept drifting data can result in overoptimistic measures of performance. In the temporal degradation test, we sequentially chunk our dataset and measure how classifier performance depends on the time interval between training and testing data. We also describe our preprocessing procedure and our Auto-ML approach to model selection.
Before being transformed into numerical features, all text was preprocessed, including stopword removal, lemmatization, and lower-casing . The preprocessed text was then numerically represented with Term Frequency — Inverse Document Frequency (TF-IDF) vectors. In both the time-stratified vs. random split test and the temporal degradation test, we selected the top 3,000 most frequent unigrams and bigrams as TF-IDF features. We removed words that appeared less than five times in the dataset from the possible feature space. We used random undersampling across the entire moderator labeled dataset to achieve a balanced class distribution.
4.2 Auto-ML Setup
Finding the optimal ML configuration for a problem usually involves a repetitive and time-consuming process of testing different models, hyperparameters, preprocessing techniques, and feature engineering strategies. The goal of Auto-ML is to automate much of this workflow and reduce the developer’s bias towards prioritizing specific models or configurations over others[yao_taking_2019, jorgensen_multi-class_2020].
In this paper we used the popular Auto-sklearn Auto-ML library to train our classifiers [Feurer2015]. Auto-sklearn is built on the well known scikit-learn python ML library 555See https://automl.github.io/auto-sklearn/master/ for Auto-sklearn documentation and uses Bayesian optimization methods across 15 classifiers, 14 feature preprocessing methods, 4 data preprocessing methods, and over 100 hyperparameters (model dependent) to construct an ensemble of the best performing models [Feurer2015]. In the time-stratified vs. random split test, we capped each Auto-sklearn instance at one day of run time with 20GB of available memory. In the temporal degradation test, we capped each Auto-sklearn instance at 12 hours of run time with 25GB of available memory. All 15 classifiers were included in the parameter search space for both experiments.
4.3 Time-stratified vs. random split test
The time-stratified vs. random split experiment adapted from [lazaridou_pitfalls_2021] compares the popular random train-test data split with a time-stratified train-test split. A time-stratified split means the model is trained on one time period of data and then evaluated on a subsequent time period with no overlap in the time period of the training and evaluation datasets. On the other hand, in a random split, the evaluation dataset is selected at random from the entire time period of data meaning there is total overlap in the time period of the training and evaluation datasets. The goal is to determine whether the random split tends to overestimate performance on future unseen data.
In order to compare the two splitting strategies, we create an evaluation dataset and two training datasets—the time-stratified dataset and the control dataset. The evaluation dataset contains all comments from the last eight months (i.e., Nov. 2019 through June 2020). The time-stratified dataset contains all comments from the first month until the evaluation period (i.e., Nov. 2018 to Nov. 2019). The control dataset contains all comments from the entire corpus period (i.e., Nov. 2018 to June 2020). In other words, the control dataset and evaluation dataset have overlapping time periods, whereas the time-stratified dataset and evaluation dataset do not. The model trained on the time-stratified dataset is evaluated on its ability to predict future comments posted after the time period of its training data. In contrast, the model trained on the control dataset is trained to predict comments posted during the time period of its training data.
The experiment has two additional requirements. First, the control and time-stratified training datasets must be of equal size. Second, no comments from the evaluation dataset are in either training dataset (i.e., the control-evaluation and time-stratified-evaluation pairs are disjoint sets). These requirements ensure that the two training datasets differ only in the time period of training data and that they are tested on the same evaluation dataset. We achieve this by randomly undersampling the control dataset to be the same size as the time-stratified dataset and then updating the evaluation dataset to exclude all comments present in the control dataset. The flowchart in Figure 1 shows the four-stage process for creating the time-stratified, control, and evaluation datasets for the time-stratified vs. random split test. Figure 2 shows the composition of clean and abusive comments in the training and evaluation datasets. The evaluation dataset is examined at a monthly resultion to assess how the performance of the control-trained classifier degrades as the time between the control period and the evaluation period increases.
4.4 Temporal degradation test
In the temporal degradation test, we examine how the performance of a classifier changes over time. We measure temporal degradation by splitting the dataset into sequential chunks for training and evaluation and observing how classifier performance changes as the time interval between the training and evaluation data changes. The expectation is that as the time interval between training and evaluation data increases, the performance will degrade as language and news fluctuate.
In this experiment, we split the dataset into five consecutive four-month chunks. All chunks are then undersampled to match the number of comments in the smallest chunk (n=5007). Next, we fit an auto-sklearn classifier to each chunk and evaluate it on all other chunks. When training and testing on the same chunk, we use a random 20% hold-out set for evaluation and the remaining 80% for training.
To gain further insight into why performance varies, we also measure the corpus similarity between each chunk with the Spearman rank correlation coefficient [kilgarriff_comparing_2001, mukaka_statistics_2012]. We calculate the Spearman coefficient with TFIDF-ranked word lists for the two corpora being compared666Python https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.spearmanr.html. A Spearman coefficient of 0 indicates no correlation, whereas a coefficient of 1 indicates perfect correlation. Given the temporal changes in language, we expect that the further apart two corpora are in time, the lower their correlation.
5 Experimental results
In this section, we describe the results of the time-stratified vs. random split test and the temporal degradation test outlined in Sec. LABEL:Sec:experimental_setup. We also provide further insights into the impact of the COVID-19 pandemic on classifier performance. In the time-stratified vs. random split test, we show that emerging new words during the evaluation period are dominated by pandemic-related vocabulary. Furthermore, in both tests, our time-stratified evaluation procedures show clear performance drops during the pandemic’s early months.
5.1 Time-stratified vs random split test
The results of our time-stratified vs. random split test confirm that a classifier evaluated with a random splitting technique (the control trained classifier) will yield overoptimistic results when compared to a time-stratified approach to evaluation. The classifier trained on the control dataset—the dataset which contains comments from the same time period as the evaluation dataset—had an overall F1-score of 0.632 compared with an F1-score of 0.590 for the classifier trained on the time-stratified dataset. An examination of the monthly performance across the evaluation dataset (Fig. 3) shows that the control-trained classifier consistently performs better than the time-stratified-trained classifier. It also shows that the performance gap between the two classifiers grows as the time-stratified training data becomes increasingly outdated. Table 3
similarly shows the monthly performance of the two classifiers on the evaluation dataset with precision and recall data. In an abusive language detection environment, recall, or the number of abusive comments detected divided by the total number of abusive comments in the evaluation dataset, is critical. In a semi-automated abusive language detection system, the goal is to detect as many abusive comments as possible (high recall), even if that comes at the cost of classifying more clean comments as abusive (low precision) since moderators review comments marked as abusive.
Interestingly, the performance of the control-trained classifier also degrades over time. One reason for this decline may be that the comments in the control dataset skew towards the earlier periods of the dataset—only 32.7% of the control dataset comes from the evaluation period. The presence of old and outdated data in a training dataset can negatively impact classifier performance on new data, a phenomenon that runs counter the usual assumption that more data leads to better performance[nobata_abusive_2016, gama_survey_2014]. Many concept drift adaption algorithms implement forgetting mechanisms like the sliding window in which old data outside of a particular time window is discarded [gama_survey_2014, silic_exploring_2012].
We also note that during the onset of the COVID-19 pandemic in the early months of 2020, the time-stratified-trained classifier drops sharply (cf. Fig. 3). This drop is likely due to the emergence of new vocabulary and abuse associated with the pandemic that was not present in the time-stratified dataset. Words that appear for the first time in the evaluation period (i.e., Nov. 2019 to June 2020) are "unseen" by a classifier trained on the time-stratified dataset but will likely have appeared in the control dataset. Table 4 shows a list of emerging words in the evaluation period. The list is dominated by frequently used words related to the COVID-19 pandemic. A classifier trained on a corpus of words without the pandemic-related vocabulary is almost guaranteed to run into trouble when faced with comments that, for example, decry lockdowns and vaccines, or blame certain groups for the emergence of the virus. As described in Sec. 2, changes in the language and topics of abuse lead to concept drift, which standard ML models are poorly equipped to address.
The words shown are preprocessed as described in Sec. 4.1 including lemmetization and lowercasing. Some words may also represent usernames that were frequently mentioned in comment threads.
5.2 Temporal degradation test
The results from the temporal degradation test show that, in general, as the time interval between the training and evaluation chunks increases, classifier performance decreases (Fig. 3(a)). Furthermore, the Spearman correlation between chunks (Fig. 3(b)) shows the same trend: as the time interval between two chunks increases, the corpora similarity decreases.
However, the temporal degradation effect seen in Fig. 3(a) is significantly more potent in the forward direction of time where the training chunk precedes the evaluation chunk. We observe drops in the F1 score of up to 0.31 in the forward direction. In the backward direction of time where the time period of the training chunk is after the evaluation chunk, the temporal degradation effect is small or non-existent. One reason that the temporal degradation effect may be more negligible in the backward direction of time is that the news cycle has a "memory" where older stories are built upon, and the respective vocabulary is accumulated instead of being discarded.
The chunk from March 2020 to June 2020 is notable for both its low F1-scores in Fig. 3(a) and low Spearman correlation coefficients in Fig. 3(b). The time period of this chunk coincides with the emergence of hundreds of COVID-19 related news stories and emerging new words like those in Table 4. These results suggest that abrupt concept drift associated with the COVID-19 pandemic contributed to significant temporal degradation for classifiers trained on pre-pandemic data.
The prevalence of abusive language in online comment sections presents a significant challenge for newspapers. As a result, there has been growing interest in (semi-) automated comment moderation tools that use machine learning and natural language processing to avoid the high costs of manual moderation or shutting down comments sections entirely. Unfortunately, however, much research on abusive language detection is modeled on an unrealistically static language environment where the language and abuse topics remain unchanging beyond the training dataset.
This paper uses a time stratified evaluation procedure to show that the typical random train-test splitting strategy tends to overestimate classifier performance on future data. Random train-test splits assume a complete overlap between the training and testing data—an assumption that is broken as soon as the model is deployed into a real-world environment to make predictions on new data. We argue that a time-stratified evaluation procedure in which the training and testing data are selected from distinct time periods is better suited for modeling our real-world environment where language changes dynamically.
Our findings on temporal degradation suggest that a classifier’s performance can degrade significantly in as short a period as four months. Temporal degradation is especially pronounced during periods of abrupt concept drift like the COVID-19 pandemic. Our experiments consistently showed that the changes in vocabulary associated with the pandemic led to a sharp decline in performance among classifiers trained on data from before the pandemic. What is clear from these results, and other concept drift literature, is that the niave application of standard ML techniques will result in worse performance as the model’s training corpus becomes increasingly outdated.
Practitioners of ALD systems have several avenues available to deal with the consequences of temporal degradation and concept drifting data. On the low-tech end of the spectrum, are sliding-window training schemes in which models are regularly re-trained with new data and old data is discarded (see for example Nobata et al. 2016 [nobata_abusive_2016]). Other, more complex adaption strategies are covered in Gama et al. 2014 [gama_survey_2014], though many of these algorithms are not yet available in standard ML libraries. In almost all cases, save for unsupervised concept drift adaptation algorithms [gemaque_overview_2020]
, labeling some fraction of incoming data is required. This fact implies that manual moderation will continue to be necessary, even if the workload is greatly reduced. One promising direction to reduce the amount of new data that needs to be manually labeled is active learning[settles.tr09]. Active learning queries data instances that are particularly useful for training (i.e., close to the decision boundary).
Online discourse will likely be a permanent and growing fixture of how society communicates. Regulating online speech raises complex but important legal, political, and technical issues that have broad implications for interacting with media and forming opinions. If newspapers opt to use (semi-)automated comment moderation systems, it is crucial that these systems perform well and are aligned with the criteria for comment censorship outlined by the platform. Our findings support the notion that abusive language detection is not a trivial task. Classifiers trained to detect abusive language will need to be regularly updated with new data or otherwise designed to adapt to changes in language and the incoming data. Failure to do so risks ineffective comment moderation systems at best and careless censorship at worst.