Toxic content on the internet prevents the constructive exchange of ideas, excludes sensitive individuals from online dialogue, and inflicts mental and physical health impacts on the recipients. Notable examples of toxic content include hate speech and profanity. Given the sheer scale of internet communications, manual filtering of such content is difficult, requiring methods of automated filtering.
Previous work in toxic content classification has so far focused on constructing classifiers that can flag toxic content with a high degree of accuracy on datasets curated from sources such as Twitter and Wikipedia. However, these datasets do not acknowledge the possibility for malicious users to attempt to deliberately bypass these classifiers. In the presence of toxic content filters, these users could formulate adversarial attacks that aim to prevent the classifier from detecting their harmful content while retaining readability for the receiving user. For example, a change of a single character to an asterisk, which requires minimal effort, may allow a hurtful content to bypass the toxic content filter (e.g., “shut up” to “s*ut up”).
If such simple attacks are effective at fooling automated toxic content classifiers, the utility of these classifiers would diminish greatly: determined users could still easily produce toxic content at a large scale. Therefore, useful toxic content classifiers need to be robust to adversarial attacks by making the transmission of toxic content sufficiently difficult and discouraging users from posting this type of content.
In this paper, we investigate the robustness of state-of-the-art toxic content classifiers to realistic adversarial attacks as well as various defenses against them. We find that these classifiers are vulnerable to extremely simple, model-agnostic attacks, with the toxic comment recall rate dropping by nearly 50% in some cases.
To address these vulnerabilities, we explore two types of defenses. The first is adversarial training, which we find to be effective against adversarial text, yet degrades performance on clean data. We also propose the Contextual Denoising Autoencoder (CDAE), a novel method for learning robust representations. The CDAE uses character-level and contextual information to “denoise” obfuscated tokens. We find that our approach outperforms several strong baselines with respect to character-level obfuscations, but is still vulnerable to distractors (i.e., injected sequences of non-toxic tokens). We experimentally find that the two best-performing models (our proposed CDAE and BERT) have different robustness characteristics, but a model ensemble allows us to leverage both their advantages.
Task and Datasets
Toxic content detection attempts to identify content that can offend or harm its recipients, including hate speech [wang2018interpreting], racism [Waseem2016HatefulSO], and offensive language [wu2018decipherment]. Given the subjectivity of these categorizations, we do not limit the scope of our work to any specific type and address toxic content in general. We work with three datasets summarized in Table 1.
The Jigsaw 2018 dataset focuses on the general toxic content detection task and it is comprised of approximately 215,000 annotated comments from Wikipedia talk pages labeled by crowd workers. It provides both a general toxicity label and more fine-grained annotations such as severe toxicity, obscenity, threat, insult, and identity hate.
The Jigsaw Unintended Bias in Toxicity Classification dataset (Jigsaw 2019)222Publicly available at https://www.kaggle.com/c/jigsaw-unintended-bias-in-toxicity-classification/overview. extends the Jigsaw 2018 dataset with 1.8 million comments, each annotated by up to 10 annotators for multiple labels. Jigsaw 2019 contains a field for toxicity which provides the fraction of annotators who labeled the comment as “toxic” (7.99% ). We use the Jigsaw 2019 corpus as our background corpus for generating adversarial attacks.
The OffensEval 2019333https://competitions.codalab.org/competitions/20011 dataset consists of 13,240 tweets annotated by crowdworkers. The data contains labels for whether the content is offensive and whether it is targeted, with 33% of the tweets being labeled as offensive.
Generating Realistic Adversarial Attacks
Previous work generating adversarial examples in text often assumes access to either the weights [Ebrahimi2017HotFlipWA, LiangLSBLS17] or the raw prediction scores of the classifier [DBLP:journals/corr/abs-1812-05271, Liang:2018:DTC:3304222.3304355, alzantot2018, abs-1801-04354, SamantaM17]. However, it is unlikely that users would have access to this information. Instead, the users most likely would only have weak signals from what gets flagged as well as access to public datasets with toxicity labels. To mimic this setup, we use a large background corpus (Jigsaw 2019) with labels indicating toxicity.444We follow Jigsaw 2019 guidelines for conversion of their continuous toxicity scores into binary labels. Our adversarial attack consists of two steps: (1) constructing a lexicon of toxic tokens and (2) using it to applying noise to the test set.
To identify “toxic” tokens, we train a logistic regression classifier on bag-of-words utterance representations from our background corpus. We use the coefficients of the logistic regression classifier as a signed measure of the association between the token and toxicity and select the 50,000 tokens with the strongest positive association with toxicity to be ourtoxic lexicon. We provide a list of top 100 toxic lexicon tokens in the Supplemental Material. We treated any token that did not appear in our lexicon as non-toxic. Using this toxic lexicon, we generate noised versions of the corpora using two settings: token obfuscation and distractor injection. Figure 2 provides an illustration of all our proposed attacks.
We apply character-level perturbations to the tokens of the utterance that belong to our toxic lexicon. For each toxic token we randomly select one of the following three perturbing operations: character scrambling, homoglyph substitution, and dictionary-based near-neighbor replacement. Details of the perturbing operations are given below.
Character scrambling consists in randomly permuting the token’s characters without deletions and substitutions, as applied in other work [heigold-etal-2018-robust, belinkov2018synthetic, michel2019]. Prior research shows that humans can read sufficiently long scrambled words, albeit not without an effort, especially if starting and ending letters remain in place [rayner2006raeding]. Thus, for this operation, we ignore tokens with fewer than three characters and keep the first and the last character unchanged. The remaining characters are split into groups of three consecutive characters and each group is permuted randomly and independently.
consists in replacing one or more Latin letters with similar-looking international characters from a handcrafted confusion map (see Supplemental Material). If homoglyph substitution operation is selected, each character of the toxic token is replaced with 20% probability. This type of obfuscation is common in social media[Rojas-Galeano:2017:OOO:3079924.3032963] and cybercrime [ginsberg2018rapid, elsayed2018large].
Dictionary-based near-neighbor replacement uses a base vocabulary to find the closest (but distinct) token in terms of Levenshtein distance. If relative Levenshtein distance (i.e., Levenshtein distance divided by maximum word length) is greater than 0.75, we use this nearest neighbor as a replacement. We leave the original toxic token unchanged otherwise. This form of noise produces common misspellings. As such, it introduces deletions, insertions, and substitutions that are not overly artificial. This procedure is distinct from that used by belinkov2018synthetic belinkov2018synthetic, who generate naturally occurring errors using corpora of error corrections.
In this setting, we inject distractor tokens by repeating randomly selected sequences of non-toxic tokens. We split the utterance into two parts at a random position and find the maximum-length sequence of non-toxic words that starts in each of the parts. Search localization introduces variety in the identified distractor sequences, which helps to avoid the appearance of easily detectable vandalism. Once a suitable sequence is found, it is appended to the end of the utterance.
Both, token obfuscation and distractor injection are model-agnostic, simple, and subject to easy automation. Hence, toxic content classifiers that are vulnerable to these attacks can be easily and systematically exploited.
We emphasize that the noise we present here is different from “naturally” occurring noise (e.g., misspellings and slang) that does not deliberately attempt to hide toxic tokens. The datasets we use have not been constructed in the presence of a toxicity filter, implying that the users had no incentive to obfuscate toxicity of their comments. Hence, the synthetic noise we present here is not the noise that we observe frequently in these datasets.
Effect of Adversarial Noise
We have implemented an experiment to assess whether our perturbations retained the toxicity of the toxic comments to human readers. In that, we have randomly sampled 200 comments from the Jigsaw 2018 dataset, half of which labeled as toxic by original Jigsaw 2018 crowd-workers. For each comment, we have a native English speaker rate either the perturbed or unperturbed version of the comment, taking care not to show both versions to the same individual. Overall, our experiment involved 10 participants, with each individual providing a toxicity rating for 80 comments. As such, we have obtained a total of 800 ratings, with each version of the comment receiving two independent ratings. We have tested whether the toxicity rating of the unperturbed comment tended to be higher than that of the perturbed comment using the Wilcoxon signed rank test [wilcoxon1992individual] applied to pairs of unperturbed/perturbed toxicity scores averaged at the comment-level. The original comment was perceived as more toxic (based on the average rating of two distinct users) 14% of the time and we found no statistically significant difference at the 1% significance level. Thus, we conclude that it is unlikely that our perturbations remove the toxicity signals for human readers.
We evaluate the effect of our adversarial noise on toxic content classifiers on the Jigsaw 2018 and OffensEval 2019 datasets 555Note that the Jigsaw 2018 and Jigsaw 2019 datasets are distinct and we remove all examples in the Jigsaw 2019 dataset from the Jigsaw 2018 dataset to prevent leakage. We use the Jigsaw 2019 dataset as a background corpus and not to train the model for the Jigsaw 2018 dataset. The general toxic content classifier architecture is straightforward. The tokens of an utterance
are first embedded into a continuous vector space and then passed through an LSTM encoder which produces a sequence of intermediate representations. These representations are then used to produce a single vector representation
using mean- and max-pooling as well as attention:
which is, in turn, put through an MLP and used to make a prediction
of the toxicity of the utterance through a sigmoid function:
To demonstrate the effect of our adversarial attacks, we experiment with fastText [bojanowski-etal-2017-enriching] and ELMo [Peters:2018]
embeddings, both of which are capable of handling out-of-vocabulary words. For ELMo, we follow the recommendations of Peters:2018 Peters:2018 and apply a 0.5 dropout to the representations and a weight decay of 1e-4 to the scalar weights of all layers. We only fine-tune the scalar weights and keep the language model weights fixed. We also experiment with BERT, applying a single affine layer to the embedding of the [CLS] token for classification and fine-tune all weights. In addition, we report the performance of a simple logistic regression baseline.
All hyperparameters are tuned on the Jigsaw 2018 dataset and are listed in the Supplemental Material. Preprocessing steps include tokenization, lower-casing, removal of hyperlinks and removal of characters that are repeated more than three times in a row (e.g., “stupiiiiddddd” is converted to “stupid”, but “iidiot” remains unchanged). All punctuation is retained. For consistency across datasets, we evaluate models on the “toxic”/“offensive” labels that include all types of toxicity (obscenity, hate speech, targeted/untargeted offense, and others). To convert probabilistic outputs of the models to binary classes, we threshold the predictions to maximize the F1 score on the training set. We focus on the ability of various models to classify toxic content correctly since this is where adversarial attacks are most likely to take place (users that post non-toxic content are not motivated to have the system misclassify their content as toxic).
|Noise||Jigsaw 2018||OffensEval 2019|
|Recall||% Change||Recall||% Change|
The effects of our combined adversarial attacks are summarized in Table 2. The logistic regression classifier is effectively incapable of handling out-of-vocabulary words and performs the worst when noise is applied, with more than 50% recall lost. Despite this limitation, however, its performance does not drop to zero. This means that our obfuscation does not completely remove all words that the logistic regression classifier uses to detect toxicity. Indeed, we found that some tokens that are quite obviously toxic (e.g., “motherf*cker”) were not included in our toxic lexicon. Therefore, it is likely that improving the lexicon by finding a larger dataset or manually curating more toxic words could further enhance the effect of adversarial noise. Although neural models fare slightly better, recall on the adversarial test sets still drops significantly, with losses of over 30% in all cases.
We present randomly sampled examples of toxic sentences that were misclassified by the fastText model due to the adversarial noise in Table 3. Although not all of them retain grammatical correctness, it is our view that their toxicity is preserved and they should be properly handled by any toxic content classifier
Defenses Against Adversarial Attacks
Next we consider potential defenses against the aforementioned attacks: adversarial training and contextual denoising autoencoder. We note that our objective with these
One possible defense is adversarial training [Szegedy2014IntriguingPO, 43405], applying similar noise to the training dataset. Adversarial training has been applied successfully in tasks including machine translation [belinkov2018synthetic] and morphological tagging [heigold-etal-2018-robust]. One limitation to this approach is that one would need to know the details of the incoming attack, including the lexicon the adversary might use to generate noise. This is a major limitation, since adversaries can easily change their lexicon. Another limitation is that there is no guarantee that the adversarial noise will produce a reliable pattern that the model can generalize to. For example, for fastText embeddings, the same operation of swapping two characters would produce completely different changes in the subwords for different source words, resulting in different changes in embedding space. The model could also overfit to the adversarial noise, resulting in worsened performance on clean data.
Contextual Denoising Autoencoder
With token obfuscation, the underlying problem is that small character perturbations can cause large and unpredictable changes in embedding space. To resolve this problem, the underlying text representations themselves need to be robust against character-level perturbations. To learn such robust representations, we train a denoising autoencoder that receives noised tokens as input and predicts the denoised version of the token. When denoising tokens, the surrounding context can provide strong hints as to what the original token was. Some words like “duck” can be used both as obfuscations of profanity and as standard language, meaning context is crucial in effective denoising. Thus, we use a model that takes the context a sequence of potentially noised tokens as input and predicts the denoised tokens using contextual information. We call this model the Contextual Denoising Autoencoder (CDAE).
Due to its impressive performance across a wide range of tasks, we use a Transformer 
as the underlying architecture. For word representations, we employ the character convolutional neural network (CNN) encoder used in the ELMo model. We feed the outputs of the CNN encoder to the Transformer with learned positional embeddings, 6 layers and 4 attention heads in each layer where the outputs of each layer are of size 128. We show the overall scheme of the CDAE in Figure1. Not using wordpieces leads to massive vocabulary size, especially with corpora obtained from the web. We therefore use the CNN-softmax method combined with importance sampling loss  to accelerate training. We apply noise to 70% of tokens according to the scheme in Section Generating Realistic Adversarial Attacks and mask all tokens uniformly with a probability of 10%. We train our denoising autoencoder on a random subset of the UMBC webbase corpus [UMBC] (a large-scale corpus constructed from web crawls) and the Jigsaw 2019 dataset, taking care to remove any examples from the Jigsaw 2018 dataset.
We note that this approach does require knowledge of what character-level perturbations will be applied. However, the space of possible character-level perturbations that retain readability of the original token is limited. Crucially, unlike adversarial training, the CDAE does not require knowledge of the adversary’s lexicon, making this approach more suitable for a wider range of attacks.
|Model||Train Noise||Test Noise||Jigsaw 2018||OffensEval 2019|
Effect of Defenses
In order to evaluate our proposed defenses, we measure AUC, F1 score, and recall over the toxic class for all models. The model architecture for CDAE is similar to the one we used for fastText and ELMo. For CDAE, we use the mean of the final 4 layers of the model and concatenate them with fastText embeddings, because we found that this leads to superior performance.666We hypothesize that this is because the fastText embeddings were trained on much more data so captured some semantic aspects that the CDAE did not. The detailed results of applying adversarial training as well as CDAE’s performance on the Jigsaw 2018 and OffensEval 2019 datasets are shown in Table 4. 777The OffensEval challenge evaluates models with a macro-averaged F1 score over both classes, so our numbers are significantly lower than the numbers reported there. We achieve a 0.84 macro-averaged F1 score, beating the state-of-the-art.
Overall, we find that BERT performs well in the absence of noise on both datasets (None–None setting). As expected, the addition of noise hurts its performance. CDAE, on the other hand, performs well in the noised test set without adversarial training (None–C+D setting), indicating that it indeed manages to at least partly denoise the adversarial utterances. When additional adversarial training is introduced (C+D–C+D setting), BERT and CDAE perform comparably, outperforming all other methods. For OffenEval, we found that BERT was more biased towards the non-toxic class compared to the CDAE, causing it to have much higher precision but slightly lower recall.
Adversarial training improves performance across the board, although performance does not recover to the clean-data standards. Interestingly, classifiers that were more vulnerable to attacks before adversarial training tend to perform poorly even with adversarial training. This implies that the representation of text needs to be inherently robust for adversarial training to be an effective defense. Despite using character CNNs, ELMo was more vulnerable to noise compared to our CDAE (cf. 36% vs. 33% degradation in recall on the Jigsaw 2018 dataset without adversarial training), showing that character CNNs need to be explicitly trained to handle noise/out-of-vocabulary words in order to exhibit robustness.
To better understand the robustness characteristics of our two best models (BERT and CDAE) models, we perform ablations under various noise settings (only character perturbations, only distractors, adversarial training with a clean test set). Results are shown in Table 5 and we summarize our findings below.
|Model||Train Noise||Test Noise||Jigsaw 2018||OffensEval 2019|
|C + D||None||0.968||0.653||0.904||0.875||0.685||0.739|
|C + D||None||0.968||0.651||0.912||0.858||0.655||0.721|
Character-level perturbation degrades performance more than distractors. For both datasets and models, character-level perturbations lead to significantly larger drops in performance across all metrics. This is reasonable, given that obfuscation directly removes the toxicity signal. The distractors, instead, simply dilute it.
Adversarial training reduces performance on clean data. Although adversarial training consistently improves robustness to noise, it also slightly reduces performance on clean data. This undesirable byproduct can probably be attributed to models overfitting to the training noise.
The CDAE is more resilient against character perturbations compared to BERT. We find that the performance of the CDAE drops less with character-level perturbations both before and after adversarial training. For example the recall drops by 33% and 24% for BERT before and after adversarial training, whereas for the CDAE the recall drop is 28% and 21% respectively. This reveals the advantage of the CDAE: it is explicitly trained to address character-level perturbations. BERT’s vulnerability to such noise cannot be easily remedied due to its reliance to a wordpiece tokenizer.
BERT performs better in the presence of distractors compared to the CDAE. In contrast to the CDAE, BERT is weak to character-level perturbations but strong against distractors. For both datasets, BERT performs stronger in terms of final performance, aside from recall on OffensEval where BERT was more inclined to predict the non-toxic class compared to the CDAE for all settings. For the Jigsaw dataset, BERT performance drops less in relative terms although the opposite holds for OffensEval. For OffensEval, the distractors tended to be shorter compared to the Jigsaw dataset since the original text was also generally shorter. This difference in response to distractors may suggest that BERT and the CDAE have different robustness characteristics regarding distractors. A possible explanation might lie in the architecture: BERT is entirely self-attention-based while the CDAE features are fed into a recurrent LSTM. The effect of the different architectures on the robustness characteristics towards distractors remains an open question.
|Model||Train Noise||Test Noise||Jigsaw 2018||OffensEval 2019|
|Ensemble||None||C + D||0.921||0.590||0.725||0.774||0.505||0.404|
|C + D||C + D||0.942||0.628||0.799||0.827||0.612||0.604|
Based on our findings, we also examine the performance of an ensemble of BERT and the CDAE, in the hope that it will combine their advantages. The final prediction is made with arithmetic mean of the two models’ predicted probabilities. Results are shown in Table 6. Indeed, the ensemble outperforms both the single CDAE and BERT models when tested on combined noise, exactly because it combines their different robustness characteristics. This suggests that although it may be difficult to train a single model to be robust to all possible attacks, specialized models can be trained to handle different attacks and their ensemble may be a simple, cheap approach that will boost robustness of the entire system.
Toxic Content Classification.
Since toxic content classification is a text classification task, traditional techniques ranging from bag of words models [georgakopoulos2018convolutional] to CNNs [georgakopoulos2018convolutional] and RNNs [van2018challenges, gunasekara2018review] have all been applied. Both van2018challenges van2018challenges and gunasekara2018review gunasekara2018review have shown that among the various approaches, bidirectional RNNs with attention using pretrained fastText embeddings [joulin2016bag] have strong performance, with gunasekara2018review gunasekara2018review acheiving the best single-model performance on the Jigsaw 2018 dataset using a bidirectional LSTM with attention. mishra-etal-2018-neural mishra-etal-2018-neural developed an approach that uses a character-level models to mimic GloVe [pennington-etal-2014-glove] embeddings, thus inferring the embeddings for unseen words. Crucially, this method can only train the model on in-vocabulary words, meaning it is incapable of handling targeted character obfuscations that do not appear naturally in the GloVe vocabulary.
Noise and Adversarial Attacks in Text.
belinkov2018synthetic belinkov2018synthetic demonstrated the brittleness of neural machine translation (NMT) systems to both natural and synthetic noise. They showed that training on synthetically noised data improves robustness towards similar synthetic noise but not to naturally occurring noise (e.g. omissions). In contrast to their work, we focus on targeted adversarial attacks that deliberately attempt to fool a classifier. Multiple works have proposed white-box attacks (attacks assuming access to model gradients) in NLP for tasks such as NMT[ebrahimi2018] and text classification [Ebrahimi2017HotFlipWA]. SamantaM17 SamantaM17 constructed a lexicon of words to construct adversarial examples, which is similar but crucially different from our approach in that they assume access to model gradients. Other work has explored black-box attacks [Liang:2018:DTC:3304222.3304355]. In particular, HosseiniKZP17 HosseiniKZP17 generated adversarial attacks against the Google Perspective API, a public API for detecting toxic content, and showed the brittleness of this system. However, these methods rely on multiple queries to the underlying prediction scores of the model which are not always exposed to a user and can be seen as a form of internal knowledge. W16-5603 W16-5603 showed that the gender of posters on social media can be obfuscated by using a background corpus to identify words indicative of each gender and replacing those words with semantically similar words. Our method differs in that we replace words with similar-looking instead of similar-meaning character sequences, since our aim is to fool the system while maintaining readability.
Defenses. One straightforward, yet non-scalable approach to solving the problem of adversarial noise is to manually curate a lexicon of the most frequent obfuscations [Wang:2014:CET:2531602.2531734]
. On the other hand, Rojas-Galeano:2017:OOO:3079924.3032963 Rojas-Galeano:2017:OOO:3079924.3032963 proposed a method to automatically match obfuscated words to their original forms using a custom edit distance function. Although their approach is more scalable, it still requires the manual construction of inflexible rules for measuring the distance caused by different transformation and thus can easily be circumvented by adversaries. W17-3005 W17-3005 proposed to use the errors from a class-conditioned character-level language model to classify out-of-vocabulary words as toxic or non-toxic. scrnn scrnn proposed the semi-character level recurrent neural network (scRNN) as a method of generating robust word representations. Although their method showed strong performance in spell checking, it is unable to handle anagrams (e.g. “there” and “three”) and homoglyph substitutions, and ignores contextual information. One limitation of these approaches is that they do not consider the possibility of toxic words being mapped to in-vocabulary words. For instance, “suck my duck” is likely an obfuscation, but the word “duck” itself is common. These problems require the usage of context: for instance, “duck” in “the duck is swimming” is not toxic, but this can only be inferred based on the context. Moreover, none of these approaches consider distractor injection as a potential method of attack.
Conclusion and Future Work
In this paper, we show that we can easily degrade the performance of state-of-the-art toxic content classifiers in a model-agnostic manner by using a background corpus of toxicity and introducing character-level perturbations as well as distractors. We also explore defenses against these attacks, and we find that adversarial training improves robustness in general, but decreases performance on clean data. We also propose the Contextual Denoising Auto-Encoder (CDAE), a method of learning robust representations, and show that these representations are more robust against character-level perturbations, whereas a BERT-based model performs strongly in the presence of distractors. An ensemble of BERT and the CDAE is the most robust approach towards combined noise.
|Learning rate schedule||slanted_triangular|
|Learning rate schedule||slanted_triangular|