Log In Sign Up

Perturbations in the Wild: Leveraging Human-Written Text Perturbations for Realistic Adversarial Attack and Defense

We proposes a novel algorithm, ANTHRO, that inductively extracts over 600K human-written text perturbations in the wild and leverages them for realistic adversarial attack. Unlike existing character-based attacks which often deductively hypothesize a set of manipulation strategies, our work is grounded on actual observations from real-world texts. We find that adversarial texts generated by ANTHRO achieve the best trade-off between (1) attack success rate, (2) semantic preservation of the original text, and (3) stealthiness–i.e. indistinguishable from human writings hence harder to be flagged as suspicious. Specifically, our attacks accomplished around 83 on BERT and RoBERTa, respectively. Moreover, it outperformed the TextBugger baseline with an increase of 50 stealthiness when evaluated by both layperson and professional human workers. ANTHRO can further enhance a BERT classifier's performance in understanding different variations of human-written toxic texts via adversarial training when compared to the Perspective API.


SemAttack: Natural Textual Attacks via Different Semantic Spaces

Recent studies show that pre-trained language models (LMs) are vulnerabl...

HydraText: Multi-objective Optimization for Adversarial Textual Attack

The field of adversarial textual attack has significantly grown over the...

Certified Robustness to Text Adversarial Attacks by Randomized [MASK]

Recently, few certified defense methods have been developed to provably ...

TextBugger: Generating Adversarial Text Against Real-world Applications

Deep Learning-based Text Understanding (DLTU) is the backbone technique ...

A Universal Adversarial Policy for Text Classifiers

Discovering the existence of universal adversarial perturbations had lar...

BERT-Defense: A Probabilistic Model Based on BERT to Combat Cognitively Inspired Orthographic Adversarial Attacks

Adversarial attacks expose important blind spots of deep learning system...

Adversarial Stylometry in the Wild: Transferable Lexical Substitution Attacks on Author Profiling

Written language contains stylistic cues that can be exploited to automa...

1 Introduction

Machine learning (ML) models trained to optimize only the prediction performance are often vulnerable to adversarial attacks papernot2016limitations; wang2019towards. In the text domain, especially, a character-based adversarial attacker aims to fool a target ML model by generating an adversarial text from an original text by manipulating characters of different words in , such that some properties of are preserved li2018textbugger; VIPER; gao2018black. We characterize strong and practical adversarial attacks as three criteria: (1) attack performance, as measured by the ability to flip a target model’s predictions, (2) semantic preservation, as measured by the ability to preserve the meaning of an original text, and (3) stealthiness, as measured by how unlikely it is detected as machine-manipulation and removed by defense systems or human examiners (Figure 1). While the first two criteria are natural derivation from adversarial literature papernot2016limitations, stealthiness is also important to be a practical attack under a mass-manipulation scenario.

In fact, adversarial text generation remains a challenging task under practical settings.

Previously proposed character-based attacks follow a deductive approach where the researchers hypothesize a set of text manipulation strategies that exploit some vulnerabilities of textual ML models (Figure 1). Although these deductively derived techniques can demonstrate superior attack performance, there is no guarantee that they also perform well with regard to semantic preservation and stealthiness. We first analyze why enforcing these properties are challenging especially for character-based attacks.

Figure 1: Anthro (Bottom) extracts and uses human-written perturbations for adversarial attacks instead of proposing a specific set of manipulation rules (Top).

To preserve the semantic meanings, an attacker can minimize the distance between representative vectors learned from a large pre-trained model–e.g., Universal Sentence Encoder 

cer2018universal of the two sentences. However, this is only applicable in word- or sentence-based attacks, not in character-based attacks. It is because character-based manipulated tokens are more prone to become out-of-distribution–e.g., moronsmor0ns, from what is observed in a typical training corpus where the correct use of English is often assumed. In fact, existing character-based attacks such as TextBugger li2018textbugger, VIPER VIPER and DeepWordBug gao2018black generally assume that the meaning of the original sentence is preserved without further evaluations.

In addition, a robust ML pipeline is often equipped to detect and remove potential adversarial perturbations either via automatic software neuspell; pruthi2019combating, trapdoors le2021sweet or human-in-the-loop malcom. Such detection is feasible especially when the perturbed texts are curated using a set of fixed rules that can be easily re-purposed for defense. Thus, attackers such as VIPER and DeepWordBug, which map each Latin-based character to either non-English accents (e.g., ė, ā, d̃), or homoglyphs (characters of similar shape), fall into this category and can be easily detected under simple normalization techniques (Sec. 4.1). TextBugger circumvents this weakness by utilizing a set of more general character-editing strategies–e.g., replacing and swapping nearby characters to synthesize human-written typos and misspellings. Although texts perturbed by such strategies become less likely to be detected, many of them may distort the meaning of the original text (e.g., “garbage"“gabrage", “dumb"“dub") and can be easily flagged as machine-generated by human examiners. Therefore, we argue that generating perturbations that both preserve original meanings and are indistinguishable from human-written texts be a critically important yet challenging task.

To overcome these challenges, we introduce Anthro, a novel algorithm that inductively finds and extracts text perturbations in the wild. As shown in Figure 1, our method relies on human-written sentences in the Web in their raw form. We then use them to develop a character-based adversarial attack that is not only effective and realistic but is also helpful in training ML models that are more robust against a wide variety of human-written perturbations. Distinguished from previous research, our work considers both spellings and phonetic features (how a word sounds), to characterize text perturbations. Furthermore, we conducted user studies to quantitatively evaluate semantic preservation and stealthiness of adversarial texts. Our contributions are as follows.

  • [leftmargin=]

  • Anthro extracts over 600K case-sensitive character-based “real" perturbations from human-written texts.

  • Anthro facilitates black-box adversarial attacks with an average of 82.7% and 90.7% attack success rates on BERT and RoBERTa, and drops the Perspective API’s precision to only 12%.

  • Anthro outperforms the TextBugger baseline by over 50% in semantic preservation and 40% in stealthiness in human subject studies.

  • Anthro combined with adversarial training also enables BERT classifier to achieve 3%–14% improvement in precision over Perspective API in understanding human-written perturbations.

2 Perturbations in the Wild

2.1 Machine v.s. Human Perturbations

Perturbations that are neither natural-looking nor resembling human-written texts are more likely to be detected by defense systems (thus not a practical attack from adversaries’ perspective). However, some existing character-based perturbation strategies, including TextBugger, VIPER and DeepWordBug, follow a deductive approach and their generated texts often do not resemble human-written texts. Qualitatively, however, we find that humans express much more diverse and creative tagg2011wot perturbations (Figure B.1, Appendix) than ones generated by such deductive approaches. For example, humans frequently (1) capitalize and change the parts of a word to emphasize distorted meanings (e.g.,“democrats““democRATs", “republicans"“republiCUNTs"), (2) hyphenate a word (e.g., “depression"“de-pres-sion"), (3) use emoticons to emphasize meaning (e.g., “shit"“sht"), (4) repeat particular characters (e.g., “dirty"“diiirty", “porn"“pooorn"), or (5) insert phonetically similar characters (e.g., “nigger"“nighger"). Human-written perturbations do not manifest any fixed rules and often require some context understanding. Moreover, one can generate a new meaningful perturbation simply by repeating a character–e.g., “porn"“pooorn". Thus, it is challenging to systematically generate all such perturbations, if not impossible. Moreover, it is very difficult for spell-checkers, which usually rely on a fixed set of common spelling mistakes and an edit-distance threshold, to correct and detect all human-written perturbations.

We later show that human examiners rely on personal exposure from Reddit or YouTube comments to decide if a word choice looks natural (Sec. 4.2). Quantitatively, we discover that not all the perturbations generated by deductive methods are observed on the Web (Table 1). To analyze this, we first use each attack to generate all possible perturbations of either (1) a list of over 3K unique offensive words or (2) a set of the top 5 offensive words (“c*nt”, “b*tch”, “m*therf***er”, “bast*rd”, “d*ck”). Then, we calculate how many of the perturbed words are present in a dataset of over 34M online news comments or are used by at least 50 unique commentators on Reddit, respectively. Even though TextBugger was well-known to simulate human-written typos as adversarial texts, merely 51.6% and 7.1% of its perturbations are observed on Reddit and online news comments, implying TextBugger’s generated adversarial texts being “unnatural" and “easily-detectable" by human-in-the-loop defense systems.

Attacker Reddit Comts. News Comts.
#texts, #tokens >>5B, N/A (34M, 11M)
TextBugger 51.6% (126/244) 7.10% (11K/152K)
VIPER 3.2% (1/31) 0.13% (25/19K)
DeepWordBug 0% (0/31) 0.27% (51/19K)
ANTHRO 82.4% (266/323) 55.7% (16K/29K)
Table 1: Percentage of offensive perturbed words generated by different attacks that can be observed in real human-written comments on Reddit and online news.

2.2 The SMS Property: Similar Sound, Similar Meaning, Different Spelling

The existence of a non-arbitrary relationship between sounds and meanings has been proven by a life-long research establishment kohler1967gestalt; jared1991does; gough1972one. In fact, blasi2016sound analyzed over 6K languages and discovered a high correlation between a word’s sound and meaning both inter- and intra-cultures. aryani2020affective found that how a word sounds links to an individual’s emotion. This motivates us to hypothesize that words spelled differently yet have the same meanings such as text perturbations will also have similar sounds.

Figure B.1 (Appendix) displays several perturbations that are found from real-life texts. Even though these perturbations are spelled differently from the original word, they all preserve similar meanings when perceived by humans. Such semantic preservation is feasible because humans perceive these variations phonetically similar to the respective original words van1987rows. For example, both “republican" and “republikan" sound similar when read by humans. Therefore, given the surrounding context of a perturbed sentence–e.g., “President Trump is a republikan”, and the phonetic similarity of “republican” and “republikan”, end-users are more likely to interpret the perturbed sentence as “President Trump is a republican”. We call these characteristics of text perturbations the SMS property: “similar Sound, similar Meaning, different Spellings”. Noticeably, the SMS characterization includes a subset of “visually similar" property of perturbations as studied in previous adversarial attacks such as TextBugger (e.g., “hello” sounds similar with “he11o”), VIPER and DeepWordBug. However, two words that look very similar sometimes carry different meanings–e.g., “garbage”“gabrage”. Moreover, our characterization is also distinguished from homophones (e.g., “to” and “two”) which describe words with similar sound yet different meaning.

3 A Realistic Adversarial Attack

Given the above analysis, we now derive our proposed Anthro adversarial attack. We first share how to systematically encode the sound–i.e., phonetic feature, of any given words and use it to search for their human-written perturbations that satisfy the SMS property. Then, we introduce an iterative algorithm that utilizes the extracted perturbations to attack textual ML models.

3.1 Mining Perturbations in the Wild

Sound Encoding with Soundex++. To capture the sound of a word, we adopt and extend the case-insensitive Soundex algorithm. Soundex helps index a word based on how it sounds rather than how it is spelled stephenson1980methodology. Given a word, Soundex first keeps the 1st character. Then, it removes all vowels and matches the remaining characters one by one to a digit following a set of predefined rules–e.g., “B”, “F”1, “D”, “T”stephenson1980methodology. For example, “Smith” and “Smyth” are both encoded as S530.

As the Soundex system was designed mainly for encoding surnames, it does not necessarily work for texts in the wild. For example, it cannot recognize visually-similar perturbations such as “l"“1", “a"“@" and “O"“0". Moreover, it always fixes the 1st character as part of the final encodes. This rule is too rigid and can result in words that are entirely different yet encoded the same (Table 2). To solve these issues, we propose a new Soundex++ algorithm. Soundex++ is equipped to both recognize visually-similar characters and encode the sound of a word at different hierarchical levels (Table 2). Particularly, at level , Soundex++ works similar to Soundex by fixing the first character. At level , Soundex++ instead fixes the first characters and encodes the rest.

Word Soundex Soundex++ (Ours)
porn P650 P650 (), PO650 ()
p0rn P065(✗) (same as above)
lesbian L215 L245 (), LE245 ()
lesbbi@n L21@(✗) (same as above)
losbian L215(✗) L245 (), LO245 ()
(✗): Incorrect encoding
Table 2: Soundex++ can capture visually similar characters and is more accurate in differentiating between desired (blue) and undesired (red) perturbations.
Key TH000 DE5263 AR000 DI630 NO300
Value the democrats are dirty not
(Set) demokRATs arre dirrrty
Anthro(democrats,,){democrats, demokRATs}
Anthro(dirty,,){dirty, dirrrty}
Table 3: Examples of hash table curated from sentences “the demokRATs are dirrrty" and “the democrats arre not dirty" and its utilization.

Levenshtein Distance and Phonetic Level as a Semantic Preservation Proxy. Since Soundex++ is not designed to capture a word’s semantic meaning, we utilize both phonetic parameter and Levenshtein distance  levenshtein1966binary

as a heuristic approximation to measure the semantic preservation between two words. Intuitively, the higher the phonetic level (

) at which two words share the same Soundex++ code and the smaller the Levenshtein distance to transform one word to another, the more likely human associate them with the meaning. In other words, and

are hyper-parameters that help control the trade-off between precision and recall when retrieving perturbations of a given word such that they satisfy the SMS property (Figure

2). We will later carry out a human study to evaluate how well our extracted perturbations can preserve the semantic meanings in practice.

Figure 2: Trade-off between precision and recall of extracted perturbations for the word “president" w.r.t different and values. Higher and lower associate with better preservation of the original meaning.

Mining from the Wild. To mine all human-written perturbations, we first collect a large corpus of over 18M sentences written by netizens from 9 different datasets (Table A.1 in Appendix). We select these datasets because they include offensive texts such as hate speech, sensitive search queries, etc., and hence very likely to include text perturbations. Next, for each phonetic level , we curate different hash tables that maps a unique Soundex++ code to a set of its matching unique case-sensitive tokens that share the same encoding as follows:


where returns the Soundex++ code of token at phonetic level , is the largest phonetic level we want to encode. With , and , we can now search for the set of perturbations of a specific target token as follows:


where returns the Levenshtein distance between and . Noticeably, we only extract once from via Eq. (1), then we can use Eq. (2) to retrieve all perturbations for a given word during deployment. We name this method of mining and retrieving human-written text perturbations in the wild as Anthro, aka human-like perturbations:

1:  Input: , ,
2:  Input: target classifier , original sentence
3:  Output: perturbed sentence
4:  Initialize:
5:  for word in do:
6:   according to
7:  for in do:
8:   // Eq.(3)
9:   replace with the best
10:  if then return
11:  return None
Algorithm 1 Anthro Attack Algorithm

Anthro Attack. To utilize Anthro for adversarial attack on model , we propose the Anthro attack algorithm (Alg. 1). We use the same iterative mechanism (Ln.9–13) that is common among other black-box attacks. This process replaces the most vulnerable word in sentence , which is evaluated with the support of

function (Ln. 5), with the perturbation that best drops the prediction probability

on the correct label. Unlike the other methods, Anthro inclusively draws from perturbations extracted from human-written texts captured in (Ln. 10). We adopt the from TextBugger.

4 Evaluation

We evaluate Anthro by: (1) attack performance, (2) semantic preservation, and (3) human-likeness–i.e., how likely an attack message is spotted as machine-generated by human examiners.

4.1 Attack Performance

Setup. We use BERT (case-insensitivejin2019bert and RoBERTa (case-sensitiveliu2019roberta as target classifiers to attack. We evaluate on three public tasks, namely detecting toxic comments ((TC) dataset, Kaggle 2018), hate speech ((HS) dataset hateoffensive), and online cyberbullying texts ((CB) dataset cyberbullyingdata). We split each dataset to train, validation and test

set with the 8:1:1 ratio. Then, we use the train set to fine-tune BERT and RoBERTa with a maximum of 3 epochs and select the best checkpoint using the validation set. BERT and RoBERTa achieve around 0.85–0.97 in F1 score on the test sets (Table

A.2 in Appendix). We evaluate with targeted attack (change positivenegative label) since it is more practical. We randomly sample 200 examples from each test set and use them as initial sentences to attack. We repeat the process 3 times with unique random seeds and report the results. We use the attack success rate (Atk%)

metric–i.e., the number of examples whose labels are flipped by an attacker over the total number of texts that are correctly predicted pre-attack. We use the 3rd party open-source

OpenAttack zeng2020openattack framework to run all evaluations.

Baselines. We compare Anthro with three baselines, namely TextBugger li2018textbugger, VIPER VIPER and DeepWordBug gao2018black. These attackers utilize different character-based manipulations to craft their adversarial texts as described in Sec. 1. From the analysis in Sec. 3.1 and Figure 2, we set and for Anthro to achieve a balanced trade-off between precision and recall on the SMS property. We examine all attackers under several combinations of different normalization layers. They are (1) Accents normalization (A) and (2) Homoglyph normalization 111 (H), which converts non-English accents and homoglyphs to their corresponding ascii characters, (3) Perturbation normalization (P), which normalizes potential character-based perturbations using the SOTA misspelling correction model Neuspell neuspell. These normalizers are selected as counteracts against the perturbation strategies employed by VIPER (uses non-English accents), DeepWordBug (uses homoglyphs) and TextBugger, Anthro (based on misspelling and typos), respectively.

Attacker Normalizer BERT (case-insensitive) RoBERTa (case-sensitive)
TextBugger - 0.760.02 0.940.01 0.780.03 0.770.06 0.870.01 0.720.01
DeepWordBug - 0.560.04 0.680.01 0.500.02 0.520.01 0.420.04 0.380.04
VIPER - 0.080.03 0.010.01 0.130.02 1.000.00 1.000.00 0.990.01
Anthro - 0.720.02 0.820.01 0.710.02 0.840.00 0.930.01 0.780.01
TextBugger A - - - 0.720.02 0.920.00 0.740.02
DeepWordBug A - - - 0.430.02 0.590.03 0.430.01
VIPER A - - - 0.090.01 0.050.01 0.170.02
Anthro A - - - 0.770.02 0.940.02 0.840.02
TextBugger A+H 0.780.03 0.850.00 0.790.00 0.740.02 0.930.01 0.770.03
DeepWordBug A+H 0.040.00 0.060.02 0.010.01 0.030.01 0.010.01 0.060.02
VIPER A+H 0.070.00 0.010.01 0.100.00 0.130.02 0.070.01 0.170.01
Anthro A+H 0.760.02 0.770.03 0.730.05 0.820.02 0.970.00 0.820.02
TextBugger A+H+P 0.730.02 0.640.06 0.700.04 0.680.06 0.570.03 0.660.04
DeepWordBug A+H+P 0.020.01 0.040.02 0.010.01 0.020.01 0.010.01 0.020.01
VIPER A+H+P 0.120.01 0.040.01 0.170.03 0.110.02 0.050.01 0.180.01
Anthro A+H+P 0.650.04 0.640.01 0.600.05 0.800.02 0.910.03 0.820.02
(-) BERT already has the accents normalization (A normalizer) by default, (Red): Poor performance (Atk%<0.15)
Table 4: Averaged attack success rate (Atk%) of different attack methods

Results. Overall, both Anthro and TextBugger perform the best. Being case-sensitive, Anthro performs significantly better on RoBERTa and is competitive on BERT when compared to TextBugger (Table 4). This happens because RoBERTa is case-sensitive (unlike the base-uncased-bert BERT model we used) and only Anthro is case-sensitive out of all attack baselines. For example, the perturbation “democrats"“democRATs" is considered as a perturbation for RoBERTa but not for other case-insensitive models. This gives Anthro an advantage in practice because many popular commercial API services (e.g., the popular Perspective API

, the sentiment analysis and text categorization API from Google) are case-sensitive–i.e., “democrats"

“democRATs". (See more at Table 4).

VIPER achieves a near perfect score on RoBERTa, yet it is ineffective on BERT because RoBERTa uses the accent Ġ as a part of its byte-level BPE encoding liu2019roberta while BERT by default removes all such accents. Since VIPER exclusively utilizes accents, its attacks can be easily corrected by the accents normalizer (Table 4). Similarly, DeepWordBug perturbs texts with homoglyph characters, most of which can also be normalized using a 3rd party homoglyph detector (Table 4).

In contrast, even under all normalizers–i.e., A+H+P, TextBugger and Anthro still achieves 66.3% and 73.7% in Atk% on average across all evaluations. Although Neuspell neuspell drops TextBugger’s Atk% 14.7% across all runs, it can only reduce the Atk% of Anthro a mere 7.5% on average. This is because TextBugger and Neuspell or other dictionary-based typo correctors rely on fixed deductive rules–e.g., swapped, replaced by neighbor letters, for attack and defense. However, Anthro utilizes human-written perturbations which are greatly varied, hence less likely to be systematically detected. We further discuss the limitation of misspelling correctors such as NeuSpell in Sec. 7.

Attacker Normalizer BERT (case-insensitive) RoBERTa (case-sensitive)
Toxic Comments HateSpeech Cyberbullying Toxic Comments HateSpeech Cyberbullying
TextBugger - 0.760.02 0.940.01 0.780.03 0.770.06 0.870.01 0.720.01
- 0.820.01 0.970.01 0.880.04 0.910.02 0.970.01 0.890.02
TextBugger A+H+P 0.730.02 0.640.06 0.700.04 0.680.06 0.570.03 0.660.04
A+H+P 0.850.04 0.790.02 0.840.03 0.880.04 0.930.01 0.910.01
Table 5: Averaged attack success rate (Atk%) of and TextBugger

4.2 Human Evaluation

Since Anthro and TextBugger are the top two effective attacks, this section will focus on evaluating their ability in semantic preservation and human-likeness. Given an original sentence and its adversarial text generated by either one of the attacks, we design a human study to directly compare Anthro with TextBugger. Specifically, two alternative hypotheses for our validation are (1) : generated by Anthro preserves the original meanings of better than that generated by TextBugger and (2) : generated by Anthro is more likely to be perceived as a human-written text (and not machine) than that generated by TextBugger.

Figure 3: Semantic preservation and human-likeness

Human Study Design. We use the two attackers to generate adversarial texts targeting BERT model on 200 examples sampled from the TC dataset’s test set. We then gather examples that are successfully attacked by both Anthro and TextBugger. Next, we present a pair of texts, one generated by Anthro and one by TextBugger, together with the original sentence to human subjects. We then ask them to select (1) which text better preserves the meaning of the original sentence (Figure B.2 in Appendix) and (2) which text is more likely to be written by human (Figure B.3 in Appendix). To reduce noise and bias, we also provide a “Cannot decide" option when quality of both texts are equally good or bad, and present the two questions in two separate tasks. Since the definition of semantic preservation can be subjective, we recruit human subjects as both (1) Amazon Mechanical Turk (MTurk) workers and (2) professional data annotators at a company with extended experience in annotating texts in domain such as toxic and hate speech. Our human subject study with MTurk workers was IRB-approved. We refer the readers to Sec. B.3 (Appendix) for more details on MTurks and study designs.

Reason Favorable Unfavorable
For Anthro For TextBugger
Genuine Typos stuupid, but, Faoggt sutpid, burt, Foggat
Intelligible faiilure faioure
Sound Preserv. shytty, crp shtty, crsp
Meaning Preserv. ga-y, ashole, dummb bay, alshose, dub
High Search Results sodmized, kiills Smdooized, klils
Personal Exposure ign0rant, gaarbage ignorajt, garage
Word Selection moronsmor0ns editedewited
Table 6: Top reasons in favoring Anthro’s perturbations as more likely to be written by human.

Figure 4:

Trade-off among evaluation metrics

Quantitative Results. It is statistically significant (p-value0.05) to reject the null hypotheses of both and (Table A.3). Overall, adversarial texts generated by perturbations mined in the wild are much better at preserving the original semantics and also at resembling human-written texts than those generated by TextBugger (Figure 3, Left).

Qualitative Analysis. Table 6 summarizes the top reasons why they favor Anthro over TextBugger in terms of human-likeness. Anthro’s perturbations are perceived similar to genuine typos and more intelligible. They also better preserve both meanings and sounds. Moreover, some annotators also rely on personal exposure on Reddit, YouTube comments, or the frequency of word use via the search function on Reddit to decide if a word-choice is human-written.

5 Attack

. We examine if perturbations inductively extracted from the wild help improve the deductive TextBugger attack. Hence, we introduce , which considers the perturbation candidates from both Anthro and TextBugger in Ln. 10 of Alg. 1. Alg. 1 still selects the perturbation that best flip the target model’s prediction.

Attack Performance. Even though Anthro comes second after TextBugger when attacking BERT model, Table 5 shows that when combined with TextBugger–i.e., , it consistently achieves superior performance with an average of 82.7% and 90.7% in Atk% on BERT and RoBERTa even under all normalizers (A+H+P).

Semantic Preservation and Human-Likeness. improves TextBugger’s Atk%, semantic preservation and human-likeness score with an increase of over 8%, 32% and 42% (from 0.5 threshold) on average (Table 5, 3, Right), respectively. The presence of only a few human-like perturbations generated by Anthro is sufficient to signal whether or not the whole sentence is written by humans, while only one unreasonable perturbation generated by TextBugger can adversely affect its meaning. This explains the performance drop in terms of semantic preservation but not in human-likeness when indirectly comparing with Anthro. Overall, also has the best trade-off between Atk% and human evaluation–i.e., positioning at top right corners in Figure 4, with a noticeable superior Atk%.

Model Anthro
BERT 0.72 0.82 0.71 0.82 0.97 0.88
BERT+A+H+P 0.65 0.65 0.60 0.85 0.79 0.84
Adv.Train 0.41 0.30 0.35 0.72 0.72 0.67
SoundCNN 0.14 0.02 0.15 0.86 0.84 0.92
Table 7: Averaged Atk% of Anthro and against different defense models.

6 Defend Anthro, Attack

We suggest two countermeasures against Anthro attack. They are (i) Sound-Invariant Model (SoundCNN): When the defender do not have access to learned by the attacker, the defender trains a generic model that encodes not the spellings but the phonetic features of a text for prediction. Here we train a CNN model kim-2014-convolutional on top of a embeddings layer for discrete Soundex++ encodings of each token in a sentence; (ii) Adversarial Training (Adv.Train): To overcome the lack of access to , the defender extracts his/her perturbations in the wild from a separate corpus where and uses them to augment the training examples–i.e., via self-attack with ratio 1:1, to fine-tune a more robust BERT model. We use as a corpus of 34M general comments from online news. We compare the two defenses against BERT and BERT combined with 3 layers of normalization A+H+P. BERT is selected as it is better than RoBERTa at defending against Anthro (Table 4).

Results. Table 7 shows that both SoundCNN and Adv.Train are robust against Anthro attack, while Adv.Train performs best when defending . Since SoundCNN is strictly based on phonetic features, it is vulnerable against whenever TextBugger’s perturbations are selected. Table 7 also underscores that is a strong and practical attack, defense against which is thus an important future direction.

7 Discussion and Analysis

Evaluation with Perspective API. We evaluate if Anthro and can successfully attack the popular Perspective API 222, which has been adopted in various publishers–e.g., NYTimes, and platforms–e.g., Disqus, Reddit, to detect toxicity. We evaluate on 200 toxic texts randomly sampled from the TC dataset. Figure 5 (Left) shows that the API provides superior performance compared to a self fine-tuned BERT classifier, yet its precision deteriorates quickly from 0.95 to only 0.9 and 0.82 when 25%–50% of a sentence are randomly perturbed using human-written perturbations. However, the Adv.Train (Sec. 6) model achieves fairly consistent precision in the same setting. This shows that Anthro is not only a powerful and realistic attack, but also can help develop more robust text classifiers in practice. The API is also vulnerable against both direct (Alg. 1) and transfer Anthro attacks through an intermediate BERT classifier, with its precision dropped to only 0.12 when evaluated against .

Figure 5: (Left) Precision on human-written perturbed texts synthesized by Anthro and (Right) Robustness evaluation of Perspective API under different attacks
Task Sentiment Analysis Categorization
Anthro 0.80 0.93
0.86 1.00
Table 8: Averaged Atk% of Anthro and in fooling Google Cloud444’s sentiment analysis API and text categorization API.

Generalization beyond Offensive Texts. Although Anthro extracts perturbations from abusive data, the majority of them are non-abusive texts. Thus, Anthro learns perturbations for non-abusive English words–e.g., hilarious->Hi-Larious, shot->sh•t. We also make no assumption on the task domains that Anthro can attack. Evidently, Anthro and achieves 80%, 86% Atk% and 90%, 100% Atk% on fooling the sentiment analysis and text categorization API from Google Cloud (Table 4)

Computational Complexity. The one-time extraction of via Eq. (1) has where , is the # of tokens and the length of longest token in (hash-map operations cost ). Given a word and , Anthro retrieves a list of perturbation candidates via Eq. (2) with where is the length of and is the size of the largest set of tokens sharing the same Soundex++ encoding in . Since is constant, the upper-bound then becomes .

Limitation of Misspelling Correctors. Similar to other spell-checkers such as pyspellchecker and symspell, the SOTA NeuSpell depends on a fixed dictionary of common misspellings, or synthetic misspellings generated by random permutation of characters neuspell. These checkers often assume perturbations are within an edit-distance threshold from the original words. This makes them exclusive since one can easily generate new perturbations by repeating a specific character–e.g., “porn"“pooorn". Also, due to the iterative attack mechanism (Alg. 1) where each token in a sentence is replaced by many candidates until the correct label’s prediction probability drops, Anthro only needs a single good perturbation that is not detected by NeuSpell for a successful replacement. Thus, by formulating perturbations by not only their spellings but also their sounds, ANTHRO is able to mine perturbations that can circumvent NeuSpell.

Limitation of Anthro. The perturbation candidate retrieval operation (Eq. (2)) has a higher computational complexity than that of other methods–i.e., v.s. where is the length of an input token (Please refer to Sec. 8 in the Appendix for detailed computational complexity). This can prolong the running time, especially when attacking long documents. However, we can overcome this by storing all the perturbations (given ) of the top frequently used offensive and non-offensive English words. We can then expect the operation to have an average complexity close to . The current Soundex++ algorithm is designed for English texts and might not be applicable in other languages. Thus, we plan to extend Anthro to a multilingual setting.

8 Conclusion

We propose Anthro, a character-based attack algorithm that extracts human-written perturbations in the wild and then utilizes them for adversarial text generation. Our approach yields the best trade-off between attack performance, semantic preservation and stealthiness under both empirical experiments and human studies. A BERT classifier trained with examples augmented by Anthro can also better understand human-written texts.

Broad Impact

To the best of our knowledge, Anthro is the first work that extracts noisy human-written texts, or text perturbations, online. We further iterate what reviewer pvcD has observed: Anthro moves “away from deductively-derived attacks to data-driven inspired attacks". This novel direction is beneficial not only to the adversarial NLP community but also in other NLP tasks that require the understanding of realistic noisy user-generated texts online. Specifically, Sec. 6 and Figure 5 shows that our work enables the training of a BERT model that can understand noisy human-written texts better than the popular Perspective API. By extending this to other NLP tasks such as Q&A and NLI, our work hopes to enable current NLP software to perform well in real life settings, especially on social platforms where user-generated texts are not always in perfect English. Our work also opens a new direction in the use of languages online and how netizens utilize different forms of perturbations for avoiding censorship in this new age of AI.

Ethical Consideration

Similar to previous works in adversarial NLP literature, there are risks that our proposed approach may be unintentionally utilized by malicious actors to attack textual ML systems. To mitigate this, we will not publicly release the full perturbation dictionary that we have extracted and reported in the paper. Instead, we will provide access to our private API on a case-by-case basis with proper security measures. Moreover, we also suggest and discuss two potential approaches that can defend against our proposed attacks (Sec. 6). We believe that the benefits of our work overweight its potential risks. All public secondary datasets used in this paper were either open-sourced or released by the original authors.


This research was supported in part by NSF awards #1820609, #1915801, and #2114824.


Appendix A Supplementary Materials

a.1 Additional Results and Figures

Below are list of supplementary materials:

  • Table A.1: list of datasets we used to curate the corpus , from which human-written perturbations are extracted (Sec. 3.1). All the datasets are publicly available, except from the two private datasets Sensitive Query and Hateful Comments.

  • Table A.2: list of datasets we used to evaluate the attack performance of all attackers (Sec. 4.1) and the prediction performance of BERT and RoBERTa on the respective test sets. All datasets are publicly available.

  • Table A.3: Statistical analysis of the human study results (Sec. 4.2).

  • Figure B.1: Word-cloud of extracted human-written perturbations by Anthro for some of popular English words.

  • Figure B.2, B.3: Interfaces of the human study described in Sec. 4.2.

a.2 Infrastructure and Software

Dataset #Texts #Tokens (level 1) (case-insensitive) (case-sensitive)
List of Bad Words 555 1.9K 1.9K
Rumours (Twitter) kochkina2018all 99K 159K Twitter 150K 328K
Personal Atks (Wiki.) wulczyn_thain_dixon_2017 116K 454K
Toxic Comments (Wiki.) (Kaggle, 2019) 2M 1.6M Unknown 313K 857K
Hateful Comments (Reddit) (Kaggle, 2021)777 1.7M 1M
Sensitive Query (Search Engine, Private) 1.2M 314K Online News 12.7M 7M 1.7M 3.127 13.27
Total texts used to extract Anthro 18.3M -
Table A.1: Real-life datasets that are used to extract adversarial texts in the wild, number of total examples (#Texts) and unique tokens (#Tokens) (case-insensitive)
Dataset #Total BERT RoBERTa
CB cyberbullyingdata 449K 0.84 0.84
TC (Kaggle, 2018) 160K 0.85 0.85
HS hateoffensive 25K 0.91 0.97
Table A.2: Evaluation datasets Cyberbullying (CB), Toxic Comments (TC) and Hate Speech (HS) and prediction performance in F1 score on their test sets of BERT and RoBERTa.
Alternative Hypothesis Mean t-stats p-value df
—– AMT Workers as Subjects —–
Anthro > TB 0.82 5.66 4.1e-7** 48
> TB 0.64 1.95 2.9e-2* 46
Anthro > TB 0.71 3.14 1.5e-3** 47
> TB 0.70 3.00 2.2e-3** 46
—– Professional Annotators as Subjects —–
Anthro > TB 0.75 3.79 2.4e-4** 44
> TB 0.68 2.49 8.6e-3** 41
Anthro > TB 0.70 3.06 1.82e-3** 50
> TB 0.73 3.53 4.6e-4** 48
Statistical significant **(p-value0.01) *(p-value0.05)
Table A.3: It is statistically significant (p-value0.01) that adversarial texts generated by Anthro are better than those generated by TextBugger (TB) at both preserving the semantics of the original sentences ()) and at being perceived as human-written texts ().

Appendix B Implementation Details

b.1 Attackers

We evaluate all the attack baselines using the open-source OpenAttack framework zeng2020openattack. We keep all the default parameters for all the attack methods.

b.2 Defenders

For the (1) Accents normalization, we adopt the accents removal code from the Hugging Face repository 888 For (2) Homoglyph normalization, we adopt a 3rd party python Homoglyph library999 For (3) Perturbation normalization, we use the state-of-the-art misspelling-based perturbation correction Neuspell model neuspell 101010 For Perspective API, we directly use the publicly available API provided by Jigsaw and Google 111111

b.3 Details of Human Study and Experiment Controls

To ensure a high quality response from MTurks, we require a minimum attentions span of 30 seconds for each question. We recruit MTurk workers who are 18 years or older residing in North America. MTurk workers are recruited using the following qualifications provided by AMT, namely (1) recognized as “master” workers by AMT system, (2) have done at least 5K HITs and (3) have historical HITs approval rate of at least 98%. These qualifications are also more conservative than previous human studies we found in previous literature. We pay each worker on average around $10 an hour or higher (federal minimum wage was $7.25 in 2021 when we carried out our study). To limit abusive behaviors, we impose a minimum attention span of 30 seconds for the workers to complete each task.

Figure B.1: Word-clouds of perturbations in the wild extracted by Anthro for the word “amazon”, “republicans”, “democrats” and “president”.

Figure B.2: User-study design for semantic preservation comparison between Anthro, v.s. TextBugger

Figure B.3: User-study design for human-likeness comparison between Anthro, v.s. TextBugger