Contemporary state of the art language models such as GPT-2 are rapidly improving, as they are being trained on increasingly large datasets and defined using billions of parameters. Language models are currently able to generate coherent text that humans can identify as machine-written text (neural text) with approximately 54% accuracy. –close to random guessing. With this increasing power, language models provide bad actors with the potential to spread misinformation on an unprecedented scale  and undermine clear authorship.
To reduce the spread of misinformation via language models and give readers a better sense of what entity (machine or human) may have actually written a piece of text, multiple neural text detection methods have been proposed. Two automatic neural text detectors are considered in this work, RoBERTa [3, 4] and GROVER , which are 95% and 92% accurate in discriminating neural text from human-written text, respectively. Another tool, GLTR , is designed to assist humans in detecting neural text, increasing humans’ ability to correctly distinguish neural text and human-written text from 54% to 72%. Fundamentally, these detectors are based on the fact that neural text follows predictable patterns based on the neural text’s underlying language model generator.
Attacks on machine learning models, called adversarial attacks [6, 7, 8, 9], have been studied in depth and used to expose both security holes and understand how machine learning models function by purposefully causing machine learning models to make mistakes.
Historically, homoglyph attacks111https://en.wikipedia.org/wiki/IDN_homograph_attack have been used to direct victims to malicious websites by replacing characters in a trusted URL with similar looking ones, called homoglyphs. Part of this work seeks to test whether homoglyph attacks can also be used to create effective black-box adversarial attacks on neural text detectors.
2 Threat Model and Proposed Attacks
In this paper, two classes of attacks on neural text detectors are proposed. Both of these attacks attempt to modify neural text in ways that are relatively visually imperceptible to humans, but will cause a neural text detector to misclassify the text as human-written. Specifically, these attacks change the underlying distribution of neural text so that it diverges from that of the language model which generated it.
The first class of attacks are non human-like attacks, which imperceptibly (according to humans) change neural text in a way that humans normally would not. This class of attack shifts the modified text’s distribution away from its original one. In this work, the non-human like attacks are realized by swapping selected characters with Unicode homoglyphs (e.g. changing English “a”s to Cyrillic “a”s throughout a neural text sample). Homoglyphs are chosen because they appear visually similar to their counterparts, but get tokenized differently by neural text detectors.
The second class of attacks are human-like attacks, which imperceptibly (according to humans) change neural text in a way that humans normally would. In this paper, this class of attack is realized by randomly swapping correctly spelled words with common human misspellings throughout a neural next sample–which from here onward is referred to as a “misspelling attack.” However, this is not the only way human-like attacks may be implemented. This class of attack may also target word-choice, grammar, or punctuation. Misspelling attacks are simply a proof-of-concept for this larger umbrella of human-like attacks.
A neural text dataset containing 5,000 text samples generated by GPT-2 1.5B using top-k 40 sampling was used to evaluate attacks in all experiments. This dataset was taken from a GitHub repository.222https://github.com/openai/gpt-2-output-dataset
In all experiments, except for the transferability tests, an open source implementation of the automatic RoBERTa neural text detector333https://github.com/openai/gpt-2-output-dataset/tree/master/detector was used. Before the attacks, RoBERTa’s recall on neural text was 97.44%. In this paper, five experiments testing homoglyph attacks were conducted, and two were conducted for misspelling attacks.
The first homoglyph experiment in this paper was designed to test the effectiveness of different homoglyph pairs in lowering detector recall on neural text. In this experiment, all attacks were restricted to randomly replacing 1.5% of all the characters in a given neural text sample to homoglyphs. If there were not enough of the character(s) being replaced in a neural text sample to meet this 1.5% quota, the text sample was thrown out and the result of the attack not considered. Even so, every attack in experiments conducted under these conditions was run on at least 2,500 neural text samples.
The second homoglyph experiment took the most effective homoglyph pair found in the first experiment and tested the effectiveness of the homoglyph attack when it was allowed to replace every occurrence of the target character(s).
The third homoglyph experiment was designed to take the most effective homoglyph pair and test how varying frequencies of replacement may affect detector recall on neural text.
The fourth homoglyph experiment was designed to test the transferability of the homoglyph attacks to the GROVER and GLTR online demos. 444https://grover.allenai.org/detect 555http://gltr.io/dist/index.html In this experiment, 20 samples of neural text were randomly selected from the neural text dataset. Then, the most effective homoglyph attack (found in the first homoglyph experiment) was applied to the samples. GROVER’s predictions on the original neural text and modified neural text were then recorded. The online demo for GROVER outputs “We are quite sure this was written by a machine,” (Machine++) “We think this was written by a machine (but we’re not sure),”(Machine+) “We think this was written by a human (but we’re not sure),” (Human+) or “We are quite sure this was written by a human” (Human++). A similar experiment was performed on the GLTR demo. The most successful homoglyph attack was applied to 10 samples of text taken randomly from the neural text dataset.666The GLTR interface does not allow many Unicode characters, including Cyrillic ones. Thus, the homoglyph attack used for the GLTR experiments was the most successful, GLTR-allowed homoglyph attack. Screenshots of GLTR’s graphical interface were then taken before and after the attack, and patterns were observed.
For the misspelling attack experiments, words were randomly misspelled throughout a text sample using a Wikipedia list777https://en.wikipedia.org/wiki/Wikipedia:Lists_of_common_misspellings/For_machines of commonly misspelled words (by humans) in the English language. The attack was restricted to randomly misspelling 5% of the words in each neural text sample in the dataset. The same transferability experiments used in the homoglyph attacks were used for the misspelling attacks, except instead of characters being replaced with homoglyphs, a random 5% of the words in neural text samples were misspelled.
Code to reproduce results found in this paper can be found at https://github.com/mwolff31/attacking_neural_text_detectors.
|Original||Homoglyph||Detector Recall||Average Confidence|
|a (U+0061), e (U+0065)||a (U+0430), e (U+0435)||13.57%||81.61%|
|e (U+0065)||e (U+0435)||16.11%||79.43%|
|e (U+0065)||é (U+00E9)||18.11%||77.42%|
|a (U+0061), c (U+0063)||a (U+0430), c (U+0441)||19.96%||75.98%|
|c (U+0063)||c (U+0441)||36.94%||61.78%|
|p (U+0070)||p (U+0440)||42.25%||56.99%|
Results for the first homoglyph experiment can be seen in Table 1. Interestingly, replacing vowels with homoglyphs was a much more effective attack, even when the frequency of replacement was the same as that of consonants. Additionally, attacks using multiple homoglyph pairs were more effective than those which used only one.
For the second homoglyph experiment, according to Table 1, the most successful homoglyph pair was English “e” and English “a” to Cyrillic “Ye” and Cyrillic “a”, respectively. When this homoglyph attack was allowed to replace all of the English “e”s and English “a”s in the neural text dataset, RoBERTa’s recall on neural text dropped to 0.26%.
The results of the third homoglyph experiment can be seen in Figure 1. The most successful single character homoglyph attack was used. Neural text detector recall on neural text was inversely proportional to the amount of characters a homoglyph attack was allowed to replace.
The results of the fourth homoglyph experiment indicate that the homoglyph attacks are transferable to other neural text detectors. Before the English “e” and English “a” to Cyrillic “Ye” and Cyrillic “a” attack was implemented, GROVER predicted Machine++ for 19 of the 20 samples, and predicted Human++ for 1 of the 20 samples. After the homoglyph attack, GROVER predicted Machine++ for 3 of the 20 samples, Machine+ for 1 of the 20 samples, Human+ for 1 of the 20 samples, and Human++ for the remaining 15 samples. In an experiment testing the transferability of the homoglyph attack to GLTR, replacing all English “e”s with Latin “é”s across 10 neural text samples consistently shifted histograms and the way GLTR colored the given text in the online demo towards GLTR behavior characteristic of human writing. Graphical results can be seen in Appendix A.
The results of the second misspelling experiment indicate that the misspelling attack is transferable to other neural text detectors as well. Before the misspelling attack was implemented, GROVER predicted Machine++ for 19 of the 20 samples, and predicted Human++ for 1 of the 20 samples. Note that random samples different from the ones used for the homoglyph transferability attack were used. After the homoglyph attack, GROVER predicted Machine++ for 8 of the 20 samples, Machine+ for 2 of the 20 samples, Human+ for 1 of the 20 samples, and Human++ for the remaining 9 samples. Similarly, the misspelling attack was also able to shift GLTR behavior towards that characteristic of humans across 10 neural text samples. An example of this can be seen in Appendix A.
It is interesting to note that the non-human like attacks were effective because they are not characteristic of human-written nor neural text, yet the neural text detectors predicted the text was human-written–just because the modified neural text wasn’t characteristic of neural text. Clearly, automatic neural text detectors are trained not to discriminate between neural text and human-written text, but rather decide what is characteristic and uncharacteristic of neural text. As seen by the success of the homoglyph attacks presented in this paper, this creates a vulnerability for neural text detectors in which an adversary can change neural text to be characteristic of neither language models nor humans (e.g. mixing English and Cyrillic alphabets), yet have the modified neural text be classified as human-written text.
While homoglyph attacks may be defended against with similar tactics such as those employed by modern web-browsers and spell-check, human-like attacks will ultimately be much more difficult to defend against, especially as they increase in complexity and employ methods which create not just spelling errors, but also grammatical errors or different sampling mechanisms to encourage different word-choice. Such attacks will force neural text detectors to increasingly deepen their understanding of not only what constitutes neural text, but also what constitutes human-written text.
This work defines two classes of attacks on neural text detectors: non human-like and human-like. Both proved to be very effective in disrupting neural text detectors’ ability to classify neural text accurately. Additionally, this paper sheds some light on what kinds of methods neural text detectors employ, and how these may be exploited. Future work should focus on making neural text detectors robust against the attacks presented in this work, and further explore the extent to which the attacks presented in this paper, particularly human-like attacks, may be deployed on neural text detectors.
-  A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever, “Language models are unsupervised multitask learners,” 2019.
-  S. Gehrmann, H. Strobelt, and A. Rush, “GLTR: Statistical detection and visualization of generated text,” in Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics: System Demonstrations, pp. 111–116, Association for Computational Linguistics, 2019.
-  I. Solaiman, M. Brundage, J. Clark, A. Askell, A. Herbert-Voss, J. Wu, A. Radford, G. Krueger, J. W. Kim, S. Kreps, M. McCain, A. Newhouse, J. Blazakis, K. McGuffie, and J. Wang, “Release strategies and the social impacts of language models,” 2019.
-  Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, and V. Stoyanov, “Roberta: A robustly optimized BERT pretraining approach,” CoRR, vol. abs/1907.11692, 2019.
-  R. Zellers, A. Holtzman, H. Rashkin, Y. Bisk, A. Farhadi, F. Roesner, and Y. Choi, “Defending against neural fake news,” arXiv preprint arXiv:1905.12616, 2019.
C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. J. Goodfellow, and R. Fergus, “Intriguing properties of neural networks,” inProceedings of the 2nd International Conference on Learning Representations, 2014.
-  A. Athalye, L. Engstrom, A. Ilyas, and K. Kwok, “Synthesizing robust adversarial examples,” in Proceedings of the 35th International Conference on Machine Learning, pp. 284–293, 2018.
M. Sharif, S. Bhagavatula, L. Bauer, and M. K. Reiter, “Accessorize to a crime: Real and stealthy attacks on state-of-the-art face recognition,” inProceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security, pp. 1528–1540, 2016.
-  J. Ebrahimi, A. Rao, D. Lowd, and D. Dou, “HotFlip: White-box adversarial examples for text classification,” in Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pp. 31–36, Association for Computational Linguistics, 2018.
Appendix A Shifting Neural Text’s Distribution
This experiment, similar to the ones performed by the GLTR authors, was designed to quantify the extent to which a homoglyph attack could shift the distribution of neural text away from that of text produced by a language model. The GPT-2 117M language model888An open-sourced GPT-2 117M model was taken from https://github.com/huggingface/transformers was used to generate predictions for each token in a text sample. The token’s position within GPT-2 117M’s predictions, or rank, was then recorded. Lower ranks indicate an alignment with GPT-2 117M’s predictions. 50 randomly chosen text samples from the WebText dataset made available in the same GitHub repository that provided the neural text dataset999https://github.com/openai/gpt-2-output-dataset was used for human evaluation. 50 text samples were randomly chosen from the neural text dataset, which then had the English “e” and English “a” to Cryllic “Ye” and Cyrillic “a” homoglyph attack with no maximum character replacement restriction applied to them. The results for the this experiment are displayed in Table 2. Overall, the homoglyph attack was successful in shifting neural text’s distribution away from that of a language model.