Shielding Google's language toxicity model against adversarial attacks

01/05/2018
by   Nestor Rodriguez, et al.
0

Lack of moderation in online communities enables participants to incur in personal aggression, harassment or cyberbullying, issues that have been accentuated by extremist radicalisation in the contemporary post-truth politics scenario. This kind of hostility is usually expressed by means of toxic language, profanity or abusive statements. Recently Google has developed a machine-learning-based toxicity model in an attempt to assess the hostility of a comment; unfortunately, it has been suggested that said model can be deceived by adversarial attacks that manipulate the text sequence of the comment. In this paper we firstly characterise such adversarial attacks as using obfuscation and polarity transformations. The former deceives by corrupting toxic trigger content with typographic edits, whereas the latter deceives by grammatical negation of the toxic content. Then, we propose a two--stage approach to counter--attack these anomalies, bulding upon a recently proposed text deobfuscation method and the toxicity scoring model. Lastly, we conducted an experiment with approximately 24000 distorted comments, showing how in this way it is feasible to restore toxicity of the adversarial variants, while incurring roughly on a twofold increase in processing time. Even though novel adversary challenges would keep coming up derived from the versatile nature of written language, we anticipate that techniques combining machine learning and text pattern recognition methods, each one targeting different layers of linguistic features, would be needed to achieve robust detection of toxic language, thus fostering aggression--free digital interaction.

READ FULL TEXT

page 6

page 9

page 11

research
01/21/2022

Identifying Adversarial Attacks on Text Classifiers

The landscape of adversarial attacks against text classifiers continues ...
research
06/08/2022

Adversarial Text Normalization

Text-based adversarial attacks are becoming more commonplace and accessi...
research
03/21/2022

On The Robustness of Offensive Language Classifiers

Social media platforms are deploying machine learning based offensive la...
research
01/27/2021

Robust Android Malware Detection System against Adversarial Attacks using Q-Learning

The current state-of-the-art Android malware detection systems are based...
research
03/09/2020

Gradient-based adversarial attacks on categorical sequence models via traversing an embedded world

An adversarial attack paradigm explores various scenarios for vulnerabil...
research
12/22/2022

Aliasing is a Driver of Adversarial Attacks

Aliasing is a highly important concept in signal processing, as careful ...

Please sign up or login with your details

Forgot password? Click here to reset