White-Box Attacks on Hate-speech BERT Classifiers in German with Explicit and Implicit Character Level Defense

02/11/2022
by   Shahrukh Khan, et al.
0

In this work, we evaluate the adversarial robustness of BERT models trained on German Hate Speech datasets. We also complement our evaluation with two novel white-box character and word level attacks thereby contributing to the range of attacks available. Furthermore, we also perform a comparison of two novel character-level defense strategies and evaluate their robustness with one another.

READ FULL TEXT

page 8

page 11

research
06/08/2022

Adversarial Text Normalization

Text-based adversarial attacks are becoming more commonplace and accessi...
research
06/02/2021

BERT-Defense: A Probabilistic Model Based on BERT to Combat Cognitively Inspired Orthographic Adversarial Attacks

Adversarial attacks expose important blind spots of deep learning system...
research
12/14/2019

Towards Robust Toxic Content Classification

Toxic content detection aims to identify content that can offend or harm...
research
07/12/2021

A Closer Look at the Adversarial Robustness of Information Bottleneck Models

We study the adversarial robustness of information bottleneck models for...
research
05/08/2021

Certified Robustness to Text Adversarial Attacks by Randomized [MASK]

Recently, few certified defense methods have been developed to provably ...
research
10/31/2022

Character-level White-Box Adversarial Attacks against Transformers via Attachable Subwords Substitution

We propose the first character-level white-box adversarial attack method...
research
05/27/2019

Combating Adversarial Misspellings with Robust Word Recognition

To combat adversarial spelling mistakes, we propose placing a word recog...

Please sign up or login with your details

Forgot password? Click here to reset