FireBERT: Hardening BERT-based classifiers against adversarial attack

08/10/2020
by   Gunnar Mein, et al.
0

We present FireBERT, a set of three proof-of-concept NLP classifiers hardened against TextFooler-style word-perturbation by producing diverse alternatives to original samples. In one approach, we co-tune BERT against the training data and synthetic adversarial samples. In a second approach, we generate the synthetic samples at evaluation time through substitution of words and perturbation of embedding vectors. The diversified evaluation results are then combined by voting. A third approach replaces evaluation-time word substitution with perturbation of embedding vectors. We evaluate FireBERT for MNLI and IMDB Movie Review datasets, in the original and on adversarial examples generated by TextFooler. We also test whether TextFooler is less successful in creating new adversarial samples when manipulating FireBERT, compared to working on unhardened classifiers. We show that it is possible to improve the accuracy of BERT-based models in the face of adversarial attacks without significantly reducing the accuracy for regular benchmark samples. We present co-tuning with a synthetic data generator as a highly effective method to protect against 95 of pre-manufactured adversarial samples while maintaining 98 benchmark performance. We also demonstrate evaluation-time perturbation as a promising direction for further research, restoring accuracy up to 75 benchmark performance for pre-made adversarials, and up to 65 of 75

READ FULL TEXT

page 1

page 2

page 3

page 4

research
02/09/2021

Target Training Does Adversarial Training Without Adversarial Samples

Neural network classifiers are vulnerable to misclassification of advers...
research
04/07/2020

Towards Evaluating the Robustness of Chinese BERT Classifiers

Recent advances in large-scale language representation models such as BE...
research
12/08/2018

AutoGAN: Robust Classifier Against Adversarial Attacks

Classifiers fail to classify correctly input images that have been purpo...
research
04/21/2020

BERT-ATTACK: Adversarial Attack Against BERT Using BERT

Adversarial attacks for discrete data (such as text) has been proved sig...
research
10/12/2020

EFSG: Evolutionary Fooling Sentences Generator

Large pre-trained language representation models (LMs) have recently col...
research
06/01/2022

Order-sensitive Shapley Values for Evaluating Conceptual Soundness of NLP Models

Previous works show that deep NLP models are not always conceptually sou...
research
02/27/2020

Adv-BERT: BERT is not robust on misspellings! Generating nature adversarial samples on BERT

There is an increasing amount of literature that claims the brittleness ...

Please sign up or login with your details

Forgot password? Click here to reset