Achieving Verified Robustness to Symbol Substitutions via Interval Bound Propagation

09/03/2019
by   Po-Sen Huang, et al.
1

Neural networks are part of many contemporary NLP systems, yet their empirical successes come at the price of vulnerability to adversarial attacks. Previous work has used adversarial training and data augmentation to partially mitigate such brittleness, but these are unlikely to find worst-case adversaries due to the complexity of the search space arising from discrete text perturbations. In this work, we approach the problem from the opposite direction: to formally verify a system's robustness against a predefined class of adversarial attacks. We study text classification under synonym replacements or character flip perturbations. We propose modeling these input perturbations as a simplex and then using Interval Bound Propagation -- a formal model verification method. We modify the conventional log-likelihood training objective to train models that can be efficiently verified, which would otherwise come with exponential search complexity. The resulting models show only little difference in terms of nominal accuracy, but have much improved verified accuracy under perturbations and come with an efficiently computable formal guarantee on worst case adversaries.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
05/23/2023

Expressive Losses for Verified Robustness via Convex Combinations

In order to train networks for verified adversarial robustness, previous...
research
06/29/2022

IBP Regularization for Verified Adversarial Robustness via Branch-and-Bound

Recent works have tried to increase the verifiability of adversarially t...
research
09/03/2019

Certified Robustness to Adversarial Word Substitutions

State-of-the-art NLP models can often be fooled by adversaries that appl...
research
10/30/2018

On the Effectiveness of Interval Bound Propagation for Training Verifiably Robust Models

Recent works have shown that it is possible to train models that are ver...
research
09/06/2019

Learning to Discriminate Perturbations for Blocking Adversarial Attacks in Text Classification

Adversarial attacks against machine learning models have threatened vari...
research
05/04/2020

Robust Encodings: A Framework for Combating Adversarial Typos

Despite excellent performance on many tasks, NLP systems are easily fool...
research
09/27/2020

Differentially Private Adversarial Robustness Through Randomized Perturbations

Deep Neural Networks, despite their great success in diverse domains, ar...

Please sign up or login with your details

Forgot password? Click here to reset