Adversarial Training for High-Stakes Reliability

05/03/2022
by   Daniel M. Ziegler, et al.
0

In the future, powerful AI systems may be deployed in high-stakes settings, where a single failure could be catastrophic. One technique for improving AI safety in high-stakes settings is adversarial training, which uses an adversary to generate examples to train on in order to achieve better worst-case performance. In this work, we used a language generation task as a testbed for achieving high reliability through adversarial training. We created a series of adversarial training techniques – including a tool that assists human adversaries – to find and eliminate failures in a classifier that filters text completions suggested by a generator. In our simple "avoid injuries" task, we determined that we can set very conservative classifier thresholds without significantly impacting the quality of the filtered outputs. With our chosen thresholds, filtering with our baseline classifier decreases the rate of unsafe completions from about 2.4 the limit of our ability to measure. We found that adversarial training significantly increased robustness to the adversarial attacks that we trained on, without affecting in-distribution performance. We hope to see further work in the high-stakes reliability setting, including more powerful tools for enhancing human adversaries and better ways to measure high levels of reliability, until we can confidently rule out the possibility of catastrophic deployment-time failures of powerful models.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
06/26/2021

Multi-stage Optimization based Adversarial Training

In the field of adversarial robustness, there is a common practice that ...
research
06/04/2022

Soft Adversarial Training Can Retain Natural Accuracy

Adversarial training for neural networks has been in the limelight in re...
research
01/12/2020

Fast is better than free: Revisiting adversarial training

Adversarial training, a method for learning robust deep networks, is typ...
research
07/27/2018

From Adversarial Training to Generative Adversarial Networks

In this paper, we are interested in two seemingly different concepts: ad...
research
04/18/2020

Single-step Adversarial training with Dropout Scheduling

Deep learning models have shown impressive performance across a spectrum...
research
08/26/2022

Lower Difficulty and Better Robustness: A Bregman Divergence Perspective for Adversarial Training

In this paper, we investigate on improving the adversarial robustness ob...
research
05/22/2023

Adversarial Nibbler: A Data-Centric Challenge for Improving the Safety of Text-to-Image Models

The generative AI revolution in recent years has been spurred by an expa...

Please sign up or login with your details

Forgot password? Click here to reset