Diagnostics for Deep Neural Networks with Automated Copy/Paste Attacks

11/18/2022
by   Stephen Casper, et al.
8

Deep neural networks (DNNs) are powerful, but they can make mistakes that pose significant risks. A model performing well on a test set does not imply safety in deployment, so it is important to have additional tools to understand its flaws. Adversarial examples can help reveal weaknesses, but they are often difficult for a human to interpret or draw generalizable, actionable conclusions from. Some previous works have addressed this by studying human-interpretable attacks. We build on these with three contributions. First, we introduce a method termed Search for Natural Adversarial Features Using Embeddings (SNAFUE) which offers a fully-automated method for finding "copy/paste" attacks in which one natural image can be pasted into another in order to induce an unrelated misclassification. Second, we use this to red team an ImageNet classifier and identify hundreds of easily-describable sets of vulnerabilities. Third, we compare this approach with other interpretability tools by attempting to rediscover trojans. Our results suggest that SNAFUE can be useful for interpreting DNNs and generating adversarial data for them. Code is available at https://github.com/thestephencasper/snafue

READ FULL TEXT

page 2

page 3

page 5

page 6

page 7

page 13

research
08/23/2020

Developing and Defeating Adversarial Examples

Breakthroughs in machine learning have resulted in state-of-the-art deep...
research
08/04/2022

A New Kind of Adversarial Example

Almost all adversarial attacks are formulated to add an imperceptible pe...
research
07/31/2020

TEAM: We Need More Powerful Adversarial Examples for DNNs

Although deep neural networks (DNNs) have achieved success in many appli...
research
07/12/2021

Detect and Defense Against Adversarial Examples in Deep Learning using Natural Scene Statistics and Adaptive Denoising

Despite the enormous performance of deepneural networks (DNNs), recent s...
research
10/05/2022

Natural Color Fool: Towards Boosting Black-box Unrestricted Attacks

Unrestricted color attacks, which manipulate semantically meaningful col...
research
05/20/2022

B-cos Networks: Alignment is All We Need for Interpretability

We present a new direction for increasing the interpretability of deep n...

Please sign up or login with your details

Forgot password? Click here to reset