Adversarial alignment: Breaking the trade-off between the strength of an attack and its relevance to human perception

06/05/2023
by   Drew Linsley, et al.
0

Deep neural networks (DNNs) are known to have a fundamental sensitivity to adversarial attacks, perturbations of the input that are imperceptible to humans yet powerful enough to change the visual decision of a model. Adversarial attacks have long been considered the "Achilles' heel" of deep learning, which may eventually force a shift in modeling paradigms. Nevertheless, the formidable capabilities of modern large-scale DNNs have somewhat eclipsed these early concerns. Do adversarial attacks continue to pose a threat to DNNs? Here, we investigate how the robustness of DNNs to adversarial attacks has evolved as their accuracy on ImageNet has continued to improve. We measure adversarial robustness in two different ways: First, we measure the smallest adversarial attack needed to cause a model to change its object categorization decision. Second, we measure how aligned successful attacks are with the features that humans find diagnostic for object recognition. We find that adversarial attacks are inducing bigger and more easily detectable changes to image pixels as DNNs grow better on ImageNet, but these attacks are also becoming less aligned with features that humans find diagnostic for recognition. To better understand the source of this trade-off, we turn to the neural harmonizer, a DNN training routine that encourages models to leverage the same features as humans to solve tasks. Harmonized DNNs achieve the best of both worlds and experience attacks that are detectable and affect features that humans find diagnostic for recognition, meaning that attacks on these models are more likely to be rendered ineffective by inducing similar effects on human perception. Our findings suggest that the sensitivity of DNNs to adversarial attacks can be mitigated by DNN scale, data scale, and training routines that align models with biological intelligence.

READ FULL TEXT

page 2

page 8

page 17

page 18

research
08/01/2023

Training on Foveated Images Improves Robustness to Adversarial Attacks

Deep neural networks (DNNs) have been shown to be vulnerable to adversar...
research
11/08/2022

Harmonizing the object recognition strategies of deep neural networks with humans

The many successes of deep neural networks (DNNs) over the past decade h...
research
12/04/2022

Recognizing Object by Components with Human Prior Knowledge Enhances Adversarial Robustness of Deep Neural Networks

Adversarial attacks can easily fool object recognition systems based on ...
research
01/04/2017

Dense Associative Memory is Robust to Adversarial Inputs

Deep neural networks (DNN) trained in a supervised way suffer from two k...
research
08/07/2023

Fixed Inter-Neuron Covariability Induces Adversarial Robustness

The vulnerability to adversarial perturbations is a major flaw of Deep N...
research
01/21/2020

Massif: Interactive Interpretation of Adversarial Attacks on Deep Learning

Deep neural networks (DNNs) are increasingly powering high-stakes applic...
research
05/11/2018

Breaking Transferability of Adversarial Samples with Randomness

We investigate the role of transferability of adversarial attacks in the...

Please sign up or login with your details

Forgot password? Click here to reset