Understanding the Logit Distributions of Adversarially-Trained Deep Neural Networks

08/26/2021
by   Landan Seguin, et al.
0

Adversarial defenses train deep neural networks to be invariant to the input perturbations from adversarial attacks. Almost all defense strategies achieve this invariance through adversarial training i.e. training on inputs with adversarial perturbations. Although adversarial training is successful at mitigating adversarial attacks, the behavioral differences between adversarially-trained (AT) models and standard models are still poorly understood. Motivated by a recent study on learning robustness without input perturbations by distilling an AT model, we explore what is learned during adversarial training by analyzing the distribution of logits in AT models. We identify three logit characteristics essential to learning adversarial robustness. First, we provide a theoretical justification for the finding that adversarial training shrinks two important characteristics of the logit distribution: the max logit values and the "logit gaps" (difference between the logit max and next largest values) are on average lower for AT models. Second, we show that AT and standard models differ significantly on which samples are high or low confidence, then illustrate clear qualitative differences by visualizing samples with the largest confidence difference. Finally, we find learning information about incorrect classes to be essential to learning robustness by manipulating the non-max logit information during distillation and measuring the impact on the student's robustness. Our results indicate that learning some adversarial robustness without input perturbations requires a model to learn specific sample-wise confidences and incorrect class orderings that follow complex distributions.

READ FULL TEXT

page 19

page 20

page 27

research
01/12/2018

A3T: Adversarially Augmented Adversarial Training

Recent research showed that deep neural networks are highly sensitive to...
research
03/24/2023

Improved Adversarial Training Through Adaptive Instance-wise Loss Smoothing

Deep neural networks can be easily fooled into making incorrect predicti...
research
08/29/2023

Advancing Adversarial Robustness Through Adversarial Logit Update

Deep Neural Networks are susceptible to adversarial perturbations. Adver...
research
04/01/2020

Towards Achieving Adversarial Robustness by Enforcing Feature Consistency Across Bit Planes

As humans, we inherently perceive images based on their predominant feat...
research
10/08/2021

Game Theory for Adversarial Attacks and Defenses

Adversarial attacks can generate adversarial inputs by applying small bu...
research
03/25/2023

Improving robustness of jet tagging algorithms with adversarial training: exploring the loss surface

In the field of high-energy physics, deep learning algorithms continue t...
research
02/18/2021

Random Projections for Improved Adversarial Robustness

We propose two training techniques for improving the robustness of Neura...

Please sign up or login with your details

Forgot password? Click here to reset