Feature Purification: How Adversarial Training Performs Robust Deep Learning

by   Zeyuan Allen-Zhu, et al.

Despite the great empirical success of adversarial training to defend deep learning models against adversarial perturbations, so far, it still remains rather unclear what the principles are behind the existence of adversarial perturbations, and what adversarial training does to the neural network to remove them. In this paper, we present a principle that we call "feature purification", where we show the existence of adversarial examples are due to the accumulation of certain "dense mixtures" in the hidden weights during the training process of a neural network; and more importantly, one of the goals of adversarial training is to remove such mixtures to "purify" hidden weights. We present both experiments on the CIFAR-10 dataset to illustrate this principle, and a Theoretical Result proving that for certain natural classification tasks, training a two-layer neural network with ReLU activation using randomly initialized gradient descent indeed satisfies this principle. Technically, we give, to the best of our knowledge, the first result proving that the following two can hold simultaneously for training a neural network with ReLU activation. (1) Training over the original data is indeed non-robust to small adversarial perturbations of some radius. (2) Adversarial training, even with an empirical perturbation algorithm such as FGM, can in fact be provably robust against ANY perturbations of the same radius. Finally, we also prove a complexity lower bound, showing that low complexity models such as linear classifiers, low-degree polynomials, or even the neural tangent kernel for this network, CANNOT defend against perturbations of this same radius, no matter what algorithms are used to train them.


page 3

page 9

page 13

page 14

page 16

page 20

page 22

page 24


On the Convergence of Certified Robust Training with Interval Bound Propagation

Interval Bound Propagation (IBP) is so far the base of state-of-the-art ...

Most ReLU Networks Suffer from ℓ^2 Adversarial Perturbations

We consider ReLU networks with random weights, in which the dimension de...

Gradient Methods Provably Converge to Non-Robust Networks

Despite a great deal of research, it is still unclear why neural network...

Over-parameterized Adversarial Training: An Analysis Overcoming the Curse of Dimensionality

Adversarial training is a popular method to give neural nets robustness ...

Proximal Mapping for Deep Regularization

Underpinning the success of deep learning is effective regularizations t...

A Sublinear Adversarial Training Algorithm

Adversarial training is a widely used strategy for making neural network...

A Spectral View of Adversarially Robust Features

Given the apparent difficulty of learning models that are robust to adve...