On Frank-Wolfe Optimization for Adversarial Robustness and Interpretability
Deep neural networks are easily fooled by small perturbations known as adversarial attacks. Adversarial Training (AT) is a technique that approximately solves a robust optimization problem to minimize the worst-case loss and is widely regarded as the most effective defense against such attacks. While projected gradient descent (PGD) has received most attention for approximately solving the inner maximization of AT, Frank-Wolfe (FW) optimization is projection-free and can be adapted to any L^p norm. A Frank-Wolfe adversarial training approach is presented and is shown to provide as competitive level of robustness as PGD-AT without much tuning for a variety of architectures. We empirically show that robustness is strongly connected to the L^2 magnitude of the adversarial perturbation and that more locally linear loss landscapes tend to have larger L^2 distortions despite having the same L^∞ distortion. We provide theoretical guarantees on the magnitude of the distortion for FW that depend on local geometry which FW-AT exploits. It is empirically shown that FW-AT achieves strong robustness to white-box attacks and black-box attacks and offers improved resistance to gradient masking. Further, FW-AT allows networks to learn high-quality human-interpretable features which are then used to generate counterfactual explanations to model predictions by using dense and sparse adversarial perturbations.
READ FULL TEXT