Certifiably Robust Interpretation in Deep Learning

05/28/2019
by   Alexander Levine, et al.
0

Although gradient-based saliency maps are popular methods for deep learning interpretation, they can be extremely vulnerable to adversarial attacks. This is worrisome especially due to the lack of practical defenses for protecting deep learning interpretations against attacks. In this paper, we address this problem and provide two defense methods for deep learning interpretation. First, we show that a sparsified version of the popular SmoothGrad method, which computes the average saliency maps over random perturbations of the input, is certifiably robust against adversarial perturbations. We obtain this result by extending recent bounds for certifiably robust smooth classifiers to the interpretation setting. Experiments on ImageNet samples validate our theory. Second, we introduce an adversarial training approach to further robustify deep learning interpretation by adding a regularization term to penalize the inconsistency of saliency maps between normal and crafted adversarial samples. Empirically, we observe that this approach not only improves the robustness of deep learning interpretation to adversarial attacks, but it also improves the quality of the gradient-based saliency maps.

READ FULL TEXT

page 2

page 3

page 5

page 19

research
08/22/2021

Robustness-via-Synthesis: Robust Training with Generative Adversarial Perturbations

Upon the discovery of adversarial attacks, robust models have become obl...
research
02/01/2019

Understanding Impacts of High-Order Loss Approximations and Features in Deep Learning Interpretation

Current methods to interpret deep learning models by generating saliency...
research
06/26/2020

Proper Network Interpretability Helps Adversarial Robustness in Classification

Recent works have empirically shown that there exist adversarial example...
research
06/08/2018

Noise-adding Methods of Saliency Map as Series of Higher Order Partial Derivative

SmoothGrad and VarGrad are techniques that enhance the empirical quality...
research
11/21/2020

Backdoor Attacks on the DNN Interpretation System

Interpretability is crucial to understand the inner workings of deep neu...
research
11/29/2022

Interpretations Cannot Be Trusted: Stealthy and Effective Adversarial Perturbations against Interpretable Deep Learning

Deep learning methods have gained increased attention in various applica...
research
05/14/2019

Robustification of deep net classifiers by key based diversified aggregation with pre-filtering

In this paper, we address a problem of machine learning system vulnerabi...

Please sign up or login with your details

Forgot password? Click here to reset