Adversarial Attacks on the Interpretation of Neuron Activation Maximization

06/12/2023
by   Geraldin Nanfack, et al.
0

The internal functional behavior of trained Deep Neural Networks is notoriously difficult to interpret. Activation-maximization approaches are one set of techniques used to interpret and analyze trained deep-learning models. These consist in finding inputs that maximally activate a given neuron or feature map. These inputs can be selected from a data set or obtained by optimization. However, interpretability methods may be subject to being deceived. In this work, we consider the concept of an adversary manipulating a model for the purpose of deceiving the interpretation. We propose an optimization framework for performing this manipulation and demonstrate a number of ways that popular activation-maximization interpretation techniques associated with CNNs can be manipulated to change the interpretations, shedding light on the reliability of these methods.

READ FULL TEXT
research
02/11/2016

Multifaceted Feature Visualization: Uncovering the Different Types of Features Learned By Each Neuron in Deep Neural Networks

We can better understand deep neural networks by identifying which featu...
research
08/25/2021

Understanding of Kernels in CNN Models by Suppressing Irrelevant Visual Features in Images

Deep learning models have shown their superior performance in various vi...
research
12/11/2018

Diagnostic Visualization for Deep Neural Networks Using Stochastic Gradient Langevin Dynamics

The internal states of most deep neural networks are difficult to interp...
research
07/21/2020

Inverting the Feature Visualization Process for Feedforward Neural Networks

This work sheds light on the invertibility of feature visualization in n...
research
02/19/2020

Gradient-Adjusted Neuron Activation Profiles for Comprehensive Introspection of Convolutional Speech Recognition Models

Deep Learning based Automatic Speech Recognition (ASR) models are very s...
research
06/20/2022

Neural Activation Patterns (NAPs): Visual Explainability of Learned Concepts

A key to deciphering the inner workings of neural networks is understand...
research
09/19/2019

AllenNLP Interpret: A Framework for Explaining Predictions of NLP Models

Neural NLP models are increasingly accurate but are imperfect and opaque...

Please sign up or login with your details

Forgot password? Click here to reset