Deep neural networks have recently transformed many areas including visual perceptions, language understanding, reinforcement learning, etc. Though they become the most representative intelligent systems with a dominant performance, DNNs are criticized for lacking transparency and interpretability. Better understanding the working mechanism of machine learning systems has become a requested demand, which is not only beneficial to academic research but also significant to many critical industries requiring a high level of safety concerns.
In this paper, we propose a simple and interpretable disentanglement form for deep neural networks, which can not only reveal neural network’s functional behaviors but also have application improvement in visual explanation task  and adversarial example detection . The main idea is that we propose to extract the class-specific subnetwork for each semantic category from a pre-trained full model while maintaining a comparable prediction performance (Figure 1). To effectively extract the subnetworks, we utilize the knowledge distillation criteria  and model pruning strategy . We observe that the highly compressed subnetworks can display an architecture resemblance to their corresponding categorical semantic similarities.
Furthermore, the interpretable subnetworks extracted by the proposed method can have further operational influence in other tasks. One application is visual explanation, which provides salient regions or features relevant to the model prediction results. We propose a simple improvement operation for gradient-based visual explanation method, by just replacing the full model weights with subnetwork for the requested explanation class. This can lead to more accurate and concise salient regions. The similar technique can also be applied to another application: adversarial sample detection, which detects the malicious samples fooling DNN classifiers. We propose to use the features generated by the class-specific subnetworks to construct confidence score-based detector since the resulting features are observed to be more separable from adversarial data than those generated by full models.
Let denote a neural network with pre-trained weights . For one test sample
, the output vector offor class
, which is usually termed as logit, can be expressed as, and denotes a collection of samples .
To extract class-specific subnetworks, we consider our problem under the knowledge distillation framework . Specifically, let denotes the original prediction made by the full model parametrized by for a single sample , and we want to extract a subnetwork with the parameter for class , whose prediction should be close to the full model under KL divergence measurement. Therefore, the objective is
where is an extra regularization term which encourages to be sparse enough. We adopt -norm as regularization term.
We observe that the original full model already has a good performance for single class binary classification. Therefore, the probability can be represented by transforming the output logit into , where
is Sigmoid function. Then the objective in Equation (1) can be rewritten as
where stands for binary cross entropy function which is . By using Monte Carlo approximation, we can obtain the final objective for learning subnetwork , which is
As for the parametrization form of , we associate control gates on multiple layers’ output channels in the network. The control gate values then modulate the output features of -th layer by channel-wise multiplication.
3 Adversarial Sample Detection
A school of adversarial sample detection methods are based on confidence scores discrimination. By evaluating the confidence score based on training data density estimators in feature space, one can adequately judge whether a sample appears on the true data manifold with high probability or not.
Now with the class-specific subnetworks, we can further increase the discriminability of confidence score methods. Figure (a)a have displayed clustering patterns, but the adversarial samples have high overlaps with true data manifold. Figure (b)b demonstrates that adversarial samples become more separable when using the features generated from class-specific subnetworks.
Weakly supervised localization errors on ImageNet validation dataset. “Normal” indicates standard explanation practice by using full model weights regardless of the requested explanation class. “Subnet” indicates the proposed practice by replacing with class-specific subnetworks.is optimized on held-out 5,000 images from ImageNet training dataset. “Err” indicates localization error (lower is better).
|Dataset||Mehod||Detection AUROC (%)||Unknown Attack Detection AUROC (%)|
|CIFAR-10||KD + PU||81.21||82.28||81.07||55.93||81.21||16.16||76.80||56.30|
Adversarial sample detection AUROC (%) for different methods. Our method improves upon Mahalanobis distance score by replacing feature extractor with class-specific subnetworks to estimate empirical means and covariance matrices. For unknown attack detection, FGSM samples denoted by “seen” are used for training logistic regression detector.
Detection based on Mahalanobis distance In this section, we will formally present the improved detection algorithm based on the Mahalanobis distance score proposed in . Following the definition in , the Mahalanobis confidence score for a test sample is computed by measuring the Mahalanobis distance between
and its closest class-conditional Gaussian distribution in feature space:
where and are the empirical class mean and covariance for features of training samples .
With the extracted subnetworks, we can modify the empirical mean and covariance estimation by using the class-specific feature as instead. Suppose that is the resulting subnetwork for class . Then and can be estimated by
Similar to , the other low-level features in the neural network can also be combined to estimate confidence and a logistic regression detector is trained on a held-out validation data to weight each feature importance.
We extract class-specific subnetworks of three typical ImageNet pre-trained networks: AlexNet, VGG16, and ResNet50. When optimizing the subnetwork for class
, for each epoch a balanced training set is sampled dynamically by including all the 1,000 images of class, and an equal number of randomly chosen images for all the other classes. The final subnetwork is selected after epochs with minimum loss, below the sparsity level . Balance parameter . Mini-batch size is 64. The learning rate for Adam optimizer is 0.1 for all the experiments.
Subnetworks Visualization For each subnetwork, the associated control gates can reflect the utilization of each layer when predicting specific class. Figure 3 displays the relationships between different class-specific subnetworks when projected onto the 2D plane using the UMAP algorithm. Here we can observe that the subnetwork representations tend to be more similar when their corresponding labels are semantically closer.
Improving Visual Explanation Visual explanation methods usually present the highlighted salient regions in the input image as explanation results. For most of visual explanation methods [8, 10, 11, 9, 7]
, they all generate the visual saliency by following specific predefined “layerwise attribution backpropagation” rules. Here we propose a simple alternative explanation procedure for the above gradient-based explanation methods, by using class-specific subnetwork as model weights when explaining the requested class. Figure 4 shows that by using the extracted class-specific subnetwork, these methods can generate more clear and accurate salient regions focusing on the main objects.
Weakly Supervised Object Localization To demonstrate the improvement of visual explanation methods more rigorously, here we adopt the Weakly Supervised Object Localization (WSOL) evaluation protocol. Table 1 summarizes the results. The proposed method can reduce localization errors across different methods. These results validate that the proposed practice can help improve gradient-based visual explanation methods.
Detecting Adversarial Samples Following the similar experimental setups in , we experiment with four attacking methods including FGSM, BIM, DeepFool and -version CW attack. We first extract class-specific subnetworks for each dataset. Then the Mahalanobis scores are calculated according to Equation (6). The subsequent logistic regression detector setups are the same as .
We compared three state-of-the-art logistic regression detectors, which are based on 1) the combinations of kernel density (KD) and predictive uncertainty (PU) , 2) the local intrinsic dimensionality scores (LID)  and 3) the Mahalanobis distance scores .
The middle columns of Table 2 summarize the detection results. Our method can generally improve detection success rates over the baseline methods across different attacking methods. We also train the logistic regression detector on FGSM and evaluate its detection performance on the other types of adversarial samples. The right columns of Table 2 summarize the results. Our method can still outperform baseline methods in most cases. The results validate the power of class-specific subnetworks to detect adversarial examples.
In this paper, we explore the possibility of understanding DNNs from disentangled subnetworks. The discovery reveals that the extracted subnetworks can display a resemblance to their corresponding class semantic similarity. Furthermore, the proposed techniques can effectively improve the localization accuracy of visual explanation methods, and detection success rate of adversarial sample detection methods.
-  R. Feinman, R. R. Curtin, S. Shintre, and A. B. Gardner. Detecting adversarial samples from artifacts. arXiv preprint arXiv:1703.00410, 2017.
-  G. Hinton, O. Vinyals, and J. Dean. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015.
-  K. Lee, K. Lee, H. Lee, and J. Shin. A simple unified framework for detecting out-of-distribution samples and adversarial attacks. In Advances in Neural Information Processing Systems, pages 7167–7177, 2018.
Z. Liu, J. Li, Z. Shen, G. Huang, S. Yan, and C. Zhang.
Learning efficient convolutional networks through network slimming.
Proceedings of the IEEE International Conference on Computer Vision, pages 2736–2744, 2017.
-  X. Ma, B. Li, Y. Wang, S. M. Erfani, S. Wijewickrema, G. Schoenebeck, D. Song, M. E. Houle, and J. Bailey. Characterizing adversarial subspaces using local intrinsic dimensionality. arXiv preprint arXiv:1801.02613, 2018.
-  G. Montavon, W. Samek, and K.-R. Müller. Methods for interpreting and understanding deep neural networks. Digital Signal Processing, 73:1–15, 2018.
-  R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh, and D. Batra. Grad-cam: Visual explanations from deep networks via gradient-based localization. In ICCV, 2017.
-  K. Simonyan, A. Vedaldi, and A. Zisserman. Deep inside convolutional networks: Visualising image classification models and saliency maps. arXiv:1312.6034, 2013.
-  D. Smilkov, N. Thorat, B. Kim, F. Viégas, and M. Wattenberg. Smoothgrad: removing noise by adding noise. arXiv preprint arXiv:1706.03825, 2017.
-  J. T. Springenberg, A. Dosovitskiy, T. Brox, and M. Riedmiller. Striving for simplicity: The all convolutional net. arXiv preprint arXiv:1412.6806, 2014.
-  M. Sundararajan, A. Taly, and Q. Yan. Axiomatic attribution for deep networks. In ICML, 2017.