Unsupervised Learning of Neural Networks to Explain Neural Networks (extended abstract)

01/21/2019 ∙ by Quanshi Zhang, et al. ∙ 22

This paper presents an unsupervised method to learn a neural network, namely an explainer, to interpret a pre-trained convolutional neural network (CNN), i.e., the explainer uses interpretable visual concepts to explain features in middle conv-layers of a CNN. Given feature maps of a conv-layer of the CNN, the explainer performs like an auto-encoder, which decomposes the feature maps into object-part features. The object-part features are learned to reconstruct CNN features without much loss of information. We can consider the disentangled representations of object parts a paraphrase of CNN features, which help people understand the knowledge encoded by the CNN. More crucially, we learn the explainer via knowledge distillation without using any annotations of object parts or textures for supervision. In experiments, our method was widely used to interpret features of different benchmark CNNs, and explainers significantly boosted the feature interpretability without hurting the discrimination power of the CNNs.



There are no comments yet.


page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.


Figure 1: Explainer network. We use an explainer network (green) to disentangle the feature map of a certain conv-layer in a pre-trained performer network (gray). The explainer network disentangles input features into object-part feature maps () to explain knowledge representations in the performer, i.e. making each filter represent a specific object part. The explainer network can also invert the disentangled object-part features to reconstruct features of the performer without much loss of information. We compare ordinary feature maps () in the performer and the disentangled feature maps () in the explainer on the right. The gray and green lines indicate the information-pass route during the inference process and that during the explanation process, respectively.

Quanshi Zhang is with the John Hopcroft Center and the MoE Key Lab of Artificial Intelligence, AI Institute, Shanghai Jiao Tong University. Yu Yang and Ying Nian Wu are with the Center for Vision, Cognition, Learning, and Autonomy, University of California, Los Angeles.

Motivation, diagnosis of features inside CNNs:

In recent years, real applications usually propose new demands for deep learning beyond the accuracy. The CNN needs to earn trust from people for safety issues, because a high accuracy on testing images cannot always ensure that the CNN encodes correct features. Instead, the CNN sometimes uses unreliable reasons for prediction.

Therefore, this study aim to provide a generic tool to examine middle-layer features of a CNN to ensure the safety in critical applications. Unlike previous visualization [Zeiler and Fergus2014] and diagnosis [Bau et al.2017, Ribeiro, Singh, and Guestrin2016] of CNN representations, we focus on the following two new issues, which are of special values in feature diagnosis.

Disentanglement of interpretable and uninterpretable feature information is necessary for a rigorous and trustworthy examination of CNN features. Each filter of a conv-layer usually encodes a mixture of various semantics and noises (see Fig. 1). As discussed in [Bau et al.2017], filters in high conv-layers mainly represent “object parts”222[Zhang, Wu, and Zhu2018] considers both semantics of “objects” and “parts” as parts., and “material” and “color” information in high layers is not salient enough for trustworthy analysis. In particular, part features are usually more localized and thus is more helpful in feature diagnosis.

Therefore, in this paper, we propose to disentangle part features from another signals and noises. For example, we may quantitatively disentangle 90% information of CNN features as object parts and interpret the rest 10% as textures and noises.

Semantic explanations: Given an input image, we aim to use clear visual concepts (here, object parts) to interpret chaotic CNN features. In comparisons, network visualization and diagnosis mainly illustrate the appearance corresponding to a network output/filter, without physically modeling or quantitatively summarizing strict semantics. As shown in Fig. 4, our method identifies which parts are learned and used for the prediction as more fine-grained explanations for CNN features.

Disentangle interpretable signals Semantically explain Few restrictions on structures & losses Not affect discriminability
CNN visualization
Interpretable nets
Our research
Table 1: Comparison between our research and other studies. Note that this table can only summarize mainstreams in different research directions considering the huge research diversity.

Tasks, learning networks to explain networks: In this paper, we propose a new explanation strategy to boost feature interpretability. I.e. given a pre-trained CNN, we learn another neural network, namely a explainer network, to translate chaotic middle-layer features of the CNN into semantically meaningful object parts. More specifically, as shown in Fig. 1, the explainer decomposes middle-layer feature maps into elementary feature components of object parts. Accordingly, the pre-trained CNN is termed a performer network.

In the scenario of this study, the performer is well pre-trained for superior performance. We attach the explainer onto the performer without affecting the original discrimination power of the performer.

The explainer works like an auto-encoder. The encoder decomposes features in the performer into interpretable part features and other uninterpretable features. The encoder contains hundreds of specific filters, each being learned to represent features of a certain object part. The decoder inverts the disentangled part features to reconstruct features of upper layers of the performer.

As shown in Fig. 1, the feature map of each filter in the performer usually represents a chaotic mixture of object parts and textures, whereas the disentangled object-part features in the explainer can be treated as a paraphrase of performer features that provide an insightful understanding of the performer. For example, the explainer can tell us
How much can feature information (e.g. 90%) in the performer be interpreted as object parts?
Information of what parts is encoded in the performer?
For each specific prediction, which object parts activate filters in the performer, and how much do they contribute to the prediction?

Explaining black-box networks vs. learning interpretable networks: In recent years, some researchers gradually focus on the interpretability [Bau et al.2017] of middle-layer features of a neural network. Pioneering studies, such as the research of capsule nets [Sabour, Frosst, and Hinton2017] and interpretable CNNs [Zhang, Wu, and Zhu2018], have developed new algorithms to ensure middle-layer features of a neural network are semantic meaningful.

In comparisons, our explaining pre-trained performer is of higher flexibility and has broader applicability than learning new interpretable models. Table 1 summarizes the difference.
Model flexibility: Interpretable neural networks usually have specific requirements for structures [Sabour, Frosst, and Hinton2017] or losses [Zhang, Wu, and Zhu2018], which limit the model flexibility and applicability. In addition, most existing CNNs are learned in a black-box manner with low interpretability. To interpret such CNNs, an explainer is required.
Interpretability vs. discriminability: Using clear visual concepts to explain black-box networks can overcome a major issue with network interpretability, i.e. the dilemma between the feature interpretability and its discrimination power. A high interpretability is not necessarily equivalent to, and sometimes conflicts with a high discrimination power [Bau et al.2017]. As discussed in [Zhang, Wu, and Zhu2018], increasing the interpretability of a neural network may affect its discrimination power. People usually have to trade off between the network interpretability and the performance in real applications.

In contrast, our explanation strategy does not change feature representations in the pre-trained CNN performer, thereby physically protecting the CNN’s discrimination power.

Figure 2: The explainer network (left). Detailed structures within the interpretable track, the ordinary track, and the decoder are shown on the right. People can change the number of conv-layers and FC layers within the encoder and the decoder for their own applications.

Learning: We learn the explainer by distilling feature representations from the performer to the explainer without any additional supervision. No annotations of parts or textures are used to guide the feature disentanglement during the learning process. We add a loss to specific filters in the explainer (see Fig. 2). The filter loss encourages the filter to be exclusively triggered by a certain object part of a category. This filter is termed an interpretable filter.

Meanwhile, the disentangled object-part features are also required to reconstruct features of upper layers of the performer. Successful feature reconstructions guarantee to avoid significant information loss during the disentanglement of part features.

Contributions of this study are summarized as follows.

(i) We tackle a new explanation strategy, i.e. learning an explainer network to mine and clarify potential feature components in middle layers of a pre-trained performer network. Decomposing chaotic middle-layer features into interpretable concepts will shed new light on explaining black-box models.

(ii) Another distinctive contribution of this study is that learning an explainer for interpretation avoids the typical dilemma between a model’s discriminability and interpretability. This is our essential difference to studies of directly learning the performer with disentangled/interpretable features.

Our method protects the discrimination power of the original network. Thus, it ensures a high flexibility and broad applicability in real applications.

(iii) Our method is able to learn the explainer without any annotations of object parts or textures for supervision. Experiments show that our approach has considerably improved the feature interpretability.


Network structure of the explainer

As shown in Fig. 2, the explainer network has two modules, i.e. an encoder and a decoder, which transform performer features into interpretable object-part features and invert object-part features back to features of the performer, respectively. We can roughly consider that object-part features in the explainer contain nearly the same information as features in the performer.

We applied the encoder and decoder with following structures to all types of performers in all experiments. Nevertheless, people can change the layer number of the explainer in their applications.

Encoder: In order to reduce the risk of over-interpreting textures or noises as parts, we design two tracks for the encoder, namely an interpretable track and an ordinary track, which models part features and other features, respectively. Although as discussed in [Zhang, Wu, and Zhu2018], a high conv-layer mainly represents parts rather than textures, avoiding over-interpreting is still necessary for the explainer.

The interpretable track disentangles performer features into object parts. This track has two interpretable conv-layers (namely conv-interp-1,conv-interp-2

), each followed by a ReLU layer and a mask layer. The interpretable layer contains interpretable filters. Each interpretable filter is learned based on the filter loss, which makes the filter exclusively triggered by a specific object part (the learning of interpretable filters will be introduced later). The ordinary track contains a conv-layer (namely

conv-ordin), a ReLU layer, and a pooling layer.

We sum up output features of the interpretable track and those of the ordinary track as the final output of the encoder, i.e. , where a scalar weight measures the quantitative contribution from the interpretable track.

is parameterized as a softmax probability

, , where is the set of parameters to be learned. Our method encourages a large so that most information in comes from the interpretable track.

In particular, if , we can roughly consider that about 90% feature information from the performer can be represented as object parts due to the use of norm-layers.

Decoder: The decoder inverts to , which reconstructs performer features. The decoder has two FC layers, which followed by two ReLU layers. We use the two FC layers, namely fc-dec-1 and fc-dec-2, to reconstruct feature maps of two corresponding FC layers in the performer. The reconstruction loss will be introduced later. The better reconstruction of the FC features indicates that the explainer loses less information during the computation of .


When we distill knowledge representations from the performer to the explainer, we consider the following three terms: 1) the quality of knowledge distillation, i.e. the explainer needs to well reconstruct feature maps of upper layers in the performer, thereby minimizing the information loss; 2) the interpretability of feature maps of the interpretable track, i.e. each filter in conv-interp-2 should exclusively represent a certain object part; 3) the relative contribution of the interpretable track w.r.t. the ordinary track, i.e. we hope the interpretable track to make much more contribution to the final CNN prediction than the ordinary track. Therefore, we minimize the following loss for each input image to learn the explainer.


where denotes the set of parameters to be learned, including filter weights of conv-layers and FC layers in the explainer, for , and for norm-layers. and are scalar weights.

The first term is the reconstruction loss, where denotes the feature of the FC layer in the decoder, . indicates the corresponding feature in the performer.

The second term encourages the interpretable track to make more contribution to the CNN prediction.

The third term is the loss of filter interpretability. Without annotations of object parts, the filter loss forces to be exclusively triggered by a specific object part of a certain category. The filter loss was formulated in [Zhang, Wu, and Zhu2018]. We can summarize the filter loss as the minus mutual information between the distribution of feature maps and that of potential part locations.


In experiments, we learned explainers for performer networks with three types of structures to demonstrate the broad applicability of our method. Performer networks were pre-trained using object images in two different benchmark datasets for object classification. We visualized feature maps of interpretable filters in the explainer to illustrate semantic meanings of these filters. Experiments showed that interpretable filters in the explainer generated more semantically meaningful feature maps than conv-layers in the performer.

Figure 3: Visualization of interpretable filters in the explainer and ordinary filters in the performer. We compared filters in the top conv-layer of the performer and interpretable filters in the conv-interp-2 layer of the explainer.

We compared the object-part interpretability between feature maps of the explainer and those of the performer. To obtain a convincing evaluation, we both visualized filters (see Fig. 3) and used the objective metric of location instability [Zhang, Wu, and Zhu2018] to measure the fitness between a filter and the representation of an object part.

Single-category Multi
bird cat cow dog horse sheep Avg. Avg.
AlexNet 0.153 0.131 0.141 0.128 0.145 0.140 0.140
Explainer 0.104 0.089 0.101 0.083 0.098 0.103 0.096
VGG-M 0.152 0.132 0.143 0.130 0.145 0.141 0.141 0.135
Explainer 0.106 0.088 0.101 0.088 0.097 0.101 0.097 0.097
VGG-S 0.152 0.131 0.141 0.128 0.144 0.141 0.139 0.138
Explainer 0.110 0.085 0.098 0.085 0.091 0.096 0.094 0.107
VGG-16 0.145 0.133 0.146 0.127 0.143 0.143 0.139 0.128
Explainer 0.095 0.089 0.097 0.085 0.087 0.089 0.090 0.109
Table 2: Location instability of feature maps between performers and explainers that were trained using the Pascal-Part dataset. A low location instability indicates a high filter interpretability.
AlexNet VGG-M VGG-S VGG-16
Performer 0.1502 0.1476 0.1481 0.1373
Explainer 0.0906 0.0815 0.0704 0.0490
Table 3: Location instability of feature maps in performers and explainers that were trained using the CUB200-2011 dataset. A low location instability indicates a high filter interpretability.
Figure 4: Grad-CAM attention maps and quantitative analysis. We compute grad-CAM attention maps of interpretable feature maps in the explainer and ordinary feature maps in the performer.

Tables 2 and 3 compare the interpretability between feature maps in the performer and feature maps in the explainer. Feature maps in our explainers were much more interpretable than feature maps in performers in all comparisons.

Table 4 lists values of explainers that were learned for different performers. measures the quantitative contribution from the interpretable track. For example, the VGG-16 network learned using the CUB200-2011 dataset has a value , which means that about feature information of the performer can be represented as object parts, and only about feature information comes from textures and noises.

Pascal-Part dataset CUB200
Single-class Multi-class dataset
AlexNet 0.7137 0.5810
VGG-M 0.9012 0.8066 0.8611
VGG-S 0.9270 0.8996 0.9533
VGG-16 0.8593 0.8718 0.9579
Table 4: Average values of explainers. measures the quantitative contribution from the interpretable track. When we used an explainer to interpret feature maps of a VGG network, about 80%–96% activation scores came from interpretable features.

Conclusion and discussions

In this paper, we have proposed a theoretical solution to a new explanation strategy, i.e. learning an explainer network to disentangle and explain feature maps of a pre-trained performer network. Learning an explainer besides the performer does not decrease the discrimination power of the performer, which ensures the broad applicability. We have developed a simple yet effective method to learn the explainer, which guarantees the high interpretability of feature maps without using annotations of object parts or textures for supervision.

We divide the encoder of the explainer into an interpretable track and an ordinary track to reduce the risk of over-interpreting textures or noises as parts. Fortunately, experiments have shown that about 90% of signals in the performer can be explained as parts.


  • [Bau et al.2017] Bau, D.; Zhou, B.; Khosla, A.; Oliva, A.; and Torralba, A. 2017. Network dissection: Quantifying interpretability of deep visual representations. In CVPR.
  • [Ribeiro, Singh, and Guestrin2016] Ribeiro, M. T.; Singh, S.; and Guestrin, C. 2016.

    “why should i trust you?” explaining the predictions of any classifier.

    In KDD.
  • [Sabour, Frosst, and Hinton2017] Sabour, S.; Frosst, N.; and Hinton, G. E. 2017. Dynamic routing between capsules. In NIPS.
  • [Zeiler and Fergus2014] Zeiler, M. D., and Fergus, R. 2014. Visualizing and understanding convolutional networks. In ECCV.
  • [Zhang, Wu, and Zhu2018] Zhang, Q.; Wu, Y. N.; and Zhu, S.-C. 2018. Interpretable convolutional neural networks. In CVPR.