Quanshi Zhang is with the John Hopcroft Center and the MoE Key Lab of Artificial Intelligence, AI Institute, Shanghai Jiao Tong University. Yu Yang and Ying Nian Wu are with the Center for Vision, Cognition, Learning, and Autonomy, University of California, Los Angeles.
Motivation, diagnosis of features inside CNNs:
In recent years, real applications usually propose new demands for deep learning beyond the accuracy. The CNN needs to earn trust from people for safety issues, because a high accuracy on testing images cannot always ensure that the CNN encodes correct features. Instead, the CNN sometimes uses unreliable reasons for prediction.
Therefore, this study aim to provide a generic tool to examine middle-layer features of a CNN to ensure the safety in critical applications. Unlike previous visualization [Zeiler and Fergus2014] and diagnosis [Bau et al.2017, Ribeiro, Singh, and Guestrin2016] of CNN representations, we focus on the following two new issues, which are of special values in feature diagnosis.
Disentanglement of interpretable and uninterpretable feature information is necessary for a rigorous and trustworthy examination of CNN features. Each filter of a conv-layer usually encodes a mixture of various semantics and noises (see Fig. 1). As discussed in [Bau et al.2017], filters in high conv-layers mainly represent “object parts”222[Zhang, Wu, and Zhu2018] considers both semantics of “objects” and “parts” as parts., and “material” and “color” information in high layers is not salient enough for trustworthy analysis. In particular, part features are usually more localized and thus is more helpful in feature diagnosis.
Therefore, in this paper, we propose to disentangle part features from another signals and noises. For example, we may quantitatively disentangle 90% information of CNN features as object parts and interpret the rest 10% as textures and noises.
Semantic explanations: Given an input image, we aim to use clear visual concepts (here, object parts) to interpret chaotic CNN features. In comparisons, network visualization and diagnosis mainly illustrate the appearance corresponding to a network output/filter, without physically modeling or quantitatively summarizing strict semantics. As shown in Fig. 4, our method identifies which parts are learned and used for the prediction as more fine-grained explanations for CNN features.
|Disentangle interpretable signals||Semantically explain||Few restrictions on structures & losses||Not affect discriminability|
Tasks, learning networks to explain networks: In this paper, we propose a new explanation strategy to boost feature interpretability. I.e. given a pre-trained CNN, we learn another neural network, namely a explainer network, to translate chaotic middle-layer features of the CNN into semantically meaningful object parts. More specifically, as shown in Fig. 1, the explainer decomposes middle-layer feature maps into elementary feature components of object parts. Accordingly, the pre-trained CNN is termed a performer network.
In the scenario of this study, the performer is well pre-trained for superior performance. We attach the explainer onto the performer without affecting the original discrimination power of the performer.
The explainer works like an auto-encoder. The encoder decomposes features in the performer into interpretable part features and other uninterpretable features. The encoder contains hundreds of specific filters, each being learned to represent features of a certain object part. The decoder inverts the disentangled part features to reconstruct features of upper layers of the performer.
As shown in Fig. 1, the feature map of each filter in the performer usually represents a chaotic mixture of object parts and textures, whereas the disentangled object-part features in the explainer can be treated as a paraphrase of performer features that provide an insightful understanding of the performer. For example, the explainer can tell us
How much can feature information (e.g. 90%) in the performer be interpreted as object parts?
Information of what parts is encoded in the performer?
For each specific prediction, which object parts activate filters in the performer, and how much do they contribute to the prediction?
Explaining black-box networks vs. learning interpretable networks: In recent years, some researchers gradually focus on the interpretability [Bau et al.2017] of middle-layer features of a neural network. Pioneering studies, such as the research of capsule nets [Sabour, Frosst, and Hinton2017] and interpretable CNNs [Zhang, Wu, and Zhu2018], have developed new algorithms to ensure middle-layer features of a neural network are semantic meaningful.
In comparisons, our explaining pre-trained performer is of higher flexibility and has broader applicability than learning new interpretable models. Table 1 summarizes the difference.
Model flexibility: Interpretable neural networks usually have specific requirements for structures [Sabour, Frosst, and Hinton2017] or losses [Zhang, Wu, and Zhu2018], which limit the model flexibility and applicability. In addition, most existing CNNs are learned in a black-box manner with low interpretability. To interpret such CNNs, an explainer is required.
Interpretability vs. discriminability: Using clear visual concepts to explain black-box networks can overcome a major issue with network interpretability, i.e. the dilemma between the feature interpretability and its discrimination power. A high interpretability is not necessarily equivalent to, and sometimes conflicts with a high discrimination power [Bau et al.2017]. As discussed in [Zhang, Wu, and Zhu2018], increasing the interpretability of a neural network may affect its discrimination power. People usually have to trade off between the network interpretability and the performance in real applications.
In contrast, our explanation strategy does not change feature representations in the pre-trained CNN performer, thereby physically protecting the CNN’s discrimination power.
Learning: We learn the explainer by distilling feature representations from the performer to the explainer without any additional supervision. No annotations of parts or textures are used to guide the feature disentanglement during the learning process. We add a loss to specific filters in the explainer (see Fig. 2). The filter loss encourages the filter to be exclusively triggered by a certain object part of a category. This filter is termed an interpretable filter.
Meanwhile, the disentangled object-part features are also required to reconstruct features of upper layers of the performer. Successful feature reconstructions guarantee to avoid significant information loss during the disentanglement of part features.
Contributions of this study are summarized as follows.
(i) We tackle a new explanation strategy, i.e. learning an explainer network to mine and clarify potential feature components in middle layers of a pre-trained performer network. Decomposing chaotic middle-layer features into interpretable concepts will shed new light on explaining black-box models.
(ii) Another distinctive contribution of this study is that learning an explainer for interpretation avoids the typical dilemma between a model’s discriminability and interpretability. This is our essential difference to studies of directly learning the performer with disentangled/interpretable features.
Our method protects the discrimination power of the original network. Thus, it ensures a high flexibility and broad applicability in real applications.
(iii) Our method is able to learn the explainer without any annotations of object parts or textures for supervision. Experiments show that our approach has considerably improved the feature interpretability.
Network structure of the explainer
As shown in Fig. 2, the explainer network has two modules, i.e. an encoder and a decoder, which transform performer features into interpretable object-part features and invert object-part features back to features of the performer, respectively. We can roughly consider that object-part features in the explainer contain nearly the same information as features in the performer.
We applied the encoder and decoder with following structures to all types of performers in all experiments. Nevertheless, people can change the layer number of the explainer in their applications.
Encoder: In order to reduce the risk of over-interpreting textures or noises as parts, we design two tracks for the encoder, namely an interpretable track and an ordinary track, which models part features and other features, respectively. Although as discussed in [Zhang, Wu, and Zhu2018], a high conv-layer mainly represents parts rather than textures, avoiding over-interpreting is still necessary for the explainer.
The interpretable track disentangles performer features into object parts. This track has two interpretable conv-layers (namely conv-interp-1,conv-interp-2
), each followed by a ReLU layer and a mask layer. The interpretable layer contains interpretable filters. Each interpretable filter is learned based on the filter loss, which makes the filter exclusively triggered by a specific object part (the learning of interpretable filters will be introduced later). The ordinary track contains a conv-layer (namelyconv-ordin), a ReLU layer, and a pooling layer.
We sum up output features of the interpretable track and those of the ordinary track as the final output of the encoder, i.e. , where a scalar weight measures the quantitative contribution from the interpretable track.
is parameterized as a softmax probability, , where is the set of parameters to be learned. Our method encourages a large so that most information in comes from the interpretable track.
In particular, if , we can roughly consider that about 90% feature information from the performer can be represented as object parts due to the use of norm-layers.
Decoder: The decoder inverts to , which reconstructs performer features. The decoder has two FC layers, which followed by two ReLU layers. We use the two FC layers, namely fc-dec-1 and fc-dec-2, to reconstruct feature maps of two corresponding FC layers in the performer. The reconstruction loss will be introduced later. The better reconstruction of the FC features indicates that the explainer loses less information during the computation of .
When we distill knowledge representations from the performer to the explainer, we consider the following three terms: 1) the quality of knowledge distillation, i.e. the explainer needs to well reconstruct feature maps of upper layers in the performer, thereby minimizing the information loss; 2) the interpretability of feature maps of the interpretable track, i.e. each filter in conv-interp-2 should exclusively represent a certain object part; 3) the relative contribution of the interpretable track w.r.t. the ordinary track, i.e. we hope the interpretable track to make much more contribution to the final CNN prediction than the ordinary track. Therefore, we minimize the following loss for each input image to learn the explainer.
where denotes the set of parameters to be learned, including filter weights of conv-layers and FC layers in the explainer, for , and for norm-layers. and are scalar weights.
The first term is the reconstruction loss, where denotes the feature of the FC layer in the decoder, . indicates the corresponding feature in the performer.
The second term encourages the interpretable track to make more contribution to the CNN prediction.
The third term is the loss of filter interpretability. Without annotations of object parts, the filter loss forces to be exclusively triggered by a specific object part of a certain category. The filter loss was formulated in [Zhang, Wu, and Zhu2018]. We can summarize the filter loss as the minus mutual information between the distribution of feature maps and that of potential part locations.
In experiments, we learned explainers for performer networks with three types of structures to demonstrate the broad applicability of our method. Performer networks were pre-trained using object images in two different benchmark datasets for object classification. We visualized feature maps of interpretable filters in the explainer to illustrate semantic meanings of these filters. Experiments showed that interpretable filters in the explainer generated more semantically meaningful feature maps than conv-layers in the performer.
We compared the object-part interpretability between feature maps of the explainer and those of the performer. To obtain a convincing evaluation, we both visualized filters (see Fig. 3) and used the objective metric of location instability [Zhang, Wu, and Zhu2018] to measure the fitness between a filter and the representation of an object part.
Tables 2 and 3 compare the interpretability between feature maps in the performer and feature maps in the explainer. Feature maps in our explainers were much more interpretable than feature maps in performers in all comparisons.
Table 4 lists values of explainers that were learned for different performers. measures the quantitative contribution from the interpretable track. For example, the VGG-16 network learned using the CUB200-2011 dataset has a value , which means that about feature information of the performer can be represented as object parts, and only about feature information comes from textures and noises.
Conclusion and discussions
In this paper, we have proposed a theoretical solution to a new explanation strategy, i.e. learning an explainer network to disentangle and explain feature maps of a pre-trained performer network. Learning an explainer besides the performer does not decrease the discrimination power of the performer, which ensures the broad applicability. We have developed a simple yet effective method to learn the explainer, which guarantees the high interpretability of feature maps without using annotations of object parts or textures for supervision.
We divide the encoder of the explainer into an interpretable track and an ordinary track to reduce the risk of over-interpreting textures or noises as parts. Fortunately, experiments have shown that about 90% of signals in the performer can be explained as parts.
- [Bau et al.2017] Bau, D.; Zhou, B.; Khosla, A.; Oliva, A.; and Torralba, A. 2017. Network dissection: Quantifying interpretability of deep visual representations. In CVPR.
[Ribeiro, Singh, and Guestrin2016]
Ribeiro, M. T.; Singh, S.; and Guestrin, C.
“why should i trust you?” explaining the predictions of any classifier.In KDD.
- [Sabour, Frosst, and Hinton2017] Sabour, S.; Frosst, N.; and Hinton, G. E. 2017. Dynamic routing between capsules. In NIPS.
- [Zeiler and Fergus2014] Zeiler, M. D., and Fergus, R. 2014. Visualizing and understanding convolutional networks. In ECCV.
- [Zhang, Wu, and Zhu2018] Zhang, Q.; Wu, Y. N.; and Zhu, S.-C. 2018. Interpretable convolutional neural networks. In CVPR.