Interpretable Convolutional Neural Networks

10/02/2017 ∙ by Quanshi Zhang, et al. ∙ 0

This paper proposes a method to modify traditional convolutional neural networks (CNNs) into interpretable CNNs, in order to clarify knowledge representations in high conv-layers of CNNs. In an interpretable CNN, each filter in a high conv-layer represents a certain object part. We do not need any annotations of object parts or textures to supervise the learning process. Instead, the interpretable CNN automatically assigns each filter in a high conv-layer with an object part during the learning process. Our method can be applied to different types of CNNs with different structures. The clear knowledge representation in an interpretable CNN can help people understand the logics inside a CNN, i.e., based on which patterns the CNN makes the decision. Experiments showed that filters in an interpretable CNN were more semantically meaningful than those in traditional CNNs.



There are no comments yet.


page 1

page 3

page 4

page 8

page 12

page 13

page 14

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Convolutional neural networks (CNNs) [15, 12, 7] have achieved superior performance in many visual tasks, such as object classification and detection. As discussed in Bau et al. [2], besides the discrimination power, model interpretability is another crucial issue for neural networks. However, the interpretability is always an Achilles’ heel of CNNs, and has presented considerable challenges for decades.

In this paper, we focus on a new problem, i.e. without any additional human supervision, can we modify a CNN to obtain interpretable knowledge representations in its conv-layers? We expect the CNN has a certain introspection of its representations during the end-to-end learning process, so that the CNN can regularize its representations to ensure high interpretability. Our learning for high interpretability is different from conventional off-line visualization [34, 17, 24, 4, 5, 21] and diagnosis [2, 10, 14, 18] of pre-trained CNN representations.

Figure 1: Comparison of a filter’s feature maps in an interpretable CNN and those in a traditional CNN.

Bau et al. [2] defined six kinds of semantics in CNNs, i.e. objects, parts, scenes, textures, materials, and colors. In fact, we can roughly consider the first two semantics as object-part patterns with specific shapes, and summarize the last four semantics as texture patterns without clear contours. Moreover, filters in low conv-layers usually describe simple textures, whereas filters in high conv-layers are more likely to represent object parts.

Therefore, in this study, we aim to train each filter in a high conv-layer to represent an object part. Fig. 1 shows the difference between a traditional CNN and our interpretable CNN. In a traditional CNN, a high-layer filter may describe a mixture of patterns, i.e. the filter may be activated by both the head part and the leg part of a cat. Such complex representations in high conv-layers significantly decrease the network interpretability. In contrast, the filter in our interpretable CNN is activated by a certain part. In this way, we can explicitly identify which object parts are memorized in the CNN for classification without ambiguity. The goal of this study can be summarized as follows.

  • We propose to slightly revise a CNN to improve its interpretability, which can be broadly applied to CNNs with different structures.

  • We do not need any annotations of object parts or textures for supervision. Instead, our method automatically pushes the representation of each filter towards an object part.

  • The interpretable CNN does not change the loss function on the top layer and uses the same training samples as the original CNN.

  • As an exploratory research, the design for interpretability may decrease the discrimination power a bit, but we hope to limit such a decrease within a small range.

Methods: Given a high conv-layer in a CNN, we propose a simple yet effective loss for each filter in the conv-layer to push the filter towards the representation of an object part. As shown in Fig. 2, we add a loss for the output feature map of each filter. The loss encourages a low entropy of inter-category activations and a low entropy of spatial distributions of neural activations. I.e. each filter must encode a distinct object part that is exclusively contained by a single object category, and the filter must be activated by a single part of the object, rather than repetitively appear on different object regions. For example, the left eye and the right eye may be represented using two different part filters, because contexts of the two eyes are symmetric, but not the same. Here, we assume that repetitive shapes on various regions are more prone to describe low-level textures (e.g. colors and edges), instead of high-level parts.

The value of network interpretability: The clear semantics in high conv-layers is of great importance when we need human beings to trust a network’s prediction. In spite of the high accuracy of neural networks, human beings usually cannot fully trust a network, unless it can explain its logic for decisions, i.e. what patterns are memorized for prediction. Given an image, current studies for network diagnosis [5, 21, 18] localize image regions that contribute most to network predictions at the pixel level. In this study, we expect the CNN to explain its logic at the object-part level. Given an interpretable CNN, we can explicitly show the distribution of object parts that are memorized by the CNN for object classification.

Contributions: In this paper, we focus on a new task, i.e. end-to-end learning a CNN whose representations in high conv-layers are interpretable. We propose a simple yet effective method to modify different types of CNNs into interpretable CNNs without any additional annotations of object parts or textures for supervision. Experiments show that our approach has significantly improved the object-part interpretability of CNNs.

2 Related work

The interpretability and the discrimination power are two important properties of a model [2]. In recent years, different methods are developed to explore the semantics hidden inside a CNN. Many statistical methods [28, 33, 1] have been proposed to analyze CNN features.

Network visualization: Visualization of filters in a CNN is the most direct way of exploring the pattern hidden inside a neural unit. [34, 17, 24] showed the appearance that maximized the score of a given unit. up-convolutional nets [4] were used to invert CNN feature maps to images.

Pattern retrieval: Some studies go beyond passive visualization and actively retrieve certain units from CNNs for different applications. Like the extraction of mid-level features [26] from images, pattern retrieval mainly learns mid-level representations from conv-layers. Zhou et al. [38, 39] selected units from feature maps to describe “scenes”. Simon et al. discovered objects from feature maps of unlabeled images [22], and selected a certain filter to describe each semantic part in a supervised fashion [23]. [36] extracted certain neural units from a filter’s feature map to describe an object part in a weakly-supervised manner. [6] used a gradient-based method to interpret visual question-answering models. Studies of [11, 31, 29, 16] selected neural units with specific meanings from CNNs for various applications.

Model diagnosis: Many methods have been developed to diagnose representations of a black-box model. The LIME method proposed by Ribeiro et al. [18], influence functions [10] and gradient-based visualization methods [5, 21] and [13] extracted image regions that were responsible for each network output, in order to interpret network representations. These methods require people to manually check image regions accountable for the label prediction for each testing image. [9] extracted relationships between representations of various categories from a CNN. Lakkaraju et al. [14] and Zhang et al. [37] explored unknown knowledge of CNNs via active annotations and active question-answering. In contrast, given an interpretable CNN, people can directly identify object parts (filters) that are used for decisions during the inference procedure.

Learning a better representation: Unlike the diagnosis and/or visualization of pre-trained CNNs, some approaches are developed to learn more meaningful representations. [19] required people to label dimensions of the input that were related to each output, in order to learn a better model. Hu et al. [8] designed some logic rules for network outputs, and used these rules to regularize the learning process. Stone et al. [27] learned CNN representations with better object compositionality, but they did not obtain explicit part-level or texture-level semantics. Sabour et al. [20] proposed a capsule model, which used a dynamic routing mechanism to parse the entire object into a parsing tree of capsules, and each capsule may encode a specific meaning. In this study, we invent a generic loss to regularize the representation of a filter to improve its interpretability. We can analyze the interpretable CNN from the perspective of information bottleneck [32] as follows. 1) Our interpretable filters selectively model the most distinct parts of each category to minimize the conditional entropy of the final classification given feature maps of a conv-layer. 2) Each filter represents a single part of an object, which maximizes the mutual information between the input image and middle-layer feature maps (i.e. “forgetting” as much irrelevant information as possible).

3 Algorithm

Given a target conv-layer of a CNN, we expect each filter in the conv-layer to be activated by a certain object part of a certain category, and keep inactivated on images of other categories. Let denote a set of training images, where represents the subset that belongs to category , (). Theoretically, we can use different types of losses to learn CNNs for multi-class classification, single-class classification (i.e. for images of a category and for random images), and other tasks.

Figure 2: Structures of an ordinary conv-layer and an interpretable conv-layer. Green and red lines indicate the forward and backward propagations, respectively.

Fig. 2 shows the structure of our interpretable conv-layer. In the following paragraphs, we focus on the learning of a single filter in the target conv-layer. We add a loss to the feature map of the filter

after the ReLu operation. The feature map

is an matrix, . Because ’s corresponding object part may appear at different locations in different images, we design templates for . As shown in Fig. 3, each template is also an matrix, and it describes the ideal distribution of activations for the feature map when the target part mainly triggers the -th unit in .

During the forward propagation, given each input image , the CNN selects a specific template from the template candidates as a mask to filter out noisy activations from . I.e. we compute and , where denotes the Hadamard (element-wise) product. , denotes the unit (or location) in potentially corresponding to the part.

The mask operation supports the gradient back-propagation for end-to-end learning. Note that the CNN may select different templates for different input images. Fig. 4 visualizes the masks chosen for different images, as well as the original and masked feature maps.

During the back-propagation process, our loss pushes filter to represent a specific object part of the category and keep silent on images of other categories. Please see Section 3.1 for the determination of the category for filter . Let denote feature maps of after an ReLU operation, which are computed on different training images. Given an input image , if , we expect the feature map to exclusively activated at the target part’s location; otherwise, the feature map keeps inactivated. In other words, if , the feature map is expected to the assigned template ; if , we design a negative template and hope the feature map matches to . Note that during the forward propagation, our method omits the negative template, and all feature maps, including those of other categories, select positive templates as masks.

Thus, each feature map is supposed to be well fit to one of all the template candidates . We formulate the loss for as the mutual information between and .


The prior probability of a template is given as

, where is a constant prior likelihood. The fitness between a feature map and a template is measured as the conditional likelihood .


where . indicates the multiplication between and ; indicates the trace of a matrix, and . .

Figure 3: Templates of . In fact, the algorithm also supports a round template based on the L-2 norm distance. Here, we use the L-1 norm distance instead to speed up the computation.

Part templates: As shown in Fig. 3, a negative template is given as , , where is a positive constant. A positive template corresponding to is given as , , where denotes the L-1 norm distance; is a constant parameter.

3.1 Learning

We train the interpretable CNN via an end-to-end manner. During the forward-propagation process, each filter in the CNN passes its information in a bottom-up manner, just like traditional CNNs. During the back-propagation process, each filter in an interpretable conv-layer receives gradients w.r.t. its feature map from both the final task loss and the local filter loss , as follows:


where is a weight.

Figure 4: Given an input image , from the left to the right, we consequently show the feature map of a filter after the ReLU layer , the assigned mask , the masked feature map , and the image-resolution RF of activations in computed by [38].

We compute gradients of w.r.t. each element of feature map as follows222Please see the proof in the Appendix..


where is the target template for feature map . If the given image belongs to the target category of filter , then , where . If image belongs to other categories, then . Considering , after initial learning episodes, we make the above approximation to simplify the computation. Because is computed using numerous feature maps, we can roughly treat as a constant to compute gradients computation in the above equation. We gradually update the value of during the training process333We can use a subset of feature maps to approximate the value of , and continue to update when we receive more feature maps during the training process. Similarly, we can approximate using a subset of feature maps. We compute .. Similarly, we can also approximate without huge computation33footnotemark: 3.

Determining the target category for each filter: We need to assign each filter with a target category to approximate gradients in Eqn. (4). We simply assign the filter with the category whose images activate most, i.e. .

4 Understanding of the loss

In fact, the loss in Eqn. (1) can be re-written as22footnotemark: 2


In the above equation, the first term is a constant, which denotes the prior entropy of part templates.

Low inter-category entropy: The second term is computed as


where , . This term encourages a low conditional entropy of inter-category activations, i.e. a well-learned filter needs to be exclusively activated by a certain category and keep silent on other categories. We can use a feature map of to identify whether the input image belongs to category or not, i.e. fitting to either or , without great uncertainty. Here, we define the set of all positive templates as a single label to represent category . We use the negative template to denote other categories.

Low spatial entropy: The third term in Eqn. (5) is given as


where . This term encourages a low conditional entropy of spatial distribution of ’s activations. I.e. given an image , a well-learned filter should only be activated by a single region of the feature map , instead of repetitively appearing at different locations.

5 Experiments

In experiments, to demonstrate the broad applicability, we applied our method to CNNs with four types of structures. We used object images in three different benchmark datasets to learn interpretable CNNs for single-category classification and multi-category classification. We visualized feature maps of filters in interpretable conv-layers to illustrate semantic meanings of these filters. We used two types of metrics, i.e. the object-part interpretability and the location stability, to evaluate the clarity of the part semantics of a convolutional filter. Experiments showed that filters in our interpretable CNNs were much more semantically meaningful than those in ordinary CNNs.

Three benchmark datasets: Because we needed ground-truth annotations of object landmarks444To avoid ambiguity, a landmark is referred to as the central position of a semantic part (a part with an explicit name, e.g. a head, a tail). In contrast, the part corresponding to a filter does not have an explicit name. (parts) to evaluate the semantic clarity of each filter, we chose three benchmark datasets with landmark44footnotemark: 4/part annotations for training and testing, including the ILSVRC 2013 DET Animal-Part dataset [36], the CUB200-2011 dataset [30], and the Pascal VOC Part dataset [3]. As discussed in [3, 36], non-rigid parts of animal categories usually present great challenges for part localization. Thus, we followed [3, 36] to select the 37 animal categories in the three datasets for evaluation.

All the three datasets provide ground-truth bounding boxes of entire objects. For landmark annotations, the ILSVRC 2013 DET Animal-Part dataset [36] contains ground-truth bounding boxes of heads and legs of 30 animal categories. The CUB200-2011 dataset [30] contains a total of 11.8K bird images of 200 species, and the dataset provides center positions of 15 bird landmarks. The Pascal VOC Part dataset [3] contain ground-truth part segmentations of 107 object landmarks in six animal categories.

Four types of CNNs: To demonstrate the broad applicability of our method, we modified four typical CNNs, i.e. the AlexNet [12], the VGG-M [25], the VGG-S [25], the VGG-16 [25], into interpretable CNNs. Considering that skip connections in residual networks [7] usually make a single feature map encode patterns of different filters, in this study, we did not test the performance on residual networks to simplify the story. Given a certain CNN structure, we modified all filters in the top conv-layer of the original network into interpretable ones. Then, we inserted a new conv-layer with filters above the original top conv-layer, where is the channel number of the input of the new conv-layer. We also set filters in the new conv-layer as interpretable ones. Each filter was a tensor with a bias term. We added zero padding to input feature maps to ensure that output feature maps were of the same size as the input.

Implementation details: We set parameters as , , and . We updated weights of filter losses w.r.t. magnitudes of neural activations in an online manner,

. We initialized parameters of fully-connected (FC) layers and the new conv-layer, and loaded parameters of other conv-layers from a traditional CNN that was pre-trained using 1.2M ImageNet images in

[12, 25]. We then fine-tuned the interpretable CNN using training images in the dataset. To enable a fair comparison, traditional CNNs were also fine-tuned by initializing FC-layer parameters and loading conv-layer parameters.

5.1 Experiments

Single-category classification:

We learned four types of interpretable CNNs based on the AlexNet, VGG-M, VGG-S, and VGG-16 structures to classify each category in the ILSVRC 2013 DET Animal-Part dataset 

[36], the CUB200-2011 dataset [30], and the Pascal VOC Part dataset [3]. Besides, we also learned ordinary AlexNet, VGG-M, VGG-S, and VGG-16 networks using the same training data for comparison. We used the logistic log loss for single-category classification. Following experimental settings in [36, 37, 35], we cropped objects of the target category based on their bounding boxes as positive samples with ground-truth labels . We regarded images of other categories as negative samples with ground-truth labels .

Multi-category classification: We used the six animal categories in the Pascal VOC Part dataset [3] and the thirty categories in the ILSVRC 2013 DET Animal-Part dataset [36] respectively, to learn CNNs for multi-category classification. We learned interpretable CNNs based on the VGG-M, VGG-S, and VGG-16 structures. We tried two types of losses, i.e. the softmax log loss and the logistic log loss555We considered the output for each category independent to outputs for other categories, thereby a CNN making multiple independent single-class classifications for each image. Table 7 reported the average accuracy of the multiple classification outputs of an image. for multi-class classification.

5.2 Quantitative evaluation of part interpretability

As discussed in [2], filters in low conv-layers usually represent simple patterns or object details (e.g. edges, simple textures, and colors), whereas filters in high conv-layers are more likely to represent complex, large-scale parts. Therefore, in experiments, we evaluated the clarity of part semantics for the top conv-layer of a CNN. We used the following two metrics for evaluation.

5.2.1 Evaluation metric: part interpretability

We followed the metric proposed by Bau et al. [2]

to measure the object-part interpretability of filters. We briefly introduce this evaluation metric as follows. For each filter

, we computed its feature maps after ReLu/mask operations on different input images. Then, the distribution of activation scores in all positions of all feature maps was computed. [2] set an activation threshold such that , so as to select top activations from all spatial locations of all feature maps as valid map regions corresponding to ’s semantics. Then, [2] scaled up low-resolution valid map regions to the image resolution, thereby obtaining the receptive field (RF)666Note that [38] accurately computes the RF when the filter represents an object part, and we used RFs computed by [38] for filter visualization in Fig. 5. However, when a filter in an ordinary CNN does not have consistent contours, it is difficult for [38] to align different images to compute an average RF. Thus, for ordinary CNNs, we simply used a round RF for each valid activation. We overlapped all activated RFs in a feature map to compute the final RF as mentioned in [2]. For a fair comparison, in Section , we uniformly applied these RFs to both interpretable CNNs and ordinary CNNs. of valid activations on each image. The RF on image , denoted by , described the part region of .

bird cat cow dog horse sheep Avg.
AlexNet 0.332 0.363 0.340 0.374 0.308 0.373 0.348
AlexNet, interpretable 0.770 0.565 0.618 0.571 0.729 0.669 0.654
VGG-16 0.519 0.458 0.479 0.534 0.440 0.542 0.495
VGG-16, interpretable 0.818 0.653 0.683 0.900 0.795 0.772 0.770
VGG-M 0.357 0.365 0.347 0.368 0.331 0.373 0.357
VGG-M, interpretable 0.821 0.632 0.634 0.669 0.736 0.756 0.708
VGG-S 0.251 0.269 0.235 0.275 0.223 0.287 0.257
VGG-S, interpretable 0.526 0.366 0.291 0.432 0.478 0.251 0.390
Table 1: Average part interpretability of filters in CNNs for single-category classification using the Pascal VOC Part dataset [3].

The compatibility between each filter and the -th part on image was reported as an intersection-over-union score , where denotes the ground-truth mask of the -th part on image . Given an image , we associated filter with the -th part if . Note that the criterion of for part association is much stricter than that was used in [2]. It is because compared to other CNN semantics discussed in [2] (such as colors and textures), object-part semantics requires a stricter criterion. We computed the probability of the -th part being associating with the filter as . Note that one filter might be associated with multiple object parts in an image. Among all parts, we reported the highest probability of part association as the interpretability of filter , i.e. .

For single-category classification, we used testing images of the target category for evaluation. In the Pascal VOC Part dataset [3], we used four parts for the bird category. We merged ground-truth regions of the head, beak, and l/r-eyes as the head part, merged regions of the torso, neck, and l/r-wings as the torso part, merged regions of l/r-legs/feet as the leg part, and used tail regions as the fourth part. We used five parts for the cat category. We merged regions of the head, l/r-eyes, l/r-ears, and nose as the head part, merged regions of the torso and neck as the torso part, merged regions of frontal l/r-legs/paws as the frontal legs, merged regions of back l/r-legs/paws as the back legs, and used the tail as the fifth part. We used four parts for the cow category, which were defined in a similar way to the cat category. We added l/r-horns to the head part and omitted the tail part. We applied five parts of the dog category in the same way as the cat category. We applied four parts of both the horse and sheep categories in the same way as the cow category. We computed the average part interpretability over all filters for evaluation.

For multi-category classification, we first assigned each filter with a target category , i.e. the category that activated the filter most . Then, we computed the object-part interpretability using images of category , as introduced above.

  Network  Logistic log loss55footnotemark: 5  Softmax log loss
VGG-16 0.710 0.723
VGG-16, interpretable 0.938 0.897
VGG-M 0.478 0.502
VGG-M, interpretable 0.770 0.734
VGG-S 0.479 0.435
VGG-S, interpretable 0.572 0.601
Table 2: Average part interpretability of filters in CNNs that are trained for multi-category classification. Filters in our interpretable CNNs exhibited significantly better part interpretability than other CNNs in all comparisons.

5.2.2 Evaluation metric: location stability

gold. bird frog turt. liza. koala lobs. dog fox cat lion tiger bear rabb. hams. squi.
AlexNet 0.161 0.167 0.152 0.153 0.175 0.128 0.123 0.144 0.143 0.148 0.137 0.142 0.144 0.148 0.128 0.149
AlexNet, interpretable 0.084 0.095 0.090 0.107 0.097 0.079 0.077 0.093 0.087 0.095 0.084 0.090 0.095 0.095 0.077 0.095
VGG-16 0.153 0.156 0.144 0.150 0.170 0.127 0.126 0.143 0.137 0.148 0.139 0.144 0.143 0.146 0.125 0.150
VGG-16, interpretable 0.076 0.099 0.086 0.115 0.113 0.070 0.084 0.077 0.069 0.086 0.067 0.097 0.081 0.079 0.066 0.065
VGG-M 0.161 0.166 0.151 0.153 0.176 0.128 0.125 0.145 0.145 0.150 0.140 0.145 0.144 0.150 0.128 0.150
VGG-M, interpretable 0.088 0.088 0.089 0.108 0.099 0.080 0.074 0.090 0.082 0.103 0.079 0.089 0.101 0.097 0.082 0.095
VGG-S 0.158 0.166 0.149 0.151 0.173 0.127 0.124 0.143 0.142 0.148 0.138 0.142 0.143 0.148 0.128 0.146
VGG-S, interpretable 0.087 0.101 0.093 0.107 0.096 0.084 0.078 0.091 0.082 0.101 0.082 0.089 0.097 0.091 0.076 0.098
horse zebra swine hippo catt. sheep ante. camel otter arma. monk. elep. red pa. Avg.
AlexNet 0.152 0.154 0.141 0.141 0.144 0.155 0.147 0.153 0.159 0.160 0.139 0.125 0.140 0.125 0.146
AlexNet, interpretable 0.098 0.084 0.091 0.089 0.097 0.101 0.085 0.102 0.104 0.095 0.090 0.085 0.084 0.073 0.091
VGG-16 0.150 0.153 0.141 0.140 0.140 0.150 0.144 0.149 0.154 0.163 0.136 0.129 0.143 0.125 0.144
VGG-16, interpretable 0.106 0.077 0.094 0.083 0.102 0.097 0.091 0.105 0.093 0.100 0.074 0.084 0.067 0.063 0.085
VGG-M 0.151 0.158 0.140 0.140 0.143 0.155 0.146 0.154 0.160 0.161 0.140 0.126 0.142 0.127 0.147
VGG-M, interpretable 0.095 0.080 0.095 0.084 0.092 0.094 0.077 0.104 0.102 0.093 0.086 0.087 0.089 0.068 0.090
VGG-S 0.149 0.155 0.139 0.140 0.141 0.155 0.143 0.154 0.158 0.157 0.140 0.125 0.139 0.125 0.145
VGG-S, interpretable 0.096 0.080 0.092 0.088 0.094 0.101 0.077 0.102 0.105 0.094 0.090 0.086 0.078 0.072 0.090
Table 3: Location instability of filters () in CNNs that are trained for single-category classification using the ILSVRC 2013 DET Animal-Part dataset [36]. Filters in our interpretable CNNs exhibited significantly lower localization instability than ordinary CNNs in all comparisons. Please see supplementary materials for performance of other structural modifications of CNNs.

The second metric measures the stability of part locations, which was proposed in [35]. Given a feature map of filter , we regarded the unit with the highest activation as the location inference of . We assumed that if consistently represented the same object part through different objects, then distances between the inferred part location and some object landmarks44footnotemark: 4 should not change a lot among different objects. For example, if represented the shoulder, then the distance between the shoulder and the head should keep stable through different objects.

Therefore, [35] computed the deviation of the distance between the inferred position and a specific ground-truth landmark among different images, and used the average deviation w.r.t. various landmark to evaluate the location stability of . A smaller deviation indicates a higher location stability. Let denote the normalized distance between the inferred part and the -th landmark on image , where denotes the center of the unit ’s RF when we backward propagated the RF to the image plane. denotes the diagonal length of the input image. We computed as the relative location deviation of filter w.r.t. the -th landmark, where is referred to as the variation of the distance . Because each landmark could not appear in all testing images, for each filter , we only used inference results with the top-100 highest activation scores on images containing the -th landmark to compute . Thus, we used the average of relative location deviations of all the filters in a conv-layer w.r.t. all landmarks, i.e. , to measure the location instability of , where denotes the number of landmarks.

More specifically, object landmarks for each category were selected as follows. For the ILSVRC 2013 DET Animal-Part dataset [36], we used the head and frontal legs of each category as landmarks for evaluation. For the Pascal VOC Part dataset [3], we selected the head, neck, and torso of each category as the landmarks. For the CUB200-2011 dataset [30], we used ground-truth positions of the head, back, tail of birds as landmarks. It was because these landmarks appeared on testing images most frequently.

bird cat cow dog horse sheep Avg.
AlexNet 0.153 0.131 0.141 0.128 0.145 0.140 0.140
AlexNet, interpretable 0.090 0.089 0.090 0.088 0.087 0.088 0.088
VGG-16 0.145 0.133 0.146 0.127 0.143 0.143 0.139
VGG-16, interpretable 0.101 0.098 0.105 0.074 0.097 0.100 0.096
VGG-M 0.152 0.132 0.143 0.130 0.145 0.141 0.141
VGG-M, interpretable 0.086 0.094 0.090 0.087 0.084 0.084 0.088
VGG-S 0.152 0.131 0.141 0.128 0.144 0.141 0.139
VGG-S, interpretable 0.089 0.092 0.092 0.087 0.086 0.088 0.089
Table 4: Location instability of filters () in CNNs that are trained for single-category classification using the Pascal VOC Part dataset [3]. Filters in our interpretable CNNs exhibited significantly lower localization instability than ordinary CNNs in all comparisons. Please see supplementary materials for performance of other structural modifications of CNNs.
    Network     Avg. location instability
AlexNet 0.150
AlexNet, interpretable 0.070
VGG-16 0.137
VGG-16, interpretable 0.076
VGG-M 0.148
VGG-M, interpretable 0.065
VGG-S 0.148
VGG-S, interpretable 0.073
Table 5: Location instability of filters () in CNNs for single-category classification based on the CUB200-2011 dataset [30]. Please see supplementary materials for performance of other structural modifications on ordinary CNNs.
Dataset ILSVRC Part [36] Pascal VOC Part [3]
Network Logistic log loss55footnotemark: 5 Logistic log loss55footnotemark: 5 Softmax log loss
VGG-16 0.128 0.142
interpretable 0.073 0.075
VGG-M 0.167 0.135 0.137
interpretable 0.096 0.083 0.087
VGG-S 0.131 0.138 0.138
interpretable 0.083 0.078 0.082
Table 6: Location instability of filters () in CNNs that are trained for multi-category classification. Filters in our interpretable CNNs exhibited significantly lower localization instability than ordinary CNNs in all comparisons.

For multi-category classification, we needed to determine two terms for each filter , i.e. 1) the category that mainly represented and 2) the relative location deviation w.r.t. landmarks in ’s target category. Because filters in ordinary CNNs did not exclusively represent a single category, we simply assigned filter with the category whose landmarks can achieve the lowest location deviation to simplify the computation. I.e. we used the average location deviation to evaluate the location stability, where denotes the set of part indexes belonging to category .

Figure 5: Visualization of filters in top conv-layers. We used [38]

to estimate the image-resolution receptive field of activations in a feature map to visualize a filter’s semantics. The top four rows visualize filters in interpretable CNNs, and the bottom two rows correspond to filters in ordinary CNNs. We found that interpretable CNNs usually encoded head patterns of animals in its top conv-layer for classification.

5.2.3 Experimental results and analysis

Tables 1 and 2 compare part interpretability of CNNs for single-category classification and that of CNNs for multi-category classification, respectively. Tables 3, 4, and 5 list average relative location deviations of CNNs for single-category classification. Table 6 compares average relative location deviations of CNNs for multi-category classification. Our interpretable CNNs exhibited much higher interpretability and much better location stability than ordinary CNNs in almost all comparisons. Table 7

compares classification accuracy of different CNNs. Ordinary CNNs performed better in single-category classification. Whereas, for multi-category classification, interpretable CNNs exhibited superior performance to ordinary CNNs. The good performance in multi-category classification may be because that the clarification of filter semantics in early epochs reduced difficulties of filter learning in later epochs.

Figure 6: Heat maps for distributions of object parts that are encoded in interpretable filters. We use all filters in the top conv-layer to compute the heat map.

5.3 Visualization of filters

We followed the method proposed by Zhou et al. [38] to compute the RF of neural activations of an interpretable filter, which was scaled up to the image resolution. Fig. 5 shows RFs66footnotemark: 6 of filters in top conv-layers of CNNs, which were trained for single-category classification. Filters in interpretable CNNs were mainly activated by a certain object part, whereas filters in ordinary CNNs usually did not have explicit semantic meanings. Fig. 6 shows heat maps for distributions of object parts that were encoded in interpretable filters. Interpretable filters usually selectively modeled distinct object parts of a category and ignored other parts.

multi-category single-category
logistic55footnotemark: 5 logistic55footnotemark: 5 softmax
AlexNet 96.28 95.40 95.59
interpretable 95.38 93.93 95.35
VGG-M 96.73 93.88 81.93 97.34 96.82 97.34
interpretable 97.99 96.19 88.03 95.77 94.17 96.03
VGG-S 96.98 94.05 78.15 97.62 97.74 97.24
interpretable 98.72 96.78 86.13 95.64 95.47 95.82
VGG-16 97.97 89.71 98.58 98.66 98.91
interpretable 98.50 91.60 96.67 95.39 96.51
Table 7: Classification accuracy based on different datasets. In single-category classification, ordinary CNNs performed better, while in multi-category classification, interpretable CNNs exhibited superior performance.

6 Conclusion and discussions

In this paper, we have proposed a general method to modify traditional CNNs to enhance their interpretability. As discussed in [2], besides the discrimination power, the interpretability is another crucial property of a network. We design a loss to push a filter in high conv-layers toward the representation of an object part without additional annotations for supervision. Experiments have shown that our interpretable CNNs encoded more semantically meaningful knowledge in high conv-layers than traditional CNNs.

In future work, we will design new filters to describe discriminative textures of a category and new filters for object parts that are shared by multiple categories, in order to achieve a higher model flexibility.



Proof of equations


Visualization of CNN filters

Figure 7: Visualization of filters in the top interpretable conv-layer. Each row corresponds to feature maps of a filter in a CNN that is learned to classify a certain category.
Figure 8: Visualization of filters in the top interpretable conv-layer. Each row corresponds to feature maps of a filter in a CNN that is learned to classify a certain category.
Figure 9: Visualization of filters in the top conv-layer of an ordinary CNN. Each row corresponds to feature maps of a filter in a CNN that is learned to classify a certain category.