Convolutional neural networks (CNNs) [LeCun et al.1998, Krizhevsky, Sutskever, and Hinton2012, He et al.2016, Li et al.2015] have achieved superior performance in object classification and detection. However, the end-to-end learning strategy makes the entire CNN a black box. When a CNN is trained for object classification, we believe that its conv-layers have encoded rich implicit patterns (e.g. patterns of object parts and patterns of textures). Therefore, in this research, we aim to provide a global view of how visual knowledge is organized in a pre-trained CNN, which presents considerable challenges. For example,
How many types of patterns are memorized by each convolutional filter of the CNN (here, a pattern may describe a specific object part or a certain texture)?
Which patterns are co-activated to describe an object part?
What is the spatial relationship between two patterns?
In this study, given a pre-trained CNN, we propose to mine mid-level object part patterns from conv-layers, and we organize these patterns in an explanatory graph in an unsupervised manner. As shown in Fig. 1, the explanatory graph explains the knowledge hierarchy hidden inside the CNN. The explanatory graph disentangles the mixture of part patterns in each filter’s feature map11footnotemark: 1 of a conv-layer, and uses each graph node to represent a part.
• Representing knowledge hierarchy: The explanatory graph has multiple layers, which correspond to different conv-layers of the CNN. Each graph layer has many nodes. We use these graph nodes to summarize the knowledge hidden in chaotic feature maps of the corresponding conv-layer. Because each filter in the conv-layer may potentially represent multiple parts of the object, we use graph nodes to represent patterns of all candidate parts. A graph edge connects two nodes in adjacent layers to encode co-activation logics and spatial relationships between them.
Note that we do not fix the location of each pattern (node) to a certain neural unit of a conv-layer’s output. Instead, given different input images, a part pattern may appear on various positions of a filter’s feature maps11footnotemark: 1. For example, the horse face pattern and the horse ear pattern in Fig. 1 can appear on different positions of different images, as long as they are co-activated and keep certain spatial relationships.
• Disentangling object parts from a single filter: As shown in Fig. 1, each filter in a conv-layer may be activated by different object parts (e.g. the filter’s feature map11footnotemark: 1 may be activated by both the head and the neck of a horse). To clarify the knowledge representation, we hope to disentangle patterns of different object parts from the same filter in an unsupervised manner, which presents a big challenge for state-of-the-art algorithms.
In this study, we propose a simple yet effective method to automatically discover object parts from a filter’s feature maps without ground-truth part annotations. In this way, we can filter out noisy activations from feature maps, and we ensure that each graph node consistently represents the same object part among different images.
Given a testing image to the CNN, the explanatory graph can tell 1) whether a node (part) is triggered and 2) the location of the part on the feature map.
• Graph nodes with high transferability:
Just like a dictionary, the explanatory graph provides off-the-shelf patterns for object parts, which enables a probability of transferring knowledge from conv-layers to other tasks. Considering that all filters in the CNN are learned using numerous images, we can regard each graph node as a detector that has been sophisticatedly learned to detect a part among thousands of images. Compared to chaotic feature maps of conv-layers, our explanatory graph is a more concise and meaningful representation of the CNN knowledge.
To prove the above assertions, we learn explanatory graphs for different CNNs (including the VGG-16, residual networks, and the encoder of a VAE-GAN) and analyze the graphs from different perspectives as follows.
Visualization & reconstruction: Patterns in graph nodes can be directly visualized in two ways. First, for each graph node, we list object parts that trigger strong node activations. Second, we use activation states of graph nodes to reconstruct image regions related to the nodes.
Examining part interpretability of graph nodes: [Bau et al.2017] defined different types of interpretability for a CNN. In this study, we evaluate the part-level interpretability of the graph nodes. I.e. given an explanatory graph, we check whether a node consistently represents the same part semantics among different objects. We follow ideas of [Bau et al.2017, Zhou et al.2015] to measure the part interpretability of each node.
Examining location instability of graph nodes: Besides the part interpretability, we also define a new metric, namely location instability, to evaluate the clarity of the semantic meaning of each node in the explanatory graph. We assume that if a graph node consistently represents the same object part, then the distance between the inferred part and some ground-truth semantic parts of the object should not change a lot among different images.
Testing transferability: We associate graph nodes with explicit part names for multi-shot part localization. The superior performance of our method shows the good transferability of our graph nodes.
In experiments, we demonstrate both the representation clarity and the high transferability of the explanatory graph.
Contributions of this paper are summarized as follows.
1) In this paper, we, for the first time, propose a simple yet effective method to clarify the chaotic knowledge hidden inside a pre-trained CNN and to summarize such a deep knowledge hierarchy using an explanatory graph. The graph disentangles part patterns from each filter of the CNN. Experiments show that each graph node consistently represents the same object part among different images.
2) Our method can be applied to different CNNs, e.g. VGGs, residual networks, and the encoder of a VAE-GAN.
3) The mined patterns have good transferability, especially in multi-shot part localization. Although our patterns were pre-trained without part annotations, our transfer-learning-based part localization outperformed approaches that learned part representations with part annotations.
Semantics in CNNs
The interpretability and the discrimination power are two crucial aspects of a CNN [Bau et al.2017]. In recent years, different methods are developed to explore the semantics hidden inside a CNN. Many statistical methods [Szegedy et al.2014, Yosinski et al.2014, Aubry and Russell2015] have been proposed to analyze the characteristics of CNN features. In particular, [Zhang, Wang, and Zhu2018] has demonstrated that in spite of the good classification performance, a CNN may encode biased knowledge representations due to dataset bias. Instead, the CNN usually uses unreliable contexts for classification. For example, a CNN may extract features from hairs as a context to identify the smiling attribute. Therefore, we need methods to visualize the knowledge hierarchy hidden inside a CNN.
Visualization & interpretability of CNN filters:
Visualization of filters in a CNN is the most direct way of exploring the pattern hidden inside a neural unit. Up-convolutional nets [Dosovitskiy and Brox2016] were developed to invert feature maps to images. Comparatively, gradient-based visualization [Zeiler and Fergus2014, Mahendran and
Vedaldi2015, Simonyan, Vedaldi, and
Zisserman2013] showed the appearance that maximized the score of a given unit, which is more close to the spirit of understanding CNN knowledge. Furthermore, Bau et al. [Bau et al.2017] defined and analyzed the interpretability of each filter.
Although these studies achieved clear visualization results, theoretically, gradient-based visualization methods visualize one of the local minimums contained in a high-layer filter. I.e. when a filter represents multiple patterns, these methods selectively illustrated one of the patterns; otherwise, the visualization result will be chaotic. Similarly, [Bau et al.2017] selectively analyzed the semantics among the highest 0.5% activations of each filter. In contrast, our method provides a solution to explaining both strong and weak activations of each filter and discovering all possible patterns from each filter.
Some studies go beyond passive visualization and actively retrieve units from CNNs for different applications. Like middle-level feature extraction[Singh, Gupta, and Efros2012], pattern retrieval mainly learns mid-level representations of CNN knowledge. Zhou et al. [Zhou et al.2015, Zhou et al.2016] selected units from feature maps to describe “scenes”. Simon et al. discovered objects from feature maps of unlabeled images [Simon and Rodner2015], and selected a filter to describe each part in a supervised fashion [Simon, Rodner, and Denzler2014]. However, most methods simply assumed that each filter mainly encoded a single visual concept, and ignored the case that a filter in high conv-layers encoded a mixture of patterns. [Zhang et al.2017a, Zhang et al.2017c, Zhang et al.2017b]
extracted certain neurons from a filter’s feature map to describe an object part in a weakly-supervised manner (e.g. learning from active question answering and human interactions).
In this study, the explanatory graph disentangles patterns different parts in the CNN without a need of part annotations. Compared to raw feature maps, patterns in graph nodes are more interpretable.
Weakly-supervised knowledge transferring
Knowledge transferring ideas have been widely used in deep learning. Typical research includes end-to-end fine-tuning and transferring CNN knowledge between different categories[Yosinski et al.2014] or different datasets [Ganin and Lempitsky2015]. In contrast, we believe that a transparent representation of part knowledge will create a new possibility of transferring part knowledge to other applications. Therefore, we build an explanatory graph to represent part patterns hidden inside a CNN, which enables transfer part patterns to other tasks. Experiments have demonstrated the efficiency of our method in multi-shot part localization.
Intuitive understanding of the pattern hierarchy
As shown in Fig. 2, the feature map of a filter can usually be activated by different object parts in various locations. Let us assume that a feature map is activated with peaks. Some peaks represent common parts of the object, and we call such activation peaks part patterns. Whereas, other peaks may correspond to background noises.
Our task is to discover activation peaks of part patterns out of noisy peaks from a filter’s feature map. We assume that if a peak corresponds to an object part, then some patterns of other filters must be activated in similar map positions; vice versa. These patterns represent sub-regions of the same part and keep certain spatial relationships. Thus, in the explanatory graph, we connect each pattern in a low conv-layer to some patterns in the neighboring upper conv-layer. We mine part patterns layer by layer. Given patterns mined from the upper conv-layer, we select activation peaks, which keep stable spatial relationships with specific upper-layer patterns among different images, as part patterns in the current conv-layer.
As shown in Fig. 2, patterns in high conv-layers usually represent large-scale object parts. Whereas, patterns in low conv-layers mainly describes relatively simple shapes, which are less distinguishable in semantics. We use high-layer patterns to filter out noises and disentangle low-layer patterns. From another perspective, we can regard low-layer patterns as components of high-layer patterns.
Notations: We are given a CNN pre-trained using its own set of training samples . Let denote the target explanatory graph. contains several layers, which corresponds to conv-layers in the CNN. We disentangles the -th filter of the -th conv-layer into different part patterns, which are modeled as a set of nodes in the -th layer of , denoted by . denotes the node set for the -th filter. Parameters of these nodes in the -th layer are given as , which mainly encode spatial relationships between these nodes and the nodes in the -th layer.
Given a training image , the CNN generates a feature map11footnotemark: 1 of the -th conv-layer, denoted by . Then, for each node , we can use the explanatory graph to infer whether ’s part pattern appears on the -th channel11footnotemark: 1 of , as well as the position of the part pattern (if the pattern appears). We use to represent position inference results for all nodes in the -th layer.
Objective function: We build the explanatory graph in a top-down manner. Given all training samples , we first disentangle patterns from the top conv-layer of the CNN, and built the top graph layer. Then, we use inference results of the patterns/nodes on the top layer to help disentangle patterns from the neighboring lower conv-layer. In this way, the construction of is implemented layer by layer. Given inference results for the -th layer , we expect that all patterns to simultaneously 1) be well fit to and 2) keep consistent spatial relationships with upper-layer patterns among different images. The objective of learning for the -th layer is given as
I.e. we learn node parameters that best fit feature maps of training images.
Let us focus on a conv-layer’s feature map of image . Without ambiguity, we ignore the superscript to simplify notations in following paragraphs. We can regard as a distribution of “neural activation entities.” We consider the neural response of each unit as the number of “activation entities.” In other words, each unit localizes at the position of 222To make unit positions in different conv-layers comparable with each other (e.g. in Eq. 4), we project the position of unit to the image plane. We define the coordinate on the image plane, instead of on the feature-map plane. in the -th channel of . We use to denote the number of activation entities at the position , where is the normalized response value of ; is a constant.
Therefore, just like a Gaussian mixture model, we use all patterns inas a mixture model to jointly explain the distribution of activation entities on the -th channel of :
where we consider each node as a hidden variable or an alternative component in the mixture model to describe activation entities.
is a constant prior probability.measures the compatibility of using node to describe an activation entity at . In addition, because noisy activations cannot be explained by any patterns, we add a dummy component to the mixture model for noisy activations. Thus, the compatibility between and is computed based on spatial relationship between and other nodes in , which is roughly formulated as
In above equations, node has a set of neighboring patterns in the upper layer, denoted by , which would be determined during the learning process. The overall compatibility is divided into the spatial compatibility between node and each neighboring node , . , denotes the position inference result of , which have been provided. is a constant for normalization. is a constant to roughly ensure , which can be eliminated during the learning process.
As shown in Fig. 3, an intuitive idea is that the relative displacement between and should not change a lot among different images. Let and denote the prior positions of and , respectively. Then, will approximate to , if node can well fit activation entities at . Therefore, given and , we assume the spatial relationship between and
follows a Gaussian distribution in Eqn.4, where denotes the prior position of given .
denotes the variation, which can be estimated from data333We can prove that for each , , where ; . Therefore, we can either directly use as , or compute the variation of w.r.t. different images to obtain ..
In this way, the core of learning is to determine an optimal set of neighboring patterns and estimate the prior position . Note that our method only models the relative displacement .
Inference of pattern positions: Given the -th filter’s feature map, we simply assign node with a certain unit on the feature map as the true inference of , where denotes the score of assigning to . Accordingly, represents the inferred position of . In particular, in Eqn. (1), we define .
Top-down EM-based Learning:
For each node , we need to learn the parameter and a set of patterns in the upper layer that are related to , . We learn the model in a top-down manner. We first learn nodes in the top-layer of , and then learn for the neighboring lower layer. For the sub-graph in the -th layer, we iteratively estimate parameters of and
for nodes in the sub-graph. We can use the Expectation-Maximization (EM) algorithm for learning. Please see Algorithm1 for details.
Overview of experiments
Four types of CNNs: To demonstrate the broad applicability of our method, we applied our method to four types of CNNs, i.e. the VGG-16 [Simonyan and Zisserman2015], the 50-layer and 152-layer Residual Networks [He et al.2016], and the encoder of the VAE-GAN [Larsen, Sønderby, and Winther2016].
Three experiments and thirteen baselines: We designed three experiments to evaluate the explanatory graph. The first experiment is to visualize patterns in the graph. The second experiment is to evaluate the semantic interpretability of the part patterns, i.e. checking whether a pattern consistently represents the same object region among different images. We compared our patterns with three types of middle-level features and neural patterns. The third experiment is multi-shot learning for part localization, in order to test the transferability of patterns in the graph. In this experiment, we associated part patterns with explicit part names for part localization. We compared our method with ten baselines.
Three benchmark datasets: We built explanatory graphs to describe CNNs learned using a total of 37 animal categories in three datasets: the ILSVRC 2013 DET Animal-Part dataset [Zhang et al.2017a], the CUB200-2011 dataset [Wah et al.2011], and the Pascal VOC Part dataset [Chen et al.2014]. As discussed in [Chen et al.2014, Zhang et al.2017a], animals usually contain non-rigid parts, which presents a key challenge for part localization. Thus, we selected animal categories in the three datasets for testing.
We first trained/fine-tuned a CNN using object images of a category, which were cropped using object bounding boxes. Then, we learned an explanatory graph to represent patterns of the category hidden inside the CNN. We set parameters , , , and .
Given a VGG-16 that was pre-trained using the 1.3M images in the ImageNet dataset[Deng et al.2009], we fine-tuned all conv-layers of the VGG-16 using object images in a category. The loss for finetuning was that for classification between the target category and background images. In each VGG-16, there are thirteen conv-layers and three fully connected layers. We selected the ninth, tenth, twelfth, and thirteenth conv-layers of the VGG-16 as four valid conv-layers, and accordingly built a four-layer graph. We extracted patterns from the -th filter of the -th layer. We set and .
Residual Networks: We chose two residual networks, i.e. the 50-layer and 152-layer ones. The finetuning process for each network was exactly the same as that for VGG-16. We built a three-layer graph based on each residual network by selecting the last conv-layer with a feature ouput, the last conv-layer with a feature map, and the last conv-layer with a feature map as valid conv-layers. We set , , and .
VAE-GAN: For each category, we used the cropped object images in the category to train a VAE-GAN. We learned a three-layer graph based on the three conv-layers of the encoder of the VAE-GAN. We set , , and .
Experiment 1: pattern visualization
Given an explanatory graph for a VGG-16 network, we visualize its structure in Fig. 4. Part patterns in the graph are visualized in the following three ways.
Top-ranked patches: We performed pattern inference on all object images. For each image , we extracted an images patch in the position of 444We projected the unit to the image to compute its position. with a fixed scale of to represent pattern . Fig. 5 shows a pattern’s image patches that had highest inference scores.
Heat maps of patterns: Given a cropped object image , we used the explanatory graph to infer its patterns on image , and drew heat maps to show the spatial distribution of the inferred patterns. We drew a heat map for each layer of the graph. Given inference results of patterns in the -th layer, we drew each pattern as a weighted Gaussian distribution 44footnotemark: 4 on the heat map, where . Please see Fig. 6 for heat maps of the top-50% patterns with the highest scores of .
Pattern-based image synthesis: We used the up-convolutional network [Dosovitskiy and Brox2016] to visualize the learned patterns. Up-convolutional networks were originally trained for image reconstruction. In this study, given an image’s feature maps corresponding to the second graph layer, we estimated the appearance of the original image. Given an object image , we used the explanatory graph for pattern inference, i.e. assigning each pattern with a certain neural unit as its position inference44footnotemark: 4. We considered the top-10% patterns with highest scores of as valid ones. We filtered out all neural responses of units, which were not assigned to valid patterns, from feature maps (setting these responses to zero). We then used [Dosovitskiy and Brox2016] to synthesize the appearance corresponding to the modified feature maps. We regard image synthesis in Fig. 7 as the visualization of the inferred patterns.
Experiment 2: semantic interpretability of patterns
In this experiment, we tested whether each pattern in an explanatory graph consistently represented the same object region among different images. We learned four explanatory graphs for a VGG-16 network, two residual networks, and a VAE-GAN that were trained/fine-tuned using the CUB200-2011 dataset [Wah et al.2011]. We used two methods to evaluate the semantic interpretability of patterns, as follows.
Part interpretability of patterns: We mainly extracted patterns from high conv-layers, and as discussed in [Bau et al.2017], high conv-layers contain large-scale part patterns. We were inspired by Zhou et al. [Zhou et al.2015] and measured the interpretability of part patterns. For the pattern of a given node , we used people to manually evaluate the pattern’s interpretability. When we used to make inferences among all images, we regarded inference results with the top- inference scores among all images as valid representations of . We require the highest inference scores on images to take about 30% of the inference energy, i.e. (we use this equation to compute ). As shown in Fig.8, we asked human raters how many inference results among the top described the same object part, in order to compute the purity of part semantics of pattern .
The table in Fig. 8(top-left) shows the semantic purity of the patterns in the second layer of the graph. Let the second graph layer correspond to the -th conv-layer with filters. Like in [Zhou et al.2015], the raw filter maps baseline used activated neurons in the feature map of a filter to describe a part. The raw filter peaks baseline considered the highest peak on a filer’s feature map as a part detection. Like our method, the two baselines only visualized top- part inferences (the feature maps’ neural activations took 30% of activation energies among all images). We back-propagated the center of the receptive field of each neural activation to the image plane and simply used a fixed radius to draw the image region corresponding to each neural activation. Fig. 8 compares the image region corresponding to each node in the explanatory graph and image regions corresponding to feature maps of each filter. Our graph nodes encoded much more meaningful part representations than raw filters.
Because the baselines simply averaged the semantic purity among the filters, for a fair comparison, we also computed average semantic purities using the top- nodes, each node having the highest scores of .
|Raw filter [Zhou et al.2015]||0.1328||0.1346||0.1398||0.1944|
|[Singh, Gupta, and Efros2012]||0.1341|
|[Simon, Rodner, and Denzler2014]||0.2291|
Location instability of inference positions: We also defined the location instability of inference positions for each pattern as an alternative evaluation of pattern interpretability. We assumed that if a pattern was always triggered by the same object part through different images, then the distance between the pattern’s inference position and a ground-truth landmark of the object part should not change a lot among various images.
As shown in Fig. 9, for each testing image , we computed the distances between the inferred position of and ground-truth landmark positions of head, back, and tail parts, denoted by , , and . We normalized these distances by the diagonal length of input images. Then, we computed as the location instability of the node for evaluation, where denotes the variation of among different images.
Given an explanatory graph, we compared its location instability with three baselines. In the first baseline, we treated each filter in a CNN as a detector of a certain pattern. Thus, given the feature map of a filter (after the ReLu operation), we used the method of[Zhou et al.2015] to localize the unit with the highest response value as the pattern position. The other two baselines were typical methods to extract middle-level features from images [Singh, Gupta, and Efros2012] and extract patterns from CNNs [Simon, Rodner, and Denzler2014], respectively. For each baseline, we chose the top-500 patterns (i.e. 500 nodes with top scores in our explanatory graph, 500 filters with strongest activations in the CNN, and the top-500 middle-level features). For each pattern, we selected position inferences on the top-20 images with highest scores to compute the instability of its inferred positions. Table 1 compares the location instability of the patterns learned by different baselines, and our method exhibited significantly lower location instability.
Experiment 3: multi-shot part localization
|SS-DPM-Part [Azizpour and Laptev2012]||N||0.3469|
|PL-DPM-Part [Li et al.2013]||N||0.3412|
|Part-Graph [Chen et al.2014]||N||0.4889|
|CNN-PDD [Simon, Rodner, and Denzler2014]||N||0.2333|
|CNN-PDD-ft [Simon, Rodner, and Denzler2014]||Y||0.3269|
|Fast-RCNN (1 ft) [Girshick2015]||N||0.4517|
|Fast-RCNN (2 fts) [Girshick2015]||Y||0.4131|
|SS-DPM-Part [Azizpour and Laptev2012]||N||0.356||0.270||0.264||0.242||0.262||0.286||0.280|
|PL-DPM-Part [Li et al.2013]||N||0.294||0.328||0.282||0.312||0.321||0.840||0.396|
|Part-Graph [Chen et al.2014]||N||0.360||0.208||0.263||0.205||0.386||0.500||0.320|
|CNN-PDD [Simon, Rodner, and Denzler2014]||N||0.301||0.246||0.220||0.248||0.292||0.254||0.260|
|CNN-PDD-ft [Simon, Rodner, and Denzler2014]||Y||0.358||0.268||0.220||0.200||0.302||0.269||0.269|
|Fast-RCNN (1 ft) [Girshick2015]||N||0.324||0.324||0.325||0.272||0.347||0.314||0.318|
|Fast-RCNN (2 fts) [Girshick2015]||Y||0.350||0.295||0.255||0.293||0.367||0.260||0.303|
|Fast-RCNN (1 ft)||N||0.261||0.365||0.265||0.310||0.353||0.365||0.289||0.363||0.255||0.319||0.251||0.260||0.317||0.255||0.255||0.169|
|Fast-RCNN (2 fts)||Y||0.340||0.351||0.388||0.327||0.411||0.119||0.330||0.368||0.206||0.170||0.144||0.160||0.230||0.230||0.178||0.205|
|Fast-RCNN (1 ft)||N||0.374||0.322||0.285||0.265||0.320||0.277||0.255||0.351||0.340||0.324||0.334||0.256||0.336||0.274||0.299|
|Fast-RCNN (2 fts)||Y||0.346||0.303||0.212||0.223||0.228||0.195||0.175||0.247||0.280||0.319||0.193||0.125||0.213||0.160||0.246|
And-Or graph for semantic parts
The explanatory graph makes it plausible to transfer middle-layer patterns from CNNs to semantic object parts. In order to test the transferability of patterns, we build an additional And-Or graph (AOG) to associate certain implicit patterns with an explicit part name, in the scenario of multi-shot learning. We used the AOG to localize semantic parts of objects for evaluation. The structure of the AOG is inspired by [Zhang, Wu, and Zhu2017], and the learning of the AOG was originally proposed in [Zhang et al.2017a]. We briefly introduce the AOG in [Zhang et al.2017a] as follows.
As shown in Fig. 10, like the hierarchical model in [Li and Hua2015], the AOG encodes a four-layer hierarchy for each semantic part, i.e. the semantic part (OR node), part templates (AND node), latent patterns (OR nodes, those from the explanatory graph), and neural units (terminal nodes). In the AOG, each OR node (e.g. a semantic part or a latent pattern) contains a list of alternative appearance (or deformation) candidates. Each AND node (e.g. a part template) uses a number of latent patterns to describe its compositional regions.
1) The OR node of a semantic part contains a total of part templates to represent alternative appearance or pose candidates of the part. 2) Each part template (AND node) retrieve patterns from the explanatory graph as children. These patterns describe compositional regions of the part. 3) Each latent pattern (OR node) has all units in its corresponding filter’s feature map as children, which represent its deformation candidates on image .
Experimental settings of three-shot learning
We learned the explanatory graph based on a fine-tuned VGG-16 network and built the AOG following the scenario of multi-shot learning introduced in [Zhang et al.2017a]. For each category, we used three annotations of the head part to learn three head templates in the AOG. Such part annotations were offered by [Zhang et al.2017a]. To enable a fair comparison, all the object-box annotations and the three part annotations were equally provided to all baselines for learning.
We learned the explanatory graph based on a fine-tuned VGG-16 network [Simonyan and Zisserman2015] and built the AOG following the scenario of multi-shot learning introduced in [Zhang et al.2017a]. For each category, we set three templates for the head part (), and used a single part-box annotation for each template. We set to learn AOGs for categories in the ILSVRC Animal-Part and CUB200 datasets and set for Pascal VOC Part categories. Then, we used the AOGs to localize semantic parts on objects. Note that we used object images without part annotations to learn the explanatory graph and we used three part annotations provided by [Zhang et al.2017a] to build the AOG. All these training samples were equally provided to all baselines for learning (besides part annotations, all baselines also used object annotations contained in the datasets for learning).
We compared AOGs with a total of ten baselines in part localization. The baselines included 1) state-of-the-art algorithms for object detection (i.e. directly detecting target parts from objects), 2) graphical/part models for part localization, and 3) the methods selecting CNN patterns to describe object parts.
The first baseline was the standard fast-RCNN [Girshick2015], namely Fast-RCNN (1 ft), which directly fine-tuned a VGG-16 network based on part annotations. Then, the second baseline, namely Fast-RCNN (2 fts), first used massive object-box annotations in the target category to fine-tune the VGG-16 network with the loss of object detection. Then, given part annotations, Fast-RCNN (2 fts) further fine-tuned the VGG-16 to detect object parts. We used [Simon, Rodner, and Denzler2014] as the third baseline, namely CNN-PDD. CNN-PDD selected certain filters of a CNN to localize the target part. In CNN-PDD, the CNN was pre-trained using the ImageNet dataset [Deng et al.2009]. Just like Fast-RCNN (2 ft), we extended [Simon, Rodner, and Denzler2014] as the fourth baseline CNN-PDD-ft, which fine-tuned a VGG-16 network using object-box annotations before applying the technique of [Simon, Rodner, and Denzler2014]. The fifth and sixth baselines were DPM-related methods, i.e. the strongly supervised DPM (SS-DPM-Part) [Azizpour and Laptev2012] and the technique in [Li et al.2013] (PL-DPM-Part), respectively. Then, the seventh baseline, namely Part-Graph, used a graphical model for part localization [Chen et al.2014]
. For weakly supervised learning, “simple” methods are usually insensitive to model over-fitting. Thus, we designed two baselines as follows. First, we used object-box annotations in a category to fine-tune the VGG-16 network. Then, given a few well-cropped object images, we used the selective search[Uijlings et al.2013] to collect image patches, and used the VGG-16 network to extract fc7 features from these patches. The baseline fc7+linearSVM used a linear SVM to detect the target part. The other baseline fc7+sp+linearSVM combined both the fc7 feature and the spatial position () of each image patch as features for part detection. The last competing method is weakly supervised mining of part patterns from CNNs [Zhang et al.2017a], namely supervised-AOG. Unlike our method (unsupervised), supervised-AOG used part annotations to extract part patterns.
|Dataset||ILSVRC DET Animal||Pascal VOC Part||CUB200-2011|
To enable a fair comparison, we classify all baselines into three groups,i.e. no representation learning (no-RL), unsupervised representation learning (unsup-RL)555Representation learning in these methods only used object-box annotations, which is independent to part annotations. A few part annotations were used to select off-the-shelf pre-trained features.
, and supervised representation learning (sup-RL). The No-RL group includes conventional methods without using deep features, such as SS-DPM-Part, PL-DPM-Part, and Part-Graph. Sup-RL methods are Fast-RCNN (1 ft), Fast-RCNN (2 ft), CNN-PDD, CNN-PDD-ft, supervised-AOG, fc7+linearSVM, and fc7+sp+linearSVM. Fast-RCNN methods used part annotations to learn features. Supervised-AOG used part annotations to select filters from CNNs to localize parts. Unsup-RL methods include CNN-PDD, CNN-PDD-ft, and our method. These methods did not use part annotations, and only used object boxes for learning/selection.
We use the normalized distance to evaluate localization accuracy, which has been used in [Zhang et al.2017a, Simon, Rodner, and Denzler2014] as a standard metric. Tables 2, 4, and 4 show part-localization results on the CUB200-2011 dataset [Wah et al.2011], the Pascal VOC Part dataset [Chen et al.2014], and the ILSVRC 2013 DET Animal-Part dataset [Zhang et al.2017a], respectively. Table 5 compares the unsupervised and supervised learning of neural patterns. In the experiment, the AOG outperformed all baselines, even methods that learned part features in a supervised manner.
Conclusion and discussions
In this paper, we proposed a simple yet effective method to learn an explanatory graph that reveals knowledge hierarchy inside conv-layers of a pre-trained CNN (e.g. a VGG-16, a residual network, or a VAE-GAN). We regard the graph as a concise and meaningful representation, which 1) filters out noisy activations, 2) disentangles reliable part patterns from each filter of the CNN, and 3) encodes co-activation logics and spatial relationships between patterns. Experiments showed that our patterns had significantly higher stability than baselines.
The explanatory graph’s transparent representation makes it plausible to transfer CNN patterns to object parts. Part-localization experiments well demonstrated the good transferability. Our method even outperformed supervised learning of part representations. Nevertheless, the explanatory graph is still a rough representation of the CNN, rather than an accurate reconstruction of the CNN knowledge.
This work is supported by ONR MURI project N00014-16-1-2007 and DARPA XAI Award N66001-17-2-4029, and NSF IIS 1423305.
- [Aubry and Russell2015] Aubry, M., and Russell, B. C. 2015. Understanding deep features with computer-generated imagery. In ICCV.
- [Azizpour and Laptev2012] Azizpour, H., and Laptev, I. 2012. Object detection using strongly-supervised deformable part models. In ECCV.
- [Bau et al.2017] Bau, D.; Zhou, B.; Khosla, A.; Oliva, A.; and Torralba, A. 2017. Network dissection: Quantifying interpretability of deep visual representations. In CVPR.
- [Chen et al.2014] Chen, X.; Mottaghi, R.; Liu, X.; Fidler, S.; Urtasun, R.; and Yuille, A. 2014. Detect what you can: Detecting and representing objects using holistic models and body parts. In CVPR.
- [Deng et al.2009] Deng, J.; Dong, W.; Socher, R.; Li, L.-J.; Li, K.; and Fei-Fei, L. 2009. Imagenet: A large-scale hierarchical image database. In CVPR.
- [Dosovitskiy and Brox2016] Dosovitskiy, A., and Brox, T. 2016. Inverting visual representations with convolutional networks. In CVPR.
Ganin, Y., and Lempitsky, V.
Unsupervised domain adaptation in backpropagation.In ICML.
- [Girshick2015] Girshick, R. 2015. Fast r-cnn. In ICCV.
- [He et al.2016] He, K.; Zhang, X.; Ren, S.; and Sun, J. 2016. Deep residual learning for image recognition. In CVPR.
- [Krizhevsky, Sutskever, and Hinton2012] Krizhevsky, A.; Sutskever, I.; and Hinton, G. 2012. Imagenet classification with deep convolutional neural networks. In NIPS.
- [Larsen, Sønderby, and Winther2016] Larsen, A. B. L.; Sønderby, S. K.; and Winther, O. 2016. Autoencoding beyond pixels using a learned similarity metric. In ICML.
- [LeCun et al.1998] LeCun, Y.; Bottou, L.; Bengio, Y.; and Haffner, P. 1998. Gradient-based learning applied to document recognition. In Proceedings of the IEEE.
[Li and Hua2015]
Li, H., and Hua, G.
Hierarchical-pep model for real-world face recognition.In CVPR.
- [Li et al.2013] Li, B.; Hu, W.; Wu, T.; and Zhu, S.-C. 2013. Modeling occlusion by discriminative and-or structures. In ICCV.
- [Li et al.2015] Li, H.; Lin, Z.; Brandt, J.; Shen, X.; and Hua, G. 2015. A convolutional neural network cascade for face detection. In CVPR.
- [Mahendran and Vedaldi2015] Mahendran, A., and Vedaldi, A. 2015. Understanding deep image representations by inverting them. In CVPR.
- [Simon and Rodner2015] Simon, M., and Rodner, E. 2015. Neural activation constellations: Unsupervised part model discovery with convolutional networks. In ICCV.
- [Simon, Rodner, and Denzler2014] Simon, M.; Rodner, E.; and Denzler, J. 2014. Part detector discovery in deep convolutional neural networks. In ACCV.
- [Simonyan and Zisserman2015] Simonyan, K., and Zisserman, A. 2015. Very deep convolutional networks for large-scale image recognition. In ICLR.
- [Simonyan, Vedaldi, and Zisserman2013] Simonyan, K.; Vedaldi, A.; and Zisserman, A. 2013. Deep inside convolutional networks: visualising image classification models and saliency maps. In arXiv:1312.6034.
- [Singh, Gupta, and Efros2012] Singh, S.; Gupta, A.; and Efros, A. A. 2012. Unsupervised discovery of mid-level discriminative patches. In ECCV.
- [Szegedy et al.2014] Szegedy, C.; Zaremba, W.; Sutskever, I.; Bruna, J.; Erhan, D.; Goodfellow, I.; and Fergus, R. 2014. Intriguing properties of neural networks. In arXiv:1312.6199v4.
- [Uijlings et al.2013] Uijlings, J. R. R.; van de Sande, K. E. A.; Gevers, T.; and Smeulders, A. W. M. 2013. Selective search for object recognition. In IJCV 104(2):154–171.
- [Wah et al.2011] Wah, C.; Branson, S.; Welinder, P.; Perona, P.; and Belongie, S. 2011. The caltech-ucsd birds-200-2011 dataset. Technical report, In California Institute of Technology.
- [Yosinski et al.2014] Yosinski, J.; Clune, J.; Bengio, Y.; and Lipson, H. 2014. How transferable are features in deep neural networks? In NIPS.
- [Zeiler and Fergus2014] Zeiler, M. D., and Fergus, R. 2014. Visualizing and understanding convolutional networks. In ECCV.
- [Zhang et al.2017a] Zhang, Q.; Cao, R.; Wu, Y. N.; and Zhu, S.-C. 2017a. Growing interpretable part graphs on convnets via multi-shot learning. In AAAI.
- [Zhang et al.2017b] Zhang, Q.; Cao, R.; Zhang, S.; Edmonds, M.; Wu, Y.; and Zhu, S.-C. 2017b. Interactively transferring cnn patterns for part localization. In arXiv:1708.01783.
- [Zhang et al.2017c] Zhang, Q.; Cao, R.; Wu, Y. N.; and Zhu, S.-C. 2017c. Mining object parts from cnns via active question-answering. In CVPR.
- [Zhang, Wang, and Zhu2018] Zhang, Q.; Wang, W.; and Zhu, S.-C. 2018. Examing cnn representations with respect to dataset bias. In AAAI.
- [Zhang, Wu, and Zhu2017] Zhang, Q.; Wu, Y.; and Zhu, S.-C. 2017. A cost-sensitive visual question-answer framework for mining a deep and-or object semantics from web images. In arXiv:1708.03911.
- [Zhou et al.2015] Zhou, B.; Khosla, A.; Lapedriza, A.; Oliva, A.; and Torralba, A. 2015. Object detectors emerge in deep scene cnns. In ICRL.
- [Zhou et al.2016] Zhou, B.; Khosla, A.; Lapedriza, A.; Oliva, A.; and Torralba, A. 2016. Learning deep features for discriminative localization. In CVPR.