Interpreting CNN Knowledge via an Explanatory Graph
This paper introduces a graphical model, namely an explanatory graph, which reveals the knowledge hierarchy hidden inside conv-layers of a pre-trained CNN. Each filter in a conv-layer of a CNN for object classification usually represents a mixture of object parts. We develop a simple yet effective method to disentangle object-part pattern components from each filter. We construct an explanatory graph to organize the mined part patterns, where a node represents a part pattern, and each edge encodes co-activation relationships and spatial relationships between patterns. More crucially, given a pre-trained CNN, the explanatory graph is learned without a need of annotating object parts. Experiments show that each graph node consistently represented the same object part through different images, which boosted the transferability of CNN features. We transferred part patterns in the explanatory graph to the task of part localization, and our method significantly outperformed other approaches.READ FULL TEXT VIEW PDF
This paper learns a graphical model, namely an explanatory graph, which
Given a convolutional neural network (CNN) that is pre-trained for objec...
Given a pre-trained CNN without any testing samples, this paper proposes...
In the scenario of one/multi-shot learning, conventional end-to-end lear...
This paper proposes a learning strategy that extracts object-part concep...
This paper presents a method to learn a decision tree to quantitatively
Reusable model design becomes desirable with the rapid expansion of mach...
Interpreting CNN Knowledge via an Explanatory Graph
Convolutional neural networks (CNNs) [15, 12, 9] have exhibited superior performance in various visual tasks, for example, object classification and detection. In comparison, explaining features in middle conv-layers of a CNN has presented continuous challenges for decades. When a CNN is trained for object classification, its conv-layers have encoded rich implicit patterns of object parts and patterns of textures. Therefore, this research aims to provide a global analysis of how visual knowledge is organized in a pre-trained CNN:
How many patterns can activate a certain convolutional filter of the CNN? For example, the filter may be triggered by both a specific object-part pattern or a certain textural pattern.
Which patterns are co-activated to describe an object part?
What is the spatial relationship between two co-activated patterns?
Given a CNN pre-trained for object classification, in this paper, we propose a method (i) to mine object-part patterns from intermediate conv-layers and (ii) to organize these patterns in an explanatory graph.
As shown in Fig. 1, the explanatory graph encodes the knowledge hierarchy hidden inside the CNN, as follows.
The explanatory graph has multiple layers, which correspond to different conv-layers of the CNN.
Each graph layer has many nodes. We use graph nodes in a layer to represent all candidate part patterns that can activate the feature map of the corresponding conv-layer.
Because a filter in the conv-layer may be potentially triggered by multiple parts of the object, we disentangle different part patterns from the same filter, which are represented as different graph nodes.
A graph edge connects two nodes in adjacent layers to encode co-activation logics and spatial relationships between them.
We can regard the explanatory graph as a dictionary, which summarizes the part knowledge hidden inside hundreds of thousands of chaotic neural activations of a conv-layer into thousands of graph nodes.
During the inference process, given feature maps of an input image, our method selects a small number of nodes from the explanatory graph and assigns these nodes with certain neural activations in the feature map. We consider these nodes are activated to explain which part patterns are hidden behind the neural activations. Each graph node consistently corresponds to the same part over different input object images.
Note that the location of each pattern (node) is not fixed to a specific neural activation unit during the inference process. Instead, given different input images, a part pattern may appear on various locations of a filter’s feature maps11footnotemark: 1. For example, the ear pattern and the face pattern of a horse in Fig. 3 can appear on different locations of different images, but they are co-activated and keep certain spatial relationships.
• Disentangling object parts from a single filter is the core technique of building an explanatory graph. As shown in Fig. 1, a filter in a conv-layer may be activated by different object parts (e.g. the filter’s feature map11footnotemark: 1 may be activated by both the head and the neck of a horse).
In this study, we hope to develop a simple yet effective method to automatically disentangle different part patterns from a single filter without using any annotations of object parts, which presents considerable challenges for state-of-the-art algorithms. In this way, the explanatory graph explains neural activations with clear meanings and ignores noisy activations and activations of textural patterns. Given a testing image to the CNN, the explanatory graph can infer (i) which nodes (parts) are responsible for neural activations of a filter and (ii) locations of the corresponding parts on the feature map.
• Graph nodes with high transferability: The explanatory graph contains off-the-shelf patterns for object parts. The explanatory graph summarizes chaotic feature maps of conv-layers into object parts, which can be considered as a more concise and meaningful representation of the CNN knowledge, just like a dictionary. The explanatory graph enables us to accurately transfer object-part patterns from conv-layers to other tasks. Because all filters in the CNN are learned using numerous training images, we can consider each graph node as a detector that has been sophisticatedly learned to detect a part among thousands of images.
To prove the above assertions, we learn explanatory graphs for different CNNs (including the VGG-16, residual networks, and the encoder of a VAE-GAN) and analyze the graphs from various perspectives as follows.
Visualization & reconstruction: We visualize part patterns encoded by graph nodes using the following two approaches. First, for each graph node, we select object parts that most strongly activate the node for visualization. Second, we learn another decoder network to invert activation states of graph nodes to reconstruct image regions of the nodes.
Examining part interpretability of graph nodes: We quantitatively evaluate the part-level interpretability of graph nodes. Given an explanatory graph, we measure whether a node consistently represents the same part on different objects.
Examining location instability of graph nodes:
Besides the part interpretability, we also use a new metric, namely location instability, to measure the semantic clarity of each graph node. It is assumed that if a graph node consistently represents the same object part, then the distance between the inferred part and some ground-truth landmarks of the object should not change a lot through different images. Thus, the evaluation metric uses the deviation of such relative distances over images to measure the instability of a part pattern.
Contributions of this paper are summarized as follows.
In this paper, we, for the first time, propose a simple yet effective method to extract and summarize part knowledge hidden inside chaotic feature maps of intermediate conv-layers of a CNN and organize the layerwise knowledge hierarchy using an explanatory graph. Experiments show that each graph node consistently represents the same object part through different input images.
As a generic method, we can learn explanatory graphs for different CNNs, e.g. VGGs, residual networks, and the encoder of a VAE-GAN.
Graph nodes (patterns) have good transferability, especially in the task of few-shot part localization. Although our graph nodes were learned without part annotations, our transfer-learning-based part localization still outperformed approaches that learned part representations using part annotations.
A preliminary version of this paper appeared in .
The interpretability and the discrimination power are two crucial aspects of a CNN . In recent years, different methods are developed to explore the semantics hidden inside a CNN.
Visualization & interpretability of CNN filters:
Visualization of filters in a CNN is the most direct way of exploring the pattern hidden inside a neural unit. Lots of visualization methods have been used in the literature. Dosovitskiy et al.  proposed up-convolutional nets to invert feature maps of conv-layers to images. However, up-convolutional nets cannot mathematically ensure the visualization result reflects actual neural representations. Comparatively, gradient-based visualization [39, 19, 27] showed the appearance that maximized the score of a given unit, which is more close to the spirit of understanding CNN knowledge. Furthermore, Bau et al.  defined and analyzed the interpretability of each filter. In recent years,  provided a reliable tool to visualize filters in different conv-layers of a CNN.
Although these studies achieved clear visualization results, theoretically, gradient-based visualization methods visualize one of the local minimums contained in a high-layer filter. I.e. when a filter represents multiple patterns, these methods selectively illustrated one of the patterns; otherwise, the visualization result will be chaotic. Similarly,  selectively analyzed the semantics among the highest 0.5% activations of each filter. In contrast, our method provides a solution to explaining both strong and relatively weak activations from each filter, instead of exclusively extracting significant neural activations.
Active network diagnosis:
Going beyond “passive” visualization, some methods “actively” diagnose a pre-trained CNN to obtain insight understanding of CNN representations. Many statistical methods [31, 38, 1] have been proposed to analyze the characteristics of CNN features.  explored semantic meanings of convolutional filters.  evaluated the transferability of filters in intermediate conv-layers. [17, 1] computed feature distributions of different categories in the CNN feature space. Methods of [6, 24] propagated gradients of feature maps w.r.t.
the CNN loss back to the image, in order to estimate the image regions that directly contribute the network output. The LIME and the SHAP  proposed general methods extract input units of a neural network that are used for a specific prediction.
Zhang et al.  has demonstrated that in spite of the good classification performance, a CNN may encode biased knowledge representations due to dataset bias. Instead, the CNN usually uses unreliable contexts for classification. For example, a CNN may extract features from hairs as a context to identify the smiling attribute.
Therefore, in order to ensure the correctness of feature representations, network-attack methods [30, 11, 31] diagnosed network representations by computing adversarial samples for a CNN. In particular, influence functions  were proposed to compute adversarial samples, provide plausible ways to create training samples to attack the learning of CNNs, fix the training set, and further debug representations of a CNN.  discovered knowledge blind spots (unknown patterns) of a pre-trained CNN in a weakly-supervised manner. Some studies [36, 37, 35] mined the local, bottom-up, and top-down information components in a model to construct a hierarchical object representation. From this perspective, our method disentangles object-part patterns from a pre-trained CNN and builds a knowledge hierarchy to diagnose the knowledge inside the CNN.
Some studies retrieve units with specific meanings from CNNs for different applications. Like middle-level feature extraction, pattern retrieval mainly learns mid-level representations of CNN knowledge. Zhou et al. [48, 49] selected units from feature maps to describe “scenes”. In particular,  proposed a method to accurately compute the image-resolution receptive field of neural activations in a feature map. Theoretically, the actual receptive field of a neural activation is smaller than that computed using the filter size. The accurate estimation of the receptive field is crucial to understand a filter’s representations. Simon et al. discovered objects from feature maps of unlabeled images , and selected a filter to describe each part in a supervised fashion . However, most methods simply assumed that each filter mainly encoded a single visual concept, and ignored the case that a filter in high conv-layers encoded a mixture of patterns. [41, 42, 43]
extracted certain neurons from a filter’s feature map to describe an object part in a weakly-supervised manner (e.g. learning from active question answering and human interactions).
In this study, the explanatory graph disentangles patterns different parts in the CNN without a need of part annotations. Compared to raw feature maps, patterns in graph nodes are more interpretable.
Compared to the diagnosis of CNN representations and the pattern retrieval, semanticization of CNN representations is closer to the spirit of building interpretable representations.
Hu et al.  designed logic rules for network outputs, and used these rules to regularize neural networks and learn meaningful representations. However, this study has not obtained semantic representations in intermediate layers.  distilled knowledge of a neural network into an additive model to explain the knowledge inside the network.  used a tree structure to summarize the inaccurate rationale of each CNN prediction into generic decision-making models for a number of samples. Capsule nets  and interpretable CNNs 
used certain network structures and loss functions, respectively, to make the network automatically encode interpretable features in intermediate layers.
In comparison, we aim to explore the entire semantic hierarchy hidden inside conv-layers of a CNN. With clear semantic structures, the explanatory graph makes it easier to transfer CNN patterns to other part-based tasks.
Knowledge transferring ideas have been widely used in deep learning. Typical research includes end-to-end fine-tuning and transferring CNN knowledge between different categories  or different datasets . In contrast, a transparent representation of part knowledge will create a new possibility of transferring part knowledge to other applications. Therefore, we build an explanatory graph to represent part patterns hidden inside a CNN, which enables transfer part patterns to other tasks. Experiments have demonstrated the efficiency of our method in few-shot part localization.
A single filter is usually activated by different parts of the object (see Fig. 2). Let us assume that given an input image, a filter is activated by parts, i.e. there are activation peaks on the filter’s feature map. Some peaks represent common parts of the object, which are termed part patterns. Other activation peaks may correspond to background noises or textural patterns.
Our goal is to disentangle activation peaks corresponding to part patterns from chaotic feature maps of a filter. It is assumed that if an activation peak of a filter represents an object part, then the CNN usually also contains other filters to represent neighboring parts of the target part. I.e. some activation peaks (patterns) of these filters must keep certain spatial relationships with the target part. Thus, the explanatory graph connects each pattern (node) in a low layer to some patterns in the neighboring upper layer.
We mine part patterns layer by layer. Given patterns mined from the upper layer, we extract activation peaks that keep stable spatial relationships with specific upper-layer patterns through different images, as part patterns in the current layer.
Patterns in high layers usually represent large-scale object parts, while patterns in low layers mainly describe small and relatively simple shapes, which can be regarded as components of high-layer patterns. Patterns in high layers are usually discriminative, and the explanatory graph uses high-layer patterns to filter out noisy activations. Patterns in low layers are disentangled based on their spatial relationship with high-layer patterns.
We are given a CNN, which is pre-trained using its own set of training samples . Let denote the target explanatory graph. contains several layers corresponding to conv-layers in the CNN. Our method disentangles the -th filter of the -th conv-layer into part patterns. These part patterns are modeled as a set of nodes in the -th layer of , denoted by . is given as the entire node set for the -th layer. represents parameters of nodes in the -th layer, which mainly encode spatial relationships between these nodes and nodes in the -th layer.
Given an input image , the -th conv-layer of the CNN generates a feature map11footnotemark: 1, denoted by . Then, for each node , the explanatory graph infers whether or not ’s part pattern appears on the -th channel11footnotemark: 1 of , as well as the part location (if the pattern appears). We use to represent position inference results for all nodes in the -th layer.
Top-down iterative learning of explanatory graphs:
Given all training images , we expect that (i) all patterns nodes in the explanatory graph can be well fit to feature maps of all images, and (ii) nodes in the lower layer always keep consistent with nodes in the upper layer given each input images. Therefore, the learning of an explanatory graph is conducted in a top-down manner as follows.
We first disentangle patterns from the top conv-layer of the CNN and construct the top graph layer. Then, we use inference results of the patterns/nodes on the top layer to help disentangle patterns from the neighboring lower conv-layer. In this way, we can ensure stable layerwise spatial relationships between patterns.
When we learn the -th layer, for each node , we need to learn the following two terms: (i) the parameter and (ii) a set of patterns in the upper layer that are connected to , . denotes the prior location of . Thus, for each node , corresponds the prior displacement between and the upper node . The explanatory graph only uses the displacement to model the spatial relationships between nodes.
Just like an EM algorithm, we use the current explanatory graph to fit feature maps of training images. Then, we use matching results as feedback to modify the prior location and edge connections of each node in the -th layer, in order to make the explanatory graph better fit the feature maps. We repeat this process iteratively to obtain the optimal prior location and edge connections for .
In other words, our method automatically extracts pairs of related patterns and learns the optimal spatial relationships between them during the iterative learning process, which best fit feature maps of training images.
Therefore, the objective function of learning the -th layer is given as
Let us focus on the feature map of image . Without ambiguity, we ignore the superscript to simplify notations in following paragraphs. We can regard as a distribution of “neural activation entities.” The neural response of each unit can be considered as the number of “activation entities.” In other words, each neural unit localizes at the position of 222To make unit positions in different conv-layers comparable with each other (e.g. in Eq. 4), we project the position of unit to the image plane. We define the coordinate on the image plane, instead of on the feature-map plane. in the -th channel of . We use to denote the number of activation entities at the location , where is the normalized response value of ; is a constant.
Just like a Gaussian mixture model, all patterns incomprise a mixture model, which explains the distribution of activation entities on the -th channel of . Each node is treated as a hidden variable or an alternative component in the mixture model to describe activation entities.
is a constant prior probability.measures the compatibility of using node to describe an activation entity at . In particular, we add a dummy component to the mixture model for noisy activations, which cannot be explained by any part patterns. The compatibility between and is based on spatial relationship between and its connected nodes in , which is approximated as
In above equations, has related nodes in the upper layer. The set of node connections would be determined during the learning process. The overall compatibility is divided into the spatial compatibility between node and each related node , . , denotes the position inference result of , which have been given. is a constant for normalization. is a constant to roughly ensure , which can be eliminated during the learning process.
As shown in Fig. 3, an intuitive idea is that the relative displacement between and should not change a lot among different images. Then, will approximate to the prior displacement , if node can well fit the activation at . Given , we assume the spatial relationship between and
follows a Gaussian distribution in Eqn.4, where we define as the prior localization of given . The variation can be estimated from data333We can prove that for each , , where ; . Therefore, we can either directly use as , or compute the variation of w.r.t. different images to obtain ..
We learn the explanatory graph in a top-down manner, and the learning process is summarized in Algorithm 1. We first learn nodes in the top-layer of , and then learn for the neighboring lower layer. For the sub-graph in the -th layer, we iteratively estimate and for nodes in the sub-graph.
Inference of pattern locations: Given feature maps of an input image, we can assign nodes in the explanatory graph with different activations peaks on feature maps, in order to infer semantic meanings (parts) represented by these neural activations. The explanatory graph simply assigns node with the unit on the feature map as the inference of , where denotes the score of assigning to . Accordingly, represents the inferred location of . In particular, in Eqn. (1), we define .
To demonstrate the broad applicability of our method, we learned explanatory graphs to interpret four types of CNNs, i.e. the VGG-16 , the 50-layer and 152-layer Residual Networks , and the encoder of the VAE-GAN . These CNNs learned using a total of 37 animal categories in three datasets, which included the ILSVRC 2013 DET Animal-Part dataset , the CUB200-2011 dataset , and the VOC Part dataset . As discussed in [4, 41], animals usually contain non-rigid parts, which presents a key challenge for part localization. Thus, we selected animal categories in the three datasets for testing.
We designed three experiments to evaluate the explanatory graph from different perspectives. In the first experiment, we visualized node patterns in the explanatory graph. The second experiment was designed to evaluate the interpretability of part patterns, i.e. checking whether or not a node pattern consistently represents the same object part among different images. We compared our patterns with three types of middle-level features and neural patterns. In the third experiment, we used our graph nodes for the task of few-shot part localization, in order to test the transferability of node patterns in the graph. We associated part patterns with explicit part names for part localization. We compared our part-localization performance with fourteen baselines.
We first trained/fine-tuned a CNN using object images of a category, which were cropped using object bounding boxes. Then, we set parameters , , , and to learn an explanatory graph for the CNN.
The VGG-16 was first pre-trained using the 1.3M images in the ImageNet dataset. We then fine-tuned all conv-layers of the VGG-16 using object images in a category. The loss for fine-tuning was for binary classification between the target category and background images. The VGG-16 has thirteen conv-layers and three fully connected layers. We selected the ninth, tenth, twelfth, and thirteenth conv-layers of the VGG-16 as four valid conv-layers, and accordingly, we built a four-layer graph. We extracted patterns from the -th filter of the -th layer, where we set and .
The global structure of an explanatory graph for a VGG-16 network is visualized in Fig. 4. We visualized detailed part patterns of graph nodes from the following three perspectives.
Top-ranked patches: For each image , we performed the pattern inference on its feature maps. For a node , we extracted a patch at the location of 444We projected the unit to the image to compute its position. with a fixed scale of to represent . Fig. 5 shows a pattern’s image patches that had highest inference scores.
Heatmaps of patterns: Given inference results of patterns w.r.t. a cropped object image , we drew heatmaps to show the spatial distribution of the inferred patterns. We drew a heatmap for each layer of the graph. Each pattern was visualized as a weighted Gaussian distribution 44footnotemark: 4 on the heatmap, where . Fig. 6 shows heatmaps of the top-50% patterns with the highest scores of .
Pattern-based image synthesis: We used the up-convolutional network  to visualize part patterns of graph nodes. Given an object image , we used the explanatory graph for pattern inference, i.e. assigning each pattern with a certain neural unit as its position inference44footnotemark: 4. We considered the top-10% patterns with highest scores of as valid ones. We filtered out all neural responses of units, which were not assigned to valid patterns, from feature maps (setting these responses to zero). We selected the filtered feature map corresponding to the second graph layer and used the up-convolutional network to synthesize the filtered feature map to the input image. Fig. 7 shows image-synthesis results, which can be regarded as the visualization of the inferred patterns.
In this experiment, we evaluated whether or not each node pattern consistently represented the same object part through different images. Four explanatory graphs were built for a VGG-16 network, two residual networks, and a VAE-GAN. These networks were learned using the CUB200-2011 dataset . We used the following two metrics to measure the interpretability of node patterns.
Part interpretability of patterns: We mainly extracted patterns from high conv-layers, because as discussed in , high conv-layers contain large-scale part patterns. The evaluation metric was inspired by Zhou et al. . For the pattern of a given node , we used to make inferences among all images. We regarded inference results with the top- inference scores among all images as valid representations of . We require the highest inference scores on images to take about 30% of the inference energy, i.e. we use to compute . We asked human raters to count the number of inference results, which described the same object part, among the top , in order to compute the purity of part semantics of pattern .
The table in Fig. 8(top-left) shows the semantic purity of the patterns in the second layer of the graph. Let the second graph layer correspond to the -th conv-layer with filters. The raw filter maps baseline used all neural activation in the feature map of a filter to describe a part. The raw filter peaks baseline considered the highest peak on a filer’s feature map as the part detection. Like our method, the two baselines also visualized top- part inferences (the feature maps’ neural activations took 30% of activation energies over all images). We back-propagated the center of the receptive field of each neural activation to the image plane and draw the image region corresponding to each neural activation. Fig. 8 compares the image region corresponding to each graph node and image regions corresponding to feature maps of each filter. Our graph nodes represented explicit object parts, but raw filters encoded mixed semantics.
Because the baselines simply averaged the semantic purity among the filters, we also computed average semantic purities using the top- nodes with the highest scores of to enable a fair comparison.
|Raw filter ||0.1328||0.1346||0.1398||0.1944|
Location instability of inference positions: We defined the location instability for each pattern as another evaluation metric of pattern interpretability. We assumed that if a pattern was always triggered by the same object part through different images, then the distance between the pattern’s inference position and a ground-truth landmark of the object part should not change a lot among various images.
As shown in Fig. 9, given a testing image , , , and denote the distances between the inferred position of and ground-truth landmark positions of head, back, and tail parts, respectively. These distances were normalized by the diagonal length of input images. Then, the node’s location instability was given as , where denotes the variation of over different images.
We compared its location instability of an explanatory graph with three baselines. The first baseline treated each filter in a CNN as a detector of a certain pattern. Thus, given the feature map of a filter (after the ReLu operation), we used the method of to localize the unit with the highest response value as the pattern position. The other two baselines were typical methods to extract middle-level features from images  and extract patterns from CNNs , respectively. For each baseline, we chose the top-500 patterns, i.e. 500 nodes with top scores in the explanatory graph, 500 filters with strongest activations in the CNN, and the top-500 middle-level features. For each pattern, we selected position inferences on the top-20 images with highest scores to compute the location instability. Table I compares the location instability of different baselines. Nodes in the explanatory graph had significantly lower location instability than patterns of baselines.
|Fast-RCNN (1 ft) ||N||0.4517|
|Fast-RCNN (2 fts) ||Y||0.4131|
|Fast-RCNN (1 ft) ||N||0.324||0.324||0.325||0.272||0.347||0.314||0.318|
|Fast-RCNN (2 fts) ||Y||0.350||0.295||0.255||0.293||0.367||0.260||0.303|
|Fast-RCNN (1 ft) ||N||2.1||2.2||2.2||1.9||1.4||7.0||2.8|
|Fast-RCNN (2 fts) ||Y||7.7||24.0||18.7||18.0||5.0||19.4||15.5|
|Fast-RCNN (1 ft)||N||0.261||0.365||0.265||0.310||0.353||0.365||0.289||0.363||0.255||0.319||0.251|
|Fast-RCNN (2 fts)||Y||0.340||0.351||0.388||0.327||0.411||0.119||0.330||0.368||0.206||0.170||0.144|
|Fast-RCNN (1 ft)||N||0.260||0.317||0.255||0.255||0.169||0.374||0.322||0.285||0.265||0.320||0.277|
|Fast-RCNN (2 fts)||Y||0.160||0.230||0.230||0.178||0.205||0.346||0.303||0.212||0.223||0.228||0.195|
|Fast-RCNN (1 ft)||N||0.255||0.351||0.340||0.324||0.334||0.256||0.336||0.274||0.299|
|Fast-RCNN (2 fts)||Y||0.175||0.247||0.280||0.319||0.193||0.125||0.213||0.160||0.246|
|Fast-RCNN (1 ft) ||N||5.0||0.5||1.8||2.6||3.7||3.3||0||0.5||28.9||11.4||22.2||11.7||2.5||20.2||27.9||36.3|
|Fast-RCNN (2 fts) ||Y||4.5||5.0||2.4||4.5||2.2||68.8||1.4||9.0||46.0||50.8||61.3||65.8||29.0||30.1||56.3||40.9|
|Fast-RCNN (1 ft) ||N||3.2||6.8||11.0||11.2||1.6||7.4||23.0||1.9||2.1||2.5||3.8||11.8||14.5||19.5||10.0|
|Fast-RCNN (2 fts) ||Y||6.3||15.3||39.0||34.6||36.2||43.6||46.5||20.5||26.7||13.1||36.6||56.6||47.8||57.3||31.9|
The explanatory graph makes it plausible to transfer middle-layer patterns from CNNs to semantic object parts. In order to test the transferability of patterns in the explanatory graph, we introduce a further extension of the disentangling graph, i.e. using a hybrid And-Or graph (AOG) to associate part patterns in the explanatory graph with explicit part names. The structure of the AOG is inspired by , and the learning of the AOG was originally proposed in . We briefly introduce basic inference logic and settings of the AOG as follows.
As shown in Fig. 10, the AOG encodes a four-layer hierarchy for each semantic part, i.e. the semantic part (OR node), part templates (AND node), latent patterns (OR nodes, those from the explanatory graph), and neural units (terminal nodes).
|1||semantic part||OR node|
|2||part template||AND node|
|3||latent pattern||OR node|
|4||neural unit||Terminal node|
where latent patterns correspond to nodes from the explanatory graph.
In the AOG, each OR node (e.g. a semantic part or a latent pattern) contains a list of alternative appearance (or deformation) candidates. Each AND node (e.g. a part template) uses a number of latent patterns to describe its compositional regions.
The OR node of a semantic part contains a total of part templates to represent alternative appearance or pose candidates of the part.
Each part template (AND node) retrieve patterns from the explanatory graph as children. These patterns describe compositional regions of the part.
Each latent pattern (OR node) has all units in its corresponding filter’s feature map as children, which represent its deformation candidates on image .
Technical details: Based on the AOG, we use the extracted patterns to infer semantic parts in a bottom-up manner. We first compute inference scores of different units at the bottom layer w.r.t. different patterns, and then we propagate inference scores up to the layers of part templates and the semantic part for part localization.
The top OR node of the semantic part contains a total of part templates to represent alternative appearance or pose candidates of the part. We manually define the composition of the part templates. During part-inference process, given an image , selects its best child as the true part template:
where denotes the inference score of .
Then, each part template uses a number of latent patterns to describe sub-regions of the part. In the scenario of one-shot learning, we only annotate one part sample belonging to the part template. Then, we retrieve patterns that are related to the annotated part from all nodes in the disentangling graph. Given the inference score and inferred position of each latent pattern on , we retrieve the top latent patterns with the highest scores of as children of . denotes the annotated position of the part ; is a constant variation.
When we have extracted a set of latent patterns for a part template, given a new image, we can use inference results of the latent patterns to localize the part template:
where denotes a constant displacement from to .
Each latent pattern has a channel of units as children, which represent its deformation candidates on image . The score of each unit is given as . The OR node of selects the unit with the maximum score as its deformation configuration:
Please see  for details of the AOG.
Given a fine-tuned VGG-16 network, we learned an explanatory graph and built the AOG upon the explanatory graph following the scenario of few-shot learning in . For each category, we set three templates for the head part () and used three part-box annotations for the three templates. Note that we used object images without part annotations to learn the explanatory graph, and we used three part annotations provided by  for each part to build the AOG. All these object-box annotations and part annotations were equally provided to all baselines to enable fair comparisons (besides part annotations, all baselines also used object annotations contained in the datasets for learning). We set to learn AOGs for categories in the ILSVRC Animal-Part and CUB200 datasets and set for VOC Part categories. Then, we used the AOGs to localize semantic parts on objects.
We compared AOGs with a total of fourteen baselines for part localization. The baselines included (i) approaches for object detection (i.e. directly detecting target parts from objects), (ii) graphical/part models for part localization, and (iii) the methods selecting CNN patterns to describe object parts.
The first baseline was the standard fast-RCNN , namely Fast-RCNN (1 ft), which directly fine-tuned a VGG-16 network based on part annotations. Then, the second baseline, namely Fast-RCNN (2 fts), first used massive object-box annotations in the target category to fine-tune the VGG-16 network with the loss of object detection. Then, given part annotations, Fast-RCNN (2 fts) further fine-tuned the VGG-16 to detect object parts. We used  as the third baseline, namely CNN-PDD. CNN-PDD selected certain filters of a CNN to localize the target part. In CNN-PDD, the CNN was pre-trained using the ImageNet dataset . Just like Fast-RCNN (2 ft), we extended  as the fourth baseline CNN-PDD-ft, which fine-tuned a VGG-16 network using object-box annotations before applying the technique of . The fifth and sixth baselines were DPM-related methods, i.e. the strongly supervised DPM (SS-DPM-Part)  and the technique in  (PL-DPM-Part), respectively. Then, the seventh baseline, namely Part-Graph, used a graphical model for part localization 
. For weakly supervised learning, “simple” methods are usually insensitive to model over-fitting. Thus, we designed six baselines as follows. First, we used object-box annotations in a category to fine-tune the VGG-16 network. Then, given a few well-cropped object images, we used the selective search to collect image patches, and used the VGG-16 network to extract fc7 features from these patches. The baselines fc7+linearSVM, fc7+RBF-SVM, fc7+NN used a linear SVM, an RBF-SVM, and the nearest-neighbor method (selecting the patch closest to the annotated part), respectively, to detect the target part. The other three baseline fc7+sp+linearSVM, fc7+sp+RBF-SVM, fc7+sp+NN combined both the fc7 feature and the spatial position () of each image patch as features for part detection. The last competing method is weakly supervised mining of part patterns from CNNs , namely supervised-AOG. Unlike our method (unsupervised), supervised-AOG used part annotations to extract part patterns.
We divided all baselines into three groups. The first group, namely not-learn parts
, included traditional methods without using deep features, such as SS-DPM-Part, PL-DPM-Part, and Part-Graph. These methods did not learn deep features555Representation learning in these methods only used object-box annotations, which is independent to part annotations. A few part annotations were used to select off-the-shelf pre-trained features.. The second group, termed super-learn parts, contained Fast-RCNN (1 ft), Fast-RCNN (2 ft), CNN-PDD, CNN-PDD-ft, supervised-AOG, fc7+linearSVM, and fc7+sp+linearSVM. These methods learned deep features using part annotations, e.g. fast-RCNN methods used part annotations to learn features; supervised-AOG used part annotations to select filters from CNNs to localize parts. The third group (unsuper-learn parts) included CNN-PDD, CNN-PDD-ft, and our method. These methods learned deep features using object-level annotations, rather than part annotations.
Fig. 11 visualizes localization results based on AOGs, which were learned using three annotations of the head part of each category. We used the normalized distance (used in [41, 26]) and the traditional intersection-over-union (IoU) criterion to evaluate the localization performance. Tables II, III, IV, V, and VI show part-localization results on the CUB200-2011 dataset , the VOC Part dataset , and the ILSVRC 2013 DET Animal-Part dataset . AOGs based on our graph nodes exhibited outperformed all baselines in few-shot learning. Note that our AOGs simply localized the center of an object part without sophisticatedly modeling the scale of the part. Thus, detection-based methods, which also estimated the part scale, performed better in very few cases. Table VII compares the unsupervised and supervised learning of neural patterns. In the experiment, our method outperformed all baselines, even including approaches that learned part features using part annotations.
In this paper, we have developed a simple yet effective method to learn an explanatory graph that reveals knowledge hierarchy inside conv-layers of a pre-trained CNN. The explanatory graph can be regarded as a concise and meaningful summarization of CNN knowledge in intermediate layers, which filters out noisy activations, disentangles part patterns from each filter, and models co-activation relationships and spatial relationships between part patterns. Experiments showed that our patterns had significantly higher stability than baselines. More crucially, our method can be applied to different types of networks, including the VGG-16, residual networks, and the VAE-GAN, to explain their conv-layers.
The transparent representation of the explanatory graph boosts the transferability of CNN features. Part-localization experiments well demonstrated the good transferability of CNN patterns in graph nodes. Our method even outperformed the supervised learning of part representations. Nevertheless, the explanatory graph is just a rough representation of CNN knowledge. It is still difficult to well disentangle textural patterns from filters of the CNN.
This work is supported by ONR MURI project N00014-16-1-2007, DARPA XAI Award N66001-17-2-4029, and NSF IIS 1423305.
Unsupervised domain adaptation in backpropagation.In ICML, 2015.
“why should i trust you?” explaining the predictions of any classifier.In KDD, 2016.
International journal of computer vision, 93(2):226–252, 2011.