Interactively Transferring CNN Patterns for Part Localization

08/05/2017 ∙ by Quanshi Zhang, et al. ∙ 0

In the scenario of one/multi-shot learning, conventional end-to-end learning strategies without sufficient supervision are usually not powerful enough to learn correct patterns from noisy signals. Thus, given a CNN pre-trained for object classification, this paper proposes a method that first summarizes the knowledge hidden inside the CNN into a dictionary of latent activation patterns, and then builds a new model for part localization by manually assembling latent patterns related to the target part via human interactions. We use very few (e.g., three) annotations of a semantic object part to retrieve certain latent patterns from conv-layers to represent the target part. We then visualize these latent patterns and ask users to further remove incorrect patterns, in order to refine part representation. With the guidance of human interactions, our method exhibited superior performance of part localization in experiments.



There are no comments yet.


page 4

page 5

page 7

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Convolutional neural networks (CNNs) [8, 7]

have shown promise for classification tasks in computer vision. However, learning a model using small data,

e.g. annotations on 1–3 examples, is still a great challenge for state-of-the-art algorithms. Without sufficient annotations, the conventional end-to-end learning usually has no mechanism to ensure that the CNN actually learns correct knowledge, rather than over-fit to noisy signals.

Therefore, instead of letting the CNN “guess” new knowledge from small training data, we aim to mine certain patterns from a well pre-trained CNN to represent semantic parts of the object for part localization. We notice that when a CNN is well pre-trained/finetuned using a large number of object-box annotations for object classification, the CNN has encoded massive implicit patterns for local object shapes in its conv-layers. Each pattern may represent a local shape of an object.

In this paper, we propose to incorporate human interactions into the mining of part patterns, in order to ensure we use correct patterns to represent the target semantic part. As shown in Fig. 1, we are given a large number of images with object-box annotations to train a CNN for category classification, but very few (e.g. 1–3) objects of a category have annotations of a semantic part. We mine hundreds of latent patterns from the pre-trained CNN as candidates that potentially represent the target part. Then, we visualize the mined patterns and ask human users to manually select semantically correct latent patterns to build up the model for part localization, just like playing LEGO blocks.

Figure 1: Given a pre-trained CNN and very few (e.g. 1–3) part annotations, we retrieve certain patterns from conv-layers of the CNN to represent the target part. We use an AOG to encode the semantic hierarchy of the retrieved patterns. Each node in the AOG corresponds to a latent pattern in the CNN. We visualize patterns in the AOG and requires people to select patterns related to the target part and remove unrelated patterns, in order to refine the AOG model.

More specifically, during the process of pattern mining, each latent pattern is expected 1) to frequently appear on objects of the category, 2) to be strongly activated by a compositional (or contextual) shape w.r.t the target part, and 3) to keep good spatial relationship with other latent patterns. Because all latent patterns are well pre-trained using massive data, these patterns represent common shapes of a category, instead of being over-fitted to a few part annotations.

For human interactions, we require human users to remove patterns corresponding to background noises and specificity of certain samples. Because we allow people to directly point out model flaws, this interactive-learning strategy is more effective than end-to-end “guessing” true knowledge of the object part from small training data.

And-Or graph representation: Before conducting human interactions, we need to represent latent patterns at the semantic level for pattern manipulation, rather than at the level of CNN neural units. Note that the same object part (e.g.

the head) in different images may appear in different image positions, thereby activating neurons in different feature map positions. Thus, given a feature map of a filter in a conv-layer, we need to first remove noisy activations from the feature map, and then summarize the rest neural activations on numerous neural units into a few latent patterns for human interactions. When we infer a latent pattern in different images, the pattern may appear in different feature map positions due to object deformation. Moreover, a filter’s feature map may be activated by multiple object parts (

e.g. being activated by both the head and the leg), we need to disentangle latent patterns of different object parts from the same feature map.

Therefore, we use an And-Or graph (AOG) to clearly represent the semantic hierarchy of the patterns that are mined from conv-layers. As shown in Fig. 2, we build a four-layer AOG to represent the semantic hierarchy ranging from semantic part, part templates, latent patterns, to CNN units. We use AND nodes in the AOG to encode compositional regions of a part, and use OR nodes to encode a list of alternative appearances/deformations for a local region.

Based on the AOG, we localize patterns on CNN feature maps and visualize these patterns. Then, users can remove irrelevant patterns by pruning certain AOG nodes.

Pattern visualization and interactions: We apply the up-convolutional neural network (up-conv-net) in [5] to visualize latent patterns in the AOG. We train different up-conv-nets to visualize latent patterns corresponding to different conv-layers of the CNN. According to visualization results, latent patterns in low conv-layers usually describe object details, and those in high conv-layers mainly correspond to large-scale parts or contexts. Therefore, we use low-layer patterns to represent details within the target part, and require users to remove low-layer patterns outside the part. We select patterns in high conv-layers to represent the contextual information of the target part.

Method generality: Our method is a general solution to interactively learning object parts based on pre-trained CNNs. First, the AOG representation of objects [11] can be compatible with different features. Second, our method has been tested on AOGs with two different types of neural patterns [19, 18].


Instead of end-to-end learning new information from training data, this study explores the probability of directly selecting certain patterns from a pre-trained CNN to build a model for part localization. Our method mines middle-level latent patterns from the CNN and incorporates human interactions to manually select correct patterns. Our method exhibited superior performance in experiments, which demonstrates the effectiveness of human interactions in weakly-supervised learning.

2 Related work

CNN visualization, semanticization, and interactions: In recent years, many methods have been developed to explain the semantics hidden in the CNN. Studies of [17, 10, 14] passively visualized content of some given CNN units. [1] analyzed statistics of CNN features.

Unlike passive CNN visualization, we hope to actively semanticize CNNs by discovering patterns related to the target part, which is more challenging. Given CNN feature maps, Zhou et al. [20, 21] discovered latent “scene” semantics. Simon et al. discovered objects [12] from CNN activations in an unsupervised manner, and learned part concepts in a supervised fashion [13]. [19] mined CNN patterns for a part concept and transformed the pattern knowledge into an AOG model.

However, without sufficient supervision, previous studies cannot ensure the extracted CNN patterns to have correct part semantics. Thus, we visualize candidate patterns and require users to manually select correct ones, so as to create a better “white-box” explanation of the target part.

And-Or representation: In many studies, people used AOGs to represent the hierarchical semantic hierarchy of objects or scenes [22, 11]. We use the AOG to associate the latent patterns with part semantics, which eases the visualization of CNN patterns and enables semantic-level interactions on CNN patterns.

Unsupervised learning of objects and parts: Unsupervised object discovery [12] was formulated as a problem of mining common foreground patterns from images, and many sophisticated methods were developed for this problem. Whereas, given a pre-trained object-level model, un-/weakly-supervised learning of part representations is a different problem.

Figure 2: Semantic And-Or graph grown on the pre-trained CNN. Red lines in the AOG indicate the parse tree, which associates certain CNN units with certain image regions.

3 Algorithm

In this section, we propose a general method to interactively learn object-part models. In fact, there are a number of techniques to mine neural patterns from CNNs to represent middle-level object shapes. Each of these patterns can be organized using a widely used AOG model. A clear AOG representation allows human users to refine the model by interactively modifying the AOG structure. The basic idea is to visualize latent patterns in the AOG and to let human users identify and remove latent patterns that are not closely related to the target part.

3.1 Preliminaries: AOG representation

The AOG has been a typical object representation for years [11, 22]. Here, we briefly introduce the AOG structure, which has been widely used to represent neural patterns of CNNs in studies of [19] and [18].

The AOG organizes latent patterns hidden in the CNN to explain the semantic hierarchy of an object part. We use the AOG to parse object parts from images. As shown in Fig. 2, we use a four-layer AOG hierarchy ranging from semantic part (OR node), part templates (AND nodes), latent patterns (OR nodes), to CNN units (terminals). In the AOG, an OR node encodes a list of alternative candidates as children, while an AND node uses each of its children to describe a certain compositional shape (or a contextual area) of the father node.

Given an object image 222Considering the CNN’s superior performance in object detection, as in [3], object detection and part localization are considered two separate processes for evaluation. Thus, we crop to only contain the object and resize for CNN inputs to simplify the scenario of learning for part localization., we use the AOG for part parsing. I.e. we first use the CNN to compute ’s feature maps of its conv-layers, and then determine a parse tree within the AOG to explain neural activations on the feature maps and simultaneously localize the target part.

As red lines in Fig. 2, during the parsing procedure, we 1) select a certain part template (AND node in the 2nd Layer) to explain the target part (root OR node in the AOG), 2) parse an image region for the part template (i.e. part localization), and 3) for each latent pattern (OR node in the 3rd Layer) under the part template, determine a CNN unit (terminal node) within a deformation range to localize the local shape of this latent pattern. To be precise, we achieve the part parsing in a bottom-up manner. I.e. in the beginning, we compute an inference score for each terminal node (CNN unit), and then propagate these scores up to nodes of latent patterns and part templates following certain And-Or rules for part parsing.

The top node of “semantic part” (OR node) encodes a number of alternative part templates as children, each denoted by . Each part template naturally corresponds to a type of part appearance observed from a certain perspective. Given parsing results of all part templates, the top node selects the part template with the highest inference score as the true parsing configuration.


where denotes the overall inference score on image , and represents the image region parsed for part template . measures the inference score when we parse an image region for the part template .

Then, each part template (AND node) uses children latent patterns to represent its local compositional shapes or contextual area. Thus, we can formulate ’s inference score as the sum of its children’s inference scores, i.e. , where is the children set of , and denotes the inference score of .

Finally, each latent pattern (OR node) naturally corresponds to a square deformation range within a certain conv-slice/channel of the CNN. All CNN units within 333We set a constant deformation range for each latent pattern, which covers -by- of the feature map of the conv-slice. Deformation ranges of different patterns in the same conv-slice may overlap. The central position of ,

, is a parameter to estimate.

are regarded as deformation candidates of . Just following the OR-node logic in Eq. (1), also selects the best child unit as the true parsing configuration, , . The selected child propagates both its score and parsed region to parent . The image region for each CNN unit is fixed, i.e. we simply propagate the receptive field of to the image plane and obtain . Therefore, we can re-write the above And-Or logics as a DPM-like (deformable part model) model:


where the part template score comprises latent pattern scores, and each latent pattern selects its best CNN unit as the parsing configuration (i.e. computing the image region corresponding the CNN unit). The unary term 444 measures both the neural response of the CNN unit corresponding to the position of and the score for local deformation w.r.t. ’s ideal position , where returns the center position. , where denotes the displacement from to . is computed based on part annotations. , . measures the local inference quality, and the pairwise term 44footnotemark: 4 encodes spatial relationship between part template and latent pattern .

Figure 3: Human interactions. Users remove latent-pattern nodes, which are not related to the target part, from the AOG.

3.2 Latent patterns

In order to demonstrate the generality of the proposed method, we use two different types of latent patterns mined from CNNs to construct different AOGs. Our method allows users interactively select different latent patterns for learning. The first type of patterns are proposed in [19] as local middle-level features of objects, namely middle-level patterns. The method of [18] is proposed to learn an explanatory graph to explain the knowledge hierarchy inside a pre-trained CNN. We use nodes in the explanatory graph as the second type of patterns, namely explanatory patterns. In experiments, we compared part-localization performance of AOGs based on different patterns.

The experimental settings for the mining of the above two patterns are the same, which can be summarized as follows. The input is a set of cropped object images of a category, denoted by , where only a few objects have ground-truth annotations of the target part. Each part annotation on image includes both the ground-truth part template and the true bounding box of the part

. The CNN is pre-trained to classify object images in

of a category from random images. Given the part annotations, [19, 18] mine middle-level patterns and explanatory patterns from conv-layers in the CNN to represent neural activations that are highly related to the part annotations. We use the two patterns to build two types of AOGs.

3.3 AOG visualization

We learn up-conv-nets [5] to roughly555Up-conv-nets [5] cannot ensure a “strict” correspondence between each CNN unit and its visualized appearance. visualize the content within AOG nodes. Given a feature map of a CNN’s conv-layer as input, the up-conv-net was originally proposed to invert the CNN feature map and output the image corresponding to the feature map, namely, image reconstruction. The AOG encodes several part templates of a semantic part, and each part template consists of latent patterns in different conv-layers. Thus, in each step, we visualize and refine latent patterns of a certain part template. We extend the up-conv-net [5] to visualize latent patterns within the sub-AOG under a given part template .

Given a training object sample whose part is explained by the part template in the AOG, we use ’s sub-graph to localize the target part based on Eq. (2). During the part-localization process, each latent pattern is assigned with a certain CNN unit. Thus, we can use activation responses of the corresponding CNN units to reconstruct the patterns’ appearance. We implement the image reconstruction by simply filtering out all unrelated activation responses from the CNN feature map (i.e. setting activations to zero) and using the up-conv-net to invert the modified feature map.

Note that latent patterns of each part template are extracted from different conv-layers of the CNN, and they potentially represent local compositional shapes at different scales. As shown in Fig. 3, latent patterns in lower conv-layers usually represent small-scale details (e.g. edges and corners), but these patterns are usually not discriminative enough and more likely to be activated by background noises. In contrast, latent patterns in higher conv-layers mainly correspond to large-scale appearances/contexts without encoding much object details, but they usually consistently represent certain parts among different objects.

As a result, we train different up-conv-nets using feature maps of different conv-layers for image reconstruction, in order to avoid detailed object appearance being mixed with large-scale object patterns during image reconstruction. In this way, we apply different human-interaction strategies for latent patterns in different conv-layers (which will be introduced later).

Training and using up-conv-nets: Given a CNN that is finetuned for a category, we train an up-conv-net for each conv-layer of the CNN. Training samples are the cropped object images of the category. Given the object images, we use the CNN to extract feature maps of the target conv-layer as the input of the up-conv-net. Output of the up-conv-net is pixel-level image-reconstruction results. The up-conv-net is trained based on the loss of the L-2 distance between the original image and the reconstructed image :


When we apply the up-conv-net to AOG visualization, we use the AOG to localize the target part on the given object . Let the part-parsing process localize a latent pattern at the horizontal coordinate in -th conv-slice of the target conv-layer. We consider that neural responses at in this conv-slice are related to 666For conv-layers of VGG-16 whose feature map size , we select neural responses within .. We set all neural responses unrelated to any latent patterns to zero in the feature map. Then, we use the up-conv-net to invert the modified feature map for image reconstruction.

3.4 Human interactions

Based on visualization results of latent patterns, we further use human interactions to remove latent patterns that are not closely related to the target part. The flowchart of human interactions is designed as follows. Given an part template , our method sequentially produces different image-reconstruction results based on ’s latent patterns extracted from different conv-layers. Then, given the reconstructed image w.r.t. each conv-layer, we require people to annotate image regions that do not contribute to the localization of the target part. Let denote all image area of , and denote the union of the annotated image regions. A latent pattern under is considered not related to the target part and removed from the AOG, if 1) ’s parsed image region localizes within , and 2) it satisfies


where denotes the value of a pixel in . is equal to the gradient of pattern score w.r.t. pixel (see [13] for details of the gradient computation). Pixels with high gradients usually have high correspondence to pattern . Compared to the above equation, the up-conv-net can only show rough image regions of latent patterns.

Rules for human interactions We believe that the localization of a certain semantic part mainly relies on two types of information: 1) the contextual information (e.g. the global pose of the entire object) for rough part localization from a global view, and 2) detailed part appearance for accurate localization. In general, patterns in low conv-layers describe object details, and those in high conv-layers represent contextual information. Therefore, as shown in Fig. 3, for latent patterns in high conv-layers, we mainly remove those localized on the background outside the “object.” For latent patterns in low conv-layers, we usually remove those outside “part bounding boxes.”

4 Experiments

4.1 Implementation details

We learned the AOG based on a 16-layer VGG network (VGG-16) [15]

, which was pre-trained as follows. The VGG-16 was first pre-trained using the 1.3M images in the ImageNet ILSVRC 2012 dataset 

[4] with a loss for 1000-category classification. Then, we further finetuned the VGG-16 using cropped object images in a category to classify target objects from background images. Just as in [19], we selected the last nine conv-layers in the VGG-16 as valid layers, and extracted latent patterns from these conv-layers to build the AOG.

Note that feature maps for Conv-layers 5–7, Conv-layers 8–10, and Conv-layers 11–13 are three , three , and three matrices, respectively. To simplify the system, we merged conv-slices in Conv-layers 5–7, which contained latent patterns, into a new feature map of ( denotes the number of valid conv-slices in Conv-layers 5–7). We learned a up-conv-net, namely Net-1, to visualize latent patterns in the new feature map. Similarly, we learned Net-2 and Net-3 for Conv-layers 8–10 and Conv-layers 11–13, respectively.

4.2 Datasets

In order to make a comprehensive evaluation of the proposed method, we chose three benchmark datasets for testing, i.e. the Pascal VOC Part dataset [3], the CUB200-2011 dataset [16], and the ILSVRC 2013 DET Animal-Part dataset [19]. Note that as in many previous studies [3, 19], we chose animal categories in these three datasets to evaluate part-localization performance, because animals usually contain multiple non-rigid parts, which presents a key challenge for part localization.

Figure 4: Visualization of patterns for the head part before and after human interactions.
obj.-box finetune gold. bird frog turt. liza. koala lobs. dog fox cat lion tiger bear rabb. hams. squi.
SS-DPM-Part [2] N 0.297 0.280 0.257 0.255 0.317 0.222 0.207 0.239 0.305 0.308 0.238 0.144 0.260 0.272 0.178 0.261
PL-DPM-Part [9] N 0.273 0.256 0.271 0.321 0.327 0.242 0.194 0.238 0.619 0.215 0.239 0.136 0.323 0.228 0.186 0.281
Part-Graph [3] N 0.363 0.316 0.241 0.322 0.419 0.205 0.218 0.218 0.343 0.242 0.162 0.127 0.224 0.188 0.131 0.208
fc7+linearSVM Y 0.150 0.318 0.186 0.150 0.257 0.156 0.196 0.136 0.101 0.138 0.132 0.163 0.122 0.139 0.110 0.262
fc7+RBF-SVM Y 0.243 0.369 0.232 0.157 0.243 0.146 0.237 0.154 0.122 0.134 0.115 0.141 0.154 0.124 0.135 0.289
fc7+NN Y 0.298 0.387 0.307 0.259 0.300 0.169 0.287 0.242 0.170 0.159 0.112 0.135 0.263 0.219 0.152 0.346
fc7+sp+linearSVM Y 0.150 0.318 0.186 0.150 0.254 0.156 0.196 0.136 0.101 0.138 0.132 0.163 0.122 0.139 0.110 0.262
fc7+sp+RBF-SVM Y 0.243 0.371 0.235 0.156 0.252 0.145 0.237 0.154 0.122 0.134 0.115 0.140 0.156 0.121 0.136 0.302
fc7+sp+NN Y 0.298 0.387 0.307 0.259 0.300 0.169 0.287 0.242 0.170 0.158 0.112 0.135 0.263 0.219 0.152 0.345
CNN-PDD [13] N 0.316 0.289 0.229 0.260 0.335 0.163 0.190 0.220 0.212 0.196 0.174 0.160 0.223 0.266 0.156 0.291
CNN-PDD-ft [13] Y 0.302 0.236 0.261 0.231 0.350 0.168 0.170 0.177 0.264 0.270 0.206 0.256 0.178 0.167 0.286 0.237
Fast-RCNN (1 ft) [6] N 0.313 0.370 0.250 0.318 0.375 0.343 0.291 0.365 0.287 0.321 0.291 0.305 0.349 0.261 0.290 0.165
Fast-RCNN (2 fts) [6] Y 0.355 0.382 0.421 0.298 0.413 0.168 0.328 0.383 0.298 0.219 0.196 0.217 0.245 0.265 0.264 0.220
Mining-raw [19] Y 0.102 0.149 0.116 0.215 0.137 0.094 0.162 0.146 0.081 0.154 0.079 0.088 0.120 0.092 0.094 0.105
Ours Y 0.074 0.108 0.095 0.182 0.129 0.079 0.147 0.124 0.058 0.132 0.071 0.083 0.104 0.078 0.077 0.072
horse zebra swine hippo catt. sheep ante. camel otter arma. monk. elep. red pa. Avg.
SS-DPM-Part [2] N 0.246 0.206 0.240 0.234 0.246 0.205 0.224 0.277 0.253 0.283 0.206 0.219 0.256 0.129 0.242
PL-DPM-Part [9] N 0.322 0.267 0.297 0.273 0.271 0.413 0.337 0.261 0.286 0.295 0.187 0.264 0.204 0.505 0.284
Part-Graph [3] N 0.296 0.315 0.306 0.378 0.333 0.230 0.216 0.317 0.227 0.341 0.159 0.294 0.276 0.094 0.257
fc7+linearSVM Y 0.205 0.258 0.201 0.140 0.256 0.236 0.164 0.190 0.140 0.252 0.256 0.176 0.215 0.116 0.184
fc7+RBF-SVM Y 0.234 0.221 0.237 0.168 0.300 0.253 0.171 0.212 0.146 0.238 0.248 0.225 0.185 0.104 0.198
fc7+NN Y 0.293 0.235 0.297 0.335 0.330 0.262 0.305 0.263 0.125 0.262 0.304 0.277 0.214 0.102 0.247
fc7+sp+linearSVM Y 0.205 0.258 0.201 0.140 0.256 0.236 0.164 0.190 0.140 0.250 0.256 0.176 0.215 0.116 0.184
fc7+sp+RBF-SVM Y 0.234 0.221 0.237 0.165 0.290 0.246 0.181 0.211 0.146 0.238 0.250 0.224 0.184 0.103 0.198
fc7+sp+NN Y 0.293 0.235 0.299 0.335 0.330 0.262 0.305 0.262 0.125 0.260 0.304 0.276 0.214 0.102 0.247
CNN-PDD [13] N 0.261 0.266 0.189 0.192 0.201 0.244 0.208 0.193 0.174 0.299 0.236 0.214 0.222 0.179 0.225
CNN-PDD-ft [13] Y 0.310 0.321 0.216 0.257 0.220 0.179 0.229 0.253 0.198 0.308 0.273 0.189 0.208 0.275 0.240
Fast-RCNN (1 ft) [6] N 0.372 0.360 0.302 0.289 0.342 0.266 0.220 0.349 0.328 0.334 0.351 0.261 0.337 0.328 0.311
Fast-RCNN (2 fts) [6] Y 0.355 0.324 0.275 0.266 0.292 0.235 0.212 0.271 0.329 0.343 0.259 0.163 0.285 0.246 0.284
Mining-raw [19] Y 0.182 0.160 0.211 0.169 0.186 0.117 0.102 0.145 0.117 0.241 0.113 0.141 0.135 0.081 0.135
Ours Y 0.158 0.126 0.196 0.161 0.155 0.113 0.092 0.120 0.097 0.216 0.081 0.138 0.109 0.074 0.115
Table 1: Normalized distance of part localization on the ILSVRC 2013 DET Animal-Part dataset. The second column indicates whether the baseline used all object annotations in the category to pre-finetune a CNN before learning the part (in fact, object-box annotations are more than part annotations).

4.3 Baselines

In experiments, we used our interactive learning method to refine the AOG learned by [19], namely the Ours method. Then, we followed the algorithm of [19] to further build a new AOG, which used explanatory patterns in [18] as latent patterns. We conducted the interactive learning on the new AOG, and named it the Ours (explanatory patterns) method.

We compared our methods with a total of fourteen methods in part localization. The baselines included 1) state-of-the-art algorithms for object detection (i.e. directly detecting target parts from objects), 2) conventional graphical/part models for part localization (modeling both the part appearance and the relationships with parts and objects), and 3) the methods selecting CNN patterns to describe object parts.

The first baseline was the standard method of using AOGs for part localization without applying human interactions [19], namely Mining-raw. The comparison with Mining-raw demonstrated the effectiveness of human interactions in learning.

Then, two baselines were implemented based on the Fast-RCNN [6]. We finetuned the fast-RCNN with a loss for detecting a single class/part from background, rather than for multi-class/part detection to ensure a fair comparison. The first baseline was the standard fast-RCNN, namely Fast-RCNN (1 ft), which directly finetuned the VGG-16 network to detect parts on objects. We annotated object parts on well cropped object images as training samples (objects had been cropped using their bounding boxes). Then, for the second baseline, namely Fast-RCNN (2 fts), we slightly modified original algorithm of Fast-RCNN. Given the large number of object-box annotations in the target category, the Fast-RCNN (2 fts) first finetuned the VGG-16 network with the loss of “detecting objects from entire images.” Then, given a small number of part annotations, Fast-RCNN (2 fts) further finetuned the VGG-16 to detect parts from objects. This two-step finetuning made a full use of the massive object-level annotations, which enabled a fair comparison.

Method Finetune
SS-DPM-Part [2] N 0.3469
PL-DPM-Part [9] N 0.3412
Part-Graph [3] N 0.4889
fc7+linearSVM Y 0.3120
fc7+RBF-SVM Y 0.3666
fc7+NN Y 0.4194
fc7+sp+linearSVM Y 0.3120
fc7+sp+RBF-SVM Y 0.3700
fc7+sp+NN Y 0.4195
CNN-PDD [13] N 0.2333
CNN-PDD-ft [13] Y 0.3269
Fast-RCNN (1 ft) [6] N 0.4517
Fast-RCNN (2 fts) [6] Y 0.4131
Mining-raw [19] Y 0.0915
Ours Y 0.0878
Ours (explanatory patterns) Y 0.0860
Table 2: Normalized distance of part localization on the CUB200-2011 dataset. The second column indicates whether the baseline used all object annotations in the category to pre-finetune a CNN before learning the part.
bird cat cow dog horse sheep Avg
SS-DPM-PartA 0.356 0.270 0.264 0.242 0.262 0.286 0.280
PL-DPM-PartA 0.294 0.328 0.282 0.312 0.321 0.840 0.396
Part-GraphA 0.360 0.208 0.263 0.205 0.386 0.500 0.320
fc7+linearSVMA 0.247 0.174 0.251 0.217 0.261 0.317 0.244
fc7+RBF-SVMA 0.276 0.167 0.265 0.244 0.263 0.313 0.255
fc7+NNA 0.344 0.188 0.316 0.288 0.351 0.313 0.300
fc7+sp+linearSVMA 0.247 0.174 0.249 0.217 0.261 0.317 0.244
fc7+sp+RBF-SVMA 0.276 0.167 0.273 0.242 0.264 0.309 0.255
fc7+sp+NNA 0.344 0.188 0.316 0.288 0.351 0.313 0.300
CNN-PDDA 0.301 0.246 0.220 0.248 0.292 0.254 0.260
CNN-PDD-ftA 0.358 0.268 0.220 0.200 0.302 0.269 0.269
Fast-RCNN (1 ft)A 0.324 0.324 0.325 0.272 0.347 0.314 0.318
Fast-RCNN (2 fts)A 0.350 0.295 0.255 0.293 0.367 0.260 0.303
Mining-rawA 0.187 0.132 0.212 0.175 0.168 0.186 0.177
Ours (explanatory patterns)A 0.162 0.128 0.258 0.137 0.179 0.187 0.175
OursA 0.129 0.108 0.184 0.140 0.160 0.156 0.146
Table 3: Normalized distance of part localization on the Pascal VOC Part dataset.

Another typical baseline was CNN-PDD proposed in [13]. CNN-PDD selected certain conv-slices (channels) of a CNN to represent the target part for part localization. In CNN-PDD, the CNN was pre-trained using 1.3M images in the ImageNet ILSVRC 2012 dataset. Just like Fast-RCNN (1 ft), we also extended the method of [13] as a new baseline CNN-PDD-ft, in order to make a full use of object-level annotations. CNN-PDD-ft pre-finetuned the VGG-16 network using object-box annotations of the target category before the selection of CNN channels.

Figure 5: Localization of the head part based on AOGs using the ILSVRC 2013 DET Animal-Part dataset [19]. Users are given only three object images for interactive learning.

We also compared our method with two DPM-related methods, i.e. the strongly supervised DPM (SS-DPM-Part[2] and the technique in [9] (PL-DPM-Part), respectively. These two methods trained DPMs with part annotations for part localization. The next baseline, namely Part-Graph, used a graphical model for part localization [3].

In the scope of weakly-supervised learning, “simple” methods are usually insensitive to the over-fitting problem. Therefore, we selected six baselines, which extracted fc7 features for each image patch and used a SVM for part detection. The first three baselines were original proposed in [19], namely fc7+linearSVM, fc7+RBF-SVM, and fc7+NN, which used a linear SVM, a RBF-SVM, and the nearest-neighbor strategy, respectively, for part detection. The other three baselines combined both the fc7 feature and the spatial position () of each image patch as the final feature. We used fc7+sp+linearSVM, fc7+sp+RBF-SVM, and fc7+sp+NN to denote the last three baselines.

All the baselines were provided with object bounding boxes and used the same set of part annotations for training to ensure a fair comparison.

bird beak bird neck bird wing cat eye
Mining-raw [19] 0.1357 0.1822 0.1622 0.1586
Ours 0.1225 0.1570 0.1580 0.1331
cow ear dog nose dog eye horse ear
Mining-raw [19] 0.1819 0.1902 0.1273 0.2123
Ours 0.1725 0.1789 0.1032 0.1739
horse eye sheep neck sheep muzzle sheep eye
Mining-raw [19] 0.2229 0.1447 0.2203 0.1624
Ours 0.2046 0.1359 0.1786 0.1293
Table 4: Normalized distance for localization of different object parts on the Pascal VOC Part dataset. Localization results for the head has been shown in Table 3.

4.4 Evaluation metric

[3] mentioned that it is necessary to remove factors of object detection from the evaluation of part localization. Therefore, as in [19], original images were cropped using object bounding boxes for testing to make a fair comparison. All the baselines used the cropped images for part localization/detection. In addition, some baselines for part detection may predict more than one locations for a part. Like in [13, 3], we took the detected part with the highest confidence in each image as the localization result. We evaluated part-localization performance using the metric of normalized distance, which has been widely used [13, 19]. Given an object, the normalized distance is the Euclidean distance between predicted part center and ground-truth part center, divided by the diagonal length of the object bounding box.

4.5 Experimental results and analysis

In experiments, we learned AOGs for the head, neck, nose, muzzle, beak, eye, ear, and wing parts of the six animal categories in the Pascal VOC Part dataset. For the ILSVRC 2013 DET Animal-Part dataset and the CUB200-2011 dataset, we learned an AOG for the head part777It is the “forehead” part for birds in the CUB200-2011 dataset. of each category. Because the head is shared by all categories in the two datasets, we selected the head as the target part to enable a fair comparison.

For each of the 37 object categories in the above three benchmark datasets, the experimental setting was the same as in [19]. Each part concept of the category was defined to have three different part templates, and a single part box was annotated for each part template. I.e. we used a total of three annotations to build the AOG for the part. All the baselines learned models using the same three part annotations. On average, labeling a bounding box cost 3.4 seconds (i.e. 6.8 seconds for labeling both the object and the part), and the average time of human interactions per image was 12.3 seconds.

Fig. 4 shows patterns before and after human interactions. Fig. 5 shows part-localization results based on AOGs. Tables 1, 2 and 3 compare part-localization performance of different methods on the ILSVRC 2013 DET Animal-Part dataset, the CUB200-2011 dataset, and the Pascal VOC Part dataset, respectively. Table 4 lists localization results of various object parts in the Pascal VOC Part dataset before and after human-interactions. Our method exhibited superior performance to other baselines.

5 Conclusions and discussion

In this paper, we attempted to use human interactions to manually correct the representation of a semantic part, in the scenario of weak-supervised learning. We proposed to build a model for a semantic part by selecting related latent patterns from pre-trained CNNs and removing irrelevant patterns based on human subjective perception. We successfully mined and transferred latent patterns from a pre-trained CNN to a human-interpretable AOG model, which eased the visualization of latent patterns and allowed people to directly manipulate these patterns in the AOG. We can parallel this learning process to building LEGO blocks.

With the help of human interactions, our method ensured that the model used semantically correct patterns to represent an object part. This is different from conventional batch learning methods that usually let the computer “guess” the target knowledge representation from training samples. Our method exhibited superior performance in experiments, which demonstrated the effectiveness of incorporating human interactions in weakly-supervised learning.


  • [1] M. Aubry and B. C. Russell.

    Understanding deep features with computer-generated imagery.

    In ICCV, 2015.
  • [2] H. Azizpour and I. Laptev. Object detection using strongly-supervised deformable part models. In ECCV, 2012.
  • [3] X. Chen, R. Mottaghi, X. Liu, S. Fidler, R. Urtasun, and A. Yuille. Detect what you can: Detecting and representing objects using holistic models and body parts. In CVPR, 2014.
  • [4] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. Imagenet: A large-scale hierarchical image database. In CVPR, 2009.
  • [5] A. Dosovitskiy and T. Brox. Inverting visual representations with convolutional networks. In CVPR, 2016.
  • [6] R. Girshick. Fast r-cnn. In ICCV, 2015.
  • [7] A. Krizhevsky, I. Sutskever, and G. Hinton. Imagenet classification with deep convolutional neural networks. In NIPS, 2012.
  • [8] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition. In Proceedings of the IEEE, 1998.
  • [9] B. Li, W. Hu, T. Wu, and S.-C. Zhu. Modeling occlusion by discriminative and-or structures. In ICCV, 2013.
  • [10] A. Mahendran and A. Vedaldi. Understanding deep image representations by inverting them. In CVPR, 2015.
  • [11] Z. Si and S.-C. Zhu. Learning and-or templates for object recognition and detection. In PAMI, 2013.
  • [12] M. Simon and E. Rodner. Neural activation constellations: Unsupervised part model discovery with convolutional networks. In ICCV, 2015.
  • [13] M. Simon, E. Rodner, and J. Denzler. Part detector discovery in deep convolutional neural networks. In ACCV, 2014.
  • [14] K. Simonyan, A. Vedaldi, and A. Zisserman. Deep inside convolutional networks: Visualising image classification models and saliency maps. In arXiv:1312.6034v2, 2013.
  • [15] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. In ICLR, 2015.
  • [16] C. Wah, S. Branson, P. Welinder, P. Perona, and S. Belongie. The caltech-ucsd birds-200-2011 dataset. Technical Report CNS-TR-2011-001, In California Institute of Technology, 2011.
  • [17] M. D. Zeiler and R. Fergus. Visualizing and understanding convolutional networks. In ECCV, 2014.
  • [18] Q. Zhang, R. Cao, Y. Wu, and S.-C. Zhu. Interpreting cnn knowledge via an explanatory graph. In arXiv:1708.01785, 2017.
  • [19] Q. Zhang, R. Cao, Y. N. Wu, and S.-C. Zhu. Growing interpretable part graphs on convnets via multi-shot learning. In AAAI, 2017.
  • [20] B. Zhou, A. Khosla, A. Lapedriza, A. Oliva, and A. Torralba. Object detectors emerge in deep scene cnns. In ICRL, 2015.
  • [21] B. Zhou, A. Khosla, A. Lapedriza, A. Oliva, and A. Torralba. Learning deep features for discriminative localization. In CVPR, 2016.
  • [22] L. Zhu, Y. Chen, Y. Lu, C. Lin, and A. Yuille. Max-margin and/or graph learning for parsing the human body. In CVPR, 2008.


Figure 6: Localization results of the head part on animal categories in the Pascal VOC Part dataset [3]. Users are given only three object images for interactive learning.
Figure 7: Localization results of the neck, ear, eye, nose, muzzle, beak, and wing parts on animal categories in the Pascal VOC Part dataset [3]. Users are given only three object images for interactive learning.
Figure 8: Localization of the head part on the ILSVRC 2013 DET Animal-Part dataset [19]. Users are given only three object images for interactive learning.
Figure 9: Visualization of patterns for the head part before and after human interactions.
Figure 10: Visualization of patterns for the head part before and after human interactions.