Recently, Deep Convolutional Neural Networks (DCNNs) have attracted much research attention in visual recognition, largely due to their excellent performance 
. It has been discovered that the activation of a DCNN trained on a large dataset, such as ImageNet, can be employed as a universal image descriptor, and applying this descriptor to many visual classification and retrieval problems delivers impressive performance [3, 4, 5]. This discovery quickly sparked significant interest and inspired many extensions, including [6, 7]. A fundamental issue with these kinds of methods is how to generate an image representation from a pre-trained DCNN. Most current solutions take activations of a single DCNN layer, usually the fully-connected layer, as the image representation.
In this paper, we show that we can build a powerful image representation using the activations from two consecutive convolutional layers. We name our method cross-convolutional layer pooling (or cross-layer pooling for short). This new method relies on two crucial components: (1) we extract local features from one convolutional layer (2) we pool extracted local features by using activations from its successive convolutional layer as guidance.
The first component is motivated by recent work [6, 7, 8] which has shown that DCNN activations are not translation invariant and that it is beneficial to extract fully connected layer activations from a DCNN to describe local regions and create the image representation by pooling multiple regional DCNN activations. In this paper, we view those regional CNN activations as a newly added convolutional layer (named as the augmented convolutional layer as discussed in section 3.1). Inspired by this view, we also extract local features from the original convolutional layers of the pre-trained CNN.
The second component is motivated by the parts-based pooling method  which was originally proposed for fine-grained image classification. This method creates one pooling channel for each detected part region while the final image representation is obtained by concatenating pooling results from multiple channels. We generalize this idea to DCNNs and avoid the need for annotating predefined parts. More specifically, we deem the feature map of each filter in a convolutional layer as the detection score map of a part detector and apply the feature map to weight regional descriptors extracted from the previous convolutional layer in the pooling process. The final image representation is obtained by concatenating pooling results from multiple channels with each channel corresponding to one feature map. Note that in contrast to existing regional-DCNN based methods [6, 7], the proposed method does not require additional dictionary learning and encoding steps at either the training or testing stage once the convolutional layer activations become available. To further reduce the memory use in storing image representations, we also experiment with a coarse ‘feature sign quantization’ compression scheme and show that the discriminative power of the proposed representation can be largely maintained after compression.
Besides image classification, we explore the use of cross-layer pooling for image retrieval. To overcome the high computational cost of the direct implementation of cross-layer pooling, we propose to employ feature binarization and adaptive pooling channel selection schemes to reduce the computational cost.
We conduct extensive experiments on three popular visual classification datasets, and three popular image retrieval datasets. Experimental results suggest that the proposed method achieves significantly better performance than competitive methods in most cases. Further ablation studies provide insight into the importance of various components of our approach.
A preliminary version of this paper has been published in . In this paper, we have made a significant extension. The major differences are threefold.
We view the scheme of extracting fully connected CNN activations at densely sampled regions as a newly added convolutional layer and perform cross-layer pooling at that level. This extension makes our method more widely applicable.
We apply cross-layer pooling to image retrieval tasks and propose new schemes to reduce the computational cost.
We have conducted more experiments to validate the proposed method, including experiments with a better CNN model and new ablation studies.
. Both networks are composed by the cascade of convolutional layers and fully connected layers. At each convolutional layer, multiple filters are applied, and it results in multiple feature maps, one for each filter. In this paper, we use the term ‘feature map’ to indicate the convolutional result (after applying the ReLU) of one filter and the term ‘convolutional layer activations’ to indicate feature maps of all filters in a convolutional layer.
2 Related Work
In the literature, there are two primary methods for using a pre-trained DCNN to create an image representation: (1) directly feeding the whole image into a pre-trained DCNN and extracting its activations; (2) applying the pre-trained DCNN to subregions of the input image and aggregating activations from multiple regions as the image representation. Usually, the first method extracts the last few fully-connected layer activations as the image-level representation. Fine-tuning is sometimes applied to make the network better adapted to a given task. Also, to make this kind of method more robust to image transforms, averaging activations from several jittered versions of the original image, e.g., several slightly shifted versions of the input image, has been employed to obtain better classification performance .
DCNNs can also be applied to extract local features. It has been suggested that DCNN activations are not invariant to a large amount of translation  and that performance will be degraded if input images are not well aligned. To handle this issue, it has been suggested to sample multiple regions from an input image and use one DCNN, called regional-DCNN in this scenario, to describe each region. The final image representation is aggregated from activations of those regional-DCNNs . In , another layer of unsupervised encoding is employed to create the image-level representation [6, 7]. In , discriminative patterns are mined from those regional activations for classification. It is shown that for many visual tasks [6, 7] this approach leads to better performance than directly extracting DCNN activations as global features.
One common factor in the above methods is that they all use fully-connected layer activations as features. Some recent studies explore the use of convolutional layers to extract image representations. For example, the work in 
applies Fisher vector pooling to the local features extracted from a convolutional layer to create image representations for texture classification. The work in uses pooled convolutional activations for object detection. The authors in  demonstrated that the pooled convolutional feature are well suited to the image retrieval task. The work in 
is probably most relevant to our work. As mentioned in itself, their approach is an extension of the method proposed in  (the preliminary version of this paper) for fine-grained image classification. It uses a similar strategy as ours to combine the convolutional feature activations from two DCNNs and jointly fine-tune all of the parameters in an end-to-end fashion.
3 Proposed Method
3.1 Convolutional layers, fully connected layers and notations
The convolutional layer in a CNN is embedded with rich spatial information. Its activations can be formulated as a tensor of the size, where and denote the height and width of each feature map and denotes the number of feature maps. These activations can alternatively be viewed as an array of -dimensional local features extracted at locations. In this paper, we denote each of the locations as a spatial unit, and the -dimensional feature maps corresponding to a location as the feature vector for a spatial unit.
The fully-connected layer can be seen as a convolutional layer with the receptive field as the whole image. In recent literature [16, 8, 6], the activations from a fully connected layer are often used as a descriptor for image regions rather than the whole image. As pointed out in , if the regions are sampled from the input image over a dense grid, such a descriptor extraction process can also be viewed as applying a convolutional layer. For the sake of clarity, in this paper we refer such a convolutional layer as the “augmented convolutional layer” or the AConv layer for short and the convolutional layer of the original pre-trained CNN as the OConv layer.
3.2 Cross-convolutional-layer Pooling
In [6, 7], it has been shown that applying an additional pooling operation on the local features extracted from multiple image regions can significantly boost classification performance. Motivated by these methods, we can design a specific pooling layer and apply it to the local features extracted from a convolutional layer which can be either the OConv layer or the AConv layer.
Various pooling methods could be chosen to aggregate the local features, e.g., max-pooling, sum-pooling or Fisher vector based pooling. In this section, we propose an alternative pooling method which significantly improves classification performance.
The proposed method is inspired by the parts-based pooling strategy  used in fine-grained image classification. In this strategy, multiple regions-of-interest (ROI) are first detected, with each corresponding to one human-specified object part, e.g., the tails of birds. Then local features falling into each ROI are then pooled together to obtain a pooled feature vector. Given object parts, this strategy creates different pooled feature vectors and these vectors are concatenated together to form the image representation. It has been shown that this simple strategy achieves significantly better performance than directly pooling all local features together. Formally, the pooled feature from the th ROI of the th layer, which we denote by , can be calculated by the following equation (let’s consider sum-pooling in this case):
where denotes the th local feature and is a binary indicator map indicating whether falls into the th ROI. We can also generalize to real values with its value indicating the ‘membership’ of a local feature to an ROI. Essentially, each indicator map defines a pooling channel and the image representation is the concatenation of pooling results from multiple channels.
However, in a general image classification task, there are no human-specified part annotations, and even for many fine-grained image classification tasks the annotation and detection of these parts are typically non-trivial. To handle this situation, in this paper, we propose to use feature maps of the th convolutional layer as indicator maps. By doing so, pooling channels are created for the local features extracted from the th convolutional layer. We call this method cross-convolutional-layer pooling or cross-layer pooling for short.
The use of feature maps as indicator maps is motivated by the observation that a feature map of a
deep convolutional layer is usually sparse and tends to be selective of higher-level visual
concepts, as has also been observed in . This observation is illustrated in Figure
3. In Figure 3, we choose three images taken from the
Birds-200  dataset. We sample three feature maps from 256 feature maps in conv5 and
overlay them on the original images for better visualization. As can be seen from Figure
3, the activated regions of the sampled feature map (highlighted in warm
colors) are semantically meaningful. For example, the activated region in the first row tends to
localize at the head region of a bird while the feature map shown in the second row exhibits high
values around the claw area. Thus, the filter of a convolutional layer works as a part detector, and
its feature map serves a similar role as the part region indicator map. Certainly, compared with the
parts detector learned from human-specified part annotations, the filter of a convolutional layer is
usually not directly task-relevant. However, the discriminative power of our image representation
can benefit from combining a much larger number of indicator maps, e.g., 256 as opposed to 20-30 (the number of parts usually defined by human).
Formally, the image representation extracted from cross-layer pooling can be expressed as follows:
where is the th local feature (filter responses) in the th convolutional layer. is the total number of local features at the th layer. is the activation value of the th spatial unit and the th filter at the th convolutional layer. Thus represents a weighted sum-pooling of with the weight defined by . In total, there are sets of weights since there are filters at the th convolutional layer. The final image representation will be the concatenation of all and thus its dimensionality will be . Note that here we assume that there is a correspondence between the th local feature at the th layer and the th feature activations at the th layer. This correspondence can be easily established if the consecutive convolutional layers have the same spatial unit layout. For example, the last two convolutional layers in the Alex net both have a 1313 spatial unit layout and we can deem that feature maps at the same spatial unit across two layers are corresponding.
Another way of interpreting Equ. (3.2
)) is that the image representation is a sum over outer products of corresponding features in two layers. This operation is similar to calculating Gram matrices which have been applied in computer vision[21, 22, 23]. The difference is that our cross-layer pooling calculates the outer product across different layers and thus can be seen as an extension of the Gram matrices based representation.
3.3 Augmented convolutional layers vs. original convolutional layers
To implement cross-layer pooling, one needs to specify two convolutional layers. In practice, these two convolutional layers can either be chosen from the AConv layer or the OConv layer 111It is also possible to choose one OConv layer and one AConv layer to perform cross-layer pooling. We discuss this possibility in section 4.2.1. . But which type of convolutional layers performs better? We show empirically below that the best performing layer type depends on the recognition task to which it will be applied.
For the AConv layer, its convolutional filters are the fully-connected layer parameters of the original CNN. So the AConv layer encodes the higher-level visual concept, e.g., object-category-level information. Thus, if the target problem involves identification of object-category-level patterns, e.g., to classify whether a “car” occurs in the image, then the AConv layer should be used, and its activations can be seen as being similar to object bank detectors
. Note that even if the problem does not directly involve the identification of an object that appears in the network training task, e.g., the 1000 categories in the ImageNet subset, category-level pattern detection may be still beneficial. For example, for scene classification, the occurrence of an object, such as a bed, can be a strong indicator of the scene class “bedroom”.
Compared with the AConv layer, the OConv layer captures lower-level visual patterns. Thus for target applications which require strong discriminative power to identify lower-level visual patterns, e.g., specific textures in a fine-grained image classification problem, using the OConv layer can transfer across domain more easily and lead to better performance than using the AConv layer. It should be noted that most commonly used pre-trained CNNs are trained on image classification datasets such as ImageNet . Thus, if we use an AConv layer to describe low-level patterns, the input images of those pre-trained networks will be very different to that in the target domain. Figure 2 shows an example in this case. The top row of Figure 2 shows some images from the ImageNet  dataset while the bottom row shows some regions corresponding to the low level visual patterns from the images in Birds-200  dataset. As can be seen, the appearance and the level of detail are quite different between the two rows. Thus, applying the AConv layer to describe lower-level visual patterns will introduce a significant input image style mismatch which could potentially undermine the discriminative power of DCNN activations.
3.4 Implementation details
PCA: In our implementation, we perform PCA on to reduce the dimensionality of to 2000 dimensions for the AConv layer. For the OConv layer, we still perform PCA to de-correlate the local features but without performing dimensionality reduction. We empirically find that this leads to slightly better performance than using the uncorrected OConv local features.
Normalization: Since the number of activated spatial units at the guidance convolutional layer can be different for different pooling channels. The pooling vector derived from different channels may have a different energy. Thus, in our implementation we normalize the pooled coding vector for each channel. After that, we apply power normalization to , that is, we use as the image representation to further improve performance.
Feature sign quantization: Besides the aforementioned image representation, we also tried directly using as an image representation, that is, we coarsely quantize into according to the feature sign of .
Adding a new convolutional layer for the AConv layer
: One issue when using the AConv layer for cross-layer pooling is that we need to find two consecutive AConv layers. These two layers can be obtained by using two consecutive fully-connected layers in the original DCNN. However, since the fully-connected layers in most commonly used DCNN models have a very large number of output neurons, e.g., 4096 or 1000. Directly performing cross-layer pooling on those AConv layers will result in a very high dimensional image representation. To solve this issue, we only utilize one fully connected layer from the original DCNN as one AConv layer, and stack another newly added convolutional layer on top of it with a much smaller number of filters, e.g., 100. Then we train the new convolutional layer on the target dataset. The network architecture of our implementation is as follows: a max-pooling layer is applied to pool the activations of the newly added convolutional layer and the pooled result is feed into a logistic regression layer. The negative entropy loss is then utilized to train the new convolutional layer.
3.5 Application to image retrieval
Recently, it has been discovered that the pooled convolutional layer activations can form a good image representation for image retrieval . Inspired by the success of the work in , we apply our cross-layer pooling method to image retrieval. Since cross-layer pooling creates multiple pooling channels with each pooling channel capturing one type of visual pattern, and the pooling result of each channel is normalized, an image representation created by cross-layer pooling can depict various aspects of visual patterns within the image in a balanced way. In comparison, the representation generated by the direct pooling method in  may be dominated by patterns that occur more frequently within the image.
Directly applying cross-layer pooling for image retrieval will incur a high computational cost due to the high dimensionality of the generated image representations. For example, if there are pooling channels, the computational cost can be times higher than the method in . To handle this drawback, in this paper we propose two strategies. The first strategy is to binarize the cross-layer pooling feature. This is inspired by the observation (which will be experimentally demonstrated in Section 4.4) that keeping the sign of feature values 222As will be discussed in the second strategy, some channels will be discarded during feature similarity calculation, which is equivalent to setting the feature values within the discarded channels to 0. In this view, there are still three possible values, 1,,0, in the resulting image representation. does not significantly impact the discriminative power of cross-layer pooling. The second strategy is to adaptively select a small number, say , of pooling channels for each query image and then only retain the features pooled from the selected channels in both the query and the reference image to perform the similarity comparison. Formally, for such a scheme the image similarity between a query image and a reference image is calculated via:
where and are the query image and a reference image. In the original cross-layer pooling, both the query image and the reference image are represented by subvectors which are pooled from each pooling channel. We use and denote the th subvectors of the query and a reference image respectively. For the original cross-layer pooling approach, the comparison should be made over all subvectors. Here in Equ. (3), only subvectors whose indices fall within a small subset () are compared. Thus, the computational cost can be greatly reduced in comparison with the naive implementation of cross-layer pooling for image retrieval. In this paper, we construct the set by selecting channels (feature maps of a convolutional layer) with top average activations. By applying this criterion, the convolutional feature maps with less significant activations will be discarded. Thus, for this operation, one additional benefit besides a reduction in the computational cost is that it might suppress the noise patterns and therefore potentially improve retrieval performance. Note that besides selecting channels with top activation values, other criteria can be applied. For example, if the retrieval task is to find a specific object type such as sculpture, cloth, an importance weight for each pooling channel can be learned by using images which contain the object-of-interest as positive training samples and random images as negative training samples.
We have organized our experiments into three parts. The first evaluates the proposed cross-layer pooling method for the image classification application. In the second part, we conduct ablation studies and demonstrate the impact of the various components of our method. In the third part, we evaluate the proposed method on image retrieval tasks. The focus of the third part is to evaluate whether the proposed cross-layer pooling leads to better performance than the method in  which also uses a convolutional layer pooling strategy for image retrieval.
4.1 Image classification experiments
4.1.1 Experimental protocol
|SCFV ||59.2%||AlexNet, OConv|
|CrossLayer (proposed)||63.0%||AlexNet, OConv|
|SCFV ||68.2%||AlexNet, AConv|
|CrossLayer (proposed)||68.2%||AlexNet, AConv|
|SCFV ||73.5%||VGGNet, OConv|
|CrossLayer (proposed)||74.4%||VGGNet, OConv|
|SCFV ||76.4%||VGGNet, AConv|
|CrossLayer (proposed)||78.2%||VGGNet, AConv|
|Fine-tuning ||69.8%||fine-tunning with the VGGNet|
|Fine-tuning ||66.0%||fine-tunning with the AlexNet|
|MOP-CNN ||68.9%||three scales|
|VLAD level2 ||65.5%||single scale|
|DeepTexture ||81.7%||7 scales|
|Texture Synthesis ||75.0%||using the Gram matrix on fc18|
|layer (VGG net) for classification|
We evaluate the proposed method on three datasets: MIT indoor scene-67 (MIT-67 in short) , Caltech-UCSD Birds-200-2011  (Birds-200 in short) and PASCAL VOC 2007  (PASCAL-07 in short) for image classification. These three datasets cover several popular topics in image classification, that is, scene classification, fine-grained object classification and generic object classification.
We compare the proposed method against three baselines, they are: (1) directly using fully-connected layer activations for the whole image (CNN-Global); (2) averaging fully-connected layer activations from several transformed versions of an input image. Following [3, 4], we transform the input image by cropping its four corners and middle regions as well as by creating their mirrored versions (CNN-Jitter); (3) applying the sparse coding based Fisher vector encoding method  on the local feature extracted from the convolutional layer (in both schemes) (SCFV). Since R-CNN SCFV has demonstrated superior performance to the MOP method in , we do not include MOP in our comparison. To make a fair comparison, we reimplement all three baseline methods. We also apply PCA, normalization and power normalization to SCFV and normalization to CNN-Global and CNN-Jitter (We find that using PCA and/or power normalization makes little difference to the result of CNN-Global and CNN-Jitter).
Two CNN models are adopted throughout our experiment: the Alex net  and the VGG very-deep 19 layer network . Two different types of convolutional layers are used, the original convolutional layer (denoted as OConv) and the augmented convolutional layer (denoted as AConv). As discussed in section 3.1, the latter can be implemented by applying DCNN on a set of image regions which are extracted on a dense grid. In our implementation, we first resize the input image to 512512 pixels and then extract image regions with the size 128
128 at a step size of 32 pixels. For OConv layers, we report the results obtained using the 4th and 5th convolutional layers for the Alex net and the conv5-3 and conv5-4 convolutional layers for the VGGVD net since those settings achieve the best performance. We also explore the use of other convolutional layers in the second part of our experiment. For the AConv layer, we extract local features from the fc6 layer in the Alex net and the first fully-connected layer in the VGGVD net respectively. Then we stack a new convolutional layer with 100 filters on top of them and train the newly added layer on the target dataset. This new layer is trained with the learning rate 0.01 by using 50 epochs. No data augmentation is used in this training step.
We use libsvm  as the SVM solver and use precomputed linear kernels as inputs. This is because the calculation of linear kernels/Gram matrices can be easily implemented in parallel. When feature dimensionality is high, the kernel matrix computation actually occupies most of the computational time. Thus it is appropriate to use parallel computing to accelerate this process.
4.1.2 Classification Results
Scene classification: MIT-67. MIT-67 is a commonly used benchmark for evaluating scene classification algorithms. It contains 6700 images with 67 indoor scene categories. Following the standard setting, we use 80 images in each category for training and 20 images for testing. The results are shown in Table I. It can be seen that the proposed cross-layer pooling achieves the overall best performance in most settings. The best performance is achieved by using cross-layer pooling and the AConv layer: this setting produces 68.2% classification accuracy for the Alex net and 78.2% for the VGGVD net. Also, it is clear that extracting local features from the AConv layer, as has been done in SCFV and CrossLayer, achieves significant performance increase in comparison with global CNN features, i.e. Global and Jitter. Finally, the use of the VGGVD net further boosts the classification performance by a large margin.
By comparing the performance reported from the literature, we can see that the proposed method surpasses most state-of-the-art methods. The only exception is the result in . However, its method is very close to our SCFV (with OConv) baseline, and its good performance is largely due to their brute force multiple-scale strategy (they have utilized 7 scales while we only use a single scale).
Fine-grained image classification: Birds-200. Birds-200 is the most popular dataset in fine-grained image classification research. It contains 11788 images with 200 different bird species. This dataset provides ground-truth annotations of bounding boxes and parts of birds such as the head and the tail, on both the training set and the test set. In this experiment, we only use the bounding box annotation. The results are shown in Table II. As can be seen, the cross-layer pooling achieves the best classification performance: 77.0% when the VGGVD net is used. Also, using the original convolutional layer achieves much better performance than the use of the AConv layers. For both the Alex net and the VGGVD net, the best performance is achieved by using features from the OConv layer. The underlying reason can be well explained by section 3.3, that is, the discriminative information of birds species usually lies at small regions and it will be more appropriate to extract features from original convolutional layers due to the image style mismatch issue discussed in section 3.3.
Our best performance is among the best for the dataset. The work in  reports higher classification accuracy than us. However, it relies on a fine-tuning step on two different networks and it adopts some different experimental settings, e.g., their convolutional layers have a different number of spatial units to ours333When the same spatial units configuration is used, our cross-layer pooling achieves 80% classification accuracy which is closer to the result in . Note that we only use one network and do not apply fine-tuning and score calibration. (28 28 as oppose to 14 14 in our experiments); it performs decision score calibration on the SVM while we just use the standard one-vs-the-rest SVM.
|CNN-Global||59.2%||no parts. AlexNet|
|CNN-Jitter||60.5%||no parts. AlexNet|
|SCFV ||64.2%||no parts, AlexNet, OConv|
|CrossLayer||73.3 %||no parts, AlexNet, OConv|
|SCFV ||66.4%||no parts, AlexNet, AConv|
|CrossLayer||71.7%||no parts, AlexNet, AConv|
|CNN-Global||62.5%||no parts. VGGNet|
|CNN-Jitter||63.6%||no parts. VGGNet|
|SCFV ||73.7%||no parts, VGGNet, OConv|
|CrossLayer||77.0%||no parts, VGGNet, OConv|
|SCFV ||66.2%||no parts, VGGNet, AConv|
|CrossLayer||69.4%||no parts, VGGNet, AConv|
|Fine-tuning ||76.4%||no parts, fine tunning, VGGNet|
|Fine-tuning ||66.4 %||no parts, fine tunning, AlexNet|
|Parts-RCNN-FT ||76.37 %||use parts, fine tunning|
|Parts-RCNN ||68.7 %||use parts, no fine tunning|
|CNN-SVM ||53.3%||CNN global|
|DPD+CNN ||65.0%||use parts|
|Bilinear CNN ||77.9%||Two networks|
|Bilinear CNN ||81.9%||Two networks, fine-tuning|
|Texture Synthesis ||67.3||using the Gram matrix on conv5-4|
|layer (VGG net) for classification|
Object classification: PASCAL-2007. PASCAL VOC 2007 contains 9963 images with 20 object categories. The task is to predict the presence of each object in each image. Note that most object categories in PASCAL-2007 are also included in ImageNet which is the training set of the Alex net and the VGGVD net. So ImageNet can be seen as a superset of PASCAL-2007. The results on this dataset are shown in Table III. From Table III, we can see that again the best performance is achieved by using cross-layer pooling and the VGGVD net. Not surprisingly, the AConv layer performs better than the OConv layer in this dataset because the training categories of the DCNN overlaps with PASCAL-2007 and the AConv layer contains this category-level information. The per-class performance of three best comparing methods, that is, CNN jitter with the VGGVD net, SCFV with the AConv layer from the VGGVD net and cross-layer pooling with the AConv layer from the VGGVD net, is shown in Table IV. As seen, the proposed cross-layer pooling achieves the best performance in most classes.
|SCFV ||66.8%||AlexNet, OConv|
|SCFV ||76.9%||AlexNet, AConv|
|SCFV ||82.9%||VGGNet, OConv|
|SCFV ||85.1%||VGGNet, AConv|
|Fine-tuning ||90.1%||VGGNet fine-tuning|
|Fine-tuning ||82.4%||CNN-S fine tuning|
|CNNaug-SVM ||77.2%||with augmented data|
|CNN-SVM ||73.9%||no augmented data|
|Texture Synthesis ||84.7%||using the Gram matrix on fc18|
|layer (VGG net) for classification|
|Global Jitter (VGGVD)||81.9||96.3||72.9||86.1||61.6||95.2||89.3||91.5||90.9||79.7|
|Global Jitter (VGGVD)||77.8||67.4||92.1||91.2||85.2||56.6||92.8||92.9||90.4||97.1|
4.2 Ablation study
From the above experiments, the advantage of using the proposed method has been clearly demonstrated. In this section, we further examine the effect of various components in our method.
4.2.1 Using different convolutional layers
First, we are interested in examining the performance of using convolutional layers other than the 4th and 5th convolutional layers in the Alex net and the conv5-2 and conv5-3 convolutional layers in the VGGVD net. We investigate the performance of using the 3rd and 4th convolutional layers for the Alex net and the conv5-2 and conv5-3 convolutional layers in the VGGVD net. The results are shown in Table VI. From the results, we can see that using 4-5th layers (conv5-3-4th) layers achieves superior performance over using the 3-4th layers (conv5-2-3th) layers. This is consistent with the observation in  that the deeper the convolutional layer, the better discriminative power it has.
As discussed in section 3.1, the process of extracting fully-connected layer activations from multiple local regions can be viewed as applying a special convolutional layer. Thus it is possible to perform cross-layer pooling on two fully-connected layers. For AConv layers, a new fully-connected layer is stacked and re-trained. Certainly, it is also possible to directly use two fully-connected layers in an existing CNN without introducing new layers, but the computational cost can be higher due to the high-dimensionality of the resulting representation, e.g., 40961000. Also, it is possible to perform cross-layer pooling on an original convolutional layer and a fully-connected layer in the above setting. In such cases, multiple spatial units in a convolutional layer will correspond to one fully-connected layer output, to apply cross-layer pooling we can either flatten activations from multiple spatial units into a long vector or using the pooled activation from multiple spatial units. In this paper, we use the latter approach since it produces lower dimensional image representations.
In this subsection we conduct an experimental evaluation of the above two approaches to performing cross-layer pooling. Specifically, for the first approach, denoted as FC-FC cross-layer pooling, we use the activations from the first fully-connected layer (4096 dimensions) and the last fully-connected layer (1000 dimensions) in the VGG net to perform cross-layer pooling. PCA is applied to reduce the dimensionality of the first fully-connected layer activations to 2000; for the second approach, denoted as FC-Conv cross-layer pooling, we use the activations from conv5-4 and the first fully-connected layer in VGG net and apply sum-pooling (followed by using square-root post-processing) to pool the activations of conv5-4. Thus the dimensionality of the representation obtained from cross-layer pooling will be 5124096. The performance of these two methods is shown in Table V. From Table V we make the following two observations: (1) by cross-referencing the performance in Table I, Table II and Table III, we observe that FC-FC cross-layer pooling achieves similar performance to cross-layer pooling using the AConv layer, but marginally worse. This may suggest that the good performance of the AConv layer based cross-layer pooling mainly comes from the cross-layer pooling strategy, although the re-trained new convolutional layer can further boost classification performance. (2) FC-Conv cross-layer pooling achieves better performance than FC-FC cross-layer pooling on Birds200 but it is inferior to FC-FC cross-layer pooling on MIT-67 and PASCAL-07. Considering that the pooling operation of FC-Conv cross-layer pooling is performed on the OConv layer, this observation is consistent with the conclusion in Table I and Table III, that is, cross-layer pooling on AConv layers leads to better performance than pooling with OConv layers for MIT-67 and PASCAL-07.
|FC-18, FC-20 (VGGNet)||77.0%||63.2%||85.9%|
|conv5-3, FC-18 (VGGNet)||74.1%||68.5%||84.5%|
4.2.2 The impact of PCA and normalization
In our implementation, we have applied three operations to obtain the final image representation, that is, performing PCA on the local feature, performing normalization on each pooled coding vector and power normalization. In this section, we investigate the impact of those three operations. We conduct our experiment on MIT-67 with the AConv layer features and test the performance under various settings of those three operations. Table VII shows the results. From Table VII, we observe some interesting phenomena: (1) The three operations have a big impact on the performance. If none of them is applied, the performance drops significantly. (2) Applying either normalization or power normalization leads to similar performance improvement. (3) Applying PCA with normalization, normalization or power normalization or both of them, can lead to further performance improvement. (4) The best performance is obtained by applying all three operations together.
4.2.3 Feature sign quantization
As has been discussed in Section 3.4, feature sign quantization is a promising strategy to reduce the memory cost of cross-layer pooling. Here we demonstrate the effect of applying feature sign quantization. Feature sign quantization quantizes a feature to 1 if it is positive, -1 if it is negative and 0 if it equals 0. In other words, we use 2 bits to represent each dimension of the pooled feature vector. This scheme greatly reduces the memory use. Here we only report the result on the best performing setting for each dataset. The results are shown in Table IX. As can be seen, this coarse quantization scheme does not degrade the performance much, and for two datasets, MIT-67 and Birds-200, it achieves almost the same performance as the original feature. Note that a similar quantization scheme has been also explored in , however there is caused a significant performance drop if applied to convolutional layer features. For example, in the Table 7 of , by binarizing conv-5, the performance drops around 5%. In contrast, our representation seems to be less sensitive to this coarse quantization.
4.3 Comparison with alternative pooling methods
Finally, we compare several alternative pooling strategies against cross-layer pooling. The baseline method that we compare against directly pools convolutional layers. There are many possible variants of such an approach, however, which we characterize according to three criteria:
Pooling methods. We consider both sum-pooling with square-root post-processing (which is better than direct sum-pooling) and max-pooling.
Spatial pyramids . Three spatial pyramid partitions, that is, the (level 0), (level 1) and (level3) are considered.
Pooling layers. The conv5-3 layer, conv5-4 layer and the concatenation of conv5-3 and conv5-4 layers of the VGG net are considered.
The classification results on MIT67, Birds200 and PASCAL07 are reported in Table VIII. As can be seen, the best performance achieved by tuning those pooling strategies is 69.0% on MIT67, 64.1% on Birds200 and 77.4% on PASCAL07. In comparison, the cross-layer pooling counterpart achieves 74.4%, 77.0% and 84.1% on MIT67, Birds200 and PASCAL07 respectively, which is much better than the traditional pooling approaches.
Another possible variation of cross-layer pooling is to change the channel pooling method from sum-pooling to max-pooling. We also tried this setting and achieved 72.8% on MIT67, 71.6% on Birds200 and 85.2% on PASCAL07. As can be seen, its performance is worse than sum-pooling. Thus we suggest using sum-pooling as the default pooling method for cross-layer pooling.
4.4 Experiments for image retrieval
For image retrieval, we evaluate cross-layer pooling on the Oxford5K , Holiday  and Sculpture6K  datasets. These three datasets represent several common scenarios in image retrieval. The objects-of-interest in Oxford5K are buildings which have rich texture patterns. For the Holiday dataset, the to-be-retrieved images are more general scenes and objects. The sculpture6K dataset focuses on sculptures which have relatively smooth surfaces. We adopt a similar setting to  to evaluate performance by directly using the query image without the object bounding box. We re-implemented the baseline method of  by strictly following its experimental protocol. Specifically, we apply PCA whitening and calculate the PCA projection matrix from external datasets. For Oxford5K, we learn the PCA projection matrix from Paris6K; for Holiday, we learn the PCA projection matrix from 5K Flickr images (although those 5K Flickr images might not be same as the one used in ); for Sculpture, we also calculate the PCA projection matrix from the same 5K Flickr images. The VGG net is used in this experiment, the baseline in  is performed on the conv5-4 layer since it leads to the best performance. Our approach is applied on conv5-3 and conv5-4 layers. We use binarized cross-layer pooling vectors and select the pooling channels with top- largest average activation value on conv5-4. Different are evaluated and the comparison of cross-layer pooling and the baseline method in  is shown in Figure 4. From Figure 4, it is clear that cross-layer pooling performs much better than the direct convolutional layer pooling baseline  once a sufficiently large is chosen. Selecting is typically sufficient to achieve good performance. Considering that all the features are binarized, the computational cost is still reasonable despite the fact that the dimensionality is higher than that of the baseline method. Note that if we directly binarize the feature obtained from the convolutional layer pooling baseline, this leads to significant performance drop, while our method does not. Also, it is interesting to discover that when becomes too large, that is, when we are close to using all of the available pooling channels, the retrieval performance will start to drop. This is probably because including channels with small average activation values tend to introduce more noise during retrieval.
|Method||Conv5-3||Conv5-4||Conv5-3 + Conv5-4|
|max-pooling, SPM level0||57.3/67.7/75.5||60.7/69.0/80.5||60.6/55.8/77.8|
|max-pooling, SPM level1||66.3/67.6/76.8||67.7/70.2/81.6||66.0/59.1/78.9|
|max-pooling, SPM level2||68.0/67.6/76.8||69.3/69.9/81.9||67.9/61.5/79.8|
|sum-sqrt-pooling, SPM level0||66.4/66.4/77.7||68.2/69.3/81.7||69.9/65.8/80.9|
|sum-sqrt-pooling, SPM level1||68.0/64.1/77.5||70.0/69.8/81.7||70.7/63.6/80.6|
|sum-sqrt-pooling, SPM level2||69.0/64.1/77.4||70.3/69.8/81.9||70.9/64.1/80.7|
|Dataset||Feature sign quantization||Original|
|MIT-67 (VGG, Aconv)||77.9%||78.2%|
|Birds-200 (VGG, OConv)||76.5%||77.0%|
|PASCAL07 (VGG, AConv)||85.1%||87.0%|
We have proposed a new method termed cross-convolutional layer pooling to create image representations from the activations of two consecutive convolutional layers of a pre-trained CNN. We realize this idea on two types of implementations of convolutional layers and show that these two different implementations are particularly well suited to different recognition tasks. Also, we propose a variation on the cross-convolutional layer pooling approach for the image retrieval task. By conducting experiments on popular image classification datasets and image retrieval datasets, we show that the proposed method leads to superior performance over various existing methods of using a pre-trained DCNNs to extract image representations.
This work was in part supported by the Data to Decisions Cooperative Research Centre; and Australian Research Council Future Fellowship (FT120100969), and Australian Research Council projects DP160103710, and LP130100156.
C. Shen is the corresponding author.
-  A. Krizhevsky, I. Sutskever, and G. Hinton, “Imagenet classification with deep convolutional neural networks,” in Proc. Advances in Neural Inf. Process. Syst., 2012, pp. 1106–1114.
-  J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “ImageNet: A Large-Scale Hierarchical Image Database,” in Proc. IEEE Conf. Comp. Vis. Patt. Recogn., 2009.
-  A. S. Razavian, H. Azizpour, J. Sullivan, and S. Carlsson, “CNN features off-the-shelf: an astounding baseline for recognition,” in Proc. IEEE Conf. Comp. Vis. Patt. Recogn. Workshop, 2014.
-  H. Azizpour, A. S. Razavian, J. Sullivan, A. Maki, and S. Carlsson, “From generic to specific deep representations for visual recognition,” in Proc. IEEE Conf. Comp. Vis. Patt. Recogn. Workshop, 2015.
-  A. Babenko and V. S. Lempitsky, “Aggregating deep convolutional features for image retrieval,” in Proc. IEEE Conf. Comp. Vis. Patt. Recogn., 2015.
-  Y. Gong, L. Wang, R. Guo, and S. Lazebnik, “Multi-scale orderless pooling of deep convolutional activation features,” Proc. Eur. Conf. Comp. Vis., 2014.
-  L. Liu, C. Shen, L. Wang, A. van den Hengel, and C. Wang, “Encoding high dimensional local features by sparse coding based fisher vectors,” in Proc. Advances in Neural Inf. Process. Syst., 2014.
-  Y. Li, L. Liu, C. Shen, and A. van den Hengel, “Mid-level deep pattern mining,” in Proc. IEEE Conf. Comp. Vis. Patt. Recogn., 2015.
-  T. D. Ning Zhang, Ryan Farrell, “Pose pooling kernels for sub-category recognition,” in Proc. IEEE Conf. Comp. Vis. Patt. Recogn., 2012.
-  L. Liu, C. Shen, and A. van den Hengel, “The treasure beneath convolutional layers: Cross-convolutional-layer pooling for image classification,” in Proc. IEEE Conf. Comp. Vis. Patt. Recogn., 2015.
-  K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” in Proc. Int. Conf. Learn. Representations, 2014.
-  Y. Li, L. Liu, C. Shen, and A. van den Hengel, “Mid-level deep pattern mining,” in Proc. IEEE Conf. Comp. Vis. Patt. Recogn., 2014.
-  M. Cimpoi, S. Maji, and A. Vedaldi, “Deep filter banks for texture recognition and segmentation,” in Proc. IEEE Conf. Comp. Vis. Patt. Recogn., 2015.
-  R. B. Girshick, “Fast R-CNN,” in Proc. IEEE Int. Conf. Comp. Vis., 2015.
-  T. Lin, A. RoyChowdhury, and S. Maji, “Bilinear CNN models for fine-grained visual recognition,” arXiv: Comp. Res. Repository, vol. abs/1504.07889, 2015.
-  D. Yoo, S. Park, J. Lee, and I. S. Kweon, “Fisher kernel for deep neural activations,” arXiv preprint, 2014.
-  J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks for semantic segmentation,” in Proc. IEEE Conf. Comp. Vis. Patt. Recogn., 2015.
-  F. Perronnin, J. Sánchez, and T. Mensink, “Improving the fisher kernel for large-scale image classification,” in Proc. Eur. Conf. Comp. Vis., 2010.
-  M. Zeiler and R. Fergus, “Visualizing and understanding convolutional networks,” in Proc. Eur. Conf. Comp. Vis., 2014.
-  P. W. etc., “Caltech-UCSD Birds 200,” Technical Report, California Institute of Technology, 2010.
-  L. A. Gatys, A. S. Ecker, and M. Bethge, “Texture synthesis using convolutional neural networks,” in Proc. Advances in Neural Inf. Process. Syst., May 2015.
-  I. Ustyuzhaninov, W. Brendel, L. A. Gatys, and M. Bethge, “Texture synthesis using shallow convolutional networks with random filters,” arXiv preprint, 2016.
-  M. B. Leon A. Gatys, Alexander S. Ecker, “A neural algorithm of artistic style,” Proc. IEEE Conf. Comp. Vis. Patt. Recogn., 2016.
-  L.-J. Li, H. Su, E. P. Xing, and F.-F. Li, “Object bank: A high-level image representation for scene classification and semantic feature sparsification,” in Proc. Advances in Neural Inf. Process. Syst., 2010.
-  Y. Li, L. Liu, C. Shen, and A. van den Hengel, “Mining mid-level visual patterns with deep cnn activations,” Int. J. Comput. Vision, 2016.
-  C. Doersch, A. Gupta, and A. A. Efros, “Mid-level visual element discovery as discriminative mode seeking,” in Proc. Advances in Neural Inf. Process. Syst., 2013.
-  M. Pandey and S. Lazebnik, “Scene recognition and weakly supervised object localization with deformable part-based models,” in Proc. IEEE Int. Conf. Comp. Vis., 2011, pp. 1307–1314.
-  A. Quattoni and A. Torralba, “Recognizing indoor scenes,” in Proc. IEEE Conf. Comp. Vis. Patt. Recogn., 2009, pp. 413–420.
-  M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman, “PASCAL visual object classes challenge 2007 (VOC2007) results,” http://host.robots.ox.ac.uk/pascal/VOC/.
C.-C. Chang and C.-J. Lin,
“LIBSVM: A library for support vector machines,”ACM Transactions on Intelligent Systems and Technology, vol. 2, pp. 27:1–27:27, 2011, Software available at http://www.csie.ntu.edu.tw/~cjlin/libsvm.
-  N. Zhang, J. Donahue, R. Girshick, and T. Darrell, “Part-based R-CNNs for fine-grained category detection,” in Proc. Eur. Conf. Comp. Vis., 2014.
-  J. Donahue, Y. Jia, O. Vinyals, J. Hoffman, N. Zhang, E. Tzeng, and T. Darrell, “DeCAF: A deep convolutional activation feature for generic visual recognition,” in Proc. Int. Conf. Mach. Learn., 2014.
-  N. Zhang, R. Farrell, F. Iandola, and T. Darrell, “Deformable part descriptors for fine-grained recognition and attribute prediction,” in Proc. IEEE Int. Conf. Comp. Vis., December 2013.
R.-W. Zhao, J. Li, Y. Chen, J.-M. Liu, Y.-G. Jiang, and X. Xue,
“Regional gating neural networks for multi-label image classification,”Proc. British Machine Vis. Conf., 2016.
-  K. Chatfield, K. Simonyan, A. Vedaldi, and A. Zisserman, “Return of the devil in the details: Delving deep into convolutional nets,” Proc. British Machine Vis. Conf., 2014.
-  Z. Song, Q. Chen, Z. Huang, Y. Hua, and S. Yan, “Contextualizing object detection and classification,” in Proc. IEEE Conf. Comp. Vis. Patt. Recogn., 2011.
-  Q. Chen, Z. Song, Y. Hua, Z. Huang, and S. Yan, “Hierarchical matching with side information for image classification.,” in Proc. IEEE Conf. Comp. Vis. Patt. Recogn., 2012, pp. 3426–3433.
-  J. Dong, W. Xia, Q. Chen, J. Feng, Z. Huang, and S. Yan, “Subcategory-aware object classification.,” in Proc. IEEE Conf. Comp. Vis. Patt. Recogn., 2013, pp. 827–834.
-  P. Agrawal, R. Girshick, and J. Malik, “Analyzing the performance of multilayer neural networks for object recognition,” in Proc. Eur. Conf. Comp. Vis., 2014.
-  S. Lazebnik, C. Schmid, and J. Ponce, “Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories,” Proc. IEEE Conf. Comp. Vis. Patt. Recogn., 2006.
-  J. Philbin, O. Chum, M. Isard, J. Sivic, and A. Zisserman, “Object retrieval with large vocabularies and fast spatial matching,” in Proc. IEEE Conf. Comp. Vis. Patt. Recogn., 2007.
-  M. D. Herve Jegou and C. Schmid, “Hamming embedding and weak geometry consistency for large scale image search,” in Proc. Eur. Conf. Comp. Vis., 2008.
-  R. Arandjelović and A. Zisserman, “Smooth object retrieval using a bag of boundaries,” in Proc. IEEE Int. Conf. Comp. Vis., 2011.