In this paper we show that by carefully making good choices for various detailed but important factors in a visual recognition framework using deep learning features, one can achieve a simple, efficient, yet highly accurate image classification system. We first list 5 important factors, based on both existing researches and ideas proposed in this paper. These important detailed factors include: 1) ℓ_2 matrix normalization is more effective than unnormalized or ℓ_2 vector normalization, 2) the proposed natural deep spatial pyramid is very effective, and 3) a very small K in Fisher Vectors surprisingly achieves higher accuracy than normally used large K values. Along with other choices (convolutional activations and multiple scales), the proposed DSP framework is not only intuitive and efficient, but also achieves excellent classification accuracy on many benchmark datasets. For example, DSP's accuracy on SUN397 is 59.78 state-of-the-art (53.86READ FULL TEXT VIEW PDF
Feature representation is among the most important topics (if not the most important one) in current state-of-the-art visual recognition tasks. Over the past decade, handcrafted features (e.g., SIFT and HOG) were very popular, and they were often encoded into a high dimensional vector by the Bag-of-Visual-Words (BOVW) framework . The BOVW representation is further improved by the Vector of the Locally Aggregated Descriptors (VLAD)  and Fisher Vector (FV) 
methods, via adding higher order statistics. However, such features are significantly outperformed by the recent deep features from convolutional neural networks (CNNs), which have exhibited significantly better performance than those handcrafted features in visual recognition.
In spite of the impressive results achieved by deep features, there are many factors which can affect the performance of deep feature representations. A lot of factors exist and many details will have huge impact in CNN feature’s recognition accuracy. Those factors include, for example, how the deep net is trained. Zhou et al.  evaluated deep feature’s performance from the same network architecture learned from different training sets (i.e.1] studied other factors, including architectures of deep nets and data augmentation, etc.
After a deep net has been successfully trained, more factors and decisions are awaiting. In other words, how shall we use the deep features for image recognition? Studies have been carried out very recently, and some important details have been worked on. However, a systematic study of “what factors are out there?” and “what choices should be made?” is missing. In this paper, we present our studies to these questions. Specifically, suppose we are given a pre-trained deep CNN model,
What are important factors in utilizing this model? Based on existing studies in the literature and our new proposals, we make a list of five important factors.
What decisions are the best concerning these factors? We carefully evaluate different choices and present our answers to this question. Some choices (e.g., the choice of size in FV) are quite different from previous practices in the community.
What effects do these factors have? We show that they are key to high recognition accuracy. By combining the best choices from the 5 factors we raised, we propose Deep Spatial Pyramid (DSP), a framework that properly utilize deep CNN features. DSP has the following properties:
High efficiency and flexibility. DSP achieves high processing speed, with roughly 150 ms to process an image. DSP also processes images of any aspect ratio or resolution.
Small storage cost. The final DSP representation is memory-efficient, with around 12k dimensions. This length is much shorter than existing combination of CNN features and FV / VLAD, and is advantageous in large-scale problems.
We will first present the framework, preliminaries, and the list of important factors in Sec. 2. The study of best decisions for these factors are presented in Sec. 3. However, the study of size is very special, as to have its own Sec. 4. DSP is evaluated as a whole system in Sec. 5, and it is compared with state-of-the-art visual recognition methods. Sec. 6 concludes this paper.
Our study follow the framework illustrated in Fig. 1. In the first step, we feed an input image with arbitrary resolution into a pre-trained CNN model to extract deep activations. Then, a visual dictionary with dictionary items is trained on the deep descriptors from training images. The third step overlay a spatial pyramid partition to the deep activations of an image into blocks in pyramid levels. One spatial block is represented as a vector by using the improved Fisher Vector. Thus, blocks correspond to FVs. In the fourth and final step, we concatenate the FVs to form a -dimensional feature vector as the final image-level representation.
Our framework does not consider how the pre-trained CNN is obtained or how an image is classified after its representation is obtained. These can be viewed as preliminary factors, and we follow the commonly used decisions for them in the literature.
In practice, some CNN models (e.g., Krizhevsky et al.  and Zeiler and Fergus ) are popularly used as the deep feature extractor in image related tasks. However, recently neural networks that are even deeper than these are shown to further improve CNN performance, characterized by deeper and wider architectures and smaller convolutional filters when compared to traditional CNN such as [11, 28]. Examples of deeper nets include GoogLeNet  and VGG Net-D . Our work is based on the network architecture released by  (i.e., VGG Net-D). This network consists of 13 layers of
convolutional kernels, with 5 max-pooling layers interspersed, and in the end concluded by 3 fully connected layers. The width of this network starts from 64 in the first layer, increasing by a factor of 2 after each max-pooling layer, until it reaches 512. For the classification, we use a linear SVM classifier.
In the rest of this paper, we follow the notations in . We use the term “feature map” to indicate the convolutional results (after applying the max-pooling) of one filter, the term “activations” to indicate feature maps of all filters in a convolutional layer, and the term “descriptor” to indicate the -dimensional component vector of activations. “pool” refers to the activations of the max-pooled last convolutional layer, and “fc” refers to the activation of the last fully connected layer.
With these preliminaries and notations, we now discuss the important factors inside this framework.
Which activation to use? Deep features for an image can be extracted from either the convolutional layers or the fully connected layers of a pre-trained CNN. The original idea is to use the last fully connected layer directly for classification . And recently, activations from the fully convolutional layers have exemplified its value [28, 13, 2, 25]. Which one shall we adopt?
How to normalize the deep features before feeding them into a classifier or the next level of processing? It is not yet a common practice to normalize CNN activations. What are the viable choices and which one is the best?
How many components in the FV representation? The GMM model in FV consists of Gaussian components. It is known that in general a large (e.g., 256) leads to high accuracy for fully connected activations [7, 27], dense SIFT  and action features . However, a large leads to a very long (hundreds of thousands of dimensions) representation. Is a large really necessary?
Shall we capture spatial information (and how?) A general CNN requires a fixed input image size. He et al.  proposed a SPP-Net to remove the fixed-size constraint, which also inspired a Spatial Pyramid Pooling (SPP). The SPP-Net pooled deep activations of the last convolutional layer and generated fixed length outputs, then the pooled activations were fed into the fully connected layers. Is there a simpler and more natural way to capture spatial information?
Shall we use information from multiple scales? Yoo et al.  replaces the fully connected layers with equivalent convolutional layers to obtain large amount of dense deep descriptors. Then, all the activations are merged into a single vector by Multi-scale Pyramid Pooling (MPP). MPP utilizes multi-scale CNNs’ activations. MPP, however, is computationally expensive. Is there an efficient way to capture information from multiple scales?
These factors may seem too detailed to be important. However, existing methods adopted very different decisions to these questions, and these differences may well explain their performance differences. We summarize these differences in Table 1.
In Table 1, “DF” refers to deep features, where “F” and “C” represent the fully connected and convolutional layer, respectively. “Norm” refers to how the deep activations are normalized; “K” indicates the number of visual words or Gaussian components; “SP” refers to spatial pyramid; “Ms” refers to multiple scale. In addition, “-” means that a method does not involve the corresponding factor. Some methods also use PCA to reduce the dimensionality of deep activations.
From Table 1, it is clear that the proposed DSP is flexible (accepting any size image), efficient (fully convolutional and very small ), and making full use of the image (spatial pyramid and multiple scales). We will explain how these decisions and choices are made in the next section.
Convolutional neural networks consist of alternatively stacked convolutional layers and pooling layers, followed by one or more fully connected layers. The convolutional layers generate feature maps by linear convolutional filters with nonlinear activation functions such as rectified linear units, then the feature maps max-pool the outputs within local neighborhoods. Finally, the activations of the last convolutional layer are fed into fully connected layers, followed by a soft-max classifier.
However, the feature map of top convolutional layers are known to contain mid- and high-level information, e.g., object parts or complete objects . As shown in Fig. 2, we visualize the input image’s feature maps which are generated by the last convolutional layer. In this figure, the strongest response of the 194th and 207th feature map are corresponding to the person and motorcycle in the input image, respectively. Thus, one major difference between convolutional and fully connected layer activations is that the former is directly embedded with rich semantic information of image patches, while the latter not necessarily be so.
Furthermore, the fully connected layers require a fixed image size (e.g., ). On the contrary, convolutional layers accept input images of arbitrary resolution or aspect ratio. The pool
activations can be formulated as a order-3 tensor of size, which include cells and each cell contains one -dimensional deep descriptor. For example, we will get a activations if the input image size is . Convolutional layer deep descriptors have been successfully in [13, 2, 25].
These deep descriptors contain more spatial information compared to the activation of the fully connected layers, e.g., the top-left cell’s -dim deep descriptor is generated using only the top-left part of the input image, ignoring other pixels. In addition, fully connected layers have large computational cost, because it contains roughly 90 of all the parameters of the whole CNN model.
Thus, in DSP we use a fully convolutional network by removing the fully connected layers.
Let () be the matrix of -dimensional deep descriptors extracted from an image via a pre-trained CNN model. was usually processed by dimensionality reduction methods such as PCA, before they are pooled into a single vector using VLAD or FV [7, 27]. PCA is usually applied to the SIFT features or fully connected layer activations, since it is empirically shown to improve the overall recognition performance. However, our experiments show that PCA significantly hurts recognition when applied to the fully convolutional activations. Thus, it is not applied to fully convolutional deep descriptors in this paper.
In addition, each deep descriptors inside is not normalized in current processing of deep visual descriptors . We first try to normalize with the vector normalization (i.e., ), which leads to better results than null normalization on most datasets, except in Stanford40, as shown in Table 2.
We also propose a novel matrix normalization (i.e., ), where is the matrix spectral norm, i.e.
, largest singular value of. This normalization has a benefit that it normalizes using the information from the entire image . It is a bit surprising to observe that it is more effective than the commonly used vector normalization, and sometimes by a large margin. An intuitive interpretation is that the matrix normalization can use the global information, making it more robust to changes such as illumination and scale.
In order to evaluate the effect of these normalization and PCA for classification performance, we use 4 datasets. We use the original resolution of input images without cropping or warping and pool activations by using FV with (i.e., the GMM has 4 Gaussian components). The experimental results are reported in Table 2. The matrix normalization before using FV is found to be important for better performance.
The size of pool is a parameter in CNN because input images have arbitrary sizes. However, the classifiers (e.g., SVM or soft-max) require fixed length vectors. Thus, all the deep descriptors of an image must be pooled to form a single vector. We use the Fisher Vector (FV) to encode the deep descriptors.
We denote the parameters of the GMM with components by , where , and are the mixture weight, mean vector and covariance matrix of the Gaussian component, respectively. The covariance matrices are diagonal and
are the variance vectors. Letbe the soft-assignment weight of with respect to the -th Gaussian, the FV representation corresponding to and are presented as follows :
Note that, and are both -dimensional vectors. The final Fisher Vector is the concatenation of the gradients and for all Gaussian components. Thus, FV can represent the set of deep descriptors with a -dimensional vector. In addition, the Fisher Vector is improved by the power-normalization with the factor of 0.5, followed by the vector normalization .
We will further study how to choose a proper size for FV in Sec. 4.
The proposed method is named as DSP (Deep Spatial Pyramid), since adding spatial pyramid information is the key part of DSP. Adding spatial information through a spatial pyramid  have been shown to significantly improve image recognition performance when dense SIFT features are used. How can we efficiently and effectively utilize the spatial information with fully convolutional activations?
The SPP-net method  adds a spatial pyramid pooling layer to deep nets, which has improved recognition performance. However, since we are using FV to pool activations from a fully convolutional network, a more intuitive and natural way exists.
As previously discussed, one single cell (deep descriptor) in the last convolutional layer corresponds to one local image patch in the input image, and the set of all convolutional layer cells form a regular grid of image patches in the input image. This is a direct analogy to the dense SIFT feature extraction framework. Instead of a regular grid of SIFT vectors extracted fromlocal image patches, a grid of deep descriptors are extracted from larger image patches by a CNN.
Thus, we can easily form a natural deep spatial pyramid by partitioning an image into sub-regions and computing local features inside each sub-region. In practice, we just need to spatially partition the cells of activations in the last convolutional layer, and then pool deep descriptors in each region separately using FV. The operation of DSP is illustrated in Fig. 3.
The level 0 simply aggregates all cells using FV. The level 1, however, splits the cells into 5 regions according to their spatial locations: the 4 quadrants and 1 centerpiece. Then, 5 FVs are generated from activations inside each spatial region. Note that the level 1 spatial pyramid we use is different from the classic one in . We follow Wu and Rehg  to use an additional spatial region in the center of the image. A DSP using two levels will then concatenate all 6 FVs from level 0 and level 1 to form the final image representation.
This proposed DSP method is summarized in Algorithm 1.
In order to capture variations of the activations caused by variations of objects in an image, we generate a multiple scale pyramid, extracted from different rescaled versions of the original input image. We feed images of all different scales into a pre-trained CNN model and extract deep activations. In each scale, the corresponding rescaled image is encoded into a -dimensional vector by DSP. Therefore, we have vectors of -dimensions and they are merged into a single vector by average pooling, as
where is the DSP representation extracted from the scale level . Finally, normalization is applied to . Note that each vector is already normalized, as shown in Algorithm 1.
The multi-scale DSP is related to MPP proposed by Yoo et al. . A key different between our method and MPP is that encodes spatial information while MPP does not.
In this section, we will discuss one key character of DSP, i.e., the number of GMM’s components.
Our experiments show that in DSP, when the number of GMM’s components is small (e.g., from 1 to 4), it will achieve satisfactory classification performances. In fact, when different are used, the highest recognition accuracy is usually achieved by setting to 1 or 2!
This phenomenon is not consistent with common practices in image classification by using local descriptors via the FV encoding. When deep learning features are used together with FV, a large value is also used. Moreover, Yoo et al.  specified the value of to be 256 when they trained their visual vocabulary. More previous examples of large values can be found in Table 1. Having a small value is very beneficial in terms of CPU and storage costs, however, why is DSP requiring a small ?
We believe the answer is because DSP uses a small number of deep descriptors per image, i.e., is a small integer. We usually extract no more than 100 512-dimensional deep descriptors from the last convolutional layer from one image, while  represented one image with 4,410 vectors of 4,096 dimensional dense CNN activations. If the value of is specified as a large number (e.g., 128 or 256), the resulting FV representation will be problematic.
First, if a large is used in DSP, there will not be enough deep descriptors to estimate an accurate GMM model, because each training image will only contribute few number of deep descriptors. An inaccurate GMM model will adversely affect the classification performance seriously. Second, many FV components will only contain zeros, because there are more Gaussian components than CNN descriptors. We conjecture that this will cause FV to lose accuracy.
We also empirically study this phenomenon. As shown in Fig. 4, we plot distribution of GMM components’ priors (i.e., ) in DSP. There are 14 plots for the 7 datasets used in our experiments. Two plots are shown for each data set, which corresponds to different number of GMM components (shown as the horizontal axis), i.e., 64 and 256. The vertical axis shows the value of for each Gaussian component.
It is obvious to find that: for most datasets, one or two values are much larger than the rest. For example, when in the SUN 397 dataset, the two tall bars indicate that two values are above 0.3, and their sum is around 0.7. In other words, only 2 Gaussian components are responsible for more than 70% of the variations of the distribution. The rest 30% might be related to noisy or background image patches. Thus, might be the best choice in this particular case. In most datasets, we can observe the same phenomenon: one or two Gaussian components are dominating the entire distribution. This observation might explain why DSP just needs a small number of Gaussian components. Since a small value of in DSP will cause a much lower computational cost, it is efficient to handle large scale image classification tasks.
We further evaluate the impact of in DSP and multiple scale DSP (Ms-DSP). We show the classification results in Fig. 5 as a function of the number of Gaussians (i.e., ) of the GMM, and is increased by a factor of 2. A smaller (e.g., ) always obtains better classification performance for DSP and Ms-DSP. With the increasing of , we can see that DSP and Ms-DSP lead to a drop in the discriminative ability. DSP or Ms-DSP feature vector may be too sparse when is increased, which is detrimental to classification. When , a DSP representation has only dimensions. The entire DSP pipeline (from reading in an image till emitting a prediction) requires on average 0.15 second per image.
For a fixed , Ms-DSP always significantly outperforms DSP. This is not surprising since, for a given , Ms-DSP captures more information from rescaled images, which DSP does not have access to.
The purpose of this section is to evaluate the performance of DSP as a complete pipeline. We report results in three object recognition datasets, Caltech-101 , Caltech-256  and Pascal VOC 2007 , and three scene recognition datasets, Scene15 categories , MIT Indoor67  and SUN397 , and one action recognition data set, Stanford40 . Except for Pascal VOC 2007 and MIT Indoor67 which have fixed training and test splittings, all experiments on the other datasets are repeated as the average of three randomly sampled train/test splittings.
Caltech-101  contains 9K labeled images of 101 object categories and a background category. We follow the procedure of  and randomly select 30 images per category for training and test on up to 50 images per class in every split. Caltech-256  with 31K images and 257 classes is an improvement of Caltech-101. Following , each split contains 60 training images per class and the rest is used for test. For PASCAL VOC 2007 which contains 20 object classes, we use its standard protocol and measure the average precision (AP) and report the mean AP (mAP) of 20 categories.
Scene15 is composed of 15 different kinds of scenes, where each category has 200 to 400 images. We randomly select 100 images per class for training and the rest for test, following . MIT Indoor67  is a challenging indoor data set comparing with outdoor scene recognition. The dataset has 15,620 images with 67 indoor scene categories. The standard split  for this dataset consists of 80 training and 20 test images per category. SUN397  is the largest data set for scene recognition. It contains 397 categories and each category has at least 100 images. The training and test splits are fixed and publicly available from , where each split has 50 training and 50 test images per category. We select the first three splits from the 10 public splits in our experiments.
Stanford40  contains 40 diverse daily human actions and with 180300 images for each category. In each splitting, we randomly select 100 images in each class for training and the remaining for test.
In our experiments, average accuracy rate is used to evaluate the classification performances on Caltech-101, Caltech-256, MIT Indoor67, Scene15, SUN397, and Stanford40. For PASCAL VOC 2007, we employ mean average precision (mAP) to evaluate our proposed method and other approaches.
In our DSP, VGG Net-D  is employed as the pre-trained CNN model to extract deep activations. For simplicity, pre-trained CNN model weights are kept fixed without fine-tuning. Note that, we just employ VGG Net-D without its fully connected layers in our experiments, thus can accept input images of arbitrary sizes. Input images do not need to be resized into a fixed aspect ratio. However, considering running efficiency, an image is resized such that the smallest and largest edge of input image will not be lower than 224 or higher than 1120, respectively. In addition, each image is preprocessed by subtracting the per-pixel mean (of the ImageNet images and provided along with the CNN model).
We use in FV in this section. An image is represented by the concatenation of FVs from all the sub-blocks in a two level deep spatial pyramid. For using multi-scale, the rescaled images are times of the of original input image, where . the FVs of all five scale are merged into a single vector by average pooling as Eq. 3.
|Methods||Description||Caltech-101||Caltech-256||VOC 2007||Scene15||SUN397||MIT Indoor67||Stanford40|
State-of-the-art and two baseline results are reported in Table 3. In particular, the first baseline method is fc which is extracted from the last fully connected layer. To extract fc feature, we resize the image so that its resolution is . -normalization is applied to the fc activations before employing SVM, which was suggested in . The other baseline is the poolFV where deep descriptors are aggregated to single vector by orderless FV pooling. In order to compare fairly, we use the same resolution of input image as in our DSP.
On most datasets, fc already performs well. Pool produces quite good results even though the Pool activations are computed using only 10 of the CNN parameters of the complete CNN model, which shows that fully convolutional features (with small in FV and matrix normalization) are powerful, especially on VOC2007 (84.61 88.12) and Stanford40 (71.53 73.96).
DSP and multi-scale DSP can significantly outperform baseline and state-of-the-arts methods. Compared to the baselines, DSP improves performance in all datasets by 1–5, especially on SUN397 (53.90%59.27%) and Stanford40 (73.96%79.75%). This gain is mainly due to the fact that DSP can capture the spatial information on top of pool activations. On the other hand, the fully convolutional network relaxes the constraint that the input images must have the same fixed size, thus the full image can be fed into a pre-trained CNN without changing its aspect ratio. Combining multiple scale and DSP (Ms-DSP) achieves the best recognition performance on all datasets. Since fully convolutional and small are used, Ms-DSP is still very efficient.
Our DSP and Ms-DSP can achieve mean recall and on Caltech-101, respectively, and and on Caltech-256, respectively. These results are significantly higher than that of  (92.7% for Caltech-101 and 86.2% for Caltech-256).
In order to present a powerful deep feature representation, details have to be made right. In other words, decisions for important factors must be carefully studied and made. In this paper, we picked a list of 5 important factors and provided our answers to them. The main findings of this paper form a complete pipeline DSP (deep spatial pyramid), which integrates the following components: activations from the last convolutional layer, naturally processing input image of any size instead of fixed size, dense deep features extracted from multiple scales, and most importantly, a natural way to build a spatial pyramid in deep learning. DSP, in spite of being simple and efficient, has excellent performance in many benchmark datasets.
In particular, we emphasize the following new findings.
Normalization: matrix normalization is more effective than unnormalized or vector normalization.
DSP: DSP can effectively capture the spatial information in a natural and efficient manner.
size in FV: Pooling deep descriptors only need small number of Gaussian components in the Fisher Vector, which leads to lower computational costs.
Other factors and details can be further considered in the DSP framework, which we will study in the future. For example, convolutional activations from multiple layers (cross-layer ) might further improve classification accuracy. And VLAD might be a better fit than FV for aggregating deep convolutional activations .
VLFeat: An open and portable library of computer vision algorithms, 2008.Software available at http://www.vlfeat.org/.