1 Introduction
In the recent decade local patchbased, spatially pooled feature extraction pipelines have been shown to provide good image features for classification. Methods following such a pipeline usually start from densely extracted local image patches (either normalized raw pixel values or handcrafted descriptors such as SIFT or HOG), and perform dictionary learning to obtain a dictionary of codes (filters). The patches are then encoded into an overcomplete representation using various algorithms such as sparse coding
[14, 17] and simple inner product with a nonlinear postprocessing [4, 10]. After encoding, spatial pooling with average or max operations are carried out to form a global image representation [19, 1]. The encoding and pooling pipeline can be stacked to produce a final feature vector, which is then used to predict the labels for the images usually via a linear classifier.
There is an abundance of literature on singlelayered networks for unsupervised feature encoding. Dictionary learning algorithms have been discussed to find a set of basis that reconstructs local image patches or descriptors well [13, 4], and several encoding methods have been proposed to map the original data to a highdimensional space that emphasizes certain properties, such as sparsity [14, 19, 20] or locality [17].
A particularly interesting finding in the recent papers [3, 15, 4, 16]
is that very simple patchbased dictionary learning algorithms like Kmeans or random selection, combined with feedforward encoding methods with a naive nonlinearity, produces stateoftheart performance on various datasets. Explanation of such phenomenon often focuses on the local image patch statistics, such as the frequency selectivity of random samples
[16].A potential problem with such patchbased learning methods is that it may learn redundant features when we consider the pooling stage, as two codes that are uncorrelated may become highly correlated after pooling due to the introduction of spatial invariance. While using a larger dictionary almost always alleviates this problem, in practice we often want the dictionary to have a limited number of codes due to various reasons. First, feature computation has become the dominant factor in the stateoftheart image classification pipelines, even with purely feedforward methods (e.g., threshold encoding [4]) or speedup algorithms (e.g., LLC [17]). Second, reasonably sized dictionary helps to more easily learn further tasks that depends on the encoded features; this is especially true when we have more than one codingpooling stage such as stacked deep networks, or when one applies more complex pooling stages such as secondorder pooling [2], as a large encoding output would immediately drive up the number of parameters in the next layer.
Thus, it would be beneficial to design a dictionary learning algorithm that takes pooling into consideration and learns a compact dictionary. Prior work on addressing such problem often resorts to convolutional approaches [12, 22]. These methods are usually able to find dictionaries that bear a better level of spatial invariance than patchbased Kmeans, but are often nontrivial to train when the dictionary size is large, since a convolution operation has to be carried out instead of simple inner products.
In our paper we present a new method that is analogous to the patchbased Kmeans method for dictionary learning, but takes into account the redundancy that may be introduced in the pooling stage. We show how a Kcentroids clustering method applied on the covariance between candidate codes can efficiently learn pooling invariant representations. We also show how one can view dictionary learning as a matrix approximation problem, which finds the best approximation to an “oracle” dictionary (which is often very large). It turns out that under this perspective, the performance of various dictionary learning methods can be explained by the recent findings in Nyström subsampling [11, 23].
We will first review the feature extraction pipeline and the effect of pooling on the learned dictionary in Section 2, and then describe the proposed twostage dictionary learning method in Section 3. The effectiveness of this simple yet effective dictionary learning algorithms on the standard CIFAR10 and STL benchmark datasets, as well as the finegrained classification task, in which we show that feature learning plays an important role.
2 Background
We illustrate the feature extraction pipeline that is composed of encoding dense local patches and pooling encoded features in Figure 1. Specifically, starting with an input image , we formally define the encoding and pooling stages as follows.
(1) Coding.
In the coding step, we extract local image patches^{1}^{1}1Although we use the term “patches” throughout the paper, the pipeline works with local image descriptors, such as SIFT, as well., and encode each patch to activation values based on a dictionary of size (learned via a separate dictionary learning step). These activations are typically binary (in the case of vector quantization) or continuous (in the case of e.g. sparse coding), and it is generally believed that having an overcomplete ( the dimension of patches) dictionary while keeping the activations sparse helps classification, especially when linear classifiers are used in the later steps.
We will mainly focus on what we call the decoupled encoding methods, in which the activation of one code does not rely on other codes, such as threshold encoding [4], which computes the inner product between and each code, with a fixed threshold parameter : . Such methods have been increasingly popular mainly for their efficiency over coupled encoding methods such as sparse coding, for which a joint optimization needs to be carried out. Their employment in several deep models (e.g. [10]) also suggests that such simple nonlinearity may suffice to learn a good classifier in the later stages.
Learning the dictionary: Recently, it has been found that relatively simple dictionary learning and encoding approaches lead to surprisingly good performances [3, 16]. For example, to learn a dictionary of size from randomly sampled patches each reshaped as a vector of pixel values, one could simply adopt the Kmeans algorithm, which aims to minimize the squared distance between each patch and its nearest code: . We refer to [3] for a detailed comparison about different dictionary learning and encoding algorithms.
(2) Pooling.
Since the coding result are highly overcomplete and highly redundant, the pooling layer aggregates the activations over a spatial region of the image to obtain a dimensional vector . Each dimension of the pooled feature is obtained by taking the activations of the corresponding code in the given spatial region (also called receptive field in the literature), and performing a predefined operator (usually average or max) on the set of activations.
Figure 1 shows an example when average pooling is carried out over the whole image. In practice we may define multiple spatial regions per image (such as a regular grid), and the global representation for the image will then be a vector of size times the number of spatial regions.
Since the feature extraction for image classification often involves the spatial pooling stage, the patchlevel dictionary learning may not find good dictionaries that produce most informative pooled outputs. In fact, one would reasonably argue that it doesn’t, one immediate reason being that patchbased dictionary learning algorithms often yield similar Gabor filters with small translations. Such filters, when pooled over a certain spatial region, produce highly correlated responses and lead to redundancy in the feature representation. Figure 2 shows two such examples, where filters produce uncorrelated patchbased responses but highly correlated pooled responses.
Convolutional approaches [12, 22] are usually able to find dictionaries that are more spatially invariant than patchbased Kmeans, but learning may not scale as well as simple clustering algorithms, especially with hundreds or thousands of codes. In addition, convolutional approaches may still not solve the problem of intercode invariance: for example, the response of a colored edge filter might have high correlation with that of a gray scale edge filter, and such correlation could not be modeled by spatial invariance.
3 PoolingInvariant Dictionary Learning
We are interested in designing a simple yet effective dictionary learning algorithm that takes into consideration the pooling stage of the feature extraction pipeline, and that models the general invariances among the pooled features. Observing the effectiveness of clustering methods in dictionary learning, we propose to learn a final dictionary of size in two stages: first, we adopt the patchbased Kmeans algorithm to learn a more overcomplete starting dictionary of size (); we then perform encoding and pooling using the dictionary, learn the final, smaller dictionary of size from the statistics of the pooled features.
The motivation of such idea is that Kmeans is a highly parallelizable algorithm that could be scaled up by simply sharding the data, allowing us to have an efficient algorithm for dictionary learning. Using a starting dictionary allows us to preserve most information on the patchlevel, and the second step prunes away the redundancy due to pooling. Note that the large dictionary is only used during the feature learning time  after this, for each input image, we only need to encode local patches with the selected, relatively smaller dictionary, not any more expensive than existing feature extraction methods.
3.1 Feature Selection with Affinity Propagation
The first step of our algorithm is identical to the patchbased Kmeans algorithm with a starting dictionary size . After this, we can sample a set of image superpatches of the same as the pooling regions, and obtain the dimensional pooled features from them. Randomly sampling a large number of pooled features in this way allows us to analyze the pairwise similarities between the codes in the starting dictionary in a postpooling fashion. We would then like to find a dimensional subspace that best represents the pooled features. Specifically, the similarity between two pooled dimensions (which correspond to two codes in the starting dictionary) and code as
(1) 
where is the covariance matrix computed from the random sample of pooled features. We note that this is equivalent to the negative Euclidean distance between the coded output and the coded output
when the outputs are normalized to have zero mean and standard deviation 1. We then use affinity propagation
[9], which is a version of the Kcentroids algorithm, to select centroids from existing features. Intuitively, codes that produce redundant pooled output (such as translated versions of the same code) would have high similarity between them, and only one exemplar would be chosen by the algorithm.Specifically, affinity propagation finds centroids from a set of candidates where pairwise similarity () can be computed. It iteratively updates two terms, the “responsibility” and the “availability” via a message passing method following such rules [9]:
(2)  
(3)  
(4) 
Upon convergence, the centroid that represents any candidate is given by , and the set of centroids is obtained by
(5) 
And we refer to [9] for details about the nature of such message passing algorithms.
3.2 Visualization of Selected Filters
To visually show what codes are selected by affinity propagation, we applied our approach to the CIFAR10 dataset by first training an overcomplete dictionary of 3200 codes, and then performing affinity propagation on the 3200dimensional pooled features to obtain 256 centroids, which we visualize in Figure 3. Translational invariance appears to be the most dominant factor, as many clusters contain translated versions of the same Gabor like code, especially for gray scale codes. On the other hand, clusters capture more than translation: clusters such as column 5 focus on finding the contrasting colors more than finding edges of exactly the same angle, and clusters such as the last column finds invariant edges of varied color. We note that the selected codes are not necessarily centered (which is the case for convolutional approaches), as the centroids are selected solely from the pooled response covariance statistics, which does not explicitly favor centered patches.
4 Why Does Clustering Work?
We will briefly discuss the reason why simple clustering algorithms work well in finding a good dictionary. Essentially, given a dictionary of size and specifications on the encoding and pooling operations, the feature extraction pipeline could be viewed as a projection from the original raw image pixels to the pooled feature space . Note that this space does not degenerate since the feature extraction is highly nonlinear, and one could also view this from a kernel perspective as defining a specific kernel between images. Then, one way to evaluate the information contained in this embedded feature space is to use the covariance matrix of the output features, which plays the dual role of the kernel matrix between the encoded patches.
Considering that larger dictionaries almost always increases performance [20], and in the limit one could use all the patches as the dictionary, leading to a dimensional space and an covariance matrix . Since in practice we always assume a budget on the dictionary size, the goal is to find a dictionary of size , which yields a dimensional space and a covariance matrix . The approximation to the “oracle” encoded space using could then be computed as:
(6) 
where is the covariance matrix between the features extracted by and the one extracted by , and denotes the matrix pseudoinverse.
Note that such explanation works for both the beforepooling case and the afterpooling case. The dictionary learning algorithm can then be thought as finding a dictionary that approximates
best. Interestingly, this could be thought as a form of the Nyström method that subsamples subsets of the matrix columns for approximation. The Nyström method has been used to approximate large matrices for spectral clustering
[8], and here enables us to explain the mechanism of dictionary learning. Recent research in the machine learning field, notably
[11], supports the recent empirical observations in vision: first, it is known that uniformly sampling the columns of the matrix already works well in reconstruction, which explains the good performance of random patches in feature learning [16]; second, theoretical results [11, 23] have shown that clustering algorithms works particularly better than other methods as a datadriven way in finding good subsets to approximate the original matrix, justifying the use of clustering in the dictionary learning works.In the patchbased dictionary learning, clustering could be directly applied on the set of patches (i.e. the largest dictionary in the limit). When we consider pooling, though, using all the patches becomes nontrivial, since we need to compute pooled features using each patch as a code, which is computationally overwhelming. Thus, a reasonable approach to consider is to first find a subset of all the patches using patchbased clustering (although this “subset” is still larger than our final dictionary size), and perform clustering on the pooled outputs of this subset only, which leads to the proposed algorithm. The performance of the algorithm is then supported by the Nyström theory above.
Based on the matrix approximation explanation, we further reshape our selected features to match the distribution in the higherdimensional space corresponding to the starting dictionary. After selecting centroids denoted by subset , we can group the original covariance matrix as
(7) 
where denotes the covariance between the subset and the subset and so on. We then approximate the original highdimensional covariance matrix as
(8) 
More importantly, given the pooled outputs using the selected filters, the highdimensional feature could be approximated by
(9) 
Notice that the implicit dimensionality of the data is still no larger than the number of selected codes, so we can apply SVD on the matrix as where is a diagonal matrix and is a columnwise orthonormal matrix, and compute the lowdimensional feature (with a little abuse of terminology) as
(10) 
The transform matrix
could be precomputed during feature selection, and imposes minimum overhead during the actual feature extraction. This transform does not change the dimensionality or the rank of the data, and only changes the shape of the underlying data distribution. In practice, we found the transformed data yields a slightly better performance than the untransformed data when combined with a linear SVM with
regularization.If we simply would like to find a lowdimensional representation from the dimensional pooled features, one would naturally choose PCA to find the most significant projections:
(11) 
where the matrix
contains the eigenvectors and the
diagonal matrixcontains the eigenvalues. The lowdimensional features are then computed as
.We note that while this guarantees the best dimensional approximation, it does not help in our task since the number of filters are not reduced, as PCA almost always yields nonzero coefficients for all the dimensions. Linearly combining the codes does not work either, due to the nonlinear nature of the encoding algorithm. However, as we will show in Section 5.3, results with PCA show that a larger starting dictionary almost always help performance even when the feature is then reduced to a lower dimensional space of the same size as a smaller dictionary, which justifies the use of matrix approximation to explain the dictionary learning behavior.
5 Experiments
We apply our poolinginvariant dictionary learning (PDL) algorithm on several benchmark tasks, including the CIFAR10 and STL datasets on which performance can be systematically analyzed, and the finegrained classification task of classifying bird species, on which we show that feature learning provides a significant performance gain compared to conventional methods.
(a)  (b)  (c)  (d) 
5.1 CIFAR10 and STL
The CIFAR10^{2}^{2}2http://www.cs.toronto.edu/~kriz/cifar.html and the STL^{3}^{3}3http://www.stanford.edu/~acoates/stl10/, [3] datasets are extensively used to analyze the behavior of feature extraction pipelines. CIFAR10 contains a large number of training and testing data, while STL contains a very small amount of training data and a large amount of unlabeled images. As our algorithm works with any encoding and pooling operations, we adopted the standard setting usually carried on the dataset: extracting local patches with mean subtracted and contrast normalized, whitening the patches with ZCA, and then train the dictionary with normalized Kmeans. The features are then encoded using onesided threshold encoding with and average pooled over the four quadrants () of the image. For STL, we followed [5] and resized them to . For the PDL algorithm, instead of learning a different set of codes for each pooling quadrant, we learn an identical set of codes in general for pooling the coded outputs. For all experiments, we carry out 5 independent runs and take the mean accuracy to report here.
5.2 Statistics for Feature Selection
We first verify whether the learned codes capture the pooling invariance as we claimed in the previous section. To this end, we start from the selected features in Section 3.2, and randomly sample three types of filter responses: (a) pairwise filter responses before pooling between codes in the same cluster, (b) pairwise filter responses after pooling between codes in the same cluster, and (c) pairwise filter responses after pooling between the selected centroids. The distribution of such responses are plotted in Figure 4. The result verifies our conjecture well: first, codes that produce uncorrelated responses before pooling may become correlated after the pooling stage (comparing 4(a) and 4(b)), which could effectively be identified by the affinity propagation algorithm; second, by explicitly taking into consideration the pooling behavior, we are able to select a subset of the features whose responses are lowly correlated (compare 4(b) and 4(c)), which helps preserve more information with a fixed number of codes.
Figure 4(d) shows the eigenvalues of the original covariance matrix and those of the approximated covariance matrix, using the same setting as in the previous subsection^{4}^{4}4Note that since we only select 256 features, the number of nonzero eigenvalues is 256 for the approximation.. The approximation captures the largest eigenvalues of the original covariance matrix well, while dropping at a higher rate for smaller eigenvalues.
5.3 Classification Performances
Figure 5 shows the relative improvement obtained on CIFAR10, when we use a budgeted dictionary of size 200, but perform feature selection from a larger dictionary as indicated by the X axis. The PCA performance is also included in the figure as a loose upper bound of the feature selection performance. Learning the dictionary with our feature selection method consistently increases the performance as the size of the original dictionary increases, and is able to get about 70% the performance gain as obtained by PCA (again, notice that PCA still requires all the codes to be used, thus does not save feature extraction time).
The detailed performance of our algorithm on the two datasets, using different starting and final dictionary sizes, is visualized in Figure 7. Table 1 summarizes the accuracy values of two particular cases  final dictionary sizes of 200 and 1600 respectively. Note that our goal is not to get the best overall performance  as performance always goes up when we use more codes. Rather, we focus on how much gain the poolingaware dictionary learning gets, given a fixed dictionary size as the budget. Figure 7 shows the performance of the various settings, using PCA to reduce the dimensionality instead of PDL. As we stated in the previous section, this serves as an upper bound of the feature selection algorithms.
Overall, considering the pooled feature statistics always help us to find better dictionaries, especially when only a small dictionary is allowed for classification. Choosing from a larger starting dictionary helps increasing performance, although such effect saturates when the dictionary size is much larger than the target size. For the STL dataset, a large starting dictionary may lessen the performance gain (Figure 7(b)). We infer the reason to be that feature selection is more prone to local optimum, and the small training data of STL may cause the performance to be sensitive to suboptimal codebook choices. However, in general the codebook learned by PDL is consistently better than its patchbased counterpart.
Task  Learning Method  Accuracy 

Kmeans  69.02  
CIFAR10  2x PDL  70.54 (+1.52) 
200 codes  4x PDL  71.18 (+2.16) 
8x PDL  71.49 (+2.47)  
CIFAR10  Kmeans  77.97 
1600 codes  2x PDL  78.71 (+0.74) 
Kmeans  53.22  
STL  2x PDL  54.57 (+1.35) 
200 codes  4x PDL  55.53 (+2.31) 
8x PDL  55.52 (+2.30)  
STL  Kmeans  58.16 
1600 codes  2x PDL  58.28 (+0.12) 
Finally, we note that due to the heavytailed nature of the encoded and pooled features (see the eigendecomposition of Figure 4), one can infer that the representations obtained with a budget would have a correspondingly bounded performance when combined with linear SVMs. In this paper we have focused on analyzing unsupervised approaches. Incorporating weakly supervised information to guide feature learning / selection or learning multiple layers of feature extraction would be particularly interesting, and would be a possible future direction.
5.4 Finegrained Classification
To show the performance of the feature learning algorithm in the realworld image classification tasks, we tested the performance of our algorithm on the finegrained classification task, using the 2011 CaltechUCSD Birds dataset [18]^{5}^{5}5http://www.vision.caltech.edu/visipedia/CUB2002011.html. Classification of filegrained categories poses a significant challenge for the contemporary vision algorithms, as such classification tasks usually requires the identification of localized appearances like “birds with yellow spotted feather on the belly” which is hard to capture using manually designed features. Recent work on finegrained classification usually focuses on the localization of parts [7, 24, 6], and still uses manually designed features. Yao et al. [21] proposed to use a templatebased approach for more powerful features, but the approach may be difficult to scale as the number of templates grow larger^{6}^{6}6Due to computation complexity, some early work such as [7, 21] do not scale up well, and only reported performance on subsets of the whole data [personal communication].. In addition, due to the lack of training data in finegrained classification tasks, whether supervised feature learning is useful or not is unclear yet.
We performed classification on all the 200 bird species provided. For the image preprocessing we followed the same setting as [24, 21] by cropping the images to be centered on the birds using 1.5 the size of the provided bounding boxes. We then resized each cropped image to to avoid any artifact that may be introduced by the varying number of local features. The training data are expanded by simply mirroring each training image. Then, we extract local whitened patches as we did for CIFAR10, encoded them using threshold encoding with a dictionary of size , and performed max pooling since the bird images are larger than those from CIFAR and STL. The dictionary is learned using our feature extraction pipeline, from an original set of patchbased clustering centers.
Our classification results, together with previous stateoftheart baselines from [24], are reported in Table 2. It is somewhat surprising to observe that feature learning provides a significant performance boost in this case, indicating that in addition to part localization (which has been the focus of finegrained classification), learning appropriate features / descriptors to represent local appearances may be a major factor in finegrained classification, possibly due to the subtle appearance changes for such tasks. As we have shown here, even simple and fully unsupervised feature learning algorithms such as Kmeans and PDL could lead to significant accuracy improvement, and we hope this would inspire further advancement in the finegrained classification research.
6 Conclusion
We have proposed a novel algorithm to efficiently take into account the invariance of learned features after the spatial pooling stage. The algorithm is empirically shown to identify redundancy between codes learned in a patchbased way, and yields dictionaries that produces better classification accuracy than simple patchbased approaches. To explain the performance gain we proposed to take a matrix approximation view of the dictionary learning, and show the close connection between the proposed methods and the Nyström method. The proposed method does not introduce overheads during classification time, and could be easily “plugged in” to the existing image classification pipelines.
References
 [1] Y Boureau, F Bach, Y LeCun, and J Ponce. Learning midlevel features for recognition. In CVPR, 2010.
 [2] J Carreira, R Caseiro, J Batista, and C Sminchisescu. Semantic segmentation with secondorder pooling. In ECCV, 2012.
 [3] A Coates, H Lee, and A Ng. An analysis of singlelayer networks in unsupervised feature learning. In AISTATS, 2011.
 [4] A Coates and A Ng. The importance of encoding versus training with sparse coding and vector quantization. In ICML, 2011.
 [5] A Coates and AY Ng. Selecting receptive fields in deep networks. In NIPS, 2011.
 [6] K Duan, D Parikh, D Crandall, and K Grauman. Discovering localized attributes for finegrained recognition. In CVPR, 2012.
 [7] R Farrell, O Oza, N Zhang, VI Morariu, T Darrell, and LS Davis. Birdlets: Subordinate categorization using volumetric primitives and posenormalized appearance. In CVPR, 2011.
 [8] C Fowlkes, S Belongie, F Chung, and J Malik. Spectral grouping using the nystrom method. IEEE TPAMI, 26(2):214–225, 2004.
 [9] BJ Frey and D Dueck. Clustering by passing messages between data points. Science, 315(5814):972–976, 2007.
 [10] A Krizhevsky, I Sutskever, and GE Hinton. Imagenet classification with deep convolutional neural networks. In NIPS, 2012.
 [11] S Kumar, M Mohri, and A Talwalkar. Sampling methods for the nyström method. JMLR, 13(Apr):981–1006, 2012.

[12]
H Lee, R Grosse, R Ranganath, and AY Ng.
Convolutional deep belief networks for scalable unsupervised learning of hierarchical representations.
In ICML, 2009.  [13] J Mairal, F Bach, J Ponce, and G Sapiro. Online learning for matrix factorization and sparse coding. JMLR, 11:19–60, 2010.
 [14] B Olshausen and DJ Field. Sparse coding with an overcomplete basis set: a strategy employed by V1? Vision research, 37(23):3311–3325, 1997.
 [15] R Rigamonti, MA Brown, and V Lepetit. Are sparse representations really relevant for image classification? In CVPR, 2011.
 [16] A Saxe, PW Koh, Z Chen, M Bhand, B Suresh, and A Ng. On random weights and unsupervised feature learning. In ICML, 2011.
 [17] J Wang, J Yang, K Yu, F Lv, T Huang, and Y Gong. Localityconstrained linear coding for image classification. In CVPR, 2010.
 [18] P Welinder, S Branson, T Mita, C Wah, F Schroff, S Belongie, and P Perona. CaltechUCSD Birds 200. Technical Report CNSTR2010001, Caltech, 2010.
 [19] J Yang, K Yu, and Y Gong. Linear spatial pyramid matching using sparse coding for image classification. In CVPR, 2009.
 [20] J Yang, K Yu, and T Huang. Efficient highly overcomplete sparse coding using a mixture model. In ECCV, 2010.
 [21] B Yao, G Bradski, and L FeiFei. A codebookfree and annotationfree approach for finegrained image categorization. In CVPR, 2012.
 [22] MD Zeiler, D Krishnan, GW Taylor, and R Fergus. Deconvolutional networks. In CVPR, 2010.
 [23] K Zhang, IW Tsang, and JT Kwok. Improved nyström lowrank approximation and error analysis. In ICML, 2008.
 [24] N Zhang, R Farrell, and T Darrell. Pose pooling kernels for subcategory recognition. In CVPR, 2012.
Comments
There are no comments yet.