Unsupervised Feature Learning with C-SVDDNet

12/23/2014 ∙ by Dong Wang, et al. ∙ Nanjing University of Aeronautics and Astronautics 0

In this paper, we investigate the problem of learning feature representation from unlabeled data using a single-layer K-means network. A K-means network maps the input data into a feature representation by finding the nearest centroid for each input point, which has attracted researchers' great attention recently due to its simplicity, effectiveness, and scalability. However, one drawback of this feature mapping is that it tends to be unreliable when the training data contains noise. To address this issue, we propose a SVDD based feature learning algorithm that describes the density and distribution of each cluster from K-means with an SVDD ball for more robust feature representation. For this purpose, we present a new SVDD algorithm called C-SVDD that centers the SVDD ball towards the mode of local density of each cluster, and we show that the objective of C-SVDD can be solved very efficiently as a linear programming problem. Additionally, traditional unsupervised feature learning methods usually take an average or sum of local representations to obtain global representation which ignore spatial relationship among them. To use spatial information we propose a global representation with a variant of SIFT descriptor. The architecture is also extended with multiple receptive field scales and multiple pooling sizes. Extensive experiments on several popular object recognition benchmarks, such as STL-10, MINST, Holiday and Copydays shows that the proposed C-SVDDNet method yields comparable or better performance than that of the previous state of the art methods.



There are no comments yet.


page 6

page 9

page 10

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Learning good feature representation from unlabeled data is the key to make progress in recognition and classification tasks, and has attracted great attention and interest from both academia and industry recently. A representative method for this is the deep learning (DL) approach


with its goal to learn multiple layers of abstract representations from data. Among others, one typical DL method is the so called convolutional neural network (ConvNet), which consists of multiple trainable stages stacked on top of each other, followed by a supervised classifier

[2] [3]. Many variations of ConvNet network have been proposed as well for different vision tasks [4][5][6][7] [8] with great success.

In these methods layers of representation are usually obtained by greedily training one layer at a time on the lower level [5] [9] [3]

, using an unsupervised learning algorithm. Hence the performance of single-layer learning has a big effect on the final representation. Neural network based single-layer methods, such as autoencoder


and RBM (Restricted Boltzmann Machine,

[11]), are widely used for this but they usually have many parameters to adjust, which is very time-consuming in practice.

That motivates more simple and more efficient methods for single-layer feature learning. Among others K-means clustering algorithm is a commonly used unsupervised learning method, which maps the input data into a feature representation simply by associating each data point to its nearest cluster center. There is only one parameter involved in the K-means based method, i.e., the number of clusters, hence the model is very easy to use in practice. Coates et al. [12]

shows that the K-means based feature learning network is capable to achieve superior performance compared to sparse autoencoder, sparse RBM and GMM (Guassian Mixture Model). However, the K-means based feature representation may be too terse, and does not take the non-uniform distribution of cluster size into account - Intuitively, clusters containing more data are likely to be part of the features with higher influential power, compared to the smaller ones.

In this paper, we proposed a SVDD (Support Vector Data Description,

[13], [14], [15]

) based method to address these issues. The key idea of our method is to use SVDD to measure the density of each cluster resulted from K-means clustering, based on which more robust feature representation is built. Actually the K-means algorithm lacks a robust definition of the size of its clusters, since the nearest center principle is not robust against the noise or outliers commonly encountered in real world applications. We advocate that the SVDD could be a good way to address this issue. Actually SVDD is a widely used tool to find a minimal closed spherical boundary to describe the data belonging to the target class and therefore, given a cluster of data, we are expecting SVDD to generate a ball containing the normal data except outliers. Performing this procedure on all the clusters of K-means, we will finally obtain

SVDD balls on which our representation can be built. In addition, to take the cluster size into account, we use the distance from the data to each ball’s surface instead of the center as the feature.

One possible problem of this method, however, may come from the instability of SVDD’s center, due to the fact that its position is mainly determined by the support vectors on the boundary and the noise in the data may deviate the center far from the mode (c.f., Fig. 3 (left)). Hence the resulting SVDD ball may not be consistent with the data’s distribution when used for feature representation. To address this issue, we add a new constraint to the original SVDD objective function to make the model align better with the data. In addition, we show that our modified SVDD can be solved very efficiently as a linear programming problem, instead of as a quadratic one. Usually we need to compute hundreds of clusters, and a linear programming solution can thus save us large amounts of time. The proposed method is further extended by adopting a set of receptive fields with different sizes to capture multi-scale information ranging from detailed edge-like features to part-level features. A preliminary version of this work appeared in [16]

, and the feasibility and effectiveness of the proposed C-SVDD-based method (called C-SVDDNet) is verified extensively on several object recognition and image retrieval benchmarks with competitive performance.

The remaining parts of this paper are organized as follows: In Section II, preliminaries are provided regarding unsupervised feature learning representation, then we detail our improved feature learning method in Section III. In Section IV, we investigate the performance of our method empirically over several popular datasets. We conclude this paper in Section V.

Ii Unsupervised Feature Learning

The goal of unsupervised feature learning is to automatically discover useful hidden patterns/features in large datasets without relying on a supervisory signal, and those learnt patterns can be utilized to create representations that facilitate subsequent supervised learning (e.g., object classification). Compared to supervised learning, unsupervised learning has its unique characteristics and advantages. Among others, one of the most important one is that it can be used to learn consistent patterns from unlabelled data, which are often free and easy to obtain. Such patterns distinguish from noise since by definition noise can be thought of as random variations presented in the data. This implies many potential applications of unsupervised learning, e.g., to transfer knowledge from one domain to another related domain, to regularize the behavior of a supervised algorithm, and to represent the data in a compact but effective manner. Due to these reasons, unsupervised learning are regarded as the future of deep learning


There are many kinds of unsupervised learning methods in computer vision, such as Bag of Words (BoW)

[18], Vector of Linearly Aggregated Descriptors (VLAD) [19], Fisher vector (FV) [20], and so on. A typical pipeline for unsupervised feature learning includes three steps. The first step is to train a set of local filters from the unlabeled training data. This is usually done by running K-means (for BoW, VLAD) or GMM (for FV) on lots of local patches sampled from the dataset and then using the centers of clusters as filter bank. The second step is to partition a given image into patches and encode them into a set of feature vectors using the learnt filter bank. These feature vectors are finally combined and normalized as the feature representation for the input image. In what follows we give a brief review on these methods.

Bag of Words and Its Variants The simple and basic unsupervised feature learning method is the BoW model. In this model local filters are usually the centers of clusters from K-means. These filters are looked as bins, which serves to pool the local patches nearest to them. This can be regarded as a ”hard voting” method:


where is the value that a patch was encoded as with the k-th filter . In BoW, we simply count the number of patches in each bin to get a histogram representation. Thus it is a very coarse way to encode the information of an input image.

Alternatively, VLAD [19] and FV [20] encode each data point with a vector instead of a simple count number as in BoW, which effectively improve the richness and robustness of the feature representation. Particularly, FV captures the first and second order difference between an input and the centres of a GMM, denoted as ,


Then the FV coding for an local patch is a vector of . VLAD [19] is a simplified version of FV, with the difference signal between a patch and a filter defined as . As in FV, these difference signals are concatenated into a -dim vector for feature representation. Obviously, both FV and VLAD encode much richer information than that of BoW, hence being more discriminative in subsequent tasks such as object classification.

Coates et al.’s Method To the best of our knowledge, the work of [12] is the first “deep” unsupervised learning method that are based on the K-means method, hence having close connection with the aforementioned BoW, VLAD and FV methods. Particularly, after learning a filter bank, instead of using it as basin of attraction like in BoW or as references for calculating difference vectors, it is utilized to generate a series of feature maps, one for each filter. This has at least two potential advantages: 1) compared to VLAD and FV, the encoded information is even more rich; 2) the feature maps preserve the spatial information well and hence the whole procedure could be repeated, leading to a deep unsupervised learning architecture.

Furthermore, to deal with the problem of “hard coding” in K-means, the following “triangle” encoding is proposed [12]:


where , and is the mean of the elements of

. This activation function outputs

for the feature that has an above average distance to the centroid . This model leads to a less sparse representation (roughly half of the features could be set to be

). Note that this “triangle” encoding strategy essentially allows us to learn a distributed representation using the simple K-means method instead of more complicated network-based methods (e.g., autoencoder and RBM), hence saving much time in training. Coastes et al.

[12] shows that this strategy actually leads to comparable performance to, if not better than, those based on network methods.

However, this method does not take the characteristics of each cluster into consideration. Actually, the number of data point in each cluster is usually different, so is the distribution of data points in each cluster. We believe that these differences would make a difference in feature representation as well. Unfortunately the aforementioned K-means feature mapping scheme completely ignores these and only uses the position of center for feature encoding. As shown in Fig. 1, although the data point has the same distance to the centers and of two clusters, it should be assigned a different score on than on since the former cluster is much bigger than the latter. In practice such unequal clusters are not uncommon and the K-means method by itself can not reliably grasp the size of its clusters due to the existence of outliers. To this end, we propose a SVDD based method to describe the density and distribution of each cluster and use this for more robust feature representation.

Fig. 1: Illustration of the unequal cluster effect, where the distances from a test point to two cluster centers and are equal but the size of two clusters are different.

Iii The Proposed Method

In this section, after presenting an overview of the proposed method, we give the details of our Centered-SVDD method for feature encoding, and compare it with the K-means “triangle” encoding method. Then we describe our SIFT-based post-pooling layer and discuss how to extend the method to extract multi-scale information.

Iii-a Overview of the Proposed Method

A typical single-layer network contains several components: an input image is first mapped into a set of feature maps using filter banks (or dictionary), which are then subjected to a pooling/subsampling operation to condense the information contained in the feature maps. Finally, the pooled feature maps are concatenated to a feature vector, which serves as the representation for the subsequent classification/cluster tasks. There are several design options in this procedure, where the size of filter bank and that of the pooling grids are the major tradeoff one has to make.

Generally speaking, bigger filter banks help each sample find its nearby representative points more accurately but at the cost of yielding a high-dimensional representation, hence a crude pooling/subsampling is needed to reduce the dimensionality. Overall this type of architecture emphasizes more on the global aspects of the samples than on the local ones (e.g., local texture, local shape, etc.). Actually, Coates et al. show that this kind of network is able to yield state of the art results on several challenging datasets [12]. On the other hand, other works use smaller filter banks but highlight the importance of detailed local information in constructing the representation, usually based on some complicated feature encoding strategy, as done in PCANet [21] or Fisher Vector [22].

In this work, we follow the second design choice, based on the consideration that the learned representation should preserve enough local spatial information for the subsequent processing. Compared to [12], we use an improved feature encoding method named C-SVDD (detailed in the next section) and adopt the architecture of relatively small dictionary. Different to [21] or [22], we learn filter banks for feature encoding but add a SIFT-based post-pooling processing procedure onto the network, which essentially projects the responses of a pooling operation into a more compact and robust representation space.

Iii-B Using SVDD Ball to Cover Unequal Clusters

Assume that a dataset contains data objects, {, and a ball is described by its center and the radius . The goal of SVDD (Support Vector Data Description, [13]) is to find a closed spherical boundary around the given data points. In order to avoid the influence of outliers, SVDD actually faces the tradeoff between two conflicting goals, i.e., minimizing the radius while covering as many data points as possible. This can be formulated as the following objective,


where the slack variable represents the penalty related with the deviation of the -th training data point outside the ball, and is a user defined parameter controlling the degree of regularization imposed on the objective. With the KKT conditions, we have , i.e., the center of the ball is a linear combination of the data . The dual function of Eq.( 4) is


where and are Lagrangian multipliers. By solving the quadratic programming problem we can get the center and the radius .

The SVDD method can be understood as a type of one-class SVM and its boundary is solely determined by support vectors points. SVDD allows us to summarize a group of data points in a nice and robust way. Hence it is natural to use SVDD ball to model each cluster from K-means, thereby combining the strength of both models. In particular, for a given data point we first compute its distance to the surface of each SVDD ball , and then use the following modified “triangle” encoding method for feature representation (c.f., E.q.( 3)),


where is the distance from the point to the surface of the -th SVDD ball, while is the average of the values .

Shown in Fig. 2 for a data point , respectively are the centroids of two SVDD balls with being the radius. Since the distances from to and are equal, will be assigned the same scores on the two ball with the K-means scheme (c.f., E.q.( 3)). However, if we take the density and size of the clusters into accounts, the score from should be higher in our method.

Fig. 2: Using the SVDD ball to cover the clusters of K-means, where two SVDD balls cover two clusters with different sizes, respectively. For a test point , we encode its feature using its distance to the surface of an SVDD ball. This can be calculated by subtracting the length of the radius of the ball from the distance between to the ball center . Hence for two SVDD balls with different size, the encoded features for the same point would be different.

Iii-C The C-SVDD Model

Fig. 3: Illustration of the difference between SVDD and C-SVDD. Note that after centering the SVDD ball (left), the center of C-SVDD ball (right) aligns better with the high density region of the data points.

Although SVDD ball provides a robust way to describe the cluster of data, one unwelcome property of the ball is that it may not align well with the distribution of data points in that cluster. As illustrated in Fig. 3 (left), although the SVDD ball covers the cluster

well, its center is biased to the region with low density. This should be avoided since it actually gives suboptimal estimates on the distribution of the cluster of data.

To address this issue, inspired by the observation that the centers of K-means are always located at the corresponding mode of their local density, we propose to shift the SVDD ball to the centroid of the data such that it may fit better with the distribution of the data in a cluster. Our new objective function is then formulated as, 111We choose the squared L2 norm distance as a convenient for optimization. There are also other robust distance such as non-squared L2 norm distance [23].


and its Lagrange function is as follows,


where and are the corresponding Lagrange multipliers. According to KKT Conditions, we have,


Taking Eq.(9) and Eq.(10) into the Lagrange function (8) we get that

Recalling that , one has the following dual function,


This can be reformulated as


where . This objective function is linear to , and thus can be solved efficiently with a linear programming algorithm.

Since the model is centered towards the mode of the distribution of the data points in a cluster, we named our method as C-SVDD (centered-SVDD). Fig. 3 shows the difference between SVDD and C-SVDD, where the left is from SVDD and the right from C-SVDD. We can see that our new model aligns better with the density of the data points, as expected. It is also worth mentioning that the normalization parameter plays an important role in our model - a larger value would allow more noise to enter the ball, while , the C-SVDD model actually reduces to the naive single-cluster K-means. More discussions on setting this value empirically will be given in Section IV.

After the model is trained, we use the modified “triangle” encoding (E.q. 6) for feature encoding, with almost the same computational complexity with its K-means counterpart.

Iii-D K-means Encoding vs. C-SVDD Encoding

Fig. 4: Illustration of feature maps of five face images (a) using K-means (c) and C-SVDD (d) respectively, based on five local dictionary atoms (b), where maps in each row are corresponding to one atom next to it while each column corresponding to one face. For the response values in a feature map, the darker the lower.
Fig. 5: Distribution of the number of patches attracted by each atom (a), the radius of the corresponding SVDD ball (b), and the entropy (c) over the five atoms shown in Fig.4 (leftmost)

To this end, it will be useful to take a brief discussion on the difference of two kinds of feature maps, i.e., K-means-based “triangle” encoding (E.q. 3) and our C-SVDD-based one 222Hereinafter we will call them respectively “K-means encoding” and “C-SVDD encoding” for short without confusion.. For this a pilot experiment is conducted. Particularly, we learn a very small dictionary containing only five atoms using five face images, by clustering ZCA-whitened patches randomly sampled from the faces, and then take these for feature encoding. Fig.4 illustrates the face images used for dictionary learning (top) and the five learnt atoms (leftmost). The feature maps of face images encoded by the K-means encoding method and those by the C-SVDD encoding method are respectively shown in Fig.4 (a) and Fig.4 (b), where each row is corresponding to one dictionary atom next to it and each column corresponding to one face.

By comparing the feature maps shown in Fig.4 (a) and Fig.4 (b), one can see that the C-SVDD-based ones contain more detailed information than the K-means feature maps for the first three atoms, while the responses of the last two atoms are largely suppressed by our method (c.f., last two rows of Fig.4 (b)). To further understand this phenomenon, we plot the entropy of each atom (by treating them as a small image patch) in Fig.5 (c). The figure shows that the entropy of the last two atoms is much smaller than that of the first three ones, which indicates that the local appearance patterns captured by these last two atoms are much simpler than those by the first three. Hence these two atoms will tend to be widely used by many faces, resulting in reduced discriminative capability in distinguishing different subjects. In this sense, it will be useful to suppress their responses (c.f., the last two rows of Fig.4 (b)).

It is also useful to inspect the distribution of local facial patches attracted by these atoms. Fig.5 (a) gives the results. It can be seen that this distribution is not uniform and the number of local patches attracted by the fourth atom is significantly larger than those by other atoms. As a result, for K-means encoding method, the feature maps yielded by this atom show much more rich details than others (see the fourth row of Fig.4 (a)), potentially indicating that it could play more important roles than others in the subsequent classification task. However, as explained above, since this atom actually contains much less information than the first three atoms (low entropy and being a “common word”), it is really not good to over-emphasize its importance in feature encoding.

This drawback of K-means feature mapping is largely bypassed by our C-SVDD-based scheme. As shown Fig.5 (b), the fourth atom actually represents a very small cluster. In fact, the radius of C-SVDD ball corresponding to the more informative atom tends to be large, and one major advantage of our C-SVDD-based strategy is that it is capable to exploit this characteristic of dictionary atoms for more effective feature encoding, as shown in the first three rows of Fig.4 (b). This partially explains the superior performance of the proposed C-SVDD method compared to its K-means counterpart (c.f., experimental results in Section IV).

Fig. 6: Features of different scales learnt from face images. The size of original face images is in pixels. (a) size 5 (b) size 10 (c) size 20

Iii-E Encoding Feature Maps with SIFT Representation

Traditional unsupervised methods like Bag of Words (BoW) model [18] usually generate a global feature representation by simply histogramming over local codings, while ignoring spatial relationship between local patches.

One problem of preserving spatial information in feature representation is due to the huge dimension of feature maps. Suppose that the size of a receptive field is , and the size of an input image is . After densely extracting patches and encoding them, we would obtain feature maps, one for each filter, with each of size (). Particularly, for small images with , and a small dictionary with size and with size of its filter , the resulting dimension of feature maps is nearly

, which is too large for many applications. One can use such methods as average pooling or max pooling to reduce the size of feature maps. For

sized pooling blocks, the size of a feature map can be reduced to . In the above example if we set , the dimension of each map becomes , which is still too big when concatenating maps. However, if we choose a bigger pooling window, the more spatial information will be lost.

In this paper we proposed a variant of SIFT-representation to address the above issues. SIFT is a widely used descriptor in computer vision and is helpful to suppress the noise and improve the invariant properties of the final feature representation. The way we get SIFT representation is not in general way which extracts a 128-bit SIFT-descriptors densely. This will also cause to a very high dimensionality. For example, if we extract 128 dimensional SIFT-descriptors densely in feature maps with the size of in pixel, the dimension of the obtained representation vector will be as high as over M (). To address this issue, we first divide each feature map into blocks and then only extract an 8-bit gradient histogram from each block in the same way as SIFT does. This results in a feature representation with dimension of for each map (Such as if m=3, then the dim is only 72bit). In this way we significantly reduce the dimensionality while preserving rich information for the subsequent task.

Iii-F Multi-scale Receptive Field Voting

Next we extend our method to exploit multi-scale information for better feature learning. A multi-scale method is a way to describe the objects of interest in different sizes of context. This would be useful since patches of a fixed size can seldom characterize an object well - actually they can only capture local appearance information limited in that size. For example, if the size is very small, information about edges could be captured but the information on how to combine these into more meaningful patterns such as motifs, parts, poselets, and object, is lost, while information about these entities at different levels is valuable in that they are not only discriminative by itself but complementary to each other as well. Most popular manually designed feature descriptors, such as SIFT or HoG, address this problem to some extend by pooling image gradients into edglets-like features, but it is still unclear, for example, how to assemble edglets into motifs using these methods. Convolutional neural network provides a simple and comprehensive solution to this issue by automatically learn hierarchies of features ranging from edglets to objects. However, during this procedure, information on where those high-level patterns are found becomes more and more ambiguous.

Because our C-SVDDNet is a single-layer network, it is difficult to learn multi-scale information in a hierarchical way. Instead, we take a naive way to obtain multi-scale information by using receptive fields of different sizes. In particular, we fetch patches with squares in size from training images and use these to train dictionary atoms with corresponding size through K-means. Fig.6 shows some examples of atoms we learnt on a face dataset. One can see that these feature extractors are similar to those learnt using a typical ConvNet. Specifically, with the increasing window size, the learnt features become more understandable - for example, as shown in Fig.6 (c), using a receptive field with size of on face images of , we successfully learned facial parts such as the eyes, the mouth, and so on, while a smaller receptive field gives us some oriented filters, as shown in Fig.6 (a). At each scale we train several networks with different pooling window. One advantage of this method is that it is very efficient to learn and is effective in capture salient features in a multi-scale context. However, it will not tell us how the bigger patterns are explained by smaller ones - such information would be useful from a generative angle.

To use the learnt multi-scale information for classification, we train a separate classifier on the output layer of the corresponding network (view) according to different receptive sizes and different pooling sizes, then combine them under a boosting framework. Particularly, assume that the total number of categories is , and we have scales (with different number of pooling sizes for each scale), then we have to learn output nodes. These nodes are corresponding to multi-class classifiers. Let us denote the parameter of the th classifier ( is the dimension of feature representation) as , where is the weight vector for the -th category. We first train these parameters using a series of one-versus-rest -SVM classifiers, and then normalize the outputs of each classifier using a soft max function,


Finally, the normalized predictions are combined to make the final decision,


where is the output vector of the -th classifier, and the corresponding combination coefficients are trained using the following objective,


This is the same type of one-versus-rest -SVM mentioned before.

Iv Experiments AND Analysis

To evaluate the performance of the proposed C-SVDDNet, we conduct extensive experiments on four datasets including two object classification datasets(STL-10 [12], MINST [2]) and two image retrieval datasets(Holiday [24], INRIA Copydays [25]).

Iv-a Experiment Settings

All the images undergo whitening preprocessing before feeding them into the network. The whitening operation linearly transforms the data such that their covariance matrix becomes unit sphere, hence justifying the Euclidean distance we use in the K-means clustering procedure.

Unless otherwise noted, the parameter settings listed in Table.I apply to all experiments. The influence of some important parameters, such as the number of filters, will be investigated in more detail in the subsequent sections. For single scale network the receptive field is set to be by default across all the datasets, as recommended in [12], while in multi-scale version, we use receptive fields in three scales, as shown in Table.I.

For C-SVDD ball there is a regularization parameter to set. This parameter allows us to control the amount of noise we are willing to tolerant to. As can be seen from E.q.1, a small value encourages a tight ball. We set by default for most datasets except for those with too noisy background are set to 0.005. Furthermore, the centers in C-SVDD are set as the same as those in k-means, so that we can safely ignore the effect of the initialization of k-means.

Throughout the experiments, we use Coates’ K-means “triangle” encoding method [12] (c.f., Section LABEL:subsec_softcoding) as baseline (denoted as ‘K-means’), while its direct counterpart method by simply replacing “triangle” encoding with C-SVDD encoding is denoted as ‘C-SVDD’. Furthermore, we denote the proposed single layer network as ‘C-SVDDNet’, and its multi-scale version as ‘MSRV + C-SVDDNet’. In addition, we re-evaluate the baseline method [12] within the proposed network by replacing its component of C-SVDD with the K-means-based encoding, denoted as ‘K-meansNet’.

Parameter value
#clusters 500
Size of receptive field
size of average pooling
*-default setting for the non-multi-scale network.

Iv-B Analysis of the Proposed Method

First of all we conduct extensive experiments on the STL-10 dataset to investigate the behavior of the proposed method. The STL-10 is a large image dataset popularly used to evaluate algorithms of unsupervised feature learning or self-taught learning. Besides 100,000 unlabeled images, it contains 13,000 labeled images from 10 object classes, among which 5,000 images are partitioned for training while the remaining 8,000 images for testing. All the images are color images with pixels in size. There are 10 pre-defined overlapped folds of training images, with 1000 images in each fold. In each fold, a classifier is trained on a set of 1000 training images, and tested on all 8000 testing images. In consistence with [12], we report the average accuracy across 10 folds. For unsupervised feature learning we randomly select 20,000 unlabeled data. The size of spatial pooling is , hence the size of feature maps fed for SIFT representation is . For multi-scale receptive voting we use 2 scale ( and ), on each of which we perform spatial pooling in 5 sizes ranging from to .

Do we really need a large number of local features?

Fig. 7: The effect of different number of features on the performance with different methods on the STL-10 dataset. All parameters here (such as pooling size) are set as the same as in Fig.8.

By the number of features, we mean the number of filters

used for feature extraction, which is equal to the number of dictionary atoms. One of the major conclusions of Coates et al.’s series of controlled experiments on single layer unsupervised feature learning network

[12] is that compared to the choice of particular learning algorithm, the parameters that define the feature extraction pipeline, especially the number of features, have much more deep impact on the performance. Using a K-means network with 4000 features, for example, they are able to achieve surprisingly good performance on several benchmark datasets - even better than those with much deeper architectures such as Deep Boltzmann Machine [26] and Sparse Auto-encoder [12].

However, one drawback accompanying this large dictionary is that a very crude pooling size has to be adopted (e.g., over feature maps) to condense the resulting feature maps, otherwise the dimensionality of the final feature representation could be prohibitively high. For example, a pooling over 4000 feature maps with in size would lead to a total number of features over 3.8M. Hence the first question we investigate is that whether such a large number of features are really needed all the time?

Fig.7 gives the performance curves according to varying number of features with different methods on the STL-10 dataset. Besides the aforementioned methods, in this figure we also give the results of random dictionary (i.e, local dictionary atoms are obtained randomly without being fine tuned by k-means, denoted as “Random” ) and of the combination of random dictionary and SIFT representation (denoted as ’RandomNet’).

It can be seen that with the increasing number of features, the performance of both K-means and C-SVDD methods rises, which is consistent with the results by Coates et al. [12]

. One possible explanation is that since both K-means encoding and C-SVDD encoding use the learnt dictionary to extract non-linear features, more dictionary atoms help to disentangle factors of variations in images. In our opinion the capability to learn a large number of atoms at relatively low computational cost is one of the major advantages of K-means based methods for unsupervised feature learning over other algorithms such as Gaussian Mixture Model (GMM), sparse coding, and RBM. For example, it is difficult for a GMM to learn a dictionary with over 800 atoms


On the other hand, a too large dictionary can increase the redundancy and decrease the efficiency. Hence it is desirable to reduce the number of features while not hurting the performance too much. Fig.7 shows that our C-SVDD encoding method consistently works better than the K-means encoding at different number of features, and combining C-SVDD encoding and SIFT-based representation dramatically reduces the needs for large dictionary without scarifying the performance. Actually, Table.II and fig.7 show that using our C-SVDD encoding and the SIFT feature representation, the dictionary size reduces by 10 times (from 4,800 [27] to 500) while the performance improves by 12% (from 53.80% [27] to 65.92%).

As for the random dictionary (denoted as ’Random’ and ’RandomNet’ in Fig.7), it is interesting to see that when the number of atoms is small, random atoms perform much worse than those finetuned by k-means. But as the size of dictionary increases, the performance difference between the random dictionary and K-means dictionary begins to reduce. For example, at 500 features, using random atoms gives a performance of 54.77%, slightly worse than that of k-means (56.63%), and the performance of RandomNet (62.45%) is also close to that of K-meansNet (63.07%). However the performance of both random methods is all much lower than that of the C-SVDD based methods.

Effect of the pooling size

Fig. 8: The effect of different pooling sizes on the performance with the proposed method on the STL-10 dataset.

To investigate the effect of different pooling sizes on the performance using the proposed method, we conduct a series of experiments on the STL-10 dataset. Particularly, for a 96 96 original image, we use a receptive field of 5 5 in pixel for feature extraction and obtain a layer of feature maps with 92 92. The pooling blocks are set to be such that the size of final feature maps after pooling is . We vary from 1 1 to 31 31 and record the yielded accuracy. Fig.8 gives the results under different settings. We can see from the figure that generally for the one layer K-means-based network we need bigger block sizes for improved translation invariance, but adding a robust SIFT encoding layer after pooling effectively reduces the needs for large pooling size while obtaining better performance. One possible reason is that this tends to characterize more detailed information of the objects to be represented.

Effect of the multi-scale receptive field voting

Fig. 9: Detailed performance of 10 different representations and their ensemble on the STL-10 dataset. These representations are obtained by combining different receptive field size (rf) and pooling size (pl), where rf indicates a receptive field of , and pl denotes a pooling block of in pixel.

Fig.9 gives the detailed accuracy of 10 representations using 2 sizes of receptive fields and 5 sizes of pooling blocks. One can see that different representation leads to different prediction accuracy but combining them leads to better performance. This shows that the representations captured with different receptive fields and pooling sizes are complementary to each other.

Contribution of components

Fig. 10: The contribution of the three major components of the proposed method to the performance.

To illustrate the contributions of the individual stages of the proposed method (i.e., C-SVDD-based encoding, SIFT-representation and multi-scale voting), we conduct a series of experiments on the STL-10 dataset by removing each of the three main stages in turn while leaving the remaining stages in place (the comparison is thus against our full method). Fig.10 gives the results. In general each stage is beneficial and (not shown) the results are cumulative over the stages, but the SIFT stage seems to contribute most to the performance improvement. This suggests that taking spatial information into global representation is of importance.

Iv-C Object Classification

Algorithm Accuracy()

Selective Receptive Fields (3 Layers) [27] (2011)
60.10 1.0
Trans. Invariant RBM (TIRBM) [28] (2012) 58.70
Simulated visual fixation ConvNet [29] (2012) 61.00
Discriminative Sum-Prod. Net (DSPN) [30] (2012) 62.30 1.0
Hierarchical Matching Pursuit (HMP) [31] (2013) 64.50 1.0
Deep Feedforward Networks [32] (2014) 68.00 0.55
BoW(K = 4800 D = 4800) 51.50 0.6
VLAD(K = 512 D = 40960) 57.60 0.6
FV (K = 256 D = 40960) 59.10 0.8
K-means (K = 4800 D = 19200) [27] (2011) 53.80 1.6
C-SVDD (K = 4800 D = 19200) 54.60 1.5
K-meansNet (K = 500 D = 36000) 63.07 0.6
C-SVDDNet (K = 500 D = 36000) 65.92 0.6
MSRV+K-meansNet 64.96 0.4
MSRV+C-SVDDNet 68.23 0.5
TABLE II: Comparative performance (%) on the STL-10 dataset.
Algorithm Error()
Deep Boltzmann Machines [26] (2009) 0.95

Convolutional Deep Belief Networks

[33] (2009)
Multi-column deep neural networks [34] (2012) 0.23
Network in Network [35] (2013) 0.47
Maxout Networks [36] (2013) 0.45
Regularization of neural networks [37] (2013) 0.21
PCANet [21] (2014) 0.62
Deeply-Supervised Nets [38] (2014) 0.39
K-means (1600 features) 1.01
C-SVDD (1600 features) 0.99
K-meansNet (400 features) 0.45
C-SVDDNet (400 features) 0.43
MSRV+K-meansNet 0.36
TABLE III: Comparative performance (%) on the MINST dataset.

STL-10 dataset Table.II gives our results on the STL-10 dataset. The major challenges of this dataset lie in that its images are captured in the wild with cluttered background, objects in various scales and poses. As before, we compared our method with several feature learning methods with state of the art performance. One can see that our one scale C-SVDD network obtains 65.92% accuracy, using a filtering dictionary of 500 atoms, outperforms several other feature encoding methods, such as Bag of Words (BoW), Vector of Linearly Agregated Descriptors (VLAD), Fisher vector (FV) and other unsupervised deep learning methods (e.g, Trans. Invariant RBM (TIRBM) [28], Selective Receptive Fields (SRF) [27], and Discriminative Sum-Product Networks (DSPN) [30]). This also indicates that spatial information preserving using SIFT is indeed useful in unsupervised feature learning. Also note that replacing the proposed C-SVDD encoding with K-means encoding leads to nearly 3.0% performance loss, while fusing the multi-scale information gives us about 2.3% improvement in accuracy, exceeding the current best performer [32] on this challenging dataset.

MINST dataset

The MNIST is one of the most popular datasets in pattern recognition. It consists of grey valued images of handwritten digits between 0 and 9. It has a training set of 60,000 examples, and a test set of 10,000 examples, all of which have been size-normalized and centered in a fixed-size image with 28

28 in pixel. In training we use a dictionary with 400 atoms for feature mapping, and after pooling/subsampling we break each feature map into 9 blocks to extract SIFT features. For multi-scale receptive voting, we use 3 types of receptive fields: , and . Combined these with two settings for pooling sizes (i.e., and , respectively), 6 different views/representations can be obtained for each image in this dataset.

Fig. 11: All misclassified 35 handwritten digits among 10,000 test examples by our method. The small digit in each white square is the ground truth label of the corresponding image, and the one in the green square is the prediction made by our method.

Table.III gives our experimental results on the MINST dataset. It is well-known that deep learning has achieved great success on this task of digit recognition. For example, only 95 among 10,000 test digits are misclassified by the Deep Boltzmann Machines [26], while Convolutional Deep Belief Networks [33] and Maxout Networks [36] respectively reduce this number to 82 and 45. Our simple single layer network (MSRV+C-SVDDnet) achieves an error as low as 0.35, which is highly competitive to other complex methods using deep architecture. Fig.11 shows all the 35 misclassified digits by our method, and one can see that these misclassified digits are very confusing even for human beings. Compared to the original K-means network [12], the proposed method reduces the error rate by 65%, with much smaller number of filters. This reveals that at least on this dataset with clean background, it is very beneficial to focus more on the representation of the details of the image, rather than emphasizing too much on its global aspects using a large number of filters and a large pooling size.

Iv-D Image Retrieval

Algorithm K D Holidays (mAP %)
best D’ = 2048 D’ = 512 D’ = 128 D’ = 64 D’ = 32
BoW [19] (2012) 20000 20000 45.2 41.80 44.9 45.2 44.4 41.8
FV [19] (2012) 256 16384 62.6 62.6 57.0 53.8 50.6 48.6
VLAD [19] (2012) 256 16384 62.1 62.1 56.7 54.2 51.3 48.1
VLAD+adapt+innorm [39] (2013) 256 16384 64.6 62.5
K-means 3200 12800 55.2 54.5 54.9 51.6 48.3 44.5
C-SVDD 3200 12800 57.4 56.8 57.0 53.1 50.5 46.8
K-meansNet 256 8192 62.5 59.8 62.5 61.3 55.5 49.5
C-SVDDNet 256 8192 66.0 63.7 65.8 64.7 59.3 52.1
MSRV+K-meansNet 256 81928 66.5 65.0 66.3 65.3 58.3 51.5
MSRV+C-SVDDNet 256 81928 70.2 68.6 69.8 68.5 62.5 53.8
TABLE IV: Comparative performance (mAP %) on the Holiday dataset.
Algorithm K D crop 50% transformations
best D’ = 128 best D’ = 128
BoW [19] (2012) 20k 20k 100.0 100.0 54.3 29.6
FV [19] (2012) 64 4096 98.7 92.7 59.6 41.2
VLAD [19] (2012) 64 4096 97.7 94.2 59.2 42.7
K-means 3200 12800 95.2 91.5 47.6 32.80
C-SVDD 3200 12800 97.4 93.8 52.2 36.60
K-meansNet 256 8192 96.8 94.3 55.4 37.80
C-SVDDNet 256 8192 100.0 98.1 62.2 52.0
MSRV+K-meansNet 256 81926 99.7 97.9 58.2 41.8
MSRV+C-SVDDNet 256 81926 100.0 100.0 65.6 55.3
TABLE V: Comparative performance (mAP %) on the Copydays dataset.

Holiday dataset INRIA Holiday dataset consists of 1491 images from personal holiday photos. There are 500 queries, most of which have 1-2 ground truth images. mAP (mean average precision) is employed to measure the retrieval accuracy. We resize all the images to 96 96. In training we use a dictionary with 256 atoms for feature mapping, and after pooling/subsampling we break each feature map into 4 blocks to extract SIFT features. Thus the dimension of final representation is 8196. And we also run PCA for dimensionality reduction as [19]. For multi-scale receptive voting, we use 2 types of receptive fields: and . Combined these with four settings for pooling sizes (i.e., , , and , respectively), 8 different views/representations can be obtained for each image in this dataset. Note that in image retrieval task, we can not train classifiers so that we just concatenate all the views’ representations to combine multi-scale information. In retrieval stage we use Euclidean distance in nearest neighbor searching as in [19] and [39], facilitating a fair comparison between various feature representation methods on this task.

Table.IV gives our experimental results on this dataset. We compare our method with BOW, VLAD, FV under different dimension ( reduced through PCA). BoW takes a 20k sized filter bank but has the lowest mAP (45.2%). Replacing BoW with K-means triangle encoding improves mAP by 10% (55.2%), but still needs a large filter bank of 3.2K.

Previous state-of-art unsupervised feature learning methods, i.e., VLAD and FV [19], can achieve a high mAP of 62.1% 62.6% respectively. And both of them only take a small set of filters of size 256. In [39] Arandjelovic combines VLAD with adaptive filter bank and a new normalization to achieve an accuracy of 64.6%. Our proposed C-SVDDNet can get a mAP of 66.0% with 256 filters as well. It outperforms VLAD by 3.9% and VLAD+adapt+innorm by 1.4%. Even if we reduce its dimension to smaller sizes with PCA, it consistently achieves the best performance among the compared ones.

Also note that replacing K-means encoding with C-SVDD encoding results in significant improvement (from K-means 55.2% to C-SVDD 57.4%, and from K-meansNet 62.5% to C-SVDDNet 66.0%). When concatenating multi-scale representation from 8 views, we are able to achieve the highest mAP of 70.2%, without using any supervision information.

Copydays dataset INRIA Copydays dataset was designed to evaluate near-duplicate detection [19]. The dataset contains 157 original images. To obtain query images relevant in a copy detection scenario, each image of the dataset has been transformed with three different types of transformation: image resizing, cropping (Here we use only the queries with the cropping parameter fixed to 50%), strong transformations (print and scan, occlusion, change in contrast, perspective effect, blur, etc). There is in total 229 transformed images, each of which has only a single matching image in the database. All images are resized to 75 75. We use 2 types of receptive fields and , together with three pooling sizes (i.e., , and respectively), which result 6 different views. To challenge ourself in this experiments we also merge the database with 10k web images as [19] does.

It is a large scale retrieval task. Table.V gives our experimental results on this dataset. We can see that in the 50% cropped circumstance, our C-SVDDNet with only 256 filters can be robust enough to achieve a mAP of 100% as BoW with 20k filters. When reducing its dimension to 128 bits, it still performs the second best. In the strong transformation setting, our C-SVDDNet achieves a mAP of 62.2% which outperforms VLAD (59.2%) and FV (59.6%) by nearly 3%.

Furthermore, one can see that C-SVDD encoding allows our C-SVDDNet improve upon K-meansNet by 6.8% in terms of mAP. When reduced to 128 bits, our C-SVDDNet achieves a mAP of 52% under the difficult cases of strong transformation, which outperforms other compared methods by more than 10%, while our multi-scale version improves the mAP by 3%.

V Conclusion

In this paper, we propose a simple one-layer neural network termed C-SVDDNet for unsupervised feature learning. One of the major advantages of the proposed method is that it allows effective feature representation for many applications, such as object classification and image retrieval, by exploiting unlabeled data which are often cheap and readily available. We show that when properly combined with the SIFT descriptors, such representation could be made even more efficient and discriminant. Extensive experiments on several challenging object classification datasets and image retrieval datasts demonstrate that the proposed method significantly outperforms previous state of the art unsupervised feature learning methods such as Bag of Word, VLAD [19], and FV [20].

Additionally, we show that for feature representation, a very big dictionary is not necessary, as one could accumulate rich information in each feature map and preserve them with compact encoding (e.g, using the proposed method). This significantly reduces the computational cost. Last but not least, we show that one can use multi-scale information to further improve the performance without training many layers of networks - after all training several shallow networks is much easier than training a deep one.


The work was financed by the National Science Foundation of China (61073112), the National Science Foundation of Jiangsu Province (BK2012793), and the Doctoral Fund of Ministry of Education of China (20123218110033).


  • [1] Y. Bengio, A. Courville, and P. Vincent, “Representation learning: A review and new perspectives,” arXiv preprint arXiv:1206.5538, 2012.
  • [2] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learning applied to document recognition,” Proceedings of the IEEE, vol. 86, no. 11, pp. 2278–2324, 1998.
  • [3] K. Jarrett, K. Kavukcuoglu, M. Ranzato, and Y. LeCun, “What is the best multi-stage architecture for object recognition?” in Computer Vision, 2009 IEEE 12th International Conference on.   IEEE, 2009, pp. 2146–2153.
  • [4] J. Bruna and S. Mallat, “Invariant scattering convolution networks,” Pattern Analysis and Machine Intelligence, IEEE Transactions on, vol. 35, no. 8, pp. 1872–1886, 2013.
  • [5] Q. V. Le, M. Ranzato, R. Monga, M. Devin, K. Chen, G. S. Corrado, J. Dean, and A. Y. Ng, “Building high-level features using large scale unsupervised learning,” arXiv preprint arXiv:1112.6209, 2011.
  • [6] G. Carneiro, J. C. Nascimento, and A. Freitas, “The segmentation of the left ventricle of the heart from ultrasound data using deep learning architectures and derivative-based search methods,” Image Processing, IEEE Transactions on, vol. 21, no. 3, pp. 968–982, 2012.
  • [7] P. P. San, S. H. Ling, Nuryani, and H. Nguyen, “Evolvable rough-block-based neural network and its biomedical application to hypoglycemia detection system.” Cybernetics IEEE Transactions on, vol. 44, no. 8, pp. 1338 – 1349, 2014.
  • [8] L. Shuai and Y. Li, “Nonlinearly activated neural network for solving time-varying complex sylvester equation,” Cybernetics IEEE Transactions on, vol. 44, no. 8, pp. 1397 – 1407, 2014.
  • [9] A. Agarwal and B. Triggs, “Hyperfeatures–multilevel local coding for visual recognition,” in Proc. Ninth European Conf. Computer Vision, 2006.   Springer, 2006, pp. 30–43.
  • [10] G. E. Hinton and R. R. Salakhutdinov, “Reducing the dimensionality of data with neural networks,” Science, vol. 313, no. 5786, pp. 504–507, 2006.
  • [11] M. A. Cueto, J. Morton, and B. Sturmfels, “Geometry of the restricted boltzmann machine,”

    Algebraic Methods in Statistics and Probability,(eds. M. Viana and H. Wynn), AMS, Contemporary Mathematics

    , vol. 516, pp. 135–153, 2010.
  • [12] A. Coates, H. Lee, and A. Y. Ng, “An analysis of single-layer networks in unsupervised feature learning,” Ann Arbor, vol. 1001, p. 48109, 2010.
  • [13] D. M. Tax and R. P. Duin, “Support vector data description,” Machine learning, vol. 54, no. 1, pp. 45–66, 2004.
  • [14] J. Xu, J. Yao, and L. Ni, “Fault detection based on svdd and cluster algorithm,” in Electronics, Communications and Control (ICECC), 2011 International Conference on.   IEEE, 2011, pp. 2050–2052.
  • [15]

    A. Banerjee and P. Burlina, “Efficient particle filtering via sparse kernel density estimation,”

    Image Processing, IEEE Transactions on, vol. 19, no. 9, pp. 2480–2490, 2010.
  • [16] D. Wang and X. Tan, “Centering svdd for unsupervised feature representation in object classification,” in Neural Information Processing.   Springer, 2013, pp. 376–383.
  • [17] Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” Nature, vol. 521, no. 7553, pp. 436–444, 2015.
  • [18] G. Csurka, C. Dance, L. Fan, J. Willamowski, and C. Bray, “Visual categorization with bags of keypoints,” in Workshop on statistical learning in computer vision, ECCV, vol. 1, no. 1-22.   Prague, 2004, pp. 1–2.
  • [19] H. Jégou, F. Perronnin, M. Douze, J. Sánchez, P. Pérez, and C. Schmid, “Aggregating local image descriptors into compact codes,” Pattern Analysis and Machine Intelligence, IEEE Transactions on, vol. 34, no. 9, pp. 1704–1716, 2012.
  • [20] J. Sánchez, F. Perronnin, T. Mensink, and J. Verbeek, “Image classification with the fisher vector: Theory and practice,” International journal of computer vision, vol. 105, no. 3, pp. 222–245, 2013.
  • [21] T.-H. Chan, K. Jia, S. Gao, J. Lu, Z. Zeng, and Y. Ma, “Pcanet: A simple deep learning baseline for image classification?” arXiv preprint arXiv:1404.3606, 2014.
  • [22] K. Chatfield, V. Lempitsky, A. Vedaldi, and A. Zisserman, “The devil is in the details: an evaluation of recent feature encoding methods,” 2011.
  • [23]

    F. Nie, J. Yuan, and H. Huang, “Optimal mean robust principal component analysis,” in

    Proceedings of the 31st International Conference on Machine Learning (ICML-14), 2014, pp. 1062–1070.
  • [24] H. Jégou, M. Douze, and C. Schmid, “Hamming embedding and weak geometry consistency for large scale image search-extended version,” 2008.
  • [25] M. Douze, H. Jégou, H. Sandhawalia, L. Amsaleg, and C. Schmid, “Evaluation of gist descriptors for web-scale image search,” in Proceedings of the ACM International Conference on Image and Video Retrieval.   ACM, 2009, p. 19.
  • [26] R. Salakhutdinov and G. E. Hinton, “Deep boltzmann machines,” in

    International Conference on Artificial Intelligence and Statistics

    , 2009, pp. 448–455.
  • [27] A. Coates and A. Y. Ng, “Selecting receptive fields in deep networks.” in NIPS, vol. 5, 2011, p. 8.
  • [28] K. Sohn and H. Lee, “Learning invariant representations with local transformations,” arXiv preprint arXiv:1206.6418, 2012.
  • [29] W. Y. Zou, A. Y. Ng, S. Zhu, and K. Yu, “Deep learning of invariant features via simulated fixations in video.” in NIPS, 2012, pp. 3212–3220.
  • [30] R. Gens and P. Domingos, “Discriminative learning of sum-product networks.” in NIPS, 2012, pp. 3248–3256.
  • [31] L. Bo, X. Ren, and D. Fox, “Unsupervised feature learning for rgb-d based object recognition,” in Experimental Robotics.   Springer, 2013, pp. 387–402.
  • [32] B. Miclut, “Committees of deep feedforward networks trained with few data,” in Pattern Recognition.   Springer, 2014, pp. 736–742.
  • [33] H. Lee, R. Grosse, R. Ranganath, and A. Y. Ng, “Convolutional deep belief networks for scalable unsupervised learning of hierarchical representations,” in Proceedings of the 26th Annual International Conference on Machine Learning.   ACM, 2009, pp. 609–616.
  • [34] D. Ciresan, U. Meier, and J. Schmidhuber, “Multi-column deep neural networks for image classification,” in Proc. IEEE Computer Vision and Pattern Recognition (CVPR), 2012, pp. 3642–3649.
  • [35] M. Lin, Q. Chen, and S. Yan, “Network in network,” CoRR, vol. abs/1312.4400, 2013. [Online]. Available: http://arxiv.org/abs/1312.4400
  • [36] I. J. Goodfellow, D. Warde-Farley, M. Mirza, A. Courville, and Y. Bengio, “Maxout networks,” arXiv preprint arXiv:1302.4389, 2013.
  • [37] L. Wan, M. Zeiler, S. Zhang, Y. L. Cun, and R. Fergus, “Regularization of neural networks using dropconnect,” in Proceedings of the 30th International Conference on Machine Learning (ICML-13), 2013, pp. 1058–1066.
  • [38] C.-Y. Lee, S. Xie, P. Gallagher, Z. Zhang, and Z. Tu, “Deeply-supervised nets,” arXiv preprint arXiv:1409.5185, 2014.
  • [39] R. Arandjelovic and A. Zisserman, “All about vlad,” in Computer Vision and Pattern Recognition (CVPR), 2013 IEEE Conference on.   IEEE, 2013, pp. 1578–1585.