Deep convolutional neural networks trained via backpropagation have recently been shown to perform well on image classification tasks containing millions of images and thousands of categories[17, 24]. While deep convolutional neural networks have been known to yield good results on supervised image classification tasks such as MNIST for a long time , the recent successes are made possible through optimized implementations, efficient model averaging and data augmentation techniques 
. The feature representation learned by these networks achieves state of the art performance not only on the classification task the network is trained for, but also on various other computer vision tasks, for example: classification on Caltech-101[24, 7], Caltech-256 , Caltech-UCSD birds dataset 
, SUN-397 scene recognition database; detection on PASCAL VOC dataset 
. This capability to generalize to new datasets indicates that supervised discriminative learning is currently the best known algorithm for visual feature learning. The downside of this approach is the need for expensive labeling, as the amount of required labels grows quickly the larger the model gets. For this reason unsupervised learning, although currently underperforming, remains an appealing paradigm, since it can make use of raw unlabeled images and videos which are readily available in virtually infinite amounts.
In this work we aim to combine the power of discriminative supervised learning with the simplicity of unsupervised data acquisition. The main novelty of our approach is the way we obtain training data for a convolutional network in an unsupervised manner. In the standard supervised setting there exists a large set of labeled images, which may be further augmented by small translations, rotations or color variations to generate even more (and more diverse) training data.
In contrast, our method does not require any labeled data at all: we use the augmentation step alone to create surrogate training data from a set of unlabeled images. We start with trivial surrogate classes consisting of one random image patch each, and then augment the data by applying a random set of transformations to each patch. After that we train a convolutional neural network to classify these surrogate classes. The feature representation learned by the network is, by construction, discriminative and at the same time invariant to typical data transformations. Nevertheless it is not immediately clear: Would the feature representation learned from this surrogate task perform well on general image classification problems? Our experiments show that, indeed, this simple unsupervised feature learning algorithm achieves competitive or state of the art results on several benchmarks.
By performing image augmentation we provide prior knowledge about natural image distribution to the training algorithm. More precisely, by assigning the same label to all transformed versions of an image patch we force the learned feature representation to be invariant to the transformations applied. This can be seen as an indirect form of supervision: our algorithm needs some expert knowledge about which transformations the features should be invariant to. However, similar expert knowledge is used in most other unsupervised feature learning algorithms. Features are usually learned from small image patches, which assumes translational invariance. Turning images to grayscale assumes invariance to color changes. Whitening or contrast normalization assumes invariance to contrast changes and, largely, color variations.
1.1 Related work
Our approach is related to a large body of work on unsupervised learning and convolutional neural networks. In contrast to our method, most unsupervised learning approaches, e.g. [13, 14, 23, 6, 25], rely on modeling the input distribution explicitly – often via a reconstruction error term – rather than training a discriminative model and thus cannot be used to jointly train multiple layers of a deep neural network in a straightforward manner. Among these unsupervised methods, most similar to our approach are several studies on learning invariant representations from transformed input samples, for example [22, 25, 15].
Our proposed method can be related to work on metric learning, for example [10, 12]. However, instead of enforcing a metric on the feature representation directly, as in , we only implicitly force the representation of transformed images to be mapped close together through the introduced surrogate labels. This enables us to use discriminative training for learning a feature representation which performs well in classification tasks.
Learning invariant features with a discriminative objective was previously considered in early work on tangent propagation , which aims to learn features invariant to small predefined transformations by directly penalizing the derivative of the network output with respect to the parameters of the transformation. In contrast to their work, our algorithm does not rely on labeled data and is less dependent on a small magnitude of the applied transformations. Tangent propagation has been successfully combined with an unsupervised feature learning algorithm in  to build a classifier exploiting information about the manifold structure of the learned representation. This, however, again comes with the disadvantages of reconstruction-based training.
Loosely related to our work is research on using unlabeled data for regularizing supervised algorithms, for example self-training  or entropy regularization [11, 19]. In contrast to these semi-supervised methods, our training procedure, as mentioned before, does not make any use of labeled data. Finally, the idea of creating a pseudo-task to improve the performance of a supervised algorithm is used in .
2 Learning algorithm
Here we describe in detail our feature learning pipeline. The two main stages of our approach are generating the surrogate training data and training a convolutional neural network using this data.
2.1 Data acquisition
The input to our algorithm is a set of unlabeled images, which come from roughly the same distribution as the images we later aim to classify. We randomly sample random patches of size pixels from different images, at varying positions and scales. We only sample from regions with considerable gradient energy to avoid getting uniformly colored patches. Then we apply random transformations to each of the sampled patches. Each of these random transformations is a composition of four random ’elementary’ transformations from the following list:
Translation: translate the patch by a distance within of the patch size vertically and horizontally.
Scale: multiply the scale of the patch by a factor between and .
Color: multiply the projection of each patch pixel onto the principal components of the set of all pixels by a factor between and (factors are independent for each principal component and the same for all pixels within a patch).
Contrast: raise saturation and value (S and V components of the HSV color representation) of all pixels to a power between and (same for all pixels within a patch).
We do not apply any preprocessing to the obtained patches other than subtracting the mean of each pixel over the whole training dataset. Examples of patches sampled from the STL-10 unlabeled dataset are shown in Fig. 2. Examples of transformed versions of one patch are shown in Fig. 2.
As a result of the procedure described above, to each patch from the set of initially sampled patches we apply a set of transformations and get a set of its transformed versions . We then declare each of these sets to be a class by assigning label to the class
and train a convolutional neural network to discriminate between these surrogate classes. Formally, we minimize the following loss function:
where is the loss on the sample with (surrogate) true label . We use a convolutional neural network with cross entropy loss on top of the softmax output layer of the network, hence in our case
where denotes the function computing the values of the output layer of the neural network given the input data, and is the
th standard basis vector.
For training the network we use an implementation based on the fast convolutional neural network code from , modified to support dropout. We use a fixed network architecture in all experiments: convolutional layers with filters of size each followed by fully connected layer of neurons with dropout and a softmax layer on top. We perform max-pooling after convolutional layers and do not perform any contrast normalization between layers. We start with a learning rate of and gradually decrease the learning rate during training. That is, we train until there is no improvement in validation error, then decrease the learning rate by a factor of , and repeat this procedure several times until there is no more significant improvement in validation error.
In some of our experiments, in which the number of surrogate classes is large relative to the number of training samples per surrogate class, we observed that during the training process the training error does not significantly decrease compared to initial chance level. To alleviate this problem, before training the network on the whole surrogate dataset we pre-train it on a subset with fewer surrogate classes, typically . We stop the pre-training as soon as the training error starts falling, indicating that the optimization found a direction towards a good local minimum. We then use the weights learned by this pre-training phase as an initialization for training on the whole surrogate dataset.
When the training procedure is finished, we apply the learned feature representation to classification tasks on ’real’ datasets, consisting of images which may differ in size from the surrogate training images. To extract features from these new images, we convolutionally compute the responses of all the network layers except the top softmax and form a 3-layer spatial pyramid of them. We then train a linear support vector machine (SVM) on these features. We select the hyperparameters of the SVM via crossvalidation.
We report our classification results on the STL-10, CIFAR-10 and Caltech-101 datasets, approaching or exceeding state of the art for unsupervised algorithms on each of them. We also evaluate the effects of the number of surrogate classes and the number of training samples per surrogate class in the training data. For training the network in all our experiments we generate a surrogate dataset using patches extracted from the STL-10 unlabeled dataset.
For STL-10 we use the usual testing protocol of averaging the results over 10 pre-defined folds of training data and report the mean and the standard deviation. For CIFAR-10 we report two results: ’CIFAR-10’ means training on the whole CIFAR-10 training set and ’CIFAR-10-reduced’ means the average over 10 random selections of 400 training samples per class. For Caltech-101 we follow the usual protocol with selecting 30 random samples per class for training and not more than 50 training samples per class for testing, repeated 10 times.
3.1 Classification results
In Table 1 we compare our classification results to other recent work. Our network is trained on a surrogate dataset with surrogate classes containing samples each. We remind that for extracting features during test time we use the first layers of the network with , and filters respectively. The feature representation is hence considerably more compact than in most competing approaches. We do not list the results of supervised methods on CIFAR-10 (the best of which currently exceed accuracy), since those are not directly comparable to our unsupervised feature learning method.
As can be seen in the table, our results are comparable to state of the art on CIFAR-10 and exceed the performance of many unsupervised algorithms on Caltech-101. On STL-10 for which the image distribution of the test dataset is closest to the surrogate samples our algorithm reaches accuracy outperforming all other approaches by a large margin.
3.2 Influence of the data acquisition on classification performance
Our pipeline lets us easily vary the number of surrogate classes in the training data and the number of training samples per surrogate class. We use this to measure the effect of these factors on the quality of the resulting features. We vary the number of surrogate classes between and and the number of training samples per surrogate class between and . The results are shown in Fig. 4 and 4. In Fig. 4
we also show, as a baseline, the classification performance of random filters (all weights are sampled from a normal distribution with standard deviation, all biases are set to zero). Initializing the random filters does not require any training data and can hence be seen as using samples per surrogate class. Error bars in Fig. 4 show the standard deviations computed when testing on folds of the STL-10 dataset.
An apparent trend in Fig. 4 is that increasing the number of surrogate classes results in an increase in classification accuracy until it reaches an optimum at around surrogate classes. When the number of surrogate classes is further increased the classification results do not change or slightly decrease. One explanation for this behavior is that the larger the number of surrogate classes becomes, the more these classes overlap. As a result of this overlap the classification problem becomes more difficult and adapting the network to the surrogate task no longer succeeds. To check the validity of this explanation we also plot in Fig. 4 the classification error on the validation set (taken from the surrogate data) computed after training the network. It rapidly grows as the number of surrogate classes increases, supporting the claim that the task quickly becomes more difficult as the number of surrogate classes increases.
Fig. 4 shows that classification accuracy increases with increasing number of samples per surrogate class and saturates around samples. It can also be seen that when training with small numbers of samples per surrogate class, there is no clear indication that having more classes lead to better performance. We hypothesize that the reason may be that with few training samples per class the surrogate classification problem is too simple and hence the network can severely overfit, which results in poor and unstable generalization to real classification tasks. However, starting from around samples per surrogate class, the surrogate task gets sufficiently complicated and the networks with more diverse training data (more surrogate classes) perform consistently better.
We proposed a simple unsupervised feature learning approach based on data augmentation that shows good results on a variety of classification tasks. While our approach sets the state of the art on STL-10 it remains to be seen whether this success can be translated into consistently better performance on other datasets.
The performance of our method saturates when the number of surrogate classes increases. One probable reason for this is that the surrogate task we use is relatively simple and does not allow the network to learn complex invariances such as 3D viewpoint invariance or inter-instance invariance. We hypothesize that our unsupervised feature learning method could learn more powerful higher-level features if the surrogate data were more similar to real-world labeled datasets. This could be achieved by using extra weak supervision provided for example by video data or a small number of labeled samples. Another possible way of obtaining richer surrogate training data would be (unsupervised) merging of similar surrogate classes. We see these as interesting directions for future work.
We acknowledge funding by the ERC Starting Grant VideoLearn (279401).
A. Ahmed, K. Yu, W. Xu, Y. Gong, and E. Xing.
Training hierarchical feed-forward visual recognition models using transfer learning from pseudo-tasks.In ECCV (3), pages 69–82, 2008.
M.-R. Amini and P. Gallinari.
Semi supervised logistic regression.In ECAI, pages 390–394, 2002.
-  L. Bo, X. Ren, and D. Fox. Unsupervised Feature Learning for RGB-D Based Object Recognition. In ISER, June 2012.
-  L. Bo, X. Ren, and D. Fox. Multipath sparse coding using hierarchical matching pursuit. In CVPR, pages 660–667, 2013.
-  Y. Boureau, N. Le Roux, F. Bach, J. Ponce, and Y. LeCun. Ask the locals: multi-way local pooling for image recognition. In Proc. International Conference on Computer Vision (ICCV’11). IEEE, 2011.
-  A. Coates and A. Y. Ng. Selecting receptive fields in deep networks. In NIPS, pages 2528–2536, 2011.
-  J. Donahue, Y. Jia, O. Vinyals, J. Hoffman, N. Zhang, E. Tzeng, and T. Darrell. Decaf: A deep convolutional activation feature for generic visual recognition. 2013. pre-print, arXiv:1310.1531v1 [cs.CV].
-  R. Gens and P. Domingos. Discriminative learning of sum-product networks. In NIPS, pages 3248–3256, 2012.
-  R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. 2013. pre-print, arXiv:1311.2524v1 [cs.CV].
-  J. Goldberger, S. T. Roweis, G. E. Hinton, and R. Salakhutdinov. Neighbourhood components analysis. In NIPS, 2004.
-  Y. Grandvalet and Y. Bengio. Entropy regularization. In O. Chapelle, B. Schölkopf, and A. Zien, editors, Semi-Supervised Learning, pages 151–168. MIT Press, 2006.
R. Hadsell, S. Chopra, and Y. Lecun.
Dimensionality reduction by learning an invariant mapping.
In Proc. Computer Vision and Pattern Recognition Conference (CVPR’06, 2006.
-  G. E. Hinton, S. Osindero, and Y.-W. Teh. A fast learning algorithm for deep belief nets. Neural Comput., 18(7):1527–1554, July 2006.
-  G. E. Hinton and R. R. Salakhutdinov. Reducing the dimensionality of data with neural networks. Science, 313(5786):504–507, July 2006.
-  K. Y. Hui. Direct modeling of complex invariances for visual object features. In S. Dasgupta and D. Mcallester, editors, Proceedings of the 30th International Conference on Machine Learning (ICML-13), volume 28, pages 352–360. JMLR Workshop and Conference Proceedings, May 2013.
-  Y. Jia, C. Huang, and T. Darrell. Beyond spatial pyramids: Receptive field learning for pooled image features. In CVPR, pages 3370–3377. IEEE, 2012.
-  A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In NIPS, pages 1106–1114, 2012.
-  Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, November 1998.
-  D.-H. Lee. Pseudo-label : The simple and efficient semi-supervised learning method for deep neural networks. In Workshop on Challenges in Representation Learning, ICML, 2013.
-  S. Rifai, Y. N. Dauphin, P. Vincent, Y. Bengio, and X. Muller. The manifold tangent classifier. In Advances in Neural Information Processing Systems 24 (NIPS). 2011.
-  P. Simard, B. Victorri, Y. LeCun, and J. S. Denker. Tangent prop - a formalism for specifying selected invariances in an adaptive network. In Advances in Neural Information Processing Systems 4, (NIPS), 1992.
-  K. Sohn and H. Lee. Learning invariant representations with local transformations. In ICML, 2012.
P. Vincent, H. Larochelle, Y. Bengio, and P.-A. Manzagol.
Extracting and composing robust features with denoising autoencoders.In Proceedings of the 25th International Conference on Machine Learning, ICML ’08, pages 1096–1103, New York, NY, USA, 2008. ACM.
-  M. D. Zeiler and R. Fergus. Visualizing and understanding convolutional networks. 2013. pre-print, arXiv:1311.2901v3 [cs.CV].
-  W. Y. Zou, A. Y. Ng, S. Zhu, and K. Yu. Deep learning of invariant features via simulated fixations in video. In NIPS, pages 3212–3220, 2012.