Convolutional Neural Networks Applied to House Numbers Digit Classification

04/18/2012 ∙ by Pierre Sermanet, et al. ∙ 0

We classify digits of real-world house numbers using convolutional neural networks (ConvNets). ConvNets are hierarchical feature learning neural networks whose structure is biologically inspired. Unlike many popular vision approaches that are hand-designed, ConvNets can automatically learn a unique set of features optimized for a given task. We augmented the traditional ConvNet architecture by learning multi-stage features and by using Lp pooling and establish a new state-of-the-art of 94.85 error improvement). Furthermore, we analyze the benefits of different pooling methods and multi-stage features in ConvNets. The source code and a tutorial are available at



There are no comments yet.


page 1

page 4

Code Repositories


Explored CNNs with TensorFlow to create models for cropped single-digit and original multi-digit images from SVHN dataset.

view repo


Image Recognition using neural networks in TensorFlow

view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Architecture

The ConvNet architecture is composed of repeatedly stacked feature stages. Each stage contains a convolution module, followed by a pooling/subsampling module and a normalization module. While traditional pooling modules in ConvNet are either average or max poolings, we use an Lp pooling here. The normalization module is subtractive only as opposed to subtractive and divisive, i.e. the mean value of each neighborhood is subtracted to the output of each stage (but not divided by the standard deviation as it decreases performance with this dataset). Finally, multi-stage features are also used as opposed to single-stage features.

1.1 Lp-Pooling

Figure 2:

L2-pooling applied to a 9x9 feature map with a 3x3 Gaussian kernel and 2x2 stride

Lp pooling is a biologically inspired pooling layer modelled on complex cells [12, 5] who’s operation can be summarized in equation (1), where is a Gaussian kernel, is the input feature map and is the output feature map. It can be imagined as giving an increased weight to stronger features and suppressing weaker features. Two special cases of Lp pooling are notable. corresponds to a simple Gaussian averaging, whereas corresponds to max-pooling (i.e only the strongest signal is activated). Lp-pooling has been used previously in [6, 15] and a theoretical analysis of this method is described in [1].


Figure 2 demonstrates a simple example of L2-pooling.

Task Single-Stage features Multi-Stage features Improvement %
Pedestrians detection (INRIA) [9] 14.26% 9.85% 31%
Traffic Signs classification (GTSRB) [11] 1.80% 0.83% 54%
House Numbers classification (SVHN) 5.72% 5.67% 0.9%
Table 1: Error rates improvements of multi-stage features over single-stage features for different types of objects detection and classification. Improvements are significant for multi-scale and textured objects such as traffic signs and pedestrians but minimal for house numbers.
Figure 3: A 2-stage ConvNet architecture where Multi-Stage features (MS) are fed to a 2-layer classifier. The 1st stage features are branched out, subsampled again and then concatenated to 2nd stage features.

1.2 Multi-Stage Features

Multi-Stage features (MS) are obtained by branching out outputs of all stages into the classifier (Figure 3). They provide richer representations compared to Single-Stage features (SS) by adding complementary information such as local textures and fine details lost by higher levels. MS features have consistently improved performance in other work [4, 11, 9] and in this work as well (Figure 4). However we observe minimal gains on this dataset compared to other types of objects such as pedestrians and traffic signs (Table 1). The likely explanation for this observation is that gains are correlated to the amount of texture and multi-scale characteristics of the objects of interest.


Data Preparation The SVHN classification dataset [8] contains 32x32 images with 3 color channels. The dataset is divided into three subsets: train set, extra set and test set. The extra set is a large set of easy samples and train set is a smaller set of more difficult samples. Since we are given no information about how the sampling of these images was done, we assume a random order to construct our validation set. We compose our validation set with from training samples (400 per class) and from extra samples (200 per class), yielding a total of 6000 samples. This distribution allows to measure success on easy samples but puts more emphasis on difficult ones.

Samples are pre-processed with a local contrast normalization (with a 7x7 kernel) on the Y channel of the YUV space followed by a global contrast normalization over each channel. No sample distortions were used to improve invariance.

Figure 4: Improvement of Multi-Stage features (MS) over Single-Stage features (SS) in error rate on the validation set. MS features provide a slight error improvement over SS features.

1.3 Architecture Details

The ConvNet has 2 stages of feature extraction and a two-layer non-linear classifier. The first convolution layer produces 16 features with 5x5 convolution filters while the second convolution layer outputs 512 features with 7x7 filters. The output to the classifier also includes inputs from the first layer, which provides local features/motifs to reinforce the global features. The classifier is a 2-layer non-linear classifier with 20 hidden units. Hyper-parameters such as learning rate, regularization constant and learning rate decay were tuned on the validation set. We use stochastic gradient descent as our optimization method and shuffle our dataset after each training iteration.

Figure 5: Error rate of Lp-pooling on the validation set for ( is represented as

for convenience). These validation errors are reported after 1000 training epochs.

performs best with an error rate of .

For the pooling layers, we compare Lp-pooling for the value on the validation set and use the best performing pooling on the final testing. The performance of different pooling methods on the validation set can be seen in Figure 5. Insights from [1] tell us that the optimal value of varies for different input spaces and there is no single globally optimal value for . For our validation data, we observe that give the best performance ( and respectively). Max-pooling, which corresponds to yielded a validation error rate of .

Algorithm SVHN-Test Accuracy
Binary Features (WDCH) 63.3%
HOG 85.0%
Stacked Sparse Auto-Encoders 89.7 %
K-Means 90.6%
ConvNet / MS / Average 90.75%
ConvNet / MS / L2 / Smaller training 91.55%
ConvNet / SS / L2 94.28%
ConvNet / MS / L2 94.33%
ConvNet / MS / L12 94.76%
ConvNet / MS / L4 94.85%
Human Performance 98.0%
Table 2: Performance reported by [8] with the additional Supervised ConvNet with state-of-the-art accuracy of 94.85%.

2 Results & Future Work

Our experiments demonstrate a clear advantage of Lp pooling with on this dataset in validation (Figure 5) and test (Average pooling is 3.58 points inferior to L2 pooling in Table 2). With L4 pooling, we obtain a state-of-the-art performance on the test set with an accuracy of 94.85% compared to the previous best of 90.6% (Table 2). We also show that using multi-stage features gives only a slight increase in performance, compared to the performance increase seen in other vision applications.

Additionally, it is important to note that our approach is trained fully supervised only, whereas the best previous methods are unsupervised learning methods (k-means, auto-encoders). We shall, in the future, run experiments with unsupervised learning, to compare the accuracy improvement that can be attributed to supervision. Figure 6 shows the validation samples with highest energy. Many of these seem to exhibit large scale variations, future work could address this problem by introducing artificial scale deformations during training.

Figure 6: Preprocessed Y channel of validation samples with highest energy (i.e. highest error) with the 94.33% accuracy L2-pool based multi-stage ConvNet.


  • [1] Y. Boureau, J. Ponce, and Y. LeCun. A theoretical analysis of feature pooling in vision algorithms. In

    Proc. International Conference on Machine learning

    , 2010.
  • [2] D. C. Ciresan, U. Meier, J. Masci, and J. Schmidhuber. A committee of neural networks for traffic sign classification. In International Joint Conference on Neural Networks, pages 1918–1921, 2011.
  • [3] T. E. de Campos, B. R. Babu, and M. Varma. Character recognition in natural images. In Proceedings of the International Conference on Computer Vision Theory and Applications, Lisbon, Portugal, February 2009.
  • [4] J. Fan, W. Xu, Y. Wu, and Y. Gong. Human tracking using convolutional neural networks. Neural Networks, IEEE Transactions on, 21(10):1610 –1623, 2010.
  • [5] A. Hyvärinen and U. Köster. Complex cell pooling and the statistics of natural images. In Computation in Neural Systems,, 2005.
  • [6] K. Kavukcuoglu, M. Ranzato, R. Fergus, and Y. LeCun. Learning invariant features through topographic filter maps. In

    Proc. International Conference on Computer Vision and Pattern Recognition

    . IEEE, 2009.
  • [7] Y. Lecun and C. Cortes.

    The MNIST database of handwritten digits.

  • [8] Y. Netzer, T. Wang, A. Coates, A. Bissacco, B. Wu, and A. Y. Ng. Reading digits in natural images with unsupervised feature learning. In

    NIPS Workshop on Deep Learning and Unsupervised Feature Learning

    , 2011.
  • [9] P. Sermanet, K. Kavukcuoglu, and Y. LeCun. Traffic signs and pedestrians vision with multi-scale convolutional networks. In Snowbird Machine Learning Workshop, 2011.
  • [10] P. Sermanet, K. Kavukcuoglu, and Y. LeCun. Eblearn: Open-source energy-based learning in c++. In

    Proc. International Conference on Tools with Artificial Intelligence

    . IEEE, 2009.
  • [11] P. Sermanet and Y. LeCun. Traffic sign recognition with multi-scale convolutional networks. In Proceedings of International Joint Conference on Neural Networks, 2011.
  • [12] E. P. Simoncelli and D. J. Heeger.

    A model of neuronal responses in visual area mt, 1997.

  • [13] J. Stallkamp, M. Schlipsing, J. Salmen, and C. Igel. The German Traffic Sign Recognition Benchmark: A multi-class classification competition. In IEEE International Joint Conference on Neural Networks, pages 1453–1460, 2011.
  • [14] T. Yamaguchi, Y. Nakano, M. Maruyama, H. Miyao, and T. Hananoi. Digit classification on signboards for telephone number recognition. In ICDAR, pages 359–363, 2003.
  • [15] J. Yang, K. Yu, Y. Gong, and T. Huang. Linear spatial pyramid matching using sparse coding for image classification. In in IEEE Conference on Computer Vision and Pattern Recognition, 2009.