1 Introduction
Supervised training of deep convolution networks (LeCun et al., 1998)
is clearly highly effective for image classification, as shown by results on ImageNet
(Krizhevsky et al., 2012). The first layers of the networks trained on ImageNet also perform very well to classify images in very different databases, which indicates that these layers capture generic image information
(Zeiler & Fergus, 2013; Girshick et al., 2013; Donahue et al., 2013). This paper shows that such generic properties can be captured by a scattering transform, which has the capability to build invariants to affine transformations. Scattering transforms compute hierarchical invariants along groups of transformations by cascading wavelet convolutions and modulus nonlinearities, along the group variables (Mallat, 2012).Invariant scattering transforms to translations (Bruna & Mallat, 2013) and rotationtranslations (Sifre & Mallat, 2012) have previously been applied to digit recognition and texture discrimination. The main source of variability of these images are due to deformations and stationary stochastic variability. This paper applies a scattering transform to CalTech101 and CalTech256 datasets, which include much more complex structural variability of objects and clutter, with good classification results. The scattering transform is adapted to the important source of variability of these images, by applying wavelet transforms along spatial, rotation, and scaling variables, with separable convolutions which is computationally efficient and by considering YUV color channels independently.
It differs to most deep network in two respects. No adhoc renormalization is added within the network and no max pooling is performed. All poolings are average pooling, which guarantees the mathematical stability of the representation. It does not affect the quality of results when applied to wavelets, even in cluttered environments. This study concentrates on two layers because adding a third layer of wavelet coefficients did not reduce classification errors. Beyond the first two layers, which take care of translation, rotation and scaling variability, seems necessary to learn the filters involved in the third and next layers to improve classification performances.
2 Scattering along Translations, Rotations and Scales
A twolayer scattering transform is computed by cascading wavelet transforms and modulus nonlinearities. The first wavelet transform filters the image with a lowpass filter and complex wavelets which are scaled and rotated. The lowpass filter outputs an averaged image and the modulus of each complex coefficients defines the first scattering layer . A second wavelet transform applied to computes an average and the next layer . A final averaging computes second order scattering coefficients , as illustrated in Figure 1. Higher order scattering coefficients are not computed.
The first wavelet transform is defined from a mother wavelet , which is a complex Morlet function (Bruna & Mallat, 2013) well localized in the image plane and a gaussian lowpass filter . The wavelet is scaled by , where is an integer or halfinteger, and rotated by for :
where . The averaging filter is .
This wavelet transform first computes the average of and we compute the modulus of the complex wavelet coefficients:
The spatial variable is subsampled by . We write the aggregated variable which indexes these first layer coefficients.
The next layer is computed with a second wavelet transform which convolves with separable wavelets along the spatial, rotation and scale variables
The index specifies the angle of rotation , the scales , and of these wavelets. We choose this wavelet family so that it defines a tight frame, and hence an invertible linear operator which preserves the norm. The wavelet family in angle and scales also includes the necessary averaging filters. The next layer of coefficients are defined for and by
First order order scattering coefficients are given by and the second order scattering coefficients are computed with only a lowpass filtering from the second layer. As opposed to almost all deep networks (Krizhevsky et al., 2012)
, we found that no nonlinear normalization was needed to obtain good classification results with a scattering transform. As usual, at the classification stage, all coefficients are standardized by setting their variance to
.A locally invariant translation scattering representation is obtained by concatenating the scattering coefficients of different orders :
Each spatial averaging by is subsampled at intervals .
3 Numerical Classification Results
The classification performance of this double layer wavelet scattering representation is evaluated on the Caltech databases. All images are first renormalized to a fixed size of by pixels by a linear rescaling. Each YUV channel of each image is computed separately and their scattering coefficients are concatenated. The first wavelet transform is computed with Morlet wavelets (Bruna & Mallat, 2013), over octaves with angles . The second wavelet transform is still computed with Morlet wavelets over a range of spatial scales and angles . We also use a Morlet wavelet over the angle variable, calculated over octaves . In this implementation, we did not use a wavelet along the scale variable .
The final scattering coefficients are computed with a spatial pooling at a scale , as opposed to the maxima selection used in most convolution networks. These coefficients are renormalized by a standardization which subtracts their mean and sets their variance to . The mean and variance are computed on the training databases. Standardized scattering coefficients are then provided to a linear SVM classifier.
Dataset  Layers  Calt.101  Calt.256 

Scattering  1  51.20.8  19.30.2 
ImageNet CCN  1  44.8 0.7  24.60.4 
Scattering  2  68.8 0.5  34.60.2 
ImageNet CCN  2  66.2 0.5  39.60.3 
ImageNet CCN  3  72.3 0.4  46.00.3 
ImageNet CCN  7  85.5 0.4  72.60.2 
Almost state of the art classification results are obtained on Caltech101 and Caltech256, with a ConvNet(Zeiler & Fergus, 2013) pretrained on ImageNet. Table 1 shows that with 7 layers it has an 85.5% accuracy on Caltech101 and 72.6% accuracy on Caltech256, using respectively 30 and 60 training images. The classification is performed with a linear SVM. In this work, we concentrate on the first two layers. With only two layers, the ConvNet performances drop to on Caltech101 and on Caltech256, and progressively increase as the number of layers increases. A scattering transform has similar performances as a ConvNet when restricted to and layers, as shown by Table 1. It indicates that major sources of classification improvements over these first two layers can be obtained with wavelet convolutions over spatial variables on the first layer, and joint spatial and rotation variables on the second layer. Color improves by 1.5% our results on Caltech101 but it has not been tried yet on Caltech256, whose results are given with graylevel images. More improvements can potentially be obtained by adjusting the wavelet filtering along scale variables.
Localityconstrained Linear Coding(LLC) (Wang et al., 2010)
are examples different two layer architectures, with a first layer computing SIFT feature vectors and a second layer which uses an unsupervised dictionary optimization. This algorithm performs a maxpooling followed by an SVM. LLC yields performances up to 73.4% on Caltech101 and 47.7% on Caltech256
(Wang et al., 2010). These results are better than the one obtained with fixed architectures such as the one presented in this paper, because the second layer performs an unsupervised optimization adapted to the data set.In this work we observed that averagepooling can provide competitive and even better results that maxpolling. According to some analysis and numerical experiments performed on sparse features (Boureau et al., 2010), maxpooling should perform better than averagepooling. Caltech images are piecewise regular so we do have sparse wavelets coefficients. It however seems that using the modulus on complex wavelet coefficients does improves results obtained with average pooling compared to maxpooling. When using real wavelets and an absolute value nonlinearity, average and max pooling achieve similar performances. The maxpooling was implemented with and without overlapping windows. Using squared windows of length , with a maxpooling we obtained 60.5% of accuracy on Caltech101 without overlapping windows and 62.0% with overlapping windows. This accuracy is well below the average pooling results presented in Table 1. With real Haar wavelets, we observed that maxpooling and average pooling perform similarly with respectively 66.9% and 67.0% accuracy on Caltech101. These difference behaviors with real and complex wavelets are still not well understood.
4 Conclusion
We showed that a two layer scattering convolution network, which involves no learning, provides similar accuracy on the Caltech databases, as double layers neural network pretrained on ImageNet. This scattering transform linearizes the variability relatively to translations and rotations and provides invariants to translations with an average pooling. It involves no inner renormalization but a standardisation of output coefficients.
Many further improvements can still be brought to these first scattering layers, in particular by optimizing the scaling invariance. These preliminary experiments indicate that wavelet scattering transforms provide a good approach to understand the first two layers of convolution networks for complex image classification, and an efficient initialization of these two layers.
References

Boureau et al. (2010)
Boureau, YLan, Ponce, Jean, and LeCun, Yann.
A theoretical analysis of feature pooling in visual recognition.
In
Proceedings of the 27th International Conference on Machine Learning (ICML10)
, pp. 111–118, 2010.  Bruna & Mallat (2013) Bruna, J. and Mallat, S. Invariant scattering convolution network. IEEE Trans. on PAMI, 35(8):1872–1886, 2013.
 Donahue et al. (2013) Donahue, J., Jia, Y., Vinyals, O., Hoffman, J., Zhang, N., Tzeng, E., and Darrell, T. Decaf: A deep convolutional activation feature for generic visual recognition. arXiv preprint arXiv:1310.1531, 2013.
 Girshick et al. (2013) Girshick, R., Donahue, J., Darrell, T., and Malik, J. Rich feature hierarchies for accurate object detection and semantic segmentation. arXiv preprint arXiv:1311.2524, 2013.

Krizhevsky et al. (2012)
Krizhevsky, A., Sutskever, I., and Hinton, G.
Imagenet classification with deep convolutional neural networks.
In Advances in Neural Information Processing Systems 25, pp. 1106–1114, 2012.  LeCun et al. (1998) LeCun, Y., Bottou, L., Bengio, Y., and Haffner, P. Gradientbased learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
 Mallat (2012) Mallat, S. Group invariant scattering. Communications on Pure and Applied Mathematics, 65(10):1331–1398, 2012.
 Sifre & Mallat (2012) Sifre, L. and Mallat, S. Combined scattering for rotation invariant texture analysis. In European Symposium on Artificial Neural Networks, 2012.
 Wang et al. (2010) Wang, Jinjun, Yang, Jianchao, Yu, Kai, Lv, Fengjun, Huang, Thomas, and Gong, Yihong. Localityconstrained linear coding for image classification. In Computer Vision and Pattern Recognition (CVPR), 2010 IEEE Conference on, pp. 3360–3367. IEEE, 2010.
 Zeiler & Fergus (2013) Zeiler, M. D. and Fergus, R. Visualizing and understanding convolutional neural networks. arXiv preprint arXiv:1311.2901, 2013.