1 Introduction
The scattering transform, proposed in [8]
, is a cascade of wavelets and complex modulus nonlinearities, seen as a convolutional neural network (CNN) with fixed predetermined filters. This construction can be used to build representations with geometric invariants and is shown to be stable to deformations. It has been shown to yield impressive results on problems involving highly structured signals
[4], outperforming a number of other classic signal processing techniques.Since scattering transforms can be seen as CNN instantiations, they have been studied as a mathematical model for understanding the impressive success of CNNs in image classification [4, 9]. As discussed in [4], first order scattering coefficients are similar to SIFT descriptors [7], and higher order scattering can provide insight into the information added with depth. Moreover, theoretical and empirical study of information encoded in scattering networks indicates they often promote linear separability, which in turn leads to effective representations for downstream classification tasks [4, 10, 1, 6].
Scatteringbased models have been shown useful in several applications involving scarcely annotated or limited labelled data [6, 13, 12]
. Indeed, most breakthroughs in deep learning in general, and CNNs in particular, involve significant effort in collecting massive amounts of well annotated data to be used when training deep overparameterized networks. However, while big data have become increasingly prevalent nowadays, there are numerous applications (e.g., in biomedical and healthcare domains) where the task of annotating more than a small amount of samples is infeasible, which gave rise to increasing interest in small sample learning tasks and deep learning approaches towards them
[3]. Recent work has shown that in image classification, state of the art results can be achieved by hybrid networks that harness the scattering transform as their early layers followed by learned layers based on a wide residual network architecture [12]. Here, we further advance this research avenue by proposing to use the scattering paradigm not only as fixed preprocessing layers in a concatenated architecture, but also as a parametric prior to learned filters learned in a CNN.To formulate parametric scattering priors, we recall that the scattering construction is based on complex wavelets, generated from a mother wavelet via several operations, such as dilations and rotations, aimed to cover the frequency plane, while having the capacity to encode informative variability in input signals [4]
. Further, discrete parametrization and indexing of these operations (i.e., by dilation scaling or rotation angle) have traditionally been carefully constructed to ensure the resulting filter bank forms an efficient tight frame with well established energy preservation properties. Here, we relax these constructions to allow datadriven learning (i.e., via backpropagation) of the wavelet parameters used in scattering layers of hybrid architectures
^{1}^{1}1Code available on https://github.com/bentherien/ParametricScatteringNetworks, as discussed in Sec. 2.To our knowledge, this is the first work that aims to learn the wavelet filters of scattering networks, and more generally to pose strict parametric priors on filters learned in early layers of convolutional networks, which have often been observed to resemble wavelets but not explicitly parameterized as such. Our empirical study, described in Secs. 34, evaluates our parametric scattering approach on three datasets and demonstrates its advantages in limited labelled data settings.
2 Parametric Scattering Networks
To introduce parametric scattering networks, we first revisit the formulation of traditional scattering convolution networks in Sec. 2.1. Then, in Sec. 2.2 we introduce our parametric scattering transform and describe how to differentiating through this parametrized scattering transform in order to learn its parameters with gradient descent approaches. Finally, Sec. 2.3 discussed scattering parameter initialization.
2.1 Scattering Networks
For simplicity, we focus here on 2D scattering networks up to their 2nd order. Subsequent orders can be computed by following the same iterative scheme, but have been shown to yield negligible energy [4]. Given a signal , where is the spatial position index, we compute the scatterings coefficients , of order 0, 1 and 2 respectively. For an integer , corresponding to the spatial scale of the scattering transform, and assuming an signal input with one channel, the resulting feature maps are of size , with channel sizes varying with the scattering coefficients’ order.
To calculate 0thorder coefficients, we consider a low pass filter with a spatial window of scale , such as a Gaussian smoothing function. We then convolve this filter with the signal, and downsample by a factor of to obtain Due to the low pass filtering, high frequency information is discarded here and is recovered in higher order coefficients via wavelets introduced as in a filter bank.
Morlet wavelets are a typical example of filters used in conjunction with the scattering transform, and are defined as
(1) 
where is a normalization constant to ensure wavelets integrate to 0 over the spatial domain, , and The four parameters can be adjusted and are presented in Table 1. From one wavelet , a tight frame is obtained by dilating it by factors , , and rotating by angles equally spaced over the circle, to get , which is then completed with the lowpass . This can be written in terms of the parameters in Table 1 as . By slight abuse of notations, we use here, , to denote such wavelets indexed by and .
Firstorder scattering coefficients are calculated by first convolving the input signal with one of the generated complex wavelets (i.e., indexed by the parameters in Table 1) and downsampling the resulting filtered signal by the scale factor of the wavelet chosen. Then, a pointwise complex modulus is used to add nonlinearity, and the resulting real signal is smoothed via a low pass filter. Finally, another downsampling step is applied, this time by a factor of , to obtain an optimally compressed output size. Mathematically, we have
The resulting feature map has channels, based on the number of wavelets in the generated family.
Second order coefficients are generated in a similar way, with the addition of another cascade of wavelet transform and modulus operator before the low pass smoothing, i.e.,
Due to the interaction between the bandwidths and frequency supports of first and second order, only coefficients with have significant energy. Hence, the second order output yields a feature map with channels.
2.2 The Parameteric Scattering Transform
While traditionally the wavelet filters are fixes to approximate a tight frame, we let the network learn the optimal parameters of each filter. In other words, we constrain our filters to always be Morlet wavelets by only optimizing the parameters in Table 1. To provide such datadriven optimization of scattering parameters, we show here that it is possible to backpropagate through this construction. Namely, we verify the differentiability of this construction by explicitly computing the partial derivatives with respect to these parameters.
First, the linear derivative of the complex modulus is . Next, we show the differentiation of convolution with wavelets with respect to their parameters. For simplicity, we focus here on differentiation of the Gabor portion^{2}^{2}2It is not difficult to extend this derivation to Morlet wavelets, but the resulting expressions are rather cumbersome and left out for brevity. of the filter construction from Eq. 1, written as:
Its derivatives with respect to the parameters are
Finally, the derivative of the convolution with such filters is given by where is any of the filter parameter from Table 1. It is easy to verify that these derivations can be chained together to propagathe through the scattering cascades defined in Sec. 2.1. We can now learn these jointly with other parameters in an endtoend differentiable architecture.
2.3 Initialization
To evaluate the ability of datadriven learning to tune the scattering parameters, we consider two initializations and study their impact on resulting performance in both learned and nonlearned settings. First, a tight frame initialization follows common implementations of the scattering transform by setting , , and for , , while for each , we set to be equally spaced on . Second, as an alternative, we consider a random initialization where these parameters are sampled as , , , and . That is, orientations are selected uniformly at random on the circle, the filter width
is selected using an exponential distribution across available scales and the spatial frequency
is chosen to be in the interval , which lies in the center of the feasible range between aliasing () and the fundamental frequency of the signal size ( where is the number of pixels). Finally, we select the slant variable governing the aspect ratio to vary around the spherical setting of , with a bias towards stronger orientation selectivity () compared to lesser orientation selectivity ().3 Experimental setup
To evaluate our approach, we consider an architecture inspired by [12], where scattering is combined with a Wide Residual Network [17] of depth 16 and width 8, henceforth denoted WRN; and a simpler one, denoted LL, where scattering is followed by a linear model – a commonly used evaluation for learned and designed representations [11, 12]. In both cases, we compare learned parametric scattering networks (LS) to fixed ones (S), yielding four combinations : LS + WRN, S + WRN, LS + LL, and S + LL. The WRN configurations provide a setting comparable to previous work on hybrid scattering (i.e., the S + WRN setting is equivalent to [12]
), and give an estimate of the peak performance of parametric scattering networks. The LL configurations are used to evaluate the linear separability of the obtained scattering representations and have the added benefit of providing a more interpretable model.
When using learned and fixed scattering, we consider both random and tight frame (TF) initializations, as discussed in Sec. 2.3. The fixed scattering models determined by the TF construction are equivalent to traditional scattering transforms. Together with the four aforementioned architectures, this yields eight scattering configurations. Finally, we also compare our approach to a fully learned WRN (with no scattering priors) applied directly to input data, thus giving a total of nine reported models in Tables 24.
In all configurations, a batch normalization layer with learnable affine parameters is added after all scattering layers. The classification is done via a softmax layer yielding the final output. All models are trained using crossentropy loss, minimized by stochastic gradient descent with momentum of 0.9. Weight decay is applied to the linear model and to the WRN. The learning rate is scheduled according to one cycle policy
[14], which improves convergence during optimization, especially in the small data regime, due to its socalled super convergence (see [14] for further details).^{3}^{3}3The scheduler div factor is always set to 25, while learning rate is tuned.4 Results
Our empirical evaluations are based on three image datasets, illustrated in Fig. 2. Following the evaluation protocol from [12] we subsample each dataset at various sample sizes in order to showcase the performance of scatteringbased architectures in the small data regime. CIFAR10 and KTHTIPS2 are natural image and texture recognition datasets (correspondingly). They are often used as generalpurpose benchmarks in similar image analysis settings [2, 13]. COVIDx CRX2 is a dataset of Xray scans for COVID19 diagnosis; its use here demonstrates the viability of our parametric scattering approach in practice, e.g., in medical imaging applications.
To obtain comparable and reproducible results, we control for deterministic GPU behaviour and assure that each model is initialized the same way for the same seed. Furthermore, we use the same set of seeds for models evaluated on the same number of samples. For instance, the TF learnable hybrid with linear model would be evaluated on the same ten seeds as the fixed tight frame hybrid with linear model when trained on 100 samples of CIFAR10. Some error is inevitable when subsampling datasets, hence all our figures include averages and standard error calculated over the different seeds.
4.1 Cifar10
CIFAR10 is a popular benchmark in computer vision, consisting of 60,000 images of size
from ten classes. The train set contains 50,000 classbalanced samples, while the test set contains the remaining images. In our experiments, we train on a small random subset of the training data, but always test on the entire test set as per [12]. The training set is augmented with horizontal flipping, random cropping, and prespecified autoaugment [5] for CIFAR10. We used [5] to showcase the best possible small sample results.Table 2
reports the evaluation of our learnable scattering approach on CIFAR10 with training sample sizes of 100, 500, 1K, and 50K. The hybrid linear models were trained using a max learning rate of 0.06 and a div factor of 25 for all parameters on 5K, 1K, 500, and 500 epochs for 100, 500, 1K, 50K samples respectively. The hybrid WRN models were trained using a max learning rate of 0.1 and a div factor of 25 on 3K, 2K, 1K, and 200 epochs for 100, 500, 1K, and 50K samples respectively. As shown in Table
2, TFinitialized models outperform others in limited sample settings. Moreover, randomlyinitialized fixed scattering perform the worst, while randomlyinitialized learned models yield performance gains relative to TF ones as the sample size increases.Among the linear models, our TFinitialized learnable scattering one significantly outperforms all others in few sample settings. This demonstrates that learnable scattering networks obtain a more linearly separable representation than their fixed counterparts, perhaps by building greater datasetspecific intraclass invariance. Moreover, in the very small sample regime of 100 training samples performance of the linear model on learned scattering rivals that of a highly nonlinear learned CNN. Interestingly, when comparing TF learnable to fixed, we observe that the relative performance gain of learning increases from 100 samples () to 500 samples (), yet it remains relatively constant for 500, 1K, and 50K sample settings, indicating that perhaps 500 samples are sufficient to tune a scattering representation when starting from TF initialization. In contrast, randomlyinitialized learnable scattering only improves beyond the fixed tight frame after 500 samples, and only achieves similar performance to TF learnable when trained on the whole dataset. These results suggested the TF initialization, derived from rigorous signal processing principles, is empirically beneficial as a starting point in the very few sample regime, but can be improved upon by learning. Figure 1 shows the real part of the wavelet filters before and after optimization on the entire training set. Scales and slants of TFinitialized filters change substantially but their orientations remain approximately the same. For random initialization, all filters undergo major transformations in orientation, scale, and slant.
Among the WRN hybrid models, the TFinitialized learnable scattering network performs best. However, the lone WRN outperforms our model in 1,000 sample and 50,000 sample settings. The relative performance gains of TF learnable over TF fixed are much smaller here when compared to the linear hybrid model. We interpret this as an indication that some performance was gained due to the ability of learnable scattering to linearly separate data when no other source of nonlinear processing is present in the model. However, TF learnable still improves over TF fixed when paired with a WRN, indicating some loss of information in the fixed scattering representation is mitigated by datadriven tuning or optimization. Note that S+WRN corresponds to the setting of [12], with our higher results in this setting due to use of more data augmentation. We confirmed when using the latter’s data augmentation performance gains for LS+WRN are similar. For example in the 1000 sample case when using the augmentation of [12] learnable scattering obtains accuracy vs averaged over 8 runs.
4.2 COVIDx CRX2
COVIDx CRX2 is a two class (positive and negative) dataset of chest XRay images of COVID19 patients [16]. The train set contains 15,951 unbalanced images, while the test set contains positive and negative. In our experiments, we always train on a classbalanced subset of the training set. We resize the images to and train our network with random crops of pixels. The only data augmentation we use is random horizontal flipping. All scattering networks use a spatial scale J=3.
Table 3 reports our evaluation on COVIDx CRX2, with training on sample sizes of 100, 500, and 1K images using the same protocol as CIFAR10. All linear models were trained on 400 epochs using a max learning rate of 0.06 linear layer parameters and and 0.1 for scattering parameters. The hybrid WRN models were trained on 200 epochs using a max learning rate of 0.1. For the linear model we observe that the tight frame initialization performs slightly better than the learned evaluation, although we note that the error bars overlap in most of the settings. This suggests that the tight frame is already near optimal for this data. We also observe that even random initialization can give reasonable results, exceeding the performance of CNNs on this data in the small sample regime. When combined with a CNN, the scattering yields the best performance for 100 and 1000 samples. Similar to CIFAR10, this benchmark exhibits the advantages of using the TF initialization. Indeed, as a whole, TF configurations exhibit higher accuracy than randomlyinitialized ones, while also showing that it can be improved upon by learning in many cases. Finally, we observe that the WRN alone performs worse than the other architectures, demonstrating the effectiveness of the scattering prior (i.e., using learnable or nonlearnable parameters) in the small data regime.
4.3 KthTips2
KTHTIPS2 is a texture dataset containing 4,752 images from 11 material classes. Each class is divided into four samples (108 images each) of different scales. Using the standard protocol, we train the model on one sample ( images), while the rest are used for testing [15]. In total, each training set contains 1,188 images. We resize these images to and train our network with random crops of pixels. The training data is augmented with random horizontal flips and random rotations. All scattering networks use a spatial scale of 3. We set the maximum learning rate of the scattering parameters to 0.1 while it is set to 0.001 for all other parameters. All hybrid models are trained with a minibatch size of 128. However, the hybrid linear models are trained for 250 epochs, while the hybrid WRN models are trained for 150 epochs. We evaluate each model, training it with 4 different seeds on each sample, amounting to 16 total runs for each of the 9 models we evaluate.
Table 4 reports classification accuracies for KTHTIPS2. Among the linear hybrid models, we observe that optimizing the scattering parameters improves performance. Indeed, the TF learnable model achieves the highest accuracy, while the random learnable model is a close second. We observe that learning the filters’ parameters brings a significant improvement over fixing them: 2.92% and 3.71% for TF and random respectively. We also see that the fixed and randomlyintitialized models perform the worst, showing that even poorly intitialized filters can effectively be optimized. Altogether, these results further corroborate our previous findings, notably that all filters can effectively be optimized, but that beyond being valuable for fixed scattering networks, the tight frame initialization also helps the learning process.
Out of all the WRN hybrid models, the random learnable model achieves the highest accuracy and is the only one to improve over its linear counterpart. This is contrary to our findings for CIFAR10 and COVIDx CRX2, namely that tight frame is best and the WRN always improves over the linear layer. We hypothesize that this is due to a couple of factors: residual networks perform poorly at texture discrimination and it may be difficult to escape the TF local minimum. While its relative performance in 1000 sample CIFAR10 and xray settings (similar to the 1,188 samples used here) is very competitive, the lone WRN (i.e., without parametric scattering priors) performs extremely poorly relative to hybrid models on KTHTIPS2, supporting our hypothesis. Furthermore, TF aids optimization throughout, yet fails to do the same for textures. This may be due to texture discrimination requiring different invariance properties [4] than the other datasets, leading the TFinitialized networks to remain stuck at local minima, while the random learnable models are freer to roam the optimization space. As a final remark, we note that while WRN increases the performance is some cases, it also significantly increases the total number of parameters, therefore exhibiting a tradeoff between performance and model complexity.
5 Conclusion
This work demonstrates the competitive results of learning Morlet wavelet filter parameters by differentiating through scattering networks. When trained on subsamples of CIFAR10, learnable scattering with TF initialization improves performance for all architectures. On Xray scans, improvements are less significant, though our method is still the most accurate with the LS+WRN configuration and TF initialization, demonstrating the realworld viability of our approach. On texture data, learning the parameters yields significant improvement with both linear and WRN downstream layers. Overall, we show that, in small sample settings, learned versions of the scattering transform yield significant performance gains over the nonlearned ones. Moreover, the standard tight frames are not always necessary to extract effective representations, as even with randomlyinitialized filters, our approach yields comparable or better results.
References
 [1] (2015) Joint timefrequency scattering for audio classification. In Proc. of MLSP, pp. 1–6. Cited by: §1.
 [2] (2021) Generative latent implicit conditional optimization when learning from small sample. In Proc. of ICPR, pp. 8584–8591. Cited by: §4.
 [3] (2020) Learning from few samples: a survey. Note: arXiv:2007.15484 Cited by: §1.
 [4] (2013) Invariant scattering convolution networks. IEEE Trans. Pattern Anal. Mach. Intell. 35 (8), pp. 1872–1886. Cited by: §1, §1, §1, §2.1, §4.3.
 [5] (2019) AutoAugment: learning augmentation policies from data. In Proc. of CVPR, pp. 113–123. Cited by: §4.1.
 [6] (2018) Solid harmonic wavelet scattering for predictions of molecule properties. J. Chem. Phys. 148 (24), pp. 241732. Cited by: §1, §1.
 [7] (2004) Distinctive image features from scaleinvariant keypoints. Int. J. Comput. Vis. 60 (2), pp. 91–110. Cited by: §1.
 [8] (2012) Group invariant scattering. Commun. Pure Appl. Math 65 (10), pp. 1331–1398. Cited by: §1.
 [9] (2016) Understanding deep convolutional networks. Phil. Trans. of the Royal Society A 374 (2065), pp. 20150203. Cited by: §1.
 [10] (2017) Scaling the scattering transform: deep hybrid networks. In Proc. of ICCV, pp. 5618–5627. Cited by: §1.
 [11] (2015) Deep rototranslation scattering for object classification. In Proc. of CVPR, pp. 2865–2873. Cited by: §3.
 [12] (2018) Scattering networks for hybrid representation learning. IEEE Trans. Pattern Anal. Mach. Intell. 41 (9), pp. 2208–2221. Cited by: §1, §3, §4.1, §4.1, Table 2, §4.
 [13] (2013) Rotation, scaling and deformation invariant scattering for texture discrimination. In Proc. of CVPR, pp. 1233–1240. Cited by: §1, §4.

[14]
(2019)
Superconvergence: very fast training of neural networks using large learning rates
. In AI & ML for MultiDomain Oper. App., Cited by: §3. 
[15]
(2017)
Locallytransferred fisher vectors for texture classification
. In Proc. of ICCV, pp. 4912–4920. Cited by: §4.3.  [16] (2020) COVIDNet: a tailored deep convolutional neural network design for detection of COVID19 cases from chest Xray images. Scientific Reports 10, pp. 19549. Cited by: §4.2.
 [17] (2016) Wide residual networks. In Proc. of BMVC, pp. 87.1–87.12. Cited by: §3.