Convolutional Neural Networks rise to success on large datasets like ImageNet in
, has prompted a myriad of work in their direction, which build on their key depth-preserved transformation equivariance property to achieve better classifiers[3, 4, 5]. Equivariance to transformations has been thus recognized as an important pre-requisite to any classifier, and CNNs which are by definition translation equivariant have been recognized as a first important step in this direction.
An underlying requirement to a transformation equivariant representation is the construction of transformed copies of filters, i.e. when the transformation is a translation, the operation becomes a convolution. A natural extension of this idea to general transformation groups led to the idea of Group-equivariant CNNs , where in the first layer, transformed copies of filter weights are generated. Subsequently, the application of group convolution ensures that the network stays equivariant to that transformation throughout.
However, there are certain issues pertaining to the application of any (spatial) transformation on a filter:
There is no prior on the spatial complexity of a convolutional filter within a CNN, which means a considerable part of the filter space may contain filters which are not sensitive to the desired spatial transformation. Examples include rotation symmetric filters, high-frequency filters etc.
To alleviate these issues, the use of a steerable filter basis for filter construction and learning was proposed in . Steerable filters have the unique property, that allow them to be transformed by simply using linear combinations of an appropriate steerable filter basis. Importantly, the choice of the steerable basis allows one to control the transformation sensitivity of the final computed filter. Especially for a circular harmonic basis , we find that filters of order are only sensitive to rotation shifts in the range . In this case, higher order filter responses show less sensitivity to input rotations, and simultaneously are of higher spatial frequency and complexity. Using a small basis of the first few filter orders enabled the authors of  to achieve state-of-the-art on MNIST-Rot classification (with small training data size).
2 Contributions of this Work
Log-Radial Harmonics: A scale steerable basis
In this paper, we define filters which are steerable in their spatial scale using a complex filter basis we denote as log-radial harmonics. Each kernel of a CNN is represented as the real part of the linear combination of the proposed basis filters, which contains filters of various orders, analogous to circular harmonics. Furthermore, the scale steerable property permits exact representation of the filters in its scale simply through a linear combination of learnt complex coefficients on the log-radial harmonics. The filter form is conjugate to the circular harmonics, with the choice of filter order having a direct impact on the scale sensitivity of the resulting filters.
Scale-Steered CNN (SS-CNN)
Using the log-radial harmonics as a complex steerable basis, we construct a locally scale invariant CNN, where the filters in each convolution layer are a linear combination of the basis filters. For obtaining filter response across scales, each filter is simultaneously steered in its scale and size, and the filter responses are eventually max-pooled. We demonstrate accuracy improvements with the scale-steered CNN on datasets containing global (MNIST-Scale, and FMNIST-Scale) and local (MNIST-Scale-Local; synthesized here) scale variations. Importantly, we find that on MNIST-Scale, the proposed SS-CNN achieves competitive accuracy to the Spatial Transformer Network, which due to its global affine re-sampling property has a natural advantage in this task.
3 Related Work
Previous work with Local Scale Invariant/Equivariant CNNs
Scale-transformed weights were proposed in , where it was observed to improve performance over the normal baseline CNN, on MNIST-Scale. On the same dataset (with a 10k, 2k and 50k split), better performance was observed in , where in addition to forwarding the maximum filter response to a range of scales, the actual scale at which the response was obtained was also forwarded. In both works, weight scaling was only indirectly emulated, by rather scaling the input and the resizing back the convolution response to a fix size for max-pooling across scales.
4 Background:Steerable Filters for Rotation
Rotation steerable filters, in the form of circular harmonics, are of the form , expressed in polar co-ordinates. For circular harmonics, is usually considered to be a Gaussian function centered on a particular radius. is a complex function of unit norm, . Such a choice of allows one to rotationally steer the filter by any angle , just by a complex multiplication, . Furthermore, control over the rotational order allows one to directly control rotational sensitivity of the resulting filter (which is invariant to the filter rotation), and also simultaneously the spatial complexity of the filter.
5.1 Scale-steerable filters: Log-Radial Harmonics
Similar to the rotation steerable circular harmonics, we can analogously construct a set of filters of the form . Since we wish to steer the scale of the filter, now is of Gaussian form, whereas is complex valued with unit norm, i.e. . The proposed mathematical form of a scale steerable filter of order and centered on a particular is,
where . Here is the distance between the two angles and . Example filters constructed using equation 1 are shown in Figure 1. When steering the above filter in scale, we find that a complex multiplication of suffices, where is the scale factor change. This we prove in the following theorem.
Given a circular input patch within a larger image, which is defined within the range of . Let denote the same patch when the image was scaled around the centre of the patch by a factor of . We then have
where is the cross-correlation operator (in the continuous domain), used in the same context as in .
The proof of theorem 1 is shown in the appendix.
An immediate consequence of the above theorem is that for the theorem assumes a simpler form, .
A useful consequence of steerability is that any filter expressed as a linear combination (with complex coefficients) of the steerable basis is also steerable. Consider a filter of radius constructed in similar fashion using the the proposed scale-steerable basis , s.t. , where . The same filter can be steered in its scale by a scale factor of , giving
However, we want the filters to be real valued, and hence we only take the real part of . Note that equality in equation (2) is for both the real and the imaginary parts on both sides of the equation, and thus working with the real part of the filters does not change steerability. The result in Theorem 1 includes an additional change of radius from to . This indicates that the pixel values of are sampled across a circular region of radius , which depends on the scale factor . Finally, as noted in [10, 7], steerability and sampling are interchangeable, therefore the sampled version of the scaled basis filters are same as the scaled version of the sampled filter.
5.2 Scale-Invariant CNNs with Scale Steered Weights
Here we describe the Scale-Steered CNN (SS-CNN), which employs a scale steeerable filter basis in the computation of its filters. Figure 2 shows the proposed scale-invariant layer. Each filter within the scale-invariant layers is computed as a linear combination of the assigned scale steerable basis . The network directly only learns the complex co-efficients . At each scale-invariant layer, the scaled and resized versions of the filters are directly computed from the complex coefficients using equation 3. Only the maximum responses across all scales are channeled to the next layer, by max-pooling the responses across scales.
First, to validate the proposed approach, datasets such as MNIST-Scale and FMNIST-Scale were chosen which contain global scale variations. In addition, a dataset containing local scale variations was also synthesized from MNIST. Subsequently, the filters and the activation maps within the SS-CNN are visualized. All experiments were run on a NVIDIA Titan GPU processor. The code has been released at https://github.com/rghosh92/SS-CNN.
6.1 Classification with SS-CNN
6.1.1 MNIST and FMNIST
The data partitioning protocol for MNIST-Scale is a 10k, 2k, and 50k split of the scaled version of original MNIST, into training, validation and testing data respectively.222A small training data size is chosen so as to better evaluate the generalization abilities of the trained classifiers. We use the same split ratio for creating FMNIST-Scale, with the same range of spatial scaling . No additional data augmentation was performed for all the networks.
Global scale variations: MNIST and FMNIST
, scale equivariant vector fields and spatial transformer networks 444For the spatial transformer network, we use network configurations which perform the best on the validation data. 
. For a fair comparison, all networks used have a total of 3 convolutional layers and 2 fully connected layers. The number of trainable parameters for all four networks were kept approximately the same. Mean and standard deviations of accuracies are reported after 6 splits.555Note that although the input size for both MNIST and FMNIST are similar, they contain very different kind of data. MNIST is mainly white strokes on a black background, whereas FMNIST includes both shape and texture information in grayscale.
Generalization to Distortions
Here we test and compare method performance on MNIST with added elastic distortions. The networks are all trained on the undistorted MNIST-Scale, but tested on MNIST-Scale with added elastic deformations. Results are shown in Table 2. We only record the performance for a single network (best performing) for each method.
Synthesized data: Local scale variations
We synthesize a variation using MNIST, namely MNIST-scale-local-2, with scale variations that are more local than MNIST-Scale. Pairs of MNIST examples were each scaled with a random scale factor between , and arranged side by side in an image of size , a small proportion of which contains overlapping examples. We only choose 10 possible combinations of digits, , resulting in a total of 10 categories for the network. Mean and standard deviations of accuracies are reported after 6 splits. Results are reported in Table 3. The results demonstrate the superior performance of local scale-invariance based methods over global transformation estimation architectures such as spatial transformers, in a scenario where the data contains local scale variations.
|1% data||10% data||100% data|
6.2 Visualization Experiments
In this section we visualize the network filters and feature map activations for two scale-invariant networks: our proposed SS-CNN and the LocScaleInv-CNN. Both networks were trained on MNIST-Scale. Figure 3 (a) shows a visual comparison of the layer 1 filters for these networks. Notice that the scale-steered filters show considerably higher structure, centrality, and interesting filter form: some of them resembling oriented bars. Figure 3 (b) compares the average feature map activation of Layer 1, in response to different inputs. Notice that spatial structure is far better preserved in the SS-CNN responses (bottom row), with the digit outlines clearly distinguishable. This is partly due to the ingrained centrality of the scale-steered basis (the term), which generates a response which is more structure preserving.
Based on the proposed SS-CNN framework in this work, we underline some of the important issues and considerations moving forward. Also, we provide detailed explanations for some of the design choices used in this work.
Input Resizing vs Filter Scaling: For locally scale invariant CNNs, usually the input is reshaped to a range of sizes, both smaller and greater than the original size [1, 9]. Feature maps are obtained by convolving each resized input with an unchanged filter. Lastly all the feature maps are reshaped back to a common size, beyond which only the maximum response across scales are channeled. This approach uses two rounds of reshaping, and thus is clearly prone to interpolation artifacts, especially if the filters are not smooth enough. The method proposed in this work only steers the filters in their scale and size, without having to rely on any interpolation operations. Note that change of filter size just requires computing the filter values at the new locations using equation 3 and 1.
Filter Centrality: If the filters are not central, i.e. centered near to their centre of mass666Centre of mass, in this case holds the same definition as in physics. The ”mass” element can be considered as the absolute value of the filter at a certain location., then they pose the risk of entangling scale and translation information. This happens, when the filter response to the input, at a certain scale and location is the same as the response of the same filter at a different scale and a different location. This can be quite common for filters which have most of their ”mass” away from their center. Such entanglement can often lead to feature maps with distorted and over-smoothed spatial structure, as observed in Figure 3 (b) (top). This issue can be tackled to a certain extent by using filters which show centrality (Figure 3 (a)). As seen in equation 1, one can control the centrality of the steerable basis filters, with the radial term , and by ensuring radial-symmetric filters with as the angular term. Figure 1 shows the central nature of the steerable basis. Filter centrality is preserved for the subsequently generated filters, as seen in Figure 3 (a) (left), which shows the generated filters after training.
Transformation Sensitivity: As iterated in section 1, an important yet partly overlooked aspect of using a steerable basis from the family of circular harmonics (or log-radial harmonics), is the ability to control the transformation sensitivity of the filters. For instance, circular harmonics beyond a certain order have a much smaller sensitivity to changes in input rotation. This is simply because each circular harmonic filter is invariant to discrete rotations of , being the filter order. Similarly, it is easily seen that each log-radial harmonic filter is invariant to filters being scaled by a scale factor of
. Therefore, higher order filters show considerably less transformation sensitivity. It is perhaps noteworthy that the 2D Fourier transform (or the 2D DCT) basis functions can also be used as a steerable basis (e.g.). In that case, higher frequency (analogous to filter order) filters are less sensitive to input translations, compared to low frequency filters. Therefore in a certain sense, the circular harmonic and log-radial harmonic filter bases are a natural extension of the Fourier basis (translations), to other transformations (rotation and scale).
8 Conclusions and Future Work
A scale-steerable filter basis is proposed, which along with the popular rotation-steerable circular harmonics, can help augment CNNs with a much higher degree of transformational weight-sharing. Experiments on multiple datasets showcasing global and local scale variations demonstrated the performance benefits from using scale-steered filters in a scale-invariant framework. Scale-steered filters are found to showcase heightened centrality and structure. A natural trajectory for this approach will be to inculcate the scale-steering paradigm onto equivariant architectures such as GCNNs.
This research was supported by DSO National Laboratories, Singapore (grant no. R-719-000-029-592). We thank Dr. Loo Nin Teow and Dr. How Khee Yin for helpful discussions. We also thank Dr. Diego Marcos for sharing the code for scale-vector fields , and clarifying a number of other queries related to the task.
-  A. Kanazawa, A. Sharma, and D. W. Jacobs, “Locally scale-invariant convolutional neural networks,” Deep Learning and Representation Learning Workshop: Conference on Neural Information Processing Systems (NIPS), vol. abs/1412.5104, 2014.
-  A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in Advances in Neural Information Processing Systems 25 (NIPS), pp. 1097–1105, Curran Associates, Inc., 2012.
-  T. S. Cohen and M. Welling, “Group equivariant convolutional networks,” International Conference on Machine Learning (ICML), vol. abs/1602.07576, 2016.
-  M. Weiler, F. A. Hamprecht, and M. Storath, “Learning steerable filters for rotation equivariant cnns,” Conference on Computer Vision and Pattern Recognition (CVPR), vol. abs/1711.07289, 2018.
-  D. Marcos, M. Volpi, N. Komodakis, and D. Tuia, “Rotation equivariant vector field networks,” International Conference on Computer Vision (ICCV), vol. abs/1612.09346, 2016.
-  T. S. Cohen and M. Welling, “Steerable cnns,” International Conference on Learning Representations (ICLR), vol. abs/1612.08498, 2016.
-  D. E. Worrall, S. J. Garbin, D. Turmukhambetov, and G. J. Brostow, “Harmonic networks: Deep translation and rotation equivariance,” Conference on Computer Vision and Pattern Recognition (CVPR), vol. abs/1612.04642, 2016.
-  M. Jaderberg, K. Simonyan, A. Zisserman, and k. kavukcuoglu, “Spatial transformer networks,” in Advances in Neural Information Processing Systems 28 (NIPS), pp. 2017–2025, Curran Associates, Inc., 2015.
-  D. Marcos, B. Kellenberger, S. Lobry, and D. Tuia, “Scale equivariance in cnns with vector fields,” International Conference on Machine Learning (ICML) Workshop: Towards learning with limited labels: Equivariance, Invariance and Beyond, vol. abs/1807.11783, 2018.
-  W. T. Freeman and E. H. Adelson, “The design and use of steerable filters,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 13, pp. 891–906, Sep. 1991.
-  O. Rippel, J. Snoek, and R. P. Adams, “Spectral representations for convolutional neural networks,” in Advances in Neural Information Processing Systems 28 (NIPS) (C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett, eds.), pp. 2449–2457, Curran Associates, Inc., 2015.
Appendix A Proof of Theorem 1
First, note that in the log-polar domain, . Using this fact, the cross-correlation can be expressed in log-polar co-ordinates as the integration
where . A change of integrands from to , where , yields
From the definition of the steerable filter basis , we have that . Thus the integration can be further simplified as,
This completes the proof. ∎
Appendix B Steerable Basis Parameters
The definition of each log-radial harmonic filter includes a total of four parameters: phase (), filter order (), filter orientation and orientation spread (). For all networks that have been trained in this work using scale-steered filters, we keep , , and . Note that this configuration of the steerable basis space leads to a total of 24 log-radial harmonics as the steerable basis. Thus, each scale-steerable filter has trainable parameters (Due to both real and imaginary components on each coefficient). One additional aspect of note is the term in the complex exponential in the filter definition. Since the filter is undefined for , we enforce for .
Appendix C Network Configuration Used
In each layer of the SS-CNN the filter scale factors are within the range , with the size of the filters increasing from to
(only odd size filters are chosen because of well defined centre pixel). For such large filter sizes, an additional upsampling of factor 2 was applied on the data. Note that upsampling ensures more precise convolutions, especially with scale-steered filters of higher orders. Note that although upsampling adds slight improvements to the SS-CNN (in MNIST-Scale), we found that it does not improve the performance of the other networks compared in this paper. For all experiments, the number of feature maps of within each layer were (30,60,90), for all networks. A total of 3 max-pooling layers were used after the first (), second () and the third convolution layer ( for the SS-CNN,
for other networks). For the FMNIST-Scale and MNIST-Scale-local it was ensured that all networks had approximately the same number of trainable parameters. All networks were trained for a maximum of 300 epochs, after which the best performing model on the validation data was used for testing. No data augmentation was used in any experiment.