1 Introduction
Convolutional Neural Networks rise to success on large datasets like ImageNet in
[2], has prompted a myriad of work in their direction, which build on their key depthpreserved transformation equivariance property to achieve better classifiers
[3, 4, 5]. Equivariance to transformations has been thus recognized as an important prerequisite to any classifier, and CNNs which are by definition translation equivariant have been recognized as a first important step in this direction.An underlying requirement to a transformation equivariant representation is the construction of transformed copies of filters, i.e. when the transformation is a translation, the operation becomes a convolution. A natural extension of this idea to general transformation groups led to the idea of Groupequivariant CNNs [3], where in the first layer, transformed copies of filter weights are generated. Subsequently, the application of group convolution ensures that the network stays equivariant to that transformation throughout.
However, there are certain issues pertaining to the application of any (spatial) transformation on a filter:

There is no prior on the spatial complexity of a convolutional filter within a CNN, which means a considerable part of the filter space may contain filters which are not sensitive to the desired spatial transformation. Examples include rotation symmetric filters, highfrequency filters etc.

As noted in [4]
, most transformations are continuous in nature, necessitating interpolation for obtaining filter values at new locations. This usually leads to interpolation artifacts, which can have a greater disruptive effect when the filters are usually of small size.
Steerable Filters
To alleviate these issues, the use of a steerable filter basis for filter construction and learning was proposed in [6]. Steerable filters have the unique property, that allow them to be transformed by simply using linear combinations of an appropriate steerable filter basis. Importantly, the choice of the steerable basis allows one to control the transformation sensitivity of the final computed filter. Especially for a circular harmonic basis [7], we find that filters of order are only sensitive to rotation shifts in the range . In this case, higher order filter responses show less sensitivity to input rotations, and simultaneously are of higher spatial frequency and complexity. Using a small basis of the first few filter orders enabled the authors of [4] to achieve stateoftheart on MNISTRot classification (with small training data size).
2 Contributions of this Work
LogRadial Harmonics: A scale steerable basis
In this paper, we define filters which are steerable in their spatial scale using a complex filter basis we denote as logradial harmonics. Each kernel of a CNN is represented as the real part of the linear combination of the proposed basis filters, which contains filters of various orders, analogous to circular harmonics. Furthermore, the scale steerable property permits exact representation of the filters in its scale simply through a linear combination of learnt complex coefficients on the logradial harmonics. The filter form is conjugate to the circular harmonics, with the choice of filter order having a direct impact on the scale sensitivity of the resulting filters.
ScaleSteered CNN (SSCNN)
Using the logradial harmonics as a complex steerable basis, we construct a locally scale invariant CNN, where the filters in each convolution layer are a linear combination of the basis filters. For obtaining filter response across scales, each filter is simultaneously steered in its scale and size, and the filter responses are eventually maxpooled. We demonstrate accuracy improvements with the scalesteered CNN on datasets containing global (MNISTScale, and FMNISTScale) and local (MNISTScaleLocal; synthesized here) scale variations. Importantly, we find that on MNISTScale, the proposed SSCNN achieves competitive accuracy to the Spatial Transformer Network
[8], which due to its global affine resampling property has a natural advantage in this task.3 Related Work
Previous work with Local Scale Invariant/Equivariant CNNs
Scaletransformed weights were proposed in [1], where it was observed to improve performance over the normal baseline CNN, on MNISTScale. On the same dataset (with a 10k, 2k and 50k split), better performance was observed in [9], where in addition to forwarding the maximum filter response to a range of scales, the actual scale at which the response was obtained was also forwarded. In both works, weight scaling was only indirectly emulated, by rather scaling the input and the resizing back the convolution response to a fix size for maxpooling across scales.
4 Background:Steerable Filters for Rotation
Rotation steerable filters, in the form of circular harmonics, are of the form , expressed in polar coordinates. For circular harmonics, is usually considered to be a Gaussian function centered on a particular radius. is a complex function of unit norm, . Such a choice of allows one to rotationally steer the filter by any angle , just by a complex multiplication, . Furthermore, control over the rotational order allows one to directly control rotational sensitivity of the resulting filter (which is invariant to the filter rotation), and also simultaneously the spatial complexity of the filter.
5 Methods
5.1 Scalesteerable filters: LogRadial Harmonics
Similar to the rotation steerable circular harmonics, we can analogously construct a set of filters of the form . Since we wish to steer the scale of the filter, now is of Gaussian form, whereas is complex valued with unit norm, i.e. . The proposed mathematical form of a scale steerable filter of order and centered on a particular is,
(1) 
where . Here is the distance between the two angles and . Example filters constructed using equation 1 are shown in Figure 1. When steering the above filter in scale, we find that a complex multiplication of suffices, where is the scale factor change. This we prove in the following theorem.
Theorem 1.
Given a circular input patch within a larger image, which is defined within the range of . Let denote the same patch when the image was scaled around the centre of the patch by a factor of . We then have
(2) 
where is the crosscorrelation operator (in the continuous domain), used in the same context as in [7].
The proof of theorem 1 is shown in the appendix.
An immediate consequence of the above theorem is that for the theorem assumes a simpler form, .
Scale steerability
A useful consequence of steerability is that any filter expressed as a linear combination (with complex coefficients) of the steerable basis is also steerable. Consider a filter of radius constructed in similar fashion using the the proposed scalesteerable basis , s.t. , where . The same filter can be steered in its scale by a scale factor of , giving
(3) 
However, we want the filters to be real valued, and hence we only take the real part of . Note that equality in equation (2) is for both the real and the imaginary parts on both sides of the equation, and thus working with the real part of the filters does not change steerability. The result in Theorem 1 includes an additional change of radius from to . This indicates that the pixel values of are sampled across a circular region of radius , which depends on the scale factor . Finally, as noted in [10, 7], steerability and sampling are interchangeable, therefore the sampled version of the scaled basis filters are same as the scaled version of the sampled filter.
5.2 ScaleInvariant CNNs with Scale Steered Weights
Here we describe the ScaleSteered CNN (SSCNN), which employs a scale steeerable filter basis in the computation of its filters. Figure 2 shows the proposed scaleinvariant layer. Each filter within the scaleinvariant layers is computed as a linear combination of the assigned scale steerable basis . The network directly only learns the complex coefficients . At each scaleinvariant layer, the scaled and resized versions of the filters are directly computed from the complex coefficients using equation 3. Only the maximum responses across all scales are channeled to the next layer, by maxpooling the responses across scales.
6 Experiments
First, to validate the proposed approach, datasets such as MNISTScale and FMNISTScale were chosen which contain global scale variations. In addition, a dataset containing local scale variations was also synthesized from MNIST. Subsequently, the filters and the activation maps within the SSCNN are visualized. All experiments were run on a NVIDIA Titan GPU processor. The code has been released at https://github.com/rghosh92/SSCNN.
6.1 Classification with SSCNN
6.1.1 MNIST and FMNIST
The data partitioning protocol for MNISTScale is a 10k, 2k, and 50k split of the scaled version of original MNIST, into training, validation and testing data respectively.^{2}^{2}2A small training data size is chosen so as to better evaluate the generalization abilities of the trained classifiers. We use the same split ratio for creating FMNISTScale, with the same range of spatial scaling . No additional data augmentation was performed for all the networks.
Global scale variations: MNIST and FMNIST
The results on MNISTScale and FMNISTScale are shown in Table 1^{3}^{3}3 = Our implementation. The proposed method is compared with three other CNN variants: Locally scale invariant CNN [1]
, scale equivariant vector fields
[9] and spatial transformer networks ^{4}^{4}4For the spatial transformer network, we use network configurations which perform the best on the validation data. [8]. For a fair comparison, all networks used have a total of 3 convolutional layers and 2 fully connected layers. The number of trainable parameters for all four networks were kept approximately the same. Mean and standard deviations of accuracies are reported after 6 splits.
^{5}^{5}5Note that although the input size for both MNIST and FMNIST are similar, they contain very different kind of data. MNIST is mainly white strokes on a black background, whereas FMNIST includes both shape and texture information in grayscale.Generalization to Distortions
Here we test and compare method performance on MNIST with added elastic distortions. The networks are all trained on the undistorted MNISTScale, but tested on MNISTScale with added elastic deformations. Results are shown in Table 2. We only record the performance for a single network (best performing) for each method.
=0  =10  =20  =30  =40  

ScaleInv Net  3.2  5.92  9.6  16.2  27  
Spatial Transformer  1.87  3.4  5.12  9.2  16.2  

1.87  3.7  5.6  9.82  16.83 
Synthesized data: Local scale variations
We synthesize a variation using MNIST, namely MNISTscalelocal2, with scale variations that are more local than MNISTScale. Pairs of MNIST examples were each scaled with a random scale factor between , and arranged side by side in an image of size , a small proportion of which contains overlapping examples. We only choose 10 possible combinations of digits, , resulting in a total of 10 categories for the network. Mean and standard deviations of accuracies are reported after 6 splits. Results are reported in Table 3. The results demonstrate the superior performance of local scaleinvariance based methods over global transformation estimation architectures such as spatial transformers, in a scenario where the data contains local scale variations.
1% data  10% data  100% data  

Spacial Transformer  4.760.38  0.730.05  0.230.02 
SSCNN(ours)  4.271.14  0.40.02  0.090.01 
6.2 Visualization Experiments
In this section we visualize the network filters and feature map activations for two scaleinvariant networks: our proposed SSCNN and the LocScaleInvCNN. Both networks were trained on MNISTScale. Figure 3 (a) shows a visual comparison of the layer 1 filters for these networks. Notice that the scalesteered filters show considerably higher structure, centrality, and interesting filter form: some of them resembling oriented bars. Figure 3 (b) compares the average feature map activation of Layer 1, in response to different inputs. Notice that spatial structure is far better preserved in the SSCNN responses (bottom row), with the digit outlines clearly distinguishable. This is partly due to the ingrained centrality of the scalesteered basis (the term), which generates a response which is more structure preserving.
7 Discussions
Based on the proposed SSCNN framework in this work, we underline some of the important issues and considerations moving forward. Also, we provide detailed explanations for some of the design choices used in this work.

Input Resizing vs Filter Scaling: For locally scale invariant CNNs, usually the input is reshaped to a range of sizes, both smaller and greater than the original size [1, 9]. Feature maps are obtained by convolving each resized input with an unchanged filter. Lastly all the feature maps are reshaped back to a common size, beyond which only the maximum response across scales are channeled. This approach uses two rounds of reshaping, and thus is clearly prone to interpolation artifacts, especially if the filters are not smooth enough. The method proposed in this work only steers the filters in their scale and size, without having to rely on any interpolation operations. Note that change of filter size just requires computing the filter values at the new locations using equation 3 and 1.

Filter Centrality: If the filters are not central, i.e. centered near to their centre of mass^{6}^{6}6Centre of mass, in this case holds the same definition as in physics. The ”mass” element can be considered as the absolute value of the filter at a certain location., then they pose the risk of entangling scale and translation information. This happens, when the filter response to the input, at a certain scale and location is the same as the response of the same filter at a different scale and a different location. This can be quite common for filters which have most of their ”mass” away from their center. Such entanglement can often lead to feature maps with distorted and oversmoothed spatial structure, as observed in Figure 3 (b) (top). This issue can be tackled to a certain extent by using filters which show centrality (Figure 3 (a)). As seen in equation 1, one can control the centrality of the steerable basis filters, with the radial term , and by ensuring radialsymmetric filters with as the angular term. Figure 1 shows the central nature of the steerable basis. Filter centrality is preserved for the subsequently generated filters, as seen in Figure 3 (a) (left), which shows the generated filters after training.

Transformation Sensitivity: As iterated in section 1, an important yet partly overlooked aspect of using a steerable basis from the family of circular harmonics (or logradial harmonics), is the ability to control the transformation sensitivity of the filters. For instance, circular harmonics beyond a certain order have a much smaller sensitivity to changes in input rotation. This is simply because each circular harmonic filter is invariant to discrete rotations of , being the filter order. Similarly, it is easily seen that each logradial harmonic filter is invariant to filters being scaled by a scale factor of
. Therefore, higher order filters show considerably less transformation sensitivity. It is perhaps noteworthy that the 2D Fourier transform (or the 2D DCT) basis functions can also be used as a steerable basis (e.g.
[11]). In that case, higher frequency (analogous to filter order) filters are less sensitive to input translations, compared to low frequency filters. Therefore in a certain sense, the circular harmonic and logradial harmonic filter bases are a natural extension of the Fourier basis (translations), to other transformations (rotation and scale).
8 Conclusions and Future Work
A scalesteerable filter basis is proposed, which along with the popular rotationsteerable circular harmonics, can help augment CNNs with a much higher degree of transformational weightsharing. Experiments on multiple datasets showcasing global and local scale variations demonstrated the performance benefits from using scalesteered filters in a scaleinvariant framework. Scalesteered filters are found to showcase heightened centrality and structure. A natural trajectory for this approach will be to inculcate the scalesteering paradigm onto equivariant architectures such as GCNNs.
Acknowledgments
This research was supported by DSO National Laboratories, Singapore (grant no. R719000029592). We thank Dr. Loo Nin Teow and Dr. How Khee Yin for helpful discussions. We also thank Dr. Diego Marcos for sharing the code for scalevector fields [9], and clarifying a number of other queries related to the task.
References
 [1] A. Kanazawa, A. Sharma, and D. W. Jacobs, “Locally scaleinvariant convolutional neural networks,” Deep Learning and Representation Learning Workshop: Conference on Neural Information Processing Systems (NIPS), vol. abs/1412.5104, 2014.
 [2] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in Advances in Neural Information Processing Systems 25 (NIPS), pp. 1097–1105, Curran Associates, Inc., 2012.
 [3] T. S. Cohen and M. Welling, “Group equivariant convolutional networks,” International Conference on Machine Learning (ICML), vol. abs/1602.07576, 2016.
 [4] M. Weiler, F. A. Hamprecht, and M. Storath, “Learning steerable filters for rotation equivariant cnns,” Conference on Computer Vision and Pattern Recognition (CVPR), vol. abs/1711.07289, 2018.
 [5] D. Marcos, M. Volpi, N. Komodakis, and D. Tuia, “Rotation equivariant vector field networks,” International Conference on Computer Vision (ICCV), vol. abs/1612.09346, 2016.
 [6] T. S. Cohen and M. Welling, “Steerable cnns,” International Conference on Learning Representations (ICLR), vol. abs/1612.08498, 2016.
 [7] D. E. Worrall, S. J. Garbin, D. Turmukhambetov, and G. J. Brostow, “Harmonic networks: Deep translation and rotation equivariance,” Conference on Computer Vision and Pattern Recognition (CVPR), vol. abs/1612.04642, 2016.
 [8] M. Jaderberg, K. Simonyan, A. Zisserman, and k. kavukcuoglu, “Spatial transformer networks,” in Advances in Neural Information Processing Systems 28 (NIPS), pp. 2017–2025, Curran Associates, Inc., 2015.
 [9] D. Marcos, B. Kellenberger, S. Lobry, and D. Tuia, “Scale equivariance in cnns with vector fields,” International Conference on Machine Learning (ICML) Workshop: Towards learning with limited labels: Equivariance, Invariance and Beyond, vol. abs/1807.11783, 2018.
 [10] W. T. Freeman and E. H. Adelson, “The design and use of steerable filters,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 13, pp. 891–906, Sep. 1991.
 [11] O. Rippel, J. Snoek, and R. P. Adams, “Spectral representations for convolutional neural networks,” in Advances in Neural Information Processing Systems 28 (NIPS) (C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett, eds.), pp. 2449–2457, Curran Associates, Inc., 2015.
Appendix A Proof of Theorem 1
Proof.
First, note that in the logpolar domain, . Using this fact, the crosscorrelation can be expressed in logpolar coordinates as the integration
(4) 
where . A change of integrands from to , where , yields
(5) 
From the definition of the steerable filter basis , we have that . Thus the integration can be further simplified as,
(6)  
(7) 
This completes the proof. ∎
Appendix B Steerable Basis Parameters
The definition of each logradial harmonic filter includes a total of four parameters: phase (), filter order (), filter orientation and orientation spread (). For all networks that have been trained in this work using scalesteered filters, we keep , , and . Note that this configuration of the steerable basis space leads to a total of 24 logradial harmonics as the steerable basis. Thus, each scalesteerable filter has trainable parameters (Due to both real and imaginary components on each coefficient). One additional aspect of note is the term in the complex exponential in the filter definition. Since the filter is undefined for , we enforce for .
Appendix C Network Configuration Used
In each layer of the SSCNN the filter scale factors are within the range , with the size of the filters increasing from to
(only odd size filters are chosen because of well defined centre pixel). For such large filter sizes, an additional upsampling of factor 2 was applied on the data. Note that upsampling ensures more precise convolutions, especially with scalesteered filters of higher orders. Note that although upsampling adds slight improvements to the SSCNN (
in MNISTScale), we found that it does not improve the performance of the other networks compared in this paper. For all experiments, the number of feature maps of within each layer were (30,60,90), for all networks. A total of 3 maxpooling layers were used after the first (), second () and the third convolution layer ( for the SSCNN,for other networks). For the FMNISTScale and MNISTScalelocal it was ensured that all networks had approximately the same number of trainable parameters. All networks were trained for a maximum of 300 epochs, after which the best performing model on the validation data was used for testing. No data augmentation was used in any experiment.
Comments
There are no comments yet.