Implementation of the [Locally Scale-Invariant Convolutional Neural Network](http://www.umiacs.umd.edu/~kanazawa/papers/sicnn_workshop2014.pdf)
Convolutional Neural Networks (ConvNets) have shown excellent results on many visual classification tasks. With the exception of ImageNet, these datasets are carefully crafted such that objects are well-aligned at similar scales. Naturally, the feature learning problem gets more challenging as the amount of variation in the data increases, as the models have to learn to be invariant to certain changes in appearance. Recent results on the ImageNet dataset show that given enough data, ConvNets can learn such invariances producing very discriminative features . But could we do more: use less parameters, less data, learn more discriminative features, if certain invariances were built into the learning process? In this paper we present a simple model that allows ConvNets to learn features in a locally scale-invariant manner without increasing the number of model parameters. We show on a modified MNIST dataset that when faced with scale variation, building in scale-invariance allows ConvNets to learn more discriminative features with reduced chances of over-fitting.READ FULL TEXT VIEW PDF
Training convolutional neural networks for image classification tasks us...
We study the effect of injecting local scale equivariance into Convoluti...
In this paper we present a deep neural network topology that incorporate...
Even though convolutional neural networks (CNN) has achieved near-human
Augmenting transformation knowledge onto a convolutional neural network'...
To perform well on unseen and potentially out-of-distribution samples, i...
Rhythm patterns can be performed with a wide variation of tempi. This
Implementation of the [Locally Scale-Invariant Convolutional Neural Network](http://www.umiacs.umd.edu/~kanazawa/papers/sicnn_workshop2014.pdf)
Convolutional Neural Networks (ConvNets)  have achieved excellent results on visual classification tasks like handwritten digits , toys , traffic signs , and recently 1000-category ImageNet classification . ConvNets’ success comes from their ability to learn complex patterns by building increasingly abstract representations layer by layer, much like other deep neural networks. However, ConvNets differ in that they exploit the two dimensional structure of images where objects and patterns appear at arbitrary locations. ConvNets apply local filters at every position in the image, allowing the network to detect and learn patterns regardless of their location.
In reality, the world has a three dimensional structure, and objects at different distances will appear in an image at different scales as well as locations. ConvNets do not have a mechanism to take advantage of scale explicitly, so to detect a single pattern at multiple scales, they must learn to separate filters for each scale. Unfortunately this has several major shortcomings. Capturing patterns at multiple scales uses up resources that could be used to learn a wider variety of feature detectors. This requires an increase in the number of feature detectors, which means that the network is harder to train, takes longer to train, and is more likely to overfit. To learn multiple scales and prevent overfitting, you would need a lot of training data, and even then the network will only respond to the scales seen during training. Also, detectors that capture the same pattern but at different scales are learned independently without sharing the training samples. Finally, having multiple filters of a single pattern at different scales burdens the next layer by increasing the number of configurations that indicate the presence or absence of that pattern.
In this paper, we present scale-invariant
convolutional networks (SI-ConvNets), which applies filters at multiple scales in each layer so a single filter can detect and learn patterns at multiple scales. We max-pool responses over scales to obtain representations that are locally scale invariant yet have the same dimensionality as a traditional ConvNet layer output. The proposed architecture differs from other multi-scale approaches explored with ConvNets since scale-invariance is built in at the layer level rather than at the network level. We also achieve locally scale-invariant representation and we do not require any increase in the number of parameters to be learned. We show in our experiments that by sharing information from multiple scales, the proposed model can achieve better classification performance than ConvNets while simultaneously requiring less training data. We evaluate the proposed model on a variation of the MNIST dataset where digits appear at multiple scales, and demonstrate that the SI-ConvNets are more robust to scale variations in training data and unfamiliar scales in test data than ConvNets. Our model is complementary to other ConvNet architectures and it can easily be incorporated into existing variants, and we will make the source code available online for the research community.
Several recent works have addressed the problem of explicitly incorporating transformation invariance in deep learning models. In the unsupervised feature learning domain, Sohnet al.  and 
introduced transformation-invariant Restricted Boltzmann Machines (RBMs) where linear transformations of a filter are applied to each input to infer its highest activation. In these two models, the transformed filters were only applied at the center of the largest receptive field size. Our model uses an inverse transformation stage so that transformed filters can be applied densely but still retain correspondence. Our work is inspired by the success of Sohnet al. and our goal is to incorporate scale-invariant feature learning into the extremely successful ConvNet models .
Tiled convolutional neural networks  learn invariances implicitly by square-root pooling hidden units that are computed by partially un-tied weights. In comparison, our approach explicitly encodes scale invariance, and does not require the increase in the number of learned parameter that is required by un-tying of weights.  fuses outputs of multiple ConvNets applied over multiple scales for semantic segmentation, but each ConvNet is learned independently without weight-sharing. In contrast, we jointly learn a set of feature detectors that are shared over multiple scales. 
proposes a multi-scale ConvNet where outputs of all convolutional layers are fed to the classifier. This enables them to capture information from different levels of the hierarchy, but there is no scale invariance in the features learned at a layer because each layer is only applied to the original scale.
Another influential work is that of Farabet et al.
, who train ConvNets over the Laplacian pyramid of images with tied weights for scene parsing problems. In their model, the entire multi-layer forward propagation is applied end to end at each scale disjointly, then right before the fully-connected layers, the responses of all scales are aligned by up-sampling and concatenated. Keeping responses from each scale allows them to capture scale-level dependencies, but at the expense of increasing the number of parameters required in the final layer. This further restricts the number of scales that can be applied; in contrast, our model is more compact and free of such restrictions. Further, their approach is motivated by the need to have large receptive field sizes to capture long-range, contextual interactions that are necessary for scene understanding. In contrast, we are interested in capturing locally scale-invariant features that are useful for image classification; hence, unlike their approach, we pool the responses over all scales in each spatial location in each layer.Pooling responses over scales in each layer as opposed to concatenating all scales in the very end has subtle but different effects in the middle layers. For example, suppose there is an image of two circles of different sizes, and a circle filter that can detect one of the circles in the image. In the architecture of Farabet et al., each circle will be detected in a different scale, but they will never be recognized together until the layer where all scales are concatenated. In our architecture, both circles can be detected together and the second layer can immediately make use of the fact that there are two circles in the image. Of course, by having two circle filters of different sizes, Farabet et al.’s network can also detect two circles in one scale, but at the expense of learning redundant filters. While their work demonstrates the advantages of applying ConvNets over multiple scales in scene parsing, we further investigate its effectiveness in an image classification domain with a more modular model that explicitly incorporates scale-invariance in each layer.
Convolutional Neural Networks (ConvNets) are a supervised feed-forward multi-layer architecture where each layer learns feature detectors of increasing complexity. The final layer is a classifier or regressor with a cost function such that the network can be trained in a supervised manner. The entire network is optimized jointly via stochastic gradient descent with gradients computed by back-propagation
. A single layer in a ConvNet is usually composed of feature extraction and nonlinear activation stages, optionally followed by spatial pooling and feature normalization.
The hallmark of ConvNets is the idea of convolving local feature detectors across the image. This idea is motivated by the fact that similar patterns can appear anywhere in the 2D layout of pixels and that nearby values present strong dependencies in natural images . The local feature detectors are trainable filters (kernels) and their spatial extent is called the receptive field of the network, as it represents how much of the image the network gets to “see”. The convolution operation effectively ties the learned weights at multiple locations, which radically reduces the number of trainable parameters as compared to having different weights at each location . The output of convolving one kernel is called a feature map
, which is sent through a nonlinear activation function. A feature map at a layer is computed as
where is the convolution operator, is the input feature map from the previous layer, and are the trainable weights and bias respectively. By having multiple feature maps, the network can represent multiple concepts in a single layer. The model may further summarize each sub-region of the feature map via max or average pooling, providing invariance to small amounts of image translation.
Feature detectors in ConvNets have the ability to detect features regardless of their spatial locations in the image, but the same cannot be said for features at different scales. In this section we describe a scale-invariant ConvNet (SI-ConvNet). Our formulation also allows the output of ConvNets to be locally scale-invariant, where the representation of the same patterns at different scales will be similar 111Given that the patterns share the same center. When the center of the patterns are shifted, the output will be similar but at different locations, i.e. a shift-equivariant representation. Figure 1 shows the side by side comparison of the overall structure of these two layers.
Our goal is to let one feature detector respond to patterns at
multiple scales. To do so, we convolve the filters over multiple resolutions
in a pyramid. At each scale the exact same filters are used to convolve
the image (weight-tying). Since we spatially transform the image, the
outputs of convolution come in different spatial sizes. In order to
align the feature maps, we apply an inverse
transformation to each feature map 222 When the stride
is equal to the size of the kernel, applying the inverse transformation
gives direct correspondence between convolution outputs. When it’s not,
after applying the inverse transformation, the output has to be
either cropped or padded with 0s to be properly aligned.
When the stride is equal to the size of the kernel, applying the inverse transformation gives direct correspondence between convolution outputs. When it’s not, after applying the inverse transformation, the output has to be either cropped or padded with 0s to be properly aligned.. Finally we max-pool the responses over all the scales at each spatial location. Pooling responses over multiple scales serves two purposes. First, it allows us to obtain a locally scale-invariant representation. Second, it summarizes the responses in a concise way that allows us to maintain the same output size as a standard convolution layer.
Specifically, let be a linear image transform operator that applies some spatial transformations to an input . Then, for a set of transformation operators , a feature map, , is computed as:
Note that , where is the size of the output of convolution in the -th transformed input, and for all , where is the canonical output size when all responses are aligned and is the number of feature maps used. is always the identity transformation, so when the framework is equivalent to traditional ConvNets.
Figure 2 illustrates this idea with sample inputs and that have the same “V” pattern but at different scales. is a “V” detector that this ConvNet has learned. With this , a standard convolution layer will only activate
whose “V” pattern matches the size of “V” in. However, in a scale-invariant convolution layer, and undergo scale transformations where can be matched in one of the scales, allowing to detect the pattern on both and . The responses are aligned via inverse transformation and the final output is the maximum activation at each spatial location. The path of the winning scale is shown in bold lines.
Since convolving a pyramid of images with a single filter is analogous to convolving a single image with filters of different sizes, using scales is analogous to increasing the number of feature maps by times without actually paying the price of having more parameters. This allows us to train a more expressive model without increasing the chances for over-fitting.
The image transform operator is parametrized by a single scale factor where each 2D location
in the transformed image is computed by a linear interpolation of the original image around. We use bilinear interpolation to compute the coefficients. Note that while we focus on scale invariance in this paper, the framework is applicable to other linear transformations. The transformation coefficients can be precomputed, so applying the transformation is efficient. However, there is an increase in the number of convolution operations required since each scaled input must be convolved. The increase is dominated by the largest scale factor and the step sizes between each scale. Explicitly, the increase in the number of convolutions in a layer is . Please refer to the supplementary material for details.
Since the scale-invariant convolution layer consists of just linear and max operations, its gradients can be computed by a simple modification of the back-propagation algorithm. Backprop for max-pooling over scale is implemented by using the argmax indices, analogous to how backprop is done for spatial max-pooling. For scale transformations, the error signal is propagated through the bilinear coefficients used to compute the transformation. Please refer to the supplementary materials for the detailed derivations.
We first compare the performance of the proposed method, referred to as “SI-ConvNet”, against other baseline methods including traditional ConvNets. For all the experiments we carried out, all networks share the exact same hyper-parameters and architecture, except that the convolution layers that are replaced by scale-invariant convolution layers. We implement our method using the open-source Caffe framework, and our code will be available online.
In order to evaluate the effectiveness of SI-ConvNets we must experiment with a dataset where objects come at variety of scales, since there is not much gain that can be obtained from learning in a scale-invariant manner when there is no scale variation in data. Unfortunately, most of the benchmark datasets for evaluating ConvNets do not fit this category. Therefore, we experiment with the modified MNIST handwritten digit classification dataset introduced in  called MNIST-scale. It consists of gray-scale images, where each digit is randomly scaled by a factor without making any truncation of the foreground pixels.
Unless otherwise noted, the architectures used in this experiment consist of two convolutional layers with 36 and 64 feature maps of 7x7 and 5x5 kernels respectively, a fully connected layer with 150 hidden nodes, and a soft-max logistic regression layer. The network architecture is modeled after the ConvNets of that achieve state-of-the-art on the original MNIST dataset, and we use the same pre-processing method and hyper-parameters unless otherwise noted. We don’t use techniques such as data augmentation, dropout or model averaging to simplify the comparison between a convolution layer and the proposed scale-invariant convolution layer 333Note that in , ConvNet without dropout achieves state-of-the-art performance along with ConvNet with dropconnect.
. Only the kernel size of the first convolution layer and weight decay parameter were re-tuned for the MNIST-Scale dataset using a subset of training data on a ConvNet and are fixed for all networks. The networks are trained for 700 epochs and the test error after 700 epochs are reported. All networks share the same random seed. The scale-invariant convolution layer uses six scales from 0.6 to 2 at a scale step of, i.e. scales at . The details of the parameters used in each experiment are provided in the supplementary materials.
In this experiment we compare our proposed network with ConvNets, the hierarchical ConvNets of Farabet et al. , Restricted Boltzmann Machine (RBM) and its scale-invariant version of Sohn et al. .
Following the experimental protocol of  , we train and test each network on 10,000 and 50,000 images respectively. We evaluate our models on 6 train/test folds and report the average test error and the standard deviation. We use the same architecture and scale parameters for hierarchical ConvNets. The results are shown in Table
, we train and test each network on 10,000 and 50,000 images respectively. We evaluate our models on 6 train/test folds and report the average test error and the standard deviation. We use the same architecture and scale parameters for hierarchical ConvNets. The results are shown in Table1. SI-ConvNet outperforms both ConvNet and hierarchical ConvNet by more than 10%. Hierarchical ConvNet slightly underperforms ConvNet, possibly due to overfitting as it has 6 times more parameters than ConvNet/SI-ConvNet. In the scene classification context for which hierarchical ConvNet was introduced, pixel-level labels exist providing much more training data compared to the image classification settings. This result emphasizes the strength of SI-ConvNet, which achieves scale-invariant learning while keeping the number of parameters fixed. There is large gap between RBM models and the ConvNets due to the fact that RBMs are unsupervised models and that their architecture is shallow with only one feature extraction layer. However, the relative error difference between the original and the scale-invariant version of the RBMs and ConvNets are comparable at 9.8% and 10% respectively for SI-RBM and SI-ConvNet. This shows that SI-ConvNet is obtaining similar improvements for being scale-invariant and that it is a good supervised counterpart of scale-invariant models.
|Method||Test Error (%) on 6 train/test fold|
|Restricted Boltzman Machine (RBM)||6.1|
|Scale-invariant RBM ||5.5|
|Convolutional Neural Network ||3.48 0.23|
|Hierarchical ConvNets ||3.58 0.17|
|Scale-Invariant ConvNet (this paper)||3.13 0.19|
We investigate the scale-invariance achieved by our model using the invariance measure proposed in Goodfellow et al 
. In this method, a neuronis said to fire when . Each is chosen so that the recall, , over inputs is greater than 0.01. A set of transformations is applied to the images that most activate the hidden unit, and the number of times the neuron fires in response to the transformed inputs is recorded. The proportion of transformed inputs that a neuron fires to is called the local firing rate, , which measures the robustness of the neuron to . High value indicates invariance to , unless the neuron is easily fired by arbitrary inputs. Therefore, the invariance score of a neuron is computed as the ratio of its invariance and selectivity i.e. . We report the average of the top 20% highest scoring neurons (). Please see  for more details.
Here consists of scaling the images with values in with step size 0.1. Figure 3 shows the invariance score of ConvNet and SI-ConvNet measured at the end of each layer. We can see that by max-pooling responses over multiple scales, SI-ConvNets produce features that are more scale-invariant than those from ConvNets.
We further evaluate SI-ConvNet by varying the number of training samples and feature maps in the first two layers. For these experiments we report the test error on 10,000 images.
As discussed in subsection 3.1, using scales in a scale-invariant convolution layer that has kernels resembles a network that has kernels without actually having to increase the number of parameters by times. One of the biggest disadvantages of ConvNets is that it requires more training data as the number of parameters increase. By keeping the number of parameters fixed, SI-ConvNets can train a times wider and thus more powerful network at a less demanding amount of training data. Being able to share information between the same patterns at multiple scales further allows SI-ConvNets to learn better features with less data. In contrast, ConvNets learn multiple filters for the same pattern independently, and learning those filters well requires many examples of that pattern at each scale.
Figure (a)a plots test error as the number of feature map is varied, where SI-ConvNets consistently outperforms ConvNets. Since given enough feature maps, ConvNets can learn a feature detector for each scale, we observe that the gap decreases as the number of feature maps increases. Figure (b)b plots test error as the amount of training data changes. Again, SI-ConvNet consistently achieves lower error than ConvNet, where their gap decreases as training data increases. This shows that SI-ConvNets can learn a better model with less training data.
In the following two experiments, we increase the image sizes of the MNIST-scale dataset from 28x28 to 40x40 so that we can experiment with a wider range of scale variation. In order to account for the larger scale range, we change the scales used in SI-ConvNet from 5 scales in [0.6-2] to 5 scales in [0.5 - 2.7] for these two experiments. We train and test on 10,000 images.
First we evaluate the ability to correctly classify images that are less common in the training data. Here the training data is scaled by factors sampled from a Gaussian rather than a Uniform distribution. The digits in the test data are scaled to one particular scale factor and we vary the test scale factor between [0.4, 1.6], which correspond to about
rather than a Uniform distribution. The digits in the test data are scaled to one particular scale factor and we vary the test scale factor between [0.4, 1.6], which correspond to aboutaway from the mean. The further away the test scale factor is from the mean, the more challenging the problem gets since not many training samples have been observed at that scale during the training. We expect ConvNet and SI-ConvNet to do similarly around the mean but ConvNet to get progressively worse as scale moves away from the mean as it cannot reuse the filters it learned for inputs of different scales. As shown in Figure (a)a, our results verify this trend, where SI-ConvNet outperforms ConvNet even at the mean. Average reduction in the relative error is 25% and at the two ends of the scales 0.4 and 1.6, the relative error reduction is 20% and 47% respectively. The lack of symmetry around the mean is possibly due to the fact that digit classification becomes extremely difficult even for humans when digits are very small. (For example, at the scale of 0.4, the actual digit sizes are around 8x8.)
Next we evaluate robustness to scale variation by increasing the range of scale present in training and test data while keeping the number of parameters and training samples fixed. The scale factors are sampled from a uniform distribution in the range . The results in Figure (b)b show that SI-ConvNets consistently outperform ConvNets, and that the error of SI-ConvNets increases at a lower rate than ConvNets as the scale variation increases. This shows the weakness of ConvNets which has to learn redundant filters for digits that come at a wide variety of scales, and that SI-ConvNets is making a more efficient use of its resources in terms of the number of parameters and training data.
(a) Test error vs scale of digits at test time, where digits in the training data is scaled by a factor sampled from a Gaussian distribution centered at 1. The two extremes ends are about 2 standard deviation away from the mean. Classification is more challenging the further away the scales are from the mean, since less number of training data were available. The large gap between SI-ConvNets and ConvNets at the ends show that SI-ConvNets are more robust to images at unfamiliar scales. (b) Test error vs range of uniform distribution used to scale training data. Lower growth of error of SI-ConvNets shows that it can learn better features given the same resources when the data complexity increases.
We introduced an architecture that allows locally scale-invariant feature learning and representation in convolutional neural nets. By sharing the weights across multiple scales and locations, a single feature detector can capture that feature at arbitrary scales and locations. We achieve locally scale-invariant feature representation by pooling detector responses over multiple scales. Our architecture is different from previous approaches in that scale-invariance is built into each convolution layer independently. Because we maintain the same number of parameters as traditional ConvNets while incorporating the scale prior, we can learn features more efficiently with reduced chances of over-fitting. Our experiments show that SI-ConvNets outperform ConvNets in various aspects.
Artificial Neural Networks and Machine Learning (ICANN), 2011.
ECCV Workshop on Computer Vision in Vehicle Technology: From Earth to Mars, 2012.
In order to align the notation of back-propagation to that of an ordinary multi-layer neural network, we re-write each step in the forward propagation of a scale-invariant convolution layer as a matrix-vector multiplication.
Let be a vectorized input at layer of length ( for a by by image). The spatial transformation can be written as a matrix-vector multiplication of a by matrix that encodes the interpolation coefficients, where and are the dimensionality of the original and the transformed input respectively. With bilinear interpolation, each row of has 4 non-zero coefficients.
The convolution operation can be written as a matrix-vector multiplication by encoding as a Toeplitz matrix. Then, the forward propagation at layer is
where is the matrix encoding the -th transformation where images are scaled to different sizes, is the matrix encoding the -th inverse transformation used to align the responses of convolution on each scale. is the kernel matrix encoded as a Toeplitz matrix.
Then, the error signal from the previous layer can be propagated by equations
where is the element-wise multiplication. Equation (3) applies the derivative of the activation function to the error signal and distributes the error into separate errors using the argmax indices to un-pool the max-pooling stage. Equation (4) propagates the error based on the linear weights and similar to the way the error is propagated in a traditional ConvNet. Then in Equation (5), the error is propagated through the initial spatial transformation and is accumulated to complete the propagation of this layer.
We discuss the increase in the number of convolution operations required in a scale-invariant convolution layer compared to a traditional convolution layer.
Given an input image and a kernel of size , a traditional convolution layer computes the linear combination of the kernel and a local region times (using “valid” convolution at the borders). For a scale-invariant convolution layer that uses scales at a step size of whose largest scale factor is , the input image is scaled to different images of size .
So the number of linear combinations to be computed on all of the scaled inputs is
The summation of the quadratic terms is a geometric series,
Since , the series sums to
Therefore, the number of linear combinations computed in a single scale-invariant convolution layer is .
For example, using the values , and that are used in our experiments, the series sums to 10.
Here we list the details of networks used in the experiments.
All inputs are pre-processed by subtracting the training mean and the pixel values are scaled to the [0, 1] range.
The base setup is a three layer network. All experiments have this architecture unless specified otherwise. The first layer is a (SI-)convolution layer of
kernel at stride of one with 36 feature maps with ReLu activation function, followed bymax-pooling of stride two. The second layer is another (SI-)convolution layer of kernel with 64 feature maps with ReLu and max-pooling of stride three. The third layer is a fully connected layer with 150 hidden variables with ReLu and this final output is sent to the logistic regression layer of size 10. The network is optimized by stochastic gradient descent of mini-batch size 128 with a fixed learning rate of 0.01. Momentum of 0.9 and weight decay of 0.0001 are used as regularization. Networks are trained for 700 epochs.
We tuned the kernel size of the first layer and the weight decay on a 10k validation set. After the parameters are set, all training data were used to obtain the final network. For the experiments that are ran on images, we found the best kernel size of the first layer to be . The configuration files that contains hyper-parameters and architectures for each experiment will be available with the source code.