Inverse Compositional Spatial Transformer Networks (CVPR 2017)
In this paper, we establish a theoretical connection between the classical Lucas & Kanade (LK) algorithm and the emerging topic of Spatial Transformer Networks (STNs). STNs are of interest to the vision and learning communities due to their natural ability to combine alignment and classification within the same theoretical framework. Inspired by the Inverse Compositional (IC) variant of the LK algorithm, we present Inverse Compositional Spatial Transformer Networks (IC-STNs). We demonstrate that IC-STNs can achieve better performance than conventional STNs with less model capacity; in particular, we show superior performance in pure image alignment tasks as well as joint alignment/classification problems on real-world problems.READ FULL TEXT VIEW PDF
In this paper, we provide a modern synthesis of the classic inverse
Computer vision researchers have been expecting that neural networks hav...
3D point cloud is an efficient and flexible representation of 3D structu...
The Lucas & Kanade (LK) algorithm is the method of choice for efficient ...
Multilayer transformer networks consist of interleaved self-attention an...
We propose SentiBERT, a variant of BERT that effectively captures
Extrapolating fine-grained pixel-level correspondences in a fully
Inverse Compositional Spatial Transformer Networks (CVPR 2017)
Recent rapid advances in deep learning are allowing for the learning of complex functions through convolutional neural networks (CNNs), which have achieved state-of-the-art performances in a plethora of computer vision tasks[9, 17, 4]. Most networks learn to tolerate spatial variations through: (a) spatial pooling layers and/or (b) data augmentation techniques ; however, these approaches come with several drawbacks. Data augmentation (i.e
. the synthetic generation of new training samples through geometric distortion according to a known noise model) is probably the oldest and best known strategy for increasing spatial tolerance within a visual learning system. This is problematic as it can often require an exponential increase in the number of training samples and thus the capacity of the model to be learned. Spatial pooling operations can partially alleviate this problem as they naturally encode spatial invariance within the network architecture and uses sub-sampling to reduce the capacity of the model. However, they have an intrinsic limited range of tolerance to geometric variation they can provide; furthermore, such pooling operations destroy spatial details within the images that could be crucial to the performance of subsequent tasks.
Instead of designing a network to solely give tolerance to spatial variation, another option is to have the network solve for some of the geometric misalignment in the input images [12, 6]. Such a strategy only makes sense, however, if it has lower capacity and computational cost as well as better performance than traditional spatially invariant CNNs. Spatial Transformer Networks (STNs)  are one of the first notable attempts to integrate low capacity and computationally efficient strategies for resolving - instead of tolerating - misalignment with classical CNNs. Jaderberg et al
. presented a novel strategy for integrating image warping within a neural network and showed that such operations are (sub-)differentiable, allowing for the application of canonical backpropagation to an image warping framework.
The problem of learning a low-capacity relationship between image appearance and geometric distortion is not new in computer vision. Over three and a half decades ago, Lucas & Kanade (LK)  proposed the seminal algorithm for gradient descent image alignment. The LK algorithm can be interpreted as a feed forward network of multiple alignment modules; specifically, each alignment module contains a low-capacity predictor (typically linear) for predicting geometric distortion from relative image appearance, followed by an image resampling/warp operation. The LK algorithm differs fundamentally, however, to STNs in their application: image/object alignment instead of classification.
Putting applications to one side, the LK and STN frameworks share quite similar characteristics however with a criticial exception. In an STN with multiple feed-forward alignment modules, the output image of the previous alignment module is directly fed into the next. As we will demonstate in this paper, this is problematic as it can create unwanted boundary effects as the number of geometric prediction layers increase. The LK algorithm does not suffer from such problems; instead, it feeds the warp parameters through the network (instead of the warped image) such that each subsequent alignment module in the network resamples the original input source image. Furthermore, the Inverse Compositional (IC) variant of the LK algorithm  has demonstrated to achieve equivalently effective alignment by reusing the same geometric predictor in a compositional update form.
Inspired by the IC-LK algorithm, we advocate an improved extension to the STN framework that (a) propagates warp parameters, rather than image intensities, through the network, and (b) employs the same geometric predictor that could be reapplied for all alignment modules. We propose Inverse Compositional Spatial Transformer Networks (IC-STNs) and show its superior performance over the original STNs across a myriad of tasks, including pure image alignment and joint alignment/classification problems.
We organize the paper as follows: we give a general review of efficient image/object alignment in Sec. 2 and an overview of Spatial Transformer Networks in Sec. 3. We describe our proposed IC-STNs in detail in Sec. 4 and show experimental results for different applications in Sec. 5. Finally, we draw to our conclusion in Sec. 6.
In this section, we give a review of nominal approaches to efficient and low-capacity image/object alignment.
The Lucas & Kanade (LK) algorithm  has been a popular approach for tackling dense alignment problems for images and objects. For a given geometric warp function parameterized by the warp parameters , one can express the LK algorithm as minimizing the sum of squared differences (SSD) objective in the image space,
where is the source image, is the template image to align against, and
is the warp update being estimated. Here, we denoteas the image warped with the parameters . The LK algorithm assumes a approximate linear relationship between appearance and geometric displacements; specifically, it linearizes (1) by taking the first-order Taylor approximation as
The warp parameters are thus additively updated through , which can be regarded as a quasi-Newton update. The term , known as the steepest descent image, is the composition of image gradients and the predefined warp Jacobian, where the image gradients are typically estimated through finite differences. As the true relationship between appearance and geometry is seldom linear, the warp update must be iteratively estimated and applied until convergence is reached.
A fundamental problem with the canonical LK formulation, which employs addtive updates of the warp parameters, is that must be recomputed on the rewarped images for each iteration, greatly impacting computational efficiency. Baker and Matthews  devised a computationally efficient variant of the LK algorithm, which they referred to as the Inverse Compositional (IC) algorithm. The IC-LK algorithm reformulates (1) to predict the warp update to the template image instead, written as
and the linearized least-squares objective is thus formed as
The least-squares solution is given by
where the superscript denotes the Moore-Penrose pseudo-inverse operator. This is followed by the inverse compositional update , where we abbreviate the notation to be the composition of warp functions parameterized by , and is the parameters of the inverse warp function parameterized by .
) are in the form of linear regression, which can be more generically expressed as
where is a linear regressor establishing the linear relationship between appearance and geometry, and is the bias term. Therefore, LK and IC-LK can be interpreted as belonging to the category of cascaded linear regression approaches for image alignment.
It has been shown  that the IC form of LK is effectively equivalent to the original form; the advantage of the IC form lies in its efficiency of computing the fixed steepest descent image in the least-squares objective. Specifically, it is evaluated on the static template image at the identity warp and remains constant across iterations, and thus so is the resulting linear regressor . This gives an important theoretical proof of concept that a fixed predictor of geometric updates can be successfully employed within an iterative image/object alignment strategy, further reducing unnecessary model capacities.
More generally, cascaded regression approaches for alignment can be learned from data given that the distribution of warp displacements is known a priori. A notable example of this kind of approach is the Supervised Descent Method (SDM) , which aims to learn the series of linear geometric predictors from data. The formulation of SDM’s learning objective is
where is the geometric displacement drawn from a known generating distribution using Monte Carlo sampling, and is the number of synthetically created examples for each image. Here, the image appearance
is often replaced with a predefined feature extraction function (e.g. SIFT  or HOG ) of the image. This least-squares objective is typically solved with added regularization (e.g
. ridge regression) to ensure good matrix condition.
SDM is learned in a sequential manner, i.e. the training data for learning the next linear model is drawn from the same generating distribution and applied through the previously learned regressors. This has been a popular approach for its simplicity and effectiveness across various alignment tasks, leading to a large number of variants [15, 1, 11] of similar frameworks. Like the LK and IC-LK algorithms, SDM is another example of employing multiple low-capacity models to establish the nonlinear relationship between appearance and geometry. We draw the readers’ attention to  for a more formally established link between LK and SDM.
It is a widely agreed that computer vision problems can be solved much more efficiently if misalignment among data is eliminated. Although SDM learns alignment from data and guarantees optimal solutions after each applied linear model, it is not clear whether such alignment learned in a greedy fashion is optimal for the subsequent tasks at hand, e.g. classification. In order to optimize in terms of the final objective, it would be more favorable to paramterize the model as a deep neural network and optimize the entire model using backpropagation.
In the rapidly emerging field of deep learning among with the explosion of available collected data, deep neural networks have enjoyed huge success in various vision problems. Nevertheless, there had not been a principled way of resolving geometric variations in the given data. The recently proposed Spatial Transformer Networks  performs spatial transformations on images or feature maps with a (sub-)differentiable module. It has the effects of reducing geometric variations inside the data and has brought great attention to the deep learning community.
In the feed-forward sense, a Spatial Transformer warps an image conditioned on the input. This can be mathematically written as
Here, the nonlinear function is parametrized as a learnable geometric predictor (termed the localization network in the original paper), which predicts the warp parameters from the input image. We note that the “grid generator” and the “sampler” from the original paper can be combined to be a single warp function. We can see that for the special case where the geometric predictor consists of a single linear layer, would consists of a linear regressor as well as a bias term , resulting the geometric predictor in an equivalent form of (6). This insight elegantly links the STN and LK/SDM frameworks together.
Fig. 1 shows the basic architecture of STNs. STNs are of great interest in that transformation predictions can be learned while also showing that grid sampling functions can be (sub-)differentiable, allowing for backpropagation within an end-to-end learning framework.
Despite the similarities STNs have with classic alignment algorithms, there exist some fundamental drawbacks in comparison to LK/SDM. For one, it attempts to directly predict the optimal geometric transformation with a single geometric predictor and does not take advantage of the employment of multiple lower-capacity models to achieve more efficient alignment before classification. Although it has been demonstrated that multiple Spatial Transformers can be inserted between feature maps, the effectiveness of such employment has on improving performance is not well-understood. In addition, we can observe from (8) that no information of the geometric warp is preserved after the output image; this leads to a boundary effect when resampling outside the input source image. A detailed treatment on this part is provided in Sec. 4.1.
In this work, we aim to improve upon STNs by theoretically connecting it to the LK algorithm. We show that employing multiple low-capacity models as in LK/SDM for learning spatial transformation within a deep network yields substantial improvement on the subsequent task at hand. We further demonstrate the effectiveness of learning a single geometric predictor for recurrent transformation and propose the Inverse Compositional Spatial Transformer Networks (IC-STNs), which exhibit significant improvements over the original STN on various problems.
One of the major drawbacks of the original Spatial Transformer architecture (Fig. 1) is that the output image samples only from the cropped input image; pixel information outside the cropped region is discarded, introducing a boundary effect. Fig. 2 illustrates the phenomenon.
We can see from Fig. 2(d) that such effect is visible for STNs in zoom-out transformations where pixel information outside the bounding box is required. This is due to the fact that geometric information is not preserved after the spatial transformations. In the scenario of iterative alignment, boundary effects are accumulated for each zoom-out transformations. Although this is less of an issue with images with clean background, this is problematic with real images.
A series of spatial transformations, however, can be composed and described with exact expressions. Fig. 3 illustrates an improved alignment module, which we refer to as compositional STNs (c-STNs). Here, the geometric transformation is also predicted from a geometric predictor, but the warp parameters are kept track of, composed, and passed through the network instead of the warped images. It is important to note that if one were to incorporate a cascade of multiple Spatial Transformers, the geometric transformations are implicitly composed through multiple resampling of the images. We advocate that these transformations are able to be and should be explicitly defined and composed. Unlike the Spatial Transformer module in Fig. 1, the geometry is preserved in instead of being absorbed into the output image. Furthermore, c-STNs allows repeated concatenation, illustrated in Fig. 4, where updates to the warp can be iteratively predicted. This eliminates the boundary effect because pixel information outside the cropped image is also preserved until the final transformation.
The derivative of warp compositions can also be mathematically expressed in closed forms. Consider the input and output warp parameters and in Fig. 3. Taking the case of affine warps for example, the parameters are relatable to transformation matrices in the homogeneous coordinates as
From the definition of warp composition, the warp parameters are related to the transformation matrices through
We can thus derive the derivative to be
is the identity matrix. This allows the gradients to backpropagate into the geometric predictor.
It is interesting to note that the expression of in (4.1) has a very similar expression as in Residual Networks [4, 5], where the gradients contains the identity matrix and “residual components”. This suggests that the warp parameters from c-STNs are generally insensitive to the vanishing gradient phenomenon given the predicted warp parameters is small, and that it is possible to repeat the warp/composition operation by a large number of times.
We also note that c-STNs are highly analogous to classic alignment algorithms. If each geometric predictor consists of a single linear layer, i.e
. the appearance-geometry relationship is assumed to be linearly approximated, then it performs equivalent operations as the compositional LK algorithm. It is also related to SDM, where heuristic features such as SIFT are extracted before each regression layer. Therefore, c-STNs can be regarded as a generalization of LK and SDM, differing that the features for predicting the warp updates can be learned from data and incorporated into an end-to-end learning framework.
Of all variants of the LK algorithm, the IC form  has a very special property in that the linear regressor remains constant across iterations. The steepest descent image in (5) is independent of the input image and the current estimate of ; therefore, it is only needed to be computed once. In terms of model capacity, IC-LK further reduces the necessary learnable parameters compared to canonical LK, for the same regressor can be applied repeatedly and converges provided a good initialization. The main difference from canonical LK and IC-LK lies in that the warp update should be compositionally applied in the inverse form. We redirect the readers to  for a full treatment of IC-LK, which is out of scope of this paper.
This inspires us to propose the Inverse Compositional Spatial Transformer Network (IC-STN). Fig. 5 illustrates the recurrent module of IC-STN: the warp parameters is iteratively updated by , which is predicted from the current warped image with the same geometric predictors. This allows one to recurrently predict spatial transformations on the input image. It is possible due to the close spatial proximity of pixel intensities within natural images: there exists high correlation between pixels in close distances.
In the IC-LK algorithm, the predicted warp parameters are inversely composed. Since the IC-STN geometric predictor is optimized in an end-to-end learning framework, we can absorb the inversion operation into the geometric predictor without explicitly defining it; in other words, IC-STNs are able to directly predict the inverse parameters. In our experiments, we find that there is negligible difference to explicitly perform an additional inverse operation on the predicted forward parameters, and that implicitly predicting the inverse parameters fits more elegantly in an end-to-end learning framework using backpropagation. We name our proposed method Inverse Compositional nevertheless as IC-LK is where our inspirations are drawn from.
In practice, IC-STNs can be trained by unfolding the architecture in Fig. 5 multiple times into the form of c-STNs (Fig. 4), sharing the learnable parameters across all geometric predictors, and backpropagating the gradients as described in Sec. 4.1. This results in a single effective geometric predictor that can be applied multiple times before performing the final warp operation that suits subsequent tasks such as classification.
To start with, we explore the efficacy of IC-STN for planar alignment of a single image. We took an example image from the Caffe library and generated perturbed images with affine warps around the hand-labeled ground truth, shown in Fig. 6. We used image samples of size 50
50 pixels. The perturbed boxes are generated by adding i.i.d. Gaussian noise of standard deviation
(in pixels) to the four corners of the ground-truth box plus an additional translational noise from the same Gaussian distribution, and finally fitting the box to the initial warp parameters.
To demonstrate the effectiveness of iterative alignment under different amount of noise, we consider IC-STNs that consist of a single learnable linear layer with different numbers of learned recurrent transformations. We optimize all networks in terms of
error between warp parameters with stochastic gradient descent and a batch size of 100 perturbed training samples generated on the fly.
The test error is illustrated in Table 1. We see from c-STN-1 (which is equivalent to IC-STN-1 with only one warp operation unfolded) that a single geometric warp predictor has limited ability to directly predict the optimal geometric transformation. Reusing the geometric predictor to incorporating multiple spatial transformations yields better alignment performance given the same model capacity.
shows the test error over the number of warp operations applied to the learned alignment module. We can see that even when the recurrent spatial transformation is applied more times than trained with, the error continues to decrease until some of point of saturation, which typically does not hold true for classical recurrent neural networks. This implies that IC-STN is able to capture the correlation between appearance and geometry to perform gradient descent on a learned cost surface for successful alignment.
In this section, we demonstrate how IC-STNs can be utilized in joint alignment/classfication tasks. We choose the MNIST handwritten digit dataset , and we use a homography warp noise model to perturb the four corners of the image and translate them with Gaussian noise, both with a standard deviation of 3.5 pixels. We train all networks for 200K iterations with a batch size of 100 perturbed samples generated on the fly. We choose a constant learning rate of 0.01 for the classification subnetworks and 0.0001 for the geometric predictors as we find the geometric predictor sensitive to large changes. We evaluate the classification accuracy on the test set using the same warp noise model.
|CNN(a)||6.597 %||39079||conv(33, 3)-conv(33, 6)-P-conv(33, 9)-conv(33, 12)-FC(48)-FC(10)|
|STN(a)||4.944 %||39048||[ conv(77, 4)-conv(77, 8)-P-FC(48)-FC(8) ]1 conv(99, 3)-FC(10)|
|c-STN-1(a)||3.687 %||39048||[ conv(77, 4)-conv(77, 8)-P-FC(48)-FC(8) ]1 conv(99, 3)-FC(10)|
|c-STN-2(a)||2.060 %||38528||[ conv(99, 4)-FC(8) ]2 conv(99, 3)-FC(10)|
|c-STN-4(a)||1.476 %||37376||[ FC(8) ]4 conv(99, 3)-FC(10)|
|IC-STN-2(a)||1.905 %||39048||[ conv(77, 4)-conv(77, 8)-P-FC(48)-FC(8) ]2 conv(99, 3)-FC(10)|
|IC-STN-4(a)||1.230 %||39048||[ conv(77, 4)-conv(77, 8)-P-FC(48)-FC(8) ]4 conv(99, 3)-FC(10)|
|CNN(b)||19.065 %||19610||conv(99, 2)-conv(99, 4)-FC(32)-FC(10)|
|STN(b)||9.325 %||18536||[ FC(8) ]1 conv(99, 3)-FC(10)|
|c-STN-1(b)||8.545 %||18536||[ FC(8) ]1 conv(99, 3)-FC(10)|
|IC-STN-2(b)||3.717 %||18536||[ FC(8) ]2 conv(99, 3)-FC(10)|
|IC-STN-4(b)||1.703 %||18536||[ FC(8) ]4 conv(99, 3)-FC(10)|
2 max-pooling operation). Best viewed in color.
We compare IC-STN to several network architectures, including a baseline CNN with no spatial transformations, the original STN from Jaderberg et al., and c-STNs. All networks with spatial transformations employ the same classification network. The results as well as the architectural details are listed in Table 2. We can see that classical CNNs do not handle large spatial variations efficiently with data augmentation. In the case where the digits may be occluded, however, trading off capacity for a single deep predictor of geometric transformation also results in poor performance. Incorporating multiple transformers lead to a significant improvement in classification accuracy; further comparing c-STN-4(a) and IC-STN-4(b), we see that IC-STNs are able to trade little accuracy off for a large reduction of capacity compared to its non-recurrent counterpart.
Fig. 8 shows how IC-STNs learns alignment for classification. In many cases where the handwritten digits are occluded, IC-STN is able to automatically warp the image and reveal the occluded information from the original image. There also exists smooth transitions during the alignment, which confirms with the recurrent spatial transformation concept IC-STN learns. Furthermore, we see that the outcome of the original STN becomes cropped digits due to the boundary effect described in Sec. 4.1.
We also visualize the overall final alignment performance by taking the mean and variance on the test set appearance before classification, shown in Fig. 9. The mean/variance results of the original STN becomes a down-scaled version of the original digits, reducing information necessary for better classification. From c-STN-1, we see that a single geometric predictor is poor in directly predicting geometric transformations. The variance among all aligned samples is dramatically decreased when more warp operations are introduced in IC-STN. These results support the fact that elimination of spatial variations within data is crucial to boosting the performance of subsequent tasks.
Here, we show how IC-STNs can be applied to real-world classification problems such as traffic sign recognition. We evaluate our proposed method with the German Traffic Sign Recognition Benchmark , which consists of 39,209 training and 12,630 test images from 43 classes taken under various conditions. We consider this as a challenging task since many of the images are taken with motion blurs and/or of resolution as low as 1515 pixels. We rescale all images and generate perturbed samples of size 3636 pixels with the same homography warp noise model described in Sec. 5.2. The learning rate is set to be 0.001 for the classification subnetworks and 0.00001 for the geometric predictors.
|CNN||8.287 %||200207||conv(77, 6)-conv(77, 12)-P-conv(77, 24)-FC(200)-FC(43)|
|STN||6.495 %||197343||[ conv(77, 6)-conv(77, 24)-FC(8) ]1 conv(77, 6)-conv(77, 12)-P-FC(43)|
|c-STN-1||5.011 %||197343||[ conv(77, 6)-conv(77, 24)-FC(8) ]1 conv(77, 6)-conv(77, 12)-P-FC(43)|
|IC-STN-2||4.122 %||197343||[ conv(77, 6)-conv(77, 24)-FC(8) ]2 conv(77, 6)-conv(77, 12)-P-FC(43)|
|IC-STN-4||3.184 %||197343||[ conv(77, 6)-conv(77, 24)-FC(8) ]4 conv(77, 6)-conv(77, 12)-P-FC(43)|
We set the controlled model capacities to around 200K learnable parameters and perform similar comparisons to the MNIST experiment. Table 3 shows the classification error on the perturbed GTSRB test set. Once again, we see a considerable amount of classification improvement of IC-STN from learning to reuse the same geometric predictor.
Fig. 10 compares the aligned images from IC-STN and the original STN before the classification networks. Again, IC-STNs are able to recover occluded appearances from the input image. Although STN still attempts to center the perturbed images, the missing information from occlusion degrades its subsequent classification performance.
We also visualize the aligned mean appearances from each network in Fig. 11, and it can be observed that the mean appearance of IC-STN becomes sharper as the number of warp operations increase, once again indicating that good alignment is crucial to the subsequent target tasks. It is also interesting to note that not all traffic signs are aligned to be fit exactly inside the bounding boxes, e.g. the networks finds the optimal alignment for stop signs to be zoomed-in images while excluding the background information outside the octagonal shapes. This suggests that in certain cases, only the pixel information inside the sign shapes are necessary to achieve good alignment for classification.
In this paper, we theoretically connect the core idea of the Lucas & Kanade algorithm with Spatial Transformer Networks. We show that geometric variations within data can be eliminated more efficiently through multiple spatial transformations within an alignment framework. We propose Inverse Compositional Spatial Transformer Networks for predicting recurrent spatial transformations and demonstrate superior alignment and classification results compared to baseline CNNs and the original STN.
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1859–1866, 2014.
The mnist database of handwritten digits, 1998.