Image alignment is one of the most fundamental tasks in computer vision, with applications such as object tracking, video stabilization , and visual odometry . In many applications the images to be aligned were taken several milliseconds apart, as in the case of visual odometry or object tracking. In other applications, the images were taken several hours or days apart. This can be the case in robotic localization, where a robot compares its surroundings to a recently created map of the environment . Sometimes, applications require alignment across temporal differences on the scale of months, years or decades. This is the case in the remote sensing community, where algorithms have been developed to automatically align satellite images taken at different times of the year, or across years [5, 6]. This is also the case for 3D reconstruction algorithms which utilize large amounts of imagery taken across months or years .
Alignment algorithms must overcome changes in brightness, illumination, exposure, and geometric warping. In the case of images of outdoor environments, algorithms must also combat with occlusions, seasonal changes, time of day, and the addition or removal of objects in parts of the image like buildings and cars. All of these issues are exacerbated with larger temporal differences between the capturing of the images.
These issues often do not pose a problem for humans when aligning two images. Take, for instance, the two images shown in Figure 1. One image is taken in summer, and the other in winter. Manually aligning these images with extreme accuracy is not difficult for us because we understand how scenes change during seasons, and we can use local and global cues in the image. However, to the best of our knowledge, vision algorithms have struggled to solve this task effectively due to the lack of texture and the stylistic differences.
Modern alignment techniques focus on two approaches, sometimes combining both. The first approach is to extract keypoints such as with SIFT or HOG [8, 9]. Keypoints work well with high amounts of texture, can be robust to stylistic or photometric differences, and can align large geometric warps. However, keypoint systems struggle with low-texture images. Recent attempts have been made to use keypoint-like methods for matching UAV imagery to pre-captured satellite images [10, 11]. However, these methods fail to show robustness in situations with low texture (non-urban environments) or across large temporal variation.
In the cases of low-texture imagery, it is often beneficial to use dense image alignment techniques. The most basic of such dense techniques is to perform the Lucas-Kanade (LK) optical flow algorithm on the raw pixels of the images. Advanced dense techniques extract a dense descriptor of the pixels before using the LK formulation [12, 13]. Dense descriptors provide invariance to photometric differences, and can be accurate on low- and high-texture scenes. However, they are limited in the magnitude of the warps that they can determine.
Verdie et al. have proposed a method which is able to extract key-points which are temporally-invariant across time of day and weather . However, their testing is limited to highly-textured imagery of urban environments where there is rich information for key-point extraction. We show in the Experiments section that our method can perform well on their dataset, as well as low-texture satellite imagery. Yang et al.  present a state-of-the-art algorithm for automatic image registration of satellite imagery. Their approach combines several features to use for registration, which include standard Euclidean distance of pixel intensity, the Shape Context feature, and SIFT distance. These feature representations are all reliant on ample texture and features in the images to be aligned. Yang et al. do not show that their method can perform well on low-texture imagery, or for cases of large temporal differences.
Recently, we have seen the introduction of a technique for image alignment which uses fully-convolutional networks to extract dense features of an image and template, before using the LK formulation on the extracted features [15, 16]. We henceforth refer to these methods as deep alignment methods. The benefit of these methods is that the features relevant to alignment can be learned based on the image domain.
In this paper, we make three main contributions:
We propose the use of deep alignment methods for aligning images of outdoor scenes that were taken across large temporal gaps, and show the viability of this approach.
We explore the generalization ability of deep alignment for aligning outdoor scenes, by training on outdoor images from one domain (satellite images) and testing on images from another domain (time-lapse images taken near ground level).
We propose a more optimal training strategy than has yet been proposed for deep alignment methods. Specifically, we show that using dynamic LK iterations during training provides dramatic performance benefits over single-iteration methods (see section 3.2 for more information).
. The approach is to use a fully-convolutional network followed by a differentiable, iterative Lucas-Kanade implementation. The input to the Lucas-Kanade implementation is the multi-channel feature representation that is extracted from the fully-convolutional network. The entire network, from the input image to the output of the Lucas-Kanade layer, is differentiable and thus we can perform backpropagation to the determine the weights of the convolutional layer parameters. In this way, we can optimize the convolutional weights for the image alignment task. We review the selection of the convolutional network in these deep alignment architectures. We also review the fundamentals of the Inverse-Compositional Lucas-Kanade layer. For more in-depth treatment of these topics, we refer the reader to prior deep alignment work.
2.1 Fully Convolutional Neural Network
For the convolutional part of the network, the approaches vary between the implementations of  and . For CLKN, they have a large enough dataset that it becomes feasible to train the convolutional weights from scratch for the image alignment task. For DeepLK, given limited training data for the object tracking task, they choose to fine-tune the conv5 layer of AlexNet. For our approach, due to a similar problem of a lack of data in the problem domain, we choose to fine-tune the conv3 layer of the VGG16 network . We use the same convolutional weights to extract features from both the image and the template (coupled weights).
2.2 Parameterization of the Warp Function
The goal of image alignment is, given an image and a template, to find the parameters of a geometric warping function which describe the warping from template to image. In this paper, we study the projective warp function (homography). From , we have that a pixel located at in the template image, the homography warp function is commonly written using 8 parameters :
2.3 Inverse-Compositional Lucas-Kanade
The Lucas-Kanade algorithm seeks to minimize the pixel-wise squared difference between a template and a warped version of the image. Note that the template and image can have an arbitrary number of channels; in our case, the input image and template are feature maps with 256 channels that have been outputted from the fully convolutional layers.
In the Inverse Compositional formulation, the role of the template and the image are reversed to achieve much increased efficiency. The formulation of the Inverse Compositional Lucas-Kanade (ICLK) minimization for a template T and an image I is thus:
Where indicates the change in warp parameters for a single iteration of ICLK, and p is the result of the composition of all updates over all iterations so far. To solve Equation 2, we must perform a first-order Taylor expansion, and solve the resulting least-squares form. The resulting expression of the warp update is :
Where H is the Hessian matrix:
To compute using Equation 3, we must compute the warp Jacobian at (x;0). For homography, the warp Jacobian can be written as:
Once the solution for has been obtained via Equation 3, it can be converted to a homography matrix, and applied via inverse composition to the current warp parameters. Specifically, if is the homography with parameters , and is the homography with parameters p, then the updated parameters can be extracted from a new homography calculated by:
2.4 Iterations of Lucas-Kanade
The Lucas-Kanade method solves a non-linear least squares problem by first-order Taylor expansion and successive iterations. Therefore, for each unique image and template pair, there is a variable number of iterations until convergence of the algorithm. The criterion for convergence is often a heuristic threshold on the magnitude of the change in warp parameters,. This is the chosen method of convergence used in . In  however, the authors use the magnitude of the average error residual at each iteration as the stopping criterion. The error residual is calculated as in Equation 3. We theorize that the choice of stopping criterion is highly dependent on the problem domain and the type of imagery. We experimented with using both the average residual method, and the magnitude of threshold method. We found that using a heuristic threshold on the magnitude of warp parameters provides the best trade-off between accuracy of the final alignment and number of iterations.
The steps for a single training iteration are as follows:
Generate an image and template, and ground-truth warp parameters which relate the image to the template
Extract feature maps of image and template using fully convolutional layers
Input both image and template feature maps into Inverse-Compositional Lucas-Kanade algorithm
Iterate the ICLK algorithm until convergence, indicated by the magnitude of
Compute a final loss between the estimated warp parameters and the ground truth warp parameters
Apply stochastic gradient updates to the convolutional filter weights in the fully convolutional layers
3.1 Loss Function
A question arises of the correct loss function to use for the task of estimating homography parameters. In, the authors use the Conditional Loss, which is a Huber loss computed on the difference between the ground truth warp parameters and the estimated parameters:
Where is the set of all training data, and indicates the Huber loss. The problem with this approach is that each of the 8 parameters within p do not equally affect the magnitude of the geometric warp. For instance, the projective parameters and carry much more weight in terms of the visual effect of the warp, than do the translational terms and . Therefore, a loss function is needed which captures the visual correctness of the regressed parameters p.
We use the Corner Loss proposed in , which is a much better measure of visual correctness than the Conditional Loss. The Corner Loss computes the squared distance between the four corners of a ground truth un-warped image and the prediction of the un-warped image I(W(x;p)). Defining the 4 corners of the warped image I as , , , , we have the Corner Loss defined as:
3.2 ICLK Iterations During Training
In both  and , the authors only allow the ICLK layer of the network to iterate a single time during the training stage, although during testing the ICLK is able to iterate to convergence. We believe that they take this approach due to limitations of the implementation framework, which make it very difficult to perform back-propagation through multiple ICLK iterations during training. Therefore, the authors design loss functions which suit the single-iteration training regime, and augment their dataset to mimic the middle output of LK iterations. However, we use the more optimal strategy of iterating a dynamic number of times in the ICLK layer during training. Iterating dynamically during training also allows us to utilize the simple formulation of Corner Loss in Equation 8 as the loss function of our network output.
We have seen that unfolding dynamic LK iterations during training gives a dramatic performance improvement over single LK iteration during training for our task at test time. Further information can be found in Section 5.4 and Figure 10, where we compare single-iteration Corner Loss minimization and dynamic-iteration Corner Loss minimization.
Training and testing our algorithm requires the gathering of a large amount of aligned images of a variety of outdoor environments, across many different periods of time which showcase seasonal changes and other temporal differences. One source of such data is aligned orthographic satellite imagery. Using such data has been common in approaches to remote sensing applications and GPS-denied UAV localization. Another source of data is the Archive of Many Outdoor Scenes (AMOS) dataset, which is also used by . AMOS provides data generated via webcams positioned at many outdoor scenes across the world, effectively generating large amounts of time-lapse data of outdoor scenes.
4.1 Satellite Imagery Dataset
We obtained satellite imagery data from the United States Geographical Survey Earth Explorer (earthexplorer.usgs.gov) website. For training and testing, we use images from a suburban area of New Jersey, USA. We chose the location for its abundance of data, and its even mix between high and low texture. Some example images from this dataset are shown in Figure 2. We obtain aligned images taken during spring, summer, and fall, taken in 2006, 2008, 2010, 2013, 2015, and 2017 (10 images total). The images are each 75825946 pixels, at a resolution of 1 meter per pixel, meaning we train on about 50 sq. km. of geographical area. We withold 20% of the dataset for testing.
We dynamically create data pairs from the satellite imagery data during training and testing. That is, during the training loop, we randomly choose two of the 75825946 images from the New Jersey dataset, and randomly choose a location in the image and a patch size to sample. Keeping one of the patches static (the template T), we apply a random warping to the other patch (the image I). The parameters for random patch size, and the random warp parameters, can be found in the Implementation Details section.
4.2 AMOS Time-lapse Dataset
In , the authors have constructed a representative dataset from some of the higher quality time-lapse webcams in the AMOS dataset. Some examples from this dataset are shown in Figure 3. The AMOS dataset is characterized by high-texture urban environments that are very rich with features. This is very different from the satellite imagery dataset, which is characterized by lower texture and lower resolution images that may not have salient features which can be easily extracted with sparse keypoint methods. The satellite imagery also has a larger number of examples of natural scenes, such as foliage, forests, and rivers, which are almost always more challenging for the temporally-invariant alignment problem. Similarly to the satellite imagery dataset, we create data pairs dynamically from the AMOS dataset during testing.
Our goal is to show that it is possible to learn temporal invariance for outdoor scenes in general. Thus, we propose to train our network using the satellite imagery dataset only, and test performance on both satellite imagery and the AMOS dataset. We hypothesize that by training on satellite imagery, we can learn invariances for both low and high texture, across the paradigms of most outdoors scenes (urban, suburban, rural), and across the variations that occur from large temporal differences in the images. We show in the following experiments that this is indeed possible.
We compare our alignment strategy against two other dense alignment strategies, and one sparse key-point strategy. These strategies are:
: We perform the ICLK algorithm on the raw image pixels, without applying any dense descriptors. We found that normalizing the pixels to have zero-mean and unit variance per-channel improved performance for normal ICLK.
ICLK on untrained VGG16 conv3 features: We perform the ICLK algorithm on feature maps extracted from the stock VGG16 conv3.
SIFT+RANSAC: We extract sparse SIFT keypoints and perform RANSAC to estimate the homography for alignment. We use the implementation included in OpenCV. It should be noted that in , the authors find that SIFT+RANSAC provides the second best alignment method, behind only their implementation.
We employ a Corner Error metric for showing performance of our algorithm, which is similar to that of . The Corner Error is related to the Corner Loss, except that it reports the average Euclidean distance between the four corners of the ground truth un-warped image and the prediction of the un-warped image I(W(x;p)). It is measured in pixels, and is equal to:
Since we test with variable sizes of square image patches, we instead report the Corner Error as a percentage of the image width so that we can fairly compare warps for different image patch sizes. We provide a visualization of Corner Error in Figure 4.
5.1 Implementation Details
For generating image pairs, we randomly select two aligned images from a given dataset. We extract square patches in the images which range from 175 pixels to 300 pixels wide for the satellite image dataset, and square patches between 150 pixels and 220 pixels wide from the AMOS dataset. We extract a padded version of the imageI, so that when it is warped, there are not cutoff regions around the edges. We warp the image I, choosing projective warp parameters from uniform random distributions. We choose warp parameters such that if the algorithm were to predict for every test example (no-op), the maximum Corner Error would be about 30% of the image width.
We transfer the conv3 layer of the VGG16 network for our convolutional pipeline, and fine-tune only conv3. We implement the algorithm using the open-source PyTorch framework, on an NVIDIA GeForce Titan X GPU. We trained on 15,000 dynamically generated training pairs from the New Jersey satellite image dataset. We implemented a mini-batch training approach, calculating the Corner Loss on a mini-batch of 5 training pairs before applying the stochastic gradient update. We generate a validation batch of 20 image pairs randomly at train time from the source data (New Jersey satellite data), and keep the model which generates the lowest validation loss during training.
5.2 Evaluation on Satellite Dataset
We test on 5000 data pairs from the New Jersey satellite image dataset which are unseen during training. The results of this experiment are shown in Figure 5. We report the results in terms of Corner Error (Equation 9) as a percent of image width, versus the ratio of training data. The results indicate the superior performance of our method for aligning satellite imagery, in the face of large temporal and seasonal variations. Please see the description in Figure 5 for more information on the performance metrics. In Figure 6, we provide some notable alignment results achieved by our method.
5.3 Evaluation on AMOS Dataset
For the AMOS Dataset, we first test on 2000 data pairs from a webcam located in St. Louis, Missouri, USA. Some representative examples of this dataset are shown in Figure 3. The alignment results for this dataset are shown in Figure 8. Notably, we find that our alignment method, which has been trained only on satellite data, has learned invariances to outdoor scenes which allow it to be effective at alignment on the AMOS dataset. Some alignment examples are captured in Figure 7.
We test also on 2000 data pairs from a webcam located in Courbevoie, France. Some representative examples of this dataset are shown in Figure 3, with the alignment results show in Figure 9. Again, we find that the network trained only on satellite images is able to generalize for this alignment task. Specifically, we find that our network can align 80% of image pairs in the Courbevoie dataset to within 5% corner error. The SIFT+RANSAC method, and ICLK on untrained VGG16 conv3 features, both align about 60% of this dataset to within 5% corner error.
5.4 Single-Iteration vs. Dynamic-Iteration Training
There is a choice of how many iterations to do in the ICLK layer of the deep alignment methods. Previous works have designed the system around a single-iteration training scheme due to limitations of the implementation framework. However, we show that dynamic iterations provide a dramatic performance boost for our task in Figure 10.
In this paper we present a new perspective on recent deep alignment techniques, applying them to the problem of temporally-invariant image alignment for outdoor scenes. We have shown a surprising real-world application of these deep alignment methods, as well as introducing a more optimal training strategy for learning temporal invariance. We have shown that, with deep alignment methods, we can learn invariances in one domain and transfer that invariance to a separate and unseen domain. We propose that this method can be used in future work in the remote sensing community, in the localization of UAVs in GPS-denied environments, 3D reconstruction algorithms, SLAM, and in other applications where high levels of invariance to imaging conditions are required.
Wu, Y., Lim, J., Yang, M.H.:
Online object tracking: A benchmark.
In: 2013 IEEE Conference on Computer Vision and Pattern Recognition. (June 2013) 2411–2418
-  Battiato, S., Gallo, G., Puglisi, G., Scellato, S.: Sift features tracking for video stabilization. In: 14th International Conference on Image Analysis and Processing (ICIAP 2007). (Sept 2007) 825–830
-  Engel, J., Koltun, V., Cremers, D.: Direct sparse odometry. CoRR abs/1607.02565 (2016)
-  Murillo, A.C., Guerrero, J.J., Sagues, C.: Surf features for efficient robot localization with omnidirectional images. In: Proceedings 2007 IEEE International Conference on Robotics and Automation. (April 2007) 3901–3907
-  Bentoutou, Y., Taleb, N., Kpalma, K., Ronsin, J.: An automatic image registration for applications in remote sensing. IEEE Transactions on Geoscience and Remote Sensing 43(9) (Sept 2005) 2127–2137
-  Yang, K., Pan, A., Yang, Y., Zhang, S., Ong, S.H., Tang, H.: Remote sensing image registration using multiple image features. Remote Sensing 9(6) (2017)
-  Agarwal, S., Furukawa, Y., Snavely, N., Simon, I., Curless, B., Seitz, S.M., Szeliski, R.: Building rome in a day. Commun. ACM 54(10) (October 2011) 105–112
-  Lindeberg, T.: Scale Invariant Feature Transform. Scholarpedia 7(5) (2012) 10491 revision #153939.
-  Dalal, N., Triggs, B.: Histograms of oriented gradients for human detection. In: 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05). Volume 1. (June 2005) 886–893 vol. 1
-  Shan, M., Wang, F., Lin, F., Gao, Z., Tang, Y.Z., Chen, B.M.: Google map aided visual navigation for uavs in gps-denied environment. CoRR abs/1703.10125 (2017)
-  Yol, A., Delabarre, B., Dame, A., Dartois, J. ., Marchand, E.: Vision-based absolute localization for unmanned aerial vehicles. In: 2014 IEEE/RSJ International Conference on Intelligent Robots and Systems. (Sept 2014) 3429–3434
-  Alismail, H., Browning, B., Lucey, S.: Robust tracking in low light and sudden illumination changes. In: 2016 Fourth International Conference on 3D Vision (3DV). (Oct 2016) 389–398
-  Antonakos, E., i Medina, J.A., Tzimiropoulos, G., Zafeiriou, S.P.: Feature-based lucas 2013;kanade and active appearance models. IEEE Transactions on Image Processing 24(9) (Sept 2015) 2617–2632
-  Verdie, Y., Yi, K.M., Fua, P., Lepetit, V.: Tilde: A temporally invariant learned detector. In: 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). (June 2015) 5279–5288
-  Wang, C., Galoogahi, H.K., Lin, C., Lucey, S.: Deep-lk for efficient adaptive object tracking. In: 2018 IEEE Conference on Robotics and Automation (ICRA). (2017)
-  Chang, C.H., Chou, C.N., Chang, E.Y.: Clkn: Cascaded lucas-kanade networks for image alignment. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). (July 2017) 3777–3785
-  Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. CoRR abs/1409.1556 (2014)
-  Baker, S., Matthews, I.: Lucas-kanade 20 years on: A unifying framework. International Journal of Computer Vision 56(3) (Feb 2004) 221–255