Unsupervised Deep Homography: A Fast and Robust Homography Estimation Model

09/12/2017 ∙ by Ty Nguyen, et al. ∙ University of Pennsylvania 0

This paper develops an unsupervised learning algorithm that trains a Deep Convolutional Neural Network to estimate planar homographies. The traditional feature-based approaches to estimating can fail when good features cannot be identified, and can be slow during the feature identification and matching process. We demonstrate that our unsupervised algorithm outperforms these traditional approaches in accuracy, speed, and robustness to noise on both synthetic images and real-world images captured from a UAV. In addition, we demonstrate that our unsupervised method has superior performance compared to the corresponding supervised method of training the network architecture using ground truth homography labels. Our approach is a general motion estimation framework that can be extended beyond estimating planar homographies to more general motions such as optical flow.



There are no comments yet.


page 1

page 7

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

A homography is a mapping between two images of a planar surface from different perspectives. They play an essential role in robotics and computer vision applications such as image mosaicing 

[1], monocular SLAM [2], 3D camera pose reconstruction [3] and virtual touring [4, 5]. For example, homographies are applicable in scenes viewed at a far distance by an arbitrary moving camera [6], which are the situations encountered in UAV imagery. However, to work well in the aerial multi-robot setting, the homography estimation algorithm needs to be reliable and fast.

The two traditional approaches for homography estimation are direct methods and feature-based methods [7]. Direct methods, such as the seminal Lucas-Kanade algorithm [8], use pixel-to-pixel matching by shifting or warping the images relative to each other and comparing the pixel intensity values using an error metric such as the sum of squared differences (SSD). They initialize a guess for the homography parameters and use a search or optimization technique such as gradient descent to minimize the error function [9]. The robustness of direct methods can be improved by using different performance criterion such as the enhanced correlation coefficient (ECC) [10], integrating feature-based methods with direct methods [11], or by representing the images in the Fourier domain [12]. In addition, the speed of direct methods can be increased by using efficient compositional image alignment schemes [13].

The second approach are feature-based methods. These methods first extract keypoints in each image using local invariant features (e.g. Scale Invariant Feature Transform (SIFT) [14]). They then establish a correspondence between the two sets of keypoints using feature matching, and use RANSAC [15] to find the best homography estimate. While these methods have better performance than direct methods, they can be inaccurate when they fail to detect sufficient keypoints, or produce incorrect keypoint correspondences due to illumination and large viewpoint differences between the images [16]. In addition, these methods are significantly faster than direct methods but can still be slow due to the computation of the features, leading to the development of other feature types such as Oriented FAST and Rotated BRIEF (ORB) [17] which are more computationally efficient than SIFT, but have worse performance.

Fig. 1: Above: Synthetic data; Below: Real data; Homography estimation results from the unsupervised neural network. Red represents the ground truth correspondences, and yellow represents the estimated correspondences. These images depict an example of large levels of displacement and illumination shifts in which feature-based, direct and/or supervised learning methods fail.

Inspired by the success of data-driven Deep Convolutional Neural Networks (CNN) in computer vision, there has been an emergence of CNN approaches to estimating optical flow [18, 19, 20], dense matching [21, 22], depth estimation [23], and homography estimation [24]. Most of these works, including the most relevant work on homography estimation, treat the estimation problem as a supervised learning task. These supervised approaches use ground truth labels, and as a result are limited to synthetic datasets where the ground truth can be generated for free, or require costly labeling of real-world data sets.

Our work develops an unsupervised, end-to-end, deep learning algorithm to estimate homographies. It improves upon these prior traditional and supervised learning methods by minimizing a pixel-wise intensity error metric that does not need ground truth data. Unlike the hand-crafted feature-based approaches, or the supervised approach that needs costly labels, our model is adaptive and can easily learn good features specific to different data sets. Furthermore, our framework has fast inference times since it is highly parallel. These adaptive and speed properties make our unsupervised networks especially suitable for real world robotic tasks, such as stitching UAV images.

We demonstrate that our unsupervised homography estimation algorithm has comparable or better accuracy, and better inference speed, than feature-based, direct, and supervised deep learning methods on synthetic and real-world UAV data sets. In addition, we demonstrate that it can handle large displacements ( image overlap) with large illumination variation. Fig. 1 illustrates qualitative results on these data sets, where our unsupervised method is able to estimate the homography whereas the other approaches cannot.

Our unsupervised algorithm is a hybrid approach that combines the strengths of deep learning with the strengths of both traditional direct methods and feature-based methods. It is similar to feature-based methods because it also relies on features to compute the homography estimates, but it differs in that it learns the features rather than defining them. It is also similar to the direct methods because the error signal used to drive the network training is a pixel-wise error. However, rather than performing an online optimization process, it transfers the computation offline and ”caches” the results through these learned features. Similar unsupervised deep learning approaches have been successful in computer vision tasks such as monocular depth and camera motion estimation [25], indicating that our framework can be scaled to tackle general nonlinear motions such as those encountered in optical flow.

Ii Problem Formulation

We assume that images are obtained by a perspective pin-hole camera and present points by homogeneous coordinates, so that a point is represented as and a point is equivalent to the point . Suppose that and are two points. A planar projective transformation or homography that maps

is a linear transformation represented by a non-singular

matrix such that:


Since can be multiplied by an arbitrary non-zero scale factor without altering the projective transformation, only the ratio of the matrix elements is significant, leaving

eight independent ratios corresponding to eight degrees of freedom. This mapping equation can also represented by two equations:


The problem of finding the homography induced by two images and is to find a homography such that Eqn. (1) holds for all points in the overlapping of the two images.

Iii Supervised Deep Homography Model

Fig. 2: Overview of homography estimation methods; (a) Benchmark supervised deep learning approach; (b) Feature-based methods; and (c) Our unsupervised method. DLT: direct linear transform; PSGG: parameterized sampling grid generator; DS: differentiable sampling.

The deep learning approach most similar to our work is the Deep Image Homography Estimation [24]. In this work, DeTone et al. use supervised learning to train a deep neural network on a synthetic data set. They use the 4-point homography parameterization  [26] rather than the conventional parameterization . Suppose that and for are 4 fixed points in image and respectively, such that Let , . Then is the matrix of points . Both parameterizations are equivalent since there is a one-to-one correspondence between them.

In a deep learning framework though, this parameterization is more suitable than the parameterization because mixes the rotation, translation, scale, and shear components of the homography transformation. The rotation and shear components tend to have a much smaller magnitude than the translation component, and as a result although an error in their values can greatly impact , it will have a small effect on the loss function of the elements of

, which is detrimental for training the neural network. In addition, the high variance in the magnitude of the elements of the

homography matrix makes it difficult to enforce to be non-singular. The -point parameterization does not suffer from these problems.

The network architecture is based on VGGNet [27], and is depicted in Fig. 2(a). The network input is a batch of image patch pairs. The patch pairs are generated by taking a full-sized image, cropping a square patch at a random position , perturbing the four corners of by a random value within to generate a homography , applying to the full-sized image, and then cropping a square patch of the same size and at the same location as the patch from the warped image. These image patches are used to avoid strange border effects near the edges during the synthetic data generation process, and to standardize the network input size. The applied homography is saved in the point parameterization format, . The network outputs a point parameterization estimate .

The error signal used for gradient backpropagation is the Euclidean

norm, denoted as , of the estimated -point homography versus the ground truth :


Iv Unsupervised Deep Homography Model

While the supervised deep learning method has promising results, it is limited in real world applications since it requires ground truth labels. Drawing inspiration from traditional direct methods for homography estimation, we can define an analogous loss function. Given an image pair and with discrete pixel locations represented by homogeneous coordinates , we want our network to output that minimizes the average pixel-wise photometric loss


where defines the homography transformation . We chose the error versus the error because previous work has observed that it is more suitable for image alignment problems [28], and empirically we found the network to be easier to train with the error. This loss function is unsupervised since there is no ground truth label. Similar to the supervised case, we choose the -point parameterization which is more suitable than the parameterization.

In order to compare our unsupervised deep learning algorithm with the supervised algorithm, we use the same VGGNet architecture to output the . Fig. 2(c) depicts our unsupervised learning model. The regression module represents the VGGNet architecture and is shared by both the supervised and unsupervised methods. Although we do not investigate other possible architectures, different regression models such as SqueezeNet [29] may yield better performance due to advantages in size and computation requirements. The second half of Fig. 2(c) represents the main contribution of this work, which consists of the differentiable layers that allow the network to be successfully trained with the loss function (4).

Using the pixel-wise photometric loss function yields additional training challenges. First, every operation, including the warping operation , must remain differentiable to allow the network to be trained via backpropagation. Second, since the error signal depends on differences in image intensity values rather than the differences in the homography parameters, training the deep network is not necessarily as easy or stable. Another implication of using a pixel-wise photometric loss function is the implied assumption that lighting and contrast between the input images remains consistent. In traditional direct methods such as ECC, this appearance variation problem is addressed by modifying the loss function or preprocessing the images. In our unsupervised algorithm, we standardize our images by the mean and variance of the intensities of all pixels in our training dataset, perform data augmentation by injecting random illumination shifts, and use the standard photometric loss. We found that even without modifying the loss function, our deep neural network is still able to learn to be invariant to illumination changes.

Iv-a Model Inputs

The input to our model consists of three parts. The first part is a 2-channel image of size which is the stack of and - two patches cropped from the two images and . The second part is the four corners in , denoted as . Image is also part of the input as it is necessary for warping.

Iv-B Tensor Direct Linear Transform

We develop a Tensor Direct Linear Transform (Tensor DLT) layer to compute a differentiable mapping from the 4-point parameterization

to , the parameterization of homography. This layer essentially applies the DLT algorithm [30] to tensors, while remaining differentiable to allow backpropagation during training. As shown in Fig.  2(c), the input to this layer are the corresponding corners in the image pairs and , and the output is the estimate of the homography parameterization .

The DLT algorithm is used to solve for the homography matrix given a set of four point correspondences [30]. Let be the homography induced by a set of four 2D to 2D correspondences, . According to the definition of a homography given in Eqn. (1), . This relation can also be expressed as .

Let be the -th row of , then:



is the column vector representation of


Let , then:


This equation can be rewritten as:


which has the form for each correspondence pair, where is a matrix, and is a vector with elements consisting of the entries of . Since the last row in is dependent on the other rows, we are left with two linear equations where is the first 2 rows of .

Given a set of 4 correspondences, we can create a system of equations to solve for and thus . For each , we can stack to form . Solving for results in finding a vector in the null space of

. One popular approach is singular value decomposition (SVD) 

[31], which is a differentiable operation. However, taking the gradients in SVD has high time complexity and has practical implementation issues [32]. An alternative solution to this problem is to make the assumption that the last element of , which is is equal to 1 [33].

With this assumption and the fact that , we can rewrite Eqn. (7) in the form for each correspondence points where is the matrix representing the first columns of ,

is a vector with 2 elements representing the last column of subtracted from both sides of the equation,

and is a vector consisting of the first 8 elements of (with omitted).

By stacking these equations, we get:


Eqn. (8) has a desirable form because , and thus , can be solved for using , the pseudo-inverse of . This operation is simple and differentiable with respect to the coordinates of and . In addition, the gradients are easier to calculate than for SVD.

This approach may still fail if the correspondence points are collinear: if three of the correspondence points are on the same line, then solving for is undetermined. We alleviate this problem by first making the initial guess of to be zero, implying that . We then set a small learning rate such that after each training iteration, does not move too far away from .

Iv-C Spatial Transformation Layer

The next layer applies the homography estimate output by the Tensor DLT to the pixel coordinates of image in order to get warped coordinates . These warped coordinates are necessary in computing the photometric loss function in Eqn. (4) that will train our neural network. In addition to warping the coordinates, this layer must also be differentiable so that the error gradients can flow through via backpropagation. We thus extend the Spatial Transformer Layer introduced in [34] by applying it to homography transformations.

This layer performs an inverse warping in order to avoid holes in the warped image. This process consists of 3 steps: (1) Normalized inverse computation of the homography estimate; (2) Parameterized Sampling Grid Generator (PSGG); and (3) Differentiable Sampling (DS).

The first step, computing a normalized inverse, involves normalizing the height and width coordinates of images and into a range such that and . Thus given a homography estimate , the inverse used for warping is computed as follows:

with and are the width and height of the .

The second step (PSGG) creates a grid of the same size as the second image . Each grid element corresponds to pixels of the second image . Applying the inverse homography to these grid coordinates provides a grid of pixels in the first image .


Based on the sampling points computed from PSGG, the last step (DS) produces a sampled warped image of size with channels, where .

The sampling kernel is applied to the grid and the resulting image is defined as


where are the height and width of the input image , and are the parameters of

defining the image interpolation.

is the value at location in channel of the input image, and is the value of the output pixel at location in channel . Here, we use bilinear interpolation such that the Eqn. (10) becomes


To allow backpropagation of the loss function, gradients with respect to and for bilinear interpolation are defined as


This allows backpropagation of the loss gradients using the chain rule because

and can be easily derived from Eqn. 2.

V Evaluation Results

Fig. 3: Synthetic 4pt-Homography RMSE (lower is better). Unsupervised has comparable performance with the supervised method and performs better than the other approaches especially when the displacement is large.
Fig. 4: Speed Versus Performance Tradeoff. Lower left is better. Suffixes GPU and CPU reflect the computational resource. All the feature-based methods are run on the CPU. The unsupervised network run on the GPU dominates all the other methods by having both the highest throughput and best performance.
(b) Unsupervised
(c) SIFT
(d) SIFT
(e) ECC
(f) ECC
(b) Unsupervised
(c) SIFT
(d) SIFT
(e) ECC
(f) ECC
Fig. 5: Qualitative visualization of estimation methods on aerial dataset. Left: hard case, right: moderate case. ECC performs better than SIFT in the case of small displacement, but performs worse than SIFT in case of large displacement. Unsupervised network outperforms both SIFT and ECC approaches. Supervised network is omitted due to limited space and its poor performance on this dataset.
(a) Unsupervised

The intended use case for our algorithm is in estimating homographies for aerial multi-robot systems applications such as image mosaicing and collision avoidance. As a result, we demonstrate our unsupervised algorithm’s accuracy, inference speed, and robustness to illumination variation relative to SIFT, ORB, ECC and the supervised deep learning method. We evaluate these methods on a synthetic dataset similar to the dataset used in [24], and on a real-world aerial image dataset. Since ORB’s performance is inferior to that of SIFT, we only report ORB’s performance in Fig. 4 and omit it in the remaining figures.

Both the supervised and unsupervised approaches use the VGGNet architecture to generate homography estimates . The deep learning approaches are implemented in Tensorflow [35]

using stochastic gradient descent with a batch size of 128, and an Adam Optimizer 

[36] with, , and . We empirically chose the initial learning rate for the supervised algorithm and unsupervised algorithm to be and respectively.

The ECC direct method is a standard Python OpenCV implementation while the feature-based approaches are Python OpenCV implementations of SIFT RANSAC and ORB RANSAC. We found that in our synthetic dataset, using all detected features gives better performance, while in our aerial dataset, choosing the 50 best features is superior. These feature pairs are then used to calculate the homography using RANSAC with a threshold of

pixels. For the ECC method, we use identity matrix as the initialization and set 1000 as the maximum number of iterations.

V-a Synthetic Data Results

This section analyzes the performance profile of the Unsupervised, Supervised, SIFT, and ECC methods on our synthetic dataset. We want to test how well our approach performs under illumination variation and large image displacement.

To account for illumination variation, we globally standardize our images based on the mean and variance of pixel intensities of all images in our training dataset. We additionally inject random color, brightness and gamma shifts during the training. We do not utilize any further preprocessing and use the photometric loss function. To highlight the effect of displacement amount on each method, we break down the accuracy performance in terms of: image overlap (small displacement), image overlap (moderate displacement), and image overlap (large displacement). We follow the synthetic data generation process on the MS-COCO dataset used in [24]. The amount of image overlap is controlled by the point perturbation parameter

. The evaluation metric is the

pt-Homography RMSE from Eqn. (3) comparing the estimated homography to the ground truth homography.

We train the deep networks from scratch for iterations over hours, using two GPUs. This long training procedure only needs to be performed once, as the resulting model can be used as an initial pre-trained model for other data sets. We observed that the supervised model started overfitting after iterations so stopped training early. SIFT, ORB and ECC estimated homographies using the full images, while the deep learning methods are only given access to the small patches ( pixels). This disadvantages our methods, and would result in better performance for the traditional methods, at the expense of slower running times.

Fig. 6: 4pt-homography RMSE on aerial images (lower is better). Unsupervised outperforms other approaches significantly.

Fig. 3 displays the results of each method broken down by overlap and performance percentile. We break down the results by performance percentile to illustrate the various performance profiles of each method. Specifically, SIFT tends to do very well of the time, but in the worst of the time it performs very poorly, sometimes completely failing to detect enough features to estimate a homography. On the other hand, the deep learning methods tends to have much more consistent performance, which can be more desireable in practical applications such as using homographies for collision avoidance for aerial multi-robot systems. Both the learning methods and the feature-based methods outperform the direct method (ECC).

Interestingly, whereas direct method ECC has problems with illumination variation and large displacement, our unsupervised method is able to handle these scenarios even though it uses photometric loss functions. One potential hypothesis is that our method can be viewed as a hybrid between direct methods and feature based methods. The large receptive field of neural networks may allow it to handle large image displacement better than a direct method. In addition, whereas image gradients are used to update homography parameters in direct methods, with neural networks, these gradients are used to update network parameters which correspond to improving learned features. Finally, direct methods are an online optimization process that use gradients from a single pair of images, whereas training a deep network is an offline optimization process that averages gradients across multiple images. Injecting noise into this training process can further improve robustness to different appearance variations. Understanding the relationship between the neural network and photometric loss functions is an important direction for future work.

V-B Aerial Dataset Results

This section analyzes the performance profile of each method on a representative dataset of aerial imagery captured by a UAV. In addition to accuracy performance, an equally important consideration for real world application is inference speed. As a result, we also discuss the performance to speed tradeoffs of each method.

Our aerial dataset contains image pairs resized to , captured by a DJI Phantom Pro platform in Yardley, Pensylvania, USA in . We divided it into train and test samples. We did not label the train set, but for evaluation purposes, we manually labeled the ground truth by picking 4 pairs of correspondences for each test sample. We also randomly inject illumination noise in both the training and testing sets. The evaluation metrics are the same for the synthetic data. To reduce training time, we finetune the neural networks on the aerial image data. Our unsupervised algorithm can directly use the aerial dataset image pairs. However, since we do not have ground truth homography labels, we have to perform a similar synthetic data generation process as in the synthetic dataset in order to finetune the supervised neural network. We fine tune both models over iterations for roughly hours with data augmentation.

Fig. 6 displays the performance profile for the Unsupervised, Supervised, SIFT, and ECC methods. Fig 4 displays the speed and performance tradeoff for these methods, and additionally the featured based method ORB. The feature-based methods are tested on a 16-core Intel Xeon CPU, and the deep learning methods are tested on the same CPU and an NVIDIA Titan X GPU. The closer to the lower left hand corner, the better the performance and faster the runtime.

Both Figs. 6 and 4 demonstrate that our unsupervised algorithm has the best performance of all methods. In addition, Fig. 4 also shows that our unsupervised method on the GPU has both the best performance and the fastest inference times. SIFT has the second best performance after our unsupervised algorithm, but has a much slower runtime (approximately times slower). ORB has a faster runtime than SIFT, but at the expense of poorer performance. The ECC direct method approach has the worst performance and runtime of all the methods. A qualitative example where both SIFT and ECC fail to deliver a good result while our method succeeds is illustrated in Fig. 5.

One of the most interesting results is that while the supervised and unsupervised approaches performed comparably on the synthetic data, the supervised approach had drastically poorer performance on the aerial image dataset. This shift is due to the fact that ground truth labels are not available for our aerial dataset. The generalization gap from synthetic (train) to real (test) data is an important problem in machine learning. The best practical approach is to additionally fine-tune the model on the new distribution of data. In a robotic field experiment, this can be achieved by flying the UAV to collect a few sample images and fine-tuning on those images. However, this fine-tuning is only possible with our unsupervised algorithm. Our aerial dataset results highlight the fact that even though synthetic data can be generated from real images, a pair of synthetic images is still very different from a pair of real images. These results demonstrate that the independence of our unsupervised algorithm from expensive ground truth labels has large practical implications for real-world performance.

Vi Conclusions

We have introduced an unsupervised algorithm that trains a deep neural network to estimate planar homographies. Our approach outperforms the corresponding supervised network on both synthetic and real-world datasets, demonstrating the superiority of unsupervised learning in image warping problems. Our approach achieves faster inference speed, while maintaining comparable or better accuracy than feature-based and direct methods. We demonstrate that the unsupervised approach is able to handle large displacements and large illumination variations that are typically challenging for direct approaches that use the same photometric loss function. The speed and adaptive nature of our algorithm makes it especially useful in aerial multi-robot applications that can exploit parallel computation.

In this work, we do not investigate robustness against occlusion, leaving it as future work. However, as suggested in [24], we could potentially address this issue by using data augmentation techniques such as artificially inserting random occluding shapes into the training images. Another direction for future work is investigating different improvements to achieve sub-pixel accuracy in the top performance percentile.

Finally, our approach is easily scalable to more general warping motions. Our findings provide additional evidence for applying deep learning methods, specifically unsupervised learning, to various robotic perception problems such as stereo depth estimation, or visual odometry. Our insights on estimating homographies with unsupervised deep neural network approaches provide an initial step in a structured progression of applying these methods to larger problems.

Vii Acknowledgements

We gratefully acknowledge the support of ARL grants W911NF-08-2-0004 and W911NF-10-2-0016, ARO grant W911NF-13-1-0350, N00014-14-1-0510, N00014-09-1-1051, N00014-11-1-0725, N00014-15-1-2115 and N00014-09-1-103, DARPA grants HR001151626/HR0011516850 USDA grant 2015-67021-23857 NSF grants IIS-1138847, IIS-1426840 CNS-1446592 CNS-1521617 and IIS-1328805, Qualcomm Research, United Technologies, and TerraSwarm, one of six centers of STARnet, a Semiconductor Research Corporation program sponsored by MARCO and DARPA. We would also like to thank Aerial Applications for the UAV data set.