1 Introduction
Finding correspondences between images is a key task in many computer vision applications, including image alignment
[28, 29], visual localization [31, 34, 35][2, 11], structurefrommotion [30], semantic correspondence [12, 16], optical flow [14, 15, 26, 33] and relative camera pose estimation [23, 36]. In general, there are two ways to establish a pixelwise correspondence field between images. The first group of methods is based on applying feature descriptors to an image pair and utilizing nearest neighbor criterion to match keypoints globally. However, these approaches do not produce dense correspondences explicitly and apply interpolation or local affine transformations
[18]to turn a sparse set into a pixelwise correspondences. Another possible direction of finding dense correspondences is to compare image patches in feature space. Neural networks have been widely used to learn discriminative and robust descriptors
[3, 21]. Those descriptors are then compared pairwise by thresholding Euclidean distance between them [6, 9, 22] or by predicting a binary label [37, 38]. In contrast, the proposed approach processes the image as a whole, and thus, it can handle a broader set of geometric changes in images and directly predict dense correspondences without any postprocessing steps. Recent optical flow methods [14, 33] have demonstrated great success at estimating dense subpixel correspondences. However, the main limitation of these methods is a spatially constrained correlation layer predicting the matches in a small vicinity around the center pixel of each image patch. Thus, captured transformations are very restricted. To some extent this restriction can be alleviated with pyramid structure [33] but not completely.In this paper we propose a convolutional neural network (CNN) architecture, called DGCNet, for learning dense pixel correspondences between a pair of images with strong geometric transformations. Following more recent optical flow methods
[14, 26, 33] and the concept introduced by LucasKanade [20], we exploit a coarsetofine image warping idea by creating a hierarchical network structure. Rather than considering only affine and thinplate spline (TPS) transformations [28], we train our system on synthetic data in an endtoend manner handling diverse geometric transformations present in real world. We demonstrate that the proposed approach substantially outperforms CNNbased optical flow and image matching methods on the challenging HPatches [4] and DTU [1] datasets.The main contributions of this paper are: 1) We propose an endtoend CNNbased method, DGCNet, to establish dense pixel correspondences between images with strong geometric transformations; 2) We demonstrate that even if DGCNet is trained only on synthetic transformations, it can generalize well to real data; 3) We apply the proposed approach to the problem of relative camera pose estimation and demonstrate that our method outperforms strong baseline approaches by a large margin. In addition, we modify the original structure of DGCNet and seamlessly integrate a matchability decoder into DGCNet that can significantly improve the computational efficiency of the relative camera pose estimation pipeline by removing tentative correspondences with low confidence scores.
2 Related Work
The traditional image matching pipeline begins with the detection of interest points and computation of descriptors. However, many of the most widely used descriptors [5, 19] are based on handcrafted features and have limited ability to cope with negative factors, such as strong illumination changes and large variation in viewpoint. In contrast, more recent [24, 25] methods based on viewsynthesizing have demonstrated stateoftheart results in image matching by handling large viewing angle difference and appearance changes. However, they do not produce dense perpixel correspondences and do not perform any learning.
Applying machine learning techniques has proven very effective in optical flow estimation problem
[14, 26, 33] which is closely related to finding pixel correspondence task. Recently proposed methods, PWCNet [33] and FlowNet2 [14], utilize a correlation layer to predict image similarities in some neighborhood around the center pixel in a coarsetofine manner. While such a spatially constrained correlation layer leads to stateoftheart results in optical flow, it performs poorly for very strong geometric transformations that we consider in this work. Rocco [28] proposed a CNNbased approach for determining correspondences between two images and applying it to instancelevel and categorylevel tasks. In contrast to optical flow methods [14, 33], it comprises a matching layer calculating the correlation between target and reference feature maps without any spatial constraint. The method casts finding pixel correspondences task as a regression problem and consisting of two independent Siamese CNNs trained separately and directly predicting affine and TPS geometric transformations parametrizing 6element and 18element vectors. On the contrary, we propose a more general approach handling more diverse transformations and operating in an endtoend fashion.
3 Method
Our goal is to determine correspondences between two input images . The most straightforward way to solve this task is to predict the parameters of the relative transformation matrix parametrized for different geometric transformations, such as an homography [8], an affine or a TPS [28] transformation. However, realistic scenes usually contain more complex geometric transformations which can be hardly described by such parametrization. Inspired by recent work in image compositing [17] and optical flow estimation, we propose to predict a dense pixel correspondence map in an coarsetofine manner.
3.1 Network Architecture
In this section, we first present the structure of the proposed network and the general principles behind it, then formulate the view correspondence objective function to predict geometric transformations between image pairs.
Schematic representation of the proposed approach is shown in Fig. 1. A pair of input images is fed into a module consisting of two pretrained CNN branches which construct a feature pyramid. The correlation layer takes feature maps of the source and target images from the coarse (top) level of the pyramid and estimates the pairwise similarity between them. Then, the correspondence map decoder takes the output of the correlation layer and directly predicts pixel correspondences for this particular level of the pyramid. The estimates are then refined in an iterative manner.
Feature pyramid creator. In order to create a representation of an input image pair in feature space, we construct a Siamese neural network with two branches with shared weights. The branches use the VGG16 architecture [32]
trained on ImageNet
[7] and truncated at the last pooling layer followed by normalization [28]. We extract features at different parts of each branch to create a 5layer feature pyramid with the following spatial resolutions (from top to bottom): and encoded with different colors in Fig. 1. The weights of CNNbranches are then fixed throughout the rest of the network training procedure.Correlation layer. In order to estimate a similarity score between two images, we follow an idea proposed in [28] and calculate the correlation volume between the normalized feature maps of the source and target images. In contrast to optical flow approaches [14, 33], where the correlation volume is computed for the raw features in a restricted area around the center pixel, we compute global correlation and apply normalization before and after the correlation layer to strongly downweight ambiguous matches (Fig. 1). Specifically, the correlation layer computes the scalar product between each feature vector of the source and all vectors of the target feature maps and can be defined in the following way:
(1) 
where denotes the scalar product and is a L2normalized correlation volume . Since the third dimension of the correlation volume is a product of its and , it is not feasible to calculate such volumes in the bottom layers of the pyramid where the spatial resolution of the feature maps is large. Thus, at the bottom feature pyramid layers, we concatenate descriptors channelwise.
Correspondence map decoder.
The output of the correlation layer is then fed into a correspondence map decoder consisting of 5 convolutional blocks (ConvBNReLU) to estimate a 2D dense correspondence field
at a particular level of the feature pyramid. The estimates are parameterized such that each predicted pixel location in the map belongs to the interval representing width and height normalized image coordinates. That is, we upsample the predicted correspondence field at the level to warp the feature maps of the source image at level toward the target features. Finally, the upsampled field, warped source and target features are concatenated along the channel dimension and provided accordingly as input to the correspondence map decoder at level.Each convolution layer in the decoder is padded to keep the spatial resolution of the feature maps intact. Moreover, in order to be able to capture more spatial context at the bottom layers of the pyramid, starting from
different dilation factors have been added to the convolution blocks to increase the receptive field. The feature pyramid creator, correlation layer and a hierarchical chain of the correspondence map decoders together form a CNN architecture that we will refer to as DGCNet in the following.Given an image pair and the ground truth pixel correspondence map
, we can define a hierarchical objective loss function as follows:
(2) 
where is the distance between an estimated and the ground truth dense correspondence map; is the ground truth binary mask (matchability mask) indicating whether each pixel in the source image has a correspondence in the target; indexes over valid pixel locations according to the ground truth mask at each level of the level feature pyramid. In order to adjust the weight of different pyramid layers, we introduce a vector of scalar weight coefficients .
Matchability decoder. According to recent advances in optical flow [15, 33]
, it is still very challenging to estimate correct correspondences for illposed cases, such as occluded regions of an image pair. Thus, in addition to the pixel correspondence map produced by DGCNet, we would like to directly predict a measure of confidence for each correspondence. Specifically, we modify the DGCNet structure by adding a matchability branch. It contains four convolutional layers outputting a probability map (parametrized as a sigmoid) indicating a confidence score for each pixel location in the predicted correspondence map. We will refer to this architecture as called
DGC+MNet. Since, we consider this problem as a pixel classification task, we optimize a binary cross entropy (BCE) with logits loss that is defined as:
(3) 
where and are ground truth and estimated matchability masks, respectively;
is the elementwise sigmoid function. The total loss for the DGC+MNet model is the sum of the correspondence loss
and the matchability loss with a weighted coefficient ():(4) 
We provide the detailed information about the hyperparameters used in training as well as the exact network definitions of all network components in supplementary.
4 Experiments
We discuss the experimental settings and evaluate the proposed method on two closely related tasks, finding correspondences between images and relative camera pose estimation.
4.1 Baselines
In this work we compare our approach with several strong baselines.
Image alignment. Rocco [28] propose a CNNbased method to estimate geometric transformations between two images achieving stateofthe art results in a semantic correspondence task. The transformations are parameterized as a 18element vector and directly regressed by the network. We apply the estimates to a regular grid of the size of the input images to produce a dense pixel correspondence map.
Optical flow estimation requires finding correspondences between two input images. Therefore, we consider three CNNbased optical flow approaches, SPyNet [26], FlowNet2 [14] and the recently proposed PWCNet [33] as baseline methods. In detail, PWCNet is based on a coarsetofine paradigm and predicts optical flow at different scales of feature maps produced by a Siamese CNN. The coarse estimates are then used to refine the flow. For optical flow methods we use pretrained models from the original authors.
DeepMatching [27] is matching algorithm aiming at finding semidense image correspondences. Specifically, it relies on a multiscale image pyramid architecture with no any trainable parts and can cope with very challenging scenes, such as repetitive textures and nonrigid image transformations.
4.2 Datasets
We compare the proposed approach with different baseline methods on two evaluation datasets.
HPatches [4] consists of several sequences of real images with varying photometric and geometric changes. Each image sequence contains a reference (target) image and 5 source images taken under a different viewpoint. For all images the estimated ground truth homography is provided, thus, dense correspondence maps can be obtained for each test image pair. There are 59 image sequences with challenging geometric transformations in total.
DTU. The pixel correspondences produced by our method can be also used for relative camera pose estimation problem. Thus, in order to measure the performance of the proposed approach for this task, we utilize the DTU image dataset [1] consisting of 124 scenes with very accurate absolute camera poses collected by a precisely positioned robot. We create a list of camera pairs which have overlapping fields of view and then randomly choose about image pairs covering all the scenes.
Training datasets. We use training and validation splits proposed by [28] to compare both approaches fairly. Specifically, Rocco [28] generate synthetic affine (aff) and thinplate spline (TPS) transformations and apply them to images from Pascal VOC 2011 and Tokyo Time Machine datasets. Each synthetic dataset has 20k training and validation image pairs, respectively. However, those transformations are not very diverse. To be able to estimate the correspondences for HPatches scenes accurately, we therefore generate 20k labeled training examples [8] by applying random homography transformations to the dataset. All training datasets mentioned above represent only synthetic geometric transformations between images. However, it is hard to artificially generate such diverse transformations that are present in real 3D world. Therefore, in addition to synthetic data, we utilize the Citywall dataset used for 3D reconstruction and provided by [10]. Based on camera poses and depth maps estimated with the Multiview Reconstruction Environment [10], we create a list of 10k image pairs and ground truth correspondence maps. We use this data to finetune the proposed model. We emphasize that the objective of this experiment is to demonstrate that finetuning on realistic data leads to further improvement of the results.
4.3 Metrics
As predicting a dense corresponding grid is closely related to optical flow estimation, we follow the standard evaluation metric used in this task, the average endpoint error (AEPE). AEPE is defined as the average Euclidean distance between the estimated and ground truth correspondence map. In addition to AEPE, we also use Percentage of Correct Keypoints (PCK) as the evaluation metric. PCK shows the percentage of the correctly matched estimated points
that are within a certain threshold (in pixels) from the ground truth corresponding points .In order to estimate the accuracy of matchability mask predictions, we report normalized Jaccard index (Intersection Over Union, IoU),
for the ground truth and estimated masks. This metric is interpreted as a similarity measure between two finite sample sets and widely used in semantic segmentation [13].4.4 Results
Synthetic Datasets. First, we experimentally compare the proposed DGCNet and DGCNet+M models with [28] by calculating AEPE. All the models have been trained on the data provided by [28]. More specifically, *aff methods utilize only synthetic affine transformations during training but *aff+tps methods additionally trained on TPS transformations. AEPE is measured only for valid pixel locations of and test data by applying the groundtruth mask. For DGCNet+M* models we also report normalized Jaccard index. Tab. 1 shows that DGCNet significantly outperforms all baseline methods on both evaluation datasets. Despite the fact that DGCNet+M model is marginally worse than DGCNet in the case that the transformation between images can be described by an affine transformation, it is more universal approach as it additionally predicts a matchability map which is quite accurate according to the Jacard similarity score. It is worth noting that the proposed models generalize well to unseen data, since AEPE metric varies slightly for and evaluation datasets respectively. It shows that the model has learned the geometric transformations and not overfitting to the visual representation of images.
Method  Train:  Pascalvoc11 (P)  Tokyo Time Machine (T)  

Test:  (P)aff  (T)aff  (T)aff  (P)aff  
Rocco[28] aff  3.82  3.93  4.10  4.45  
DGCNet+M aff  0.92/0.847  0.97/0.847  1.03/0.848  1.14/0.848  
DGCNet aff  0.95  0.99  0.90  1.03  
Rocco[28] aff+tps  3.28  3.30  4.83  4.97  
DGCNet+M aff+tps  0.82/0.849  0.96/0.849  0.83/0.853  0.92/0.853  
DGCNet aff+tps  0.57  0.69  0.54  0.61 
Realistic Datasets. To demonstrate the performance on more realistic data, we evaluate all baseline methods and our approach on the HPatches dataset. That is, we calculate AEPE over all image sequences belonging to the same viewpoint ID and report the numbers in Tab. 2. Compared to *aff models, finetuning on TPS transformations lead to a significant improvement in the performance reducing the overall EPE by 20% for Viewpoint II and by 9% for Viewpoint V, respectively. The performance is improved further by finetuning the model on synthetic homography data. To prevent large errors caused by interpolation, we directly calculate AEPE metric for the semidense DeepMatching [27] estimates (hence [27] has unfair advantage in terms of AEPE). The Jaccard index for DGC+MNet* models is provided in Tab. 3.
In addition, we report a number of correctly matched pixels between two images by calculating PCK metric with different thresholds. Especially the comparison with [28] is interesting as the coarse level of our pipeline is based on its matching strategy. As shown in Fig. 2, the proposed method correctly matches around 85% pixels for the case where geometric transformations are quite small (Viewpoint I). It significantly outperforms [28] trained on the same data without any external synthetic datasets and can be further improved by utilizing more diverse transformations during training. Compared to FlowNet2 and PWCNet, DGCNet, our method can handle scenarios exhibiting drastic changes between views (Viewpoint IV and V), achieving 59% of PCK with a 1pixel threshold for the most challenging case.
Method  Viewpoint ID  

I  II  III  IV  V  
SPyNet [26]  36.94  50.92  54.29  62.60  72.57 
DeepMatching* [27]  5.84  4.63  12.43  12.17  22.55 
FlowNet2 [14]  5.99  15.55  17.09  22.13  30.68 
PWCNet [33]  4.43  11.44  15.47  20.17  28.30 
Rocco [28] aff (P)  14.85  29.09  31.04  39.35  45.92 
DGC+MNet aff (P)  5.96  12.85  16.23  20.64  27.63 
DGCNet aff (P)  6.80  13.82  17.15  22.62  28.39 
Rocco [28] aff (T)  15.02  28.23  29.27  36.57  43.68 
DGC+MNet aff (T)  6.22  14.46  17.21  22.92  29.65 
DGCNet aff (T)  5.12  13.01  15.08  20.14  26.47 
Rocco [28] aff+tps (P)  9.50  22.47  24.73  34.20  41.46 
DGC+MNet aff+tps (P)  4.35  11.17  14.09  18.66  25.04 
DGCNet aff+tps (P)  4.20  10.78  14.34  18.48  25.00 
Rocco [28] aff+tps (T)  9.59  18.55  21.15  27.83  35.19 
DGC+MNet aff+tps (T)  4.40  8.92  11.94  16.33  22.01 
DGCNet aff+tps (T)  3.10  8.18  10.97  16.29  22.29 
DGC+MNet aff+tps+homo (T)  2.97  6.85  9.95  12.87  19.13 
DGCNet aff+tps+homo (T)  1.55  5.53  8.98  11.66  16.70 
Transformation  Viewpoint ID  

I  II  III  IV  V  
aff  0.682  0.617  0.562  0.523  0.445 
aff+tps  0.700  0.650  0.603  0.573  0.496 
aff+tps+homo  0.730  0.687  0.629  0.590  0.525 
Relative camera pose. In this section, we demonstrate the application of the proposed method for predicting relative camera pose. Given a list of correspondences and the intrinsic camera parameters matrix , we estimate the essential matrix by applying RANSAC. To decrease the randomness of RANSAC, for each image pair we run a 1000iteration loop for 5 times and choose the estimated essential matrix corresponding to the maximum inliers count. Once this process is predicted, relative pose can be recovered based on and respectively. Similarly to [23], we use the relative orientation error and the relative translation error as metrics for evaluating the performance. Both metrics compute the angle between the estimated orientation/translation and the ground truth. Fig. 2(a) and 2(b) show a set of normalized cumulative histograms of relative orientation and translation errors for each baseline models evaluated on all scenes of the DTU dataset (Sec. 4.2). As before, DGCNet and DGC+MNet have been trained on only synthetic transformations (aff+tps+homo). For a fair comparison, we resize images to size for all baseline methods and change internal camera parameters accordingly. Interestingly, both PWCNet [33] and FlowNet2 [14] estimate relative orientation quite well achieving and median error calculated at level , respectively. The proposed approach outperforms all CNNbased baselines by 18% and 40% at estimating relative orientation and translation median error compared to PWCNet. We also evaluate DGC+MNet model which additionally predicts a matchability mask. This mask can be considered as a filter to remove tentative correspondences with small confidence score from the relative pose estimation pipeline. According to Fig. 3, DGC+MNet falls slightly behind of DGCNet in estimating relative pose but it achieves significant advantages in terms of computational efficiency decreasing the elapsed time from 312 sec. to 162 sec. for estimating relative camera pose for all test image pairs.
To experiment with more realistic transformations, we finetune DGCNet model on the Citywall dataset (Sec. 4.2), illustrated in the supplementary material. We refer to this model as DGCNetCitywall. As can be clearly seen, groundtruth transformation maps are incomplete leading to multiple missing regions in the warped reference images (see the supplementary). However, using external data with more diverse transformations helps to improve the performance of the method remarkably, decreasing the median relative translation error by 17% according to Fig. 2(b).
In addition, we calculate the epipolar error for the matches produced by our method, PWCNet and FlowNet2. The error is defined in terms of the squared distances () between the points and corresponding epipolar lines as follows:
(5) 
where and denote a pair of matching points in two images; is the groundtruth fundamental matrix between two views; is the number of image pixels (image resolution). The normalized cumulative histogram of the error is presented in Fig. 2(c). Quantitatively, the proposed method provides quite accurate pixel correspondences between two views achieving a median error less than 4 pixels across the whole test dataset.
4.5 Ablation Study
In this section, we analyze some design decisions of the proposed approach. More specifically, our goal is to investigate the benefits of using global correlation layer compared to the one utilized in recent optical flow methods [14, 33]. In addition, we experiment with another type of parametrization of ground truth data by representing a correspondence map as flow. Furthermore, we demonstrate the importance of normalization of the correlation map. The results are presented in Tab. 4.



Global correlation layer: In contrast to the proposed approach, the PWCNet architecture comprises a local correlation layer computing similarities between two feature maps in some restricted area around the center pixel at each level of the feature pyramid. However, it is very hard to compare DGCNet and offtheshelf PWCNet approach fairly due to the significant difference in network structures (see Tab. 3(c)). Therefore, we construct a new coarsetofine level CNN model by keeping all the blocks of DGCNet except the correlation layer. More specifically, each feature pyramid level is complemented by a local correlation layer as it is used in PWCNet structure. We dubbed this model to PWCmNet. As shown in Tab. 3(a), the global correlation layer achieves a significant improvement over the case with a set of spatially constrained correlation layers. Particularly, the error is reduced from to pixels on the dataset. All results have been obtained for only affine transformations in training data.
L2 normalization: As explained in Sec. 3.1, we normalize the output of the correlation layer to downweigh the putative matches. In Tab. 3(a) we compare original DGCNet model and its modified version without correlation layer normalization step (DGCNet no L2norm). According to the results, the normalization improves the error by about 15% for all test cases demonstrating the importance of this step.
Different parametrization: Given two images, the proposed approach predicts a dense pixel correspondence map representing the absolute location of each image pixel. In contrast, all optical flow methods estimate pixel displacements between images. To dispel this doubt in parameterization, we train DGCNet model on the same synthetic data as before but with groundtruth labels recalculated in an optical flow manner. We title this model DGCNetflow and provide the results in Tab. 3(a) and Tab. 3(b). Interestingly, while DGCNetflow model marginally performs better on synthetic data, DGCNet producing more accurate results in large geometric transformations case (Tab. 3(b)) demonstrating the benefit of the original parametrization.
5 Conclusion
Our paper addressed the challenging problem of finding dense pixel correspondences. We have proposed a coarsetofine network architecture that efficiently handles diverse transformations between two views. We have shown that our contributions were crucial to outperforming strong baselines on the challenging realistic datasets. Additionally, we have also applied the proposed method to the relative camera pose estimation problem, demonstrating very promising results. We hope this paper inspires more research into applying deep learning to accurate and reliable dense pixel correspondence estimation.
References
 [1] H. Aanæs, R. R. Jensen, G. Vogiatzis, E. Tola, and A. B. Dahl. LargeScale Data for MultipleView Stereopsis. IJCV, pages 1–16, 2016.
 [2] A. Babenko, A. Slesarev, A. Chigorin, and V. S. Lempitsky. Neural Codes for Image Retrieval. In Proc. ECCV, 2014.
 [3] V. Balntas, E. Johns, L. Tang, and K. Mikolajczyk. PNNet: Conjoined Triple Deep Network for Learning Local Image Descriptors. CoRR, abs/1601.05030, 2016.
 [4] V. Balntas, K. Lenc, A. Vedaldi, and K. Mikolajczyk. HPatches: A benchmark and evaluation of handcrafted and learned local descriptors. In Proc. CVPR, 2017.
 [5] H. Bay, A. Ess, T. Tuytelaars, and L. Van Gool. SpeededUp Robust Features (SURF). CVIU, 2008.
 [6] C. Choy, J. Gwak, S. Savarese, and M. Chandraker. Universal Correspondence Network. In Proc. NIPS, 2016.
 [7] J. Deng, W.Dong, R. Socher, L.J. Li, K. Li, and F.F. Li. Imagenet: A largescale hierarchical image database. In Proc. CVPR, 2009.
 [8] D. DeTone, T. Malisiewicz, and A. Rabinovich. Deep Image Homography Estimation. In Proc. in RSS Workshop on Limits and Potentials of Deep Learning in Robotics, 2016.
 [9] M. E. Fathy, Q.H. Tran, M. Zeeshan Z., P. Vernaza, and M. Chandraker. Hierarchical Metric Learning and Matching for 2D and 3D Geometric Correspondences. In Proc. ECCV, 2018.
 [10] S. Fuhrmann, F. Langguth, and M. Goesele. MVE: A Multiview Reconstruction Environment. In Proc. of the Eurographics Workshop on Graphics and Cultural Heritage, pages 11–18, 2014.
 [11] A. Gordo, J. Almazán, J. Revaud, and D. Larlus. Deep Image Retrieval: Learning global representations for image search. In Proc. ECCV, 2016.
 [12] K. Han, R. S. Rezende, B. Ham, K. K. Wong, M. Cho, C. Schmid, and J. Ponce. SCNet: Learning Semantic Correspondence. In Proc ICCV, 2017.
 [13] V. Iglovikov and A. Shvets. TernausNet: UNet with VGG11 encoder pretrained on ImageNet for image segmentation. CoRR, abs/1801.05746, 2018.
 [14] E. Ilg, N. Mayer, T. Saikia, M. Keuper, A. Dosovitskiy, and T. Brox. Flownet 2.0: Evolution of optical flow estimation with deep networks. In Proc. CVPR, 2017.
 [15] J. Janai, G. Fatma, R. Anurag, M. J. Black, and A. Geiger. Unsupervised learning of multiframe optical flow with occlusions. In Proc. ECCV, 2018.
 [16] Z. Laskar and J. Kannala. Semisupervised semantic matching. In Proc ECCW, 2018.
 [17] C. Lin, E. Yumer, O. Wang, E. Shechtman, and S. Lucey. STGAN: Spatial Transformer Generative Adversarial Networks for Image Compositing. In Proc. CVPR, 2018.
 [18] W.Y. Lin, M.M. Cheng, J. Lu, H. Yang, M. N. Do, and P. Torr. Bilateral Functions for Global Motion Modeling. In Proc. ECCV, 2014.
 [19] D. G. Lowe. Distinctive Image Features from ScaleInvariant Keypoints. IJCV, 2004.
 [20] B. D. Lucas and T. Kanade. An iterative image registration technique with an application to stereo vision. In Proc. IJCAI81, 1981.
 [21] W. Luo, A. G. Schwing, and R. Urtasun. Efficient Deep Learning for Stereo Matching. In Proc. CVPR, 2016.
 [22] I. Melekhov, J. Kannala, and E. Rahtu. Image Patch Matching using Convolutional Descriptors with Euclidean Distance. In Proc. ACCVW, 2016.
 [23] I. Melekhov, J. Ylioinas, J. Kannala, and E. Rahtu. Relative Camera Pose Estimation Using Convolutional Neural Networks. In Proc. ACIVS, 2017.
 [24] D. Mishkin, J. Matas, and M. Perdoch. MODS: Fast and robust method for twoview matching. CVIU, 2015.
 [25] J.M. Morel and G. Yu. ASIFT: A New Framework for Fully Affine Invariant Image Comparison. SIAM J. Img. Sci., 2009.
 [26] A. Ranjan and M. J. Black. Optical Flow Estimation using a Spatial Pyramid Network. In Proc. CVPR, 2017.
 [27] J. Revaud, P. Weinzaepfel, Z. Harchaoui, and C. Schmid. DeepMatching: Hierarchical Deformable Dense Matching. IJCV, 120(3):300–323, 2016.
 [28] I. Rocco, R. Arandjelović, and J. Sivic. Convolutional neural network architecture for geometric matching. In Proc. CVPR, 2017.
 [29] I. Rocco, R. Arandjelovic, and J. Sivic. Endtoend weaklysupervised semantic alignment. In Proc. CVPR, 2018.
 [30] J. L. Schönberger and J.M. Frahm. StructurefromMotion Revisited. In Proc. CVPR, 2016.
 [31] J. L. Schönberger, M. Pollefeys, A. Geiger, and T. Sattler. Semantic Visual Localization. In Proc. CVPR, 2018.
 [32] K. Simonyan and A. Zisserman. Very Deep Convolutional Networks for LargeScale Image Recognition. CoRR, abs/1409.1556, 2014.
 [33] D. Sun, X. Yang, M.Y. Liu, and J. Kautz. PWCNet: CNNs for optical flow using pyramid, warping, and cost volume. In Proc. CVPR, 2018.
 [34] H. Taira, M. Okutomi, T. Sattler, M. Cimpoi, M. Pollefeys, J. Sivic, T. Pajdla, and A. Torii. InLoc: Indoor Visual Localization with Dense Matching and View Synthesis. In Proc. CVPR, 2018.
 [35] C. Toft, E. Stenborg, L. Hammarstrand, L. Brynte, M. Pollefeys, T. Sattler, and F. Kahl. Semantic Match Consistency for LongTerm Visual Localization. In Proc. ECCV, 2018.
 [36] B. Ummenhofer, H. Zhou, J. Uhrig, N. Mayer, E. Ilg, A. Dosovitskiy, and T. Brox. DeMoN: Depth and Motion Network for Learning Monocular Stereo. In Proc. CVPR, 2017.
 [37] S. Zagoruyko and N. Komodakis. Learning to Compare Image Patches via Convolutional Neural Networks. In Proc. CVPR, 2015.
 [38] J. Zbontar and Y. LeCun. Computing the Stereo Matching Cost with a Convolutional Neural Network. In Proc. CVPR, 2015.