# Warp Consistency for Unsupervised Learning of Dense Correspondences

The key challenge in learning dense correspondences lies in the lack of ground-truth matches for real image pairs. While photometric consistency losses provide unsupervised alternatives, they struggle with large appearance changes, which are ubiquitous in geometric and semantic matching tasks. Moreover, methods relying on synthetic training pairs often suffer from poor generalisation to real data. We propose Warp Consistency, an unsupervised learning objective for dense correspondence regression. Our objective is effective even in settings with large appearance and view-point changes. Given a pair of real images, we first construct an image triplet by applying a randomly sampled warp to one of the original images. We derive and analyze all flow-consistency constraints arising between the triplet. From our observations and empirical results, we design a general unsupervised objective employing two of the derived constraints. We validate our warp consistency loss by training three recent dense correspondence networks for the geometric and semantic matching tasks. Our approach sets a new state-of-the-art on several challenging benchmarks, including MegaDepth, RobotCar and TSS. Code and models will be released at https://github.com/PruneTruong/DenseMatching.

## Authors

• 6 publications
• 59 publications
• 38 publications
• 282 publications
• ### Learning Dense Correspondence via 3D-guided Cycle Consistency

Discriminative deep learning approaches have shown impressive results fo...
04/18/2016 ∙ by Tinghui Zhou, et al. ∙ 0

• ### Joint Learning of Semantic Alignment and Object Landmark Detection

Convolutional neural networks (CNNs) based approaches for semantic align...
10/02/2019 ∙ by Sangryul Jeon, et al. ∙ 9

• ### GOCor: Bringing Globally Optimized Correspondence Volumes into Your Neural Network

The feature correlation layer serves as a key neural network module in n...
09/16/2020 ∙ by Prune Truong, et al. ∙ 1

• ### Deep Matching Prior: Test-Time Optimization for Dense Correspondence

Conventional techniques to establish dense correspondences across visual...
06/06/2021 ∙ by Sunghwan Hong, et al. ∙ 0

• ### Semi-supervised Dense Keypointsusing Unlabeled Multiview Images

This paper presents a new end-to-end semi-supervised framework to learn ...
09/20/2021 ∙ by Zhixuan Yu, et al. ∙ 9

• ### Unsupervised Metric Relocalization Using Transform Consistency Loss

Training networks to perform metric relocalization traditionally require...
11/01/2020 ∙ by Mike Kasper, et al. ∙ 0

• ### Unsupervised Dense Shape Correspondence using Heat Kernels

In this work, we propose an unsupervised method for learning dense corre...
10/23/2020 ∙ by Mehmet Aygün, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Finding dense correspondences between images continues to be a fundamental vision problem, with many applications in video analysis [SimonyanZ14], image registration [GLAMpoint, shrivastava-sa11], image manipulation [HaCohenSGL11, LiuYT11], and style transfer [Kim2019, Liao2017]

. While supervised deep learning methods have achieved impressive results, they are limited by the availability of ground-truth annotations. In fact, collecting dense ground-truth correspondence data of real scenes is extremely challenging and costly, if not impossible. Current approaches therefore resort to artificially rendered datasets

[Dosovitskiy2015, Ilg2017a, Sun2018, Hui2018], sparsely computed matches [DusmanuRPPSTS19, D2D], or sparse manual annotations [ArbiconNet, MinLPC20, SCNet]. These strategies lack realism, accuracy, or scalability. In contrast, there is a virtually endless source of unlabelled image and video data, which calls for the design of effective unsupervised learning approaches.

Photometric objectives, relying on the brightness constancy assumption, have prevailed in the context of unsupervised optical flow [RenYNLYZ17, BackToBasics, Meister2017]. However, in the more general case of geometric matching, the images often stem from radically different views, captured at different occasions, and under different conditions. This leads to large appearance transformations between the frames, which significantly undermine the brightness constancy assumption. It is further invalidated in the semantic matching task [LiuYT11], where the images depict different instances of the same object class. As a prominent alternative to photometric objectives, warp-supervision [GLUNet, GOCor, Rocco2017a, Melekhov2019]

, also known as self-supervised learning

[Rocco2018a, SeoLJHC18, MinLPC20], trains the network on synthetically warped versions of an image. While benefiting from direct supervision, the lack of real image pairs often leads to poor generalization to real data.

We introduce Warp Consistency, an unsupervised learning objective for dense correspondence regression. Our loss leverages real image pairs without invoking the photometric consistency assumption. Unlike previous approaches, it is capable of handling large appearance and view-point changes, while also generalizing to unseen real data. From a real image pair , we construct a third image by warping with a known flow field , that is created by randomly sampling e.ghomographies, from a specified distribution. We then consider the consistency graph arising from the resulting image triplet , visualized in Fig. 1. It is used to derive a family of new flow-consistency constraints. By carefully analyzing their properties, we propose an unsupervised loss based on predicting the flow by the composition via image (Fig. 1). Our final warp consistency objective is then obtained by combining it with the warp-supervision constraint, also derived from our consistency graph by the direct path .

We perform comprehensive empirical analysis of the objectives derived from our warp consistency graph and compare them to existing unsupervised alternatives. In particular, our warp consistency loss outperforms approaches based on photometric consistency and warp-supervision on multiple geometric matching datasets. We further perform extensive experiments for two tasks by integrating our approach into three recent dense matching architectures, namely GLU-Net [GLUNet] and RANSAC-Flow [RANSAC-flow] for geometric matching, and SemanticGLU-Net [GLUNet] for semantic matching. Our unsupervised learning approach brings substantial gains: PCK-5 on MegaDepth [megadepth] for GLU-Net, PCK-5 for RANSAC-Flow on RobotCar [RobotCar, RobotCarDatasetIJRR], as well as and PCK-0.05 on PF-Pascal [PFPascal] and TSS [Taniai2016] respectively, for SemanticGLU-Net. This leads to a new state-of-the-art on all four datasets. Example predictions are shown in Fig. 2.

## 2 Related work

Unsupervised optical flow:  While supervised optical flow networks need carefully designed synthetic datasets for their training [Dosovitskiy2015, Mayer2016ALD], unsupervised approaches do not require ground-truth annotations. Inspired by classical optimization-based methods [Horn1981], they instead learn deep models based on brightness constancy and spatial smoothness losses [RenYNLYZ17, BackToBasics]. The predominant technique mainly relies on photometric losses, e.gCharbonnier penalty [BackToBasics], census loss [Meister2017], or SSIM [WangBSS04, UnOS]. Such losses are often combined with forward-backward consistency [Meister2017] and edge-aware smoothness regularization [OccAwareFlow]

. Occlusion estimation techniques

[MFOccFlow, Meister2017, OccAwareFlow]

are also employed to mask out occluded or outlier regions from the objective. Recently, several works

[DDFlow, SefFlow, ARFlow] use a data distillation approach to improve the flow predictions in occluded regions. However, all aforementioned approaches rely on the assumption of limited appearance changes between two consecutive frames. While this assumption holds to a large degree in optical flow data, it is challenged by the drastic appearance changes encountered in geometric or semantic matching applications, as visualised in Fig. 2.

Unsupervised geometric matching:  Geometric matching focuses on the more general case where the geometric transformations and appearance changes between two frames may be substantial. Methods either estimate a dense flow field [Melekhov2019, GLUNet, GOCor, RANSAC-flow] or output a cost volume [Rocco2018b, D2D], which can be further refined to increase accuracy [RoccoAS20, LiHLP20, abs-2012-09842]. The later approaches train the feature embedding, which is then used to compute dense similarity scores. Recent works further leverage the temporal consistency in videos to learn a suitable representation for feature matching [DwibediATSZ19, JabriOE20, WangJE19]. Our work focuses on the first class of methods, which directly learn to regress a dense flow field. Recently, Xen et al [RANSAC-flow] use classical photometric and forward-backward consistency losses to train RANSAC-Flow. They partially alleviate the sensitivity of photometric losses to large appearance changes by pre-aligning the images with Ransac. Several methods [Melekhov2019, GLUNet, GOCor] instead use a warp-supervision loss. By posing the network to regress a randomly sampled warp during training, a direct supervisory signal is obtained, but at the cost of poorer generalization abilities to real data.

Semantic correspondences:  Semantic matching poses additional challenges due to intra-class appearance and shape variations. Manual annotations in this context are ill-defined and ambiguous, making it crucial to develop unsupervised objectives. Methods rely on warp-supervision strategies [Rocco2017a, Rocco2018a, ArbiconNet, SeoLJHC18, GLUNet], use proxy losses on the cost volume [DCCNet, Rocco2018b, Rocco2018a, MinLPC20], identify correct matches from forward-backward consistency of the cost volumes [Jeon], or jointly learn semantic correspondence with attribute transfer [Kim2019] or segmentation [SFNet]. Most related to our work are [Zhou2016, abs-2004-09061, ZhouLYE15]. Zhou et al [Zhou2016] learn to align multiple images using 3D-guided cycle-consistency by leveraging the ground-truth matches between multiple CAD models. However, the need for 3D CAD models greatly limits its applicability in practice. In FlowWeb [ZhouLYE15], the authors optimize online pre-existing pair-wise correspondences using the cycle consistency of flows between images in a collection. Unlike these approaches, we require pairs of images as unique supervision and propose a general loss formulation, learning to regress dense correspondences directly.

## 3 Method

### 3.1 Problem formulation and notation

We address the problem of finding pixel-wise correspondences between two images and . Our goal is to estimate a dense displacement field , often referred to as flow, relating pixels in to . The flow field

represents the pixel-wise 2D motion vectors in the coordinate system of image

. It is directly related to the mapping , which encodes the absolute location in corresponding to the pixel location in image . It is thus related to the flow through . It is important to note that the flow and mapping representations are asymmetric. parametrizes a mapping from each pixel in image , which is not necessarily bijective.

With a slight abuse of notation, we interchangeably view and as either elements of or as functions

. The latter is generally obtained by a bilinear interpolation of the former, and the interpretation will be clear from context when important. We define the

warping of a function by the flow as . This is more compactly expressed as , where is the mapping defined by and denotes function composition. Lastly, we let be the identity map .

The goal of this work is to learn a neural network

, with parameters , that predicts an estimated flow relating to . We will consistently use the hat to denote an estimated or predicted quantity. The straightforward approach to learn is to minimize the discrepancy between the estimated flow and the ground-truth flow over a collection of real training image pairs . However, such supervised training requires large quantities of densely annotated data, which is extremely difficult to acquire for real scenes. This motivates the exploration of unsupervised alternatives for learning dense correspondences.

### 3.2 Unsupervised data losses

To develop our approach, we first briefly review relevant existing alternatives for unsupervised learning of flow. While there is no general agreement in the literature, we adopt a practical definition of unsupervised learning in our context. We call a learning formulation ‘unsupervised’ if it does not require any information (i.esupervision) other than pairs of images depicting the same scene or object. Specifically, unsupervised methods do not require any annotations made by humans or other matching algorithms.

Photometric losses:  Most unsupervised approaches train the network using a photometric loss [BackToBasics, Meister2017, OccAwareFlow, RANSAC-flow]. Under the photometric consistency assumption, it minimizes the difference between image and image warped according to the estimated flow field as,

 (1)

Here, is a function measuring the difference between two images, e.g [BackToBasics], SSIM [WangBSS04], or census [Meister2017].

Forward-backward consistency:  By constraining the backward flow to yield the reverse displacement of its forward counterpart , we achieve the forward-backward consistency loss [Meister2017],

 Lfb=∥∥ˆFI→J+\warpˆFI→J(ˆFJ→I)∥∥. (2)

Here, denotes a suitable norm. While well motivated, (2) is enforced by the trivial degenerate solution of always predicting zero flow . It therefore bares the risk of degrading performance by biasing the prediction towards zero, even if combined with a photometric loss (1). Both aforementioned losses are most often used together with a visibility mask that filters out the influence of occluded regions from the objective.

Warp-supervision:  Another approach relies on synthetically generated training pairs, where the ground-truth flow is obtained by construction [GLUNet, Rocco2017a, Melekhov2019]. Given only a single image , a training pair is created by applying a randomly sampled transformation , e.ga homography, to as . Here, is the synthetic flow field, which serves as direct supervision through a regression loss,

 Lwarp=∥∥ˆFI′→I−W∥∥. (3)

While this results in a strong and direct training signal, warp supervision methods struggle to generalize to real image pairs . This can lead to over-smooth predictions and instabilities in the presence of unseen appearance changes.

### 3.3 Warp consistency graph

We set out to find a new unsupervised objective suitable for scenarios with large appearance and view-point changes, where photometric based losses struggle. While the photometric consistency assumption is avoided in the forward-backward consistency (Fig. 3a) and warp-supervision (Fig. 3b) objectives, these methods suffer from severe drawbacks in terms of degenerate solutions and lack of realism, respectively. To address these issues, we consider all possible consistency relations obtained from the three images involved in both aforementioned objectives. Using this generalization, we not only retrieve forward-backward and warp-supervision as special cases, but also derive a family of new consistency relations.

From an image pair , we first construct an image triplet by warping with a known flow-field in order to generate the new image . We now consider the full consistency graph, visualized in Fig. 3c, encompassing all flow-consistency constraints derived from the triplet of images . Crucially, we exploit the fact that the transformation is known. The goal is to find consistency relations that translate to suitable learning objectives. Particularly, we wish to improve the network prediction between the real image pair . We therefore first explore the possible consistency constraints that can be derived from the graph shown in Fig. 3c. For simplicity, we do not explicitly denote visible or valid regions of the stated consistency relations. They should be interpreted as an equality constraint for all pixel locations where both sides represent a valid, non-occluded mapping or flow.

Pair-wise constraints:  We first consider the consistency constraints recovered from pairs of images, as visualized in Fig. (e)e. From the pair , and analogously , we recover the standard forward-backward consistency constraint , from which we derive (2). Furthermore, from the pair we can derive the warp-supervision constraint (3) .111While and are also possible, they offer no advantage over standard warp-supervision: .

Bipath constraints:  The novel consistency relations stem from constraints that involve all three images in the triplet . These appear in two distinct types, here termed bipath and cycle constraints, respectively. We first consider the former, which have the form . That is, we obtain the same mapping by either proceeding directly from image 1 to 2 or by taking the detour through image 3. We thus compute the same mapping by two different paths: and , from which we derive the name of the constraint. The images 1, 2, and 3 represent any enumeration of the triplet that respects the direction , specified by the known warp . There thus exist three different bipath constraints, detailed in Sec. 3.4.

Cycle constraints:  The last category of constraints is formulated by starting from any of the three images in Fig. (d)d and composing the mappings in a full cycle. Since we return to the starting image, the resulting composition is equal to the identity map. This is expressed in a general form as , where we have proceeded in the cycle . Again constraining the direction , we obtain three different constraints, as visualized in Fig. (d)d. Compared to the bipath constraints, the cycle variants require two consecutive warping operations, stemming from the additional mapping composition. Each warp reduces the valid region and introduces interpolation noise and artifacts in practice. Constraints involving fewer warping operations are thus desirable, which is an advantage of the class of bipath constraints. In the next parts, we therefore focus on the later class to find a suitable unsupervised objective for dense correspondence estimation.

### 3.4 Bipath constraints

As mentioned in the previous section, there exist three different bipath constraints that preserve the direction of the known warp . These are stated in terms of mappings as,

 MI′→J =MI→J∘MW (4a) MJ→I =MW∘MJ→I′ (4b) MW =MJ→I′∘MI→J. (4c)

From (4), we can derive the equivalent flow constraints as,

 FI′→J =W+\warpW(FI→J) (5a) FJ→I =FJ→I′+\warpFJ→I′(W) (5b) W =FI′→J+\warpFI′→J(FJ→I). (5c)

Each constraint is visualized in Fig. 4a, b and c respectively. At first glance, any one of the constraints in (5) could be used as an unsupervised loss by minimizing the error between the left and right hand side. However, by separately analyzing each constraint in (4)-(5), we will find them to have radically different properties which impact their suitability as an unsupervised learning objective.

-bipath:  The constraint (4a), (5a) is derived from the two possible paths from to (Fig. (a)a). While not obvious from (5a), it can be directly verified from (4a) that this constraint has a degenerate trivial solution. In fact, (4a) is satisfied for any by simply mapping all inputs to a constant pixel location as . In order to satisfy this constraint, the network can thus learn to predict the same flow for any input image pair.

-bipath:  From the paths in Fig. (b)b, we achieve the constraint (4b), (5b). The resulting unsupervised loss is formulated as

 LJ→I=∥∥ˆFJ→I′+\warpˆFJ→I′(W)−ˆFJ→I∥∥. (6)

Unfortunately, this objective suffers from another theoretical disadvantage. Due to the cancellation effect between the estimated flow terms and , the objective (6) is insensitive to a constant bias in the prediction. Specifically, if a small constant bias is added to all flow predictions in (6), it can be shown that the increase in the loss is approximately bounded by . Here, the bias error is scaled with the Jacobian of the warp . Since a smooth and invertible warp implies a generally small Jacobian , the change in the loss will be negligible. The resulting insensitivity of (6) to a prediction bias is further confirmed empirically by our experiments. We provide derivations in the suppl. A. To further understand and compare the bipath constraints (5), it is also useful to consider the limiting case of reducing the magnitude of the warps . By setting it can be observed that (6) becomes zero, i.eno learning signal remains.

-bipath:  The third bipath constraint (4c), (5c) is derived from the paths , which is determined by (Fig. (c)c). It leads to the -bipath consistency loss,

 LW=∥∥ˆFI′→J+\warpˆFI′→J(ˆFJ→I)−W∥∥. (7)

We first analyze the limiting case by setting , which leads to standard forward-backward consistency (2) since . The -bipath is thus a direct generalization of the latter constraint. Importantly, by randomly sampling non-zero warps , degenerate solutions are avoided, effectively solving the one fatal issue of forward-backward consistency objectives. In addition to avoiding degenerate solutions, -bipath does not experience cancellation of prediction bias, as in (6). Furthermore, compared to warp-supervision (3), it enables to directly learn the flow prediction between the real pair . In the next section, we therefore develop our final unsupervised objective based on the -bipath consistency.

### 3.5 Warp consistency loss

In this section, we develop our warp consistency loss, an unsupervised learning objective for dense correspondence estimation, using the consistency constraints derived in Sec. 3.3 and 3.4. Specifically, following the observations in Sec. 3.4, we base our loss on the -bipath constraint.

-bipath consistency term:  To formulate an objective based on the -bipath consistency constraint (5c), we further integrate a visibility mask . The mask takes a value for any pixel where both sides of (4c), (5c) represent a valid, non-occluded mapping, and otherwise. The loss (7) is then extended as,

Since we do not know the true visibility , we replace it with an estimate . While there are different techniques for estimating visibility masks in the literature [MFOccFlow, Meister2017, OccAwareFlow], we base our strategy on the approach used in [Meister2017]. Specifically, we compute our visibility mask as,

 ˆV=1[ ∣∣ˆFI′→J+\warpˆFI′→J(ˆFJ→I)−W∣∣22<α2+ (9)

Here, takes the value 1 or 0 if the input statement is true or false, respectively. The scalars and

are hyperparameters controlling the sensitivity of the mask estimation. For the warp operation

, we generally found it beneficial not to back-propagate gradients through the flow used for warping. We believe that this better encourages the network to directly adjust the flow , rather than ‘move’ the flow vectors using the warp .

Warp-supervision term:  In addition to our -bipath objective (8), we use the warp-supervision (3), found as a pairwise constraint in our consistency graph (Fig. (e)e). Benefiting from the strong and direct supervision provided by the synthetic flow , the warp-supervision term increases convergence speed and helps in driving the network towards higher accuracy. Further, by the direct regression loss against the flow , which is smooth by construction, it also acts as a smoothness constraint. On the other hand, through the -bipath loss (8), the network learns the realistic motion patterns and appearance changes present between real images

. As a result, both loss terms are mutually beneficial. From a practical perspective, the warp-supervision loss can be integrated at a low computational and memory cost, since the backbone feature extraction for the three images

can be shared between the two loss terms.

Adaptive loss balancing:  Our final unsupervised objective combines the losses (8) and (3) as . This raises the question of how to set the trade-off . Instead of resorting to manual tuning, we eliminate this hyper-parameter by automatically balancing the weights over each training batch as . Since

is a weighting factor, we do not backpropagate gradients through it.

### 3.6 Sampling warps W

The key element of our warp consistency objective is the sampled warp . During training, we randomly sample it from a distribution , which we need to design. As discussed in Sec. 3.4, the -bipath loss (8) approaches the forward-backward consistency loss (2) when the magnitude of the warps decreases . Exclusively sampling too small warps therefore risks biasing the prediction towards zero. On the other hand, too large warps would render the estimation of challenging and introduce unnecessary invalid image regions. As a rough guide, the distribution should yield warps of similar magnitude as the real transformations , thus giving similar impact to all three terms in (8). Fortunately, as analyzed in the supplementary Sec. G, our approach is not sensitive to these settings as long as they are within reasonable bounds.

We construct by sampling homography, Thin-plate Spline (TPS) and affine-TPS transformations randomly, following a procedure similar to previous approaches using warp-supervision [Rocco2017a]. (i)

Homographies are constructed by randomly translating the four image corner locations. The magnitudes of the translations are chosen independently through Gaussian or uniform sampling, with standard-deviation or range equal to

. (ii) For TPS, we randomly jitter a grid of control points by independently translating each point. We use the same standard deviation or range as for our homographies. (iii) To generate larger scale and rotation changes, we also compose affine and TPS. We first sample affine transformations by selecting scale, rotation, translation and shearing parameters according to a Gaussian or uniform sampling. The TPS transform is then sampled as explained above and the final synthetic flow is a composition of both flows.

To make the warps harder, we optionally also compose the flow obtained from (i), (ii) and (iii) with randomly sampled elastic transforms. Specifically, we generate an elastic deformation motion field, as described in [Simard2003] and apply it in multiple regions selected randomly. Elastic deformations drive the network to be more accurate to small details. Detailed settings are provided in the suppl. G.

## 4 Experiments

We evaluate our unsupervised learning approach for three dense matching networks and two tasks, namely GLU-Net [GLUNet] and RANSAC-Flow [RANSAC-flow] for geometric matching, and SemanticGLU-Net [GLUNet] for semantic matching. We extensively analyze our method and compare it to earlier unsupervised objectives, defining a new state-of-the-art on multiple datasets. Further results, analysis, visualizations and implementation details are provided in the supplementary.

### 4.1 Method analysis

We first perform a comprehensive analysis of our approach. We adopt GLU-Net [GLUNet] as our base architecture. It is a 4-level pyramidal network operating at two image resolutions to estimate dense flow fields.

Experimental set-up for GLU-Net:  We slightly simplify the GLU-Net [GLUNet] architecture by replacing the dense decoder connections with standard residual blocks, which drastically reduces the number of network parameters with negligible impact on performance. As in [GLUNet], the feature extraction network is set to a VGG-16 [Chatfield14]

with ImageNet pre-trained weights. We train the rest of the architecture from scratch in two stages. We first train GLU-Net using our unsupervised objective, described in Sec.

3.5, but without the visibility mask . As a second stage, we add the visibility mask and employ stronger warps , with elastic transforms. For both stages, we use the training split of the MegaDepth dataset [megadepth], which comprises diverse internet images of 196 different world monuments.

Datasets and metrics:  We evaluate on standard datasets with sparse ground-truth, namely RobotCar [RobotCarDatasetIJRR, RobotCar] and MegaDepth [megadepth]. For the latter, we use the test split of  [RANSAC-flow], which consists of 19 scenes not seen during training. Images in Robotcar depict outdoor road scenes and are particularly challenging due to their many textureless regions. MegaDepth images show extreme view-point and appearance variations. In line with [RANSAC-flow], we use the Percentage of Correct Keypoints at a given pixel threshold (PCK-

) as the evaluation metric (in %). We also employ the 59 sequences of the homography dataset

HPatches [Lenc]. We evaluate with the Average End-Point-Error (AEPE) and PCK.

Warp consistency graph losses:  In Tab. 1 we empirically compare the constraints extracted from our warp consistency graph (Sec. 3.3). All networks are trained with only the first stage, on the same synthetic transformations . Since we observed it to give a general improvement, we stop gradients through the flow used for warping (but not the flow that is warped). The -bipath (II) and -bipath (III) losses lead to a degenerate solution and a large predicted bias respectively, which explains the very poor performance of the networks. The cycle loss (V) obtains much better results but does not reach the performance of the -bipath constraint (IV). We only show the cycle starting from here (V), since it performs best among all cycle losses (see suppl. A). While the warp-supervision loss (I) results in a better accuracy on all datasets (PCK-1 and PCK-5 for HPatches), it is significantly less robust to large view-point changes than the -bipath objective (IV), as evidenced by results in PCK-10 and AEPE. These two losses have complementary behaviors and combining them (VIII) leads to a significant gain in both accuracy and robustness. Combining the warp-supervison loss (I) with -bipath (II) in (VI) or with -bipath (III) in (VII) instead results in drastically lower performance than (VIII). The cycle loss (V) with the warp-supervision (I) in (IX) is also slightly worse.

Ablation study:  In Tab. 2 we analyze the key components of our approach. We first show the importance of not back-propagating gradients in the warp operation. Adding the warp-supervision objective with constant weights of increases both the network’s accuracy and robustness for all datasets. Further using adaptive loss balancing (Sec. 3.5) provides a significant improvement in accuracy (PCK-1) for MegaDepth with only minor loss on other thresholds. Including our visibility mask in the second training stage drastically improves all metrics for all datasets. Finally, further sampling harder transformations results in better accuracy, particularly for PCK-1 on MegaDepth. We therefore use this as our standard setting in the following experiments, where we denote it as WarpC.

Comparison to alternative losses:  Finally, in Tab. 3 we compare and combine our proposed objective with alternative losses. The census loss [Meister2017] (I), popular in optical flow, does not have sufficient invariance to appearance changes and thus leads to poor results on geometric matching datasets. The SSIM loss [WangBSS04] (II) is more robust to the large appearance variations present in MegaDepth. Further combining SSIM with the forward-backward consistency loss (III) leads to a small improvement. Compared to SSIM (III) on MegaDepth, our WarpC approach (VI) achieves superior PCK-5 (+7.8%) and PCK-10 (+10.2%) at the cost of a slight reduction in sub-pixel accuracy. Furthermore, our approach demonstrates superior generalization capabilities by outperforming all other alternatives on the RobotCar and HPatches datasets. For completeness, we also evaluate the combination (VII) of our loss with the photometric SSIM loss. This leads to improved PCK-1 on MegaDepth but degrades other metrics compared to WarpC (VI). Nevertheless, adding WarpC significantly improves upon SSIM (II) for all thresholds and datasets. Moreover, combining the warp-supervision (IV) with the forward-backward loss in (V) leads to an improvement compared to (IV). It is however significantly worse than combining the warp-supervision with our -bipath loss in (VI), which can be seen as a generalization of the forward-backward loss. Finally, we compare with using the sparse ground-truth supervision provided by SfM reconstruction of the MegaDepth training images. Interestingly, training the dense prediction network from scratch with solely sparse annotations (VIII) leads to inferior performance compared to our unsupervised objective (VI). Lastly, we fine-tune (IX) our proposed network (VI) with sparse annotations. While this leads to a moderate gain on MegaDepth, it comes at the cost of worse generalization properties on RobotCar and HPatches.

### 4.2 Geometric matching

Here, we train the recent GLU-Net [GLUNet] and RANSAC-Flow [RANSAC-flow] architectures with our unsupervised learning approach and compare them against state-of-the-art dense geometric matching methods.

Experimental set-up for GLU-Net:  We follow the training procedure explained in Sec. 4.1 and refer to the resulting model as WarpC-GLU-Net. The original GLU-Net [GLUNet] is trained using solely the warp-supervision (3) on a different training set. For fair comparison, we also report results of our altered GLU-Net architecture when trained on MegaDepth with our warp distribution. This corresponds to setting (IV) in Tab. 3, which we here call GLU-Net*.

Experimental set-up for RANSAC-Flow:  We additionally use our unsupervised strategy to train RANSAC-Flow [RANSAC-flow]. In the original work [RANSAC-flow], the network is trained on MegaDepth [megadepth] image pairs that are coarsely pre-aligned using feature matching and Ransac. Training is separated into three stages. First, the network is trained using the SSIM loss (1), which is further combined with the forward-backward consistency loss (2) in the second stage. In the last stage, a matchability mask is also trained, by weighting the previous losses with the predicted mask and including a mask regularization term. For our WarpC-RANSAC-Flow, we also follow a three-step training using the same training pairs. As for the WarpC-GLU-Net training, we add our visibility mask in the second training stage. In the third stage, we train the matchability mask by simply replacing in (8) with the predicted mask, and adding the same mask regularizer as in RANSAC-Flow.

Results:  In Tab. 4, we report results on MegaDepth and RobotCar. Note that we only compare to methods that do not finetune on the test set. Our approach WarpC-GLU-Net outperforms the original GLU-Net and baseline GLU-Net* by a large margin at all PCK thresholds. Our proposed unsupervised objective enables the network to handle the large and complex 3D motions present in real image pairs, as evidenced in Fig. 5, top. Our unsupervised approach WarpC-RANSAC-Flow also achieves a substantial improvement compared to RANSAC-Flow. Importantly, WarpC-RANSAC-Flow shows much better generalization capabilities on RobotCar. The poorer generalization of photometric-based objectives, such as SSIM [WangBSS04] here, further supports our findings in Sec. 4.1. Interestingly, training the matchability branch of RANSAC-Flow with our objective results in drastically more accurate mask predictions. This is visualized in Fig. 5, middle, where our approach WarpC-RANSAC-Flow effectively identifies unreliable matching regions such as the sky (in red), whereas RANSAC-Flow, trained with the SSIM loss, is incapable of discarding the sky and field as unreliable.

### 4.3 Semantic matching

Finally, we evaluate our approach for the task of semantic matching by training SemanticGLU-Net [GLUNet], a version of GLU-Net specifically designed for semantic images, which includes multi-resolution features and NC-Net [Rocco2018b].

Experimental set-up:  Following [Rocco2018a, ArbiconNet], we only fine-tune a pre-trained network on semantic correspondence data. Specifically, we start from the SemanticGLU-Net weights provided by the authors, which are trained with warp-supervision without using any correspondences from flow annotations. We finetune this network on the PF-PASCAL training set [PFPascal], which consists of 20 object categories, using our unsupervised loss (Sec. 3.5).

Datasets and metrics:  We first evaluate on the test set of PF-Pascal [PFPascal]. In line with [SCNet], we report the PCK with a pixel threshold equal to , where and are the dimensions of the query image and . To demonstrate generalization capabilities, we also validate our trained model on the TSS dataset [Taniai2016], which provides dense flow field annotations for the foreground object in each pair. Following [Taniai2016], we report the PCK with respect to query image size and for .

Results:  Results are reported in Tab. 5. Our approach WarpC-SemanticGLU-Net sets a new state-of-the-art on TSS by obtaining a remarkable improvement compared to previous works. On the PF-Pascal dataset, our method ranks first for the small threshold with a substantial increase compared to second best method. It obtains marginally lower PCK () than DCCNet [DCCNet] for , but the later approach employs a much deeper feature backbone, beneficial on semantic images. Nevertheless, our unsupervised fine-tuning provides 16% and 11.1% gain, for each threshold respectively, over the baseline, demonstrating that our objective effectively copes with the radical appearance changes encountered in the semantic matching task. A visual example applied to an image pair of PF-PASCAL [PFPascal] is shown in Fig. 5, bottom.

## 5 Conclusion

We propose an unsupervised learning objective for dense correspondences, particularly suitable for scenarios with large changes in appearance and geometry. From a real image pair, we construct an image triplet and design a regression loss based on the flow-constraints existing between the triplet. When integrated into three recent dense correspondence networks, our approach outperforms state-of-the-art for multiple geometric and semantic matching datasets.

Acknowledgements:  This work was supported by the ETH Zürich Fund (OK), a Huawei Gift, Huawei Technologies Oy (Finland), Amazon AWS, and an Nvidia GPU grant.

## Appendix A Warp consistency graph regression losses

In this section, we provide additional details about the possible flow-constraints derived from our warp consistency graph (Sec. 3.3 of the main paper). We also show qualitative and quantitative comparisons between the trained networks using each possible regression loss.

### a.1 Details about Ji-bipath constraint

We here provide the detailed derivation of the bias insensitivity of the -bipath loss, which is given by (eq. (6) in the main paper) as,

 LJ→I=∥∥ˆFJ→I′+\warpˆFJ→I′(W)−ˆFJ→I∥∥. (10)

We derive an upper bound for the change in the loss when a constant bias is added to all flow predictions . We have,

 ΔLJ→I= ∥∥ˆFJ→I′+b+\warpˆFJ→I′+b(W)−(ˆFJ→I+b)∥∥ −∥∥ˆFJ→I′+\warpˆFJ→I′(W)−ˆFJ→I∥∥ = ∥∥ˆFJ→I′+\warpˆFJ→I′(W)−ˆFJ→I +\warpˆFJ→I′+b(W)−\warpˆFJ→I′(W)∥∥ −∥∥ˆFJ→I′+\warpˆFJ→I′(W)−ˆFJ→I∥∥ ≤ ∥∥ˆFJ→I′+\warpˆFJ→I′(W)−ˆFJ→I∥∥ +∥∥\warpˆFJ→I′+b(W)−\warpˆFJ→I′(W)∥∥ −∥∥ˆFJ→I′+\warpˆFJ→I′(W)−ˆFJ→I∥∥ = ∥∥\warpˆFJ→I′+b(W)−\warpˆFJ→I′(W)∥∥. (11)

Here we have used the triangle inequality. From the bound above, we can already see that will be small if is changing slowly. We can see this more clearly by assuming the bias to be small, and doing a first order Taylor expansion,

 \warpˆFJ→I′+b (W)(x)=W(x+ˆFJ→I′(x)+b) ≈ = \warpˆFJ→I′(W)(x)+\warpˆFJ→I′(DWb)(x). (12)

Here, is the Jacobian of at location . Thus, denotes the function obtained from the matrix-vector product between the Jacobian and bias at every location. Inserting (A.1) into (A.1) gives an approximate bound valid for small ,

 ΔLJ→I⪅∥∥\warpˆFJ→I′(DWb)∥∥. (13)

A smooth and invertible warp implies a generally small Jacobian . Since the bias is scaled with , the resulting change in the loss will also be small. As a spacial case, it is immediately seen from (A.1) that the change in the loss is always zero if is a pure translation. The bias insensitivity of the -bipath constraint largely explains its poor performance. As visualized in Fig. 6, the predictions of a network trained with solely the -bipath loss (6) suffer from a large translation bias.

### a.2 Cycle constraints

Here, we provide additional details about the cycle constraints, extracted from our warp consistency graph. As explained in Sec. 3.3 of the main paper, because of the fixed direction of the known flow which corresponds to , three cycle constraints are possible, starting from either images , or and composing mappings so that the resulting composition is equal to the identity map. They are respectively formulated as follows,

 I =MW∘MJ→I′∘MI→J (14a) I =MJ→I′∘MI→J∘MW (14b) I =MI→J∘MW∘MJ→I′ (14c)

The corresponding regression losses are obtained by converting the mapping constraints (14) to flow constraints and considering only the flow as known. We provide the expression for each of the three cycle losses in the following.

Cycle from :  By starting from image and performing a full cycle, the resulting regression loss is expressed as,

 Lcycle-I= ∥∥ˆFI→J+\warpˆFI→J(ˆFJ→I′)+ (15) \warpˆFI→J+\warpˆFI→J(ˆFJ→I′)(W)∥∥

Cycle from :  Starting from image instead leads to the following regression loss,

 Lcycle-I'= ∥∥W+\warpW(ˆFI→J)+ (16) \warpW+\warpW(ˆFI→J)(ˆFJ→I′)∥∥

Cycle from :  Finally, using image as starting point for the cycle constraint results in this regression loss,

 Lcycle-J= ∥∥ˆFJ→I′+\warpˆFJ→I′(W)+ (17) \warpˆFJ→I′+\warpˆFJ→I′(W)(ˆFI→J)∥∥

### a.3 Quantitative and qualitative analysis

Extension of quantitative analysis:  We first extend Tab. 1 of the main paper, by analysing the remaining warp consistency graph losses. Results on MegaDepth, RobotCar and HPatches are presented in Tab. 6. As in Tab. 1 of the main paper, all networks are trained following the first training stage of WarpC-GLU-Net (See Sec. 4.1 of main paper or Sec. C).

We first provide evaluation results of networks trained using the cycle losses, starting from images and . The cycle loss starting from obtains very poor results. The cycle starting from instead achieves better performance, but still lower than the cycle loss from . The -bipath constraint obtains the best results overall.

We then compare combinations of the derived losses with the warp-supervision objective (eq. (3) of the main paper). Between the cycle losses, the combination of the warp-supervision with the cycle loss from achieves the best results compared to the combinations with the cycle losses from and . The combination of the warp-supervision and forward-backward losses (eq. (3) of the main paper), which are both retrieved as pair-wise constraints from the warp consistency graph (Sec. 3.3 and Fig. 4e of the main paper), leads to lower generalisation abilities on the HPatches dataset than our warp consistency loss. It also achieves substantially lower PCK-1 on MegaDepth. Moreover, because the forward-backward consistency loss leads to a degenerate trivial solution when used alone, manual tuning of a weighting hyper-parameter is required to balance the warp-supervision and the forward-backward loss terms. If it is too high, the forward-backward term gains too much importance and drives the network to zero. If it is too small instead, its contribution becomes insignificant. On the contrary, our proposed unsupervised learning objective (Sec. 3.5 of the main paper) does not require expensive manual tuning of such hyperparameters.

Qualitative comparison:  In Fig. 6, we visually compare the estimated flows by GLU-Net networks trained using each of the flow-consistency losses retrieved from the warp consistency graph. Training using the warp-supervision loss alone results in an unstable estimated flow, and corresponding warped query. It can directly be seen that the -bipath loss results in the network learning a degenerate trivial solution, in the form of a constant predicted mapping independently of the input images. Training with the -bipath objective instead makes the network insensitive to an additional predicted bias. Indeed, in Fig. 6, third row, it is easily seen that the warped query is shifted towards the right and bottom, compared to the reference image. This is due to a constant predicted bias by the network. The -bipath objective leads to a drastically better warped query. Also note that the estimated flow leads to a more accurate warped query than when trained with the -cycle loss. Training with the cycle loss from leads to very poor results instead. Finally, the cycle loss derived by starting from image results in a reasonable warped query, but it has more out-of-regions artifacts compared to the prediction of the network trained with the -bipath loss.

## Appendix B Triplet creation and sampling of warps W

### b.1 Triplet creation

Our introduced unsupervised learning approach requires to construct an image triplet from an original image pair , where all three images must have training dimensions . We construct the triplet as follows. The original training image pairs are first resized to a fixed size , larger than the desired training image size . We then sample a dense flow of the same dimension , and create by warping image with , as . Each of the images of the resulting image triplet are then centrally cropped to the fixed training image size . The central cropping is necessary to remove most of the black areas in introduced from the warping operation with large sampled flows as well as possible warping artifacts arising at the image borders. We then additionally apply appearance transformations to the created image , such as brightness and contrast changes. This procedure is similar to [Rocco2017a], which employs solely the warp-supervision objective on .

### b.2 Sampling of warps W

As mentioned in the main paper Sec. 3.6, we key question raised by our proposed loss formulation is how to sample the synthetic flows . The analysis of the properties of the proposed -bipath loss brought some insight into what magnitude of warps to sample during training. If the generated warps are too small, there is still a risk of biasing the prediction towards zero. Instead, using warps of roughly similar order of magnitude as the underlying transformations would give equal impact to all three terms in eq. (8) of the main paper. During training, we randomly sample it from a distribution , which we need to design.

Base transformation sampling:  We construct

by sampling homography, Thin-plate Spline (TPS), or affine-TPS transformations with equal probability. The transformations parameters are then converted to dense flows of dimension

.

Specifically, for homographies and TPS, the four image corners and a

grid of control points respectively, are randomly translated in both horizontal and vertical directions, according to a desired sampling scheme. The translated and original points are then used to compute the corresponding homography and TPS parameters. Finally, the transformations parameters are converted to dense flows. For both transformation types, the magnitudes of the translations are sampled according to a uniform or Gaussian distribution with a range or standard-deviation

respectively. Note that for the uniform distribution, the sampling range is actually

, when it is centered at zero, or similarly if centered at 1 for example. Importantly, the image points coordinates are previously normalized to be in the interval . Therefore should be within .

For the affine transformations, all parameters, i.escale, translations, shearing and rotation angles, are sampled from a uniform or Gaussian distribution with range or standard-deviation equal to , , and respectively. For the affine scale parameter, the corresponding Gaussian sampling is centered at one whereas for all other parameters, it is centered at zero. Similarly, for a uniform sampling instead, the affine scale parameters is sampled within with center at 1, while for all other parameters, the sampling interval is centered at zero.

Elastic transformations:  To make the synthetic flow harder for the network to estimate, we also optionally compose the base flow resulting from sampling homography, TPS and Affine-TPS transformations, with a dense elastic deformation grid. We generate the corresponding elastic residual flow , by adding small local perturbations . More specifically, we create the residual flow by first generating an elastic deformation motion field on a dense grid of dimension , as described in [Simard2003]. Since we only want to include elastic perturbations in multiple small regions, we generate binary masks , each delimiting the area on which to apply one local perturbation . The final elastic residual flow thus take the form of , where . The final synthetic warp is achieved by composing the base flow with the elastic residual flow .

In practise, for the elastic deformation field , we use the implementation of [info11020125]. The masks should be between 0 and 1 and offer a smooth transition between the two, so that the perturbations appear smoothly. To create each mask , we thus generate a 2D Gaussian centered at a random location and with a random standard deviation (up to a certain value) on a dense grid of size . It is then scaled to 2.0 and clipped to 1.0, to obtain smooth regions equal to 1.0 where the perturbations will be applied, and transition regions on all sides from 1.0 to 0.0.

### b.3 Hyper-parameters

In summary, to construct our image triplet , the hyper-parameters are the following:

(i) , the resizing image size, on which is applied to obtain before cropping.

(ii) , the training image size, which correspond to the size of the training images after cropping.

(iii) , the range or standard deviation used for sampling the homography and TPS transformations.

(iv) , the range or standard deviation used for sampling the scaling parameter of the affine transformations.

(v) , the range or standard deviation used for sampling the translation parameter of the affine transformations.

(vi) , the range or standard deviation used for sampling the rotation angle of the affine transformations. It is also used as shearing angle.

(vii) , the range or standard deviation used for sampling the TPS transformations, used for the Affine-TPS compositions.

For simplicity, in all experiments including elastic deformations, we use the same elastic transformations hyper-parameters. Moreover, for all experiments and networks, we apply the same appearance transformations to image . Specifically, we use color transformations, by adjusting contrast, saturation, brightness, and hue. With probability 0.2, we additionally use a Gaussian blur with a kernel between 3 and 7, and a standard deviation sampled within .

## Appendix C Training details for WarpC-GLU-Net

We first provide details about the original GLU-Net architecture and the modifications we made for this work. We also briefly review the training strategy of the original work. We then extensively explain our training approach and the corresponding implementation details.

Architecture:  We use GLU-Net as our base architecture. It is a 4 level pyramidal network, using a VGG-16 feature backbone [Chatfield14]

, initialized with pre-trained weights on ImageNet. It is composed of two sub-networks, L-Net and H-Net which act at two different resolutions. The L-Net takes as input rescaled images to

and process them with a global feature correlation layer followed by a local feature correlation layer. The resulting flow is then upsampled to the lowest resolution of the H-Net to serve as initial flow, by warping the query features according to the estimated flow. The H-Net takes as input images the original images at unconstrained resolution , and refines the estimated flow with two local feature correlation layers. We adopt the GLU-Net architecture and simply replace the DenseNet connections [Huang2017]

of the flow decoders by residual connections. We also include residual blocks in the mapping decoder. This drastically reduces the number of weights while having limited impact on performance.

Training strategy in original work:  In the original GLU-Net [GLU-Net], the network is trained with the warp-supervision loss (referred to as a type of self-supervised training strategy in original publication), which corresponds to equation (3) of the main paper. As for the synthetic sampled transformations , Truong et al [GLU-Net] use the same 40k synthetic transformations (affine, thin-plate and homographies) than in DGC-Net [Melekhov2019], but apply them to images collected from the DPED [Ignatov2017]

### c.2 WarpC-GLU-Net: our training strategy

We here explain the different steps of our training strategy in more depth.

Training stages:  In the first training stage, we train GLU-Net using our warp consistency loss (Sec. 3.5 of the main paper) without the visibility mask. This is because the estimated flow field needs to reach a reasonable performance in order to compute the visibility mask (eq. 9 of the main paper). In the second training stage, we further introduce the visibility mask in the -bipath loss term (eq. 8 of the main paper). In order to enhance difficulty in the second stage, we increase the transformations strengths and include additional elastic transformations for the sampled warps . Note that the feature backbone is initialized to the ImageNet weights and not further trained.

Training dataset:  For training, we use the MegaDepth dataset, consisting of 196 different scenes reconstructed from 1,070,468 internet photos using COLMAP [COLMAP]

. Specifically, we use 150 scenes of the dataset and sample up to 500 random images per scene. It results in around 58k training pairs. Note that we use the same set of training pairs at each training epoch. For the validation dataset, we sample up to 100 image pairs from 25 different scenes, leading to approximately 1800 image pairs. Importantly, while we can get the corresponding sparse ground-truth correspondences from the SfM reconstructions, we do not use them during training in this work and only retrieve the image pairs.

Warps sampling:  We resize the image pairs to , sample a dense flow of the same dimension and create . Each of the images of the resulting image triplet is then centrally cropped to . In the following, we give the parameters used for the sampling of the flow in both training stages.

In the first stage, the flows are created by sampling homographies, TPS and Affine-TPS transformations with equal probability. For homographies and TPS, we use a uniform sampling scheme with a range equal to , where , which corresponds to a displacement of up to 250 pixels for the image size . For the affine transformations, we also sample all parameters, i.escale, translation, shear and rotation angles, from uniform distributions with ranges respectively equal to , , and for both angles. We compose the affine transformations with TPS transformations, for which we sample the translation magnitudes uniformly with a range , thus corresponding to a displacement of up to 60 pixels. We chose a smaller range for the TPS compositions because we have found empirically that large ranges led to very drastic resulting dense Affine-TPS flows, which were not necessarily beneficial in the first training stage.

In the second stage, we also sample homographies, TPS and Affine-TPS transformations, but increase their strength. Specifically, for homography and TPS transformations, we use a range (displacements up to 300 pixels). The affine parameters are sampled as in the first training stage, but we increase the range of the uniform sampling for the TPS transformations to (displacements up to 200px). To make the flows even harder to estimate, we additionally include elastic transformations, sampled as explained in Sec. B.

Baseline comparison:  For fair comparison, we retrain GLU-Net using the original training strategy, which corresponds to the warp-supervision training loss, on the same MegaDepth training images. We also use the same altered GLU-Net architecture as for WarpC-GLU-Net. Moreover, we make use of the same synthetic transformations as in our first and second training stages. We call this version GLU-Net*.

### c.3 Implementation details

Since GLU-Net is a pyramidal architecture with levels, we employ a multi-scale training loss, where the loss at different pyramid levels account for different weights.

 L(θ)=K∑l=1γlLl+η∥θ∥, (18)

where are the weights applied to each pyramid level and is the corresponding loss computed at each level, which refers to the warp-supervision loss (eq. 3 of the main paper) for baseline GLU-Net* and our proposed warp consistency loss (Sec. 3.5 of the main paper) for WarpC-GLU-Net. The second term of the loss (18) regularizes the weights of the network. The hyper-parameters used in the estimation of our visibility mask (eq. 9 of the main paper) are set to and . During training, we down-sample and scale the sampled from original resolution to in order to obtain the flow field for L-Net. For the loss computation, we down-sample the known flow field from the base resolution to the different pyramid resolutions without further scaling, so as to obtain the supervision signals at the different levels.

For training, we use similar training parameters as in [GLUNet]. Specifically, as a preprocessing step, the training images are mean-centered and normalized using mean and standard deviation of the ImageNet dataset [Hinton2012]. For all local correlation layers, we employ a search radius .

For our network WarpC-GLU-Net and the baseline GLU-Net*, the weights in the training loss (18) are set to be . During the first training stage, both networks are trained with a batch size of 6 for 400k iterations. The learning rate is initially equal to , and halved after 250k and 325k iterations. For the second training stage, we train for 225k iteration with an initial learning rate of , which is halved after 100k, 150k and 200k iterations. The networks are trained using Adam optimizer [adam] with weight decay of .

## Appendix D Training details for WarpC-RANSAC-Flow

In this section, we first review the RANSAC-Flow architecture as well as their original training strategy. We then explain in more depth the different steps of our training, leading to WarpC-RANSAC-Flow.

Architecture:  RANSAC-Flow inference is divided in two steps. First, the image pairs are pre-aligned by computing the homography relating them, using multi-scale feature matching based on off-the-shelf MOCO features [MOCO] and Ransac. As a second step, the pre-aligned image pairs are input to the trained RANSAC-Flow model, which predicts the flow and matchability mask relating them. The final flow field relating the original images is computed as a composition of the flow corresponding to the homography computed in the pre-alignment step, and the predicted flow field. RANSAC-Flow is a shallow architecture taking image pairs as input, and which regresses the dense flow field and matchability mask relating one image to the other. It relies on a single local feature correlation layer computed at one eight of the input images resolution. The local feature correlation layer is computed with a small search radius of . The flow decoder and matchability branch are both fully convolutional with three convolution blocks, while the feature backbone is a modified version of ResNet-18 [HeZRS15].

Training dataset:  As training dataset, RANSAC-Flow uses images of the MegaDepth dataset [megadepth], from which they selected a subset of image pairs. They pre-aligned the image pairs using their pre-processing multi-scale strategy with off-the-shelf MOCO feature [MOCO] matching and homography estimation with Ransac. The resulting training dataset comprises 20k pre-aligned image pairs, for which the remaining geometric transformation between the frames is relatively small.

Training strategy in original work:  In the original work [RANSAC-flow], the training is separated in three stages. First, the network is trained using the SSIM loss [WangBSS04], which is further combined with the forward-backward cyclic consistency loss (eq. (2) of the main paper) in the second stage. During the two first stages, only the feature backbone and the flow decoder are trained, while the matchability branch remains unchanged and unused. In the last stage, the matchability branch is also trained by weighting the previous losses with the predicted mask and including a regularization matchability loss. A disadvantage of this approach is that all losses need to be scaled with a hyper-parameter, requiring expensive manual-tuning.

### d.2 WarpC-RANSAC-Flow: our training strategy

Training stages:  In the first training stage, we apply our proposed loss (Sec. 3.5 of the main paper) without the visibility mask, as in the first stage of WarpC-GLU-Net. The visibility mask (eq. 8 of the main paper) is introduced in the second stage of training. As in original RANSAC-Flow, the two first stages only train the feature backbone and the flow decoder while keeping the matchability branch fixed (and unused). In the third stage, we jointly train the feature backbone, flow decoder and the matchability branch. As training loss, we use the original matchability regularization loss and further replace our visibility mask in the -bipath loss (eq. 8 of the main paper) with the predicted mask, output of the matchability branch.

Warps sampling:  We resize original images to . Following original RANSAC-Flow, the final training images have dimension . Because RANSAC-Flow uses a single local correlation layer with a search radius of 3 computed at one eight of the original image resolution, the network can theoretically only estimate geometric transformations up to pixels in all directions. This is a very limited compared to GLU-Net or other matching networks. It makes RANSAC-Flow architecture very sensitive to the magnitude of the geometric transformations and limited in the range of displacements that it can actually estimate. It also implies that the RANSAC-Flow pre-alignement stage (with off-the-shelf feature matching and Ransac) is crucial for the success of the matching process in general. We thus need to sample transformations within the range of the network capabilities. As a result, we construct the warps by sampling only homographies and TPS transformations from a Gaussian distribution. This is because the Affine-TPS transformations lead to larger geometric transformations and are more difficult to parametrize for a network very sensitive to the strength of geometric transformations. The Gaussian sampling gives more importance to transformations of small magnitudes, as opposed to the uniform sampling used for WarpC-GLU-Net.

The homography and TPS transforms are sampled from a Gaussian distribution with standard deviation , which corresponds to a displacement of 24 pixels in an image size . We further integrate additional elastic transformations, which were shown beneficial to boost the network accuracy. We use the above sampling scheme for all three training stages.

### d.3 Implementation details

RANSAC-Flow only estimates the flow at one eight of the original image resolution. Loss computations is performed at image resolution, i.e, after upsampling the estimated flow field. Following the original work, we also compute training losses at the image resolution. The hyper-parameters used in the estimation of our visibility mask (eq. 9 of the main paper) are set to and .

For training, we use similar training parameters as in [RANSAC-flow]. As pre-processing, we scale the input network images to . During the first training stage, WarpC-RANSAC-Flow is trained with a batch size of 10 for 300k iterations. The learning rate is initially equal to , and halved after 200k iterations. For the second training stage, we train for 140k iteration with a constant learning rate of . Finally, the third training stages also uses an initial learning rate of halved after 200k iterations, and comprises a total of 300k iterations. To weight the matchability regularization loss with respect to our warp consistency loss in the third stage, we use a constant factor of applied to the matchability loss.

## Appendix E Training details for WarpC-SemanticGLU-Net

Here, we first review the SemanticGLU-Net architecture as well as their original training strategy. We then provide additional details about our training strategy, resulting in WarpC-SemanticGLU-Net.

Architecture:  SemanticGLU-Net is derived from GLU-Net [GLUNet], with two architectural modifications, making it more suitable for semantic data. Specifically, the global feature correlation layer is followed by a consensus network [Rocco2018b]. The features from the different levels in the L-Net are also concatenated, similarly to [Jeon].

Training strategy in original work:  SemanticGLU-Net was originally trained using the same procedure as GLU-Net [GLUNet]. It is explained in Sec. C.

### e.2 WarpC-SemanticGLU-Net: our training strategy

Training procedure:  We only finetune on semantic data, from the original pretrained SemanticGLU-Net model, initialized with the weights provided by the authors. The VGG-16 feature backbone is initialized to the ImageNet weights and not further finetuned. We use our warp consistency loss (Sec. 3.5 of the main paper), where the visibility mask is directly included. Note that since SemanticGLU-Net is trained using solely the warp-supervision objective, the overall training of WarpC-SemanticGLU-Net does not use any flow annotations.

Training dataset:  We use the PF-Pascal [PFPascal] images as training dataset. Following the dataset split in [SCNet], we partition the total 1351 image pairs into a training set of 735 pairs, validation set of 308 pairs and test set of 308 pairs, respectively. The 735 training images are augmented by mirroring, random cropping and exchanging the images in the pair. It leads to a total of 2940 training image pairs.

Warps sampling:  We resize the image pairs to , sample a dense flow of the same dimension and create . Each of the images of the resulting image triplet is then centrally cropped to . The flows are created by sampling homographies, TPS and Affine-TPS transformations with equal probability. For homographies and TPS, we use a uniform sampling scheme with a range equal to , where , which corresponds to a displacement of px, in image size . For the affine transformations, we also sample all parameters, i.escale, translation, shear and rotation angles, from uniform distributions with ranges respectively equal to , , and for both angles. We compose the affine transformations with TPS transformations, for which we sample the translation magnitudes uniformly with a range , thus corresponding to a displacement of 100px.

Implementation details:  For our network WarpC-SemanticGLU-Net, the weights in the training loss (18) are set to . We train with a batch size of 5, for a total of 7k iterations. The learning rate is initially equal to , and halved after 4k, 5k and 6k iterations. The network is trained using Adam optimizer [adam] with weight decay of .

## Appendix F Training details for method analysis

For the method analysis corresponding to Sec. 4.1 of the main paper, we use as base network GLU-Net [GLUNet]. Architecture description and implementation details are explained in Sec. C. In this section, for completeness we provide additional details about the training procedure used for each of the compared networks, when necessary.

Warp consistency graph analysis:  All networks are trained following the first WarpC-GLU-Net training stage, i.ewithout including the visibility mask in the bipath or cycle losses. We employ the same warps for all networks, which correspond to the sampling distribution used in the first training stage, as detailed in Sec. C.

Ablation study:  Networks in ablation study are trained according to the stages described in Sec. C.

Comparison to alternative losses:  We provide implementation details for networks trained with alternative losses. For all unsupervised learning objectives, we train the network in two stages. First, we use solely the evaluated loss, without visibility or occlusion mask. In the second stage, we further finetune the resulting model, extending the evaluated loss with the visibility mask, estimated as in [Meister2017]. For the objectives including our warp consistency loss (WarpC) or the warp-supervision loss, we use the same synthetic warp distribution than introduced in Sec. C. In the following, we give details about each training using an alternative loss.

Warp-supervision + forward-backward:

Selecting a hyper-parameter is necessary to weight the forward-backward loss with respect to the warp-supervision objective. After manual tuning, we weight the forward-backward term with a constant factor equal to . It ensures that the forward-backward term accounts for about half of the magnitude of the warp-supervision loss. For further implementation details, refer to Sec. C.

Census:

The implementation details are the same than explained in Sec. C. Particularly, we found that downsampling the images to the flow resolution at each level for loss computation gave better results than upsampling the estimated flows to image resolution.

SSIM:

To compute the loss, we upsample the estimated flow from each level to image resolution, i.e for the HNet and for the LNet. This strategy led to significantly better results than downsampling the images instead. As a result, because GLU-Net is a multi-scale architecture and the loss is computed using the flow from each resolution, the weights of the final training loss (18) are set to . This gives equal contribution to all levels, since estimated flows at levels of L-Net and H-Net are upsampled to respectively and . SSIM is computed with a window size of 11 pixels, following RANSAC-Flow [RANSAC-flow].

SSIM + forward-backward:

The model trained using the SSIM loss is further finetuned with the combination of photometric SSIM and forward-backward consistency losses (eq. 2 of the main paper). Both loss terms are balanced with a constant factor equal to , applied to the forward-backward consistency term. It ensures that the forward-backward term accounts for about half of the magnitude of the SSIM loss. Implementation details are the same than when training with the SSIM loss only.

SSIM + WarpC:

For the WarpC loss, we follow the training procedure and implementation details provided in Sec. C, i.ewe compute the loss at estimated flow resolution. For the SSIM loss term, we instead follow the training strategy explained above, i.ewe compute the loss at image resolution. For the WarpC term, the different levels weights of the final training loss (18) are set to be , while for the SSIM loss term they are set to . Each loss term, i.eWarpC and SSIM, is computed independently and the final loss is the sum of both.

Sparse ground-truth data:

Since the ground-truth is sparse, it is inconvenient to down-sample the ground-truth to different resolutions. We thus instead up-sample the estimated flow fields to the ground-truth resolution and compute the loss at this resolution. As for SSIM, we therefore use for the level weights of the final training loss (18).

## Appendix G Analysis of transformations W

In this section, we analyse the impact of the sampled transformations’ strength on the performance of the corresponding trained WarpC networks. As explained in Sec. B, the strength of the warps is mostly controlled by the standard-deviation or range , used to sample the base homography and TPS transformations. We thus analyse the effect of the sampling range on the evaluation results of the corresponding WarpC networks, particularly WarpC-GLU-Net and WarpC-SemanticGLU-Net. We do not provide such analysis for WarpC-RANSAC-Flow because as mentioned in Sec. D, RANSAC-Flow architecture is limited to a small range of displacements that it can estimate, which also limits the range over which we can sample the warps .

While we choose a specific distribution to sample the transformations parameters used to construct the flow , our experiments show that the performance of the trained networks according to our proposed warp consistency loss (Sec. 3.5 of the main paper) is relatively insensitive to the strength of the transformations , if they remain in a reasonable bound. We present these experiments below.

WarpC-GLU-Net:  Specifically, we analyze the PCK curves obtained by GLU-Net based models, trained following our first training stage (Sec. C), for varying ranges used to sample the TPS and homography transformations. Note that for all networks, the sampling distributions of the affine-tps transformations are the same. We plot in Fig. 9 the resulting curves, computed on the MegaDepth and RobotCar datasets. For completeness, we additionally plot the PCK values for fixed pixel thresholds in versus the sampling range in Fig. 10. On MegaDepth, increasing the sampling range from to leads to an improvement of the resulting network’s robustness to large geometric transformations, i.ean increase in PCK-3, 5 and 10. Further increasing up to leads to a decrease in these PCK values. For PCK-1 however, networks trained with sampling ranges within obtain similar accuracy. The accuracy starts dropping for larger sampling ranges. We select because it obtains the best PCK-1 and good PCK-3, 5 and 10. Nevertheless, note that networks trained using sampling ranges within lead to relatively similar PCK metrics, within 2-3 %. Moreover, on RobotCar, all networks obtain similar PCK metrics, independently of the sampling range .

WarpC-SemanticGLU-Net:  As for WarpC-GLU-Net, we show that the performance of WarpC-SemanticGLU-Net is relatively insensitive to the strength of the transformations , if they remain in a reasonable bound. Specifically, we analyze the PCK curves obtained by WarpC-SemanticGLU-Net based models, for varying ranges used to sample the TPS and homography transformations of during training. Note that for all networks, the sampling distributions of the affine-tps transformations are the same. We plot in Fig. 7 the resulting curves evaluated on the test set of PF-Pascal and in Fig. 8 the results for specific PCK values. For sampling ranges within , the results of the corresponding trained WarpC-SemanticGLU-Net are all very similar overall. Particularly, the gap between all networks for is very small, within 1 %. For , differences amount to . We selected because it led to a slightly better PCK for the low threshold .

## Appendix H Experimental setup and datasets

In this section, we first provide details about the evaluation datasets and metrics. We then explain the experimental set-up in more depth.

### h.1 Evaluation metrics

AEPE:  AEPE is defined as the Euclidean distance between estimated and ground truth flow fields, averaged over all valid pixels of the reference image.

PCK:  The Percentage of Correct Keypoints (PCK) is computed as the percentage of correspondences with an Euclidean distance error , w.r.t. to the ground truth , that is smaller than a threshold .

### h.2 Evaluation datasets and set-up

HPatches:  The HPatches dataset [Lenc] is a benchmark for geometric matching correspondence estimation. It depicts planar scenes, with transformations restricted to homographies. As in DGC-Net [Melekhov2019], we only employ the 59 sequences labelled with v_X, which have viewpoint changes, thus excluding the ones labelled i_X, which only have illumination changes. Each image sequence contains a query image and 5 reference images taken under increasingly larger viewpoints changes, with sizes ranging from to .

MegaDepth:  The MegaDepth dataset [megadepth] depicts real scenes with extreme viewpoint changes. No real ground-truth correspondences are available, so we use the result of SfM reconstructions to obtain sparse ground-truth correspondences. We follow the same procedure and test images than [RANSAC-flow], spanning 19 scenes. More precisely, 1600 pairs of images were randomly sampled, that shared more than 30 points. The test pairs are from different scenes than the ones we used for training and validation. Correspondences were obtained by using 3D points from SfM reconstructions and projecting them onto the pairs of matching images. It results in approximately 367K correspondences. During evaluation, following [RANSAC-flow], all the images are resized to have minimum dimension 480 pixels.

RobotCar:  Images in RobotCar depict outdoor road scenes, taken under different weather and lighting conditions. While the image pairs show similar view-points, they are particularly challenging due to their many textureless regions. For evaluation, we use the correspondences originally introduced by [RobotCarDatasetIJRR]. Following [RANSAC-flow], all the images are resized to have minimum dimension 480 pixels.

TSS:  The TSS dataset [Taniai2016] contains 400 image pairs, divided into three groups: FG3DCAR, JODS, and PASCAL, according to the origins of the images. The dense flow fields annotations for the foreground object in each pair is provided along with a segmentation mask. Evaluation is done on 800 pairs, by also exchanging query and reference images.

PF-Pascal:  The PF-PASCAL [PFPascal] benchmark is built from the PASCAL 2011 keypoint annotation dataset [BourdevM09]. It consists of 20 diverse object categories, ranging from chairs to sheep. Sparse manual annotations are provided for 300 image pairs. Evaluation is done by computing PCK for a pixel threshold computed with respect to query image size.

PF-Willow:  The PF-WILLOW dataset consists of 900 image pairs selected from a total of 100 images [PFWillow]. It spans four object categories. Sparse annotations are provided for all pairs. For evaluation, we report the PCK scores with multiple thresholds ( = 0.05, 0.10, 0.15) with respect to bounding box size in order to compare with prior methods.

## Appendix I Qualitative results

Finally, we provide extensive qualitative visual examples of the performance of our WarpC models. We first qualitatively compare baseline GLU-Net* and our approach WarpC-GLU-Net on images of MegaDepth and RobotCar in Fig. 1213 and 11 respectively. WarpC-GLU-Net is significantly more accurate than GLU-Net*. It can also handle very drastic scale and view-point changes, where GLU-Net* often completely fails. This is thanks to our -bipath objective, which provides supervision on the network predictions between the real image pairs, as opposed to the warp-supervision objective. Also note that in the dense estimation settings, the network must predict a match for every pixels in the reference, even in obviously occluded regions. Only correspondences found in overlapping regions are relevant nevertheless. Moreover, occluded regions can be filtered out using e.ga forward-backward consistency mask [Meister2017], or by letting the network predict a visibility mask as in [RANSAC-flow, Melekhov2019]. This is particularly important for MegaDepth images, in which some image pairs have overlapping ratios below . On RobotCar images in Fig. 11, our approach WarpC-GLU-Net better handles large appearance variations, such as seasonal or day-night changes.

We then show the performance of WarpC-SemanticGLU-Net compared to SemanticGLU-Net on images of TSS in Fig. 1415 and 16. Our unsupervised finetuning brings visible robustness to the large appearance changes and shape variations inherent to the semantic matching task. Finally, we also qualitatively compare both networks on images of the PF-Pascal dataset in Fig. 1718 and 19. The PF-Pascal dataset shows more diverse object categories than TSS images. WarpC-SemanticGLU-Net manages to accurately align challenging image pairs, such as the chair examples which are particularly cluttered.