Probabilistic Warp Consistency for Weakly-Supervised Semantic Correspondences

03/08/2022
by   Prune Truong, et al.
ETH Zurich
12

We propose Probabilistic Warp Consistency, a weakly-supervised learning objective for semantic matching. Our approach directly supervises the dense matching scores predicted by the network, encoded as a conditional probability distribution. We first construct an image triplet by applying a known warp to one of the images in a pair depicting different instances of the same object class. Our probabilistic learning objectives are then derived using the constraints arising from the resulting image triplet. We further account for occlusion and background clutter present in real image pairs by extending our probabilistic output space with a learnable unmatched state. To supervise it, we design an objective between image pairs depicting different object classes. We validate our method by applying it to four recent semantic matching architectures. Our weakly-supervised approach sets a new state-of-the-art on four challenging semantic matching benchmarks. Lastly, we demonstrate that our objective also brings substantial improvements in the strongly-supervised regime, when combined with keypoint annotations.

READ FULL TEXT VIEW PDF

page 4

page 6

page 14

page 23

page 27

page 28

page 29

page 30

04/07/2021

Warp Consistency for Unsupervised Learning of Dense Correspondences

The key challenge in learning dense correspondences lies in the lack of ...
03/31/2020

Deep Semantic Matching with Foreground Detection and Cycle-Consistency

Establishing dense semantic correspondences between object instances rem...
09/28/2021

Weakly Supervised Keypoint Discovery

In this paper, we propose a method for keypoint discovery from a 2D imag...
12/04/2020

DenserNet: Weakly Supervised Visual Localization Using Multi-scale Feature Aggregation

In this work, we introduce a Denser Feature Network (DenserNet) for visu...
02/03/2022

Weakly Supervised Nuclei Segmentation via Instance Learning

Weakly supervised nuclei segmentation is a critical problem for patholog...
04/08/2022

Identifying Ambiguous Similarity Conditions via Semantic Matching

Rich semantics inside an image result in its ambiguous relationship with...
12/06/2019

Controlling Style and Semantics in Weakly-Supervised Image Generation

We propose a weakly-supervised approach for conditional image generation...

1 Introduction

The semantic matching problem entails finding pixel-wise correspondences between images depicting instances of the same semantic category of object or scene, such as ‘cat’ or ‘bird’. It has received growing interest, due to its applications in e.g., semantic segmentation [Taniai2016, RubinsteinJKL13] and image editing [DaleJSMP09, BarnesSFG09, HaCohenSGL11, LeeKLKCC20]. The task nevertheless remains extremely challenging due to the large intra-class appearance and shape variations, view-point changes, and background-clutter. These issues are further complicated by the inherent difficulty to obtain ground-truth annotations.

Figure 1: From a real image pair representing the same object class, we generate a new image by warping according to a randomly sampled transformation. We further extend the image triplet with an additional image , that depicts a different object class. For each pixel in

, we introduce two consistency objectives by enforcing the conditional probability distributions obtained either from the composition

, or directly through , to be equal to the known warping distribution. We further model occlusion and unmatched regions by introducing a learnable unmatched state. It is trained by enforcing the predicted distribution between the non-matching images to be mapped to the unmatched state for all pixels.

While a few current datasets [PFPascal, PFWillow, spair] provide manually annotated keypoints matches, these are often ill-defined, ambiguous and scarce. Strongly-supervised approaches relying on such annotations therefore struggle to generalize across datasets, as demonstrated in recent works [MinLPC20, CATs]. As a prominent alternative, unsupervised approaches [GLUNet, GOCor, Rocco2017a, Melekhov2019, Rocco2018a, SeoLJHC18, MinLPC20] often train the network with synthetically generated dense ground-truth and image data. While benefiting from direct supervision, the lack of real image pairs often leads to poor generalization to real data. Weakly-supervised methods [MinLPC20, Rocco2018a, Rocco2018b, Jeon, DCCNet] thus appear as an attractive paradigm, leveraging supervision from real image pairs by only exploiting image-level class labels, which are inexpensive compared to keypoint annotations.

Previous weakly-supervised alternatives introduce objectives on the predicted dense correspondence volume, which encapsulates the matching confidences for all pairwise matches between the image pair. The most common strategy is to maximize the maximum scores [Rocco2018b, DCCNet] or negative entropy [MinLPC20] of the correspondence volume computed between images of the same class, while minimizing the same quantity for images of different classes. However, these strategies only provide very limited supervision due to their weak and indirect learning signal. While these approaches act directly on the predicted dense correspondence volume, Truong et al[warpc] recently introduced Warp Consistency, a weakly-supervised learning objective for dense flow regression. The objective is derived from flow constraints obtained when introducing a third image, constructed by randomly warping one of the images in the original pair. While it achieves impressive results, the warp consistency objective is limited to the learning of flow regression. As such an approach predicts a single match for each pixel without any confidence measure, it struggles to handle occlusions and background clutter, which are prominent in the semantic matching task.

We propose Probabilistic Warp Consistency, a weakly-supervised learning objective for semantic matching. Following [Rocco2018b, DCCNet, CATs], we first employ a probabilistic mapping representation of the predicted dense correspondences, encoding the transitional probabilities from every pixel in one image to every pixel in the other. Starting from a real image pair , we consider the image triplet introduced in [warpc], where the synthetic image is related to by a randomly sampled warp (Fig. 1). We derive our probabilistic consistency objective based on predicting the known probabilistic mapping relating to with the composition through the image . The composition is obtained by marginalizing over all the intermediate paths that link pixels in image to pixels in through image .

Since the constraints employed to derive our objective are only valid in mutually visible object regions, we further tackle the problem of identifying pixels that can be matched. This is particularly challenging in the presence of background clutter and occlusions, common in semantic matching. We explicitly model occlusion and unmatched regions, by introducing a learnable unmatched state into our probabilistic mapping formulation. To train the model to detect unmatched regions, we design an additional probabilistic loss that is applied on pairs of images depicting different object classes, as illustrated in Fig. 1. Further, we also employ a visibility mask, which constrains our introduced consistency loss to visible object regions.

We extensively evaluate and analyze our approach by applying it to four recent semantic matching architectures, across four benchmark datasets. In particular, we train SF-Net [SFNet] and NC-Net [Rocco2018b] with our weakly-supervised Probabilistic Warp Consistency objective. Our approach brings relative gains of and on PF-Pascal [PFPascal] and PF-Willow [PFWillow] respectively, for SF-Net, and and for NC-Net on SPair-71K [spair] and TSS [Taniai2016], respectively. This leads to a new state-of-the-art on all four datasets. Finally, we extend our approach to the strongly-supervised regime, by combining our probabilistic objectives with keypoint supervision. When integrated in SF-Net, NC-Net, DHPF [MinLPC20] and CATs [CATs], it leads to substantially better generalization properties across datasets, setting a new state-of-the-art on three benchmarks. Code is available at github.com/PruneTruong/DenseMatching

2 Related Work

Semantic matching architectures:

Most semantic matching pipelines include 3 main steps, namely feature extraction, cost volume construction, and displacement estimation. Multiple works focus on the latter, through either predicting the global geometric transformation parameters 

[Rocco2017a, ArbiconNet, SeoLJHC18, Kim2018, Rocco2018a, Jeon], or directly regressing the flow field [GLUNet, GOCor, pdcnet, warpc, Kim2019] relating an image pair. Nevertheless, most methods instead predict a cost volume as the final network output, which is further transposed to point-to-point correspondences with argmax or soft-argmax [SFNet] operations. Recent methods thus focus on improving the cost volume aggregation stage, through formulating the semantic matching task as an optimal transport problem [SCOT] or leveraging multi-resolution features and cost volumes [HPF, MinLPC20, SFNet, MMNet, CATs]. Another line of work deals with refining the cost volume, with 4D [Rocco2018b, ANCNet, DCCNet, PMNC] or 6D [CHM] convolutions, an online optimization-based module [GOCor], an encoder-decoder style architecture [GSF] or a Transformer module [CATs].

Unsupervised and weakly-supervised semantic matching:

A common technique for unsupervised learning of semantic correspondences is to rely on synthetically warped versions of images 

[Rocco2017a, ArbiconNet, SeoLJHC18, GLUNet, GSF]. It nevertheless comes at the cost of poorer generalization abilities to real data. Some methods instead use real image pairs, by leveraging additional annotations in the form of 3D CAD models [Zhou2016, abs-2004-09061], segmentation masks [SFNet, ChenL0H21], or by jointly learning semantic matching with attribute transfer [Kim2019]. Most related to our work are approaches that use proxy losses on the cost volume constructed between real image pairs, with image labels as the only supervision [Jeon, Rocco2018a, Rocco2018b, DCCNet, Kim2018]. Jeon et al[Jeon] identify correct matches from forward-backward consistency. NC-Net [Rocco2018b] and DCC-Net [DCCNet] are trained by maximizing the mean matching scores over all hard assigned matches from the cost volume. Min et al[MinLPC20] instead encourage low and high correlation entropy for image pairs depicting the same or different classes, respectively. In this work, we instead construct an image triplet by warping one of the original images with a known warp, from which we derive our probabilistic losses.

Unsupervised learning from videos: Our approach is also related to [randomwalk], which proposes a self-supervised approach for learning features, by casting matches as predictions of links in a space-time graph constructed from videos. Recent works [DwibediATSZ19, JabriOE20, WangJE19] further leverage the temporal consistency in videos to learn a representation for feature matching.

3 Background: Warp Consistency

We derive our approach based on the warp consistency constraints introduced by [warpc]. They propose a weakly-supervised loss, termed Warp Consistency, for learning correspondence regression networks. We therefore first review relevant background and introduce the notation that we use.

We define the mapping , which encodes the absolute location in corresponding to the pixel location in image . We consistently use the hat to denote an estimated or predicted quantity.

Warp Consistency graph: Truong et al[warpc] first build an image triplet, which is used to derive the constraints. From a real image pair , an image triplet is constructed, by creating through warping of with a randomly sampled mapping , as . Here, denotes function composition. The resulting triplet gives rise to a warp consistency graph (Fig. 2a), from which a family of mapping-consistency constraints is derived.

Mapping-consistency constraints: Truong et al[warpc] analyse the possible mapping-consistency constraints arising from the triplet and identify two of them as most suitable when designing a weakly-supervised learning objective for dense correspondence regression. Particularly, the proposed objective is based on the W-bipath constraint, where the mapping is computed through the composition via image , formulated as,

(1)

It is further combined with the warp-supervision constraint,

(2)

derived from the graph by the direct path .

In [warpc]

, these constraints were used to derive a weakly-supervised objective for correspondence regression. However, regressing a mapping vector

for each position only retrieves the position of the match, without any information on its uncertainty or multiple hypotheses. We instead aim at predicting a matching conditional probability distribution for each position . The distribution encapsulates richer information about the matching ability of this location , such as confidence, uniqueness, and existence of the correspondence. In this work, we thus generalize the mapping constraints (1)- (2) extracted from the warp consistency graph to conditional probability distributions.

4 Method

We address the problem of estimating the pixel-wise correspondences relating an image pair , depicting semantically similar objects. The dense matches are encapsulated in the form of a conditional probability matrix, referred to as probabilistic mapping. The goal of this work is to design a weakly-supervised learning objective for probabilistic mappings, applied to the semantic matching task.

[Warp Consistency Graph [warpc] ]     [Our probabilistic PW-bipath (6) and PWarp-supervision constraints, with corresponding losses (7)-(8) ]

Figure 2: Mapping and probabilistic mapping constraints derived from the warp consistency graph between the images . is generated by warping according to a randomly sampled mapping (black arrow). (a) The W-bipath (1) and warp-supervision (2) mapping constraints [warpc] predict by the composition , and directly by respectively. (b) Our probabilistic PW-bipath and PWarp-supervision constraints are derived by enforcing the composition of the predicted distributions, and the direct prediction respectively, to be equal to the known warping distribution .

4.1 Probabilistic Formulation

In this section, we first introduce our probabilistic representation and define a typical base predictive architecture. We let denote the 2D pixel location in a grid of dimension , corresponding to image . We refer to as the index corresponding to when the spatial dimensions are vectorized into one dimension . Following [Rocco2018b, MinLPC20, CATs], we aim at predicting the probabilistic mapping relating to . Given a position in frame , gives the probability that is mapped to location in image . thus encodes the entire discrete conditional probability distribution of where is mapped in image . We can see as a matrix, where each column at index encapsulates the distribution . Also note that the probabilistic mapping is asymmetric.

Probabilistic mapping prediction: We here describe a standard architecture predicting the probabilistic mapping relating an image pair. We let and denote the -channel feature maps extracted from the images and , respectively.

A cost volume

is then constructed, which encodes the pairwise deep feature similarities between all locations in the two feature maps, as,

(3)

The cost volume is finally converted to a probabilistic mapping by simply applying the SoftMax operation over the first dimension,

(4)

Note that extensions of this basic approach can also be considered, by e.g. adding post-processing convolutional layers [Rocco2018b, DCCNet] or a Transformer module [CATs]

. The goal of this work is to design a weakly-supervised learning objective to train a neural network

, with parameters , that predicts the probabilistic mapping relating to .

4.2 Probabilistic Warp Consistency Constraints

We set out to design a weakly-supervised loss for probabilistic mappings. To this end, we consider the consistency graph introduced in [warpc] and generalize the mapping constraints (1)- (2) to their corresponding probabilistic form.

Probabilistic W-bipath constraint: We start from the W-bipath constraint (1) extracted from the Warp Consistency graph Fig. 2a and extend it to its probabilistic matrix counterpart, which we denote as PW-bipath. It states that we obtain the same conditional probability distribution by proceeding through the path , which is determined by the randomly sampled warp , or by taking the detour through image . In the latter case, the resulting probability distribution is derived by marginalizing over the intermediate paths that link pixels in to pixels in through as,

(5)

The above equality is expressed in matrix form as,

(6)

where represents matrix multiplication. This constraint is schematically represented in Fig. 2b.

PW-bipath training objective: We aim at formulating an objective based on the PW-bipath constraint (6). Crucially, in our setting, the mapping is known by construction, from which we can derive the ground-truth probabilistic mapping . To measure the distance between the right and the left side of (6), the KL divergence appears as a natural choice. Since is a constant, it simplifies to the familiar cross-entropy,

(7)

Here, is the cross-entropy loss. To simplify notations, we sometimes refer to the marginalization as . Supervising with the label provides an implicit learning signal for the predicted intermediate distributions and .

PWarp-supervision constraint and objective: Similarly, we generalize the warp-supervision constraint (2) to its probabilistic matrix form, as . As previously, by exploiting the fact that is known, we derive the corresponding training objective,

(8)

The PW-bipath constraint (6) and its loss (7) assume that all pixels of image have a match in both and . However, due to the occlusions introduced by the triplet creation and the non-matching backgrounds of the images in the semantic matching task, this assumption is partly invalidated.

4.3 Modelling Unmatched Regions

The semantic matching task aims to estimate correspondences between different image instances of the same object class. However, even in that case, the backgrounds of each image do not match. As a result, the common visible regions only represent a fraction of the images (see the birds in Fig. 2). Nevertheless, the distribution is unable to model the no-match case for pixel .

Moreover, the construction of our image triplet introduces occluded areas, for which the constraint (6) is undefined. In fact, it is only valid in non-occluded object regions. However, in our setting, the locations of the objects in the real image pairs are unknown. In this section, we derive our visibility-aware learning objective. We additionally introduce explicit modelling of occlusion and unmatchable regions into our probabilistic formulation.

Visibility-aware training objective: In general, the PW-bipath constraint (6) is only valid in regions of that are visible in both images and . That is, only in non-occluded object regions, as illustrated in Fig. 3. Applying the loss (7) in non-matching regions, such as in background areas, or in occluded objects regions (blue area in Fig. 3), bares the risk to confuse the network by enforcing matches in non-matching areas. As a result, we extend the introduced loss (7) by further integrating a visibility mask . The mask takes a value for any pixel belonging to the non-occluded common object (roughly the orange area in Fig. 3) and otherwise. The loss (7) is then extended as,

(9)

Since we do not know the true , we aim to find an estimate , also visualized in Fig. 3. We consider the predicted probability value of a pixel of to be mapped to position in , according to the known mapping . We assume that this value should be higher in matching regions, i.e. the object, than in non-matching regions, i.e. the background, where the constraint (6) doesn’t hold. We therefore compute our visibility mask by taking the highest percent of over all of . The scalar

is a hyperparameter controlling the sensitivity of the mask estimation. While we do not know the actual coverage of the object in the image, which might vary across training images, we found that taking a high estimate for

is sufficient in practise, as it simply removes the obvious non-matching regions. Moreover, while we could have instead computed by thresholding the probabilities as , our approach avoids tedious continuous tuning of the parameter during training, necessary to follow the evolution of the probabilities. While valid as it is, the accuracy of the estimate can further be improved through explicit occlusion modelling.

Figure 3: Triplet of images for training, and the visibility mask (yellow is ). The shaded blue region on represent object pixels visible in both and , but occluded in , for which our PW-bipath loss (7) is not valid. It is only valid in object regions visible in all three images, i.e. the orange shaded region. Explicitly modelling occlusions further helps to identify them.

Occlusion modelling: In order to explicitly model occlusion and non-matching regions into our probabilistic mapping , we predict the probability of a pixel to be occluded or unmatched in one image, given that it is visible in the other. This can, for example, be achieved by augmenting the cost volume in (3) with an unmatched bin [superpoint, SarlinDMR20] , such as , where is a single learnable parameter. After converting the cost volume into a probabilistic mapping through (4), encodes the probability of pixel of image to map to the unmatched or occluded state , i.e. to have no match in image . We further specify the matching distribution given an unmatched state, to always be mapped to the unmatched state. Specifically, we augment with a fixed column, forcing the distribution given an unmatched state to be as .

Occlusion aware PW-bipath: Our modelling of the unmatched state given the unmatched state, as naturally ensures that the following scheme is respected. If a pixel in image is predicted as unmatched in image , such as , it will also be predicted unmatched in image , i.e. . This prevents enforcing (9) on for pixels of image which are visible in , but occluded in image (blue area in Fig. 3). Moreover, predicting a high probability for the occluded state allows to identify occluded and non-matching areas in . It further ensures that these regions are not selected in , and therefore not supervised with (9).

Supervision of the unmatched state: Our introduced objectives (8)-(9) do not impact the unmatched state . We thus propose an additional loss to supervise it. Particularly, we aim at encouraging background and occluded object regions in images depicting the same object class, to be predicted as unmatchable. Nevertheless, since the locations of the object in are unknown during training, we cannot get direct supervision. To overcome this, we introduce an image , depicting a different semantic content than the triplet. We then supervise the unmatched state by guiding the mode of the distribution between and to be in the unmatched state for all pixels of the images. The corresponding learning objective on the non-matching image pair is defined as follows, and illustrated in Fig. 4,

(10)

denotes the binary cross-entropy and we set .

Figure 4: Learning objective on non-matching images .

4.4 Final Training Objectives

Finally, we introduce our final weakly-supervised objective, the Probabilistic Warp Consistency, as a combination of our previously introduced PW-bipath (9), PWarp-supervision (8) and PNeg (10) objectives. We additionally propose a strongly-supervised approach, benefiting from our losses while also leveraging keypoint annotations.

Weak supervision: In this setup, we assume that only image-level class labels are given, such that each image pair is either positive, i.e. depicting the same object class, or negative, i.e. representing different classes, following [Rocco2018b, DCCNet, MinLPC20]. We obtain our final weakly-supervised objective by combining the PW-bipath (9) and PWarp-supervision (8) losses applied to positive image pairs, with our negative probabilistic objective (10) on negative image pairs.

(11)

Here, and are weighting factors.

Strong supervision: We extend our approach to the strongly-supervised regime, where keypoint match annotations are given for each training image pair. Previous approaches [CATs, CHM, ANCNet] leverage these annotations by training semantic networks with a keypoint objective . Our final strongly-supervised objective is defined as the combination of the keypoint loss with our PW-bipath (9) and PWarp-supervision (8) objectives. Note that we do not include our explicit occlusion modelling, i.e. the unmatched state and its corresponding loss (10) on negative image pairs. This is to ensure fair comparison to previous strongly-supervised approaches, which solely rely on keypoint annotations, and not on image-level labels, required for our loss (10).

(12)

Here, and also are weighting factors.

5 Experimental Results

We evaluate our weakly-supervised learning approach for two semantic networks. The benefits brought by the combination of our probabilistic losses with keypoint annotations are also demonstrated for four recent networks. We extensively analyze our method and compare it to previous approaches, setting a new state-of-the-art on multiple challenging datasets.

5.1 Networks and Implementation Details

For weak supervision, we integrate our approach (11) into baselines SF-Net [SFNet] and NC-Net [Rocco2018b]. It leads to our weakly-supervised PWarpC-SF-Net and PWarpC-NC-Net respectively. We also apply our strongly-supervised loss (12) to baselines SF-Net, NC-Net, DHPF [MinLPC20] and CATs [CATs], resulting in respectively PWarpC-SF-Net*, PWarpC-NC-Net*, PWarpC-DHPF and PWarpC-CATs. For fair comparison, we additionally train a strongly-supervised baseline for both SF-Net and NC-Net, referred to as SF-Net* and NC-Net*. Note that for all methods, the strongly-supervised baseline is trained with only , which is defined as the cross-entropy loss for SF-Net*, NC-Net* and DHPF, and the End-Point-Error objective after applying soft-argmax [SFNet] for CATs. To convert the predicted probabilistic mapping to point-to-point matches for evaluation, all networks trained with our PWarpC objectives employ the argmax operation, except for PWarpC-CATs where we adopt the same soft-argmax as in the baseline CATs [CATs]. Additional details on the integration of our objectives for each architecture are provided in the appendix, Sec. A-F. We train all networks on PF-Pascal [PFPascal], using the splits of [SCNet]. The results when trained on SPair-71K are further presented in the appendix, Sec. G.1.

PF-Pascal PF-Willow Spair-71K TSS
PCK @ PCK @ PCK @ PCK @
Methods Reso FG3DCar JODS Pascal Avg.
S UCN [ucn] - - 75.1 - - - - - 17.7 - - - -
SCNet [SCNet] - 36.2 72.2 82.0 - - - - - - - - -
HPF [HPF] max 60.1 84.8 92.7 45.9 74.4 85.6 - - 93.6 79.7 57.3 76.9
SCOT [SCOT] max 300 63.1 85.4 92.7 47.8 76.0 87.1 - - 95.3 81.3 57.7 78.1
ANC-Net [ANCNet] - - 86.1 - - - - - 28.7 - - - -
CHM [CHM] 80.1 91.6 - - - - - - - - - -
PMD [PMD] - - 90.7 - - 75.6 - - - - - - -
PMNC [PMNC] - 82.4 90.6 - - - - - 28.8 - - - -
MMNet [MMNet] 77.7 89.1 94.3 - - - - - - - - -
DHPF [MinLPC20] 75.7 90.7 95.0 41.4 67.4 81.8 15.4 27.4 - - - -
CATs [CATs] 67.5 89.1 94.9 37.4 65.8 79.7 10.9 22.4 - - - -
CATs-ft-features [CATs] 75.4 92.6 96.4 40.9 69.5 83.2 13.6 27.0 - - - -
  

3.0pt2-15.51.5

plus1fil minus1fil   

CATs [CATs] ori 67.3 88.6 94.6 41.6 68.9 81.9 10.8 22.1 89.5 76.0 58.8 74.8
PWarpC-CATs ori 67.1 88.5 93.8 44.2 71.2 83.5 12.2 23.3 93.2 83.4 70.7 82.4
  

3.0pt2-15.51.5

plus1fil minus1fil   

CATs-ft-features [CATs] ori 76.8 92.7 96.5 45.2 73.2 85.2 13.7 26.8 92.1 78.9 64.2 78.4
PWarpC-CATs-ft-features ori 79.8 92.6 96.4 48.1 75.1 86.6 15.4 27.9 95.5 85.0 85.5 88.7
  

3.0pt2-15.51.5

plus1fil minus1fil   

DHPF [MinLPC20] ori 77.3 91.7 95.5 44.8 70.6 83.2 15.3 27.5 88.2 71.9 56.6 72.2
PWarpC-DHPF ori 79.1 91.3 96.1 48.5 74.4 85.4 16.4 28.6 89.1 74.1 59.7 74.3

3.0pt2-15.51.5

plus1fil minus1fil   

NC-Net* ori 78.6 91.7 95.3 43.0 70.9 83.9 17.3 32.4 92.3 76.9 57.1 75.3
PWarpC-NC-Net* ori 79.2 92.1 95.6 48.0 76.2 86.8 21.5 37.1 97.5 87.8 88.4 91.2
  

3.0pt2-15.51.5

plus1fil minus1fil   

SF-Net* ori 78.7 92.9 96.0 43.2 72.5 85.9 13.3 27.9 88.0 75.1 58.4 73.8
PWarpC-SF-Net* ori 78.3 92.2 96.2 47.5 77.7 88.8 17.3 32.5 94.9 83.4 74.3 84.2
U CNNGeo [Rocco2017a] ori 41.0 69.5 80.4 36.9 69.2 77.8 - 18.1 90.1 76.4 56.3 74.4
PARN [Jeon] ori - - - - - - - - 89.5 75.9 71.2 78.8
GLU-Net [GLUNet] ori 42.2 69.1 83.1 30.4 57.7 72.9 - - 93.2 73.3 71.1 79.2
Semantic-GLU-Net [GLUNet, warpc] ori 48.3 72.5 85.1 39.7 67.5 82.1 7.6 16.5 95.3 82.2 78.2 85.2
A2Net [SeoLJHC18] - 42.8 70.8 83.3 36.3 68.8 84.4 - 20.1 - - - -
PMD [PMD] - - 80.5 - - 73.4 - - - - - - -
M SF-Net [SFNet] / ori 53.6 81.9 90.6 46.3 74.0 84.2 - - - - - -
SF-Net [SFNet] ori 59.0 84.0 92.0 46.3 74.0 84.2 11.2 24.0 90.8 78.6 58.0 75.8
W PWarpC-SF-Net ori 65.7 87.6 93.1 47.5 78.3 89.0 17.6 33.5 95.1 84.7 76.8 85.5
  

3.0pt2-15.51.5

plus1fil minus1fil   

WeakAlign [Rocco2018a] ori/ ori / - 49.0 75.8 84.0 38.2 71.2 85.8 - 21.1 90.3 76.4 56.5 74.4
RTNs [Kim2018] - 55.2 75.9 85.2 41.3 71.9 86.2 - - 90.1 78.2 63.3 77.2
DCCNet [DCCNet] / ori / - 55.6 82.3 90.5 43.6 73.8 86.5 - 26.7 93.5 82.6 57.6 77.9
SAM-Net [Kim2019] - 60.1 80.2 86.9 - - - - - 96.1 82.2 67.2 81.8
DHPF [MinLPC20] 56.1 82.1 91.1 40.5 70.6 83.8 14.7 28.5 - - - -
DHPF [MinLPC20] ori 61.2 84.1 92.4 45.1 73.6 85.0 14.7 27.8 - - - -
GSF [GSF] - 62.8 84.5 93.7 47.0 75.8 88.9 - 33.5 - - - -
PMD [PMD] - - 81.2 - - 74.7 - - 26.5 - - - -
WarpC-SemGLU-Net [warpc] ori 62.1 81.7 89.7 49.0 75.1 86.9 13.4 23.8 97.1 84.7 79.7 87.2
NC-Net [Rocco2018b] / ori / - 54.3 78.9 86.0 44.0 72.7 85.4 - 26.4 - - - -
  

3.0pt2-15.51.5

plus1fil minus1fil   

NC-Net [Rocco2018b] ori 60.5 82.3 87.9 44.0 72.7 85.4 13.9 28.8 94.5 81.4 57.1 77.7
PWarpC-NC-Net ori 64.2 84.4 90.5 45.0 75.9 87.9 18.2 35.3 95.9 88.8 82.9 89.2
Table 1: PCK [%] obtained by different state-of-the-art methods on the PF-Pascal [PFPascal], PF-Willow [PFWillow], SPair-71K [spair] and TSS [Taniai2016] datasets. All approaches are trained on the training set of PF-Pascal, except for [GLUNet]. S denotes strong supervision using keypoint match annotations, M refers to using ground-truth object segmentation mask, U is fully unsupervised requiring only single images, and W refers to weakly-supervised with image-level class labels. Each method evaluates with ground-truth annotations resized to a specific resolution. However, using different ground-truth resolutions leads to slightly different results. We therefore use the standard setting of evaluating on the original resolution (ori) and gray the results computed with the ground-truth annotations at a different size. When needed, we re-compute metrics of baselines using the provided pre-trained weights, indicated by . For each of our PWarpC networks, we compare to its corresponding baseline within the dashed-lines. Best and second best results are in red and blue respectively.

5.2 Experimental Settings

We evaluate our networks on four standard benchmark datasets for semantic matching, namely PF-Pascal [PFPascal], PF-Willow [PFWillow], SPair-71K [spair] and TSS [Taniai2016]. Results on Caltech-101 [caltech] are further shown in appendix H.6.

Datasets: The PF-Pascal, PF-Willow and SPair-71K are keypoint datasets, which respectively contain 1341, 900 and 70958 image pairs from 20, 4 and 18 categories. Images have dimensions ranging from to . TSS is the only dataset providing dense flow field annotations for the foreground object in each pair. It contains 400 image pairs, divided into three groups: FG3DCAR, JODS, and PASCAL, according to the origins of the images.

Metrics: We adopt the standard metric, Percentage of Correct Keypoints (PCK), with a pixel threshold of . Here, and are either the dimensions of the source image or the dimensions of the object bounding box in the source image, such as .

[NC-Net [Rocco2018b]]  [PWarpC-NC-Net (Ours) ]

[SF-Net [SFNet]]  [PWarpC-SF-Net (Ours) ]

Figure 5: Example predictions of baselines NC-Net [Rocco2018b] and SF-Net [SFNet], compared to our weakly-supervised PWarpC-NC-Net and PWarpC-SF-Net. Green and red line denotes correct and wrong predictions, respectively, with respect to the ground-truth.

5.3 Results

We present results on PF-Pascal, PF-Willow, SPair-71K and TSS in Tab. 1. A few previous approaches compute the PCK metrics after resizing the annotations to a different resolution than the original. Nevertheless, we found that in practise, the annotation resolution can lead to notable variations in results, as evidenced for DHPF or CATs in Tab. 1. For fair comparison, we thus compute the metrics on the standard setting, i.e. the original image size, and re-compute the PCK in this setting for baseline works if necessary. We also indicate the annotation size used, whenever reported by the authors or provided in their public implementation.

Weak supervision (W): In Tab. 1, bottom part, we compare approaches trained with weak-supervision in the form of image labels. In this setting, our PWarpC networks are trained with in (11). While bringing improvements on the PF-Pascal dataset itself, our approach PWarpC-NC-Net most notably achieves widely better generalization properties, with impressive (+ ), (+ ) and (+ ) relative (and absolute) gains compared to the baseline NC-Net on PF-Willow (), SPair-71K () and TSS () respectively. Our PWarpC-NC-Net thus sets a new state-of-the-art on SPair-71K and TSS among weakly-supervised methods trained on PF-Pascal.

Even though it utilizes a lower degree of supervision, our approach PWarpC-SF-Net also significantly outperforms the baseline SF-Net, which is trained with mask supervision (M), on all datasets. In particular, it shows a relative (and absolute) gain of (+ ), (+ ) and (+ ) on respectively PF-Pascal, PF-Willow and SPair-71K for , and of (+ ) on TSS for . This makes our PWarpC-SF-Net the new state-of-the-art across all unsupervised (U), weakly-supervised (W) and mask-supervised (M) approaches on PF-Pascal and PF-Willow. Example predictions are shown in Fig. 5

Strong supervision (S): In the top part of Tab. 1, we evaluate networks trained with strong supervision, in the form of key-point annotations. Our strongly-supervised PWarpC approaches are trained with our  (12). For all networks, while the results are on par with the baselines on PF-Pascal, the PWarpC networks show drastically better performance on PF-Willow, SPair-71K and TSS compared to their respective baselines. PWarpC-SF-Net* and PWarpC-NC-Net* thus set a new state-of-the-art on respectively PF-Willow, and the SPair-71K and TSS datasets, across all strongly-supervised approaches trained on PF-Pascal. Finally, while most works focus on designing novel semantic architectures, we here show that the right training strategy bridges the gap between architectures.

PF-Pascal PF-Willow Spair-71K TSS
Methods
I SF-Net baseline 59.0 84.0 46.3 74.0 24.0 75.8
II PW-bipath (7) 59.1 82.3 44.9 74.3 28.0 83.4
III + visibility mask (9) 61.2 83.7 46.1 75.8 28.5 78.4
IV + PWarp-supervision (8) 63.0 84.9 47.0 76.9 30.7 83.5
V + PNeg (10) (PWarpC-SF-Net) 65.7 87.6 47.5 78.3 33.5 85.5
V PWarpC-SF-Net (Ours) 65.7 87.6 47.5 78.3 33.5 85.5
VI Mapping Warp Consistency [warpc] 64.9 86.1 46.9 76.6 26.6 82.2
VII PWarp-supervision only (8) 52.9 74.3 38.0 66.6 27.9 79.4
VIII Max-score [Rocco2018b] 52.4 76.7 31.2 59.5 24.6 74.8
IX Min-entropy [MinLPC20] 44.7 74.4 25.4 57.8 20.6 69.6
Table 2: Ablation study (top part) and comparison to alternative objectives (bottom part) for PWarpC-SF-Net.

(a) Training with mapping-based Warp Consistency [warpc]

(b) Training with Probabilistic Warp Consistency (Ours)

Figure 6: In (a), SF-Net is trained using the mapping-based Warp Consistency approach [warpc], after converting the cost volume to a mapping through soft-argmax [SFNet]. It predicts ambiguous matching scores, struggling to differentiate between the car wheels. Our probabilistic approach (b) instead directly predicts a Dirac-like distribution, whose mode is correct. Here, we also show that our approach identifies most of the background areas as unmatched.

5.4 Method Analysis

We here perform a comprehensive analysis of our approach in Tab. 2. We adopt SF-Net as the base architecture.

Ablation study: In the top part of Tab. 2, we analyze key components of our approach. The version denoted as (II) is trained using our PW-bipath objective (7), without the visibility mask. Further introducing our visibility mask (9) in (III) significantly boosts the results since it enables to supervise only in the common visible regions. Note that this version (III) already outperforms the baseline SF-Net (I), while using less annotation (class instead of mask). In (IV), we add our probabilistic warp-supervision (8), leading to a small improvement for all thresholds and on all datasets. From (IV) to (V), we further introduce our explicit occlusion modelling associated with our negative loss (10), which results in drastically better performance. This version corresponds to our final weakly-supervised PWarC-SF-Net, trained with (11). An example of the regions identified as unmatched by PWarpC-SF-Net is shown in Fig. 6b.

Comparison to other losses: In Tab. 2, bottom part, we first compare our probabilistic approach (11) corresponding to (V) with the mapping-based warp consistency objective [warpc], denoted as (VI). Our approach (V) leads to better performance than warp consistency (VI), with a particularly impressive absolute gain on the challenging SPair-71K dataset. We further illustrate the benefit of our approach on an example in Fig. 6. Moreover, using only the PWarp-supervision loss (8) in (VII) results in much worse performance than our Probabilistic Warp Consistency (V). Finally, we compare our approach (V) to previous losses applied on cost volumes. The versions (VIII) and (IX) are trained with respectively maximizing the max scores [Rocco2018b], and minimizing the cost volume entropy [MinLPC20]. Both approaches lead to poor results, likely caused by the very indirect supervision signal that these objectives provide.

6 Conclusion

We propose Probabilistic Warp Consistency, a weakly-supervised learning objective for semantic matching. We introduce multiple probabilistic losses derived both from a triplet of images generated based on a real image pair, and from a pair of non-matching images. When integrated into four recent semantic networks, our approach sets a new state-of-the-art on four challenging benchmarks.

Limitations: Since our approach acts on cost volumes, which are memory expensive, it is limited to relatively coarse resolution. This might in turn impact its accuracy.

References

Appendix A General implementation details

In this section, we provide implementation details, which apply to all our PWarpC networks.

Creating of ground-truth probabilistic mapping : Here, we describe how we obtain the ground-truth probabilisitc mapping from the known mapping . We first rescale the mapping to the same resolution as the predicted probabilistic mapping . We then convert the mapping into a ground-truth probabilistic mapping , following this scheme. For each pixel position in , we construct the ground-truth 2D conditional probability distribution by assigning a one-hot or a smooth representation of . In the latter case, following [ANCNet], we pick the four nearest neighbours of and set their probability according to distance. Then we apply 2-D Gaussian smoothing of size 3 on that probability map. We then vectorize the two dimensions of , leading to our final known warp probabilistic mapping . We will specify which representation we used, as either one-hot or smooth, for each loss and each network.

Conversion of to correspondence set: The output of the model is a probabilistic mapping, encoding the matching probabilities for all pairwise match relating an image pair. However, for various applications, such as image alignment or geometric transformation estimation, it is desirable to obtain a set of point-to-point image correspondences between the two images. This can be achieved by either performing a hard or soft assignment. In the former case, the hard assignment in one direction is done by just taking the most likely match, the mode of the distribution as,

(13)

In the latter case, the soft assignment corresponds to soft-argmax. It computes correspondences for individual locations of image , as the expected position in according to the conditional distribution ,

(14)

Training details:

All networks are trained with PyTorch, on a single NVIDIA TITAN RTX GPU with 24 GiB of memory, within 48 hours, depending on the architecture.

Appendix B Triplet creation and sampling of warps

b.1 Triplet creation

Our introduced learning approach requires to construct an image triplet from an original image pair , where all three images must have training dimensions . We follow a similar procedure than in [warpc], further described here. The original training image pairs are first resized to a fixed size , larger than the desired training image size . We then sample a dense mapping of the same dimension , and create by warping image with , as . Each of the images of the resulting image triplet are then centrally cropped to the fixed training image size . The central cropping is necessary to remove most of the black areas in introduced from the warping operation with large sampled mappings as well as possible warping artifacts arising at the image borders. We then additionally apply appearance transformations to all images of the triplet, such as brightness and contrast changes.

b.2 Sampling of warps

A question raised by our proposed loss formulations (8)-(9) is how to sample the synthetic warps . During training, we randomly sample it from a distribution , which we need to design. Here, we also follow a similar procedure than in [warpc].

In particular, we construct by sampling homography, Thin-plate Spline (TPS), or affine-TPS transformations with equal probability. The transformations parameters are then converted to dense mappings of dimension . Then, we optionally apply horizontal flipping to the each dense mapping with a probability .

Specifically, for homographies and TPS, the four image corners and a

grid of control points respectively, are randomly translated in both horizontal and vertical directions, according to a desired sampling scheme. The translated and original points are then used to compute the corresponding homography and TPS parameters. Finally, the transformations parameters are converted to dense mappings. For both transformation types, the magnitudes of the translations are sampled according to a uniform distribution with a range

. Note that for the uniform distribution, the sampling range is actually , when it is centered at zero, or similarly if centered at 1 for example. Importantly, the image points coordinates are previously normalized to be in the interval . Therefore should be within .

For the affine transformations, all parameters, i.e. scale, translations, shearing and rotation angles, are sampled from a uniform distribution with range equal to , , and respectively. The affine scale parameters are sampled within with center at 1, while for all other parameters, the sampling interval is centered at zero.

b.3 List of Hyper-parameters

In summary, to construct our image triplet , the hyper-parameters are the following:

(i) , the resizing image size, on which is applied to obtain before cropping.

(ii) , the training image size, which corresponds to the size of the training images after cropping.

(iii) , the range used for sampling the homography and TPS transformations.

(iv) , the range used for sampling the scaling parameter of the affine transformations.

(v) , the range used for sampling the translation parameter of the affine transformations.

(vi) , the range used for sampling the rotation angle of the affine transformations. It is also used as shearing angle.

(vii) , the range used for sampling the TPS transformations, used for the Affine-TPS compositions.

(viii) The probability of horizontal flipping .

b.4 Hyper-parameters settings

Geometric transformations : For all our PWarpC networks, the mappings are created by sampling homographies, TPS and Affine-TPS transformations with equal probability. For simplicity, we also use the same range for all three types of transformations. In particular, we use a uniform sampling scheme with a range equal to , where . For the affine transformations, we also sample all parameters, i.e. scale, translation, shear and rotation angles, from uniform distributions with ranges respectively equal to , , and for both angles. We use these parameters when training on either PF-Pascal [PFPascal] or SPair-71K [spair].

Probability of horizontal flipping: When training on PF-Pascal, we set the probability of horizontal flipping to for all our PWarpC networks, except for PWarpC-NC-Net and PWarpC-NC-Net*, for which we use . For training on SPair-71K, we increase this value to for all our PWarpC networks, except for PWarpC-NC-Net and PWarpC-NC-Net*, for which we keep .

Appearance transformations: For all experiments and networks, we apply the same appearance transformations to the image triplet . Specifically, we convert each image to gray-scale with a probability of 0.2. We then apply color transformations, by adjusting contrast, saturation, brightness, and hue. The color transformations are larger for the synthetic image then for the real images . For the synthetic image

, we additionally randomly invert the RGB channels. Finally, on all images of the triplet, we further use a Gaussian blur with a kernel between 3 and 7, and a standard deviation sampled within

, applied with probability of 0.2.

Appendix C PWarpC-SF-Net and PWarpC-SF-Net*

We first provide details about the SF-Net [SFNet] architecture. We also briefly review the training strategy of the original work. We then extensively explain our training approach and the corresponding implementation details, for both our weakly and strongly-supervised approaches, PWarpC-SF-Net and PWarpC-SF-Net* respectively. Finally, we provide additional method analysis for this architecture.

c.1 Details about SF-Net

Architecture: SF-Net is based on a pre-trained ResNet-101 feature backbone, on which are added convolutional adaptation layers at two levels. The predicted feature maps are then used to construct two cost volumes, at two resolutions. After upsampling the coarsest one to the same resolution, the two cost volumes are combined with a point-wise multiplication. While the resulting cost volume is the actual output of the network, it is converted to a flow field through a kernel sotf-argmax operation. Specifically, a fixed Gaussian kernel is applied on the raw cost volume scores to post-process them, before applying SoftMax to convert the cost volume to a probabilistic mapping. From there, the soft-argmax operation transposes it to a mapping.

For our PWarpC approaches, we do not use the Gaussian kernel to post-process the predicted matching scores. We simply convert the predicted cost volume into a probabilistic mapping through a SoftMax operation, following eq. (4) of the main paper. Also note that only the adaptation layers are trained.

Training strategy in original work: The original work employs ground-truth foreground object masks as supervision. From single images associated with their segmentation masks, they create image pairs by applying random transformations to both the original images and segmentation masks. Subsequently, they train the network with a combination of multiple losses. In particular, they enforce the forward-backward consistency of the predicted flow, associated with a smoothness objective acting directly on the predicted flow. These losses are further combined with an objective enforcing the consistency of the warped foreground mask of one image with the ground-truth segmentation mask of the other image.

c.2 PWarpC-SF-Net and PWarpC-SF-Net*: our training strategy

Warps sampling: For the weakly-supervised version, we resize the image pairs to , sample a dense mapping of the same dimension and create . Each of the images of the resulting image triplet is then centrally cropped to .

For the strongly-supervised version, we apply the transformations on images of the same size than the crop, i.e. . This is to avoid cropping keypoint annotations.

When training on PF-Pascal, we apply of horizontal flipping to sample the random mappings , while it is increased to when training on SPair-71K.

Weighting and details on the losses : We found it beneficial to define the known probabilistic mapping with a one-hot representation for our PW-bipath loss (9), while using a smooth representation instead for the PWarp-supervision (8) loss and the keypoint objective in (12). Each representation is described in Sec. A.

For the weakly-supervision version PWarpC-SF-Net, the weights in (11) are set to and .

For the strongly-supervised version, PWarpC-SF-Net*, we use the same weight . We additionally set , which ensure that our probabilistic losses amount for the same than the keypoint loss . Moreover, the keypoint loss is set as the cross-entropy loss, for both PWarpC-SF-Net* and its baseline SF-Net*.

Implementation details: For our weakly-supervised PWarpC-SF-Net, we set the initial learnable parameter , corresponding to the unmatched state for our occlusion modeling, at .

For both the weakly and strongly-supervised approaches, the SoftMax temperature, corresponding to equation (4) of the main paper, is set to , the same than originally used in the baseline for soft-argmax. The hyper-parameter used in the estimation of our visibility mask (eq. 9 of the main paper) is set to and to when trained on PF-Pascal or SPair-71K respectively. This is because in SPair-71K, the objects are generally much smaller than in PF-Pascal.

For training, we use similar training parameters as in baseline SF-Net [SFNet]

. We train with a batch size of 16 for maximum 100 epochs. The learning rate is set to

and halved after 50. We optionally finetune the networks on SPair-71K for an additional 20 epochs, with an initial learning rate of , halved after 10 epochs. The networks are trained using Adam optimizer [adam] with weight decay set to zero.

c.3 Additional analysis

Here, we first analyse the effect of the kernel applied in the original SF-Net baseline [SFNet] before converting the predicted cost volume to a probabilistic mapping representation. We also provide the ablation study of our strongly-supervised PWarpC-SF-Net*. Note that the ablation study of the weakly-supervised SF-Net is provided in Tab. 2 of the main paper. Finally, we show the impact of different losses on negative image pairs, i.e. depicting different object classes.

(a) Training with mapping-based Warp Consistency [warpc]

(b) Training with Probabilistic Warp Consistency (Ours)

Figure 7: In (a), SF-Net is trained using the mapping-based Warp Consistency approach [warpc], after converting the cost volume to a mapping through soft-argmax [SFNet]. It predicts ambiguous matching scores, struggling to differentiate between the car wheels. After applying the kernel, the mode of the distribution corresponds to the wrong wheel. Also note that the kernel is extremely important in that case to post-process the multi-hypothesis distribution. Our probabilistic approach (b) instead directly predicts a Dirac-like distribution, whose mode is correct.
PF-Pascal PF-Willow Spair-71K TSS
Methods
SF-Net* 78.7 92.9 43.2 72.5 27.9 73.8
+ Vis-aware PW-bipath (9) 77.1 91.6 47.8 77.9 31.1 80.3
+ PWarp-supervision (8) 78.3 92.2 47.5 77.7 32.5 84.2
Table 3: Ablation study for strongly-supervised PWarpC-SF-Net*. We incrementally add each component.

Effect of kernel: Baseline SF-Net relies on a kernel soft-argmax strategy to convert the predicted cost volume to a mapping. In particular, the kernel is applied on the cost volume before applying SoftMax (eq. (4)), which transposes it to a probabilistic mapping. From there, soft-argmax is used to obtain a mapping. Nevertheless, we observe that this kernel is extremely important in order to post-process the matching scores. This is shown in Fig. 7. In contrast, our approach Probabilistic Warp Consistency, which directly acts on the predicted dense matching scores, produces clean, Dirac-like conditional distributions, without relying on any post-processing operations.

Ablation study for strongly-supervised PWarpC-SF-Net*: In Tab. 3, we analyse key components of our approach PWarpC-SF-Net*. From the strongly-supervised baseline SF-Net*, adding our probabilistic PW-bipath objective leads to a significant improvement on the PF-Willow, SPair-71K and TSS datasets. Further including our PWarp-supervision objective results in additional gains on SPair-71K and TSS.

Comparisons to alternative negative losses: In Tab. 4, we compare combining our PW-bipath and PWarp-supervision objectives on image pairs of the same label, with different losses on images pairs showing different object classes, i.e. on negative image pairs. In the version denoted as (IV), we introduce our explicit occlusion modeling (Sec. 4.3 of the main paper), trained with our probabilistic negative loss . In (V) and (VI), we instead combine our probabilistic objectives on the positive image pairs (7) (III), with an additional objective, minimizing the max scores or the negative entropy of the cost volume respectively. While it brings a small improvement with respect to version (III), the resulting network performances in (V) and (VI) are far lower than when trained with our final combination (11), which corresponds to version (IV).

PF-Pascal PF-Willow Spair-71k TSS
Methods
I SF-Net baseline (soft-argmax) 59.0 84.0 46.3 74.0 24.0 75.8
II SF-Net baseline (argmax) 60.3 81.3 43.7 71.0 26.9 74.1
III Vis-PW-bipath + PWarp-sup 63.0 84.9 47.0 76.9 30.7 83.5
IV (III) + PNeg (10) (PWarpC-SF-Net) 65.6 87.9 47.3 78.2 33.8 84.1
V (III) + Max-score [Rocco2018b] 63.7 81.2 44.6 71.6 31.8 77.3
VI (III) + Min-entropy [MinLPC20] 59.4 76.7 41.8 67.9 28.8 73.2
Table 4: Comparison of different losses applied on negative image pairs, i.e. depicting different object classes, when associated with our introduced PW-bipath and PWarp-supervised losses on positive image pairs. We use SF-Net as baseline network. The evaluation results are computed using the annotations at original resolution.

Appendix D PWarpC-NC-Net and PWarpc-NC-Net*

In this section, we first provide details about the NC-Net architecture. We also briefly review the training strategy of the original work. We then extensively explain our training approach and the corresponding implementation details, for both our weakly and strongly-supervised approaches, PWarpC-NC-Net and PWarpC-NC-Net* respectively. Finally, we extensively ablate our approach for this architecture.

d.1 Details about NC-Net

Architecture: In [Rocco2018b], Rocco et al. introduce a learnable consensus network, applied on the 4D cost volume constructed between a pair of feature maps. Specifically, they process the cost volume with multiple 4D convolutional layers, to establish a strong locality prior on the relationships between the matches. The cost volume before and after applying the 4D convolutions is also processed with a soft mutual nearest neighbor filtering.

Training strategy in original work: The baseline NC-Net is trained with a weakly-supervised strategy, using image-level class labels as only supervision. Their proposed objective maximizes the mean matching scores over all hard assigned matches from the predicted cost volume constructed between images pairs of the same class, while minimizing the same quantity for image pairs of different classes. By retraining the NC-Net architecture with this strategy, we nevertheless found the training process to be quite unstable, multiple training runs leading to substantially different performance.

d.2 PWarpC-NC-Net and PWarpC-NC-Net*: our training strategy

Warps sampling: For the weakly-supervised version, we resize the image pairs to , sample a dense mapping of the same dimension and create . Each of the images of the resulting image triplet is then centrally cropped to .

For the strongly-supervised version, we apply the transformations on images of the same size than the crop, i.e. . This is to avoid cropping keypoint annotations.

As for the random mapping , we apply 30% of horizontal flipping. We found increasing the percentage of horizontal flipping for our PWarpC-NC-Net and PWarpC-NC-Net* networks to be beneficial compared to the other networks, in order to help stabilize the learning.

Weighting and details on the losses : For all losses, we use a smooth representation for the known probabilistic mapping (see Sec. A).

In general, we found the PWarp-supervision objective (8) to be slightly harmful for the PWarpC-NC-Net networks, and therefore did not include it in our final weakly and strongly-supervised formulations. This is particularly the case when finetuning the features, which is the setting we used for our final PWarpC-NC-Net and PWarpC-NC-Net*. This is likely due to the network ’overfitting’ to the synthetic image pairs and transformations involved in the PWarp-supervision loss, at the expense of the real images considered in the PW-bipath (9) and PNeg (10) objectives.

As a result, for the weakly-supervision version PWarpC-NC-Net, the weights in (11) are set to and . For the strongly-supervised version, PWarpC-NC-Net*, we use the same weight . We additionally set , which ensure that our probabilistic losses amount for the same than the keypoint loss . Moreover, the keypoint loss is set as the cross-entropy loss, for both PWarpC-NC-Net* and its baseline NC-Net*.

PF-Pascal PF-Willow Spair-71k TSS
Methods

3.0pt1-8.51.5

plus1fil minus1fil

I NCNet baseline (Max-score) [Rocco2018b] 60.5 82.3 44.0 72.7 28.8 77.7
II PW-bipath diverged
III + Visibility mask 64.7 83.8 45.4 75.9 32.8 82.7
IV + PWarp-supervision 61.7 79.2 45.1 73.8 35.6 85.4

3.0pt1-8.51.5

plus1fil minus1fil

III PW-bipath Visibility mask 64.7 83.8 45.4 75.9 32.8 82.7
V + PNeg 62.0 82.2 45.4 76.2 33.2 87.9
VI + ft features (PWarpC-NC-Net) 64.2 84.4 45.0 75.9 35.3 89.2

3.0pt1-8.51.5

plus1fil minus1fil

VII ft features from scratch 63.7 82.9 44.9 76.1 35.7 87.4
VI PWarpC-NC-Net 64.2 84.4 45.0 75.9 35.3 89.2
I Max-score (NC-Net baseline) 60.5 82.3 44.0 72.7 28.8 77.7
VIII Min-entropy [MinLPC20] 55.6 79.2 42.0 72.3 25.4 78.4
IX Warp Consistency [warpc] 59.1 75.0 44.6 70.1 35.0 87.0
III PW-bipath Visibility mask 64.7 83.8 45.4 75.9 32.8 82.7
V (III) + PNeg (Ours) 62.0 82.2 45.4 76.2 33.2 87.9
X (III) + Max-score 62.9 82.1 45.4 74.2 31.3 79.0
XI (III) + Min-entropy 60.8 78.5 44.8 71.4 31.5 78.6
Table 5: In the top part, we conduct an ablation study for PWarpC-NC-Net. There, we incrementally add each component. In the middle part, we then compare our Probabilistic Warp Consistency objective to alternative weakly-supervised losses. In the bottom part, we compare the impact of combining different losses on non-matching pairs with our PW-bipath objective, applied on image pairs of the same class. We measure the PCK on the PF-Pascal [PFPascal], PF-Willow [PFWillow], SPair-71K [spair] and TSS [Taniai2016] datasets. The evaluation results are computed using ground-truth annotations at original resolution.

Implementation details: For PWarpC-NC-Net, we set the initial learnable parameter , corresponding to the unmatched state for our occlusion modeling at . This is to ensure that it is in the same range than the cost volume, at initialization.

The SoftMax temperature, corresponding to equation (4) of the main paper, is set to , the same than originally used in the baseline loss. The hyper-parameter used in the estimation of our visibility mask (eq. 9 of the main paper) is set to . Indeed, for NC-Net, we found that using a more restrictive threshold, as compared to the other networks which use (when training on PF-Pascal), is beneficial to stabilize the training. It offers a better guarantee that the PW-bipath loss (9) is only applied in common visible object regions between the triplet.

Similarly to baseline NC-Net [Rocco2018b], we train in two stages. In the first stage, we only train the consensus neighborhood network while keeping the ResNet-101 feature backbone extractor fixed. We further finetune the last layer of the feature backbone as well as the consensus neighborhood network in a second stage. These two stages are used to train on PF-Pascal [PFPascal], our final PWarpC-NC-Net and PWarpC-NC-Net* approaches, as well as strongly-supervised baseline NC-Net*.

For training, we use similar training parameters as in baseline NC-Net. We train with a batch size of 16, which is reduced to 8 when the last layer of the backbone feature is finetuned. During the first training stage on PF-Pascal, we train for a maximum of 30 epochs with a learning rate set to a constant of . During the second training stage on PF-Pascal, the learning rate is reduced to and the network trained for an additional 30 epochs.

We optionally further finetune the networks on SPair-71K [spair] for 10 epochs, with the same learning rate equal to . Note that in this setting, the last layer of the feature backbone is also finetuned. The networks are trained using Adam optimizer [adam] with weight decay set to zero.

d.3 PWarpC-NC-Net: ablation study and comparison to previous works

Similarly to Sec. 5.4 of the main paper for PWarpC-SF-Net, we here provide a detailed analysis of our weakly-supervised approach PWarpC-NC-Net.

Ablation study: In the top part of Tab. 5, we analyze key components of our weakly-supervised approach. The version denoted as (II) is trained using our PW-bipath objective (7), without the visibility mask. NC-Net trained with this loss diverged. With the NC-Net architecture, we found it crucial to extend our loss with our visibility mask (9), resulting in version (III). We believe applying our PW-bipath loss on all pixels (II) confuses the NC-Net network, by enforcing matching even in e.g. non-matching background regions. Note that version (III) trained with our visibility aware PW-bipath objective (9) already outperforms the baseline (I) on all datasets and for all thresholds. Further adding the PWarp-supervision loss (8) in (IV) leads to worse results than (III) on the PF-Pascal and PF-Willow datasets, despite bringing an improvement on SPair-71K and TSS. To obtain a final network achieving competitive results on all four datasets, we therefore do not include the PWarp-supervision objective (8) in our final formulation.

From (III), including our occlusion modeling, i.e. the unmatched state and its corresponding probabilistic negative loss (10) in (V) leads to notable gains on the PF-Willow, SPair-71K and TSS datasets. In (VI), we further finetune the last layer of the feature backbone with the neighborhood consensus network in a second training stage. It leads to substantial improvements on all datasets, except for PF-Willow, where results remain almost unchanged.

From (VI) to (VII), we compare finetuning the feature backbone in a second training stage (VI), or directly in a single training stage (VII). The former leads to better performance on the PF-Pascal dataset. As a result, version (VI) corresponds to our final weakly-supervised PWarpC-NC-Net, trained with two stages on PF-Pascal.

Comparison to other losses: In the middle part of Tab. 5, we compare our Probabilistic Warp Consistency approach to previous weakly-supervised alternatives. The baseline NC-Net, corresponding to version (I), is trained with maximizing the max scores of the predicted cost volumes for matching images. It leads to significantly worse results than our approach (VI) on all datasets and threshold. The same conclusions apply to version (VIII), trained with minimizing the cost volume entropy for matching images. Finally, we compare our probabilistic approach (VI) to the mapping-based Warp Consistency method, corresponding to (IX). While Warp Consistency (IX) achieves good performance on the SPair-71K and TSS datasets, it leads to poor results on the PF-Pascal and PF-Willow datasets.

Comparison of objectives on negative image pairs: Finally, in the bottom part of Tab. 5, we compare multiple alternative losses applied on image pairs depicting different object classes. In particular, we combine our visibility-aware PW-bipath loss (III) with either our introduced probabilistic negative loss (10), minimizing the maximum scores [Rocco2018b] or maximizing the cost volume entropy [MinLPC20] in respectively (V), (X) and (XI). Our probabilistic negative loss (10) leads to significantly better results on the PF-Willow, SPair-71K and TSS datasets. We believe it is because it enables to explicitly model occlusions and unmatched regions through our extended probabilistic formulation, including the unmatched state.

Appendix E PWarpC-CATs

In this section, we first briefly review the CATs architecture and the original training strategy. We then provide details about the integration of our probabilistic approach into this architecture. Finally, we analyse the key components of our resulting strongly-supervised networks PWarpC-CATs and PWarpC-CATs-ft-features.

e.1 Details about CATs

Architecture: CATs [CATs] finds matches which are globally consistent by leveraging a Transformer architecture applied to slices of correlation maps constructed from multi-level features. The Transformer module alternates self-attention layers across points of the same correlation map, with inter-correlation self-attention across multi-level dimensions.

Training strategy in original work: While the final output of the CATs architecture is a cost volume, the latter is converted to a dense mapping by transposing into a probabilistic mapping with SoftMax, and then applying soft-argmax. The network is then trained with the End-Point Error objective, by leveraging the keypoint match annotations.

e.2 PWarpC-CATs: our training strategy

Warps sampling: We apply the transformations on images with dimensions . We do not further crop central images to avoid cropping keypoint annotations.

When training on PF-Pascal, we apply of horizontal flipping to sample the random mappings , while it is increased to when training on SPair-71K.

Weighting and details on the losses : We define the known probabilistic mapping with a one-hot representation for our PW-bipath and PWarp-supervision losses (9)-(8) (see Sec. A).

To obtain PWarpC-CATs, we set the weights in (12) as and , which ensure that our probabilistic losses amount for the same than the keypoint loss .

To obtain PWarpC-CATs-ft-features, where the ResNet-101 backbone feature is additionally finetuned, we found the PWarp-supervision objective (8) to be slightly harmful, and therefore did not include it in this case. This is consistent with the findings of PWarpC-NC-Net and PWarpC-NC-Net*, for which the PWarp-supervised objective was also found harmful when the feature backbone is finetuned. This is likely due to the network ’overfitting’ to the synthetic image pairs and transformations involved in the PWarp-supervision loss, at the expense of the real images considered in the PW-bipath (9) objectives. As a result, for the PWarpC-CATs-ft-features version, we set the weights in (12) as and .

Moreover, to be consistent with the baseline CATs, the keypoint loss is set as End-Point-Error loss, after converting the probabilistic mapping to a mapping through soft-argmax.

Implementation details: The softmax temperature, corresponding to equation (4) of the main paper, is set to , the same than originally used in the baseline. The hyper-parameter used in the estimation of our visibility mask (eq. 9 of the main paper) is set to and to when trained on PF-Pascal or SPair-71K respectively. This is because in SPair-71K, the objects are generally much smaller than in PF-Pascal.

For training, we use similar training parameters as in baseline CATs. We train with a batch size of 16 when the feature backbone is frozen, and reduce it to 7 when finetuning the backbone. The initial learning rate is set to for the feature backbone, and for the rest of the architecture. It is halved after 80, 100 and 120 epochs and we train for a maximum of 150 epochs. We use the same training parameters when training on either PF-Pascal or SPair-71K. The networks are trained using AdamW optimizer [AdamW] with weight decay set to .

PF-Pascal PF-Willow Spair-71k TSS
Methods
CATs baseline (EPE) 67.3 88.6 41.6 68.9 22.1 74.8
+ Vis-aware-PW-bipath 68.1 88.5 44.0 70.6 21.4 76.3
+ PWarp-supervision (PWarpC-CATs) 67.1 88.5 44.2 71.2 23.3 82.4

3.0pt1-7.51.5

plus1fil minus1fil

CATs-ft-features (EPE) 79.8 92.7 45.2 73.2 26.8 78.4
+ Vis-aware-PW-bipath
(PWarpC-CATs-ft-features)
79.8 92.6 48.1 75.1 27.9 88.7
+ PWarp-supervision 79.6 92.4 46.7 74.4 26.0 88.7
Table 6: Ablation study for PWarpC-CATs and PWarpC-CATs-ft-features. We incrementally add each component. We measure the PCK on the PF-Pascal [PFPascal], PF-Willow [PFWillow], SPair-71K [spair] and TSS [Taniai2016] datasets. The evaluation results are computed using ground-truth annotations at original resolution.

e.3 Ablation study

In Tab. 6, we analyse the key components of our strongly-supervised approaches PWarpC-CATs (top part) and PWarpC-CATs-ft-features (bottom part). From the CATs baseline, which is trained with the End-Point Error (EPE) objective while keeping the backbone feature frozen, adding our visibility-aware PW-bipath loss (9) leads to a subtantial gain on the PF-Willow and TSS dataset. Further including our PWarp-supervision objective results in improved performance on PF-Willow, SPair-71K and TSS. For the versions with finetuning the feature backbone (bottom part of Tab. 6), our visibility-aware PW-bipath objective brings major gains on PF-Willow, SPair-71K and TSS. However, further adding the PWarp-supervision leads to a small drop in performance on all datasets. For this reason, we use the combination of the EPE loss with our visibility-aware PW-bipath objective to train our final PWarpC-CATs-ft-features.

Appendix F PWarpC-DHPF

As in previous sections, we first review the DHPF [MinLPC20] architecture and its original training strategy. We then provide training details for our strongly-supervised PWarpC-DHPF. Finally, we provide an ablation study for our approach applied to this architecture.

f.1 Details about DHPF

Architecture: DHPF learns to compose hypercolumn features, i.e

. aggregation of different layers, on the fly by selecting a small number of relevant layers from a deep convolutional neural network. In particular, it proposes a gating mechanism to choose which layers to include in the hypercolumn. The hypercolumns features are then correlated, leading to the final output cost volume.

Training strategy in original work: The original work proposes both a weakly and strongly-supervised approach. The weakly-supervised approach is trained with minimizing the cost volume entropy computed between image pairs depicting the same class, while maximizing it for pairs depicting a different semantic content.

The strongly-supervised approach is instead trained with the cross-entropy loss, after converting the keypoint match annotations to probability distributions. In both cases, the authors also include a layer selection loss. It is a soft constraint to encourage the network to select each layer of the feature backbone at a certain rate.

f.2 PWarpC-DHPF: our training strategy

Warps sampling: We apply the transformations on images with dimensions . Similarly to PWarpC-CATs, we do not further crop central images to avoid cropping keypoint annotations.

When training on PF-Pascal, we apply of horizontal flipping to sample the random mappings , while it is increased to when training on SPair-71K.

Weighting and details on the losses : We define the known probabilistic mapping with a smooth representation for our PW-bipath and PWarp-supervision losses (9)-(8) (see Sec. A).

To obtain PWarpC-DHPF, we set the weights in (12) as and , which ensure that our probabilistic losses amount for the same than the keypoint loss .

Moreover, in the strongly-supervised baseline DHPF, they train with a keypoint loss corresponding to the cross-entropy with the ground-truth keypoint matches converted to one-hot probabilistic mapping representations. We nevertheless found that the baseline is slightly improved when the ground-truth keypoint matches are instead converted to smooth probability distributions. We denote this version as DHPF* and compare it to our final PWarpC-DHPF in Tab. 7. As a result, for our PWarpC-DHPF, we set the keypoint loss in (12) to the cross-entropy with a smooth representation of the ground-truth keypoint match distributions. Finally, for fair comparison, we add the layer selection loss used in baseline DHPF to our strongly-supervised loss (12).

Implementation details: The softmax temperature, corresponding to equation (4) of the main paper, is set to , as in the baseline loss. Note that following the baseline DHPF, we apply gaussian normalization on the cost volume before applying the SoftMax operation (4) to convert it to a probabilistic mapping. The hyper-parameter used in the estimation of our visibility mask (eq. 9 of the main paper) is set to and to when trained on PF-Pascal or SPair-71K respectively. This is because in SPair-71K, the objects are generally much smaller than in PF-Pascal.

For training, we use similar training parameters as in baseline DHPF. We train on PF-Pascal with a batch size of 6 for a maximum of 100 epochs. The initial learning rate is set and halved after 50 epochs. We optionally further finetune the network on SPair-71K, with an additional 10 epochs and a constant learning rate of . The networks are trained using SGD optimizer [ruder2016overview].

PF-Pascal PF-Willow Spair-71k TSS
Methods

3.0pt1-7.51.5

plus1fil minus1fil

DHPF baseline (CE with one-hot) 77.3 91.7 44.8 70.6 27.5 72.2
DHPF* (CE with smooth) 78.1 90.7 44.7 70.1 27.9 74.02
+ Vis-aware-PW-bipath 76.3 90.7 47.3 73.6 28.0 73.7
+ PWarp-supervision (PWarpC-DHPF) 77.7 91.7 47.7 74.3 28.6 74.3
Table 7: Ablation study for PWarpC-DHPF. We incrementally add each component. We measure the PCK on the PF-Pascal [PFPascal], PF-Willow [PFWillow], SPair-71K [spair] and TSS [Taniai2016] datasets. The evaluation results are computed using ground-truth annotations at original resolution.

f.3 Ablation study

In Tab. 7, we conduct ablative experiments on PWarpC-DHPF. Training with the cross-entropy loss using a smooth representation of the ground-truth in DHPF* leads to slightly better results than DHPF on PF-Pascal and SPair-71K. For this reason, we use it as baseline. Further including our visibility-aware PW-bipath loss and PWarp-supervision leads to incremental gains on PF-Willow and SPair-71K.

Appendix G Analysis of transformations W

In this section, we analyse the impact of the sampled transformations’ strength on the performance of the corresponding trained PWarpC networks. As explained in Sec. B, the strength of the warps is mostly controlled by the range , used to sample the base homography, TPS and Affine-TPS transformations. The probability of horizontal flipping also has a large impact. We thus analyse the effect of the sampling range and the probability of horizontal flipping on the evaluation results of the corresponding PWarpC networks. In particular, we provide the analysis for our weakly-supervised PWarpC-SF-Net. The trend is the same for the other PWarpC networks.

While we choose a specific distribution to sample the transformations parameters used to construct the mapping , our experiments show that the performance of the trained networks according to our proposed Probabilistic Warp Consistency loss is relatively insensitive to the strength of the transformations , if they remain in a reasonable bound. We present these experiments in Fig. 8.

In Fig. 8 (A), we analyse the impact of the sampling range on the performance of PWarpC-SF-Net. Any range within leads to similar performance, for and for . Only for on PF-Pascal, increasing the range up to leads to better results, with a drop for . We select in our final setting.

We then look at the impact of the probability of horizontal flipping in Fig. 8 (B). On PF-Pascal, increasing the probability of flipping up to leads to an increase in performance. Increasing it further nevertheless results in a gradual drop in performance. The trend is the same on SPair-71K, except that the best results are achieved for . We therefore set for our final PWarpC networks.

(A) Impact of sampling range
[PF-Pascal]   [SPair-71K]
(B) Impact of probability of horizontal flipping
[PF-Pascal]   [SPair-71K]

Figure 8: Impact of the strength of the transformations , on the performance of the weakly-supervised PWarpC-SF-Net network. We look at the PCK for thresholds in obtained on the PF-Pascal [PFPascal] and SPair-71K [spair] datasets, for different sampling ranges and probability of horizontal flipping , used to create the synthetic transformations during training.
PF-Pascal PF-Willow Spair-71k TSS
PCK @ PCK @ PCK @ PCK @
Methods Reso FG3DCar JODS Pascal Avg.
S HPF [HPF] max - - - - - - - 28.2 - - - -
SCOT [SCOT] max 300 - - - - - - - 35.6 - - - -
CHM [CHM] - - - - - - - 46.3 - - - -
PMD [PMD] - - - - - - - - 37.4 - - - -
PMNC [PMNC] - - - - - - - - 50.4 - - - -
MMNet [MMNet] - - - - - - - 40.9 - - - -
DHPF [MinLPC20] 52.6 75.4 84.8 37.4 63.9 77.0 20.7 37.3 - - - -
CATs [CATs] 45.3 67.7 77.0 31.8 56.8 69.1 21.9 42.4 - - - -
CATs-ft-features [CATs] 54.4 74.1 81.9 39.7 66.3 78.3 27.9 49.9 - - - -
  

3.0pt2-15.51.5

plus1fil minus1fil   

CATs-ft-features [CATs] ori 57.7 75.2 82.9 43.5 69.1 80.8 27.1 48.8 88.9 73.9 57.1 73.3
PWarpC-CATs-ft-features ori 58.8 77.4 84.6 46.4 73.6 85.0 28.2 48.4 91.1 85.8 69.1 82.0
  

3.0pt2-15.51.5

plus1fil minus1fil   

DHPF [MinLPC20] ori 56.9 77.2 86.3 40.9 66.8 79.9 20.6 36.3 83.8 69.7 57.3 70.3
PWarpC-DHPF ori 65.8 85.5 92.3 47.6 72.9 84.5 23.3 38.7 87.5 73.7 60.3 73.8

3.0pt2-15.51.5

plus1fil minus1fil   

NC-Net* ori 59.8 75.6 82.1 38.9 62.6 74.7 29.1 50.7 81.1 66.7 45.4 64.4
PWarpC-NC-Net* ori 67.8 82.3 86.9 46.1 72.6 82.7 31.6 52.0 93.0 84.6 70.6 82.7
  

3.0pt2-15.51.5

plus1fil minus1fil   

SF-Net* ori 66.5 85.0 90.8 43.5 70.4 82.9 26.2 50.0 88.3 75.3 57.2 73.6
PWarpC-SF-Net* ori 72.1 89.6 93.5 46.3 75.2 87.0 27.0 48.8 92.5 81.1 66.2 79.9
U CNNGeo [Rocco2017a] (results from [spair]) - - - - - - - - 20.6 - - - -
A2Net [SeoLJHC18] (results from [spair]) - - - - - - - - 22.3 - - - -
M SF-Net [SFNet] (results from [PMNC]) - - - - - - - - 26.3 - - - -
W PWarpC-SF-Net ori 64.5 86.9 92.6 47.1 78.1 89.9 18.6 37.1 91.0 81.6 67.4 80.0
  

3.0pt2-15.51.5

plus1fil minus1fil   

WeakAlign [Rocco2018a] (results from [spair]) - - - - - - - - 20.9
DHPF [MinLPC20] 46.1 78.1 88.4 34.9 66.2 82.5 12.4 27.7 - - - -
DHPF [MinLPC20] ori 53.3 81.3 90.3 40.9 70.1 84.6 12.7 27.2
PMD [PMD] - - - - - - - - 26.5 - - - -
WarpC-SemGLU-Net [warpc] ori 57.0 78.7 88.7 46.1 72.8 84.9 12.8 23.5 96.3 84.2 80.2 86.9
  

3.0pt2-15.51.5

plus1fil minus1fil   

NC-Net [Rocco2018b] (results from [spair]) - - - - - - - - 20.1 - - - -
PWarpC-NC-Net ori 61.7 82.6 88.5 43.6 74.6 86.9 18.5 38.0 95.4 88.9 85.6 90.0
Table 8: PCK [%] obtained by different state-of-the-art methods on the PF-Pascal [PFPascal], PF-Willow [PFWillow], SPair-71K [spair] and TSS [Taniai2016] datasets. All approaches are trained or finetuned on the training set of Spair-71K. S denotes strong supervision using key-point annotation, M refers to using ground-truth object segmentation mask, U is fully unsupervised requiring only single images, and W refers to weakly supervised with image class labels. Each method evaluates with images and ground-truth annotations resized to a specific resolution. However, using different ground-truth resolution leads to slightly different results. We therefore use the standard setting of evaluating on the original resolution (ori) and gray the results computed at a different resolution. When needed, we re-compute metrics of baselines using the provided pre-trained weights, indicated by .

g.1 Results when training on SPair

To better understand the performance of our training approach under complex conditions, we report the results according to different variation factors with various difficulty levels. In particular, the SPair-71k dataset contains diverse variations in view-point, scale, truncation and occlusion. In addition to the keypoint match annotations, the dataset also provide specific annotations for each of the variation factors, with different levels of difficulty. We are particularly interested in the occlusion setting.

Weakly-supervised: In the bottom part of Tab. 8, we compare approaches trained with a weakly-supervised approach. Our PWarpC-SF-Net and PWarpC-NC-Net trained on PF-Pascal were further finetuned on SPair-71K with our Probabilistic Warp Consistency objective (11). Note that baselines SF-Net and NC-Net were obtained by finetuning on SPair-71K the original models trained on PF-Pascal, with their respective original training strategies. Our weakly-supervised approaches PWarpC-SF-Net and PWarpC-NC-Net lead to a particularly impressive improvement compared to their respective baselines, with (+ 10.8) and (+ 17.9) relative (and absolute) gains. As a result, PWarpC-SF-Net and PWarpC-NC-Net set a new state-of-the-art on respectively the PF-Willow and PF-Pascal datasets, and the SPair-71K and TSS datasets, across all unsupervised (U), weakly-supervised (W) and mask-supervised (M) approaches trained on SPair-71K.

Strongly-supervised: In the top part of 8, we report results of models trained with a strongly-supervised approach, leveraging keypoint match annotations. While training on SPair-71K with our approach leads to similar results than the baselines on SPair-71K, our PWarpC networks show drastically better generalization properties to PF-Pascal, PF-Willow and TSS. Our strongly-supervised PWarpC-NC-Net* sets a new state-of-the-at on SPair-71K and TSS, across all strongly-supervised approaches trained on SPair-71K. Our PWarpC-SF-Net* also obtains state-of-the-art results on the PF-Pascal and PF-Willow datasets.

Methods Reso View-point Scale Truncation Occlusion
easy medi hard easy medi hard none src trg both none src trg both All
U CNNGeo (from [spair]) - 25.2 10.7 5.9 22.3 16.1 8.5 21.1 12.7 15.6 13.9 20.0 14.9 14.3 12.4 18.1
A2Net (from [spair]) - 27.5 12.4 6.9 24.1 18.5 10.3 22.9 15.2 17.6 15.7 22.3 16.5 15.2 14.5 20.1
M SF-Net ori 32.0 15.5 10.0 28.4 22.0 13.2 27.0 20.1 20.0 18.7 26.6 18.5 18.9 18.0 24.0
W PWarpC-SF-Net ori 41.9 24.2 20.7 39.1 31.8 18.8 36.3 29.7 30.4 28.4 36.5 27.7 27.9 24.7 33.5

3.0pt2-18.51.5

plus1fil minus1fil

WeakAlign (from [spair]) - 29.4 12.2 6.9 25.4 19.4 10.3 24.1 16.0 18.5 15.7 23.4 16.7 16.7 14.8 21.1
NC-Net (from [spair]) - 34.0 18.6 12.8 31.7 23.8 14.2 29.1 22.9 23.4 21.0 29.0 21.1 21.8 19.6 26.4

3.0pt2-18.51.5

plus1fil minus1fil

NC-Net ori 37.6 19.4 13.8 34.7 26.0