1 Introduction
The semantic matching problem entails finding pixelwise correspondences between images depicting instances of the same semantic category of object or scene, such as ‘cat’ or ‘bird’. It has received growing interest, due to its applications in e.g., semantic segmentation [Taniai2016, RubinsteinJKL13] and image editing [DaleJSMP09, BarnesSFG09, HaCohenSGL11, LeeKLKCC20]. The task nevertheless remains extremely challenging due to the large intraclass appearance and shape variations, viewpoint changes, and backgroundclutter. These issues are further complicated by the inherent difficulty to obtain groundtruth annotations.
While a few current datasets [PFPascal, PFWillow, spair] provide manually annotated keypoints matches, these are often illdefined, ambiguous and scarce. Stronglysupervised approaches relying on such annotations therefore struggle to generalize across datasets, as demonstrated in recent works [MinLPC20, CATs]. As a prominent alternative, unsupervised approaches [GLUNet, GOCor, Rocco2017a, Melekhov2019, Rocco2018a, SeoLJHC18, MinLPC20] often train the network with synthetically generated dense groundtruth and image data. While benefiting from direct supervision, the lack of real image pairs often leads to poor generalization to real data. Weaklysupervised methods [MinLPC20, Rocco2018a, Rocco2018b, Jeon, DCCNet] thus appear as an attractive paradigm, leveraging supervision from real image pairs by only exploiting imagelevel class labels, which are inexpensive compared to keypoint annotations.
Previous weaklysupervised alternatives introduce objectives on the predicted dense correspondence volume, which encapsulates the matching confidences for all pairwise matches between the image pair. The most common strategy is to maximize the maximum scores [Rocco2018b, DCCNet] or negative entropy [MinLPC20] of the correspondence volume computed between images of the same class, while minimizing the same quantity for images of different classes. However, these strategies only provide very limited supervision due to their weak and indirect learning signal. While these approaches act directly on the predicted dense correspondence volume, Truong et al. [warpc] recently introduced Warp Consistency, a weaklysupervised learning objective for dense flow regression. The objective is derived from flow constraints obtained when introducing a third image, constructed by randomly warping one of the images in the original pair. While it achieves impressive results, the warp consistency objective is limited to the learning of flow regression. As such an approach predicts a single match for each pixel without any confidence measure, it struggles to handle occlusions and background clutter, which are prominent in the semantic matching task.
We propose Probabilistic Warp Consistency, a weaklysupervised learning objective for semantic matching. Following [Rocco2018b, DCCNet, CATs], we first employ a probabilistic mapping representation of the predicted dense correspondences, encoding the transitional probabilities from every pixel in one image to every pixel in the other. Starting from a real image pair , we consider the image triplet introduced in [warpc], where the synthetic image is related to by a randomly sampled warp (Fig. 1). We derive our probabilistic consistency objective based on predicting the known probabilistic mapping relating to with the composition through the image . The composition is obtained by marginalizing over all the intermediate paths that link pixels in image to pixels in through image .
Since the constraints employed to derive our objective are only valid in mutually visible object regions, we further tackle the problem of identifying pixels that can be matched. This is particularly challenging in the presence of background clutter and occlusions, common in semantic matching. We explicitly model occlusion and unmatched regions, by introducing a learnable unmatched state into our probabilistic mapping formulation. To train the model to detect unmatched regions, we design an additional probabilistic loss that is applied on pairs of images depicting different object classes, as illustrated in Fig. 1. Further, we also employ a visibility mask, which constrains our introduced consistency loss to visible object regions.
We extensively evaluate and analyze our approach by applying it to four recent semantic matching architectures, across four benchmark datasets. In particular, we train SFNet [SFNet] and NCNet [Rocco2018b] with our weaklysupervised Probabilistic Warp Consistency objective. Our approach brings relative gains of and on PFPascal [PFPascal] and PFWillow [PFWillow] respectively, for SFNet, and and for NCNet on SPair71K [spair] and TSS [Taniai2016], respectively. This leads to a new stateoftheart on all four datasets. Finally, we extend our approach to the stronglysupervised regime, by combining our probabilistic objectives with keypoint supervision. When integrated in SFNet, NCNet, DHPF [MinLPC20] and CATs [CATs], it leads to substantially better generalization properties across datasets, setting a new stateoftheart on three benchmarks. Code is available at github.com/PruneTruong/DenseMatching
2 Related Work
Semantic matching architectures:
Most semantic matching pipelines include 3 main steps, namely feature extraction, cost volume construction, and displacement estimation. Multiple works focus on the latter, through either predicting the global geometric transformation parameters
[Rocco2017a, ArbiconNet, SeoLJHC18, Kim2018, Rocco2018a, Jeon], or directly regressing the flow field [GLUNet, GOCor, pdcnet, warpc, Kim2019] relating an image pair. Nevertheless, most methods instead predict a cost volume as the final network output, which is further transposed to pointtopoint correspondences with argmax or softargmax [SFNet] operations. Recent methods thus focus on improving the cost volume aggregation stage, through formulating the semantic matching task as an optimal transport problem [SCOT] or leveraging multiresolution features and cost volumes [HPF, MinLPC20, SFNet, MMNet, CATs]. Another line of work deals with refining the cost volume, with 4D [Rocco2018b, ANCNet, DCCNet, PMNC] or 6D [CHM] convolutions, an online optimizationbased module [GOCor], an encoderdecoder style architecture [GSF] or a Transformer module [CATs].Unsupervised and weaklysupervised semantic matching:
A common technique for unsupervised learning of semantic correspondences is to rely on synthetically warped versions of images
[Rocco2017a, ArbiconNet, SeoLJHC18, GLUNet, GSF]. It nevertheless comes at the cost of poorer generalization abilities to real data. Some methods instead use real image pairs, by leveraging additional annotations in the form of 3D CAD models [Zhou2016, abs200409061], segmentation masks [SFNet, ChenL0H21], or by jointly learning semantic matching with attribute transfer [Kim2019]. Most related to our work are approaches that use proxy losses on the cost volume constructed between real image pairs, with image labels as the only supervision [Jeon, Rocco2018a, Rocco2018b, DCCNet, Kim2018]. Jeon et al. [Jeon] identify correct matches from forwardbackward consistency. NCNet [Rocco2018b] and DCCNet [DCCNet] are trained by maximizing the mean matching scores over all hard assigned matches from the cost volume. Min et al. [MinLPC20] instead encourage low and high correlation entropy for image pairs depicting the same or different classes, respectively. In this work, we instead construct an image triplet by warping one of the original images with a known warp, from which we derive our probabilistic losses.Unsupervised learning from videos: Our approach is also related to [randomwalk], which proposes a selfsupervised approach for learning features, by casting matches as predictions of links in a spacetime graph constructed from videos. Recent works [DwibediATSZ19, JabriOE20, WangJE19] further leverage the temporal consistency in videos to learn a representation for feature matching.
3 Background: Warp Consistency
We derive our approach based on the warp consistency constraints introduced by [warpc]. They propose a weaklysupervised loss, termed Warp Consistency, for learning correspondence regression networks. We therefore first review relevant background and introduce the notation that we use.
We define the mapping , which encodes the absolute location in corresponding to the pixel location in image . We consistently use the hat to denote an estimated or predicted quantity.
Warp Consistency graph: Truong et al. [warpc] first build an image triplet, which is used to derive the constraints. From a real image pair , an image triplet is constructed, by creating through warping of with a randomly sampled mapping , as . Here, denotes function composition. The resulting triplet gives rise to a warp consistency graph (Fig. 2a), from which a family of mappingconsistency constraints is derived.
Mappingconsistency constraints: Truong et al. [warpc] analyse the possible mappingconsistency constraints arising from the triplet and identify two of them as most suitable when designing a weaklysupervised learning objective for dense correspondence regression. Particularly, the proposed objective is based on the Wbipath constraint, where the mapping is computed through the composition via image , formulated as,
(1) 
It is further combined with the warpsupervision constraint,
(2) 
derived from the graph by the direct path .
In [warpc]
, these constraints were used to derive a weaklysupervised objective for correspondence regression. However, regressing a mapping vector
for each position only retrieves the position of the match, without any information on its uncertainty or multiple hypotheses. We instead aim at predicting a matching conditional probability distribution for each position . The distribution encapsulates richer information about the matching ability of this location , such as confidence, uniqueness, and existence of the correspondence. In this work, we thus generalize the mapping constraints (1) (2) extracted from the warp consistency graph to conditional probability distributions.4 Method
We address the problem of estimating the pixelwise correspondences relating an image pair , depicting semantically similar objects. The dense matches are encapsulated in the form of a conditional probability matrix, referred to as probabilistic mapping. The goal of this work is to design a weaklysupervised learning objective for probabilistic mappings, applied to the semantic matching task.
4.1 Probabilistic Formulation
In this section, we first introduce our probabilistic representation and define a typical base predictive architecture. We let denote the 2D pixel location in a grid of dimension , corresponding to image . We refer to as the index corresponding to when the spatial dimensions are vectorized into one dimension . Following [Rocco2018b, MinLPC20, CATs], we aim at predicting the probabilistic mapping relating to . Given a position in frame , gives the probability that is mapped to location in image . thus encodes the entire discrete conditional probability distribution of where is mapped in image . We can see as a matrix, where each column at index encapsulates the distribution . Also note that the probabilistic mapping is asymmetric.
Probabilistic mapping prediction: We here describe a standard architecture predicting the probabilistic mapping relating an image pair. We let and denote the channel feature maps extracted from the images and , respectively.
A cost volume
is then constructed, which encodes the pairwise deep feature similarities between all locations in the two feature maps, as,
(3) 
The cost volume is finally converted to a probabilistic mapping by simply applying the SoftMax operation over the first dimension,
(4) 
Note that extensions of this basic approach can also be considered, by e.g. adding postprocessing convolutional layers [Rocco2018b, DCCNet] or a Transformer module [CATs]
. The goal of this work is to design a weaklysupervised learning objective to train a neural network
, with parameters , that predicts the probabilistic mapping relating to .4.2 Probabilistic Warp Consistency Constraints
We set out to design a weaklysupervised loss for probabilistic mappings. To this end, we consider the consistency graph introduced in [warpc] and generalize the mapping constraints (1) (2) to their corresponding probabilistic form.
Probabilistic Wbipath constraint: We start from the Wbipath constraint (1) extracted from the Warp Consistency graph Fig. 2a and extend it to its probabilistic matrix counterpart, which we denote as PWbipath. It states that we obtain the same conditional probability distribution by proceeding through the path , which is determined by the randomly sampled warp , or by taking the detour through image . In the latter case, the resulting probability distribution is derived by marginalizing over the intermediate paths that link pixels in to pixels in through as,
(5) 
The above equality is expressed in matrix form as,
(6) 
where represents matrix multiplication. This constraint is schematically represented in Fig. 2b.
PWbipath training objective: We aim at formulating an objective based on the PWbipath constraint (6). Crucially, in our setting, the mapping is known by construction, from which we can derive the groundtruth probabilistic mapping . To measure the distance between the right and the left side of (6), the KL divergence appears as a natural choice. Since is a constant, it simplifies to the familiar crossentropy,
(7) 
Here, is the crossentropy loss. To simplify notations, we sometimes refer to the marginalization as . Supervising with the label provides an implicit learning signal for the predicted intermediate distributions and .
PWarpsupervision constraint and objective: Similarly, we generalize the warpsupervision constraint (2) to its probabilistic matrix form, as . As previously, by exploiting the fact that is known, we derive the corresponding training objective,
(8) 
4.3 Modelling Unmatched Regions
The semantic matching task aims to estimate correspondences between different image instances of the same object class. However, even in that case, the backgrounds of each image do not match. As a result, the common visible regions only represent a fraction of the images (see the birds in Fig. 2). Nevertheless, the distribution is unable to model the nomatch case for pixel .
Moreover, the construction of our image triplet introduces occluded areas, for which the constraint (6) is undefined. In fact, it is only valid in nonoccluded object regions. However, in our setting, the locations of the objects in the real image pairs are unknown. In this section, we derive our visibilityaware learning objective. We additionally introduce explicit modelling of occlusion and unmatchable regions into our probabilistic formulation.
Visibilityaware training objective: In general, the PWbipath constraint (6) is only valid in regions of that are visible in both images and . That is, only in nonoccluded object regions, as illustrated in Fig. 3. Applying the loss (7) in nonmatching regions, such as in background areas, or in occluded objects regions (blue area in Fig. 3), bares the risk to confuse the network by enforcing matches in nonmatching areas. As a result, we extend the introduced loss (7) by further integrating a visibility mask . The mask takes a value for any pixel belonging to the nonoccluded common object (roughly the orange area in Fig. 3) and otherwise. The loss (7) is then extended as,
(9) 
Since we do not know the true , we aim to find an estimate , also visualized in Fig. 3. We consider the predicted probability value of a pixel of to be mapped to position in , according to the known mapping . We assume that this value should be higher in matching regions, i.e. the object, than in nonmatching regions, i.e. the background, where the constraint (6) doesn’t hold. We therefore compute our visibility mask by taking the highest percent of over all of . The scalar
is a hyperparameter controlling the sensitivity of the mask estimation. While we do not know the actual coverage of the object in the image, which might vary across training images, we found that taking a high estimate for
is sufficient in practise, as it simply removes the obvious nonmatching regions. Moreover, while we could have instead computed by thresholding the probabilities as , our approach avoids tedious continuous tuning of the parameter during training, necessary to follow the evolution of the probabilities. While valid as it is, the accuracy of the estimate can further be improved through explicit occlusion modelling.Occlusion modelling: In order to explicitly model occlusion and nonmatching regions into our probabilistic mapping , we predict the probability of a pixel to be occluded or unmatched in one image, given that it is visible in the other. This can, for example, be achieved by augmenting the cost volume in (3) with an unmatched bin [superpoint, SarlinDMR20] , such as , where is a single learnable parameter. After converting the cost volume into a probabilistic mapping through (4), encodes the probability of pixel of image to map to the unmatched or occluded state , i.e. to have no match in image . We further specify the matching distribution given an unmatched state, to always be mapped to the unmatched state. Specifically, we augment with a fixed column, forcing the distribution given an unmatched state to be as .
Occlusion aware PWbipath: Our modelling of the unmatched state given the unmatched state, as naturally ensures that the following scheme is respected. If a pixel in image is predicted as unmatched in image , such as , it will also be predicted unmatched in image , i.e. . This prevents enforcing (9) on for pixels of image which are visible in , but occluded in image (blue area in Fig. 3). Moreover, predicting a high probability for the occluded state allows to identify occluded and nonmatching areas in . It further ensures that these regions are not selected in , and therefore not supervised with (9).
Supervision of the unmatched state: Our introduced objectives (8)(9) do not impact the unmatched state . We thus propose an additional loss to supervise it. Particularly, we aim at encouraging background and occluded object regions in images depicting the same object class, to be predicted as unmatchable. Nevertheless, since the locations of the object in are unknown during training, we cannot get direct supervision. To overcome this, we introduce an image , depicting a different semantic content than the triplet. We then supervise the unmatched state by guiding the mode of the distribution between and to be in the unmatched state for all pixels of the images. The corresponding learning objective on the nonmatching image pair is defined as follows, and illustrated in Fig. 4,
(10) 
denotes the binary crossentropy and we set .
4.4 Final Training Objectives
Finally, we introduce our final weaklysupervised objective, the Probabilistic Warp Consistency, as a combination of our previously introduced PWbipath (9), PWarpsupervision (8) and PNeg (10) objectives. We additionally propose a stronglysupervised approach, benefiting from our losses while also leveraging keypoint annotations.
Weak supervision: In this setup, we assume that only imagelevel class labels are given, such that each image pair is either positive, i.e. depicting the same object class, or negative, i.e. representing different classes, following [Rocco2018b, DCCNet, MinLPC20]. We obtain our final weaklysupervised objective by combining the PWbipath (9) and PWarpsupervision (8) losses applied to positive image pairs, with our negative probabilistic objective (10) on negative image pairs.
(11) 
Here, and are weighting factors.
Strong supervision: We extend our approach to the stronglysupervised regime, where keypoint match annotations are given for each training image pair. Previous approaches [CATs, CHM, ANCNet] leverage these annotations by training semantic networks with a keypoint objective . Our final stronglysupervised objective is defined as the combination of the keypoint loss with our PWbipath (9) and PWarpsupervision (8) objectives. Note that we do not include our explicit occlusion modelling, i.e. the unmatched state and its corresponding loss (10) on negative image pairs. This is to ensure fair comparison to previous stronglysupervised approaches, which solely rely on keypoint annotations, and not on imagelevel labels, required for our loss (10).
(12) 
Here, and also are weighting factors.
5 Experimental Results
We evaluate our weaklysupervised learning approach for two semantic networks. The benefits brought by the combination of our probabilistic losses with keypoint annotations are also demonstrated for four recent networks. We extensively analyze our method and compare it to previous approaches, setting a new stateoftheart on multiple challenging datasets.
5.1 Networks and Implementation Details
For weak supervision, we integrate our approach (11) into baselines SFNet [SFNet] and NCNet [Rocco2018b]. It leads to our weaklysupervised PWarpCSFNet and PWarpCNCNet respectively. We also apply our stronglysupervised loss (12) to baselines SFNet, NCNet, DHPF [MinLPC20] and CATs [CATs], resulting in respectively PWarpCSFNet*, PWarpCNCNet*, PWarpCDHPF and PWarpCCATs. For fair comparison, we additionally train a stronglysupervised baseline for both SFNet and NCNet, referred to as SFNet* and NCNet*. Note that for all methods, the stronglysupervised baseline is trained with only , which is defined as the crossentropy loss for SFNet*, NCNet* and DHPF, and the EndPointError objective after applying softargmax [SFNet] for CATs. To convert the predicted probabilistic mapping to pointtopoint matches for evaluation, all networks trained with our PWarpC objectives employ the argmax operation, except for PWarpCCATs where we adopt the same softargmax as in the baseline CATs [CATs]. Additional details on the integration of our objectives for each architecture are provided in the appendix, Sec. AF. We train all networks on PFPascal [PFPascal], using the splits of [SCNet]. The results when trained on SPair71K are further presented in the appendix, Sec. G.1.
PFPascal  PFWillow  Spair71K  TSS  
PCK @  PCK @  PCK @  PCK @  
Methods  Reso  FG3DCar  JODS  Pascal  Avg.  
S  UCN [ucn]      75.1            17.7         
SCNet [SCNet]    36.2  72.2  82.0                    
HPF [HPF]  max  60.1  84.8  92.7  45.9  74.4  85.6      93.6  79.7  57.3  76.9  
SCOT [SCOT]  max 300  63.1  85.4  92.7  47.8  76.0  87.1      95.3  81.3  57.7  78.1  
ANCNet [ANCNet]      86.1            28.7          
CHM [CHM]  80.1  91.6                      
PMD [PMD]      90.7      75.6                
PMNC [PMNC]    82.4  90.6            28.8          
MMNet [MMNet]  77.7  89.1  94.3                    
DHPF [MinLPC20]  75.7  90.7  95.0  41.4  67.4  81.8  15.4  27.4          
CATs [CATs]  67.5  89.1  94.9  37.4  65.8  79.7  10.9  22.4          
CATsftfeatures [CATs]  75.4  92.6  96.4  40.9  69.5  83.2  13.6  27.0          
3.0pt215.51.5 

plus1fil minus1fil  
CATs [CATs]  ori  67.3  88.6  94.6  41.6  68.9  81.9  10.8  22.1  89.5  76.0  58.8  74.8  
PWarpCCATs  ori  67.1  88.5  93.8  44.2  71.2  83.5  12.2  23.3  93.2  83.4  70.7  82.4  
3.0pt215.51.5 

plus1fil minus1fil  
CATsftfeatures [CATs]  ori  76.8  92.7  96.5  45.2  73.2  85.2  13.7  26.8  92.1  78.9  64.2  78.4  
PWarpCCATsftfeatures  ori  79.8  92.6  96.4  48.1  75.1  86.6  15.4  27.9  95.5  85.0  85.5  88.7  
3.0pt215.51.5 

plus1fil minus1fil  
DHPF [MinLPC20]  ori  77.3  91.7  95.5  44.8  70.6  83.2  15.3  27.5  88.2  71.9  56.6  72.2  
PWarpCDHPF  ori  79.1  91.3  96.1  48.5  74.4  85.4  16.4  28.6  89.1  74.1  59.7  74.3  
3.0pt215.51.5 

plus1fil minus1fil  
NCNet*  ori  78.6  91.7  95.3  43.0  70.9  83.9  17.3  32.4  92.3  76.9  57.1  75.3  
PWarpCNCNet*  ori  79.2  92.1  95.6  48.0  76.2  86.8  21.5  37.1  97.5  87.8  88.4  91.2  
3.0pt215.51.5 

plus1fil minus1fil  
SFNet*  ori  78.7  92.9  96.0  43.2  72.5  85.9  13.3  27.9  88.0  75.1  58.4  73.8  
PWarpCSFNet*  ori  78.3  92.2  96.2  47.5  77.7  88.8  17.3  32.5  94.9  83.4  74.3  84.2  
U  CNNGeo [Rocco2017a]  ori  41.0  69.5  80.4  36.9  69.2  77.8    18.1  90.1  76.4  56.3  74.4 
PARN [Jeon]  ori                  89.5  75.9  71.2  78.8  
GLUNet [GLUNet]  ori  42.2  69.1  83.1  30.4  57.7  72.9      93.2  73.3  71.1  79.2  
SemanticGLUNet [GLUNet, warpc]  ori  48.3  72.5  85.1  39.7  67.5  82.1  7.6  16.5  95.3  82.2  78.2  85.2  
A2Net [SeoLJHC18]    42.8  70.8  83.3  36.3  68.8  84.4    20.1          
PMD [PMD]      80.5      73.4                
M  SFNet [SFNet]  / ori  53.6  81.9  90.6  46.3  74.0  84.2             
SFNet [SFNet]  ori  59.0  84.0  92.0  46.3  74.0  84.2  11.2  24.0  90.8  78.6  58.0  75.8  
W  PWarpCSFNet  ori  65.7  87.6  93.1  47.5  78.3  89.0  17.6  33.5  95.1  84.7  76.8  85.5 
3.0pt215.51.5 

plus1fil minus1fil  
WeakAlign [Rocco2018a]  ori/ ori /   49.0  75.8  84.0  38.2  71.2  85.8    21.1  90.3  76.4  56.5  74.4  
RTNs [Kim2018]    55.2  75.9  85.2  41.3  71.9  86.2      90.1  78.2  63.3  77.2  
DCCNet [DCCNet]  / ori /   55.6  82.3  90.5  43.6  73.8  86.5    26.7  93.5  82.6  57.6  77.9  
SAMNet [Kim2019]    60.1  80.2  86.9            96.1  82.2  67.2  81.8  
DHPF [MinLPC20]  56.1  82.1  91.1  40.5  70.6  83.8  14.7  28.5          
DHPF [MinLPC20]  ori  61.2  84.1  92.4  45.1  73.6  85.0  14.7  27.8          
GSF [GSF]    62.8  84.5  93.7  47.0  75.8  88.9    33.5          
PMD [PMD]      81.2      74.7      26.5          
WarpCSemGLUNet [warpc]  ori  62.1  81.7  89.7  49.0  75.1  86.9  13.4  23.8  97.1  84.7  79.7  87.2  
NCNet [Rocco2018b]  / ori /   54.3  78.9  86.0  44.0  72.7  85.4    26.4          
3.0pt215.51.5 

plus1fil minus1fil  
NCNet [Rocco2018b]  ori  60.5  82.3  87.9  44.0  72.7  85.4  13.9  28.8  94.5  81.4  57.1  77.7  
PWarpCNCNet  ori  64.2  84.4  90.5  45.0  75.9  87.9  18.2  35.3  95.9  88.8  82.9  89.2 
5.2 Experimental Settings
We evaluate our networks on four standard benchmark datasets for semantic matching, namely PFPascal [PFPascal], PFWillow [PFWillow], SPair71K [spair] and TSS [Taniai2016]. Results on Caltech101 [caltech] are further shown in appendix H.6.
Datasets: The PFPascal, PFWillow and SPair71K are keypoint datasets, which respectively contain 1341, 900 and 70958 image pairs from 20, 4 and 18 categories. Images have dimensions ranging from to . TSS is the only dataset providing dense flow field annotations for the foreground object in each pair. It contains 400 image pairs, divided into three groups: FG3DCAR, JODS, and PASCAL, according to the origins of the images.
Metrics: We adopt the standard metric, Percentage of Correct Keypoints (PCK), with a pixel threshold of . Here, and are either the dimensions of the source image or the dimensions of the object bounding box in the source image, such as .
5.3 Results
We present results on PFPascal, PFWillow, SPair71K and TSS in Tab. 1. A few previous approaches compute the PCK metrics after resizing the annotations to a different resolution than the original. Nevertheless, we found that in practise, the annotation resolution can lead to notable variations in results, as evidenced for DHPF or CATs in Tab. 1. For fair comparison, we thus compute the metrics on the standard setting, i.e. the original image size, and recompute the PCK in this setting for baseline works if necessary. We also indicate the annotation size used, whenever reported by the authors or provided in their public implementation.
Weak supervision (W): In Tab. 1, bottom part, we compare approaches trained with weaksupervision in the form of image labels. In this setting, our PWarpC networks are trained with in (11). While bringing improvements on the PFPascal dataset itself, our approach PWarpCNCNet most notably achieves widely better generalization properties, with impressive (+ ), (+ ) and (+ ) relative (and absolute) gains compared to the baseline NCNet on PFWillow (), SPair71K () and TSS () respectively. Our PWarpCNCNet thus sets a new stateoftheart on SPair71K and TSS among weaklysupervised methods trained on PFPascal.
Even though it utilizes a lower degree of supervision, our approach PWarpCSFNet also significantly outperforms the baseline SFNet, which is trained with mask supervision (M), on all datasets. In particular, it shows a relative (and absolute) gain of (+ ), (+ ) and (+ ) on respectively PFPascal, PFWillow and SPair71K for , and of (+ ) on TSS for . This makes our PWarpCSFNet the new stateoftheart across all unsupervised (U), weaklysupervised (W) and masksupervised (M) approaches on PFPascal and PFWillow. Example predictions are shown in Fig. 5
Strong supervision (S): In the top part of Tab. 1, we evaluate networks trained with strong supervision, in the form of keypoint annotations. Our stronglysupervised PWarpC approaches are trained with our (12). For all networks, while the results are on par with the baselines on PFPascal, the PWarpC networks show drastically better performance on PFWillow, SPair71K and TSS compared to their respective baselines. PWarpCSFNet* and PWarpCNCNet* thus set a new stateoftheart on respectively PFWillow, and the SPair71K and TSS datasets, across all stronglysupervised approaches trained on PFPascal. Finally, while most works focus on designing novel semantic architectures, we here show that the right training strategy bridges the gap between architectures.
PFPascal  PFWillow  Spair71K  TSS  
Methods  
I  SFNet baseline  59.0  84.0  46.3  74.0  24.0  75.8 
II  PWbipath (7)  59.1  82.3  44.9  74.3  28.0  83.4 
III  + visibility mask (9)  61.2  83.7  46.1  75.8  28.5  78.4 
IV  + PWarpsupervision (8)  63.0  84.9  47.0  76.9  30.7  83.5 
V  + PNeg (10) (PWarpCSFNet)  65.7  87.6  47.5  78.3  33.5  85.5 
V  PWarpCSFNet (Ours)  65.7  87.6  47.5  78.3  33.5  85.5 
VI  Mapping Warp Consistency [warpc]  64.9  86.1  46.9  76.6  26.6  82.2 
VII  PWarpsupervision only (8)  52.9  74.3  38.0  66.6  27.9  79.4 
VIII  Maxscore [Rocco2018b]  52.4  76.7  31.2  59.5  24.6  74.8 
IX  Minentropy [MinLPC20]  44.7  74.4  25.4  57.8  20.6  69.6 
5.4 Method Analysis
We here perform a comprehensive analysis of our approach in Tab. 2. We adopt SFNet as the base architecture.
Ablation study: In the top part of Tab. 2, we analyze key components of our approach. The version denoted as (II) is trained using our PWbipath objective (7), without the visibility mask. Further introducing our visibility mask (9) in (III) significantly boosts the results since it enables to supervise only in the common visible regions. Note that this version (III) already outperforms the baseline SFNet (I), while using less annotation (class instead of mask). In (IV), we add our probabilistic warpsupervision (8), leading to a small improvement for all thresholds and on all datasets. From (IV) to (V), we further introduce our explicit occlusion modelling associated with our negative loss (10), which results in drastically better performance. This version corresponds to our final weaklysupervised PWarCSFNet, trained with (11). An example of the regions identified as unmatched by PWarpCSFNet is shown in Fig. 6b.
Comparison to other losses: In Tab. 2, bottom part, we first compare our probabilistic approach (11) corresponding to (V) with the mappingbased warp consistency objective [warpc], denoted as (VI). Our approach (V) leads to better performance than warp consistency (VI), with a particularly impressive absolute gain on the challenging SPair71K dataset. We further illustrate the benefit of our approach on an example in Fig. 6. Moreover, using only the PWarpsupervision loss (8) in (VII) results in much worse performance than our Probabilistic Warp Consistency (V). Finally, we compare our approach (V) to previous losses applied on cost volumes. The versions (VIII) and (IX) are trained with respectively maximizing the max scores [Rocco2018b], and minimizing the cost volume entropy [MinLPC20]. Both approaches lead to poor results, likely caused by the very indirect supervision signal that these objectives provide.
6 Conclusion
We propose Probabilistic Warp Consistency, a weaklysupervised learning objective for semantic matching. We introduce multiple probabilistic losses derived both from a triplet of images generated based on a real image pair, and from a pair of nonmatching images. When integrated into four recent semantic networks, our approach sets a new stateoftheart on four challenging benchmarks.
Limitations: Since our approach acts on cost volumes, which are memory expensive, it is limited to relatively coarse resolution. This might in turn impact its accuracy.
References
Appendix A General implementation details
In this section, we provide implementation details, which apply to all our PWarpC networks.
Creating of groundtruth probabilistic mapping : Here, we describe how we obtain the groundtruth probabilisitc mapping from the known mapping . We first rescale the mapping to the same resolution as the predicted probabilistic mapping . We then convert the mapping into a groundtruth probabilistic mapping , following this scheme. For each pixel position in , we construct the groundtruth 2D conditional probability distribution by assigning a onehot or a smooth representation of . In the latter case, following [ANCNet], we pick the four nearest neighbours of and set their probability according to distance. Then we apply 2D Gaussian smoothing of size 3 on that probability map. We then vectorize the two dimensions of , leading to our final known warp probabilistic mapping . We will specify which representation we used, as either onehot or smooth, for each loss and each network.
Conversion of to correspondence set: The output of the model is a probabilistic mapping, encoding the matching probabilities for all pairwise match relating an image pair. However, for various applications, such as image alignment or geometric transformation estimation, it is desirable to obtain a set of pointtopoint image correspondences between the two images. This can be achieved by either performing a hard or soft assignment. In the former case, the hard assignment in one direction is done by just taking the most likely match, the mode of the distribution as,
(13) 
In the latter case, the soft assignment corresponds to softargmax. It computes correspondences for individual locations of image , as the expected position in according to the conditional distribution ,
(14) 
Training details:
All networks are trained with PyTorch, on a single NVIDIA TITAN RTX GPU with 24 GiB of memory, within 48 hours, depending on the architecture.
Appendix B Triplet creation and sampling of warps
b.1 Triplet creation
Our introduced learning approach requires to construct an image triplet from an original image pair , where all three images must have training dimensions . We follow a similar procedure than in [warpc], further described here. The original training image pairs are first resized to a fixed size , larger than the desired training image size . We then sample a dense mapping of the same dimension , and create by warping image with , as . Each of the images of the resulting image triplet are then centrally cropped to the fixed training image size . The central cropping is necessary to remove most of the black areas in introduced from the warping operation with large sampled mappings as well as possible warping artifacts arising at the image borders. We then additionally apply appearance transformations to all images of the triplet, such as brightness and contrast changes.
b.2 Sampling of warps
A question raised by our proposed loss formulations (8)(9) is how to sample the synthetic warps . During training, we randomly sample it from a distribution , which we need to design. Here, we also follow a similar procedure than in [warpc].
In particular, we construct by sampling homography, Thinplate Spline (TPS), or affineTPS transformations with equal probability. The transformations parameters are then converted to dense mappings of dimension . Then, we optionally apply horizontal flipping to the each dense mapping with a probability .
Specifically, for homographies and TPS, the four image corners and a
grid of control points respectively, are randomly translated in both horizontal and vertical directions, according to a desired sampling scheme. The translated and original points are then used to compute the corresponding homography and TPS parameters. Finally, the transformations parameters are converted to dense mappings. For both transformation types, the magnitudes of the translations are sampled according to a uniform distribution with a range
. Note that for the uniform distribution, the sampling range is actually , when it is centered at zero, or similarly if centered at 1 for example. Importantly, the image points coordinates are previously normalized to be in the interval . Therefore should be within .For the affine transformations, all parameters, i.e. scale, translations, shearing and rotation angles, are sampled from a uniform distribution with range equal to , , and respectively. The affine scale parameters are sampled within with center at 1, while for all other parameters, the sampling interval is centered at zero.
b.3 List of Hyperparameters
In summary, to construct our image triplet , the hyperparameters are the following:
(i) , the resizing image size, on which is applied to obtain before cropping.
(ii) , the training image size, which corresponds to the size of the training images after cropping.
(iii) , the range used for sampling the homography and TPS transformations.
(iv) , the range used for sampling the scaling parameter of the affine transformations.
(v) , the range used for sampling the translation parameter of the affine transformations.
(vi) , the range used for sampling the rotation angle of the affine transformations. It is also used as shearing angle.
(vii) , the range used for sampling the TPS transformations, used for the AffineTPS compositions.
(viii) The probability of horizontal flipping .
b.4 Hyperparameters settings
Geometric transformations : For all our PWarpC networks, the mappings are created by sampling homographies, TPS and AffineTPS transformations with equal probability. For simplicity, we also use the same range for all three types of transformations. In particular, we use a uniform sampling scheme with a range equal to , where . For the affine transformations, we also sample all parameters, i.e. scale, translation, shear and rotation angles, from uniform distributions with ranges respectively equal to , , and for both angles. We use these parameters when training on either PFPascal [PFPascal] or SPair71K [spair].
Probability of horizontal flipping: When training on PFPascal, we set the probability of horizontal flipping to for all our PWarpC networks, except for PWarpCNCNet and PWarpCNCNet*, for which we use . For training on SPair71K, we increase this value to for all our PWarpC networks, except for PWarpCNCNet and PWarpCNCNet*, for which we keep .
Appearance transformations: For all experiments and networks, we apply the same appearance transformations to the image triplet . Specifically, we convert each image to grayscale with a probability of 0.2. We then apply color transformations, by adjusting contrast, saturation, brightness, and hue. The color transformations are larger for the synthetic image then for the real images . For the synthetic image
, we additionally randomly invert the RGB channels. Finally, on all images of the triplet, we further use a Gaussian blur with a kernel between 3 and 7, and a standard deviation sampled within
, applied with probability of 0.2.Appendix C PWarpCSFNet and PWarpCSFNet*
We first provide details about the SFNet [SFNet] architecture. We also briefly review the training strategy of the original work. We then extensively explain our training approach and the corresponding implementation details, for both our weakly and stronglysupervised approaches, PWarpCSFNet and PWarpCSFNet* respectively. Finally, we provide additional method analysis for this architecture.
c.1 Details about SFNet
Architecture: SFNet is based on a pretrained ResNet101 feature backbone, on which are added convolutional adaptation layers at two levels. The predicted feature maps are then used to construct two cost volumes, at two resolutions. After upsampling the coarsest one to the same resolution, the two cost volumes are combined with a pointwise multiplication. While the resulting cost volume is the actual output of the network, it is converted to a flow field through a kernel sotfargmax operation. Specifically, a fixed Gaussian kernel is applied on the raw cost volume scores to postprocess them, before applying SoftMax to convert the cost volume to a probabilistic mapping. From there, the softargmax operation transposes it to a mapping.
For our PWarpC approaches, we do not use the Gaussian kernel to postprocess the predicted matching scores. We simply convert the predicted cost volume into a probabilistic mapping through a SoftMax operation, following eq. (4) of the main paper. Also note that only the adaptation layers are trained.
Training strategy in original work: The original work employs groundtruth foreground object masks as supervision. From single images associated with their segmentation masks, they create image pairs by applying random transformations to both the original images and segmentation masks. Subsequently, they train the network with a combination of multiple losses. In particular, they enforce the forwardbackward consistency of the predicted flow, associated with a smoothness objective acting directly on the predicted flow. These losses are further combined with an objective enforcing the consistency of the warped foreground mask of one image with the groundtruth segmentation mask of the other image.
c.2 PWarpCSFNet and PWarpCSFNet*: our training strategy
Warps sampling: For the weaklysupervised version, we resize the image pairs to , sample a dense mapping of the same dimension and create . Each of the images of the resulting image triplet is then centrally cropped to .
For the stronglysupervised version, we apply the transformations on images of the same size than the crop, i.e. . This is to avoid cropping keypoint annotations.
When training on PFPascal, we apply of horizontal flipping to sample the random mappings , while it is increased to when training on SPair71K.
Weighting and details on the losses : We found it beneficial to define the known probabilistic mapping with a onehot representation for our PWbipath loss (9), while using a smooth representation instead for the PWarpsupervision (8) loss and the keypoint objective in (12). Each representation is described in Sec. A.
For the weaklysupervision version PWarpCSFNet, the weights in (11) are set to and .
For the stronglysupervised version, PWarpCSFNet*, we use the same weight . We additionally set , which ensure that our probabilistic losses amount for the same than the keypoint loss . Moreover, the keypoint loss is set as the crossentropy loss, for both PWarpCSFNet* and its baseline SFNet*.
Implementation details: For our weaklysupervised PWarpCSFNet, we set the initial learnable parameter , corresponding to the unmatched state for our occlusion modeling, at .
For both the weakly and stronglysupervised approaches, the SoftMax temperature, corresponding to equation (4) of the main paper, is set to , the same than originally used in the baseline for softargmax. The hyperparameter used in the estimation of our visibility mask (eq. 9 of the main paper) is set to and to when trained on PFPascal or SPair71K respectively. This is because in SPair71K, the objects are generally much smaller than in PFPascal.
For training, we use similar training parameters as in baseline SFNet [SFNet]
. We train with a batch size of 16 for maximum 100 epochs. The learning rate is set to
and halved after 50. We optionally finetune the networks on SPair71K for an additional 20 epochs, with an initial learning rate of , halved after 10 epochs. The networks are trained using Adam optimizer [adam] with weight decay set to zero.c.3 Additional analysis
Here, we first analyse the effect of the kernel applied in the original SFNet baseline [SFNet] before converting the predicted cost volume to a probabilistic mapping representation. We also provide the ablation study of our stronglysupervised PWarpCSFNet*. Note that the ablation study of the weaklysupervised SFNet is provided in Tab. 2 of the main paper. Finally, we show the impact of different losses on negative image pairs, i.e. depicting different object classes.
PFPascal  PFWillow  Spair71K  TSS  
Methods  
SFNet*  78.7  92.9  43.2  72.5  27.9  73.8 
+ Visaware PWbipath (9)  77.1  91.6  47.8  77.9  31.1  80.3 
+ PWarpsupervision (8)  78.3  92.2  47.5  77.7  32.5  84.2 
Effect of kernel: Baseline SFNet relies on a kernel softargmax strategy to convert the predicted cost volume to a mapping. In particular, the kernel is applied on the cost volume before applying SoftMax (eq. (4)), which transposes it to a probabilistic mapping. From there, softargmax is used to obtain a mapping. Nevertheless, we observe that this kernel is extremely important in order to postprocess the matching scores. This is shown in Fig. 7. In contrast, our approach Probabilistic Warp Consistency, which directly acts on the predicted dense matching scores, produces clean, Diraclike conditional distributions, without relying on any postprocessing operations.
Ablation study for stronglysupervised PWarpCSFNet*: In Tab. 3, we analyse key components of our approach PWarpCSFNet*. From the stronglysupervised baseline SFNet*, adding our probabilistic PWbipath objective leads to a significant improvement on the PFWillow, SPair71K and TSS datasets. Further including our PWarpsupervision objective results in additional gains on SPair71K and TSS.
Comparisons to alternative negative losses: In Tab. 4, we compare combining our PWbipath and PWarpsupervision objectives on image pairs of the same label, with different losses on images pairs showing different object classes, i.e. on negative image pairs. In the version denoted as (IV), we introduce our explicit occlusion modeling (Sec. 4.3 of the main paper), trained with our probabilistic negative loss . In (V) and (VI), we instead combine our probabilistic objectives on the positive image pairs (7) (III), with an additional objective, minimizing the max scores or the negative entropy of the cost volume respectively. While it brings a small improvement with respect to version (III), the resulting network performances in (V) and (VI) are far lower than when trained with our final combination (11), which corresponds to version (IV).
PFPascal  PFWillow  Spair71k  TSS  
Methods  
I  SFNet baseline (softargmax)  59.0  84.0  46.3  74.0  24.0  75.8 
II  SFNet baseline (argmax)  60.3  81.3  43.7  71.0  26.9  74.1 
III  VisPWbipath + PWarpsup  63.0  84.9  47.0  76.9  30.7  83.5 
IV  (III) + PNeg (10) (PWarpCSFNet)  65.6  87.9  47.3  78.2  33.8  84.1 
V  (III) + Maxscore [Rocco2018b]  63.7  81.2  44.6  71.6  31.8  77.3 
VI  (III) + Minentropy [MinLPC20]  59.4  76.7  41.8  67.9  28.8  73.2 
Appendix D PWarpCNCNet and PWarpcNCNet*
In this section, we first provide details about the NCNet architecture. We also briefly review the training strategy of the original work. We then extensively explain our training approach and the corresponding implementation details, for both our weakly and stronglysupervised approaches, PWarpCNCNet and PWarpCNCNet* respectively. Finally, we extensively ablate our approach for this architecture.
d.1 Details about NCNet
Architecture: In [Rocco2018b], Rocco et al. introduce a learnable consensus network, applied on the 4D cost volume constructed between a pair of feature maps. Specifically, they process the cost volume with multiple 4D convolutional layers, to establish a strong locality prior on the relationships between the matches. The cost volume before and after applying the 4D convolutions is also processed with a soft mutual nearest neighbor filtering.
Training strategy in original work: The baseline NCNet is trained with a weaklysupervised strategy, using imagelevel class labels as only supervision. Their proposed objective maximizes the mean matching scores over all hard assigned matches from the predicted cost volume constructed between images pairs of the same class, while minimizing the same quantity for image pairs of different classes. By retraining the NCNet architecture with this strategy, we nevertheless found the training process to be quite unstable, multiple training runs leading to substantially different performance.
d.2 PWarpCNCNet and PWarpCNCNet*: our training strategy
Warps sampling: For the weaklysupervised version, we resize the image pairs to , sample a dense mapping of the same dimension and create . Each of the images of the resulting image triplet is then centrally cropped to .
For the stronglysupervised version, we apply the transformations on images of the same size than the crop, i.e. . This is to avoid cropping keypoint annotations.
As for the random mapping , we apply 30% of horizontal flipping. We found increasing the percentage of horizontal flipping for our PWarpCNCNet and PWarpCNCNet* networks to be beneficial compared to the other networks, in order to help stabilize the learning.
Weighting and details on the losses : For all losses, we use a smooth representation for the known probabilistic mapping (see Sec. A).
In general, we found the PWarpsupervision objective (8) to be slightly harmful for the PWarpCNCNet networks, and therefore did not include it in our final weakly and stronglysupervised formulations. This is particularly the case when finetuning the features, which is the setting we used for our final PWarpCNCNet and PWarpCNCNet*. This is likely due to the network ’overfitting’ to the synthetic image pairs and transformations involved in the PWarpsupervision loss, at the expense of the real images considered in the PWbipath (9) and PNeg (10) objectives.
As a result, for the weaklysupervision version PWarpCNCNet, the weights in (11) are set to and . For the stronglysupervised version, PWarpCNCNet*, we use the same weight . We additionally set , which ensure that our probabilistic losses amount for the same than the keypoint loss . Moreover, the keypoint loss is set as the crossentropy loss, for both PWarpCNCNet* and its baseline NCNet*.
PFPascal  PFWillow  Spair71k  TSS  
Methods  
3.0pt18.51.5 

plus1fil minus1fil  
I  NCNet baseline (Maxscore) [Rocco2018b]  60.5  82.3  44.0  72.7  28.8  77.7 
II  PWbipath  diverged  
III  + Visibility mask  64.7  83.8  45.4  75.9  32.8  82.7 
IV  + PWarpsupervision  61.7  79.2  45.1  73.8  35.6  85.4 
3.0pt18.51.5 

plus1fil minus1fil  
III  PWbipath Visibility mask  64.7  83.8  45.4  75.9  32.8  82.7 
V  + PNeg  62.0  82.2  45.4  76.2  33.2  87.9 
VI  + ft features (PWarpCNCNet)  64.2  84.4  45.0  75.9  35.3  89.2 
3.0pt18.51.5 

plus1fil minus1fil  
VII  ft features from scratch  63.7  82.9  44.9  76.1  35.7  87.4 
VI  PWarpCNCNet  64.2  84.4  45.0  75.9  35.3  89.2 
I  Maxscore (NCNet baseline)  60.5  82.3  44.0  72.7  28.8  77.7 
VIII  Minentropy [MinLPC20]  55.6  79.2  42.0  72.3  25.4  78.4 
IX  Warp Consistency [warpc]  59.1  75.0  44.6  70.1  35.0  87.0 
III  PWbipath Visibility mask  64.7  83.8  45.4  75.9  32.8  82.7 
V  (III) + PNeg (Ours)  62.0  82.2  45.4  76.2  33.2  87.9 
X  (III) + Maxscore  62.9  82.1  45.4  74.2  31.3  79.0 
XI  (III) + Minentropy  60.8  78.5  44.8  71.4  31.5  78.6 
Implementation details: For PWarpCNCNet, we set the initial learnable parameter , corresponding to the unmatched state for our occlusion modeling at . This is to ensure that it is in the same range than the cost volume, at initialization.
The SoftMax temperature, corresponding to equation (4) of the main paper, is set to , the same than originally used in the baseline loss. The hyperparameter used in the estimation of our visibility mask (eq. 9 of the main paper) is set to . Indeed, for NCNet, we found that using a more restrictive threshold, as compared to the other networks which use (when training on PFPascal), is beneficial to stabilize the training. It offers a better guarantee that the PWbipath loss (9) is only applied in common visible object regions between the triplet.
Similarly to baseline NCNet [Rocco2018b], we train in two stages. In the first stage, we only train the consensus neighborhood network while keeping the ResNet101 feature backbone extractor fixed. We further finetune the last layer of the feature backbone as well as the consensus neighborhood network in a second stage. These two stages are used to train on PFPascal [PFPascal], our final PWarpCNCNet and PWarpCNCNet* approaches, as well as stronglysupervised baseline NCNet*.
For training, we use similar training parameters as in baseline NCNet. We train with a batch size of 16, which is reduced to 8 when the last layer of the backbone feature is finetuned. During the first training stage on PFPascal, we train for a maximum of 30 epochs with a learning rate set to a constant of . During the second training stage on PFPascal, the learning rate is reduced to and the network trained for an additional 30 epochs.
We optionally further finetune the networks on SPair71K [spair] for 10 epochs, with the same learning rate equal to . Note that in this setting, the last layer of the feature backbone is also finetuned. The networks are trained using Adam optimizer [adam] with weight decay set to zero.
d.3 PWarpCNCNet: ablation study and comparison to previous works
Similarly to Sec. 5.4 of the main paper for PWarpCSFNet, we here provide a detailed analysis of our weaklysupervised approach PWarpCNCNet.
Ablation study: In the top part of Tab. 5, we analyze key components of our weaklysupervised approach. The version denoted as (II) is trained using our PWbipath objective (7), without the visibility mask. NCNet trained with this loss diverged. With the NCNet architecture, we found it crucial to extend our loss with our visibility mask (9), resulting in version (III). We believe applying our PWbipath loss on all pixels (II) confuses the NCNet network, by enforcing matching even in e.g. nonmatching background regions. Note that version (III) trained with our visibility aware PWbipath objective (9) already outperforms the baseline (I) on all datasets and for all thresholds. Further adding the PWarpsupervision loss (8) in (IV) leads to worse results than (III) on the PFPascal and PFWillow datasets, despite bringing an improvement on SPair71K and TSS. To obtain a final network achieving competitive results on all four datasets, we therefore do not include the PWarpsupervision objective (8) in our final formulation.
From (III), including our occlusion modeling, i.e. the unmatched state and its corresponding probabilistic negative loss (10) in (V) leads to notable gains on the PFWillow, SPair71K and TSS datasets. In (VI), we further finetune the last layer of the feature backbone with the neighborhood consensus network in a second training stage. It leads to substantial improvements on all datasets, except for PFWillow, where results remain almost unchanged.
From (VI) to (VII), we compare finetuning the feature backbone in a second training stage (VI), or directly in a single training stage (VII). The former leads to better performance on the PFPascal dataset. As a result, version (VI) corresponds to our final weaklysupervised PWarpCNCNet, trained with two stages on PFPascal.
Comparison to other losses: In the middle part of Tab. 5, we compare our Probabilistic Warp Consistency approach to previous weaklysupervised alternatives. The baseline NCNet, corresponding to version (I), is trained with maximizing the max scores of the predicted cost volumes for matching images. It leads to significantly worse results than our approach (VI) on all datasets and threshold. The same conclusions apply to version (VIII), trained with minimizing the cost volume entropy for matching images. Finally, we compare our probabilistic approach (VI) to the mappingbased Warp Consistency method, corresponding to (IX). While Warp Consistency (IX) achieves good performance on the SPair71K and TSS datasets, it leads to poor results on the PFPascal and PFWillow datasets.
Comparison of objectives on negative image pairs: Finally, in the bottom part of Tab. 5, we compare multiple alternative losses applied on image pairs depicting different object classes. In particular, we combine our visibilityaware PWbipath loss (III) with either our introduced probabilistic negative loss (10), minimizing the maximum scores [Rocco2018b] or maximizing the cost volume entropy [MinLPC20] in respectively (V), (X) and (XI). Our probabilistic negative loss (10) leads to significantly better results on the PFWillow, SPair71K and TSS datasets. We believe it is because it enables to explicitly model occlusions and unmatched regions through our extended probabilistic formulation, including the unmatched state.
Appendix E PWarpCCATs
In this section, we first briefly review the CATs architecture and the original training strategy. We then provide details about the integration of our probabilistic approach into this architecture. Finally, we analyse the key components of our resulting stronglysupervised networks PWarpCCATs and PWarpCCATsftfeatures.
e.1 Details about CATs
Architecture: CATs [CATs] finds matches which are globally consistent by leveraging a Transformer architecture applied to slices of correlation maps constructed from multilevel features. The Transformer module alternates selfattention layers across points of the same correlation map, with intercorrelation selfattention across multilevel dimensions.
Training strategy in original work: While the final output of the CATs architecture is a cost volume, the latter is converted to a dense mapping by transposing into a probabilistic mapping with SoftMax, and then applying softargmax. The network is then trained with the EndPoint Error objective, by leveraging the keypoint match annotations.
e.2 PWarpCCATs: our training strategy
Warps sampling: We apply the transformations on images with dimensions . We do not further crop central images to avoid cropping keypoint annotations.
When training on PFPascal, we apply of horizontal flipping to sample the random mappings , while it is increased to when training on SPair71K.
Weighting and details on the losses : We define the known probabilistic mapping with a onehot representation for our PWbipath and PWarpsupervision losses (9)(8) (see Sec. A).
To obtain PWarpCCATs, we set the weights in (12) as and , which ensure that our probabilistic losses amount for the same than the keypoint loss .
To obtain PWarpCCATsftfeatures, where the ResNet101 backbone feature is additionally finetuned, we found the PWarpsupervision objective (8) to be slightly harmful, and therefore did not include it in this case. This is consistent with the findings of PWarpCNCNet and PWarpCNCNet*, for which the PWarpsupervised objective was also found harmful when the feature backbone is finetuned. This is likely due to the network ’overfitting’ to the synthetic image pairs and transformations involved in the PWarpsupervision loss, at the expense of the real images considered in the PWbipath (9) objectives. As a result, for the PWarpCCATsftfeatures version, we set the weights in (12) as and .
Moreover, to be consistent with the baseline CATs, the keypoint loss is set as EndPointError loss, after converting the probabilistic mapping to a mapping through softargmax.
Implementation details: The softmax temperature, corresponding to equation (4) of the main paper, is set to , the same than originally used in the baseline. The hyperparameter used in the estimation of our visibility mask (eq. 9 of the main paper) is set to and to when trained on PFPascal or SPair71K respectively. This is because in SPair71K, the objects are generally much smaller than in PFPascal.
For training, we use similar training parameters as in baseline CATs. We train with a batch size of 16 when the feature backbone is frozen, and reduce it to 7 when finetuning the backbone. The initial learning rate is set to for the feature backbone, and for the rest of the architecture. It is halved after 80, 100 and 120 epochs and we train for a maximum of 150 epochs. We use the same training parameters when training on either PFPascal or SPair71K. The networks are trained using AdamW optimizer [AdamW] with weight decay set to .
PFPascal  PFWillow  Spair71k  TSS  
Methods  
CATs baseline (EPE)  67.3  88.6  41.6  68.9  22.1  74.8  
+ VisawarePWbipath  68.1  88.5  44.0  70.6  21.4  76.3  
+ PWarpsupervision (PWarpCCATs)  67.1  88.5  44.2  71.2  23.3  82.4  
3.0pt17.51.5 

plus1fil minus1fil  
CATsftfeatures (EPE)  79.8  92.7  45.2  73.2  26.8  78.4  

79.8  92.6  48.1  75.1  27.9  88.7  
+ PWarpsupervision  79.6  92.4  46.7  74.4  26.0  88.7 
e.3 Ablation study
In Tab. 6, we analyse the key components of our stronglysupervised approaches PWarpCCATs (top part) and PWarpCCATsftfeatures (bottom part). From the CATs baseline, which is trained with the EndPoint Error (EPE) objective while keeping the backbone feature frozen, adding our visibilityaware PWbipath loss (9) leads to a subtantial gain on the PFWillow and TSS dataset. Further including our PWarpsupervision objective results in improved performance on PFWillow, SPair71K and TSS. For the versions with finetuning the feature backbone (bottom part of Tab. 6), our visibilityaware PWbipath objective brings major gains on PFWillow, SPair71K and TSS. However, further adding the PWarpsupervision leads to a small drop in performance on all datasets. For this reason, we use the combination of the EPE loss with our visibilityaware PWbipath objective to train our final PWarpCCATsftfeatures.
Appendix F PWarpCDHPF
As in previous sections, we first review the DHPF [MinLPC20] architecture and its original training strategy. We then provide training details for our stronglysupervised PWarpCDHPF. Finally, we provide an ablation study for our approach applied to this architecture.
f.1 Details about DHPF
Architecture: DHPF learns to compose hypercolumn features, i.e
. aggregation of different layers, on the fly by selecting a small number of relevant layers from a deep convolutional neural network. In particular, it proposes a gating mechanism to choose which layers to include in the hypercolumn. The hypercolumns features are then correlated, leading to the final output cost volume.
Training strategy in original work: The original work proposes both a weakly and stronglysupervised approach. The weaklysupervised approach is trained with minimizing the cost volume entropy computed between image pairs depicting the same class, while maximizing it for pairs depicting a different semantic content.
The stronglysupervised approach is instead trained with the crossentropy loss, after converting the keypoint match annotations to probability distributions. In both cases, the authors also include a layer selection loss. It is a soft constraint to encourage the network to select each layer of the feature backbone at a certain rate.
f.2 PWarpCDHPF: our training strategy
Warps sampling: We apply the transformations on images with dimensions . Similarly to PWarpCCATs, we do not further crop central images to avoid cropping keypoint annotations.
When training on PFPascal, we apply of horizontal flipping to sample the random mappings , while it is increased to when training on SPair71K.
Weighting and details on the losses : We define the known probabilistic mapping with a smooth representation for our PWbipath and PWarpsupervision losses (9)(8) (see Sec. A).
To obtain PWarpCDHPF, we set the weights in (12) as and , which ensure that our probabilistic losses amount for the same than the keypoint loss .
Moreover, in the stronglysupervised baseline DHPF, they train with a keypoint loss corresponding to the crossentropy with the groundtruth keypoint matches converted to onehot probabilistic mapping representations. We nevertheless found that the baseline is slightly improved when the groundtruth keypoint matches are instead converted to smooth probability distributions. We denote this version as DHPF* and compare it to our final PWarpCDHPF in Tab. 7. As a result, for our PWarpCDHPF, we set the keypoint loss in (12) to the crossentropy with a smooth representation of the groundtruth keypoint match distributions. Finally, for fair comparison, we add the layer selection loss used in baseline DHPF to our stronglysupervised loss (12).
Implementation details: The softmax temperature, corresponding to equation (4) of the main paper, is set to , as in the baseline loss. Note that following the baseline DHPF, we apply gaussian normalization on the cost volume before applying the SoftMax operation (4) to convert it to a probabilistic mapping. The hyperparameter used in the estimation of our visibility mask (eq. 9 of the main paper) is set to and to when trained on PFPascal or SPair71K respectively. This is because in SPair71K, the objects are generally much smaller than in PFPascal.
For training, we use similar training parameters as in baseline DHPF. We train on PFPascal with a batch size of 6 for a maximum of 100 epochs. The initial learning rate is set and halved after 50 epochs. We optionally further finetune the network on SPair71K, with an additional 10 epochs and a constant learning rate of . The networks are trained using SGD optimizer [ruder2016overview].
PFPascal  PFWillow  Spair71k  TSS  
Methods  
3.0pt17.51.5 

plus1fil minus1fil  
DHPF baseline (CE with onehot)  77.3  91.7  44.8  70.6  27.5  72.2 
DHPF* (CE with smooth)  78.1  90.7  44.7  70.1  27.9  74.02 
+ VisawarePWbipath  76.3  90.7  47.3  73.6  28.0  73.7 
+ PWarpsupervision (PWarpCDHPF)  77.7  91.7  47.7  74.3  28.6  74.3 
f.3 Ablation study
In Tab. 7, we conduct ablative experiments on PWarpCDHPF. Training with the crossentropy loss using a smooth representation of the groundtruth in DHPF* leads to slightly better results than DHPF on PFPascal and SPair71K. For this reason, we use it as baseline. Further including our visibilityaware PWbipath loss and PWarpsupervision leads to incremental gains on PFWillow and SPair71K.
Appendix G Analysis of transformations W
In this section, we analyse the impact of the sampled transformations’ strength on the performance of the corresponding trained PWarpC networks. As explained in Sec. B, the strength of the warps is mostly controlled by the range , used to sample the base homography, TPS and AffineTPS transformations. The probability of horizontal flipping also has a large impact. We thus analyse the effect of the sampling range and the probability of horizontal flipping on the evaluation results of the corresponding PWarpC networks. In particular, we provide the analysis for our weaklysupervised PWarpCSFNet. The trend is the same for the other PWarpC networks.
While we choose a specific distribution to sample the transformations parameters used to construct the mapping , our experiments show that the performance of the trained networks according to our proposed Probabilistic Warp Consistency loss is relatively insensitive to the strength of the transformations , if they remain in a reasonable bound. We present these experiments in Fig. 8.
In Fig. 8 (A), we analyse the impact of the sampling range on the performance of PWarpCSFNet. Any range within leads to similar performance, for and for . Only for on PFPascal, increasing the range up to leads to better results, with a drop for . We select in our final setting.
We then look at the impact of the probability of horizontal flipping in Fig. 8 (B). On PFPascal, increasing the probability of flipping up to leads to an increase in performance. Increasing it further nevertheless results in a gradual drop in performance. The trend is the same on SPair71K, except that the best results are achieved for . We therefore set for our final PWarpC networks.
PFPascal  PFWillow  Spair71k  TSS  
PCK @  PCK @  PCK @  PCK @  
Methods  Reso  FG3DCar  JODS  Pascal  Avg.  
S  HPF [HPF]  max                28.2         
SCOT [SCOT]  max 300                35.6          
CHM [CHM]                46.3          
PMD [PMD]                  37.4          
PMNC [PMNC]                  50.4          
MMNet [MMNet]                40.9          
DHPF [MinLPC20]  52.6  75.4  84.8  37.4  63.9  77.0  20.7  37.3          
CATs [CATs]  45.3  67.7  77.0  31.8  56.8  69.1  21.9  42.4          
CATsftfeatures [CATs]  54.4  74.1  81.9  39.7  66.3  78.3  27.9  49.9          
3.0pt215.51.5 

plus1fil minus1fil  
CATsftfeatures [CATs]  ori  57.7  75.2  82.9  43.5  69.1  80.8  27.1  48.8  88.9  73.9  57.1  73.3  
PWarpCCATsftfeatures  ori  58.8  77.4  84.6  46.4  73.6  85.0  28.2  48.4  91.1  85.8  69.1  82.0  
3.0pt215.51.5 

plus1fil minus1fil  
DHPF [MinLPC20]  ori  56.9  77.2  86.3  40.9  66.8  79.9  20.6  36.3  83.8  69.7  57.3  70.3  
PWarpCDHPF  ori  65.8  85.5  92.3  47.6  72.9  84.5  23.3  38.7  87.5  73.7  60.3  73.8  
3.0pt215.51.5 

plus1fil minus1fil  
NCNet*  ori  59.8  75.6  82.1  38.9  62.6  74.7  29.1  50.7  81.1  66.7  45.4  64.4  
PWarpCNCNet*  ori  67.8  82.3  86.9  46.1  72.6  82.7  31.6  52.0  93.0  84.6  70.6  82.7  
3.0pt215.51.5 

plus1fil minus1fil  
SFNet*  ori  66.5  85.0  90.8  43.5  70.4  82.9  26.2  50.0  88.3  75.3  57.2  73.6  
PWarpCSFNet*  ori  72.1  89.6  93.5  46.3  75.2  87.0  27.0  48.8  92.5  81.1  66.2  79.9  
U  CNNGeo [Rocco2017a] (results from [spair])                  20.6         
A2Net [SeoLJHC18] (results from [spair])                  22.3          
M  SFNet [SFNet] (results from [PMNC])                  26.3         
W  PWarpCSFNet  ori  64.5  86.9  92.6  47.1  78.1  89.9  18.6  37.1  91.0  81.6  67.4  80.0 
3.0pt215.51.5 

plus1fil minus1fil  
WeakAlign [Rocco2018a] (results from [spair])                  20.9  
DHPF [MinLPC20]  46.1  78.1  88.4  34.9  66.2  82.5  12.4  27.7          
DHPF [MinLPC20]  ori  53.3  81.3  90.3  40.9  70.1  84.6  12.7  27.2  
PMD [PMD]                  26.5          
WarpCSemGLUNet [warpc]  ori  57.0  78.7  88.7  46.1  72.8  84.9  12.8  23.5  96.3  84.2  80.2  86.9  
3.0pt215.51.5 

plus1fil minus1fil  
NCNet [Rocco2018b] (results from [spair])                  20.1          
PWarpCNCNet  ori  61.7  82.6  88.5  43.6  74.6  86.9  18.5  38.0  95.4  88.9  85.6  90.0 
g.1 Results when training on SPair
To better understand the performance of our training approach under complex conditions, we report the results according to different variation factors with various difficulty levels. In particular, the SPair71k dataset contains diverse variations in viewpoint, scale, truncation and occlusion. In addition to the keypoint match annotations, the dataset also provide specific annotations for each of the variation factors, with different levels of difficulty. We are particularly interested in the occlusion setting.
Weaklysupervised: In the bottom part of Tab. 8, we compare approaches trained with a weaklysupervised approach. Our PWarpCSFNet and PWarpCNCNet trained on PFPascal were further finetuned on SPair71K with our Probabilistic Warp Consistency objective (11). Note that baselines SFNet and NCNet were obtained by finetuning on SPair71K the original models trained on PFPascal, with their respective original training strategies. Our weaklysupervised approaches PWarpCSFNet and PWarpCNCNet lead to a particularly impressive improvement compared to their respective baselines, with (+ 10.8) and (+ 17.9) relative (and absolute) gains. As a result, PWarpCSFNet and PWarpCNCNet set a new stateoftheart on respectively the PFWillow and PFPascal datasets, and the SPair71K and TSS datasets, across all unsupervised (U), weaklysupervised (W) and masksupervised (M) approaches trained on SPair71K.
Stronglysupervised: In the top part of 8, we report results of models trained with a stronglysupervised approach, leveraging keypoint match annotations. While training on SPair71K with our approach leads to similar results than the baselines on SPair71K, our PWarpC networks show drastically better generalization properties to PFPascal, PFWillow and TSS. Our stronglysupervised PWarpCNCNet* sets a new stateoftheat on SPair71K and TSS, across all stronglysupervised approaches trained on SPair71K. Our PWarpCSFNet* also obtains stateoftheart results on the PFPascal and PFWillow datasets.
Methods  Reso  Viewpoint  Scale  Truncation  Occlusion  
easy  medi  hard  easy  medi  hard  none  src  trg  both  none  src  trg  both  All  
U  CNNGeo (from [spair])    25.2  10.7  5.9  22.3  16.1  8.5  21.1  12.7  15.6  13.9  20.0  14.9  14.3  12.4  18.1 
A2Net (from [spair])    27.5  12.4  6.9  24.1  18.5  10.3  22.9  15.2  17.6  15.7  22.3  16.5  15.2  14.5  20.1  
M  SFNet  ori  32.0  15.5  10.0  28.4  22.0  13.2  27.0  20.1  20.0  18.7  26.6  18.5  18.9  18.0  24.0 
W  PWarpCSFNet  ori  41.9  24.2  20.7  39.1  31.8  18.8  36.3  29.7  30.4  28.4  36.5  27.7  27.9  24.7  33.5 
3.0pt218.51.5 

plus1fil minus1fil  
WeakAlign (from [spair])    29.4  12.2  6.9  25.4  19.4  10.3  24.1  16.0  18.5  15.7  23.4  16.7  16.7  14.8  21.1  
NCNet (from [spair])    34.0  18.6  12.8  31.7  23.8  14.2  29.1  22.9  23.4  21.0  29.0  21.1  21.8  19.6  26.4  
3.0pt218.51.5 

plus1fil minus1fil  
NCNet  ori  37.6  19.4  13.8  34.7  26.0 