Log In Sign Up

Joint Learning of Feature Extraction and Cost Aggregation for Semantic Correspondence

by   Jiwon Kim, et al.

Establishing dense correspondences across semantically similar images is one of the challenging tasks due to the significant intra-class variations and background clutters. To solve these problems, numerous methods have been proposed, focused on learning feature extractor or cost aggregation independently, which yields sub-optimal performance. In this paper, we propose a novel framework for jointly learning feature extraction and cost aggregation for semantic correspondence. By exploiting the pseudo labels from each module, the networks consisting of feature extraction and cost aggregation modules are simultaneously learned in a boosting fashion. Moreover, to ignore unreliable pseudo labels, we present a confidence-aware contrastive loss function for learning the networks in a weakly-supervised manner. We demonstrate our competitive results on standard benchmarks for semantic correspondence.


page 3

page 4


Semi-Supervised Learning of Semantic Correspondence with Pseudo-Labels

Establishing dense correspondences across semantically similar images re...

DiscoBox: Weakly Supervised Instance Segmentation and Semantic Correspondence from Box Supervision

We introduce DiscoBox, a novel framework that jointly learns instance se...

Recurrent Transformer Networks for Semantic Correspondence

We present recurrent transformer networks (RTNs) for obtaining dense cor...

A Unified Framework for Feature Extraction based on Contrastive Learning

Feature extraction is an efficient approach for alleviating the curse of...

End-to-end weakly-supervised semantic alignment

We tackle the task of semantic alignment where the goal is to compute de...

CPNet: Cycle Prototype Network for Weakly-supervised 3D Renal Compartments Segmentation on CT Images

Renal compartment segmentation on CT images targets on extracting the 3D...

Neural Matching Fields: Implicit Representation of Matching Fields for Visual Correspondence

Existing pipelines of semantic correspondence commonly include extractin...

1 Introduction

Semantic correspondence is one of the essential tasks on various Computer Vision applications [23, 25, 1], which generally aims to establish pixel-wise, but locally-consistent correspondences across semantically similar images. It is an extremely challenging task because finding semantic correspondences can be easily distracted by non-rigid deformations and large variations on the appearance within the same class [10].+-

Recent methods [21, 1]

solved the task by designing deep Convolutional Neural Networks (CNNs). The networks often consist of

feature extraction and cost aggregation steps. For the feature extraction step, instead of relying on hand-crafted descriptors as in conventional methods [14], recently, there has been an increasing interest in leveraging the representation power of CNNs [2, 8, 10]. However, they can struggle with determining the correct matches from the cost volume because ambiguous matching pairs are often generated by repetitive patterns and occlusions. For cost aggregation step, methods [18, 19, 21, 16, 20, 12, 9, 1] attempted to determine the correct matches between the great majority of dense information and non-distinctive matching pairs. Unlike previous strategies [14, 11], recent methods proposed a trainable matching cost aggregation in the overall network [18, 19, 21, 16, 20, 12, 9, 1].

Learning semantic correspondence networks consisting of feature extraction and cost aggregation modules in a supervised manner requires a large-scale ground-truth which is notoriously hard to build. To alleviate this, several methods leveraged pseudo-labels, extracted from networks’ prediction itself by Winner-Take-All (WTA), and train the networks in an unsupervised111They are often also called weak-supervised methods since they require the image pairs. manner [19, 21]. Although they are appealing alternatives, they are sensitive to uncertain pseudo labels. Furthermore, jointly using the pseudo labels from feature extraction and cost aggregation modules may boost the performance, but there was no study for this approach.

Some recent self-supervised methods [17, 26] use dense contrastive loss for pixel-wise prediction tasks instead of using image-level contrastive loss. However, they do not consider unconfident matches generated from repetitive fields or occlusions and semantic appearance variations.

(a) Learning feature extraction only
(b) Learning cost aggregation only
(c) Ours
Figure 1: Intuition of our method: (a) methods for solely training a feature extraction network [2, 8, 10], (b) methods for solely training a cost aggregation network for filtering the ambiguous matching cost [18, 19, 21, 16, 9], and (c) Ours, which jointly training feature extraction and cost aggregation networks with the proposed loss function.

In this work, we present a novel framework that jointly learns feature extraction and cost aggregation for semantic correspondence. Motivated by the observation that the pseudo labels from feature extraction and cost aggregation steps can complement each other, we encourage the feature extraction and cost aggregation modules to be jointly trained by exploiting the pseudo-labels from each other module. In addition, to filter out unreliable pseudo-labels, we present a confidence-aware contrastive loss function that exploits a forward-backward consistency to remove incorrect matches. We demonstrate experiments on several benchmarks [4, 24], proving the robustness of the proposed method over the latest methods for semantic correspondence.

2 Methodology

2.1 Preliminaries

Given a pair of images, i.e., source and target , which represent semantically similar images, the objective of semantic correspondence is to predict dense correspondence fields between the two images at each pixel. To achieve this, most predominant methods consist of two steps, namely feature extraction and cost aggregation [8, 21]

. In specific, the first step is to apply feature extraction networks to obtain a 3D tensor

, where is the spatial resolution of the feature maps and

denotes the number of channels. To estimate correspondences, the similarities between feature maps

and from source and target images, respectively, are measured, which outputs a 4D cost volume , where

, through a cosine similarity score such that

. Estimating the correspondence with sole reliance on matching similarities is sensitive to matching outliers, and thus the cost aggregation steps are used to refine the initial matching similarities to achieve the aggregated cost

through cost aggregation networks.

Learning such networks, i.e., feature extraction and cost aggregation modules, in a supervised manner requires manually annotated ground-truth correspondences , which is extremely labor-intensive and subjective to build [24, 4, 21]. To overcome this challenge, an alternative way is to leverage Winner-Take-All (WTA) matching point, which is the most likely match by an argmax function on or , as a pseudo correspondence label . For instance, NCNet [21] (and its variants [12, 20, 9]) and DenseCL [26] utilized such correspondences to learn the cost aggregation networks and feature extraction networks, respectively, in an unsupervised fashion, as exemplified in Fig. 1. Although they are definitely appealing alternatives, these frameworks are highly sensitive to uncertain pseudo labels. Moreover, there exists no study to jointly train the feature extraction networks and cost aggregation networks in a complementary and boosting manner.

Figure 2: Comparison on PF-Pascal [4] applying masking and warping by the predicted correspondence map. At later iteration (1000) with have more confident correspondences than with and at earlier iteration (500) vice versa.

2.2 Confidence-aware Contrastive Learning

In this section, we first study how to achieve better pseudo labels for dense correspondence, and then present a confidence-aware contrastive learning.

We start from classic uncertainty measurement based on forward-backward consistency checking as proposed in [22, 15, 13], where an argmax operator was applied twice for forward and backward directions, respectively. Specifically, the pseudo matching map from the matching cost , warping toward , is defined as follows:


where is defined for all the points in the target. Similarly, can be computed, warping toward

. In an non-occlusion region, we get a backward flow vector

in the inverse direction as the forward flow vector . If this consistency constraint is not satisfied, the points in the target are occluded at the matches in the source, or the estimated flow vector is incorrect. These constraints can be defined such that


Since there may be some estimation errors in the flows, we grant a tolerance interval by setting hyper-parameters and . A binary mask is then obtained by such forward-backward consistency checking, representing a non-occluded region as 1 and an occluded region as 0.

Based on the estimated mask , we present a confidence-aware contrastive loss function, aiming to maximize the similarities at the reliably matched points while minimizing for the others, defined such that


where is the number of non-occluded pixels, and is a temperature hyper-parameter. Our loss function enables rejecting ambiguous matches with a thresholding while accepting the confident matches. While is formulated to train the feature extraction network itself, this loss function can also be defined for aggregated cost as , which can be used to train the cost aggregation networks.

Figure 3: Overall framework.

2.3 Joint Learning

The proposed confidence-aware contrastive loss function can be independently used for learning feature extraction and cost aggregation networks through and , respectively. However, since two pseudo labels from feature extraction networks and cost aggregation networks may have complementary information, using two pseudo labels in a joint manner can enable further boosting the performance. For instance, at early stages of training, the pseudo label by pre-trained feature extractor provides more reliable cues than ones by randomly-initialized cost aggregation networks, which may help the cost aggregation networks converge much faster, as exemplified in Fig. 2. In addition, as the training progresses, the well-trained cost aggregation networks produce superior correspondences than the pseudo label by feature extractor, as exemplified in Fig. 2.


Method Base Trainable Comp. Joint PF-PSCAL ( = 0.1) PF-Willow ( = 0.1) TSS ( = 0.05)
Feature Aggreg. FG3D JODS PASCAL Avg
WTA ResNet-101 - - - 53.3 46.9 - - - -
WeakAlign [19] ResNet-101 - 75.8 84.3 90.3 76.4 56.5 74.4
RTNs [7] ResNet-101 - - 75.9 71.9 90.1 78.2 63.3 72.2
NCNet [21] ResNet-101 - - 78.9 84.3 94.5 81.4 57.1 77.7
DCCNet [6] ResNet-101 - 82.3 73.8 93.5 82.6 57.6 77.9
ANCNet [12] ResNet-101 - 86.1 - - - - -
PMNC [9] ResNet-101 - - 90.6 - - - - -
CATs [1] ResNet-101 - 87.3 76.9 85.3 73.7 55.4 73.6
Ours w/NCNet ResNet-101 80.0 86.6 95.0 82.3 55.8 78.4
Ours w/CATs ResNet-101 92.5 79.8 91.7 81.2  60.9  80.0


Table 1: Comparison with state-of-the-art methods on standard benchmarks [4, 24].

To leverage complementary information during training, we use a pseudo label output of each module, namely feature extraction and cost aggregation modules, defined such that


where . is similarly defined. Our final loss function is thus defined such that


where and represent hyper-parameters. Fig. 3 illustrates the overall architecture of the proposed methods.

3 Experiments

3.1 Implementation Details

In our framework, we used the ResNet-101 [5]

pretrained on ImageNet 

[3] benchmark. We added additional layers followed by this to transform features to be highly discriminative w.r.t. both appearance and spatial context [10]. In addition, we used two types of cost aggregation modules like 4D CNN [21], denoted Ours w/NCNet, and transformer-based architecture [1], denoted Ours w/CATs. We used 256x256 size for the input image and 16x16 size for the feature map. The learning rate is adjusted, starting from differently 3e-5 and 3e-6 for feature extraction and cost aggregation, respectively, and adjusted using AdamW optimizer. We set , , , and .

3.2 Experimental Settings

In this section, we demonstrate that our method is effective through comparison with others; WeakAlign [19], RTNs [7], NCNet [21], DCCNet [6], ANCNet [12], PMNC [9], and CATs [1]. We also conduct an analysis of each component in our framework in the ablation study. To evaluate semantic matching, Proposal Flow [4] and TSS [24] benchmarks were used. The Proposal Flow benchmark contains PF-Pascal and PF-Willow [4]. TSS dataset is split into three subgroups: FG3D, JODS and PASCAL [24]. A percentage of correct keypoints (PCK) is employed for evaluation.

3.3 Experimental Results

Table 1 shows the quantitative results. We conduct experiments on various benchmarks, such as PF-Pascal [4], PF-Willow [4], and TSS [24]. Our method records higher accuracy than both baselines, NCNet and CATs, by 1.1 and 5.2 on PF-Pascal, 2.3 and 2.9 on PF-Willow, respectively. Specifically, ours w/CATs outperforms other methods over all benchmarks and show the biggest performance improvement about 2 on PF-Pascal dataset compared to 90.6, which is the state-of-the-art result [9] among the similar network architectures and algorithms. Ours w/NCNet shows the highest performance on the FG3D as 95.0, and records average PCK of 78.4 on TSS benchmark. The qualitative results of semantic matching on the PF dataset are shown in Fig. 4. (c), (d), (e) are warped images from (a) to (b) by WTA, NCNet [21], and ours, respectively. Through (c), we could observe that errors of matches produced from feature extraction network affect the final output. Compared to (d), (e) shows accurate matching results even in difficult examples with occlusion, background clutter, and repetitive textures.


Components Accuracy
(a) Ours 80.0
(b) (-) Joint learning 78.4
(c) (-) Confidence-aware loss 77.7


Table 2: Ablation study on our modules.


Loss component PCK ( = 0.1)
(a) - - - 70.0
(b) - - 71.7
(c) - - 78.4
(d) 80.0


Table 3: Ablation study of our loss formulation.

3.4 Ablation Study

In this section, we analyze the main components in our method, confidence-aware contrastive loss and joint learning, with NCNet baseline [21] on PF-Pascal. First, in  Table 2 we validate the effectiveness of joint learning and confidence-aware contrastive loss by the lower performance of (b) and (c) compared to (a) which has both of these components. This proves that training two networks in a complementary manner boosts the performance and confidence-aware contrastive loss leads the unreliable pseudo labels to be filtered out during the training process. We also verify the effectiveness of each confidence-aware contrastive loss component through possible combinations of components displayed in  Table 3. Compared to (a) and (b), both of which train two networks only with the loss from the aggregation module, (c) and (d) show better PCK results by using separate losses that come from feature extraction and cost aggregation respectively. From this, we can confirm that the direct loss from each module works as a sufficient supervision signal for training which is free from gradient vanishing. Based on the performance improvements observed between (b) and (a), and between (d) and (c), we can also confirm that the reliable sample from one module helps training the other module, as it supports the formulation of better loss signals.

(c) WTA
(d) NCNet [21]
(e) Ours
Figure 4: Qualitative results on PF-Pascal dataset [4].

4 Conclusion

We address the limitations of the existing methods by jointly training feature extraction networks and aggregation networks in an end-to-end manner with the proposed confidence-aware contrastive loss. By jointly learning the networks with a novel loss function, our model outperforms the baseline and shows competitive results on standard benchmarks.

Acknowledgements. This research was supported by the National Research Foundation of Korea (NRF-2021R1C1C1006897).


  • [1] S. Cho, S. Hong, S. Jeon, Y. Lee, K. Sohn, and S. Kim (2021) Semantic correspondence with transformers. arXiv:2106.02520. Cited by: §1, §1, Table 1, §3.1, §3.2.
  • [2] C. B. Choy, J. Gwak, S. Savarese, and M. Chandraker (2016) Universal correspondence network. arXiv:1606.03558. Cited by: Figure 1, §1.
  • [3] J. Deng, W. Dong, R. Socher, L. Li, K. Li, and L. Fei-Fei (2009) Imagenet: a large-scale hierarchical image database. In CVPR, Cited by: §3.1.
  • [4] B. Ham, M. Cho, C. Schmid, and J. Ponce (2017) Proposal flow: semantic correspondences from object proposals. In TPAMI, Cited by: §1, Figure 2, §2.1, Table 1, Figure 4, §3.2, §3.3.
  • [5] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In CVPR, Cited by: §3.1.
  • [6] S. Huang, Q. Wang, S. Zhang, S. Yan, and X. He (2019) Dynamic context correspondence network for semantic alignment. In ICCV, Cited by: Table 1, §3.2.
  • [7] S. Kim, S. Lin, S. R. Jeon, D. Min, and K. Sohn (2018)

    Recurrent transformer networks for semantic correspondence

    In NeurIPS, Cited by: Table 1, §3.2.
  • [8] S. Kim, D. Min, B. Ham, S. Jeon, S. Lin, and K. Sohn (2017) Fcss: fully convolutional self-similarity for dense semantic correspondence. In CVPR, Cited by: Figure 1, §1, §2.1.
  • [9] J. Y. Lee, J. DeGol, V. Fragoso, and S. N. Sinha (2021) PatchMatch-based neighborhood consensus for semantic correspondence. In CVPR, Cited by: Figure 1, §1, §2.1, Table 1, §3.2, §3.3.
  • [10] J. Lee, D. Kim, J. Ponce, and B. Ham (2019) Sfnet: learning object-aware semantic correspondence. In CVPR, Cited by: Figure 1, §1, §1, §3.1.
  • [11] B. Leibe, A. Leonardis, and B. Schiele (2008) Robust object detection with interleaved categorization and segmentation. In IJCV, Cited by: §1.
  • [12] S. Li, K. Han, T. W. Costain, H. Howard-Jenkins, and V. Prisacariu (2020) Correspondence networks with adaptive neighbourhood consensus. In CVPR, Cited by: §1, §2.1, Table 1, §3.2.
  • [13] P. Liu, I. King, M. R. Lyu, and J. Xu (2019) Ddflow: learning optical flow with unlabeled data distillation. In AAAI, Cited by: §2.2.
  • [14] D. G. Lowe (2004) Distinctive image features from scale-invariant keypoints. In IJCV, Cited by: §1.
  • [15] S. Meister, J. Hur, and S. Roth (2018)

    Unflow: unsupervised learning of optical flow with a bidirectional census loss

    In AAAI, Cited by: §2.2.
  • [16] I. Melekhov, A. Tiulpin, T. Sattler, M. Pollefeys, E. Rahtu, and J. Kannala (2019) Dgc-net: dense geometric correspondence network. In WACV, Cited by: Figure 1, §1.
  • [17] T. Park, A. A. Efros, R. Zhang, and J. Zhu (2020)

    Contrastive learning for unpaired image-to-image translation

    In ECCV, Cited by: §1.
  • [18] I. Rocco, R. Arandjelovic, and J. Sivic (2017)

    Convolutional neural network architecture for geometric matching

    In CVPR, Cited by: Figure 1, §1.
  • [19] I. Rocco, R. Arandjelović, and J. Sivic (2018) End-to-end weakly-supervised semantic alignment. In CVPR, Cited by: Figure 1, §1, §1, Table 1, §3.2.
  • [20] I. Rocco, R. Arandjelović, and J. Sivic (2020) Efficient neighbourhood consensus networks via submanifold sparse convolutions. In ECCV, Cited by: §1, §2.1.
  • [21] I. Rocco, M. Cimpoi, R. Arandjelović, A. Torii, T. Pajdla, and J. Sivic (2018) Neighbourhood consensus networks. arXiv:1810.10510. Cited by: Figure 1, §1, §1, §2.1, §2.1, Table 1, 4, §3.1, §3.2, §3.3, §3.4.
  • [22] N. Sundaram, T. Brox, and K. Keutzer (2010) Dense point trajectories by gpu-accelerated large displacement optical flow. In ECCV, Cited by: §2.2.
  • [23] H. Taira, I. Rocco, J. Sedlar, M. Okutomi, J. Sivic, T. Pajdla, T. Sattler, and A. Torii (2019) Is this the right place? geometric-semantic pose verification for indoor visual localization. In ICCV, Cited by: §1.
  • [24] T. Taniai, S. N. Sinha, and Y. Sato (2016) Joint recovery of dense correspondence and cosegmentation in two images. In CVPR, Cited by: §1, §2.1, Table 1, §3.2, §3.3.
  • [25] Q. Wang, X. Zhou, B. Hariharan, and N. Snavely (2020) Learning feature descriptors using camera pose supervision. In ECCV, Cited by: §1.
  • [26] X. Wang, R. Zhang, C. Shen, T. Kong, and L. Li (2020) Dense contrastive learning for self-supervised visual pre-training. arXiv:2011.09157. Cited by: §1, §2.1.