Contrastive Transformation for Self-supervised Correspondence Learning

12/09/2020 ∙ by Ning Wang, et al. ∙ USTC 14

In this paper, we focus on the self-supervised learning of visual correspondence using unlabeled videos in the wild. Our method simultaneously considers intra- and inter-video representation associations for reliable correspondence estimation. The intra-video learning transforms the image contents across frames within a single video via the frame pair-wise affinity. To obtain the discriminative representation for instance-level separation, we go beyond the intra-video analysis and construct the inter-video affinity to facilitate the contrastive transformation across different videos. By forcing the transformation consistency between intra- and inter-video levels, the fine-grained correspondence associations are well preserved and the instance-level feature discrimination is effectively reinforced. Our simple framework outperforms the recent self-supervised correspondence methods on a range of visual tasks including video object tracking (VOT), video object segmentation (VOS), pose keypoint tracking, etc. It is worth mentioning that our method also surpasses the fully-supervised affinity representation (e.g., ResNet) and performs competitively against the recent fully-supervised algorithms designed for the specific tasks (e.g., VOT and VOS).



There are no comments yet.


page 1

page 3

page 4

page 6

page 7

page 10

page 11

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Learning representations for visual correspondence is a long-standing problem in computer vision, which is closely related to many vision tasks including video object tracking, keypoint tracking, and optical flow estimation,

etc. This task is challenging due to the factors such as viewpoint change, distractors, and background clutter.

Correspondence estimation generally requires human annotations for model training. Collecting dense annotations, especially for large-scale datasets, requires costly human efforts. To leverage the large volume of raw videos in the wild, the recent advances focus on self-supervised correspondence learning by exploring the inherent relationships within the unlabeled videos. In Wang et al. (2019b)

, the temporal cycle-consistency is utilized to self-supervise the feature representation learning. To be specific, the correct patch-level or pixel-wise associations between two successive frames should match bi-directionally in both forward and backward tracking trajectories. The bi-directional matching is realized via a frame-level affinity matrix, which represents the pixel pair-wise similarity between two frames. In

Vondrick et al. (2018); Li et al. (2019), this affinity is also utilized to achieve the content transformation between two frames for self-supervision. A straightforward transformation within videos is the color/RGB information. More specifically, the pixel colors in a target frame can be “copied” (or transformed) from the pixels in a reference frame. By minimizing the differences between the transformed and the true colors of the target frame, the backbone network is forced to learn robust feature embeddings for identifying correspondence across frames in a self-supervised manner.

In spite of the impressive performance, existing unsupervised correspondence algorithms put all the emphasis on the intra-video analysis. Since the scenario in one video is generally stable and changeless, establishing the correspondence within the same videos is less challenging and inevitably hinders the discrimination potential of learned feature embeddings. In this work, we go beyond the intra-video correspondence learning by further considering the inter-video level embedding separation of different instance objects. Our method is largely inspired by the recent success of contrastive learning He et al. (2020); Chen et al. (2020), which aims at minimizing the agreement between different augmented versions of the same image via a contrastive loss Hadsell et al. (2006). Nevertheless, there are two obvious gaps between contrastive learning and correspondence learning. First, classic contrastive learning relies on the augmented still images, but how to adapt it to the video-level correspondence scenario is rarely explored. Second, their optimization goals are somewhat conflicting. Contrastive learning targets at positive concentration and negative separation, ignoring the pixel-to-pixel relevance among the positive embeddings. In contrast, correspondence learning aims at identifying fine-grained matching.

In this work, we aim to narrow the above domain gaps by absorbing the core contrastive ideas for correspondence estimation. To transfer the contrastive learning from the image domain to the video domain, we leverage the patch-level tracking to acquire matched image pairs in unlabeled videos. Consequently, our method captures the real target appearance changes reside in the video sequences without augmenting the still images using empirical rules (e.g., scaling and rotation). Furthermore, we propose the inter-video transformation, which is consistent with the correspondence learning in terms of the optimization goal while preserving the contrastive characteristic among different instance embeddings. In our framework, similar to previous arts Vondrick et al. (2018); Li et al. (2019), the image pixels should match their counterpart pixels in the current video to satisfy the self-supervision. Besides, these pixels are also forced to mismatch the pixels in other videos to reinforce the instance-level discrimination, which is formulated in the contrastive transformation across a batch of videos, as shown in Figure 1. By virtue of the intra-inter transformation consistency as well as the sparsity constraint for the inter-video affinity, our framework encourages the contrastive embedding learning within the correspondence framework.

In summary, the main contribution of this work lies in the contrastive framework for self-supervised correspondence learning. 1) By joint unsupervised tracking and contrastive transformation, our approach extends the classic contrastive idea to the temporal domain. 2) To bridge the domain gap between two diverse tasks, we propose the intra-inter transformation consistency, which differs from contrastive learning but absorbs its core motivation for correspondence tasks. 3) Last but not least, we verify the proposed approach in a series of correspondence-related tasks including video object segmentation, pose tracking, object tracking, etc. Our approach consistently outperforms previous state-of-the-art self-supervised approaches and is even comparable with some task-specific fully-supervised algorithms.

2 Related Work

In this section, we briefly review the related methods including unsupervised representation learning, self-supervised correspondence learning, and contrastive learning.

Unsupervised Representation Learning. Learning representations from unlabeled images or videos has been widely studied. Unsupervised approaches explore the inherent information inside images or videos as the supervisory signals from different perspectives, such as frame sorting Lee et al. (2017), image content recovering Pathak et al. (2016), deep clustering Caron et al. (2018), affinity diffusion Huang et al. (2020), motion modeling Pathak et al. (2017); Tung et al. (2017), and bi-directional flow estimation Meister et al. (2018). These methods learn an unsupervised feature extractor, which can be generalized to different tasks by further fine-tuning using a small set of labeled samples. In this work, we focus on a sub-area in the unsupervised family, i.e., learning features for fine-grained pixel matching without task-specific fine-tuning. Our framework shares partial insight with Wang and Gupta (2015), which utilizes off-the-shelf visual trackers for data pre-processing. Differently, we jointly track and spread feature embeddings in an end-to-end manner for complementary learning. Our method is also motivated by the contrastive learning Den Oord et al. (2018)

, another popular framework in the unsupervised learning family. In the following, we will detailedly discuss correspondence learning and contrastive learning.

Self-supervised Correspondence Learning. Learning temporal correspondence is widely explored in the visual object tracking (VOT), video object segmentation (VOS), and flow estimation Dosovitskiy et al. (2015) tasks. VOT aims to locate the target box in each frame based on the initial target box, while VOS propagates the initial target mask. To avoid expensive manual annotations, self-supervised approaches have attracted increasing attention. In Vondrick et al. (2018), based on the frame-wise affinity, the pixel colors from the reference frame are transferred to the target frame as self-supervisory signals. Wang et al. Wang et al. (2019b) conduct the forward-backward tracking in unlabeled videos and leverage the inconsistency between the start and end points to optimize the feature representation. UDT algorithm Wang et al. (2019a) leverages a similar bi-directional tracking idea and composes the correlation filter for unsupervised tracker training. In Yang et al. (2019), an unsupervised tracker is trained via incremental learning using a single movie. Recently, Li el al. Li et al. (2019) combine the object-level and fine-grained correspondence in a coarse-to-fine fashion and shows notable performance improvements. In Jabri et al. (2020), space-time correspondence learning is formulated as a contrastive random walk and shows impressive results. Despite the success of the above methods, they put the main emphasis on the intra-video self-supervision. Our approach takes a step further by simultaneously exploiting the intra-video and inter-video consistency to learn more discriminative feature embeddings. Therefore, previous intra-video based approaches can be regarded as one part of our framework.

Figure 2: An overview of the proposed framework. Given a batch of videos, we first do patch-level tracking to generate image pairs. Then, intra- and inter-video transformations are conducted for each video in the mini-batch. Finally, except the intra-video self-supervision, we introduce the intra-inter consistency and sparsity constraint to reinforce the embedding discrimination.

Contrastive Learning. Contrastive learning is a popular unsupervised learning paradigm, which aims to enlarge the embedding disagreements of different instances for representation learning Den Oord et al. (2018); Ye et al. (2019); Hjelm et al. (2019). Based on the contrastive framework, the recent SimCLR method Chen et al. (2020) significantly narrows the performance gap between supervised and unsupervised models. He et al. He et al. (2020) propose the MoCo algorithm to fully exploit the negative samples in the memory bank. Inspired by the recent success of contrastive learning, we also involve plentiful negative samples for discriminative feature learning. Compared with existing contrastive methods, one major difference is our method jointly tracks and spreads feature embeddings in the video domain. Therefore, our method captures the temporally changed appearance variations instead of manually augmenting the still images. Besides, instead of using a standard contrastive loss Hadsell et al. (2006), we incorporate the contrastive idea into the correspondence task by a conceptually simple yet effective contrastive transformation mechanism to narrow the domain gap.

3 Methodology

An overview of our framework is shown in Figure 2. Given a batch of videos, we first crop the adjacent image patches via patch-level tracking, which ensures the image pairs have similar contents and facilitates the later transformations. For each image pair, we consider the intra-video bi-directional transformation. Furthermore, we introduce irrelevant images from other videos to conduct the inter-video transformation for contrastive embedding learning. The final training objectives include the intra-video self-supervision, intra-inter transformation consistency, and sparsity regularization for the batch-level affinity.

3.1 Revisiting Affinity-based Transformation

Given a pair of video frames, the pixel colors (e.g.

, RGB values) in one frame can be copied from the pixels from another frame. This is based on the assumption that the contents in two successive video frames are coherent. The above frame reconstruction (pixel copy) operation can be expressed via a linear transformation with the affinity matrix

, which describes the copy process from a reference frame to a target frame Vondrick et al. (2018); Liu et al. (2018).

A general option for the similarity measurement in the affinity matrix is the dot product between feature embeddings. In this work, we follow previous arts Vondrick et al. (2018); Wang et al. (2019b); Li et al. (2019) to construct the following affinity matrix:


where and denote flattened feature maps with channels of target and reference frames, respectively. With the spatial index and , is normalized by the softmax over the spatial dimension of .

Leveraging the above affinity, we can freely transform various information from the reference frame to the target frame by , where can be any associated labels of the reference frame (e.g., semantic mask, pixel color, and pixel location). Since we naturally know the color information of the target frame, one free self-supervisory signal is color Vondrick et al. (2018). The goal of such an affinity-based transformation framework is to train a good feature extractor for affinity computation.

3.2 Contrastive Pair Generation

A vital step in contrastive frameworks is building positive image pairs via data augmentation. We free this necessity by exploring the temporal content consistency resides in the videos. To this end, for each video, we first utilize the patch-level tracking to acquire a pair of high-quality image patches with similar content. Based on the matched pairs, we then conduct the contrastive transformation.

Given a randomly cropped patch in the reference frame, we aim to localize the best matched patch in the target frame, as shown in Figure 2. Similar to Eq. 1, we compute a patch-to-frame affinity between the features of a random patch in the reference frame and the features of the whole target frame. Based on this affinity, in the target frame, we can identify some target pixels most similar to the reference pixels, and average these pixel coordinates as the tracked target center. We also estimate the patch scale variation following UVC approach Li et al. (2019). Then we crop this patch and combine it with the reference patch to form an image pair.

3.3 Intra- and Inter-video Transformations


After obtaining a pair of matched feature maps via patch-level tracking, we compute their fined-grained affinity according to Eq. 1. Based on this intra-video affinity, we can easily transform the image contents from the reference patch to the target patch within a single video clip.


The key success of the aforementioned affinity-based transformation lies in the embedding discrimination among plentiful subpixels to achieve the accurate label copy. Nevertheless, within a pair of small patch regions, the image contents are highly correlated and even only cover a subregion of a large object, struggling to contain diverse visual patterns. The rarely existing negative pixels from other instance objects heavily hinder the embedding learning.

In the following, we improve the existing framework by introducing another inter-video transformation to achieve the contrastive embedding learning. The inter-video affinity is defined as follows:


where is the concatenation of the reference features from different videos in the spatial dimension, i.e., . For a mini-batch with videos, the spatial index and .

Rationale Analysis.

Inter-video transformation is an extension of intra-video transformation. By decomposing the reference feature embeddings into positive and negative, can be expressed as , where denotes the only positive reference feature related to the target frame feature while is the concatenation of negative ones from unrelated videos in the mini-batch. As a result, the computed affinity can be regarded as an ensemble of multiple sub-affinities, as shown in Figure 3. Our goal is to build such a batch-level affinity for discriminative representation learning.

To facilitate the later descriptions, we also divide the inter-video affinity as a combination of positive and negative sub-affinities:


where and are the positive and negative sub-affinities, respectively. Ideally, sub-affinity should be close to the intra-video affinity and is expected to be a zero-like matrix. Nevertheless, with the inclusion of noisy reference features , the positive sub-affinity inevitably degenerates in comparison with the intra-video affinity , as shown in Figure 3. In the following, we present the intra-inter transformation consistency to encourage contrastive embedding learning within the correspondence learning task.

Figure 3: Comparison between intra-video affinity (top) and inter-video affinity (bottom). Best view in zoom in.

3.4 Training Objectives

To achieve the high-quality frame reconstruction, following Li et al. (2019)

, we pre-train an encoder and a decoder using still images on the COCO dataset

Lin et al. (2014) to perform the feature-level transformation. The pre-trained encoder and decoder networks are frozen without further optimization in our framework. The goal is to train the backbone network for correspondence estimation (i.e., affinity computation). In the following, the encoded features of the reference image is denoted as .

Intra-video Self-supervision.

Leveraging the intra-video affinity as well as the encoded reference feature , the transformed target image can be computed via . Ideally, the transformed target frame should be consistent with the original target frame. As a consequence, the intra-video self-supervisory loss is defined as follows:


Intra-inter Consistency.

Leveraging the inter-video affinity and the encoded reference features from a batch of videos, i.e., , the corresponding transformed target image can be computed via . This inter-video transformation is shown in Figure 4. The reference features from other videos are considered as negative embeddings. The learned inter-video affinity is expected to exclude unrelated embeddings for transformation fidelity. Therefore, the transformed images via intra-video affinity and inter-video affinity should be consistent:


The above loss encourages both positive feature invariance and negative embedding separation.

Sparsity Constraint.

To further enlarge the disagreements among different video features, we force the sub-affinity in the inter-video affinity to be sparse via


where is the negative sub-affinity in Eq. 3.

Other Regularizations.

Following previous works Li et al. (2019); Wang et al. (2019b), we also utilize the cycle-consistency (bi-directional matching) between two frames, which equals forcing the affinity matrix to be orthogonal, i.e., . Besides, the concentration regularization proposed in Li et al. (2019) is also added. These two regularizations are combined and denoted as .

Final Objective.

The final training objective is the combination of the above loss functions:


Our designed losses and are equally incorporated with the basic objective . An overview of the training process is shown in Algorithm 1.

Figure 4: Illustration of the inter-video transformation.

3.5 Online Inference

After offline training, the pretrained backbone model is fixed during the inference stage, which is utilized to compute the affinity matrix for label transformation (e.g., segmentation mask). Note that the contrastive transformation is merely utilized for offline training, and the inference process is similar to the intra-video transformation. To acquire more reliable correspondence, we further design a mutually correlated affinity to exclude noisy matching as follows:


where is a mutual correlation weight between two frames. Ideally, we prefer the one-to-one matching, i.e., one pixel in the reference frame should be highly correlated with some pixel in the target frame and vice versa. The mutual correlation weight is formulated by:


The weight can be regarded as the affinity normalization across both reference and target spatial dimensions. Given the above affinity between two frames, the target frame label can be transformed via .

4 Experiments

We verify the effectiveness of our method on a variety of vision tasks including video object segmentation, visual object tracking, pose keypoint tracking, and human parts segmentation propagation111The source code and pretrained model will be available at

Input: Unlabeled video sequences.
Output: Trained weights for the backbone network.
1 for each mini-batch do

Extract deep features of the video frames;

3        Patch-level tracking to obtain matched feature pairs;
4        for each video in the mini-batch do
5               // Intra- and Inter-video transformations
6               Compute intra-video affinity (Eq. 1);
7               Compute inter-video affinity (Eq. 3);
8               Conduct intra- and inter-video transformations;
9               // Loss Computation
10               Compute intra-video self-supervision ;
11               Compute intra-inter consistency ;
12               Compute regularization terms and ;
14        end for
15       Back-propagate all the losses in this mini-batch;
17 end for
Algorithm 1 Offline Training Process

4.1 Experimental Details

Training Details.

In our method, the patch-level tracking and frame transformations share a ResNet-18 backbone network He et al. (2016)

with the first 4 blocks for feature extraction. The training dataset is TrackingNet

Müller et al. (2018) with about 30k video. Note that previous works Wang et al. (2019b); Li et al. (2019) use the Kinetics dataset Zisserman et al. (2017), which is much larger in scale than TrackingNet. Our framework randomly crops and tracks the patches of 256256 pixels (i.e., patch-level tracking), and further yields a 3232 intra-video affinity (i.e.

, the network stride is 8). The batch size is 16. Therefore, each positive embedding contrasts with 15


2) = 30720 negative embeddings. Since our method considers pixel-level features, a small batch size also involves abundant contrastive samples. We first train the intra-video transformation (warm-up stage) for the first 100 epochs and then train the whole framework in an end-to-end manner for another 100 epochs. The learning rate of both two stages is

and will be reduced by half every 40 epochs. The training stage takes about one day on 4 Nvidia 1080Ti GPUs.

Inference Details. For a fair comparison, we use the same testing protocols as previous works Wang et al. (2019b); Li et al. (2019) in all tasks.

4.2 Framework Effectiveness Study

In Table 1, we show ablative experiments of our method on the DAVIS-2017 validation dataset Ponttuset et al. (2017)

. The evaluation metrics are Jacaard index

and contour-based accuracy . As shown in Table 1, without the intra-video guidance, inter-video transformation alone for self-supervision yields unsatisfactory results due to overwhelming noisy/negative samples. With only intra-video transformation, our framework is similar to the previous approach Li et al. (2019). By jointly employing both of these two transformations under an intra-inter consistency constraint, our method obtains obvious performance improvements of 3.2% in and 3.4% in . The sparsity term of inter-video affinity encourages the embedding separation and further improves the results.

In Figure 9, we further visualize the comparison results of our method with and without contrastive transformation. As shown in the last row of Figure 9, only intra-video self-supervision fails to effectively handle the challenging scenarios with distracting objects and partial occlusion. By involving the contrastive transformation, the learned feature embeddings exhibit superior discrimination capability for instance-level separation.

Intra-video Inter-video Sparsity Mutual (Mean) (Mean)
Transformation Transformation Constraint Correlation


55.8 60.3
59.0 63.7
59.2 64.0
60.5 65.5
Table 1: Analysis of each component of our method on the DAVIS-2017 validation dataset.
Figure 5: (a) Ground-truth results. (b) Results of the model with both intra- and inter-video transformations. (c) Results of the model without inter-video contrastive transformation, where the failures are highlighted by white circles.
Figure 6: Experimental results of our method. (a) Video object segmentation on the DAVIS-2017. (b) Visual object tracking on the OTB-2015. (c) Pose keypoint tracking on the J-HMDB. (d) Parts segmentation propagation on the VIP.
Model Supervised (Mean) (Mean)


Transitive Inv. Wang et al. (2017b) 32.0 26.8
DeepCluster Caron et al. (2018) 37.5 33.2
Video Colorization Vondrick et al. (2018) 34.6 32.7
Time-Cycle Wang et al. (2019b) 41.9 39.4
CorrFlow Lai and Xie (2019) 48.4 52.2
UVC (480p) Li et al. (2019) 56.3 59.2
UVC (560p) Li et al. (2019) 56.7 60.7
MAST Lai et al. (2020) 63.3 67.6
ContrastCorr (Ours) 60.5 65.5
ResNet-18 He et al. (2016) 49.4 55.1
OSVOS Caelles et al. (2017) 56.6 63.9
FEEVOS Voigtlaender et al. (2019) 69.1 74.0
Table 2: Evaluation on video object segmentation on the DAVIS-2017 validation dataset. The evaluation metrics are region similarity and contour-based accuracy .
Model Supervised DP@20pixel AUC


KCF (HOG feature) Henriques et al. (2015) 69.6 48.5
UL-DCFNet Yang et al. (2019) 75.5 58.4
UDT Wang et al. (2019a) 76.0 59.4
UVC Li et al. (2019) - 59.2
LUDT Wang et al. (2020) 76.9 60.2
ContrastCorr (Ours) 77.2 61.1
ResNet-18 + DCF He et al. (2016) 49.4 55.6
SiamFC Bertinetto et al. (2016) 77.1 58.2
DiMP-18 Bhat et al. (2019) 87.1 66.2
Table 3: Evaluation on video object tracking on the OTB-2015 dataset. The evaluation metrics are distance precision (DP) and area-under-curve (AUC) score of the success plot.
Model Supervised PCK@.1 PCK@.2


SIFT Flow Liu et al. (2011) 49.0 68.6
Transitive Inv. Wang et al. (2017b) 43.9 67.0
DeepCluster Caron et al. (2018) 43.2 66.9
Video Colorization Vondrick et al. (2018) 45.2 69.6
Time-Cycle Wang et al. (2019b) 57.3 78.1
CorrFlow Lai and Xie (2019) 58.5 78.8
UVC Li et al. (2019) 58.6 79.8
ContrastCorr (Ours) 61.1 80.8
ResNet-18 He et al. (2016) 53.8 74.6
Thin-Slicing Network Song et al. (2017) 68.7 92.1
Table 4: Keypoints propagation on J-HMDB. The evaluation metric is PCK at different thresholds.
Model Supervised mIoU


SIFT Flow Liu et al. (2011) 21.3 10.5
Transitive Inv. Wang et al. (2017b) 19.4 5.0
DeepCluster Caron et al. (2018) 21.8 8.1
Time-Cycle Wang et al. (2019b) 28.9 15.6
UVC Li et al. (2019) 34.1 17.7
ContrastCorr (Ours) 37.4 21.6
ResNet-18 He et al. (2016) 31.8 12.6
FGFA Zhu et al. (2017) 37.5 23.0
ATEN Zhou et al. (2018) 37.9 24.1
Table 5: Evaluation on propagating human part labels in Video Instance-level Parsing (VIP) dataset. The evaluation metrics are semantic propagation with mIoU and part instance propagation in .

4.3 Comparison with State-of-the-art Methods

Video Object Segmentation on the DAVIS-2017. DAVIS Ponttuset et al. (2017) is a video object segmentation (VOS) benchmark. We evaluate our method on the DAVIS-2017 validation set following Jacaard index (IoU) and contour-based accuracy . Table 2 lists quantitative results. Our model performs favorably against the state-of-the-art self-supervised methods including Time-Cycle Wang et al. (2019b), CorrFlow Lai and Xie (2019), and UVC Li et al. (2019). Specifically, with the same experimental settings (e.g., frame input size and recurrent reference strategy), our model surpasses the recent top-performing UVC approach by 3.8% in and 4.8% in . The recent MAST approach Lai et al. (2020) obtains impressive results by leveraging a memory mechanism, which can be added to our framework for further performance improvement. From Figure 8 (first row), we can observe that our method is robust in handling distracting objects and partial occlusion.

Compared with the fully-supervised ResNet-18 network trained on ImageNet with classification labels, our method exhibits much better performance. It is also worth noting that our method even surpasses the recent fully-supervised methods such as OSVOS.

Video Object Tracking on the OTB-2015. OTB-2015 Wu et al. (2015) is a visual tracking benchmark with 100 challenging videos. We evaluate our method on OTB-2015 under distance precision (DP) and area-under-curve (AUC) metrics. Our model learns robust feature representations for fine-grained matching, which can be combined with the correlation filter Henriques et al. (2015); Danelljan et al. (2014) for robust tracking. Without online fine-tuning, we integrate our model into a classic tracking framework based on the correlation filter, i.e., DCFNet Wang et al. (2017a). The comparison results are shown in Table 3. Note that UDT Wang et al. (2019a) is the recently proposed unsupervised tracker trained with the correlation filter in an end-to-end manner. Without end-to-end optimization, our model is still robust enough to achieve superior performance in comparison with UDT. Our method also outperforms the classic fully-supervised trackers such as SiamFC. As shown in Figure 8 (second row), our model can well handle the motion blur, deformation, and similar distractors.

Pose Keypoint Propagation on the J-HMDB. We evaluate our model on the pose keypoint propagation task on the validation set of J-HMDB Jhuang et al. (2013)

. Pose keypoint tracking requires precise fine-grained matching, which is more challenging than the box-level or mask-level propagation in the VOT/VOS tasks. Given the initial frame with 15 annotated human keypoints, we propagate them in the successive frames. The evaluate metric is the probability of correct keypoint (PCK), which measures the percentage of keypoints close to the ground-truth in different thresholds. We show comparison results against the state-of-the-art methods in Table 

4 and qualitative results in Figure 8 (third row). Our method outperforms all previous self-supervised methods such as Time-Cycle, CorrFlow, and UVC (Table 4

). Furthermore, our approach significantly outperforms pre-trained ResNet-18 with ImageNet supervision.

Semantic and Instance Propagation on the VIP. Finally, we evaluate our method on the Video Instance-level Parsing (VIP) dataset Zhou et al. (2018), which includes dense human parts segmentation masks on both the semantic and instance levels. We conduct two tasks in this benchmark: semantic propagation and human part propagation with instance identity. For the semantic mask propagation, we propagate the semantic segmentation maps of human parts (e.g., heads, arms, and legs) and evaluate performance via the mean IoU metric. For the part instance propagation task, we propagate the instance-level segmentation of human parts (e.g., different arms of different persons) and evaluate performance via the instance-level human parsing metric: mean Average Precision (AP). Table 5 shows that our method performs favorably against previous self-supervised methods. For example, our approach outperforms the previous best self-supervised method UVC by 3.3% mIoU in semantic propagation and 3.9% in human part propagation. Besides, our model notably surpasses the ResNet-18 model trained on ImageNet with classification labels. Finally, our method is comparable with the fully-supervised ATEN algorithm Zhou et al. (2018) designed for this dataset.

5 Conclusion

In this work, we focus on the correspondence learning using unlabeled videos. Based on the well-studied intra-video self-supervision, we go one step further by introducing the inter-video transformation to achieve contrastive embedding learning. The proposed contrastive transformation encourages embedding discrimination while preserving the fine-grained matching characteristic among positive embeddings. Without task-specific fine-tuning, our unsupervised model shows satisfactory generalization on a variety of temporal correspondence tasks. Our approach consistently outperforms previous self-supervised methods and is even comparable with the recent fully-supervised algorithms.


The work of Wengang Zhou was supported in part by the National Natural Science Foundation of China under Contract 61822208, Contract U20A20183, and Contract 61632019; and in part by the Youth Innovation Promotion Association CAS under Grant 2018497. The work of Houqiang Li was supported by NSFC under Contract 61836011.


  • L. Bertinetto, J. Valmadre, J. F. Henriques, A. Vedaldi, and P. H. Torr (2016) Fully-convolutional siamese networks for object tracking. In ECCV Workshop, Cited by: Table 3.
  • G. Bhat, M. Danelljan, L. Van Gool, and R. Timofte (2019) Learning discriminative model prediction for tracking. In ICCV, Cited by: Table 3.
  • S. Caelles, K. Maninis, J. Ponttuset, L. Lealtaixe, D. Cremers, and L. Van Gool (2017) One-shot video object segmentation. In CVPR, Cited by: Table 2.
  • M. Caron, P. Bojanowski, A. Joulin, and M. Douze (2018) Deep clustering for unsupervised learning of visual features. In ECCV, Cited by: §2, Table 2, Table 4, Table 5.
  • T. Chen, S. Kornblith, M. Norouzi, and G. E. Hinton (2020) A simple framework for contrastive learning of visual representations. arXiv: 2002.05709. Cited by: §1, §2.
  • M. Danelljan, G. Häger, F. Khan, and M. Felsberg (2014) Accurate scale estimation for robust visual tracking. In BMVC, Cited by: §4.3.
  • A. V. Den Oord, Y. Li, and O. Vinyals (2018) Representation learning with contrastive predictive coding.. arXiv: 1807.03748. Cited by: §2, §2.
  • A. Dosovitskiy, P. Fischery, E. Ilg, P. Hausser, C. Hazirbas, V. Golkov, P. V. Der Smagt, D. Cremers, and T. Brox (2015) Flownet: learning optical flow with convolutional networks. In ICCV, Cited by: §2.
  • R. Hadsell, S. Chopra, and Y. Lecun (2006) Dimensionality reduction by learning an invariant mapping. In CVPR, Cited by: §1, §2.
  • K. He, H. Fan, Y. Wu, S. Xie, and R. Girshick (2020) Momentum contrast for unsupervised visual representation learning.. In CVPR, Cited by: §1, §2.
  • K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In CVPR, Cited by: §4.1, Table 2, Table 3, Table 4, Table 5.
  • J. F. Henriques, R. Caseiro, P. Martins, and J. Batista (2015) High-speed tracking with kernelized correlation filters. TPAMI 37 (3), pp. 583–596. Cited by: §4.3, Table 3.
  • R. D. Hjelm, A. Fedorov, S. Lavoiemarchildon, K. Grewal, P. Bachman, A. Trischler, and Y. Bengio (2019) Learning deep representations by mutual information estimation and maximization. In ICLR, Cited by: §2.
  • J. Huang, Q. Dong, S. Gong, and X. Zhu (2020)

    Unsupervised deep learning via affinity diffusion

    In AAAI, Cited by: §2.
  • A. Jabri, A. Owens, and A. Efros (2020) Space-time correspondence as a contrastive random walk. In NeurIPS, Cited by: §2.
  • H. Jhuang, J. Gall, S. Zuffi, C. Schmid, and M. J. Black (2013) Towards understanding action recognition. In ICCV, Cited by: §4.3.
  • Z. Lai, E. Lu, and W. Xie (2020) MAST: a memory-augmented self-supervised tracker. In CVPR, Cited by: §4.3, Table 2.
  • Z. Lai and W. Xie (2019) Self-supervised learning for video correspondence flow.. In BMVC, Cited by: §4.3, Table 2, Table 4.
  • H. Lee, J. Huang, M. Singh, and M. Yang (2017) Unsupervised representation learning by sorting sequences. In ICCV, Cited by: §2.
  • X. Li, S. Liu, S. De Mello, X. Wang, J. Kautz, and M. Yang (2019) Joint-task self-supervised learning for temporal correspondence. In NeurIPS, Cited by: §1, §1, §2, §3.1, §3.2, §3.4, §3.4, §4.1, §4.1, §4.2, §4.3, Table 2, Table 3, Table 4, Table 5.
  • T. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollar, and C. L. Zitnick (2014) Microsoft coco: common objects in context. In ECCV, Cited by: §3.4.
  • C. Liu, J. Yuen, and A. Torralba (2011) Sift flow: dense correspondence across scenes and its applications. TPAMI 33 (5), pp. 978–994. Cited by: Table 4, Table 5.
  • S. Liu, G. Zhong, S. De Mello, J. Gu, V. Jampani, M. Yang, and J. Kautz (2018) Switchable temporal propagation network. In ECCV, Cited by: §3.1.
  • S. Meister, J. Hur, and S. Roth (2018) Unflow: unsupervised learning of optical flow with a bidirectional census loss. In AAAI, Cited by: §2.
  • M. Müller, A. Bibi, S. Giancola, S. Al-Subaihi, and B. Ghanem (2018) Trackingnet: a large-scale dataset and benchmark for object tracking in the wild. In ECCV, Cited by: §4.1.
  • D. Pathak, R. Girshick, P. Dollar, T. Darrell, and B. Hariharan (2017) Learning features by watching objects move. In CVPR, Cited by: §2.
  • D. Pathak, P. Krahenbuhl, J. Donahue, T. Darrell, and A. A. Efros (2016) Context encoders: feature learning by inpainting. In CVPR, Cited by: §2.
  • J. Ponttuset, F. Perazzi, S. Caelles, P. Arbelaez, A. Sorkinehornung, and L. Van Gool (2017) The 2017 davis challenge on video object segmentation.. arXiv: 1704.00675. Cited by: §4.2, §4.3.
  • J. Song, L. Wang, L. Van Gool, and O. Hilliges (2017)

    Thin-slicing network: a deep structured model for pose estimation in videos

    In CVPR, Cited by: Table 4.
  • H. F. Tung, H. Tung, E. Yumer, and K. Fragkiadaki (2017) Self-supervised learning of motion capture. In NeurIPS, Cited by: §2.
  • P. Voigtlaender, Y. Chai, F. Schroff, H. Adam, B. Leibe, and L. Chen (2019) Feelvos: fast end-to-end embedding learning for video object segmentation. In CVPR, Cited by: Table 2.
  • C. Vondrick, A. Shrivastava, A. Fathi, S. Guadarrama, and K. Murphy (2018) Tracking emerges by colorizing videos. In ECCV, Cited by: §1, §1, §2, §3.1, §3.1, §3.1, Table 2, Table 4.
  • N. Wang, Y. Song, C. Ma, W. Zhou, W. Liu, and H. Li (2019a) Unsupervised deep tracking. In CVPR, Cited by: §2, §4.3, Table 3.
  • N. Wang, W. Zhou, Y. Song, C. Ma, W. Liu, and H. Li (2020) Unsupervised deep representation learning for real-time tracking. IJCV, pp. 1–19. Cited by: Table 3.
  • Q. Wang, Gao,Jin, Xing,Junliang, M. Zhang, and W. Hu (2017a) Dcfnet: discriminant correlation filters network for visual tracking. arXiv:1704.04057. Cited by: §4.3.
  • X. Wang and A. Gupta (2015) Unsupervised learning of visual representations using videos. In ICCV, Cited by: §2.
  • X. Wang, K. He, and A. Gupta (2017b) Transitive invariance for self-supervised visual representation learning. In ICCV, Cited by: Table 2, Table 4, Table 5.
  • X. Wang, A. Jabri, and A. A. Efros (2019b) Learning correspondence from the cycle-consistency of time. In CVPR, Cited by: §1, §2, §3.1, §3.4, §4.1, §4.1, §4.3, Table 2, Table 4, Table 5.
  • Y. Wu, J. Lim, and M. Yang (2015) Object tracking benchmark. TPAMI 37 (9), pp. 1834–1848. Cited by: §4.3.
  • L. Yang, D. Zhang, and L. Zhang (2019) Learning a visual tracker from a single movie without annotation. In AAAI, Cited by: §2, Table 3.
  • M. Ye, X. Zhang, P. C. Yuen, and S. Chang (2019) Unsupervised embedding learning via invariant and spreading instance feature. In CVPR, Cited by: §2.
  • Q. Zhou, X. Liang, K. Gong, and L. Lin (2018) Adaptive temporal encoding network for video instance-level human parsing. In ACM MM, Cited by: §4.3, Table 5.
  • X. Zhu, Y. Wang, J. Dai, L. Yuan, and Y. Wei (2017) Flow-guided feature aggregation for video object detection. In CVPR, Cited by: Table 5.
  • A. Zisserman, J. Carreira, K. Simonyan, W. Kay, B. Zhang, C. Hillier, S. Vijayanarasimhan, F. Viola, T. F. G. Green, T. Back, et al. (2017) The kinetics human action video dataset. arXiv: 1705.06950. Cited by: §4.1.

Appendix A A Inference Details

In the inference stage, we leverage the computed affinity matrix to transform different types of inputs, e.g., segmentation masks and pose keypoints. Similar to Time-Cycle and UVC, we adopt the same recurrent inference strategy to propagate the ground-truth result from the first frame, as well as the predicted results from the preceding frames onto the target frame. We average all predictions to obtain the final propagated map. Following previous works, is set to 1 for the VIP dataset and 7 for all the rest benchmarks. For fair comparisons, following Time-Cycle and UVC, we also use the k-NN propagation schema and set k = 5 for all tasks. More details can be found in the source code.

Appendix B B Transformation Results

In Figure 7, we exhibit some examples of our tracked image pairs. In our framework, we first randomly crop a reference patch in the reference frame and then conduct the patch-level tracking to form a pair of matched images. As shown in Figure 7, the image pairs have similar contents, which facilitate further intra- and inter-video transformations. Thanks to the patch-level tracking, our image pairs contain the real target appearance changes (e.g., person view/pose changes), which differs from conventional contrastive methods based on the manually designed rules (e.g., flip and rotation) to form image pairs.

In Figure 7, we also show the inter-video transformation results of our approach. The transformed images yield almost identical contents in comparison with the target patch, which affirms that our affinity matrix achieves reliable correspondence matching.

Figure 7: Examples of our tracked image pairs and transformed patches.
Figure 8: More results on the DAVIS-2017 validation dataset.
Figure 9: (a) Ground-truth segmentation results. (b) Results of UVC, which represents the current state-of-the-art performance of self-supervised correspondence methods. (c) Our results. By virtue of contrastive transformation, our approach shows superior results in comparison with previous intra-video based methods.

Appendix C C Additional VOS Results

In Figure 8, we show more results of our approach on the DAVIS-2017 validation dataset. From Figure 8, we can observe that our method is able to accurately propagate the segmentation masks in challenging scenarios.

UVC algorithm represents the current state-of-the-art self-supervised correspondence approach based on the intra-video transformation paradigm. In contrast, our method further exploits the inter-video level transformation to reinforce instance-level embedding discrimination. In Figure 9, we further compare our approach with UVC. As shown in Figure 9, compared with UVC, our approach better handles the challenging scenarios such as occlusion, deformation, and similar distractors.