Joint-task Self-supervised Learning for Temporal Correspondence

by   Xueting Li, et al.

This paper proposes to learn reliable dense correspondence from videos in a self-supervised manner. Our learning process integrates two highly related tasks: tracking large image regions and establishing fine-grained pixel-level associations between consecutive video frames. We exploit the synergy between both tasks through a shared inter-frame affinity matrix, which simultaneously models transitions between video frames at both the region- and pixel-levels. While region-level localization helps reduce ambiguities in fine-grained matching by narrowing down search regions; fine-grained matching provides bottom-up features to facilitate region-level localization. Our method outperforms the state-of-the-art self-supervised methods on a variety of visual correspondence tasks, including video-object and part-segmentation propagation, keypoint tracking, and object tracking. Our self-supervised method even surpasses the fully-supervised affinity feature representation obtained from a ResNet-18 pre-trained on the ImageNet.


page 2

page 6

page 7

page 11

page 12


Contrastive Transformation for Self-supervised Correspondence Learning

In this paper, we focus on the self-supervised learning of visual corres...

In-N-Out Generative Learning for Dense Unsupervised Video Segmentation

In this paper, we focus on the unsupervised Video Object Segmentation (V...

Locality-Aware Inter-and Intra-Video Reconstruction for Self-Supervised Correspondence Learning

Our target is to learn visual correspondence from unlabeled videos. We d...

Hypercolumns for Object Segmentation and Fine-grained Localization

Recognition algorithms based on convolutional networks (CNNs) typically ...

Multi-object tracking with self-supervised associating network

Multi-Object Tracking (MOT) is the task that has a lot of potential for ...

Self-supervised Video Object Segmentation

The objective of this paper is self-supervised representation learning, ...

Efficient Self-supervised Vision Transformers for Representation Learning

This paper investigates two techniques for developing efficient self-sup...

Code Repositories

1 Introduction

Learning representations for visual correspondence is a fundamental problem that is closely related to a variety of vision tasks: correspondences between multi-view images relate 2D and 3D representations, and those between frames link static images to dynamic scenes. To learn correspondences across frames in a video, numerous methods have been developed from two perspectives: (a) learning region/object-level correspondences, via object tracking bertinetto2016fully; tao2016sint; valmadre2017end; wang17dcfnet; wang2018learning or (b) learning pixel-level correspondences between multi-view images or frames, e.g., via stereo matching okutomi1993multiple

or optical flow estimation 

liu2011sift; Sun2018PWC-Net; IMKDB17; lucas1981iterative.

However, most methods address one or the other problem and significantly less effort has been made to solve both of them together. The main reason is that methods designed to address either of them optimize different goals. Object tracking focuses on learning object representations that are invariant to viewpoint and deformation changes, while learning pixel-level correspondence focuses on modeling detailed changes within an object over time. Subsequently, the existing supervised methods for these two problems often use different annotations. For example, bounding boxes are annotated in real videos for object tracking otb2015; and pixel-wise associations are generated from synthesized data for optical flow estimation Butler:ECCV:2012; DFIB15. Datasets with annotations for both tasks are scarcely available and supervision, here, is a further bottleneck preventing us from connecting the two tasks.

In this paper, we demonstrate that these two tasks inherently require the same operation of learning an inter-frame transformation that associates the contents of two images. We show that the two tasks benefit greatly by modeling them jointly via a single transformation operation which can simultaneously match regions and pixels. To overcome the lack of data with annotations for both tasks we exploit self-supervision via the signals of (a) Temporal Coherency, which states that objects or scenes move smoothly and gradually over time; (b) Cycle Consistency, correct correspondences should ensure that pixels or regions match bi-directionally and (c) Energy Preservation, which preserves the energy of feature representations during transformations. Since all these supervisory signals naturally exist in videos and are task-agnostic, the transformation that we learn through them can generalize well to any video without restriction on domain or object category.

Our key idea is to learn a single affinity matrix for modeling all inter-frame transformations through a network that learns appropriate feature representations that model the affinity. We show that region localization and fine-grained matching can be carried out by sharing the affinity in a fully differentiable manner: the region localization module finds a pair of patches with matching parts in the two frames (Figure 1, mid-top), and the fine-grained module reconstructs the color feature by transforming it between the patches (Figure 1, mid-bottom), all through the same affinity matrix. These two tasks symbiotically facilitate each other: the fine-grained matching module learns better feature representations that lead to an improved affinity matrix, which in turn generates better localization that reduces the search space and ambiguities for fine-grained matching (Figure 1, right).

The contributions of this work are summarized as: (a) A joint-task self-supervision network is introduced to find accurate correspondences at different levels across video frames. (b) A general inter-frame transformation is proposed to support both tasks and to satisfy various video constraints – coherency, cycle, and energy consistency. (c) Our method outperforms state-of-the-art methods on a variety of visual correspondence tasks, e.g., video instance and part segmentation, keypoints tracking, and object tracking. Our self-supervised method even surpasses the fully-supervised affinity feature representation obtained from a ResNet-18 pre-trained on the ImageNet deng2009imagenet.

Figure 1:

Our method (c) compared against (a) region-level matching (e.g., object tracking), and (b) pixel-level matching, e.g., matching by colorization 

vondrick2018tracking. We propose a joint-task framework which conducts region-level and fine-grained matching simultaneously and which are supported by a single inter-frame affinity matrix . During training, the two tasks improve each other progressively. To illustrate this, we unroll two training iterations and illustrate the improvement with the red box and arrow.

2 Related Work

Learning correspondence in time is widely explored in visual tracking bertinetto2016fully; tao2016sint; valmadre2017end; wang17dcfnet; wang2018learning and optical flow estimation Sun2018PWC-Net; liu2011sift; IMKDB17. Existing models are mainly trained on large annotated datasets, which require significant efforts. To overcome the limit of annotations, numerous methods have been developed to learn correspondences in a self-supervised manner Wang_2019_Unsupervised; CVPR2019_CycleTime; vondrick2018tracking. Our work establishes on learning correspondence with self-supervision, and we discuss the most related methods here.

Object-level correspondence.

The goal of visual tracking is to determine a bounding box in each frame based on an annotated box in the reference image. Most methods belong to one of the two categories that use: (a) the tracking-by-detection framework andriluka2008people; kalal2011tracking; wang2013learning; li2018high, which models tracking as detection applied independently to individual frames; or (b) the tracking-by-matching framework that models cross-frame relations and includes several early attempts, e.g., mean-shift trackers meanshift; yang2005efficient, kernelized correlation filters (KCF) henriques2014high; li2014scale, and several works that model correlation filters as differentiable blocks ma2015hierarchical; ma2015long; choi2017attentional; wang2017dcfnet. Most of these methods use annotated bounding boxes otb2015 in every frame of the videos to learn feature representations for tracking. Our work can be viewed as exploiting the tracking-by-matching framework in a self-supervised manner.

Fine-grained correspondence.

Dense correspondence between video frames has been widely applied for optical flow and motion estimation lucas1981iterative; Sun2018PWC-Net; liu2011sift; IMKDB17

, where the goal is to track individual pixels. Most deep neural networks 

IMKDB17; Sun2018PWC-Net are trained with the objective of regressing the ground-truth optical flow produced by synthetic datasets Butler:ECCV:2012; DFIB15. In contrast to many classic methods lucas1981iterative; liu2011sift that model dense correspondence as a matching problem, direct regression of pixel offsets has limited capability for frames containing dramatic appearance changes brox2010large; longrange, and suffers from problems related to domain shift when applied to real-world scenarios.

Self-supervised learning.

Recently, numerous approaches have been developed for correspondence learning via various self-supervised signals, including image Lee19 or color transformation vondrick2018tracking and cycle-consistency CVPR2019_CycleTime; Wang_2019_Unsupervised

. Self-supervised learning of correspondence in videos has been explored along the two different directions – for region-level localization 

CVPR2019_CycleTime; Wang_2019_Unsupervised and for fine-grained pixel-level matching vondrick2018tracking; kong2019multigrid. In Wang_2019_Unsupervised, a correlation filter is learned to track regions via a cycle-consistency constraint, and no pixel-level correspondence is determined.  CVPR2019_CycleTime develops patch-level tracking by modeling the similarity transformation of pixels within a fixed rectangular region. Conversely, several methods learn a matching network by transforming color/RGB information between adjacent frames vondrick2018tracking; lai2019self; kong2019multigrid. As no region-level regularization is exploited, these approaches are less effective when color features are less distinctive (see Figure 1(b)). In contrast, our method learns object-level and pixel-level correspondence jointly across video frames in a self-supervised manner.

3 Approach

Figure 2: Main steps of proposed method. Blue grids represent the reference-patch ’s and target-frame ’s feature maps that are shared by the region-level localization (left box) and fine-grained matching (right box) modules. is the affinity between and , and is that between and . is a differentiable crop from the frame . The maps and are the coordinates of pixels on a regular grid. All modules are differentiable, where the gradient flow is visualized via the red dashed arrows.

Video frames are temporally coherent in nature. For a pair of adjacent frames, pixels in a later frame can be considered as being copied from some locations of an earlier one with slight appearance changes conforming to object motion. This “copy” operator can be expressed via a linear transformation with a matrix

, in which denotes that the pixel in the second frame is copied from pixel in the first one. An approximation of is the inter-frame affinity matrix valmadre2017end; liu2018switchable; CVPR2019_CycleTime:


where denotes some similarity function. Each entry represents the similarity of subspace pixels and in the two frames and , where

is a vectorized feature map with

channels and pixels. In this work, our goal is to learn the feature embedding that optimally associates the contents of the two frames.

One free supervisory signal that we can utilize is color. To learn the inter-frame transformation in a self-supervised manner, we can slightly modify (1) to generate the affinity via features learned only from gray-scale images. The learned affinity is then utilized to map the color channels from one frame to another vondrick2018tracking; liu2018switchable, while using the ground-truth color as the self-supervisory signal.

One strict assumption of this formulation is that the paired frames need to have the same contents – no new object or scene pixel should emerge over time. Hence, the existing methods vondrick2018tracking; liu2018switchable

sample pairs of frames either uniformly, or randomly within a specified interval, e.g., 50 frames. However, it is difficult to determine a “perfect” interval as video contents may change sporadically. When transforming color from a reference frame to a target one, the objects/scene pixels in the target frame may not exist in the reference frame, thereby leading to wrong matches and an adverse effect on feature learning. Another issue is that a large portion of the video frames are “static”, in which the sampled pair of frames are almost the same and cause the learned affinity to be an identity matrix.

We show that the above problems can be addressed by incorporating a region-level localization module. Given a pair of reference and target frames, we first randomly sample a patch in the reference frame and localize this patch in the target frame (see Figure 2

). The inter-frame color transformation is then estimated between the paired patches. Both localization and color transformation are supported by a single affinity derived from a convolutional neural network (CNN) based on the fact that the affinity matrix can simultaneously track locations and transform features discussed in this section.

3.1 Transforming Feature and Location via Affinity

We sample a pair of frames and denote the frame as the reference and the one as the target. The CNN can be any effective model, e.g., ResNet-18 he2016deep with the first 4 blocks that takes a gray-scale image as input. We compute the affinity and conduct the feature transformation and localization on the top layer of the CNN, with features that are one-eighth the size of the input image. This ensures the affinity matrix to be memory efficient and each pixel in the feature space to contain considerable local contextual information.

Transforming feature representations.

We adopt the dot product for in (1) to compute the affinity, where each column can be interpreted as the similarity score between a point in the target frame to all points in the reference frame. For dense correspondence, the inter-frame affinity needs to be sparse to ensure one-to-one mapping. However, it is challenging to model a sparse matrix in a deep neural network. We relax this constraint and encourage the affinity matrix to be sparse by normalizing each column with the softmax function, so that the similarity score distribution can be peaky and only a few pixels with high similarity in the reference frame are matched to each point in the target frame:


where the variable definitions follow (1). The transformation is carried out as , where , and has the same number of entries as and can be features of the reference frame or any associated label, e.g., color, segmentation mask or keypoint heatmap.

Tracing pixel locations.

We denote as the vectorized location map for an image/feature with pixels. Given a sparse affinity matrix, the location of an individual pixel can be traced from a reference frame to an adjacent target frame:


where represents the coordinate in frame that transits to the pixel in frame . Note that (e.g., in (3)) usually represents a canonical grid as shown in Figure 3.

3.2 Region-level Localization

In the target frame, region-level localization aims to localize a patch randomly selected from the reference frame by predicting a bounding box (denoted as “bbox”) on a region that shares matching parts with the selected patch. In other words, it is a differential region of interest (ROI) with learnable center and scale. We compute an affinity according to (2) between feature representations of the patch in the reference frame, and that of the whole target frame (see Figure 2(a)).

Locating the center.

To track the center position of the reference patch in the target frame, we first localize each individual pixel of the reference patch in the target frame , according to (3). As we obtain the set , with the same number of entries as , that collects the coordinates of the most similar pixels in , we can compute the average coordinate of all the points, as the estimated new position of the reference patch.

Scale modeling.

For region-level tracking, the reference patch may undergo significant scale changes. Scale estimation in object tracking is challenging and existing methods mainly enumerate possible scales bertinetto2016fully; Wang_2019_Unsupervised and select the optimal one. In contrast, the scale can be estimated by our proposed model. We assume that the transformed locations are still distributed uniformly in a local rectangular region. By denoting as the width of the new bounding box, the scale is estimated by:


where the is the x-coordinate of the entry in the . We note that (4) can be proved by using the analogous continuous space. Suppose there is a rectangle with scale and with its center located at the origin of a 2D coordinate plane. By integrating points inside of it, we have:


This represents the average absolute distances w.r.t. the center when transforming to the discrete space. The estimation of height is conducted in the same manner.

Moving as a unit.

An important assumption in the aforementioned ROI estimation in the target frame is that the pixels from the reference patch should move in unison – this is true in most videos, as an object or its parts typically move as one unit at the region level. We enforce this constraint with a concentration regularization zhang2018unsupervised; hung2019scops term on the transformed pixels, with a truncated loss to penalize these points from moving too far away from the center:


This formulation encourages all the tracked pixels, originally from a patch, to be concentrated (see Figure 3) rather than being dispersed to other objects, which is likely to happen for methods that are based on pixel-wise matching only, e.g., when matching by color reconstruction, pixels of different objects having similar colors may match each other, as shown in Figure 1(b).

Figure 3: Concentration (left) and orthogonal (right) regularization. The dots denote pixels in feature space. The orange arrows show how they push the pixels.

3.3 Fine-grained Matching

Fine-grained matching aims to reconstruct the color information of the located patch in the target frame, given the reference patch (see Figure 1). We re-use the inter-frame affinity by extracting a sub-affinity matrix containing the columns corresponding to the located pixels in the target frame, and by using it for the color transformation described in the formulations in Section 3.1. To make the color feature compatible with the affinity matrix, we train an auto-encoder that learns to reconstruct an image in the Lab space faithfully (see the encoder and the decoder in Figure 2). This network also encodes global contextual information from color channels. We show that using the color feature instead of pixels significantly reduces the errors caused by reconstructing color directly in the image space vondrick2018tracking (see Table 1, ours vs. vondrick2018tracking). In the following, we introduce self-supervisory signals as regularization for fine-grained matching. For brevity, we denote as the sub-affinity, and as the vectorized coordinate and feature map, respectively, for the paired patches.

Orthogonal regularization.

Another important constraint, cycle-consistency, for the transformation of both location CVPR2019_CycleTime and feature liu2018switchable is the orthogonal regularization. For a pair of patches, we encourage every pixel to fall into the same location after one cycle of forward and backward tracking, as shown in Figure 3 (middle and right):


Here we specifically add to denote affinity transforming from the frame to , i.e., . Similarly, the cycle-consistency can be applied to the feature space:


We show that enforcing cycle-consistency is equivalent to regularizing to be orthogonal: With (7) and (8), it is easy to show that the optimal solution is achieved when . Inspired by recent style transfer methods Gatys_2016_CVPR; liu2018switchable, the color energy represented by the Gram-matrix should be consistent such that , which derives that is the goal to reconstruct the color information. Thus, it is easy to show that regularizing as orthogonal automatically satisfies the cycle constraint. In practice, we switch the role of reference and target to perform the transformation, as described in (7) and (8). We use the MSE loss between both and , and , and specifically replace with in Eq. (8) to enforce the regularization. Namely, the orthogonal regularization provides a concise mathematical formulation for many recent works CVPR2019_CycleTime; Wang_2019_Unsupervised that exploit cycle-consistency in videos.

Concentration regularization.

We additionally apply the concentration loss (i.e., Eq.(6) without the truncation) in local, non-overlapping grids of a feature map, to encourage local context or object parts to move as an entity over time. Unlike CVPR2019_CycleTime; Rocco18

where local patches are regularized by similarity transformation via a spatial transformation network 

jaderberg2015spatial, this local concentration loss is more flexible by allowing arbitrary deformations within each local grid.

4 Experiments

We compare with state-of-the-art algorithms vondrick2018tracking; Wang_2019_Unsupervised; CVPR2019_CycleTime on several tasks: instance mask propagation, pose keypoints tracking, human parts segmentation propagation and visual tracking.

4.1 Network Architecture

As shown in Figure 2, our model consists of a region-level localization module and a fine-grained matching module that share a feature representation network (see Figure 2). We use the ResNet-18 he2016deep as the network for fair comparisons with vondrick2018tracking; CVPR2019_CycleTime. The patch randomly cropped from the reference frame is of pixels.


We first train the auto-encoder in the matching module (the encoder “E” and decoder “D” in Figure 2) to reconstruct images in the Lab space using the MSCOCO lin2014microsoft dataset. We then fix it and train the feature representation network using the Kinetics dataset kay2017kinetics. For all experiments, we train our model from scratch without any level of pre-training or human annotations. The objectives include: (a) concentration loss (Section 3.2 and 3.3), (b) color reconstruction loss and (c) orthogonal regularization (Section 3.3

). Involving the localization module from the beginning in the training process prevents the network from converging because poor localization makes matching impossible. Thus we first train our network using patches cropped at the same location with the same size in the reference and target frame respectively. Fine-grained matching is conducted between the two patches for 10 epochs. We then jointly train the localization and matching module for another 10 epochs.


In the inference stage, we directly apply the affinity learned to transform color feature representations, on different types of inputs, e.g., segmentation masks and keypoint maps. We use the same testing protocol as Wang et al. CVPR2019_CycleTime for all tasks. Similar to CVPR2019_CycleTime, we adopt a recurrent inference strategy by propagating the ground truth segmentation mask or keypoint heatmap from the first frame, as well as the predicted results from the preceding frames onto the target frame. We average all predictions to obtain the final propagated map ( is 1 for the VIP, and 7 for all the other tasks). To compare with the ResNet-18 trained on the ImageNet with classification labels, we replace our learned network weights with it and leave other settings unchanged for fair comparisons.

Figure 4: Visualization of the propagation results. (a) Instance mask propagation on the DAVIS-2017 pont20172017 dataset. (b) Pose keypoints propagation on the J-HMDB jhuang2013towards dataset. (c) Parts segmentation propagation on the VIP zhou2018adaptive dataset. (d) Visual tracking on the OTB2015 otb2015 dataset.

4.2 Instance Segmentation Mask Propagation on the DAVIS-2017 dataset

Figure 5: Qualitative comparison with other methods. (a) Reference frame with instance masks. (b) Results by the ResNet-18 trained on ImageNet. (c) Results by Wang et al. CVPR2019_CycleTime. (d) Ours (global matching). (e) Ours with localization during inference. (f) Target frame with ground truth instance masks.

Figure 4 (a) and Figure 5 show the propagated instance masks and Table 1 lists quantitative results of all evaluated methods based on the Jacaard index (IOU) and contour-based accuracy . Our model performs favorably against the self-supervised state-of-the-art methods. Specifically, our model outperforms Wang et al. CVPR2019_CycleTime by 13.3% in and 16.6% in . and is even 6.9% better in and 4.1% better in than the ResNet-18 model he2016deep trained on ImageNet deng2009imagenet with classification labels.

Furthermore, we demonstrate that by including the localization module during inference, our model can exclude noise from background pixels. Given the instance masks in the first frame, we obtain the bounding box w.r.t. the instance mask and first locate it in the target frame by our localization module. Then, we propagate the instance masks within the bounding box in the reference frame to the localized bounding box in the target frame using our matching module. Since the propagation is carried out within two bounding boxes instead of the entire frames, we can minimize noise introduced by background pixels as shown in Figure 5 (d) and (e). The quantitative evaluation of this improved model outperforms the model that does not include the localization module during inference. (see “Ours-track” vs. “Ours” in Table 1)

Model Supervised Dataset (Mean) (Recall) (Mean) (Recall)
SIFT Flow liu2011sift - 33.0 - 35.0 -
DeepCluster caron2018deep YFCC100M thomee2015yfcc100m 37.5 - 33.2 -
Transitive Inv Wang_UnsupICCV2017 - 32.0 - 26.8 -
Vondrick et al. vondrick2018tracking Kinetics kay2017kinetics 34.6 34.1 32.7 26.8
Wang et al. CVPR2019_CycleTime VLOG Fouhey18 43.0 43.7 42.6 41.3
Ours Kinetics kay2017kinetics 56.3 65.0 59.2 64.1
Ours-track Kinetics kay2017kinetics 57.7 68.3 61.3 69.8
ResNet-18(3 blocks) ImageNet deng2009imagenet 49.4 52.9 55.1 56.6
ResNet-18(4 blocks) ImageNet deng2009imagenet 40.2 36.1 42.5 36.6
OSVOS Cae+17 ImageNet,DAVIS pont20172017 56.6 63.8 63.9 73.8
Table 1: Evaluation of instance segmentation propagation on the DAVIS-2017 dataset pont20172017. A more comprehensive comparison can be found in the supplementary.
Figure 6: Visualization of the ablation studies. Given a set of points in the reference frame (a), we visualize the results of propagating these points on to the target frame (b). “L”, “C”, “O” and “all” correspond to the localization modules, concentration or orthogonal regularization, or all of them (d-g).

4.3 Ablation Studies on the DAVIS-2017 Dataset

We carry out ablation studies to see the contributions of each term, as shown in Figure 6 and Table 2. Note that inference is conducted between a pair of full-size frames without localization.

Metric (a) Ours-track (b) Ours (c) -L (d) -O (e) -C (f) -O&C (g) -all
(Mean) 57.7 56.3 53.8 55.2 48.3 44.3 45.7
(Mean) 61.3 59.2 58.3 58.7 52.4 49.6 52.3
Table 2: Ablation studies. The minus sign “-” indicates training without the specific module or regularization. “L”, “O” and “C” mean the localization module, orthogonal and concentration regularization, respectively. The last column (“(g) -all”) shows results of a baseline model trained without any of “L”, “O” or “C”.

Region-level Localization.

Our model trained with the region-level localization module is able to place the individual points all within a reasonable local region (Figure 6 (c)). We show that the model can accurately capture both region-level shifts (e.g., person moving forward), and subtle deformations (e.g., movement of body parts), while preserving the correct spatial relations among all the points. In contrast, the model trained without the localization module tends to model global matching, leading to less accurate preservation of the local spatial relationships among points, e.g., the red points in Figure 6 (d) tend to cluster together as shown in the cyan circle. Consistent quantitative results can also be found in Table 2 (c), where the and measures drop 2.5% and 0.9%, respectively, when trained without the localization module. We also discover that the localization module should always be trained together with the concentration loss to satisfy the assumption in Section 3.2(Table 2(f)(g)).

Concentration regularization.

The concentration regularization encourages locality during the transformation process, i.e. points within a neighbourhood in the reference frame stay together in the target frame. The model trained without it tends to introduce outliers, as shown in the cyan circle of Figure 

6(e). Table 2 (b)(e) demonstrate the contribution of this concentration regularization term, e.g., compared to (b), the in (e) decrease by 8% without this regularization term.

Orthogonal regularization.

The orthogonal regularization term enforces points to match back to themselves after a cycle of forward and backward transformation. As shown in Figure 6 (f), the model trained without the orthogonal regularization term is less effective in preserving local structures. The effectiveness of the orthogonal regularization is also validated quantitatively at Table 2 (e) and (f).

Model Supervised AUC score (%)
UDT Wang_2019_Unsupervised 59.4
Ours 59.2
ResNet-18 55.6
Fully Supervised bertinetto2016fully 58.2
Table 3: Tracking results on OTB2015 otb2015

4.4 Tracking Pose Keypoint Propagation on the J-HMDB Dataset

We demonstrate that our model learns accurate correspondence by evaluating it on the J-HMDB dataset jhuang2013towards

, which requires precise matching of points compared to the coarser propagation of masks. Given the 15 ground truth human pose keypoints in the first frame, we propagate them to the remaining frames. We quantitatively evaluate performance using the probability of correct keypoint (PCK) metric 

yang2013articulated, which measures the ratio of joints that fall within a threshold distance from the ground truth joint locations. We show quantitative evaluations against the state-of-the-art methods in Table 5 and qualitative propagation results in Figure 4(b). Our model performs well versus all self-supervised methods CVPR2019_CycleTime; vondrick2018tracking and notably achieves better results than ResNet-18 he2016deep trained with classification labels deng2009imagenet.

4.5 Visual Tracking on the OTB Dataset

Other than the tasks that require dense matching, e.g., segmentation or keypoints propagation, the features learned by our model can be applied to object matching tasks such as visual tracking, because of its capability of localizing an object or a relatively global region. Without any fine-tuning, we directly integrate our network trained via self-supervision into a classic tracking framework Wang_2019_Unsupervised; wang17dcfnet based on correlation filters, by replacing the Siamese network in Wang_2019_Unsupervised; wang17dcfnet with our model, while keeping other parts in the tracking framework unchanged. Even without training with a correlation filter, our features are general and robust enough to achieve comparable performance on the OTB2015 dataset otb2015 to methods trained with this filter Wang_2019_Unsupervised, as shown in Table 3. Figure 4(d) shows that our learned features are robust against occlusion (left), object scale, as well as illumination changes (right) and can track objects through a long sequence (hundreds of frames in the OTB2015 dataset).

Table 4: Segmentation propagation on VIP zhou2018adaptive.
Model Supervised mIoU
DeepCluster. caron2018deep 21.8 8.1
Wang et al. CVPR2019_CycleTime 28.9 15.6
Ours 34.1 17.7
ResNet-18 31.8 12.6
Fully Supervised zhou2018 37.9 24.1
Table 5: Kepoints propagation on J-HMDB jhuang2013towards.
Model Supervised PCK@.1 PCK@.2
Vondrick et al. vondrick2018tracking 45.2 69.6
Wang et al. CVPR2019_CycleTime 57.3 78.1
Ours 58.6 79.8
ResNet-18 53.8 74.6
Fully Supervised yang2018efficient 68.7 92.1

4.6 Semantic and Instance Propagation on the VIP Dataset

We evaluate our method on the VIP dataset zhou2018adaptive, which includes dense human parts segmentation masks on both the semantic and instance levels. We use the same settings as Wang et al. CVPR2019_CycleTime and resize the input frames to . For the semantic propagation task, we propagate the semantic segmentation maps of human parts (e.g., arms and legs) and evaluate performance via the mean IoU metric. For the part instance propagation task, we propagate the instance-level segmentation of human parts (e.g., arms of the first person or legs of the second person) and evaluate performance via the mean average precision of the instance-level human parsing metric li2017holistic. Table 5 shows that our method performs favourably against all self-supervised methods and notably the ResNet-18 model trained on ImageNet with classification labels for both tasks. Figure 4(c) shows sample semantic segmentation propagation results. Interestingly, our model correctly propagates each part mask onto an unseen instance (the woman which does not appear in the first frame) in the second example.

5 Conclusions

In this work, we propose to learn correspondences across video frames in a self-supervised manner. Our method jointly tackles region-level and pixel-level correspondence learning and allows them to facilitate each other through a shared inter-frame affinity matrix. Experimental results demonstrate the effectiveness of our approach versus the state-of-the-art self-supervised video correspondence learning methods, as well as supervised models such as the ResNet-18 trained on ImageNet with classification labels.



Appendix A Implementation

We train our model using Adam kingma2014adam as the optimizer with a learning rate of for the warm-up and

for the joint training of the localization and matching modules. We set the temperature in the softmax layer applied to the top layer CNN features to 1. For fair comparisons, we also use the k-NN propagation schema as Wang et al. 

CVPR2019_CycleTime and set for all tasks.

Appendix B Texture Propagation

Figure 7: Texture Propagation.

In Figure 7, we show results of texture propagation. Following Wang et al. CVPR2019_CycleTime, we overlay a texture map on the object in the first video frame, then propagate this texture map across the rest of the video frames. As shown in Figure 7, our model is able to preserve the texture well during propagation, this indicates that our model is able to find precise correspondences between video frames.

Appendix C Instance Segmentation Propagation on DAVIS-2017

In Figure 8, we show more instance mask propagation results on the DAVIS-2017 dataset pont20172017. Our model is resilient to rapid object shape and scale changes, e.g., the horse, the motorbike and the cart in Figure 8. In Table 6, we demonstrate more comparisons with state-of-the-art methods. We use the full 480p images during inference for our model. For fair comparisons we test the model by Wang et al. CVPR2019_CycleTime with the resolution of 480p, in addition to the result reported using images.

In Figure 9, we visualize the process of including the localization module during inference. Given the instance mask of the first frame, we first propagate each point (marked as green) from the reference frame to the target frame by localizing a bbox on it before matching. Instead of directly applying the center as described in Section 3.2 in the paper, we refine the center at inference by applying the mean-shift algorithm, i.e.,


where is the coordinate of the pixel, the is the center of all at the iteration, and . Scale is estimated via Eq.(4) as well, see the bboxes in Figure 9. The green points in Figure 9 illustrate the individually propagated points and the red bounding box indicates the estimated bounding box of an object in the target frame. We then propagate the instance segmentation mask within the bounding box in the reference frame to the bounding box in the target frame.

Model Supervised Dataset (Mean) (Recall) (Mean) (Recall)
SIFT Flow liu2011sift - 33.0 - 35.0 -
DeepCluster caron2018deep YFCC100M thomee2015yfcc100m 37.5 - 33.2 -
Transitive Inv Wang_UnsupICCV2017 - 32.0 - 26.8 -
Vondrick et al. vondrick2018tracking Kinetics kay2017kinetics 34.6 34.1 32.7 26.8
Wang et al. CVPR2019_CycleTime () VLOG Fouhey18 43.0 43.7 42.6 41.3
Wang et al. CVPR2019_CycleTime (480p) VLOG Fouhey18 46.4 50.1 50.0 48.0
mgPFF kong2019multigrid - 42.2 41.8 46.9 44.4
Lai et al. lai2019self Kinetics kay2017kinetics 47.7 - 51.3 -
ours Kinetics kay2017kinetics 56.8 65.7 59.5 65.1
ours-track Kinetics kay2017kinetics 57.7 67.1 60.0 65.7
ResNet-18(3 blocks) ImageNet deng2009imagenet 49.4 52.9 55.1 56.6
ResNet-18(4 blocks) ImageNet deng2009imagenet 40.2 36.1 42.5 36.6
SiamMask Wang2019SiamMask YouTube-VOS xu2018youtube 54.3 62.8 58.5 67.5
OSVOS Cae+17 ImageNet,DAVIS pont20172017 56.6 63.8 63.9 73.8
Table 6: Evaluation of instance segmentation propagation on the DAVIS-2017 dataset pont20172017.
Figure 8: Instance mask propagation results.
Figure 9: Visualization of the process of including the localization module during inference.