We propose an unsupervised visual tracking method in this paper. Different from existing approaches using extensive annotated data for supervised learning, our CNN model is trained on large-scale unlabeled videos in an unsupervised manner. Our motivation is that a robust tracker should be effective in both the forward and backward predictions (i.e., the tracker can forward localize the target object in successive frames and backtrace to its initial position in the first frame). We build our framework on a Siamese correlation filter network, which is trained using unlabeled raw videos. Meanwhile, we propose a multiple-frame validation method and a cost-sensitive loss to facilitate unsupervised learning. Without bells and whistles, the proposed unsupervised tracker achieves the baseline accuracy of fully supervised trackers, which require complete and accurate labels during training. Furthermore, unsupervised framework exhibits a potential in leveraging unlabeled or weakly labeled data to further improve the tracking accuracy.READ FULL TEXT VIEW PDF
Visual tracking is a fundamental task in computer vision, which aims to localize the target object in the video given a bounding box annotation in the first frame. The state-of-the-art deep tracking methods[1, 46, 15, 55, 27, 60, 58, 54, 4, 19, 33, 34]
typically use pretrained CNN models for feature extraction. These models are trained in a supervised manner, requiring a large quantity of annotated ground-truth labels. Manual annotations are always expensive and time-consuming, whereas extensive unlabeled videos are readily available on the Internet. It deserves to investigate how to exploit unlabeled video sequences for visual tracking.
In this paper, we propose to learn a visual tracking model from scratch via unsupervised learning. Our intuition resides on the observation that visual tracking can be performed in both the forward and backward manners. Initially, given the target object annotated on the first frame, we can track the target object forward in the subsequent frames. When tracking backward, we use the predicted location in the last frame as the initial target annotation and track it backward towards the first frame. The estimated target location in the first frame via backward tracking is expected to be identical with the initial annotation. After measuring the difference between the forward and backward target trajectories, our network is trained in an unsupervised manner111In this paper, we do not distinguish between the term unsupervised and self-supervised, as both refer to learning without ground-truth annotations. by considering the trajectory consistency as shown in Fig. 1. Through exploiting consecutive frames in unlabeled videos, our model learns to locate targets by repeatedly performing forward tracking and backward verification.
The proposed unsupervised learning scheme aims to acquire a generic feature representation, while not being strictly required to track a complete object. For a video sequence, we randomly initialize a bounding box in the first frame, which may not cover an entire object. Then, the proposed model learns to track the bounding box region in the following sequences. This tracking strategy shares similarity with the part-based  or edge-based  tracking methods that focus on tracking the subregions of the target objects. As the visual object tracker is not expected to only concentrate on the complete objects, we use the randomly cropped bounding boxes for tracking initialization during training.
We integrate the proposed unsupervised learning into the Siamese based correlation filter framework . The proposed network consists of two steps in the training process: forward tracking and backward verification. We notice that the backward verification is not always effective since the tracker may successfully return to the initial target location from a deflected or false position. In addition, challenges such as heavy occlusion in unlabeled videos will further degrade the network representation capability. To tackle these issues, we propose multiple frames validation and a cost-sensitive loss to benefit the unsupervised training. The multiple frames validation increases the discrepancy between the forward and backward trajectories to reduce verification failures. Meanwhile, the cost-sensitive loss mitigates the interference from noisy samples during training.
The proposed unsupervised tracker is shown effective on the benchmark datasets. Extensive experimental results indicate that without bells and whistles, the proposed unsupervised tracker achieves comparable performance with the baseline fully supervised trackers [1, 49, 54]. When integrated with additional improvements such as the adaptive online model update [9, 7], the proposed tracker exhibits state-of-the-art performance. It is worth mentioning that the unsupervised framework shows potential in exploiting unlabeled Internet videos to learn good feature representations for tracking scenarios. Given limited or noisy labels, the unsupervised method exhibits comparable results with the corresponding supervised framework. In addition, we further improve the tracking accuracy by using more unlabeled data. Sec. 4.2 shows a complete analysis of different training configurations.
In summary, the contributions of our work are three-fold:
We propose an unsupervised tracking method based on the Siamese correlation filter backbone, which is learned via forward and backward tracking.
We propose a multiple-frame validation method and a cost-sensitive loss to improve the unsupervised learning performance.
The extensive experiments on the standard benchmarks show the favorable performance of the proposed method and reveal the potential of unsupervised learning in visual tracking.
In this section, we perform a literature review on the deep tracking methods, forward-backward trajectory analysis, and unsupervised representation learning.
Deep Visual Tracking. Existing deep tracking methods either offline learn a specific CNN model for online tracking or simply utilize off-the-shelf deep models (e.g., VGG [43, 3]) for feature extraction. The Siamese trackers [1, 46, 49, 54, 55, 15, 27, 60, 58] formulate the tracking task as a similarity matching process. They typically offline learn a tracking network and do not fine-tune the model online. On the other hand, some trackers adopt off-the-shelf CNN models as the feature extraction backbone. They incrementally train binary classification layers [37, 45, 39] or regression layers [44, 31] based on the initial frame. These methods typically achieve high accuracy while consuming a huge computational cost. The Discriminative Correlation Filter (DCF) based trackers [2, 16, 8, 30, 5, 52, 18]35, 40, 53, 7]). The main distinction is that deep DCF trackers merely utilize off-the-shelf models for feature extraction and do not online train additional layers or fine-tune the CNN models. Different from the above deep trackers using off-the-shelf models or supervised learning, the proposed method trains a network from scratch using unlabeled data in the wild.
Forward-Backward Analysis. The forward-backward trajectory analysis has been widely explored in the literature. The tracking-learning-detection (TLD)  uses the Kanade-Lucas-Tomasi (KLT) tracker  to perform forward-backward matching to detect tracking failures. Lee et al.  proposed to select the reliable base tracker by comparing the geometric similarity, cyclic weight, and appearance consistency between a pair of forward-backward trajectories. However, these methods rely on empirical metrics to identify the target trajectories. In addition, repeatedly performing forward and backward tracking brings in a heavy computational cost for online tracking. Differently, in TrackingNet , forward-backward tracking is used for data annotation and tracker evaluation. In this work, we revisit this scheme to train a deep visual tracker in an unsupervised manner.
Unsupervised Representation Learning. Our framework relates to the unsupervised representation learning. In , the feature representation is learned by sorting sequences. The multi-layer auto-encoder on large-scale unlabeled data has been explored in . Vondrick et al.  proposed to anticipate the visual representation of frames in the future. Wang and Gupta  used the KCF tracker  to pre-process the raw videos, and then selected a pair of tracked images together with another random patch for learning CNNs using a ranking loss. Our method differs from  in two aspects. First, we integrate the tracking algorithm into unsupervised training instead of merely utilizing an off-the-shelf tracker as the data pre-processing tool. Second, our unsupervised framework is coupled with a tracking objective function, so the learned feature representation is effective in presenting the generic target objects. In the visual tracking community, unsupervised learning has rarely been touched. To the best of our knowledge, the only related but different approach is the auto-encoder based method . However, the encoder-decoder is a general unsupervised framework , whereas our unsupervised method is specially designed for tracking tasks.
Fig. 2(a) shows an example from the Butterfly sequence to illustrate forward and backward tracking. In practice, we randomly draw bounding boxes in unlabeled videos to perform forward and backward tracking. Given a randomly initialized bounding box label, we first track forward to predict its location in the subsequent frames. Then, we reverse the sequence and take the predicted bounding box in the last frame as the pseudo label to track backward. The predicted bounding box via backward tracking is expected to be identical with the original bounding box in the first frame. We measure the difference between the forward and backward trajectories using the consistency loss for network training. An overview of the proposed unsupervised Siamese correlation filter network is shown in Fig. 2(b). In the following, we first revisit the correlation filter based tracking framework and then illustrate the details of our unsupervised deep tracking approach.
The Discriminative Correlation Filters (DCFs) [2, 16] regress the input features of a search patch to a Gaussian response map for target localization. When training a DCF, we select a template patch with the ground-truth label . The filter can be learned by solving the ridge regression problem as follows:
where is the element-wise product,
is the Discrete Fourier Transform (DFT),is the inverse DFT, and denotes the complex-conjugate operation. In each subsequent frame, given a search patch , the corresponding response map can be computed in the Fourier domain:
The above DCF framework starts from learning a target template using the template patch and then convolves with a search patch to generate the response. Recently, the Siamese correlation filter network [49, 54] embeds the DCF in a Siamese framework and constructs two shared-weight branches as shown in Fig. 2(b). The first one is the template branch which takes a template patch as input and extracts its features to further generate a target template via DCF. The second one is the search branch which takes a search patch as input for feature extraction. The target template is then convolved with the CNN features of the search patch to generate the response map. The advantage of the Siamese DCF network is that both the feature extraction CNN and correlation filter are formulated into an end-to-end framework, so that the learned features are more related to the visual tracking scenarios.
Given two consecutive frames and , we crop the template and search patches from them, respectively. By conducting forward tracking and backward verification, the proposed framework does not require ground-truth labeling for supervised training. The difference between the initial bounding box and the predicted bounding box in will formulate a consistency loss for network learning.
We follow  to build a Siamese correlation filter network to track the initial bounding box region in frame . After cropping the template patch from the first frame , the corresponding target template can be computed as:
where denotes the CNN feature extraction operation with trainable network parameters , and is the label of the template patch . This label is a Gaussian response centered at the initial bounding box center. Once we obtain the learned target template , the response map of a search patch from frame can be computed by
If the ground-truth Gaussian label of patch is available, the network can be trained by computing the distance between and the ground-truth. In the following, we show how to train the network without labels by exploiting backward trajectory verification.
After generating the response map for frame , we create a pseudo Gaussian label centered at its maximum value, which is denoted by . In backward tracking, we switch the role between the search patch and the template patch. By treating as the template patch, we generate a target template using the pseudo label . The target template can be learned using Eq. (4) by replacing with and replacing with . Then, we generate the response map through Eq. (5) by replacing with and replacing with . Note that we only use one Siamese correlation filter network to track forward and backward. The network parameters are fixed during the tracking steps.
Consistency Loss Computation.
After forward and backward tracking, we obtain the response map . Ideally, should be a Gaussian label with the peak located at the initial target position. In other words, should be as similar as the originally given label . Therefore, the representation network can be trained in an unsupervised manner by minimizing the reconstruction error as follows:
The proposed unsupervised learning method constructs the objective function based on the consistency between and
. In practice, the tracker may deviate from the target in the forward tracking but still return to the original position during the backward process. However, the proposed loss function does not penalize this deviation because of the consistent predictions. Meanwhile, the raw videos may contain uninformative or even corrupted training samples with occlusion that deteriorate the unsupervised learning process. We propose multiple frames validation and a cost-sensitive loss to tackle these limitations.
We propose a multiple frames validation approach to alleviate the inaccurate localization issue that is not penalized by Eq. (6). Our intuition is to involve more frames during forward and backward tracking to reduce the verification failures. The reconstruction error in Eq. (6) tends to be amplified and the computed loss will facilitate the training process.
During unsupervised learning, we involve another frame which is the subsequent frame after . We crop a search patch from and another search patch from . If the generated response map is different from its corresponding ground-truth response, this error tends to become larger in the next frame . As a result, the consistency is more likely to be broken in the backward tracking, and the generated response map is more likely to deviate from . By simply involving more search patches during forward and backward tracking, the proposed consistency loss will be more effective to penalize the inaccurate localizations as shown in Fig. 3. In practice, we use three frames to validate and the improved consistency loss is written as:
where is the response map generated by an additional frame during the backward tracking step.
We randomly initialize a bounding box region in the first frame for forward tracking. This bounding box region may contain noisy background context (e.g., occluded targets). Fig. 5 shows an overview of these regions. To alleviate the background interference, we propose a cost-sensitive loss to exclude noisy samples for network training.
During unsupervised learning, we construct multiple training pairs from the training sequences. Each training pair consists of one initial template patch in frame and two search patches and from the subsequent frames and , respectively. These training pairs form a training batch to train the Siamese network. In practice, we find that few training pairs with extremely high losses prevent the network training from convergence. To reduce the contributions of noisy pairs, we exclude 10% of the whole training pairs which contain a high loss value. Their losses can be computed using Eq. (8). To this end, we assign a binary weight
to each training pair and all the weight elements form the weight vector. The 10% of its elements are 0 and the others are 1.
In addition to the noisy training pairs, the raw videos include lots of uninformative image patches which only contain the background or still targets. For these patches, the objects (e.g., sky, grass, or tree) hardly move. Intuitively, the target with a large motion contributes more to the network training. Therefore, we assign a motion weight vector to all the training pairs. Each element can be computed by
where and are the response maps in the -th training pair, and are the corresponding initial and pseudo labels, respectively. Eq. (9) calculates the target motion difference from frame to and to . The larger value of indicates that the target undergoes a larger movement in this continuous trajectory. On the other hand, we can interpret that the large value of represents the hard training pair which the network should pay more attentions to. We normalize the motion weight and the binary weight as follows,
where is number of the training pairs in a mini-batch. The final unsupervised loss in a mini-batch is computed as:
We follow the DCFNet  to use a shallow Siamese network with only two convolutional layers. The filter sizes of these convolutional layers are and , respectively. Besides, a local response normalization (LRN) layer is employed at the end of convolutional layers. This lightweight structure enables extremely efficient online tracking.
We choose the widely used ILSVRC 2015  as our training data to fairly compare with existing supervised trackers. In the data pre-processing step, existing supervised approaches [1, 49, 54] require ground-truth labels for every frame. Meanwhile, they usually discard the frames where the target is occluded, or the target is partially out of view, or the target infrequently appears in tracking scenarios (e.g., snake). This requires a time-consuming human interaction to preprocess the training data.
In contrast, we do not preprocess any data and simply crop the center patch in each frame. The patch size is the half of the whole image and further resized to as the network input as shown in Fig. 4. We randomly choose three cropped patches from the continuous 10 frames in a video. We set one of the three patches as the template and the remaining as search patches. This is based on the assumption that the center located target objects are unlikely to move out of the cropped region in a short period. We track the objects appearing in the center of the cropped regions, while not specifying their categories. Some examples of the cropped regions are exhibited in Fig. 5.
After offline unsupervised learning, we online track the target object following forward tracking as illustrated in Sec. 3.2. To adapt the object appearance variations, we online update the DCF parameters as follows:
is the linear interpolation coefficient. The target scale is estimated through a patch pyramid with scale factorsfollowing . We denote the proposed Unsupervised Deep Tracker as UDT, which merely uses standard incremental model update and scale estimation. Furthermore, we use an advanced model update that adaptively changes as well as a better DCF formulation following . The improved tracker is denoted as UDT+.
In this section, we first analyze the effectiveness of our unsupervised learning framework. Then, we compare with state-of-the-art trackers on the standard benchmarks including OTB-2015 , Temple-Color  and VOT-2016 .
In our experiments, we use the stochastic gradient descent (SGD) with a momentum of 0.9 and a weight decay of 0.005 to train our model. Our unsupervised network is trained for 50 epoches with a learning rate exponentially decays fromto and a mini-batch size of 32. All the experiments are executed on a computer with 4.00GHz Intel Core I7-4790K and NVIDIA GTX 1080Ti GPU.
|AUC score (%)||58.2||58.0||56.8||59.4||60.5||62.9||60.1||65.7||63.7||64.2||63.3||59.2||65.0||62.6||62.1||63.2|
. The evaluation metric is AUC score. Our unsupervised UDT tracker performs favorably against baseline methods shown on the left, while our UDT+ tracker achieves comparable results with the recent state-of-the-art supervised trackers shown on the right.
Unsupervised and supervised learning. We use the same training data  to train our network via fully supervised learning. Fig. 6 shows the evaluation results where the fully supervised training configuration improves UDT by 3% under the AUC scores.
Stable training. We analyze the effectiveness of our stable training by using different configurations. Fig. 6 shows the evaluation results of multiple learned trackers. The UDT-StandardLoss indicates the results from the tracker learned without using hard sample reweighing (i.e., in Eq. (9)). The UDT-SingleTrajectory denotes the results from the tracker learned only using the prototype framework in Sec. 3.2. The results show that multiple frames validation and cost-sensitive loss improve the accuracy.
Using high-quality training data. We analyze the performance variations by using high-quality training data. In ILSVRC 2015 , instead of randomly cropping patches, we add offsets ranging from [-20, +20] pixels to the ground-truth bounding boxes for training samples collection. These patches contain more meaningful objects than the randomly cropped ones. The results in Fig. 6 show that our tracker learned using weakly labeled samples (i.e., UDT-Weakly) produce comparable results with the supervised configuration. Note that the predicted target location by existing object detectors or optical flow estimators is normally within 20 pixels offset with respect to the ground-truth. These results indicate that UDT achieves comparable performance with supervised configuration when using less accurate labels produced by existing detection or flow estimation methods.
Few-shot domain adaptation. We collect the first 5 frames from the videos in OTB-2015  with only the ground-truth bounding box available in the first frame. Using these limited samples, we fine-tune our network by 100 iterations using the forward-backward pipeline. This training process takes around 6 minutes. The results (i.e., UDT-Finetune) show that the performance is further enhanced. Our offline unsupervised training learns general feature representation, which can be transferred to a specific domain (e.g., OTB) using few-shot adaptation. This domain adaptation is similar to MDNet  but our initial parameters are offline learned in an unsupervised manner.
Adopting more unlabeled data. Finally, we utilize more unlabeled videos for network training. These additional raw videos are from the OxUvA benchmark  (337 videos in total), which is a subset of Youtube-BB . In Fig. 6, our UDT-MoreData tracker gains performance improvement (0.9% DP and 0.7% AUC), which illustrates unlabeled data can advance the unsupervised training. Nevertheless, in the following we remain using the UDT and UDT+ trackers which are only trained on  for fair comparisons.
OTB-2015 Dataset. We evaluate the proposed UDT and UDT+ trackers with state-of-the-art real-time trackers including ACT , ACFN , CFNet , SiamFC , SCT , CSR-DCF , DSST , and KCF  using precision and success plots metrics. Fig. 7 and Table 1 show that the proposed unsupervised tracker UDT is comparable with the baseline supervised methods (i.e., SiamFC and CFNet). Meanwhile, the proposed UDT tracker exceeds DSST algorithm by a large margin. As DSST is a DCF based tracker with accurate scale estimation, the performance improvement indicates that our unsupervised feature representation is more effective than empirical features. In Fig. 7 and Table 1, we do not compare with some remarkable non-realtime trackers. For example, MDNet  and ECO  can yield 67.8% and 69.4% AUC on the OTB-2015 dataset, but they are far from real-time.
In Table 1, we also compare with more recently proposed supervised trackers. These latest approaches are mainly based on the Siamese network and trained using ILSVRC . Some trackers (e.g., SA-Siam  and RT-MDNet ) adopt pre-trained CNN models (e.g., AlexNet  and VGG-M ) for network initialization. The SiamRPN  additionally uses more labeled training videos from Youtube-BB dataset . Compared with existing methods, the proposed UDT+ tracker does not require data labels or off-the-shelf deep models while still achieving comparable performance and efficiency.
|Trackers||Accuracy ()||Failures ()||EAO ()||FPS ()|
Temple-Color Dataset. The Temple-Color  is a more challenging benchmark with 128 color videos. We compare our method with the state-of-the-art trackers illustrated in Sec. 4.3. The propose UDT tracker performs favorably against SiamFC and CFNet as shown in Fig. 8.
VOT2016 Dataset. Furthermore, we report the evaluation results on the VOT2016 benchmark . The expected average overlap (EAO) is the final metric for tracker ranking according to the VOT report . As shown in Table 2, the performance of our UDT tracker is comparable with the baseline trackers (e.g., SiamFC). The improved UDT+ tracker performs favorably against state-of-the-art fully-supervised trackers including SA-Siam , StructSiam  and MemTrack .
Attribute Analysis. On the OTB-2015 benchmark, we further analyze the performance variations over different challenges as shown in Fig. 9. On the majority of challenging scenarios, the proposed UDT tracker outperforms the SiamFC and CFNet trackers. Compared with the fully-supervised UDT tracker, the unsupervised UDT does not achieve similar tracking accuracies under illumination variation (IV), occlusion (OCC), and fast motion (FM) scenarios. This is because the target appearance variations are significant in these video sequences. Without strong supervision, the proposed tracker is not effective to learn a robust feature representation to overcome these variations.
Qualitative Evaluation. We visually compare the proposed UDT tracker to some supervised trackers (e.g., ACFN, SiamFC, and CFNet) and a baseline DCF tracker (DSST) on eight challenging video sequences. Although the proposed UDT tracker does not employ online improvements, we still observe that UDT effectively tracks the target, especially on the challenging Ironman and Diving video sequences as shown in Fig. 10. It is worth mentioning that such a robust tracker is learned using unlabeled videos without ground-truth supervisions.
Limitation. (1) As discussed in the Attribute Analysis, our unsupervised feature representation may lack the objectness information to cope with complex scenarios. (2) Since our approach involves both forward and backward tracking, the computational load is another potential drawback.
In this paper, we proposed how to train a visual tracker using unlabeled video sequences in the wild, which has rarely been investigated in visual tracking. By designing an unsupervised Siamese correlation filter network, we verified the feasibility and effectiveness of our forward-backward based unsupervised training pipeline. To further facilitate the unsupervised training, we extended our framework to consider multiple frames and employ a cost-sensitive loss. Extensive experiments exhibit that the proposed unsupervised tracker, without bells and whistles, performs as a solid baseline and achieves comparable results with the classic fully-supervised trackers. Finally, unsupervised framework shows attractive potentials in visual tracking, such as utilizing more unlabeled data or weakly labeled data to further improve the tracking accuracy.
. This work was supported in part to Dr. Houqiang Li by the 973 Program under Contract No. 2015CB351803 and NSFC under contract No. 61836011, and in part to Dr. Wengang Zhou by NSFC under contract No. 61822208 and 61632019, Young Elite Scientists Sponsorship Program By CAST (2016QNRC001), and the Fundamental Research Funds for the Central Universities. This work was supported in part by National Key Research and Development Program of China (2016YFB1001003), STCSM(18DZ1112300).
End-to-end active object tracking and its real-world deployment via reinforcement learning.TPAMI, 2019.