Tracking the Untrackable

07/17/2020 ∙ by Fangyi Zhang, et al. ∙ Institute of Computing Technology, Chinese Academy of Sciences 0

Although short-term fully occlusion happens rare in visual object tracking, most trackers will fail under these circumstances. However, humans can still catch up the target by anticipating the trajectory of the target even the target is invisible. Recent psychology also has shown that humans build the mental image of the future. Inspired by that, we present a HAllucinating Features to Track (HAFT) model that enables to forecast the visual feature embedding of future frames. The anticipated future frames focus on the movement of the target while hallucinating the occluded part of the target. Jointly tracking on the hallucinated features and the real features improves the robustness of the tracker even when the target is highly occluded. Through extensive experimental evaluations, we achieve promising results on multiple datasets: OTB100, VOT2018, LaSOT, TrackingNet, and UAV123.



There are no comments yet.


page 2

page 4

page 8

Code Repositories


object detection, object recognition, object tracking on conventional methods and deep learning

view repo


for object tracking...

view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Visual object tracking is pursuing the goal of estimating the position and size of arbitrary object in a video sequence. Given the first frame of the unknown target, we need to track the object in subsequent frames. Most recent approaches tackle the tracking problem solely relying on appearance features

PAMI15KCF; ECCV16SiamFC; CVPR18SiamRPN. However, visual features alone fails when the target under severe occlusion and easily draft to distractor or stay on the previous estimated position.

Lack of motion information cause the tracker fails when severe occlusion happens has been noticed by recent researchicpr16deepmotion; cvpr18flowtrack. These methods add motion information to the tracker, such as optical flow. They first extract optical flow either form the time-consuming TV-L1 algorithmperez2013tv or FlowNetcvpr15flownet. With the flow field, motion information can be easily integrated into the tracking model. Tracking is then performed by combing the motion cues and appearance cues.

However, optical flow need the visual appearance to be consistentperez2013tv. When the target under severe deformation or occlusion, the estimated flow field contains much noise, the region of flow field corresponding to the object cannot reflect rich information of the movement of the object. In addition, optical flow calculates the similarity in a given grid which further limits its usage when object have large movements. Another approachwang2019prediction

incorporate motion information from a different view. It uses Kalman Filter to estimate the target trajectory. The key limitation of this method is that it cannot learn object motions from large video datasets, which restricts its further improvement.

Our goal is to learning the object motion from large scale video dataset while hallucinating the occluded part of the target. Recent psychological theories have shown that human visual system is predictive in naturegreve2015role; summerfield2014expectation. Grevegreve2015role has shown that humans build the mental image of the future before initiating muscle movements or motor controls. These representations capture both visual and temporal information of the expected future. Olson et al.olson2004neuronal found human will maintain a representation of a target even the moving target was not visible, having behind an occluder. Further, Ekman et al.summerfield2014expectation designed moving dot sequence experiments proving that perception is guided by the anticipation of future events.

Mimicking this biological process, we propose a HAllucinating Features to Track method named HAFT to solve the occlusion problems in visual object tracking. By predicting the future movement of target and hallucinate the occluded part, our method learns to anticipate target representations even under severe occlusion. Specially, we utilize Generative Adversarial Networks(GANs)nips14gan to anticipated the movement and representation of target. On the one hand, GANs can make the predicted features more realistic. On the other hand, it can also learn to hallucinate the occluded part of the target. The generator in our GAN framework is GRUgru which used to predict the future frame embedding. In order to learn the spatial-temporal features, we incorporate ConvGRUICLR15ConvGRU. As noted, we forecast the future frame in feature level instead of directly predicted the future frame in pixel level. In this way, we can avoid redundant computation which first generates future frame then extracts features. The final feature used to represent target at current time is the fusing of the hallucinated feature and the real feature. Our model achieves promising results on VOT2018vot18, OTB100OTB100, LaSOTlasot, TrackingNettrackingnet, and UAV123uav.

Figure 1: The structure of HAFT. Here we omit target localization and size estimation part. Our model uses GAN loss and MSE loss to make the hallucinated feature similar to the real feature . Due to the fact that occlusion is rare and hard to label, we use random mask to simulate the target being occluded.

2 Approach

Our future frame anticipation model is designed to predict future frame while online tracking. The model aims to generate embedding for future frames and obtain a complete target even when the target is fully occluded. Fig. 1 illustrates the architecture of our model. The proposed network consists of fully convolutional layers (conv1-4) for constructing feature map. Given a video sequence with a set of bounding boxes corresponding to each frame, the network computes a feature map of the input image through a single forward. The feature maps are then sent to a ConvGRUICLR15ConvGRU network to learn a temporal representation. Due to the fact that occlusion happens rare in video sequence, for example 1.86% frames are marked as heavy occlusion in GOT-10kgot10k dataset. When training our model, we add random mask to the target to simulate the target being occluded. Generative Adversarial loss and loss are utilized to make the hallucinated features more realistic.

2.1 HAllucinating Features to Track(HAFT)

Temporal prediction.

The Recurrent Neural Networks(RNNs) are used to do temporal prediction. However RNNs cannot capture long-term temporal dependency. GRU

gru is further proposed to solve the above issue. The traditional GRU can only handle flatten features, which not suitable for video representation learning. ConvGRU replace the fully connected layer with convolutional layer, which makes it more efficient for learning spatial dependency. We use ConvGRU to predict the future frame embedding.

The input of our model is RGB frames, which does not contain any additional motion information. If the input video clip has frames, the entire input can be represent as:


We use feature extractor to frame-wisely extract feature :


The feature then sent to the ConvGRU model, to capture temporal structure of the target. The outputs of ConvGRU defined as:


where is the anticipated features of next frame. In order to make the predicted features more realistic, we use GANs to train our model.

Visual and Temporal GANs. Recently, GANs are applied in trackingcvpr2018sint++; cvpr18vital. These works use GAN to augment positive data. Different from the previous approach, our method use GAN to generate future frame embedding. The original GANs synthesis images without other restriction. On the contrary, our work needs to synthesis future frame embedding which conditioned on the previous frames. The similar work is Conditional-GANcgan, which encode the object label into the generator. As for our algorithm, the ConvGRU model plays the role of the generator in the GAN framework.

We define the real future video clips , and use the same feature extractor to get the real future features:


This real features are used for GAN training to confuse the discriminator. Our generator

does not directly predict the future frame in pixel level, however, in feature level to speed up the tracking progress. The loss function to train our GAN model is defined as below:


Due the the fact that our discriminator only need to judge the predicted feature is real or fake, we only use three convolution layers, each convolution layer is followed by BatchNorm and LeakyReLU.

In order to stable the GAN training, we use regularization to compare the anticipated features and real features:


Position Localization. For target position localization, we need less samples to build a robust model. DiMPiccv19dimp utilizes background information, discriminative loss function and high efficient optimizator to achieve this goal. The loss function for localization is defined as below:


where is the filter, gaussian label function which the peak value is at target center, * is the convolution operation. is the loss function, detailed definition can be found in iccv19dimp. is the input features:


where is used for balance the real feature and the predicted features.

Size Estimation. The size estimation module used in our model is same as ATOMCVPR19ATOM which incorporated the IoU Netjiang2018acquisition. The IoU score is estimated by the dot product between template feature and search region feature which extracted from the PrRoI Pooling, then followed by a fully connected layer:


where is the template frame feature, is the fused feature defined in Eq. 10, and is the fully connected layer. The size can be estimated with back propagation to the PrRoI Pooling layer.

The final loss function can be written as:


where , and is hyper-parameters to control the each loss’s contribution to the final loss. Fig. 2 is the anticipated features of our model. Even the target is occluded, our model can still predicted the position of the target.

Figure 2: Visualization of the predicted features. When the target is not occluded in the first row, the response marked by red circle both visible in the real features and predicted features. However, in the second row, when the target is occluded by tree, the response in the real feature is not visible while in the predicted features can be observed.

3 Experiments

3.1 Implementation Details

Offline Training. Our training procedure is same as DiMPiccv19dimp. First, we random sample consequent video clips, the video clips contain 30 frames which then cropped according to the ground-truth bounding boxes. In order to making the training progress close to the tracking progress, the bounding box of the previous frame is used to crop the current frame. Due to the fact that the bounding boxes predicted by the tracker is not accurate, we add random scale and small displacement to the bounding boxes then to create the search region. We also utilize random mask which added to the target to simulate the occlusion.

We initialize the backbone network with DiMP, the feature extractor is ResNet18he2016deep. We use training set of TrackingNettrackingnet, LaSOTlasot, and GOT-10kgot10k

to train our tracker. We sample 20,000 videos one epoch, total training epoch is 50 with ADAM optimizator. The learning rate decays 0.2 every 15 epochs. Our tracker trained on one Nvidia TITAN 2080TI GPU, total training time is 12 hours.

Online Tracking. Given the target in the first frames, we construct 15 samples with data augmentation strategy. Then learning a target model as in DiMP. Besides, we initialize ConvGRU model to predicted the features of next frame. In the subsequent tracking progress, we fuse the real feature and anticipated features with Eq. 10, then localize the target and estimate the object sizeCVPR19ATOM.

3.2 Ablation Analysis

[] [] GAN AUC 0.660 0.551 0.652 0.679

Figure 3: OTB100 AUC v.s. .
Figure 4: Ablation analysis of OTB100

Different Loss. In order to verify the effectiveness of the loss, we conduct ablation analysis on OTB100. The results are shown in Fig. 4, Our baseline method is DiMP-18iccv19dimp, the AUC score on OTB100 is 0.660. Directly using GAN to generate future frame features will cause the training progress unstable, and leading to severely degenerated tracking performance, the AUC will drop dramatically to 0.551. Also we found directly applying the loss to minimize difference between the predicted features and real future frame features, which will make the features unrealistic, and the tracking performance also drops 1.4%. Combing both loss, the AUC on OTB100 increased to 0.679.

Fusing Parameter . We also use OTB100 to get the best to fuse the predicted features and real features. As shown in Fig. 4, the best we choose to conduct the subsequent experiments is 0.2.

3.3 State-of-the-art Comparison

We compare our method with state-of-the-art methods on several tracking benchmarks, including VOT2018vot18, OTB100OTB100, LaSOTlasot, TrackingNettrackingnet, and UAV123uav.

[] Tracker DRT RCO DaSiamRPN MFT ATOM SiamRPN++ DiMP-18 HAFT cvpr18drt vot18 eccv2018dasiamrpn vot18 CVPR19ATOM CVPR19SiamRPN++ iccv19dimp Accuracy() 0.519 0.507 0.586 0.505 0.590 0.600 0.594 0.587 Robustness() 0.201 0.155 0.276 0.140 0.204 0.234 0.182 0.155 EAO() 0.356 0.376 0.383 0.385 0.401 0.411 0.402 0.432

Figure 5: Evaluation results of different trackers on VOT2018. The best top 3 results are marked as red, blue and green.

VOT2018vot18. The VOT2018 dataset consists of 60 challenging videos. Each sequence is per-frame annotated by five visual attributes, and the bounding box is generated from pixel-wise segmentation of the tracked object. In Tab. 5 we compare our tracker in terms of Expected Average Overlap(EAO), Accuracy, and Robustness with top-ranked trackers in VOT2018 benchmarks. The proposed tracker achieves the top-ranked performance with respect to EAO. Compared with the baseline tracker DiMP-18, we achieve 7.3% relative gains on EAO.

OTB100OTB100. The OTB100 provides a fair comparison on accuracy and robustness with precision plots and success plots. We compare our trackers with 9 state-of-the-art trackers(ECOHAFTHCCVPR17ECO, DaSiamRPNeccv2018dasiamrpn, ATOMCVPR19ATOM, DiMP-18iccv19dimp, SiamRPNCVPR18SiamRPN, SiamFCECCV16SiamFC). The precision plots and success plots are shown in Fig. 6

Figure 6: Evaluation results of different trackers on OTB100.

LaSOTlasot. The LaSOT dataset provides a large-scale, high-quality dense annotations with 1400 videos in total. We follow the protocal II which uses 280 testing videos to evaluate our tracker with Normalized Precision Plots and Success Plots. Fig. 7 reports the overall performance of our tracker. We compare our tracker with 7 top performance approaches, including MDNetCVPR16MDNet, DSiamiccv17dsiam, STRCFcvpr2018strcf, DaSiamRPNeccv2018dasiamrpn, SiamRPN++CVPR19SiamRPN++, ATOMCVPR19ATOM, and DiMP-18iccv19dimp. Our tracker achieves top ranked performance on these three metrics. Compared with the baseline tracker DiMP-18, HAFT achieves 5.6% relative gains on Normalized Precision Plots and 3.8% relative gains on Success Plots.

Further, we analyze our tracker with respect to 8 different attributes, including aspect ratio change, scale variation, partial occlusion, deformation, full occlusion, motion blur, viewpoint change, and illumination variation. As shown in Fig. 8, our tracker can handle the full occlusion problem. Compared with DiMP-18, our tracker achieves 2.2% absolute improvement on full occlusion attributes and achieves 1.6% absolute improvement on partial occlusion attributes. Besides, our tracker generalizes to other attributes well and achieves 2% or so absolute improvement in aspect ratio changes, scale variation, deformation, viewpoint change, and background clutter.

Figure 7: Evaluation results of different trackers on LaSOT.
Figure 8: The success plots of eight attributes on LaSOT dataset.

TrackingNettrackingnet. TrackingNet provides a large amount of data to assess trackers in the wild. We evaluate our trackers on testing dataset with 511 videos. Followingtrackingnet, we use three metrices, including Precision Plots(PRE), Normalized Precision Plots(NPRE), and Success Plots(AUC), for evaluation. As show in Tab. 9, our tracker achieves 0.661 on Precision Plots, 0.779 on Normalized Precision Plots, and 0.720 on Success Plots.

[] Tracker ECO SiamFC MDNet DaSiamRPN ATOM SiamRPN++ DiMP-18 HAFT CVPR17ECO ECCV16SiamFC CVPR16MDNet eccv2018dasiamrpn CVPR19ATOM CVPR19SiamRPN++ iccv19dimp PRE() 0.492 0.533 0.565 0.591 0.648 0.694 0.666 0.661 NPRE() 0.618 0.666 0.705 0.733 0.771 0.800 0.785 0.779 AUC() 0.554 0.571 0.606 0.638 0.703 0.733 0.723 0.720

Figure 9: Evaluation results of different trackers on TrackingNet. The best top 3 results are marked as red, blue, and green.

UAV123uav. UAV123 dataset focus on drone low-altitude tracking. The dataset consists of 123 videos and the viewpoint is top view. We compare our method with 7 top-ranked trackers in Success Plots(AUC) and Precision Plots(PRE). As shown in Tab. 10, our tracker is top-ranked on Success Plots, which is 0.637.

[] Tracker ECO-HC SiamRPN++ ECO DaSiamRPN SiamRPN ATOM DiMP-18 HAFT CVPR17ECO CVPR19SiamRPN++ CVPR17ECO eccv2018dasiamrpn CVPR18SiamRPN CVPR19ATOM iccv19dimp PRE() 0.725 0.807 0.741 0.796 0.748 0.843 0.836 0.835 AUC() 0.506 0.613 0.525 0.586 0.527 0.631 0.632 0.637

Figure 10: Evaluation results of different trackers on UAV123. The best top 3 results are marked as red, blue and green.
Figure 11: Visualization results. The videos in the first, second, third row are basketball, soccer1, and handball2 in VOT2018, respectively. The bounding boxes are not drawn if the tracker lost the target (except the first frame).

3.4 Qualitative Results

Qualitative results are demonstrated in Fig. 11. The first row in the figure is the first frame of the video. The tracking target is surrounded by the green bounding box. In the subsequent frames, the pink bounding box is our tracking algorithm. Compared with DiMP-18, our tracker can better handle the full occlusion problem. Especially, the soccer1 video, the lost number of HAFT is less than the compared tracker.

4 Related Work

The past years have seen large improvements in visual object tracking, thanks to powerful baselines such as correlation filter based algorithmCVPR17ECO; PAMI15KCF, Siamese network based algorithmsECCV16SiamFC; CVPR18SiamRPN, and the unified online learning and offline training approachesCVPR19ATOM; iccv19dimp.

The CF based methods dedicate to fast online learning and incorporate different features.Henriques et al.PAMI15KCF introduced circular matrix to speed up the online learning of CF methods. Bertinetto et al.CVPR16Staple extended with color histogram to compensate the HOG feature which loses the color information. Danelljan et al.ECCV16CCOT converted different kinds of features to continuous domain to tackle combining inconsistent resolutions of features. The drawback of CF based method is that we cannot learning features that suit for object tracking. Siamese approaches start drawing much attention as this time. Bertinetto et al.ECCV16SiamFC brought cross correlation into a fully convolutional network which learned the tracking feature representation. Li et al.CVPR18SiamRPN borrowed the region proposal networkren2015faster to Siamese network, further increasing tracking accuracy and tracking speed. Li et al.CVPR19SiamRPN++ made the Siamese network going deeper with the spatial-aware sampling strategy. Danelljan et al.iccv19dimp unified the offline training tracking representation with the Siamese network and online learning with steepest gradient decent.

A key limitation in the above method is that they lack of motion information which easily fails in occlusion situation. Gladh et al.icpr16deepmotion first proposed fusing appearance cues with deep motion cues with optical flow. The limits is that extracting optical flow which further send to a deep action recognition network is time consuming. Zhu et al.cvpr18flowtrack motivated by DFFzhu2017deep use the FlowNetcvpr15flownet

to propagates deep features of previous frames to subsequent frames. However, when the object is occluded, the extracted flow field is inaccurate. Yang

et al.wang2019prediction incorporated Kalman Filter to estimate object position which is simple but effective. The drawback is that it cannot utilize large video dataset to learn a dynamic temporal model and limits its further improvement.

In our work, we utilize a conditional GANcgan for deep future representation generation. A few number of GAN approaches can be found for visual object trackingcvpr18vital; cvpr2018sint++. Wang et al.cvpr2018sint++

directly used the generative network to sample massive hard positive samples with deep reinforcement learning to decide the mask position. Song

et al.cvpr18vital augmented positive samples with generative network to randomly choose predefined masks. Instead of using the GAN to generative more hard positive samples, our approach utilize the GAN to generate future frame representation.

5 Conclusion

In this paper, our model mimics the human biological process. Our future frame anticipation method learns to anticipate future scene representations while predicting the future movement of the target. Generative adversarial loss with loss make the forecasted future frame embedding close to the realistic ones. The resulting tracker benefits from the robust hallucinate features. Our qualitative quantitative results demonstrate the superior performance even under severe occlusion. It achieves promising results on VOT2018, LaSOT, OTB100, TrackingNet, and UAV123.