AFAT: Adaptive Failure-Aware Tracker for Robust Visual Object Tracking

by   Tianyang Xu, et al.
University of Surrey

Siamese approaches have achieved promising performance in visual object tracking recently. The key to the success of Siamese trackers is to learn appearance-invariant feature embedding functions via pair-wise offline training on large-scale video datasets. However, the Siamese paradigm uses one-shot learning to model the online tracking task, which impedes online adaptation in the tracking process. Additionally, the uncertainty of an online tracking response is not measured, leading to the problem of ignoring potential failures. In this paper, we advocate online adaptation in the tracking stage. To this end, we propose a failure-aware system, realised by a Quality Prediction Network (QPN), based on convolutional and LSTM modules in the decision stage, enabling online reporting of potential tracking failures. Specifically, sequential response maps from previous successive frames as well as current frame are collected to predict the tracking confidence, realising spatio-temporal fusion in the decision level. In addition, we further provide an Adaptive Failure-Aware Tracker (AFAT) by combing the state-of-the-art Siamese trackers with our system. The experimental results obtained on standard benchmarking datasets demonstrate the effectiveness of the proposed failure-aware system and the merits of our AFAT tracker, with outstanding and balanced performance in both accuracy and speed.


Efficient Adversarial Attacks for Visual Object Tracking

Visual object tracking is an important task that requires the tracker to...

Updatable Siamese Tracker with Two-stage One-shot Learning

Offline Siamese networks have achieved very promising tracking performan...

Domain Adaptive SiamRPN++ for Object Tracking in the Wild

Benefit from large-scale training data, recent advances in Siamese-based...

SiamRPN++: Evolution of Siamese Visual Tracking with Very Deep Networks

Siamese network based trackers formulate tracking as convolutional featu...

DeepScale: An Online Frame Size Adaptation Framework to Accelerate Visual Multi-object Tracking

In surveillance and search and rescue applications, it is important to p...

Real-Time Visual Object Tracking via Few-Shot Learning

Visual Object Tracking (VOT) can be seen as an extended task of Few-Shot...

FEAR: Fast, Efficient, Accurate and Robust Visual Tracker

We present FEAR, a novel, fast, efficient, accurate, and robust Siamese ...

1 Introduction

Visual object tracking is a fundamental computer vision topic with the aim of automatically predicting the state of a target that is initialised with ground truth only in the first frame. With the exploration of mathematical modelling techniques, especially deep neural networks, recent advanced tracking approaches focus on employing large labelled video datasets to offline train end-to-end networks. The performance of these end-to-end deep Siamese neural networks,

e.g., SINT [38], SiameseFC [3], SiamRPN [27], Siam R-CNN [40], has witnessed a continuous improvement with the help of more effective network architectures and more public available labelled video datasets. One typical property of the Siamese paradigm is that it considers an online tracking task as a pair-wise matching problem, where one-shot learning provides suitable mechanistic explanations. However, in the tracking stage, Siamese approaches fix the initial template model for the entire video sequences without online adaptation, leading to potential tracking shifts or failures.

In general, pair-wise offline training of Siamese networks can achieve effective feature embedding. However, it overemphasises robust appearance feature extraction for specific target categories in the training dataset 

[26]. Therefore, the learned model enables high tracking accuracy in the test videos where the static and dynamic appearance variations are similar to those found in the training stage. However, videos from practical surveillance and mobile devices are of high diversity with countless object categories, not to mention the corresponding variations. These videos, exhibiting unpredictable appearance content, raise challenges for one-shot learning approaches.

To mitigate the above issues, we argue that it is essential to incorporate an online alarm mechanism (failure detection) in the tracking process. We propose to integrate an online quality prediction of the tracking results at the decision level, as shown in Fig. 1. The response maps from recent successive frames are used as an input to assess the uncertainty of the tracking result of the current frame. It should be noted that the use of response maps has been widely studied in visual tracking for a variety of goals, ranging from basic target localisation to additional tracking bonus, e.g., quality assessment [41, 30, 5], re-detection condition [31, 50, 48] and update requirement [7, 4]

. However, existing approaches only use hand-crafted functions or a pre-defined threshold to tap the power of response maps, neglecting their diverse variations in complex tracking situations. In contrast, we employ convolutional and Long Short-Term Memory (LSTM) modules to extract spatio-temporal fusion features from online generated response maps to perform failure flagging. Specifically, the response maps are considered as high-level representations, reflecting the probability of each position becoming the target centre. We show that by extracting implicit decision information from multi-frame response maps via convolutional and LSTM modules enables a reliable and robust prediction of potential tracking failures in a spatio-temporal perspective.

In the data collection stage, we perform online tracking on random video segments from video datasets (TrackingNet [34] in our experiment). The response maps and corresponding quality states are collected and stored as training samples for our Quality Prediction Network (QPN). We assign the quality state for each frame based on the Intersection Over Union (IOU) measure between the predicted bounding box and the corresponding ground truth. In the offline training phase, the labelled response map sequences are fed into QPN with a cross entropy loss. As temporal order and 2D response scores are jointly considered in network training, adaptive prediction of the tracking quality can be achieved by our QPN. A forward pass of QPN takes only 1.21.5 ms, which can be easily integrated with existing methods. Further more, we combine QPN with the state-of-the-art Siamese trackers for robust online visual object tracking, providing an effective failure detection and further performance boosting.

The constructed Adaptive Failure-Aware Tracker (AFAT) enables an adaptive model selection in the online tracking process, comprehensively predicting the current tracking difficulty and assigning the tracker with a suitable model capacity.

Figure 1: Illustration of the proposed adaptive failure-aware tracker. AFAT achieves online quality prediction based on spatio-temporal modelling in the decision level, enabling failure alarm and further correction. AFAT detects 5 failures in the above sequence (Bird2), i.e., #11, #48, #63, #66, #69, corresponding to severe appearance changes with lower IOU between the tracking result and ground truth. #11 and #48 are true positive alarm signals with the overlap falling rapidly. #63, #66 and #69 are false positive alarm signals as they are close to the previous valley point #61.

The main contributions of the proposed AFAT method include:

  • A new quality prediction network (QPN) for generating online tracking alarm signal, and predicting potential failures in the tracking process, achieving spatio-temporal responses fusion at the decision level. The temporal order and spatial appearance of multi-frame response maps are jointly considered to predict the reliability of the tracking results, providing additional functionality to existing one-shot learning algorithms.

  • A new adaptive failure-aware tracker equipped with advanced Siamese trackers and QPN, enabling synchronous adaptive model selection in online object tracking, with a perceptual model configuration and tracking correction.

  • A comprehensive evaluation of our AFAT method on a number of well-known benchmarking datasets, including OTB2015 [44], UAV123 [32], LaSOT [12], VOT2016 [21], VOT2018 [22] and VOT2019 [24]. The results demonstrate the superior performance of the proposed AFAT method over the state-of-the-art trackers.

The rest of this paper is organised as follows: In Section 2, we provide the motivation for introducing a failure-aware mechanism into tracking and discuss the related studies in recent development. A detailed construction and training scheme of the proposed quality prediction network are presented in Section 3 and furnished with an efficient adaptive failure-aware tracker in Section 4. The experimental results and the corresponding analysis are reported in Section 5. Last, the conclusions are drawn in Section 6.

2 Background

2.1 Advanced Tracking models

Recent studies in visual object tracking focus on modelling the task as online discriminative learning or offline pair-wise matching network, both achieving continuously developed performance in standard benchmarks [44, 29, 32, 12, 34] and competitions [22, 24].

Online discriminative learning was designed to distinguish the target region from its surroundings via online training and updating a classifier (or regressor). To alleviate the computational burden for online learning, Discriminative Correlation Filter (DCF) based approaches have extensively studied in recent years, with advanced efficiency and accuracy contributed by the circulant structure 

[17, 18, 45]

. Meanwhile, DCF-based methods address the sample sufficiency issue in online discriminative learning by augmented circulant samples. To optimise the estimate filters with more intuitive requirements and desired constraints, further attempts have been performed in the DCF paradigm,

e.g., spatial regularisation for boundary effect [9, 30, 20, 49]

, feature selection for input redundancy 

[47, 46], temporal sampling space for filter degeneration [10, 11, 8], and multi-response fusion for extended representation [2, 5].

Offline pair-wise matching network benefits from the design of the Siamese network architecture. Specifically, a Siamese network maps the image pairs, i.e. template and instance, into a latent feature space, where the target regions are of salient similarity against practical appearance variations. After feature embedding, SINT [38] compares the scores of candidates for final decision making, and GOTURN [16] performs bounding box localisation using FC layers. To avoid multi-candidate calculation and FC layer learning, cross correlation is directly used for response calculation in SiameseFC [3], realising a fully convolutional structure. Besides the effective capability in centre localisation, the accuracy of the final bounding box is also essential for advanced tracking methods. To this end, recent efforts have been made on fusing bounding box refinement techniques in the offline training stage. SiamRPN utilises the Region Proposal Network (RPN) for joint high-quality foreground-background classification and bounding box regression learning [27]. While ATOM performs IOU maximisation for fine-grained overlap precision [7]. In addition, to improve the exploitation from diverse multi-layer features and Siamese layers, SiamRPN++ [26] and SiamDW [52] are proposed, which demonstrate promising performance for visual object tracking.

Despite the success in designing powerful tracking models, the significance of online tracking quality prediction and the corresponding countermeasures are underestimated by the community. In other words, the pattern of the tracking calculation process is predefined by a specified mathematical model, which always assumes specific distributions of the target appearance or motion trajectory. We argue the necessity of online predicting the tracking quality and performing effective intervention and correction.

2.2 Online tracking adaptation

Though continuous improvements have been achieved in developing diverse mathematical formulations in visual tracking, a practical tracking task can not be exactly modelled by existing methodologies accurately. To mitigate this issue, online controlling strategies are necessary to balance the characteristics of the theories reflecting simplified assumptions about the universe of target tracking and their validity in challenging video sequences with unpredictable appearance variations.

To achieve the above goal, many tracking approaches propose to analyse the reliability of the current result for potential update and re-detection. To alleviate tracking shift caused by radical update, ATOM [7] and Dimp [4] adjust the learning rate based on the response score. TLD [19] utilised an independent classifier to re-detect the target with a distance threshold-based failure detector. Similar threshold-based quality measurement is widely used in recent online discriminative learning and Siamese trackers, e.g., LCT [31], SPLT [50] and Siamese-LT [26], performing re-detection when the response score is under a threshold. Besides, quality prediction also achieved wide attention for online adaptation in visual tracking. LMCF [41] utilised the response map for multi-modal target detection to reduce the impact from similar appearance. Then, a quality function is designed for high-confidence update, suppressing the contaminated samples. Other attempts in quality prediction focus on adaptive multi-response fusion. CSR-DCF [30] assigns more reliability to the single modal response channels and UPDT [5] simultaneously constrains the fusion response to be narrow and uni-modal distributed.

However, the above processing techniques only use a simple threshold or hand-crafted function to measure the quality of the current tracking result, which are unable to satisfy the practical tracking challenges. Therefore, we propose to employ neural network to predict the tracking quality. To better extract the potential information from response maps, we construct a spatio-temporal quality prediction network (QPN) with convolutional and LSTM modules. Different form existing measurement, we collect the response maps from successive multiple frames to enhance the sequential order. By training from the generated labelled data, QPN enables adaptive classification between successful and failed result in the decision level.

2.3 Meta-learning for tracking

From the perspective of learning framework, Siamese trackers belong to one-shot learning category, using meta-learning formulation to learn the best embedding network that highlights the target region while suppresses the background. In this paradigm, Learnet [1] is the seminal work using meta-learning to generate parameters for visual tracking. Besides predicting the tracking parameters, Meta-tracker [35] proposed to jointly learn the learning rate. While UpdateNet [51] proposed to replace the handcrafted update function with a method which learns to update. In addition, MLT [6] utilised a meta-learner network to provide the Siamese network with new appearance of the target. To explore the potential power of meta-learning, we propose to train a quality prediction network for a given tracker.

3 Quality prediction network

Figure 2: The proposed quality prediction network (QPN).

In this section, we present the proposed Quality Prediction Network (QPN). As shown in Fig. 2, the proposed QPN consists of convolutional and Long Short-Term Memory (LSTM) modules. Sequential response maps and the corresponding quality labels are used to train QPN in an end-to-end fashion. QPN utilises the current and previous temporal information to provide quality feedback, i.e., whether the current tracking result is successful or not. The forward pass of QPN can be formulated as:


where is the response maps of the -th frame. We select SiameseRPN++ [26] (with mobilev2 backbone) as our base tracker, which can produce the response map of size in each frame, where and are corresponding to the anchor number and spatial resolution, respectively. is the sequential length for our spatio-temporal inference. We set in our experiment. We describe the construction of the QPN with more details as follows.

3.1 Spatial feature extraction

Different from raw image patches, we use response maps as the input data for our QPN. Response maps are output by an online tracker, reflecting the decision-making ability of the model in the current tracking situation. Similar to an image, considering the spatial distribution, we propose to use CNN for basic spatial feature extraction from response maps. Based on existing studies and early analysis, we have the following observations: (1) an ideal response map should be uni-modal with narrow peak; and (2) a tracking failure always occurs with clutters in the response map. To make the convolutional network suitable for our response input, we circularly shift the response map and scatter the maximal peak into the four corners, enhancing the context information. Then three convolutional layers and one FC layer are employed to extract the representations for each frame in the response map sequence, where denotes the forward pass of the CNN for spatial feature extraction.

3.2 Temporal feature fusion

Existing quality measurements in advanced tracking approaches only consider the response maps obtained by the current frame, neglecting the previous temporal information. In general, a basic assumption in visual tracking is that the target always smoothly changes in appearance and position between adjacent frames. Therefore, we argue the significance of exploiting temporal ordered data for tracking quality prediction. To this aim, we propose to fuse the sequential response maps from recent successive frames for quality prediction using two LSTM networks [13]. The extracted spatial features are fed into the two LSTM modules to predict the quality of the tracking result in frame , .

3.3 End-to-end training of QPN

To generate sequential response maps for training our QPN, we run visual tracking using our base tracker (SiameseRPN++, with mobilev2 backbone) on TrackingNet [34]. There are 30132 labelled video sequences in TrackingNet, and we select the first 20000 videos to generate the training data, and select the rest 10132 videos for validation. Specifically, for each piece of data, we collected the response maps via running the base tracker on successive 20 frames from a randomly selected video. In addition, we label the response maps based on the IOU between the predicted bounding box and the corresponding ground truth for each frame,


The detailed number of our generated training samples is list in Table 1.

sequences total frames labelled frames
success lost other
Training set 50000 1000000 937881 14494 47625
Validation set 20000 400000 375604 5429 18967
Table 1: Data generation for QPN
Figure 3: Illustration of loss and accuracy in the QPN training phase.

During the training stage of QPN, sequential response maps are fed into the network, where and

. We train QPN from scratch using the Stochastic Gradient Descent (SGD) optimiser. For network optimisation, the Cross-entropy loss function is used in our design. As shown in Table. 

1, the training data is imbalanced between the success and lost categories. To balance the volume between these two types of training data, we set the loss weight as 0.002 and 1 for success and lost

, respectively. The hidden state and cell state in the two LSTM networks are initialised with zeros for each forward pass. We train the network for 25 epochs in total. The learning rate is decreased from 0.01 to 0.001 since the 16-th epoch. We report the detailed loss and accuracy in each epoch in Fig. 

3. During the training stage, QPN converges with the accuracy of and on the validation set for success and lost, respectively. This demonstrates the effectiveness of the proposed QPN method in failure detection for an online object tracker. In addition, a forward pass of QPN takes only 1.21.5 ms, which can be easily integrated with existing methods.

4 Adaptive failure-aware tracker

We propose the Adaptive Failure-Aware Tracker (AFAT) based on the SiameseRPN++ tracker [26] and the proposed QPN in Section 3. The tracking framework is summarised in Algorithm 1. We directly use two backbones, i.e., mobilev2 and resnet50, in our AFAT. We employ as the base tracker and as the correction tracker. Without bells and whistles, the default parameters from the original SiameseRPN++ are used in the proposed AFAT method.

  Initialisation: Initialise and with the ground truth in the initial frame.
  Input: Image frame , model and , response list I,target centre coordinate and scale size from frame ; Tracking:
  1. Extract search windows from at ;
  2. Calculate response maps f using ;
  3. Add f into I, if the length of I exceeds 20, remove the early observations;
  4. Record current tracking result from model ;
  5. Forward pass QPN and obtain the quality prediction ;
  6. If , change current tracking result using model and empty I;
  Output: Target bounding box (centre coordinate and current scale size ).
Algorithm 1 AFAT tracking algorithm.
VOT2016 VOT2018 VOT2019
SiamRPN++ 0.455 0.624 96.4 0.410 0.586 95.8 0.291 0.579 93.0
SiamRPN++ 0.464 0.642 38.7 0.414 0.600 37.9 0.287 0.594 38.2
AFAT 0.482 0.642 74.0 0.419 0.605 70.5 0.295 0.599 70.9
Failure detections 6632 7007 6814
Total frames 21455 21356 19935
Failure detection rate 30.9% 32.8% 34.2%
OTB2015 UAV123 LaSOT
SiamRPN++ 0.658 0.864 93.2 0.602 0.802 95.1 0.450 0.525 95.2
SiamRPN++ 0.663 0.875 38.5 0.611 0.804 37.9 0.496 0.571 38.9
AFAT 0.663 0.874 86.8 0.612 0.811 80.4 0.492 0.574 77.7
Failure detections 2881 4513 57192
Total frames 59035 112578 685360
Failure detection rate 4.88% 4.01% 8.34%
Table 2: Ablation study of the proposed AFAT on VOT2016, VOT2018, VOT2019, OTB2015, UAV123 and LaSOT.
Figure 4: Illustration of the detected failure alarms during online tracking by our AFAT in sequences DragonBoy, Couple, Bolt2, Human9, Gym, Jogging2, MotorRolling, and Man [44]. Red ovals denote the detected failure alarms by the proposed QPN.

5 Evaluation

5.1 Implementation details and experimental settings

We implement our AFAT using PyTorch 1.0.1 on a platform with an Intel(R) Xeon(R) Gold 6134 CPU and three NVIDIA GeForce RTX 2080Ti GPU cards. For speed test, we only use a single GPU card. The code will be made publicly available. We set

for temporal sequential data collection. We use the same parameters as SiameseRPN++ [26] for the backbone networks, mobilev2 and resnet50, released by the original authors for all the experiments (no additional optimised parameters are used). We compare the proposed AFAT on several benchmarks, including LaSOT [12], UAV123 [32], OTB2015 [44], VOT2016 [21], VOT2018 [22] and VOT2019 [24], with a number of state-of-the-art trackers, i.e., ATOM [7], Meta-Tracker [35], SiamRPN++ [26], DaSiam [53], SiamRPN [27], LADCF [47], CREST [36], BACF [20], CFNet [39], CACF [33], CSR-DCF [30], C-COT [11], Staple [2], SiameseFC [3], SRDCF [9] and other top-ranking trackers in VOT challenges.

To measure the tracking performance, we follow the corresponding protocols [44, 23, 25]. We use precision plot and success plot [43] for OTB2015, UAV123 and LaSOT. Specifically, the precision plot measures the percentage of frames with the distance between the tracking results and ground truth less than a certain number of pixels. The success plot measures the proportion of successful frames with the threshold ranging from 0 to 1 (a result is considered successful if the overlap of the predicted and ground-truth bounding boxes exceeds a pre-defined threshold). Three numerical values, i.e. distance precision (DP), normalised precision (NP) and area under curve (AUC), are further employed to measure the performance. DP is the corresponding precision plot value (illustrated in the legend of precision plot) when the threshold set to 20 pixels. AUC is the expected success rate (illustrated in the legend of success plot) in terms of overlap evaluation. For all the VOT (2016, 2018 and 2019) datasets, we employ expected average overlap (EAO), accuracy (A) value and robustness (R) to evaluate the performance [23].

[8] [14] [15] [37] [5] [27] [47]
EAO 0.280 0.286 0.303 0.323 0.378 0.383 0.389
A 0.483 0.509 0.484 0.493 0.536 0.586 0.503
R 0.276 0.281 0.267 0.218 0.184 0.276 0.159
[28] [53] [46] [7] mobilev2 resnet50
EAO 0.247 0.326 0.397 0.401 0.410 0.414 0.419
A 0.507 0.56 0.511 0.590 0.586 0.600 0.605
R 0.375 0.34 0.143 0.204 0.229 0.234 0.239
Table 3: Tracking results on VOT2018. (The best three results are highlighted by red, blue and brown.)
Figure 5: The experimental results on UAV123. The precision plots with DP reported in the figure legend (left) and the success plots with AUC reported in the figure legend (right) are presented.

5.2 Ablation study

To validate the effectiveness of the proposed AFAT, we perform ablation studies on recent public datasets including VOT2016, VOT2018, VOT2019, OTB2015, UAV123 and LaSOT. As reported in Table 2, the proposed AFAT achieves improved performance compared to the base tracker, SiamRPN++, with light sacrifice on speed. Specifically, though SiamRPN++ is employed as a correction tracker in our design, AFAT outperforms SiamRPN++ on EAO and A on all the three VOT datasets. Due to the failure-aware system, some challenging frames are recognised by our QPN with follow-up corrections, alleviating the possibility of being judged as failures under VOT supervised evaluation. The overall performance of SiamRPN++ is ahead of SiamRPN++ on VOT2016 and VOT2018, while shows the opposite on VOT2019. However, the detailed accuracy of these two trackers is diverse for each individual sequence, supporting the conclusion that the correction tracker is not necessarily stronger than the base tracker. In addition, the AUC values on OTB2015 and UAV123 of AFAT are also better than SiamRPN++, demonstrating the advantage of the proposed quality prediction network under unsupervised evaluation. But AFAT cannot outperform SiamRPN++ on LaSOT with AUC metric, as LaSOT dataset shares a longer average sequence length than others, with more demands on long-term robustness.

The failure detection numbers are also reported for each dataset in Table 2. It is interesting that our AFAT gives more failure detection results on the VOT datasets, with the detection rate ranging from 30.9% to 34.2%. While our QPN detects more successful results on OTB2015, UAV123 and LaSOT, which is consistent with the insight that VOT datasets generates more difficulties for shot-term visual tracking. In addition, we illustrate the detected failure alarms during online tracking in Fig. 4. Though there are false alarms during the tracking procedure, many positive alarms are detected by our QPN with a conservative detection strategy in the training stage. With further correction, a number of potential failures can be corrected to a certain degree. The above results and analysis demonstrate the efficiency of AFAT for failure detection.

[27] [42] [46] [7] mobilev2 resnet50
EAO 0.0.285 0.287 0.291 0.292 0.292 0.287 0.295
A 0.599 0.594 0.513 0.603 0.580 0.595 0.599
R 0.482 0.461 0.453 0.411 0.446 0.467 0.450
Table 4: Tracking results on VOT2019. (The best three results are highlighted by red, blue and brown.)
Figure 6: The experimental results on LaSOT. The normalised precision plots with NP reported in the figure legend (left) and the success plots with AUC reported in the figure legend (right) are presented.

5.3 Comparison with the state-of-the-art

The VOT2018 Dataset: We first evaluate our AFAT on the recent VOT2018 dataset [22] and compare the tracking performance with 13 state-of-the-art trackers. The VOT2018 public dataset contains 60 challenging video sequences for single object tracking research. Following the standard protocol [25], we report the detailed results in Table 3. AFAT achieves the best EAO score, 0.419, and Accuracy, 0.605, outperforming recent advanced trackers, e.g., LADCF, ATOM and SiamRPN++. In addition, the reported speed of AFAT on VOT2018 is 70.5 FPS (Table 3), demonstrating the effectiveness and efficiency of the proposed adaptive failure-aware system.

The VOT2019 Dataset: We also evaluate our AFAT on the VOT2019 dataset [24] and compare the tracking performance with 6 state-of-the-art trackers. The VOT2019 public dataset contains 60 challenging video sequences for single object tracking research, introducing more challenging factors than VOT2018. Following the standard protocol [25], we report the detailed results in Table 4. According to the table, the proposed AFAT method achieves the best EAO score, 0.295, outperforming recent advanced trackers, e.g., ATOM (0.292) and SiamRPN++ (0.292). In addition, the reported Accuracy (0.599) and Robustness (0.450) scores of AFAT are also within the top three, demonstrating the effectiveness of the proposed adaptive failure detection and correction framework.

The UAV123 Dataset: UAV123 dataset contains 123 video sequences, recorded by a camera embedded in a UAV system [32]. The average sequence length of UAV123 is 915 frames. We report the experimental results in Fig. 5 with precision and success plots. Our AFAT outperforms recent DCF and Siamese trackers: ECO, CSRDCF, DaSiam and SiamRPN++. Specifically, the proposed AFAT method achieves improved performance in terms of AUC (0.612), DP (0.811) and speed (80.4 FPS), against SiamRPN++ with AUC (0.611), DP (0.804) and speed (37.9 FPS).

The LaSOT Dataset: LaSOT is a recent dataset including 1400 sequences in 70 object categories, with an average sequence length of more than 2500 frames. We evaluate the tracking performance on the designed LaSOT test set that contains 280 video sequences, with 4 videos of each category [12]. Fig. 6 reports the normalised precision plots and success plots of the proposed AFAT and recent state-of-the-art trackers, e.g., LADCF, MDNet, VITAL, and SiamRPN++. AFAT achieves 0.574 in the NP metric, outperforming SiamRPN++ (0.571). But SiamRPN++ achieves better performance in terms of AUC than AFAT.

To summarise, our AFAT method achieves advanced performance, as compared with the state-of-the-art trackers. In addition, the speed of our AFAT is ranging from 70 FPS to 87 FPS, depending on the number of detected failures. It should be highlighted that a forward pass of our QPN takes only 1.21.5 ms, demonstrating its feasibility in real-time visual object tracking.

6 Conclusion

We proposed an effective online Quality Prediction Network (QPN) by learning sequential response maps in the decision level, enabling adaptive quality perception of an online visual object tracker. With spatial feature extraction and temporal feature fusion, the quality of the current tracking result can be trained from sequential response maps in a supervised manner, achieving online failure detection. Furthermore, we combined our QPN with the recent SiamRPN++ method to construct our AFAT tracker. The significance of realising online failure detection and correction was examined in datasets with diverse challenges. The extensive experimental results obtained on a number of benchmarking datasets demonstrate the effectiveness and robustness of the proposed method, as compared with the state-of-the-art trackers.


  • [1] L. Bertinetto, J. F. Henriques, J. Valmadre, P. Torr, and A. Vedaldi (2016) Learning feed-forward one-shot learners. In NIPS, pp. 523–531. Cited by: §2.3.
  • [2] L. Bertinetto, J. Valmadre, S. Golodetz, O. Miksik, and P. H. Torr (2016) Staple: complementary learners for real-time tracking. In CVPR, pp. 1401–1409. Cited by: §2.1, §5.1.
  • [3] L. Bertinetto, J. Valmadre, J. F. Henriques, A. Vedaldi, and P. H. Torr (2016) Fully-convolutional siamese networks for object tracking. In ECCV, pp. 850–865. Cited by: §1, §2.1, §5.1.
  • [4] G. Bhat, M. Danelljan, L. V. Gool, and R. Timofte (2019) Learning discriminative model prediction for tracking. In ICCV, pp. 6182–6191. Cited by: §1, §2.2.
  • [5] G. Bhat, J. Johnander, M. Danelljan, F. Shahbaz Khan, and M. Felsberg (2018) Unveiling the power of deep tracking. In ECCV, pp. 483–498. Cited by: §1, §2.1, §2.2, Table 3.
  • [6] J. Choi, J. Kwon, and K. M. Lee (2019) Deep meta learning for real-time target-aware visual tracking. In ICCV, pp. 911–920. Cited by: §2.3.
  • [7] M. Danelljan, G. Bhat, F. S. Khan, and M. Felsberg (2019) Atom: accurate tracking by overlap maximization. In CVPR, pp. 4660–4669. Cited by: §1, §2.1, §2.2, §5.1, Table 3, Table 4.
  • [8] M. Danelljan, G. Bhat, F. Shahbaz Khan, and M. Felsberg (2017) ECO: efficient convolution operators for tracking. In CVPR, pp. 6638–6646. Cited by: §2.1, Table 3.
  • [9] M. Danelljan, G. Hager, F. Shahbaz Khan, and M. Felsberg (2015) Learning spatially regularized correlation filters for visual tracking. In ICCV, pp. 4310–4318. Cited by: §2.1, §5.1.
  • [10] M. Danelljan, G. Hager, F. Shahbaz Khan, and M. Felsberg (2016) Adaptive decontamination of the training set: a unified formulation for discriminative visual tracking. In CVPR, pp. 1430–1438. Cited by: §2.1.
  • [11] M. Danelljan, A. Robinson, F. S. Khan, and M. Felsberg (2016) Beyond correlation filters: learning continuous convolution operators for visual tracking. In ECCV, pp. 472–488. Cited by: §2.1, §5.1.
  • [12] H. Fan, L. Lin, F. Yang, P. Chu, G. Deng, S. Yu, H. Bai, Y. Xu, C. Liao, and H. Ling (2019) Lasot: a high-quality benchmark for large-scale single object tracking. In CVPR, pp. 5374–5383. Cited by: 3rd item, §2.1, §5.1, §5.3.
  • [13] F. A. Gers, J. Schmidhuber, and F. Cummins (1999) . Cited by: §3.2.
  • [14] E. Gundogdu and A. A. Alatan (2018) Good features to correlate for visual tracking. IEEE TIP 27 (5), pp. 2526–2540. Cited by: Table 3.
  • [15] Z. He, Y. Fan, J. Zhuang, Y. Dong, and H. Bai (2017) Correlation filters with weighted convolution responses. In ICCV, pp. 1992–2000. Cited by: Table 3.
  • [16] D. Held, S. Thrun, and S. Savarese (2016) Learning to track at 100 fps with deep regression networks. In ECCV, pp. 749–765. Cited by: §2.1.
  • [17] J. F. Henriques, R. Caseiro, P. Martins, and J. Batista (2012) Exploiting the circulant structure of tracking-by-detection with kernels. In ECCV, pp. 702–715. Cited by: §2.1.
  • [18] J. F. Henriques, R. Caseiro, P. Martins, and J. Batista (2014) High-speed tracking with kernelized correlation filters. IEEE TPAMI 37 (3), pp. 583–596. Cited by: §2.1.
  • [19] Z. Kalal, K. Mikolajczyk, and J. Matas (2011) Tracking-learning-detection. IEEE TPAMI 34 (7), pp. 1409–1422. Cited by: §2.2.
  • [20] H. Kiani Galoogahi, A. Fagg, and S. Lucey (2017) Learning background-aware correlation filters for visual tracking. In ICCV, pp. 1135–1143. Cited by: §2.1, §5.1.
  • [21] M. Kristan, A. Leonardis, J. Matas, and et al. (2016-10) The visual object tracking vot2016 challenge results. Note: Springer External Links: Link Cited by: 3rd item, §5.1.
  • [22] M. Kristan, A. Leonardis, J. Matas, M. Felsberg, R. Pflugfelder, L. Cehovin Zajc, T. Vojir, G. Bhat, A. Lukezic, A. Eldesokey, et al. (2018) The sixth visual object tracking vot2018 challenge results. In ECCV, pp. 1–52. Cited by: 3rd item, §2.1, §5.1, §5.3.
  • [23] M. Kristan, J. Matas, A. Leonardis, M. Felsberg, L. Cehovin, G. Fernandez, T. Vojir, G. Hager, G. Nebehay, and R. Pflugfelder (2015) The visual object tracking vot2015 challenge results. In ICCVW, pp. 1–23. Cited by: §5.1.
  • [24] M. Kristan, J. Matas, A. Leonardis, M. Felsberg, R. Pflugfelder, J. Kamarainen, L. Cehovin Zajc, O. Drbohlav, A. Lukezic, A. Berg, et al. (2019) The seventh visual object tracking vot2019 challenge results. In ICCVW, pp. 1–36. Cited by: 3rd item, §2.1, §5.1, §5.3.
  • [25] M. Kristan, J. Matas, A. Leonardis, et al. (2016) A novel performance evaluation methodology for single-target trackers. IEEE TPAMI 38 (11), pp. 2137–2155. Cited by: §5.1, §5.3, §5.3.
  • [26] B. Li, W. Wu, Q. Wang, F. Zhang, J. Xing, and J. Yan (2019) Siamrpn++: evolution of siamese visual tracking with very deep networks. In CVPR, pp. 4282–4291. Cited by: §1, §2.1, §2.2, §3, §4, §5.1.
  • [27] B. Li, J. Yan, W. Wu, Z. Zhu, and X. Hu (2018) High performance visual tracking with siamese region proposal network. In CVPR, pp. 8971–8980. Cited by: §1, §2.1, §5.1, Table 3, Table 4.
  • [28] P. Li, B. Chen, W. Ouyang, D. Wang, X. Yang, and H. Lu (2019) Gradnet: gradient-guided network for visual object tracking. In ICCV, pp. 6162–6171. Cited by: Table 3.
  • [29] P. Liang, E. Blasch, and H. Ling (2015) Encoding color information for visual tracking: algorithms and benchmark. IEEE TIP 24 (12), pp. 5630–5644. Cited by: §2.1.
  • [30] A. Lukezic, T. Vojir, L. Cehovin Zajc, J. Matas, and M. Kristan (2017) Discriminative correlation filter with channel and spatial reliability. In CVPR, pp. 6309–6318. Cited by: §1, §2.1, §2.2, §5.1.
  • [31] C. Ma, X. Yang, C. Zhang, and M. Yang (2015) Long-term correlation tracking. In CVPR, pp. 5388–5396. Cited by: §1, §2.2.
  • [32] M. Mueller, N. Smith, and B. Ghanem (2016) A benchmark and simulator for uav tracking. In ECCV, pp. 445–461. Cited by: 3rd item, §2.1, §5.1, §5.3.
  • [33] M. Mueller, N. Smith, and B. Ghanem (2017) Context-aware correlation filter tracking. In CVPR, pp. 1396–1404. Cited by: §5.1.
  • [34] M. Muller, A. Bibi, S. Giancola, S. Alsubaihi, and B. Ghanem (2018) Trackingnet: a large-scale dataset and benchmark for object tracking in the wild. In ECCV, pp. 300–317. Cited by: §1, §2.1, §3.3.
  • [35] E. Park and A. C. Berg (2018) Meta-tracker: fast and robust online adaptation for visual object trackers. In ECCV, pp. 569–585. Cited by: §2.3, §5.1.
  • [36] Y. Song, C. Ma, L. Gong, J. Zhang, R. W. Lau, and M. Yang (2017) Crest: convolutional residual learning for visual tracking. In ICCV, pp. 2555–2564. Cited by: §5.1.
  • [37] C. Sun, D. Wang, H. Lu, and M. Yang (2018) Learning spatial-aware regressions for visual tracking. In CVPR, pp. 8962–8970. Cited by: Table 3.
  • [38] R. Tao, E. Gavves, and A. W. Smeulders (2016) Siamese instance search for tracking. In CVPR, pp. 1420–1429. Cited by: §1, §2.1.
  • [39] J. Valmadre, L. Bertinetto, J. Henriques, A. Vedaldi, and P. H. Torr (2017) End-to-end representation learning for correlation filter based tracking. In CVPR, pp. 2805–2813. Cited by: §5.1.
  • [40] P. Voigtlaender, J. Luiten, P. H. Torr, and B. Leibe (2019) Siam r-cnn: visual tracking by re-detection. arXiv preprint arXiv:1911.12836. Cited by: §1.
  • [41] M. Wang, Y. Liu, and Z. Huang (2017) Large margin object tracking with circulant feature maps. In CVPR, pp. 4021–4029. Cited by: §1, §2.2.
  • [42] Q. Wang, L. Zhang, L. Bertinetto, W. Hu, and P. H. Torr (2019) Fast online object tracking and segmentation: a unifying approach. In CVPR, pp. 1328–1338. Cited by: Table 4.
  • [43] Y. Wu, J. Lim, and M. Yang (2013) Online object tracking: a benchmark. In CVPR, pp. 2411–2418. Cited by: §5.1.
  • [44] Y. Wu, J. Lim, and M. Yang (2015) Object tracking benchmark. IEEE TPAMI 37 (9), pp. 1834–1848. Cited by: 3rd item, §2.1, Figure 4, §5.1, §5.1.
  • [45] T. Xu, Z. Feng, X. Wu, and J. Kittler (2019) An accelerated correlation filter tracker. Pattern Recognition, pp. 107172. Cited by: §2.1.
  • [46] T. Xu, Z. Feng, X. Wu, and J. Kittler (2019) Joint group feature selection and discriminative filter learning for robust visual object tracking. In ICCV, pp. 7950–7960. Cited by: §2.1, Table 3, Table 4.
  • [47] T. Xu, Z. Feng, X. Wu, and J. Kittler (2019) Learning adaptive discriminative correlation filters via temporal consistency preserving spatial feature selection for robust visual object tracking. IEEE TIP 28 (11), pp. 5596–5609. Cited by: §2.1, §5.1, Table 3.
  • [48] T. Xu, Z. Feng, X. Wu, and J. Kittler (2019) Learning low-rank and sparse discriminative correlation filters for coarse-to-fine visual object tracking. IEEE TCSVT. Cited by: §1.
  • [49] T. Xu, X. Wu, and J. Kittler (2018) Non-negative subspace representation learning scheme for correlation filter based tracking. In ICPR, pp. 1888–1893. Cited by: §2.1.
  • [50] B. Yan, H. Zhao, D. Wang, H. Lu, and X. Yang (2019) ’Skimming-perusal’ tracking: a framework for real-time and robust long-term tracking. In ICCV, pp. 2385–2393. Cited by: §1, §2.2.
  • [51] L. Zhang, A. Gonzalez-Garcia, J. v. d. Weijer, M. Danelljan, and F. S. Khan (2019) Learning the model update for siamese trackers. In ICCV, pp. 4010–4019. Cited by: §2.3.
  • [52] Z. Zhang and H. Peng (2019) Deeper and wider siamese networks for real-time visual tracking. In CVPR, pp. 4591–4600. Cited by: §2.1.
  • [53] Z. Zhu, Q. Wang, B. Li, W. Wu, J. Yan, and W. Hu (2018) Distractor-aware siamese networks for visual object tracking. In ECCV, pp. 101–117. Cited by: §5.1, Table 3.