SPSTracker: Sub-Peak Suppression of Response Map for Robust Object Tracking

12/02/2019 ∙ by Qintao Hu, et al. ∙ University of California-Davis 18

Modern visual trackers usually construct online learning models under the assumption that the feature response has a Gaussian distribution with target-centered peak response. Nevertheless, such an assumption is implausible when there is progressive interference from other targets and/or background noise, which produce sub-peaks on the tracking response map and cause model drift. In this paper, we propose a rectified online learning approach for sub-peak response suppression and peak response enforcement and target at handling progressive interference in a systematic way. Our approach, referred to as SPSTracker, applies simple-yet-efficient Peak Response Pooling (PRP) to aggregate and align discriminative features, as well as leveraging a Boundary Response Truncation (BRT) to reduce the variance of feature response. By fusing with multi-scale features, SPSTracker aggregates the response distribution of multiple sub-peaks to a single maximum peak, which enforces the discriminative capability of features for robust object tracking. Experiments on the OTB, NFS and VOT2018 benchmarks demonstrate that SPSTrack outperforms the state-of-the-art real-time trackers with significant margins.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 3

page 5

page 7

Code Repositories

SPSTracker

SPSTracker: Sub-Peak Suppression of Response Map for Robust Object Tracking(AAAI2020)


view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

Introduction

In the past few years, deep convolutional neural networks (CNNs) have significantly improved the performance of visual object tracking, by providing frameworks for end-to-end representation learning 

[25, 6], online correlation filter learning [9, 7]

, and discriminative classifier learning  

[3, 28, 27, 26, 12]. However, CNN-based trackers suffer the performance degradation caused by the multi-target occlusion, appearance variance and/or background noise. Especially during the tracking procedure, the neighboring targets and background noise could introduce progressive interference and result in the vital model drift, as shown in Fig. 1(up), particularly when objects have scale variations and complex motions.

To mitigate the interference, the Siamese network structure was introduced to improve the discriminative capacity of trackers by extensively training the network [29, 36]

. EArly-Stopping Tracker incorporates object representation decision-making policies with the reinforcement learning method 

[17]. Nevertheless, these approaches usually rely on using additional data for offline training and lack the capability to adapt trackers adaptability in complex conditions.

To conquer the issue, the dynamic Siamese network [11] uses a fast transformation learning method to model target appearance variation and handles background suppression from previous frames. The ATOM tracker [4]

combines offline pre-training with online learning in a multi-task learning framework by incorporating the objectives of target localization and target-background classification. While incorporating high-level knowledge into the target estimation through extensive offline learning, these methods remain overlooking the progressive interference from context area in a systematic manner. How to directly model the interference and regularize the tracking response distribution is still a open problem.

In this paper, we propose a simple-yet-effective approach, referred to as SPSTracker for robust object tracking. Our motivation is based on the observation that most failure tracking is caused by the interference around the target. Such interference produces multi-peak tracking response, and the sub-peak may progressively “grow” and eventually cause model drift. Therefore ,we propose suppressing the sub-peaks to aggregating a single-peak response, with the aim of preventing model drift from the perspective of tracking response regularization.

Specifically, we introduce a Peak Response Pooling (PRP) module, which concentrates the maximum values of tracking response into the geometric centers of targets, as shown in Fig. 2. The pooling procedure is implemented by an efficient maximization and substitution operation on the tracking response maps. During the network forward procedure, PRP aggregates multiple sub-peaks into a single centered peak for target tracking. During the backward propagation procedure, the response map with a single peak guides the online learning (fine-tuning) to explore discriminative features.

Based on PRP, we further propose the Boundary Response Truncation (BRT) operation to clip the response map by simply setting the values of the pixels far away from the peak response to be zero. The operation reduces the variance of feature response map meanwhile further aggregates the single-peak response. If the response map is approximated as a Gaussian distribution, PRP targets at aggregating the mean values while BRT reducing the variance. RPR together with BRT facilitates the model to learn an enforced response map for robust object tracking.

SPSTracker is built upon the CNN framework with a target classification branch and a target localization branch atop the convolutional layers. The classification network, equipped with the PRP and BRT modules, identifies the coarse locations (bounding boxes). These coarse locations are further fed to the target localization branch to estimate the precise target location.

The main contributions of this work can be summarized as follows:

  • An Sub-Peak Suppression tracker (SPSTracker) is presented to reduce the risk of model drift by online suppressing the potential sub-peaks while aggregating the maximum peak response.

  • A simple-yet-efficient Peak Response Pooling (PRP) module is proposed to aggregate and align discriminative features, and a Boundary Response Truncation (BRT) module is designed to reduce the variance of feature response.

  • Our proposed tracker achieves leading performance on six benchmarks, including OTB2013, OTB2015, OTB50, VOT2016, VOT2018 and NFS. In particular, we improve the state-of-the-arts on the VOT2016 and VOT2018 benchmarks with significant margins.

Related Work

The research about visual object tracking has a long history. Modern object trackers were usually constructed on three kinds of methods, including correlation filtering, online classification, and metric learning. With the rise of deep neural networks, these methods have been integrated with feature learning in an end-to-end manner.

Correlation Filters. The filtering procedure refers to matching templates with the Gaussian distribution to track targets of various appearance variation. The key to their success is the ability to efficiently exploit available negative data by including all shifted versions of a training sample. By introducing CNNs, the representative capacity of correlation filters has been greatly improved. DeepSRDCF [5] fed the features from the pre-trained CNN to a correlation filter and introduced spatial regularization on the basis of KCF [15], mitigating the boundary effect. CCOT [7] and ECO [3]

proposed the implicit interpolation model to pose the learning problem in the continuous spatial domain, leading to efficient integration of multi-resolution deep features.

Online Classification. Tracking can also be formulated as an online classification problem. DeepTrack [24] leveraged a sample selection mechanism and a lazy updating scheme to learn online classifiers. FCNT [31] utilized hierarchical convolutional features to construct a network which handles various interference. CNN-SVM [16]

used the pre-trained deep convolutional network to extract features of the target, and then used SVM to perform online target-background classification. These methods fully utilized the representation capabilities of deep learning features and the discriminative capacity of online classifiers. However, they often overlook the problem of accurate target state estimation.

Metric Learning. To facilitate state estimation, the tracking problem was formulated in the metric (similarity) learning framework. Classification and state estimation were integrated into a Siamese network [2] that measures the similarity between the target and the candidates for tracking. Semantic branches and appearance branches were constructed in the dual Siamese network [13], and saliency mechanisms were introduced in the attention-based Siamese network [32]. SiamRPN [23, 22] combined the Siamese network with the region proposal network (RPN) to allow trackers estimating target extent when positioned accurately. SiamMask [33]

involved a unified framework for visual target tracking (VOT) and video object segmentation (VOS). To put it simply, a tracker is trained offline, which relies on the position specified by the first frame for semi-supervised learning to achieve target tracking and mask estimation.

Figure 2: Illustration of Boundary Response Truncation (BRT) and Peak Response Pooling (PRP) modules. First, with BRT, we clips the feature response map meanwhile aggregates the single-peak response. Then ,with PRP, we sum the horizontal and vertical pooling maps to aggregate multiple sub-peaks(the surrounding small dots in pooling maps) into a single centered peak(large dot) for target tracking. (Best viewed in color)

Despite of the efficiency, Siamese trackers are less robust to the interference from background due to ignoring offline training. ATOM [4] solved this issue by using a large number of samples for offline training. Nevertheless, with multiple sampled features, the target response map could have multiple sub-peaks, which aggregates the risk of model drift, particularly when there is interference from target appearance variation and/or background noise.

Methodology

We propose the Sub-Peak Response Suppression tracker (SPSTracker), with Peak Response Pooling (PRP) and Boundary Response Truncation (BRT) modules, to aggregate the multiple sub-peaks on a tracking response map into a single enforced peak, as shown in Fig. 2. SPSTracker is built upon the ATOM tracker [4], with a target classification branch and a target localization branch. The classification branch converts the feature map into a response map and provides the coarse locations of the target. The localization branch uses the bounding-box regression to localize targets. Upon the classification branch, the PRP and BRT modules are applied in a plug-and-play manner, as shown in Fig. 3.

Tracking Response Prediction

The classification branch is a CNN online structure, which learns from minimizing the tracking response and Gaussian priori . Denote the feature map of a current video frame (the test image) from CNN as . The classification branch is a 2-layer fully convolutional network parameterized with , which predicts the tracking response map , as

(1)

where and denote parameters for first and second convolutional layers, respectively, the symbol of denotes standard multi-channel convolution, and and

are the activation functions.

During object tracking, the parameters of the classification branch are updated by minimizing the following objective function:

(2)

where denotes the index of training samples, denotes the features from the sample, set to a sampled Gaussian prior at the target location [4],as shown in Fig 2(Gaussian prior). denotes the weight of the corresponding training sample, and is a parameter to trade-off the contributions of the two terms.

By optimizing Eq. (2) with a conjugate gradient descent method, the model predicts the target response map, as shown in Fig. 2. Due to the response map is a weighted sum of response maps from multi-scale samples and thereby appears a multi-peak distribution. This makes the maximum response not consistent with the target geometric centers, thus increasing the classification error and the risk of model drift.

Sub-Peak Response Suppression

To conquer the issue that Sub-Peak Response causes the model drift, we propose Sub-Peak Response Suppression method, which can prevent the sub-peak from “growing” into the main-peak. Specifically, we directly operate the target response map predicted by and reformulate Eq. (2) as

(3)
(4)

where denotes the Peak Response Pooling applied on each sampled response map, denotes the feature after Boundary Response Truncation (BRT) operation which decreases the variance of target response and reduces the boundary effect, and denotes the weight for the sample. By using PRP and BRT, we can apply feature fusion to aggregate the response maps from multiple samples into the target response map, as well as guaranteeing the the response map has a single peak centered at the target.

Figure 3: The flowchart of the proposed SPSTracker. It has a target classification branch and a target localization branch. The classification branch converts the feature map into a response map and provides the coarse locations of the target. The localization branch uses bounding-box regression to localize targets. Upon the classification branch, the PRP and BRT modules are applied in a plug-and-play manner.

By minimizing the objective function of Eq. (2), we can force the response map to approximate the Gaussian priori . However, for the targets of partial occlusion or background noises, could be acenteric and unlikely to be a Gaussian distribution. The operations of PRP and BRT in Eq. (4) make close to the Gaussian prior via Eq. (3), and eventually facilitates the online learning procedure.

Peak Response Pooling (PRP)

We propose a Peak Response Pooling (PRP) module, which concentrates the maximum values on the tracking response map to the target geometric center. On the target response map output by the classification branch, horizontal PRP is first performed to concentrate the response map into a horizontal pooling map. This procedure is done by finding the maximum response in each row of the response map and assigns all pixels of the line the maximum response value. In a similar way, vertical PRP is performed in each column on the response map to obtain the vertical pooling map. As a result, the element value of the response map after the PRP operation can be calculated as

(5)

where denotes the original response value at the location of the row and the column. The horizontal and vertical pooling maps are summed to obtain the rectified response map, which tends to aggregate large response values to the target geometric center. After multiple learning iterations, the target response is concentrated to approximate a 2D Gaussian distribution, which fits the Gaussian priori distribution for robust object tracking, as shown in Fig. 2.

The Peak Response Pooling (PRP) is inspired by the center/corner pooling [8, 21] developed for object detection. However, PRP is different from the center/corner pooling from the following two aspects: 1) PRP targets at aggregating the response map to a single-peak distribution so that the Gaussian prior distribution can be well fitted. In contrast, the center/corner pooling aligns features to handle the appearance variance of objects; 2) PRP leverages more efficient row- and column-wise maximization operations to aggregate the large response to target centers, while the center/corner pooling uses comparison and substitution operations.

Boundary Response Truncation (BRT)

When recognizing objects and determining an objective boundary, the human visual system does not align objects with some fixed data points but uses Fovea in eyeballs that concentrate peak response to central regions for object localization [18]. This concentration procedure inspires us to develop the BRT module for object tracking.

During tracking, for the pixel on the extent of the target but far away target centers could have ambiguous features (either background or foreground). The PRP module can concentrate the target response to the target centers but does not consider the variance of the target response. In complex scenes, the response maps could have large variance for the significant response from the target boundary, which is called the boundary effect. Considering that a single-peak response map with small variance could alleviate the boundary effect and improve the tracking robustness, we further introduce the Boundary Response Truncation (BRT) operation.

As shown in Fig. 2, BRT is a simple clip operation, which sets the pixels far away from the peak response to be zero. This operation discards the response at the target boundary and reduces the variance of the response map. With BRT, we may miss some informative target response. However, it is experimentally validated that clipping the response map by 10%; we lose 4% foreground information and 12% background information, i.e., BRT reduces more ambiguous response while enhancing the classification ability of the tracker.

Object Tracking

SPSTracker is built upon the state-of-the-art ATOM tracker [4], with a target classification branch and a target localization branch. The classification branch produces coarse region proposals by evaluating the target response map. The target localization branch fine-tunes the network parameters to fit the reference target box with multiple region proposals [4]. Upon the classification branch, the PRP and BRT modules are applied in a plug-and-play manner, as shown in Fig. 3.

Note that the ATOM tracker uses only the last convolutional layer of ResNet-18 (Block4) as feature representation. Shallow convolutional features are more important for extracting some low-level information such as color and edge, while deep convolutional features are rich in high-level semantics. The fusion of multi-scale (shallow and deep) features enforces the representation capability, but it produces sub-peaks on response maps and deteriorates the tracking performance.

By introducing the PRP and BRT modules, the multi-scale features can be well integrated for target representation and tracking. As shown in Fig. 4, the multiple sub-peaks produced by multi-scale features can be concentrated into a maximum peak, which bridges the gap between and and facilitates robust tracking.

BRT PRP MF EAO Accuracy Robutness FPS
0.401 0.590 0.204 40
0.414 0.601 0.189 40
0.420 0.609 0.191 39
0.411 0.605 0.184 35
0.424 0.610 0.179 39
0.419 0.604 0.187 35
0.424 0.612 0.174 35
0.434 0.612 0.169 35
Table 1: Ablation study of SPSTracker on the VOT-2018 benchmark. “BRT” denotes Boundary Reponse Truncation, “PRP” denotes Peak Response Pooling, and “MF” multi-scale feature fusion. The baseline performance is reported by the state-of-the-art ATOM tracker.

Experiment

In this section, we first describe the implementation details of SPSTracker. We then present the ablation study to validate the PRP and BRT modules proposed in this paper. At last, we evaluate the SPSTracker on commonly used benchmarks and compare it with state-of-the-art trackers. All the experiments are carried out with Pytorch on a Intel i5-8600k 3.4GHz CPU and a single Nvidia GTX 1080ti GPU with 24GB memory.

Implementation details

SPSTracker is implemented upon the ATOM architecture [4], by using ResNet-18 [14]

pre-trained on ImageNet as the backbone network. The Block3 and Block4 features extracted from the test image are first passed through two Conv layers. Regions defined by the input bounding boxes are then pooled to a fixed size using pooling layers. The pooled features are modulated by channel-wise multiplication with the coefficient vector returned by the reference branch. The features are then passed through fully-connected layers to predict the Intersection over Union (IoU). All Conv and FC layers are followed by BatchNorm and ReLU. The target response map is obtained by fusing the response obtained by ResNet’s block3 and block4.

Figure 4: Comparison of the target response maps of the ATOM tracker (up) and SPSTracker (down).

Ablation Study

For the proposed PRP and BRT modules, we perform ablation analysis to investigate their impact on the tracking performance. We also analyze the impact of the multi-scale features in SPSTracker. All the ablation studies are carried out on the VOT2018 [20] benchmark.

Peak Response Pooling (PRP). From the results in Table  1, we can see that the introduction of PRP module to the classification branch significantly aggregates the tracking performance. Specifically, it improves the expected average overlap (EAO) value by 0.19 (0.401 to 0.420), which is a significant margin, considering the strong baseline ATOM. It also improves the tracking accuracy and robustness as indicated by the last two rows of Table  1.

(a)
(b)
(c)
(d)
(e)
(f)
Figure 5: The precision plots and success plots on OTB-2015, OTB-2013 and OTB-50 benchmarks.
Figure 6: EAO ranking of the tested trackers on VOT2016.
Figure 7: EAO ranking of the tested trackers on VOT2018.

Boundary Response Truncation (BRT). The BRT module improves the EAO value by ( to ), as reported in Table  1, which is also a significant margin. This validates that the truncation operation is able to eliminate response variance and benefit online classifier learning by filtering out ambiguous samples.

We test the truncation size and validate that the best performance is obtained when clipping width/height of the response map. For all the experiments, we clip a height/width of the response map.

Multi-scale Feature Fusion. By using the multi-scale feature fusion, we improve the EAO value by . Combining feature fusion with PRP and BRT modules, we can improve EAO by ( vs. ), as reported in Table  1. The significant performance gain demonstrates that the proposed PRP and BRT modules facilitate the fusion of multi-scale features, and reduce the negative effect brought by the multiple sub-peaks and the feature fusion.

Tracker EAO Accuracy Robustness
SPSTracker 0.459 0.625 0.158
SiamMask 0.433 0.639 0.214
DWSiam 0.370 0.580 0.240
CCOT 0.331 0.539 0.238
TCNN 0.325 0.554 0.268
SSAT 0.321 0.577 0.291
MLDF 0.311 0.490 0.233
Staple 0.295 0.544 0.378
DDC 0.293 0.541 0.345
EBT 0.291 0.465 0.252
SRBT 0.290 0.496 0.350
Table 2: Performance comparison on VOT2016.
Tracker EAO Accuracy Robustness
SPSTracker 0.434 0.612 0.169
SiamRPN++ 0.414 0.600 0.234
ATOM 0.401 0.590 0.204
SiamMask 0.380 0.609 0.276
LADCF 0.389 0.503 0.159
MFT 0.385 0.505 0.140
DaSiamRPN 0.383 0.544 0.276
UPDT 0.378 0.536 0.184
RCO 0.376 0.507 0.155
DRT 0.356 0.519 0.201
DeepSTRC 0.345 0.523 0.215
Table 3: Performance comparison on VOT-2018.

Sub-peak suppression. In Fig. 4, we compare the target response maps of the ATOM tracker and SPSTracker. It can be seen that SPSTracker suppresses multiple sub-peaks while producing the response map of a single peak centered at the target. The peak response can well fit a Gaussian distribution prior .

Tracking speed. With a single GPU, the proposed SPSTracker achieves a tracking speed at fps. Compared with the speed ( fps) of the baseline ATOM, SPSTracker achieves significant performance gains with a negligible computational cost overhead.

OTB. The object tracking benchmarks (OTB) [34, 35] consists of the three datasets, namely OTB-2013 [34], OTB-50 and OTB-100 which consist of 51, 50 and 100 fully annotated videos, respectively. OTB100 includes OTB2013 and OTB50. All sequences belong to 11 typical tracking interference properties.

Ours ATOM UPDT CCOT ECO MDNet HDT DaSiamRPN FCNT SRDCF BACF
AUC 60.0 59.0 54.2 49.2 47.0 42.5 40.0 39.5 39.3 35.3 34.2
Table 4: Performance comparison on the NFS dataset.

Figure 8: Qualitative results of state-of-the-art trackers on sequences Box, Matrix, ClifBar, Ironman, Deer and CarScale. SPSTracker can localize objects with interference from either foreground or backgrounds. In contrast, other compared methods have failure cases. (Best viewed in color with zoom in)

Two evaluation metrics, success rate and precision, are used on OTB. The precision plot shows the percentage of frames whose tracking results are within a certain distance, which is determined by a given threshold. The success plot shows the ratio of successful frames when the threshold changes from 0 to 1, where a successful frame indicates that its overlap is greater than the given threshold. The area under the curve (AUC) of each success plot is used to rank the tracking methods.

By using the success rate and precision plot in the one-pass evaluation (OPE) as the evaluation metric, we compare the SPSTracker with state-of-the-art trackers including ATOM [4], DaSiamRPN [39], ECO-HC [3], SiamRPN [23], CF2 [30], CNN-SVM [16], SRDCF [5] and Staple [1]. As shown in Fig. 5, the proposed SPSTracker achieves the best performance on the three benchmarks, by obtaining , and AUC scores on OTB-2015 and OTB-2013, OTB-50, respectively. Compared with ATOM [4], SPSTracker improves by , and , respectively.

VOT2016 and VOT2018. From the visual object tracking (VOT) benchmark, we select VOT2016 [19] and VOT2018 [20] to evaluate the trackers. VOT2016 contains 60 challenging videos, while VOT2018 includes 10 more challenging sequences. Whenever the tracking bounding box drifts way from the ground truth, the tracker re-initializes after five frames. The trackers are evaluated by the EAO metric, which is the inner product of empirically estimated average overlap and typical sequence length distribution. In addition, accuracy (average overlap) and failures/robustness (average number of failures) are used for evaluation as well.

SPSTracker is compared with 10 state-of-the-art trackers on VOT-2016, as shown in Fig. 6. SPSTrack achieves the leading performance and significantly outperforms other trackers. Table 2 reports the details of the comparison with SiamMask [33],DWSiam [37], CCOT [7], TCNN [26], SSAT [19],MLDF [19], Staple [1], DDC [19], EBT [38] and SRBT [19]. The EAO score of the proposed SPSTracker is , which is significantly higher than the peer trackers.

SPSTracker is also compared with the 10 state-of-the-art trackers on VOT-2018. As shown in Fig. 7, SPSTracker also obtains the best performance. Table 3 shows the details of the comparison.

SPSTracker achieves an EAO score of , which is significantly better than that of SiamRPN++ [22], ATOM and other state-of-the-art trackers. Particularly, it outperforms the state-of-the-art SiamRPN++ by , ATOM by and SiamMask by , which are significant margins for object tracking on the challenging benchmark.

NFS. The Need for Speed (NFS) [10] dataset consists of 100 videos (380K frames). All frames are annotated with axis-aligned bounding boxes, and all sequences are manually labeled with nine visual attributes, including occlusion, fast motion, background clutter. We evaluate the trackers on the 35 FPS version of the NFS dataset. Table 4 reports the AUC scores of the compared trackers. SPSTracker slightly outperforms the baseline ATOM tracker, while significantly outperforms other state-of-the-art tracking methods.

Fig. 8 shows tracking examples on the OTB benchmark, from which we can see that SPSTracker correctly localizes the targets under serious interference from foreground and backgrounds. In contrast, other trackers have failure cases.

Conclusions

Visual tracking has been extensively investigated in the past few years. Nevertheless, the problem about how to model interference from multiple targets, appearance variance and/or background noise remains unsolved. In this paper, we proposed modeling the interference from the perspective of peak distribution and designed a rectified online learning approach for sub-peak response suppression and peak response enforcement. We proposed plug-and-play Peak Response Pooling (PRP) to aggregate and align discriminative features, and designed Boundary Response Truncation (BRT) to reduce the variance of feature response. Based on PRP and BRT, we integrate multi-scale features in SPSTracker to learn the discriminative features for robust object tracking. SPSTracker achieved the new state-of-the-art performance on six widely-used benchmarks, which verifies the effectiveness of the proposed peak response modeling approach.

References

  • [1] L. Bertinetto, J. Valmadre, S. Golodetz, O. Miksik, and P. H. S. Torr (2016) Staple: complementary learners for real-time tracking. In CVPR, Cited by: Ablation Study, Ablation Study.
  • [2] L. Bertinetto, J. Valmadre, J. F. Henriques, A. Vedaldi, and P. H. S. Torr (2016) Fully-convolutional siamese networks for object tracking. In ECCV, Cited by: Related Work.
  • [3] M. Danelljan, G. Bhat, F. S. Khan, and M. Felsberg (2017) ECO: efficient convolution operators for tracking. In CVPR, Cited by: Introduction, Related Work, Ablation Study.
  • [4] M. Danelljan, G. Bhat, F. S. Khan, and M. Felsberg (2019) ATOM: accurate tracking by overlap maximization. In CVPR, Cited by: Figure 1, Introduction, Related Work, Tracking Response Prediction, Object Tracking, Methodology, Implementation details, Ablation Study.
  • [5] M. Danelljan, G. Häger, F. S. Khan, and M. Felsberg (2015) Learning spatially regularized correlation filters for visual tracking. In ICCV, Cited by: Related Work, Ablation Study.
  • [6] M. Danelljan, G. Hager, F. S. Khan, and M. Felsberg (2016) Convolutional features for correlation filter based visual tracking. In ICCVw, Cited by: Introduction.
  • [7] M. Danelljan, A. Robinson, F. S. Khan, and M. Felsberg (2016) Beyond correlation filters: learning continuous convolution operators for visual tracking. In ECCV, Cited by: Introduction, Related Work, Ablation Study.
  • [8] K. Duan, S. Bai, L. Xie, H. Qi, Q. Huang, and Q. Tian (2019) CenterNet: keypoint triplets for object detection. In CVPR, Cited by: Peak Response Pooling (PRP).
  • [9] H. F, R. Caseiro, P. Martins, and J. Batista (2015) High-speed tracking with kernelized correlation filters. IEEE Trans. Pattern Anal. Mach. Intell. 37 (3), pp. 583–596. Cited by: Introduction.
  • [10] H. K. Galoogahi, A. Fagg, C. Huang, D. Ramanan, and S. Lucey (2017) Need for speed: a benchmark for higher frame rate object tracking. In ICCV, Cited by: Ablation Study.
  • [11] Q. Guo, F. Wei, C. Zhou, H. Rui, and W. Song (2017) Learning dynamic siamese network for visual object tracking. In ICCV, Cited by: Introduction.
  • [12] B. Han, J. Sim, and H. Adam (2017) BranchOut: regularization for online ensemble tracking with convolutional neural networks. In CVPR, Cited by: Introduction.
  • [13] A. He, L. Chong, X. Tian, and W. Zeng (2018) A twofold siamese network for real-time object tracking. In CVPR, Cited by: Related Work.
  • [14] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In CVPR, Cited by: Implementation details.
  • [15] J. F. Henriques, R. Caseiro, P. Martins, and J. Batista (2014) High-speed tracking with kernelized correlation filters. IEEE Transactions on Pattern Analysis and Machine Intelligence 37 (3), pp. 583–596. Cited by: Related Work.
  • [16] S. Hong, T. You, S. Kwak, and B. Han (2015) Online tracking by learning discriminative saliency map with convolutional neural network. In ICML, Cited by: Related Work, Ablation Study.
  • [17] C. Huang, S. Lucey, and D. Ramanan (2017) Learning policies for adaptive tracking with deep feature cascades. In ICCV, Cited by: Introduction.
  • [18] T. Kong, F. Sun, H. Liu, Y. Jiang, and J. Shi (2019) FoveaBox: beyond anchor-based object detector.. In CVPR, Cited by: Boundary Response Truncation (BRT).
  • [19] M. Kristan, A. Leonardis, J. Matas, M. Felsberg, R. Pflugfelder, L. Cehovin, T. Vojir, G. Hager, A. Lukezic, and G. Fernandez (2016) The visual object tracking vot2016 challenge results. In ECCVw, Vol. 8926, pp. 191–217. Cited by: Ablation Study, Ablation Study.
  • [20] M. Kristan, A. Leonardis, J. Matas, M. Felsberg, R. P. Pflugfelder, L. C. Zajc, T. Vojir, G. Bhat, A. Lukežic, A. Eldesokey, et al. (2018) The sixth visual object tracking vot2018 challenge results. In ECCV, pp. 3–53. Cited by: Ablation Study, Ablation Study.
  • [21] H. Law and D. Jia (2018) CornerNet: detecting objects as paired keypoints.

    International Journal of Computer Vision

    , pp. 1–15.
    Cited by: Peak Response Pooling (PRP).
  • [22] B. Li, W. Wu, Q. Wang, F. Zhang, J. Xing, and J. Yan (2019) SiamRPN++: evolution of siamese visual tracking with very deep networks. In CVPR, Cited by: Related Work, Ablation Study.
  • [23] B. Li, J. Yan, W. Wu, Z. Zhu, and X. Hu (2018) High performance visual tracking with siamese region proposal network. In CVPR, Cited by: Related Work, Ablation Study.
  • [24] H. Li, Y. Li, and F. Porikli (2015) DeepTrack: learning discriminative feature representations online for robust visual tracking. IEEE Transactions on Image Processing 25 (4), pp. 1834–1848. Cited by: Related Work.
  • [25] C. Ma, J. Huang, X. Yang, and M. Yang (2016) Hierarchical convolutional features for visual tracking. In ICCV, Cited by: Introduction.
  • [26] H. Nam, Baek, Mooyeol, and B. Han (2016) Modeling and propagating cnns in a tree structure for visual tracking. In CVPR, Cited by: Introduction, Ablation Study.
  • [27] H. Nam and B. Han (2015) Learning multi-domain convolutional neural networks for visual tracking. In CVPR, Cited by: Introduction.
  • [28] Y. Qi, S. Zhang, L. Qin, H. Yao, Q. Huang, J. Lim, and M. Yang (2016) Hedged deep tracking. In CVPR, Cited by: Introduction.
  • [29] R. Tao, E. Gavves, and A. W. M. Smeulders (2016) Siamese instance search for tracking. In CVPR, Cited by: Introduction.
  • [30] J. Valmadre, L. Bertinetto, J. F. Henriques, A. Vedaldi, and P. H. S. Torr (2017) End-to-end representation learning for correlation filter based tracking. In CVPR, Cited by: Ablation Study.
  • [31] L. Wang, W. Ouyang, X. Wang, and H. Lu (2016) Visual tracking with fully convolutional networks. In ICCV, Cited by: Related Work.
  • [32] Q. Wang, Z. Teng, J. Xing, J. Gao, and S. Maybank (2018) Learning attentions: residual attentional siamese network for high performance online visual tracking. In CVPR, Cited by: Related Work.
  • [33] Q. Wang, L. Zhang, L. Bertinetto, W. Hu, and P. H. S. Torr (2019) Fast online object tracking and segmentation: a unifying approach. In CVPR, Cited by: Related Work, Ablation Study.
  • [34] Y. Wu, J. Lim, and M. H. Yang (2013) Online object tracking: a benchmark. In CVPR, Cited by: Ablation Study.
  • [35] Y. Wu, J. Lim, and M. Yang (2015) Object tracking benchmark. IEEE Transactions on Pattern Analysis and Machine Intelligence 37 (9), pp. 1834–1848. Cited by: Ablation Study.
  • [36] H. Xu, Y. Gao, F. Yu, and T. Darrell (2017) End-to-end learning of driving models from large-scale video datasets. In CVPR, Cited by: Introduction.
  • [37] Z. Zhipeng, P. Houwen, and W. Qiang (2019) Deeper and wider siamese networks for real-time visual tracking. In CVPR, Cited by: Ablation Study.
  • [38] G. Zhu, F. Porikli, and H. Li (2016) Beyond local search: tracking objects everywhere with instance-specific proposals. In CVPR, Cited by: Ablation Study.
  • [39] Z. Zhu, Q. Wang, B. Li, W. Wu, J. Yan, and W. Hu (2018) Distractor-aware siamese networks for visual object tracking. In ECCV, Cited by: Ablation Study.