SPSTracker: Sub-Peak Suppression of Response Map for Robust Object Tracking(AAAI2020)
Modern visual trackers usually construct online learning models under the assumption that the feature response has a Gaussian distribution with target-centered peak response. Nevertheless, such an assumption is implausible when there is progressive interference from other targets and/or background noise, which produce sub-peaks on the tracking response map and cause model drift. In this paper, we propose a rectified online learning approach for sub-peak response suppression and peak response enforcement and target at handling progressive interference in a systematic way. Our approach, referred to as SPSTracker, applies simple-yet-efficient Peak Response Pooling (PRP) to aggregate and align discriminative features, as well as leveraging a Boundary Response Truncation (BRT) to reduce the variance of feature response. By fusing with multi-scale features, SPSTracker aggregates the response distribution of multiple sub-peaks to a single maximum peak, which enforces the discriminative capability of features for robust object tracking. Experiments on the OTB, NFS and VOT2018 benchmarks demonstrate that SPSTrack outperforms the state-of-the-art real-time trackers with significant margins.READ FULL TEXT VIEW PDF
Discriminative correlation filters show excellent performance in object
Visual object tracking is one of the major challenges in the field of
Tracking-by-detection algorithms are widely used for visual tracking, wh...
Compared with visible object tracking, thermal infrared (TIR) object tra...
Recent progresses in model-free single object tracking (SOT) algorithms ...
Most of Multiple Object Tracking (MOT) approaches compute individual tar...
This paper assumes prior detections of multiple targets at each time ins...
SPSTracker: Sub-Peak Suppression of Response Map for Robust Object Tracking(AAAI2020)
In the past few years, deep convolutional neural networks (CNNs) have significantly improved the performance of visual object tracking, by providing frameworks for end-to-end representation learning[25, 6], online correlation filter learning [9, 7]
, and discriminative classifier learning[3, 28, 27, 26, 12]. However, CNN-based trackers suffer the performance degradation caused by the multi-target occlusion, appearance variance and/or background noise. Especially during the tracking procedure, the neighboring targets and background noise could introduce progressive interference and result in the vital model drift, as shown in Fig. 1(up), particularly when objects have scale variations and complex motions.
. EArly-Stopping Tracker incorporates object representation decision-making policies with the reinforcement learning method. Nevertheless, these approaches usually rely on using additional data for offline training and lack the capability to adapt trackers adaptability in complex conditions.
To conquer the issue, the dynamic Siamese network  uses a fast transformation learning method to model target appearance variation and handles background suppression from previous frames. The ATOM tracker 
combines offline pre-training with online learning in a multi-task learning framework by incorporating the objectives of target localization and target-background classification. While incorporating high-level knowledge into the target estimation through extensive offline learning, these methods remain overlooking the progressive interference from context area in a systematic manner. How to directly model the interference and regularize the tracking response distribution is still a open problem.
In this paper, we propose a simple-yet-effective approach, referred to as SPSTracker for robust object tracking. Our motivation is based on the observation that most failure tracking is caused by the interference around the target. Such interference produces multi-peak tracking response, and the sub-peak may progressively “grow” and eventually cause model drift. Therefore ,we propose suppressing the sub-peaks to aggregating a single-peak response, with the aim of preventing model drift from the perspective of tracking response regularization.
Specifically, we introduce a Peak Response Pooling (PRP) module, which concentrates the maximum values of tracking response into the geometric centers of targets, as shown in Fig. 2. The pooling procedure is implemented by an efficient maximization and substitution operation on the tracking response maps. During the network forward procedure, PRP aggregates multiple sub-peaks into a single centered peak for target tracking. During the backward propagation procedure, the response map with a single peak guides the online learning (fine-tuning) to explore discriminative features.
Based on PRP, we further propose the Boundary Response Truncation (BRT) operation to clip the response map by simply setting the values of the pixels far away from the peak response to be zero. The operation reduces the variance of feature response map meanwhile further aggregates the single-peak response. If the response map is approximated as a Gaussian distribution, PRP targets at aggregating the mean values while BRT reducing the variance. RPR together with BRT facilitates the model to learn an enforced response map for robust object tracking.
SPSTracker is built upon the CNN framework with a target classification branch and a target localization branch atop the convolutional layers. The classification network, equipped with the PRP and BRT modules, identifies the coarse locations (bounding boxes). These coarse locations are further fed to the target localization branch to estimate the precise target location.
The main contributions of this work can be summarized as follows:
An Sub-Peak Suppression tracker (SPSTracker) is presented to reduce the risk of model drift by online suppressing the potential sub-peaks while aggregating the maximum peak response.
A simple-yet-efficient Peak Response Pooling (PRP) module is proposed to aggregate and align discriminative features, and a Boundary Response Truncation (BRT) module is designed to reduce the variance of feature response.
Our proposed tracker achieves leading performance on six benchmarks, including OTB2013, OTB2015, OTB50, VOT2016, VOT2018 and NFS. In particular, we improve the state-of-the-arts on the VOT2016 and VOT2018 benchmarks with significant margins.
The research about visual object tracking has a long history. Modern object trackers were usually constructed on three kinds of methods, including correlation filtering, online classification, and metric learning. With the rise of deep neural networks, these methods have been integrated with feature learning in an end-to-end manner.
Correlation Filters. The filtering procedure refers to matching templates with the Gaussian distribution to track targets of various appearance variation. The key to their success is the ability to efficiently exploit available negative data by including all shifted versions of a training sample. By introducing CNNs, the representative capacity of correlation filters has been greatly improved. DeepSRDCF  fed the features from the pre-trained CNN to a correlation filter and introduced spatial regularization on the basis of KCF , mitigating the boundary effect. CCOT  and ECO 
Online Classification. Tracking can also be formulated as an online classification problem. DeepTrack  leveraged a sample selection mechanism and a lazy updating scheme to learn online classifiers. FCNT  utilized hierarchical convolutional features to construct a network which handles various interference. CNN-SVM 
used the pre-trained deep convolutional network to extract features of the target, and then used SVM to perform online target-background classification. These methods fully utilized the representation capabilities of deep learning features and the discriminative capacity of online classifiers. However, they often overlook the problem of accurate target state estimation.
Metric Learning. To facilitate state estimation, the tracking problem was formulated in the metric (similarity) learning framework. Classification and state estimation were integrated into a Siamese network  that measures the similarity between the target and the candidates for tracking. Semantic branches and appearance branches were constructed in the dual Siamese network , and saliency mechanisms were introduced in the attention-based Siamese network . SiamRPN [23, 22] combined the Siamese network with the region proposal network (RPN) to allow trackers estimating target extent when positioned accurately. SiamMask 
involved a unified framework for visual target tracking (VOT) and video object segmentation (VOS). To put it simply, a tracker is trained offline, which relies on the position specified by the first frame for semi-supervised learning to achieve target tracking and mask estimation.
Despite of the efficiency, Siamese trackers are less robust to the interference from background due to ignoring offline training. ATOM  solved this issue by using a large number of samples for offline training. Nevertheless, with multiple sampled features, the target response map could have multiple sub-peaks, which aggregates the risk of model drift, particularly when there is interference from target appearance variation and/or background noise.
We propose the Sub-Peak Response Suppression tracker (SPSTracker), with Peak Response Pooling (PRP) and Boundary Response Truncation (BRT) modules, to aggregate the multiple sub-peaks on a tracking response map into a single enforced peak, as shown in Fig. 2. SPSTracker is built upon the ATOM tracker , with a target classification branch and a target localization branch. The classification branch converts the feature map into a response map and provides the coarse locations of the target. The localization branch uses the bounding-box regression to localize targets. Upon the classification branch, the PRP and BRT modules are applied in a plug-and-play manner, as shown in Fig. 3.
The classification branch is a CNN online structure, which learns from minimizing the tracking response and Gaussian priori . Denote the feature map of a current video frame (the test image) from CNN as . The classification branch is a 2-layer fully convolutional network parameterized with , which predicts the tracking response map , as
where and denote parameters for first and second convolutional layers, respectively, the symbol of denotes standard multi-channel convolution, and and
are the activation functions.
During object tracking, the parameters of the classification branch are updated by minimizing the following objective function:
where denotes the index of training samples, denotes the features from the sample, set to a sampled Gaussian prior at the target location ,as shown in Fig 2(Gaussian prior). denotes the weight of the corresponding training sample, and is a parameter to trade-off the contributions of the two terms.
By optimizing Eq. (2) with a conjugate gradient descent method, the model predicts the target response map, as shown in Fig. 2. Due to the response map is a weighted sum of response maps from multi-scale samples and thereby appears a multi-peak distribution. This makes the maximum response not consistent with the target geometric centers, thus increasing the classification error and the risk of model drift.
To conquer the issue that Sub-Peak Response causes the model drift, we propose Sub-Peak Response Suppression method, which can prevent the sub-peak from “growing” into the main-peak. Specifically, we directly operate the target response map predicted by and reformulate Eq. (2) as
where denotes the Peak Response Pooling applied on each sampled response map, denotes the feature after Boundary Response Truncation (BRT) operation which decreases the variance of target response and reduces the boundary effect, and denotes the weight for the sample. By using PRP and BRT, we can apply feature fusion to aggregate the response maps from multiple samples into the target response map, as well as guaranteeing the the response map has a single peak centered at the target.
By minimizing the objective function of Eq. (2), we can force the response map to approximate the Gaussian priori . However, for the targets of partial occlusion or background noises, could be acenteric and unlikely to be a Gaussian distribution. The operations of PRP and BRT in Eq. (4) make close to the Gaussian prior via Eq. (3), and eventually facilitates the online learning procedure.
We propose a Peak Response Pooling (PRP) module, which concentrates the maximum values on the tracking response map to the target geometric center. On the target response map output by the classification branch, horizontal PRP is first performed to concentrate the response map into a horizontal pooling map. This procedure is done by finding the maximum response in each row of the response map and assigns all pixels of the line the maximum response value. In a similar way, vertical PRP is performed in each column on the response map to obtain the vertical pooling map. As a result, the element value of the response map after the PRP operation can be calculated as
where denotes the original response value at the location of the row and the column. The horizontal and vertical pooling maps are summed to obtain the rectified response map, which tends to aggregate large response values to the target geometric center. After multiple learning iterations, the target response is concentrated to approximate a 2D Gaussian distribution, which fits the Gaussian priori distribution for robust object tracking, as shown in Fig. 2.
The Peak Response Pooling (PRP) is inspired by the center/corner pooling [8, 21] developed for object detection. However, PRP is different from the center/corner pooling from the following two aspects: 1) PRP targets at aggregating the response map to a single-peak distribution so that the Gaussian prior distribution can be well fitted. In contrast, the center/corner pooling aligns features to handle the appearance variance of objects; 2) PRP leverages more efficient row- and column-wise maximization operations to aggregate the large response to target centers, while the center/corner pooling uses comparison and substitution operations.
When recognizing objects and determining an objective boundary, the human visual system does not align objects with some fixed data points but uses Fovea in eyeballs that concentrate peak response to central regions for object localization . This concentration procedure inspires us to develop the BRT module for object tracking.
During tracking, for the pixel on the extent of the target but far away target centers could have ambiguous features (either background or foreground). The PRP module can concentrate the target response to the target centers but does not consider the variance of the target response. In complex scenes, the response maps could have large variance for the significant response from the target boundary, which is called the boundary effect. Considering that a single-peak response map with small variance could alleviate the boundary effect and improve the tracking robustness, we further introduce the Boundary Response Truncation (BRT) operation.
As shown in Fig. 2, BRT is a simple clip operation, which sets the pixels far away from the peak response to be zero. This operation discards the response at the target boundary and reduces the variance of the response map. With BRT, we may miss some informative target response. However, it is experimentally validated that clipping the response map by 10%; we lose 4% foreground information and 12% background information, i.e., BRT reduces more ambiguous response while enhancing the classification ability of the tracker.
SPSTracker is built upon the state-of-the-art ATOM tracker , with a target classification branch and a target localization branch. The classification branch produces coarse region proposals by evaluating the target response map. The target localization branch fine-tunes the network parameters to fit the reference target box with multiple region proposals . Upon the classification branch, the PRP and BRT modules are applied in a plug-and-play manner, as shown in Fig. 3.
Note that the ATOM tracker uses only the last convolutional layer of ResNet-18 (Block4) as feature representation. Shallow convolutional features are more important for extracting some low-level information such as color and edge, while deep convolutional features are rich in high-level semantics. The fusion of multi-scale (shallow and deep) features enforces the representation capability, but it produces sub-peaks on response maps and deteriorates the tracking performance.
By introducing the PRP and BRT modules, the multi-scale features can be well integrated for target representation and tracking. As shown in Fig. 4, the multiple sub-peaks produced by multi-scale features can be concentrated into a maximum peak, which bridges the gap between and and facilitates robust tracking.
In this section, we first describe the implementation details of SPSTracker. We then present the ablation study to validate the PRP and BRT modules proposed in this paper. At last, we evaluate the SPSTracker on commonly used benchmarks and compare it with state-of-the-art trackers. All the experiments are carried out with Pytorch on a Intel i5-8600k 3.4GHz CPU and a single Nvidia GTX 1080ti GPU with 24GB memory.
pre-trained on ImageNet as the backbone network. The Block3 and Block4 features extracted from the test image are first passed through two Conv layers. Regions defined by the input bounding boxes are then pooled to a fixed size using pooling layers. The pooled features are modulated by channel-wise multiplication with the coefficient vector returned by the reference branch. The features are then passed through fully-connected layers to predict the Intersection over Union (IoU). All Conv and FC layers are followed by BatchNorm and ReLU. The target response map is obtained by fusing the response obtained by ResNet’s block3 and block4.
For the proposed PRP and BRT modules, we perform ablation analysis to investigate their impact on the tracking performance. We also analyze the impact of the multi-scale features in SPSTracker. All the ablation studies are carried out on the VOT2018  benchmark.
Peak Response Pooling (PRP). From the results in Table 1, we can see that the introduction of PRP module to the classification branch significantly aggregates the tracking performance. Specifically, it improves the expected average overlap (EAO) value by 0.19 (0.401 to 0.420), which is a significant margin, considering the strong baseline ATOM. It also improves the tracking accuracy and robustness as indicated by the last two rows of Table 1.
Boundary Response Truncation (BRT). The BRT module improves the EAO value by ( to ), as reported in Table 1, which is also a significant margin. This validates that the truncation operation is able to eliminate response variance and benefit online classifier learning by filtering out ambiguous samples.
We test the truncation size and validate that the best performance is obtained when clipping width/height of the response map. For all the experiments, we clip a height/width of the response map.
Multi-scale Feature Fusion. By using the multi-scale feature fusion, we improve the EAO value by . Combining feature fusion with PRP and BRT modules, we can improve EAO by ( vs. ), as reported in Table 1. The significant performance gain demonstrates that the proposed PRP and BRT modules facilitate the fusion of multi-scale features, and reduce the negative effect brought by the multiple sub-peaks and the feature fusion.
Sub-peak suppression. In Fig. 4, we compare the target response maps of the ATOM tracker and SPSTracker. It can be seen that SPSTracker suppresses multiple sub-peaks while producing the response map of a single peak centered at the target. The peak response can well fit a Gaussian distribution prior .
Tracking speed. With a single GPU, the proposed SPSTracker achieves a tracking speed at fps. Compared with the speed ( fps) of the baseline ATOM, SPSTracker achieves significant performance gains with a negligible computational cost overhead.
OTB. The object tracking benchmarks (OTB) [34, 35] consists of the three datasets, namely OTB-2013 , OTB-50 and OTB-100 which consist of 51, 50 and 100 fully annotated videos, respectively. OTB100 includes OTB2013 and OTB50. All sequences belong to 11 typical tracking interference properties.
Two evaluation metrics, success rate and precision, are used on OTB. The precision plot shows the percentage of frames whose tracking results are within a certain distance, which is determined by a given threshold. The success plot shows the ratio of successful frames when the threshold changes from 0 to 1, where a successful frame indicates that its overlap is greater than the given threshold. The area under the curve (AUC) of each success plot is used to rank the tracking methods.
By using the success rate and precision plot in the one-pass evaluation (OPE) as the evaluation metric, we compare the SPSTracker with state-of-the-art trackers including ATOM , DaSiamRPN , ECO-HC , SiamRPN , CF2 , CNN-SVM , SRDCF  and Staple . As shown in Fig. 5, the proposed SPSTracker achieves the best performance on the three benchmarks, by obtaining , and AUC scores on OTB-2015 and OTB-2013, OTB-50, respectively. Compared with ATOM , SPSTracker improves by , and , respectively.
VOT2016 and VOT2018. From the visual object tracking (VOT) benchmark, we select VOT2016  and VOT2018  to evaluate the trackers. VOT2016 contains 60 challenging videos, while VOT2018 includes 10 more challenging sequences. Whenever the tracking bounding box drifts way from the ground truth, the tracker re-initializes after five frames. The trackers are evaluated by the EAO metric, which is the inner product of empirically estimated average overlap and typical sequence length distribution. In addition, accuracy (average overlap) and failures/robustness (average number of failures) are used for evaluation as well.
SPSTracker is compared with 10 state-of-the-art trackers on VOT-2016, as shown in Fig. 6. SPSTrack achieves the leading performance and significantly outperforms other trackers. Table 2 reports the details of the comparison with SiamMask ,DWSiam , CCOT , TCNN , SSAT ,MLDF , Staple , DDC , EBT  and SRBT . The EAO score of the proposed SPSTracker is , which is significantly higher than the peer trackers.
SPSTracker achieves an EAO score of , which is significantly better than that of SiamRPN++ , ATOM and other state-of-the-art trackers. Particularly, it outperforms the state-of-the-art SiamRPN++ by , ATOM by and SiamMask by , which are significant margins for object tracking on the challenging benchmark.
NFS. The Need for Speed (NFS)  dataset consists of 100 videos (380K frames). All frames are annotated with axis-aligned bounding boxes, and all sequences are manually labeled with nine visual attributes, including occlusion, fast motion, background clutter. We evaluate the trackers on the 35 FPS version of the NFS dataset. Table 4 reports the AUC scores of the compared trackers. SPSTracker slightly outperforms the baseline ATOM tracker, while significantly outperforms other state-of-the-art tracking methods.
Fig. 8 shows tracking examples on the OTB benchmark, from which we can see that SPSTracker correctly localizes the targets under serious interference from foreground and backgrounds. In contrast, other trackers have failure cases.
Visual tracking has been extensively investigated in the past few years. Nevertheless, the problem about how to model interference from multiple targets, appearance variance and/or background noise remains unsolved. In this paper, we proposed modeling the interference from the perspective of peak distribution and designed a rectified online learning approach for sub-peak response suppression and peak response enforcement. We proposed plug-and-play Peak Response Pooling (PRP) to aggregate and align discriminative features, and designed Boundary Response Truncation (BRT) to reduce the variance of feature response. Based on PRP and BRT, we integrate multi-scale features in SPSTracker to learn the discriminative features for robust object tracking. SPSTracker achieved the new state-of-the-art performance on six widely-used benchmarks, which verifies the effectiveness of the proposed peak response modeling approach.
International Journal of Computer Vision, pp. 1–15. Cited by: Peak Response Pooling (PRP).