RPT++: Customized Feature Representation for Siamese Visual Tracking

10/23/2021
by   Ziang Ma, et al.
0

While recent years have witnessed remarkable progress in the feature representation of visual tracking, the problem of feature misalignment between the classification and regression tasks is largely overlooked. The approaches of feature extraction make no difference for these two tasks in most of advanced trackers. We argue that the performance gain of visual tracking is limited since features extracted from the salient area provide more recognizable visual patterns for classification, while these around the boundaries contribute to accurately estimating the target state. We address this problem by proposing two customized feature extractors, named polar pooling and extreme pooling to capture task-specific visual patterns. Polar pooling plays the role of enriching information collected from the semantic keypoints for stronger classification, while extreme pooling facilitates explicit visual patterns of the object boundary for accurate target state estimation. We demonstrate the effectiveness of the task-specific feature representation by integrating it into the recent and advanced tracker RPT. Extensive experiments on several benchmarks show that our Customized Features based RPT (RPT++) achieves new state-of-the-art performances on OTB-100, VOT2018, VOT2019, GOT-10k, TrackingNet and LaSOT.

READ FULL TEXT VIEW PDF

page 1

page 3

page 4

03/01/2021

Multiple Convolutional Features in Siamese Networks for Object Tracking

Siamese trackers demonstrated high performance in object tracking due to...
01/16/2013

Gradient Driven Learning for Pooling in Visual Pipeline Feature Extraction Models

Hyper-parameter selection remains a daunting task when building a patter...
07/14/2020

Visual Tracking by TridentAlign and Context Embedding

Recent advances in Siamese network-based visual tracking methods have en...
11/19/2018

ATOM: Accurate Tracking by Overlap Maximization

While recent years have witnessed astonishing improvements in visual tra...
08/08/2020

RPT: Learning Point Set Representation for Siamese Visual Tracking

While remarkable progress has been made in robust visual tracking, accur...
11/14/2019

SiamFC++: Towards Robust and Accurate Visual Tracking with Target Estimation Guidelines

Visual tracking problem demands to efficiently perform robust classifica...

Introduction

Generic object tracking is a challenging task in computer vision. It aims at predicting the target state over a video sequence. The tracking task can be formulated as a classification problem and a regression problem. Target classification basically aims at differentiating the target object from background, while the other one is to estimate an accurate target state, often described with a bounding box.

Figure 1: Visualization of the semantic and extreme-enhanced feature maps. The estimated semantic keypoints (yellow circles) help to capture recognizable visual patterns for classification, while the estimated extreme keypoints (green circles) help to encode explicit spatial extent for regression.

Recently, Siamese network has significantly advanced the state-of-the-art in the field of visual tracking Bertinetto et al. (2016); Li et al. (2018, 2019); Xu et al. (2020); Chen et al. (2020); Yu et al. (2020); Zhang and Peng (2019). These Siamese trackers produce a similarity map from cross-correlation of the target template and the search region. Locations on the similarity map and a set of predefined anchors are formulated as the target candidates for anchor-free Xu et al. (2020); Chen et al. (2020); Zhang and Peng (2019) and anchor-based methods Li et al. (2018, 2019); Yu et al. (2020), respectively. Both methods represent the target candidate with the point feature sampled on the corresponding location of the similarity map for convenience. However, the performance of such a point based feature representation is restricted by its weak ability of conveying the explicit semantic and border information Qiu et al. (2020).

Meanwhile, features extracted from the target bounding box have achieved much attention for their strong ability to represent the whole object. These methods improve the feature representation of visual tracking by feeding the backbone features through a PrRoI Pooling Jiang et al. (2018) or RoI Align He et al. (2017) module. However, the rectangular bounding box may contain redundant background pixels, and lack the modeling capability of the object’s geometric structure.

To address these issues, RPT Ma et al. (2020) proposes to estimate the target state as a representative point set, so that features on semantically significant and boundary areas are automatically learned via deformable convolution Dai et al. (2017). This keypoints based feature representation contains more recognizable information than the single point features and contributes to truly understanding the visual patterns of objects. However, RPT often suffers some incorrect keypoints, mainly resulting from the salient area on the background. The effectiveness of the extracted keypoint features is thus considerably restricted. Besides, the features utilized for object classification and target state estimation are extracted from the same point set, ignoring the misalignment between these two tasks Song et al. (2020). As illustrated in Fig. 1, features extracted from the semantic keypoints provide more recognizable visual patterns for object classification, while these around the boundaries encode explicit prior knowledge about the spatial extent, which contribute to accurately estimating the target state. This essential tasks misalignment in feature representation greatly limits the performance gain of visual tracking.

In this paper, we first propose to equip the baseline RPT with two customized feature extractors to obtain explicit salient and extreme information from the corresponding keypoints. The first feature extractor is named polar pooling, which helps to capture more accurate visual patterns within the target region. We achieve this by calculating the max response in the radial direction from the center to each semantic keypoint. The second feature extractor is named extreme pooling, which helps to eliminate the ambiguities in the extreme point estimation. We achieve this by predicting the localization confidence along with the point location via an additional uncertainty branch He et al. (2019); Lee et al. (2020). The enhanced features extracted from the semantic and extreme keypoints are then fed into the classification and regression tasks, respectively.

Our main contributions are summarized as follows:

1) We build our tracking framework upon a keypoints based feature representation, and reveal the feature misalignment between the classification and regression tasks.

2) We propose two customized feature extractors, named polar pooling and extreme pooling to capture accurate and task-aware visual patterns. By explicitly extracting the semantic and extreme information from the corresponding keypoints, we attain a significant performance gain in both classification and regression.

3) We validate the effectiveness of our method on various challenging datasets: OTB-100 Wu et al. (2015), VOT2018 Kristan et al. (2018)& VOT2019 Kristan et al. (2019), GOT-10k Huang et al. (2019), TrackingNet Müller et al. (2018) and LaSOT Fan et al. (2019). Compared with the most recent and advanced trackers, our method consistently attains new state-of-the-art performances.

Figure 2: The RPT++ framework with customized feature representation. We first predict a coarse keypoint set for each location on the correlation feature map. The coarse extreme keypoints are then fed into an extreme pooling module facilitating explicit border information for accurate box regression, while the coarse semantic keypoints are fed into a polar pooling module obtaining more accurate visual patterns for object classification.

Related Works

Feature Representation of Visual Tracking. The Siamese trackers have attained much attention from the visual tracking community due to its balanced accuracy and efficiency Bertinetto et al. (2016); Li et al. (2018, 2019); Xu et al. (2020); Chen et al. (2020); Yu et al. (2020); Zhang and Peng (2019). Features extracted from each location on the correlation feature map are directly utilized for both object classification and localization. However, it is difficult for the point feature to convey efficient visual patterns for the whole target region. For obtaining more powerful feature representation, some works attempt to capture richer information from the target bounding box and the keypoint set. ATOM Danelljan et al. (2019), DiMP Bhat et al. (2019) and PrDiMP Danelljan et al. (2020) exploit a PrRoI Pooling module Jiang et al. (2018) to extract more representative features from the bounding box estimates. SATIN Gao et al. (2020) formulates visual tracking as a keypoints detection task. It detects any target object as a triplet of keypoints, including a centroid point and two corner points. The strategies of center pooling and cascade corner pooling Duan et al. (2019) are employed for enriching the center and corner information, respectively. RPT Ma et al. (2020) proposes a new finer representation of tracking target as a point set. It extracts the semantic and extreme keypoint features using deformable convolution. However, these whole object based features are easily affected by the salient area on the background, and often suffers some incorrect keypoints. In this work, we enhance the keypoint features with two customized feature extractors, attaining more accurate and explicit visual patterns from the estimated keypoints.

Misalignment between Classification and Localization. Most of advanced detectors and trackers utilize a shared head for classification and regression, ignoring the misalignment problem between them. IoU-Net Jiang et al. (2018) first reveals the conflict between the classification score and the localization accuracy, and learns to predict the localization confidence for the detected bounding box. Double-Head R-CNN Wu et al. (2020) further disentangles the shared head by elaborately designing a fully connected head focusing on classification and a convolution head for regression. TSD Song et al. (2020) verifies that the spatial misalignment between the two tasks greatly hinders the performance gain of object detection, which can be resolved via the task-aware proposal estimation. In this work, we reveal the feature misalignment between these two tasks. Features extracted from the salient areas are more suitable for object classification, while these around the boundaries are more suitable for the regression task.

Our Approach

Baseline

We build our framework upon a keypoints based tracker named RPT Ma et al. (2020). Features extracted from a target region and a search region are cross-correlated in a depth-wise manner. Each location on the correlation feature map is directly viewed as a target candidate, which is formulated as a set of adaptive representative points:

(1)

where n denotes the number of sample points, and is set to 9 by default. RPT refines the distribution of sample points via two successive steps:

(2)

Here a coarse representative point set is first obtained by regressing offsets over the origin center position, and refined with an extra set of offsets to represent the final target state.

RPT obtains a more powerful feature representation via deformable convolutions, which are capable of modeling the geometric transformations of objects. The extracted features are then utilized both for object classification as well as target state estimation. During training, both the kernels of deformable convolutions and the predicted offsets are learned via a multi-task loss for simultaneous classification and regression. Driven by these two supervision sources, the extreme and semantic keypoints of tracking target are automatically learned. Besides, RPT exploits an online-trained classifier to improve the discriminative capability when handling distractors. Features extracted from hierarchical convolution layers are further employed for multi-level prediction.

In contrast to the point-based and region-based feature representations, RPT extracts the keypoint features to convey semantically prominent information and the object’s geometric structure. However, the features utilized for object classification and target state estimation are extracted from the same point set, ignoring the feature misalignment between them. It will considerably hinder the performance gain of visual tracking. To tackle this issue, we propose a new approach coined as RPT++, which extracts customized and efficient features for stronger classification and more accurate regression, respectively.

Framework of RPT++

Fig.2 shows the overall network architecture. We first predict a coarse keypoint set for every location on the correlation feature map. It is constructed with four extreme keypoints (topmost, leftmost, bottommost and rightmost) and five semantic keypoints. In order to utilize the bounding box annotations for supervision, we convert the extreme keypoints to a coarse pseudo box as

(3)

where represents the original point of regression. It is usually assigned with the corresponding location on the correlation feature map. denote the pseudo box offsets corresponding to the leftmost, topmost, rightmost, bottommost extreme keypoints, respectively. The regression loss is designed to indicate the overlap between the induced pseudo box and the annotation box Yu et al. (2016).

Different from RPT, we propose to estimate the localization uncertainties along with the pseudo box offsets. Inspired from He et al. (2019); Lee et al. (2020)

, an uncertainty branch is added to predict a Gaussian distribution instead of only the relative offset as

(4)

where the mean value ,, ,

and the standard deviation

represent the predicted offset and its uncertainty, respectively. For each location on the correlation feature map, the regression branch outputs n 2D offsets to refine the distribution of sample points in a point-wise manner, while the uncertainty branch outputs four uncertainties of the pseudo box offsets.

Based on the observations in Fig.1, the coarse extreme and semantic keypoints encode different visual patterns of the target object. In this work, we propose two customized feature extractors, named polar pooling and extreme pooling to enrich the semantic and extreme information, respectively. Polar pooling conveys more accurate visual patterns for object classification via an additional look inside the target region. Extreme pooling takes the uncertainties of the coarse extreme keypoints into consideration, which contributes to accurately locating the boundaries of tracking target. Features extracted from these two modules are exploited to predict the classification score and the object’s spatial extent, respectively. The detailed processes of these two customized feature extractors are discussed in the next section.

Customized Feature Extraction

Extreme Pooling. The extreme points of tracking target are inherently ambiguous in some cases (e.g., any point along the border-top of a vehicle might be viewed as an extreme point). These ambiguities make it hard to extract efficient extreme features, which directly limit the upper bound of localization accuracy. We thus propose a customized feature extractor, named extreme pooling to enhance feature representation for accurate target state estimation.

Taken the estimated extreme keypoint and its uncertainty as input, extreme pooling first crops an uncertainty region from the correlation feature maps with C channels. Let and denote the coordinates of the left-top and right-bottom corners of the cropped region, which satisfy the following relationship:

(5)

where is a scaling factor. Features cropped from the uncertainty region are of size , and then fed into a RoI Align module He et al. (2017)

followed by a max-pooling layer. The resulting features of RoI Align are of size

, where K denotes the pooling resolution and can be formulated as

(6)

where denotes the ceil operation. The feature spatial size is further normalized to via max-pooling.

The extreme-enhanced features are thus attained by orderly concatenating the output feature value of each extreme keypoint, as illustrated in Fig.3. The channel dimension is 5C, in which the first 4C channels correspond to the four extreme keypoints, while the other C channels correspond to the original point features. It is worth noting that the input uncertainty is along with the direction of the relative offset by theory. In this work, we instead form an uncertainty region centered at the estimated extreme keypoint for the convenience of feature pooling.

Figure 3: Illustration of extreme pooling. Our method predicts an uncertainty region (dotted bounding box) for each coarse extreme keypoint (green circle), in which the maximum values (luminous points) are taken and concatenated to form the extreme-enhanced feature maps.
Figure 4: Illustration of polar pooling. The maximum values (luminous points) in radial directions from the original point (orange point) to the coarse semantic keypoints (yellow points) are taken and concatenated to form the semantic-enhanced feature maps.

Polar Pooling. The coarse keypoint set is constrained inside a pseudo box. The bounding box lacks the capability of modeling the object’s spatial extent and tends to include distracting background pixels. Thus the estimated semantic keypoints are often outside the target, which convey inaccurate visual patterns. To address this issue, we propose a novel feature extractor named polar pooling.

As shown in Fig. 4, polar pooling takes the coarse semantic keypoints as input. Then we evenly subdivide the radial from the original point to each semantic keypoint into N points. The feature values of these N points are aggregated via a max-pooling operator as

(7)

Here and represent the coordinates of the original point and the estimated semantic keypoint, respectively. To eliminate the quantization error, we calculate the feature value

at a non-integral location on the input feature map using bilinear interpolation. Similarly, the output feature value

for each semantic keypoint is orderly concatenated to construct the semantic-enhanced features with 6C channels. The first 5C channels correspond to the five semantic keypoints, while the other C channels correspond to the original point features.

Training and Inference

Target Assignment. For estimating the coarse keypoint set, RPT directly views the center of the annotated bounding box as positive and all other locations as negative, ignoring the target scale and aspect ratio. In this work, we propose to build a positive region, in which every location would generate a bounding box with at least t IoU (intersection over union) with the annotation box. The threshold t is set to 0.5 by default. Then the pseudo box converted from the coarse extreme keypoints is assigned with a positive label if the IoU with the ground-truth box is larger than 0.5, and if the IoU is smaller than 0.4, it will be assigned with a negative label, and otherwise ignored.

Figure 5: Success and precision plots on OTB-100 Wu et al. (2015).
ATOM SiamR-CNN SiamFC++ DiMP PG-Net SiamBAN KYS Ocean RPT RPT++
Danelljan et al. (2019) Voigtlaender et al. (2020) Xu et al. (2020) Bhat et al. (2019) Liao et al. (2020) Chen et al. (2020) Bhat et al. (2020) Zhang et al. (2020) Ma et al. (2020) (ours)
A 0.590 0.609 0.587 0.597 0.618 0.597 0.609 0.592 0.629 0.630
R 0.204 0.220 0.183 0.153 0.192 0.178 0.143 0.117 0.103 0.094
EAO 0.401 0.408 0.426 0.440 0.440 0.452 0.462 0.489 0.510 0.525
Table 1: Comparison with state-of-the-art trackers on VOT2018 Kristan et al. (2018).
SiamRPN++ ATOM SiamBAN Ocean DiMP DRNet RPT RPT++
Li et al. (2019) Danelljan et al. (2019) Chen et al. (2020) Zhang et al. (2020) Bhat et al. (2019) Kristan et al. (2019) Ma et al. (2020) (ours)
A 0.599 0.603 0.602 0.594 0.594 0.605 0.623 0.635
R 0.482 0.411 0.396 0.316 0.278 0.261 0.186 0.181
EAO 0.285 0.292 0.327 0.350 0.379 0.395 0.417 0.427
Table 2: Comparison with state-of-the-art trackers on VOT2019 Kristan et al. (2019).

Loss Function. The proposed RPT++ are trained on YBB Real et al. (2017)

, COCO

Lin et al. (2014)

and ImageNet VID

Russakovsky et al. (2015) using a multi-task loss

(8)

Here the learning of coarse keypoint set is driven by , which is formulated as the KL loss He et al. (2019):

(9)

where , ,, , is the ground-truth offset for the leftmost, topmost, rightmost, bottommost extreme keypoints, respectively. and denote the focal loss Lin et al. (2017) and the IoU loss Yu et al. (2016), which are employed in the training process to obtain a refined keypoint set. The weight set , and is set to 0.5, 1 and 1, respectively.

Inference. Our method outputs a classification score and a set of refined keypoints for every target candidate. The classification score is penalized with the target movements and the scale/ratio changes between adjacent frames. It is then combined with the output of the online-trained classifier to obtain the final classification score. The point set with highest score is utilized to estimate the final target state.

Experiments Results

Implementation Details

The modified ResNet-50 Zhang and Peng (2019)

is adopted as the backbone network and initialized with the weights that pretrained on ImageNet

Russakovsky et al. (2015). For preserving finer spatial information, we remove the down-sampling operations in the last two blocks. The outputs from the last three convolution blocks share the same feature resolution, and further gather together for multi-level prediction.

Our model is trained for 20 epochs with stochastic gradient descent (SGD). The learning rate for the first 5 epochs is linearly ascended from 0.001 to 0.005. It is then exponentially decayed from 0.005 to 0.0005 in the rest training process. Meanwhile, we only train the head architecture for the first 10 epochs, and unfreeze the backbone network for the remaining epochs. Following SiamFC

Bertinetto et al. (2016), we adopt a template patch of size and a search patch of size

. The proposed RPT++ is implemented with PyTorch on a GeForce GTX 1080Ti.

The code and models will be made avaliable to facilitate further researches.

Comparison with SOTA

In this section, we conduct extensive experiments on six benchmark datasets: OTB-100 Wu et al. (2015), VOT2018 Kristan et al. (2018), VOT2019 Kristan et al. (2019), GOT-10k Huang et al. (2019), TrackingNet Müller et al. (2018) and LaSOT Fan et al. (2019).

OTB-100. OTB-100 Wu et al. (2015) serves an important role in the development of visual object tracking. We compare RPT++ with the baseline method Ma et al. (2020), as well as various state-of-the-art trackers. The precision and success plots of OPE are evaluated. Our method outperforms all other trackers for both metrics, including SiamRPN++ Li et al. (2019), ATOM Danelljan et al. (2019), DiMP Bhat et al. (2019), PrDiMP Danelljan et al. (2020), SiamFC++ Xu et al. (2020), SiamR-CNN Voigtlaender et al. (2020), SiamBAN Chen et al. (2020), Ocean Zhang et al. (2020), SiamCAR Guo et al. (2020b) and SiamGAT Guo et al. (2020a). Despite that OTB-100 dataset has been highly saturated over recent years, our method obtains a further performance gain over the baseline approach.

VOT2018. The VOT2018 dataset Kristan et al. (2018) contains 60 videos with numerous challenging factors. The evaluation system restarts the tracker at failures. The overall performance is measured with robustness (failure rate, abbreviated as R), accuracy (average overlap during successful tracking, abbreviated as A) and EAO (expected average overlap). Comparisons with numerous recent and advanced trackers on the VOT2018 dataset Kristan et al. (2018) are shown in Table 1. The RPT framework outperforms all previous methods in the main EAO metric. Our tracker attains a further gain of 1.5% in EAO over the baseline.

VOT2019. VOT2019 Kristan et al. (2019) partially inherits the previous version and includes more challenging factors. Comparisons in terms of accuracy, robustness and EAO are presented in Table 2. DRNet Kristan et al. (2019) attains the leading performance of this challenge. Our tracker significantly outperforms the champion method of VOT2019 by 3.2% in EAO.

GOT-10k. GOT-10k Huang et al. (2019) is a large high-diversity benchmark with over 10 thousand segments of real-world moving objects for training and 180 segments for test. This dataset follows the one-shot rule that object classes in the train and test sets are zero-overlapped. A fair comparison is ensured with the protocol that all methods are trained with unified data. The results in terms of (success rate with threshold T) and AO (average overlap) are reported in Table 3. Our method outperforms the previous best tracker, SiamR-CNN Voigtlaender et al. (2020), by a significant margin of 1.5% for the main AO metric.

ATOM SiamFC++ Ocean SiamGAT PrDiMP KYS DCFST SiamR-CNN RPT RPT++
Danelljan et al. (2019) Xu et al. (2020) Zhang et al. (2020) Guo et al. (2020a) Danelljan et al. (2020) Bhat et al. (2020) Zheng et al. (2020) Voigtlaender et al. (2020) Ma et al. (2020) (ours)
63.4 0.695 0.721 0.743 0.738 0.751 0.753 0.728 0.730 0.777
40.2 0.479 0.473 0.488 0.543 0.515 0.498 0.597 0.504 0.564
AO 55.6 0.595 0.611 0.627 0.634 0.636 0.638 0.649 0.624 0.664
Table 3: Comparison with state-of-the-art trackers on GOT-10k Huang et al. (2019).
ATOM SiamRPN++ SiamFC++ DCFST PrDiMP SiamAttn RPT RPT++
Danelljan et al. (2019) Li et al. (2019) Xu et al. (2020) Zheng et al. (2020) Danelljan et al. (2020) Yu et al. (2020) Ma et al. (2020) (ours)
Precision 64.8 69.4 70.5 70.0 70.4 71.5 70.2 72.2
Success 70.3 73.3 75.4 75.2 75.8 75.2 75.2 75.3
Norm. Prec 77.1 80.0 80.0 80.9 81.6 81.7 81.2 82.5
Table 4: Comparison with state-of-the-art trackers on TrackingNet Müller et al. (2018).

TrackingNet. TrackingNet Müller et al. (2018) contains 30 thousand sequences with more than 14 million dense bounding box annotations. Our tracker is evaluated on the test set consisting of 511 sequences with ATOM Danelljan et al. (2019), SiamRPN++ Li et al. (2019), SiamFC++ Xu et al. (2020), SiamAttn Yu et al. (2020), PrDiMP Danelljan et al. (2020) and DCFST. As shown in Table 4, our method reaches a precision score of 72.2% and a normalized score of 82.5% surpassing SiamAttn Yu et al. (2020) and PrDiMP Danelljan et al. (2020). We also achieve a competitive result in terms of success score.

Figure 6: Comparison with state-of-the-art trackers on LaSOT Fan et al. (2019) in terms of the normalized precision, precision and success plots of OPE.

LaSOT. LaSOT Fan et al. (2019) is a large-scale dataset with an average sequence length of 2500 frames for long-term tracking. We evaluate our work on the 280 videos test set with ATOM Danelljan et al. (2019), SiamRPN++ Li et al. (2019), DiMP Bhat et al. (2019), Ocean Zhang et al. (2020), SiamBAN Chen et al. (2020), SiamCAR Guo et al. (2020b) and SiamGAT Guo et al. (2020a). As shown in Figure 6, our RPT++ achieves the best overall performance. Compared with the baseline method, our customized features based tracker improve the normalized precision, precision and success rate by 6.0%, 5.0% and 5.5%, respectively. The result indicates that our RPT++ is more competitive for long-term tracking tasks.

Ablation Study

Component-wise analysis. To explore the impact of individual components in RPT++, we conduct an extensive ablation study on the GOT-10k dataset Huang et al. (2019), as presented in Table 5. RPT is adopted as the baseline model. We first apply the uncertainty branch on the target state estimation module and improves the AO score from 0.624 to 0.628. Similar to Qiu et al. (2020), the uncertainties corresponding to the pseudo box offsets are averaged and utilized to infer the box confidence. Then the extreme pooling and polar pooling modules are gradually added, and further improve the AO score by 1.3% and 2.3%, respectively. The result demonstrates that our customized feature representation contributes significantly to the tracking results.

Components AO
baseline 0.730 0.504 0.624
+ uncertainty branch 0.732 0.508 0.628
+ extreme pooling 0.736 0.528 0.641
+ polar pooling 0.777 0.564 0.664
Table 5: Component-wise analysis of RPT++ on GOT-10k Huang et al. (2019).

Pooling Size. In the polar pooling module, each radial are evenly subdivide into N points and aggregated as the semantic-enhanced features. Likewise, extreme pooling crops an uncertainty region of size and pools over each region to extract the extreme-enhanced features. Thus two hyper-parameters, the pooling size N and the scaling factor are introduced during the procedure of customized feature extraction. Studies concerning N and are conducted. We show the results in Table 6 and Table 7. Considering the tradeoff between tracking speed and accuracy, the values of N and are set to 8 and 10, respectively.

Besides, we further set the uncertainty region in extreme pooling with a series of fixed sizes, including , and . As shown in Table 8, our method outperforms these versions by a least margin of 1.1% in AO. It demonstrates that estimating an uncertainty region is better than just using a fixed size.

2 4 6 8 16 32
AO 0.642 0.654 0.658 0.664 0.662 0.664
FPS 15.4 15.0 14.8 14.7 14.2 13.2
Table 6: Ablation studies on the pooling size for polar pooling on GOT-10k Huang et al. (2019).

Comparison with other feature representations. We further compare the proposed customized features with other feature representations, and report the results on GOT-10k Huang et al. (2019) in Table 9. The structure to predict the coarse keypoints with localization uncertainties is remained. The baseline setting utilizes the single point features for classification and regression. Then we complement it with the border features extracted from the coarse extreme keypoints Qiu et al. (2020), the region features pooled from the pseudo box , the keypoint features extracted from the coarse keypoint set via deformable convolutions Ma et al. (2020) and our customized features, respectively. Results reveal that our customized features outperforms all other feature representations by 2.8% in AO at least.

Scaling Factor 5 8 10 12 15
AO 0.656 0.662 0.664 0.660 0.648
FPS 15.2 14.9 14.7 14.4 13.9
Table 7: Ablation studies on the scaling factor for extreme pooling on GOT-10k Huang et al. (2019).
Region Size Fixed Uncertain
AO 0.645 0.653 0.639 0.664
Table 8: Analysis of the impact of modeling the uncertainty region in extreme pooling with fixed sizes on GOT-10k Huang et al. (2019).
Method AO
Single point features 0.706 0.480 0.604
w/Border features 0.730 0.504 0.624
w/Region features 0.740 0.538 0.636
w/Keypoint features 0.732 0.508 0.628
w/Customized features 0.777 0.564 0.664
Table 9: Comparison of different feature representations on GOT-10k Huang et al. (2019).

Conclusion

In this paper, we propose RPT++, which extracts accurate semantic features for stronger classification and explicit extreme features for accurate target state estimation. These task-specific features are obtained by two customized operations named polar pooling and extreme pooling. By eliminating the task misalignment in feature representation, our method achieves a significant performance gain for both classification and regression. Extensive experiments on various benchmark datasets indicate that our method obtains new state-of-the-art results and runs at 15fps.

References

  • L. Bertinetto, J. Valmadre, J. F. Henriques, A. Vedaldi, and P. H. S. Torr (2016) Fully-convolutional siamese networks for object tracking. In Computer Vision - ECCV 2016 Workshops - Amsterdam, The Netherlands, October 8-10 and 15-16, 2016, Proceedings, Part II, G. Hua and H. Jégou (Eds.), Lecture Notes in Computer Science, Vol. 9914, pp. 850–865. Cited by: Introduction, Related Works, Implementation Details.
  • G. Bhat, M. Danelljan, L. V. Gool, and R. Timofte (2019) Learning discriminative model prediction for tracking. In 2019 IEEE/CVF International Conference on Computer Vision, ICCV 2019, Seoul, Korea (South), October 27 - November 2, 2019, pp. 6181–6190. Cited by: Related Works, Table 1, Table 2, Comparison with SOTA, Comparison with SOTA.
  • G. Bhat, M. Danelljan, L. V. Gool, and R. Timofte (2020) Know your surroundings: exploiting scene information for object tracking. In Computer Vision - ECCV 2020 - 16th European Conference, Glasgow, UK, August 23-28, 2020, Proceedings, Part XXIII, A. Vedaldi, H. Bischof, T. Brox, and J. Frahm (Eds.), Lecture Notes in Computer Science, Vol. 12368, pp. 205–221. Cited by: Table 1, Table 3.
  • Z. Chen, B. Zhong, G. Li, S. Zhang, and R. Ji (2020) Siamese box adaptive network for visual tracking. In

    2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2020, Seattle, WA, USA, June 13-19, 2020

    ,
    pp. 6667–6676. Cited by: Introduction, Related Works, Table 1, Table 2, Comparison with SOTA, Comparison with SOTA.
  • J. Dai, H. Qi, Y. Xiong, Y. Li, G. Zhang, H. Hu, and Y. Wei (2017) Deformable convolutional networks. In IEEE International Conference on Computer Vision, ICCV 2017, Venice, Italy, October 22-29, 2017, pp. 764–773. Cited by: Introduction.
  • M. Danelljan, G. Bhat, F. S. Khan, and M. Felsberg (2019) ATOM: accurate tracking by overlap maximization. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019, Long Beach, CA, USA, June 16-20, 2019, pp. 4660–4669. Cited by: Related Works, Table 1, Table 2, Comparison with SOTA, Comparison with SOTA, Comparison with SOTA, Table 3, Table 4.
  • M. Danelljan, L. V. Gool, and R. Timofte (2020) Probabilistic regression for visual tracking. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2020, Seattle, WA, USA, June 13-19, 2020, pp. 7181–7190. Cited by: Related Works, Comparison with SOTA, Comparison with SOTA, Table 3, Table 4.
  • K. Duan, S. Bai, L. Xie, H. Qi, Q. Huang, and Q. Tian (2019) CenterNet: keypoint triplets for object detection. In 2019 IEEE/CVF International Conference on Computer Vision, ICCV 2019, Seoul, Korea (South), October 27 - November 2, 2019, pp. 6568–6577. Cited by: Related Works.
  • H. Fan, L. Lin, F. Yang, P. Chu, G. Deng, S. Yu, H. Bai, Y. Xu, C. Liao, and H. Ling (2019) LaSOT: A high-quality benchmark for large-scale single object tracking. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019, Long Beach, CA, USA, June 16-20, 2019, pp. 5374–5383. Cited by: Introduction, Figure 6, Comparison with SOTA, Comparison with SOTA.
  • P. Gao, R. Yuan, F. Wang, L. Xiao, H. Fujita, and Y. Zhang (2020) Siamese attentional keypoint network for high performance visual tracking. Knowl. Based Syst. 193, pp. 105448. Cited by: Related Works.
  • D. Guo, Y. Shao, Y. Cui, Z. Wang, L. Zhang, and C. Shen (2020a) Graph attention tracking. CoRR abs/2011.11204. External Links: Link, 2011.11204 Cited by: Comparison with SOTA, Comparison with SOTA, Table 3.
  • D. Guo, J. Wang, Y. Cui, Z. Wang, and S. Chen (2020b) SiamCAR: siamese fully convolutional classification and regression for visual tracking. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2020, Seattle, WA, USA, June 13-19, 2020, pp. 6268–6276. Cited by: Comparison with SOTA, Comparison with SOTA.
  • K. He, G. Gkioxari, P. Dollár, and R. B. Girshick (2017) Mask R-CNN. In IEEE International Conference on Computer Vision, ICCV 2017, Venice, Italy, October 22-29, 2017, pp. 2980–2988. Cited by: Introduction, Customized Feature Extraction.
  • Y. He, C. Zhu, J. Wang, M. Savvides, and X. Zhang (2019) Bounding box regression with uncertainty for accurate object detection. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019, Long Beach, CA, USA, June 16-20, 2019, pp. 2888–2897. Cited by: Introduction, Framework of RPT++, Training and Inference.
  • L. Huang, X. Zhao, and K. Huang (2019) GOT-10k: a large high-diversity benchmark for generic object tracking in the wild. IEEE Transactions on Pattern Analysis and Machine Intelligence, pp. 1–1. External Links: ISSN 1939-3539 Cited by: Introduction, Comparison with SOTA, Comparison with SOTA, Ablation Study, Ablation Study, Table 3, Table 5, Table 6, Table 7, Table 8, Table 9.
  • B. Jiang, R. Luo, J. Mao, T. Xiao, and Y. Jiang (2018) Acquisition of localization confidence for accurate object detection. In Computer Vision - ECCV 2018 - 15th European Conference, Munich, Germany, September 8-14, 2018, Proceedings, Part XIV, V. Ferrari, M. Hebert, C. Sminchisescu, and Y. Weiss (Eds.), Lecture Notes in Computer Science, Vol. 11218, pp. 816–832. Cited by: Introduction, Related Works, Related Works.
  • M. Kristan, A. Berg, L. Zheng, L. Rout, L. V. Gool, L. Bertinetto, M. Danelljan, M. Dunnhofer, M. Ni, M. Y. Kim, M. Tang, M. Yang, and et al. (2019) The seventh visual object tracking VOT2019 challenge results. In 2019 IEEE/CVF International Conference on Computer Vision Workshops, ICCV Workshops 2019, Seoul, Korea (South), October 27-28, 2019, pp. 2206–2241. Cited by: Introduction, Table 2, Comparison with SOTA, Comparison with SOTA.
  • M. Kristan, A. Leonardis, J. Matas, M. Felsberg, R. P. Pflugfelder, L. C. Zajc, T. Vojír, G. Bhat, A. Lukezic, A. Eldesokey, G. Fernández, and et al. (2018) The sixth visual object tracking VOT2018 challenge results. In Computer Vision - ECCV 2018 Workshops - Munich, Germany, September 8-14, 2018, Proceedings, Part I, L. Leal-Taixé and S. Roth (Eds.), Lecture Notes in Computer Science, Vol. 11129, pp. 3–53. Cited by: Introduction, Table 1, Comparison with SOTA, Comparison with SOTA.
  • Y. Lee, J. Hwang, H. Kim, K. Yun, and J. Park (2020) Localization uncertainty estimation for anchor-free object detection. CoRR abs/2006.15607. External Links: Link, 2006.15607 Cited by: Introduction, Framework of RPT++.
  • B. Li, W. Wu, Q. Wang, F. Zhang, J. Xing, and J. Yan (2019) SiamRPN++: evolution of siamese visual tracking with very deep networks. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019, Long Beach, CA, USA, June 16-20, 2019, pp. 4282–4291. Cited by: Introduction, Related Works, Table 2, Comparison with SOTA, Comparison with SOTA, Comparison with SOTA, Table 4.
  • B. Li, J. Yan, W. Wu, Z. Zhu, and X. Hu (2018) High performance visual tracking with siamese region proposal network. In 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 18-22, 2018, pp. 8971–8980. Cited by: Introduction, Related Works.
  • B. Liao, C. Wang, Y. Wang, Y. Wang, and J. Yin (2020) PG-net: pixel to global matching network for visual tracking. In Computer Vision - ECCV 2020 - 16th European Conference, Glasgow, UK, August 23-28, 2020, Proceedings, Part XXII, A. Vedaldi, H. Bischof, T. Brox, and J. Frahm (Eds.), Lecture Notes in Computer Science, Vol. 12367, pp. 429–444. Cited by: Table 1.
  • T. Lin, P. Goyal, R. B. Girshick, K. He, and P. Dollár (2017) Focal loss for dense object detection. In IEEE International Conference on Computer Vision, ICCV 2017, Venice, Italy, October 22-29, 2017, pp. 2999–3007. Cited by: Training and Inference.
  • T. Lin, M. Maire, S. J. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick (2014) Microsoft COCO: common objects in context. In Computer Vision - ECCV 2014 - 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V, D. J. Fleet, T. Pajdla, B. Schiele, and T. Tuytelaars (Eds.), Lecture Notes in Computer Science, Vol. 8693, pp. 740–755. Cited by: Training and Inference.
  • Z. Ma, L. Wang, H. Zhang, W. Lu, and J. Yin (2020) RPT: learning point set representation for siamese visual tracking. In Computer Vision - ECCV 2020 Workshops - Glasgow, UK, August 23-28, 2020, Proceedings, Part V, A. Bartoli and A. Fusiello (Eds.), Lecture Notes in Computer Science, Vol. 12539, pp. 653–665. Cited by: Introduction, Related Works, Baseline, Table 1, Table 2, Comparison with SOTA, Ablation Study, Table 3, Table 4.
  • M. Müller, A. Bibi, S. Giancola, S. Al-Subaihi, and B. Ghanem (2018) TrackingNet: A large-scale dataset and benchmark for object tracking in the wild. In Computer Vision - ECCV 2018 - 15th European Conference, Munich, Germany, September 8-14, 2018, Proceedings, Part I, V. Ferrari, M. Hebert, C. Sminchisescu, and Y. Weiss (Eds.), Lecture Notes in Computer Science, Vol. 11205, pp. 310–327. Cited by: Introduction, Comparison with SOTA, Comparison with SOTA, Table 4.
  • H. Qiu, Y. Ma, Z. Li, S. Liu, and J. Sun (2020) BorderDet: border feature for dense object detection. In Computer Vision - ECCV 2020 - 16th European Conference, Glasgow, UK, August 23-28, 2020, Proceedings, Part I, A. Vedaldi, H. Bischof, T. Brox, and J. Frahm (Eds.), Lecture Notes in Computer Science, Vol. 12346, pp. 549–564. Cited by: Introduction, Ablation Study, Ablation Study.
  • E. Real, J. Shlens, S. Mazzocchi, X. Pan, and V. Vanhoucke (2017) YouTube-boundingboxes: A large high-precision human-annotated data set for object detection in video. In 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017, pp. 7464–7473. Cited by: Training and Inference.
  • O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. S. Bernstein, et al. (2015) ImageNet large scale visual recognition challenge. International Journal of Computer Vision 115 (3), pp. 211–252. Cited by: Training and Inference, Implementation Details.
  • G. Song, Y. Liu, and X. Wang (2020) Revisiting the sibling head in object detector. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2020, Seattle, WA, USA, June 13-19, 2020, pp. 11560–11569. Cited by: Introduction, Related Works.
  • P. Voigtlaender, J. Luiten, P. H. S. Torr, and B. Leibe (2020) Siam R-CNN: visual tracking by re-detection. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2020, Seattle, WA, USA, June 13-19, 2020, pp. 6577–6587. Cited by: Table 1, Comparison with SOTA, Comparison with SOTA, Table 3.
  • Y. Wu, J. Lim, and M. Yang (2015) Object tracking benchmark. IEEE Trans. Pattern Anal. Mach. Intell. 37 (9), pp. 1834–1848. Cited by: Introduction, Figure 5, Comparison with SOTA, Comparison with SOTA.
  • Y. Wu, Y. Chen, L. Yuan, Z. Liu, L. Wang, H. Li, and Y. Fu (2020) Rethinking classification and localization for object detection. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2020, Seattle, WA, USA, June 13-19, 2020, pp. 10183–10192. Cited by: Related Works.
  • Y. Xu, Z. Wang, Z. Li, Y. Yuan, and G. Yu (2020) SiamFC++: towards robust and accurate visual tracking with target estimation guidelines. In

    The Thirty-Fourth AAAI Conference on Artificial Intelligence, AAAI 2020, The Thirty-Second Innovative Applications of Artificial Intelligence Conference, IAAI 2020, The Tenth AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2020, New York, NY, USA, February 7-12, 2020

    ,
    pp. 12549–12556. Cited by: Introduction, Related Works, Table 1, Comparison with SOTA, Comparison with SOTA, Table 3, Table 4.
  • J. Yu, Y. Jiang, Z. Wang, Z. Cao, and T. S. Huang (2016) UnitBox: an advanced object detection network. In Proceedings of the 2016 ACM Conference on Multimedia Conference, MM 2016, Amsterdam, The Netherlands, October 15-19, 2016, A. Hanjalic, C. Snoek, M. Worring, D. C. A. Bulterman, B. Huet, A. Kelliher, Y. Kompatsiaris, and J. Li (Eds.), pp. 516–520. Cited by: Framework of RPT++, Training and Inference.
  • Y. Yu, Y. Xiong, W. Huang, and M. R. Scott (2020) Deformable siamese attention networks for visual object tracking. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2020, Seattle, WA, USA, June 13-19, 2020, pp. 6727–6736. Cited by: Introduction, Related Works, Comparison with SOTA, Table 4.
  • Z. Zhang, H. Peng, J. Fu, B. Li, and W. Hu (2020) Ocean: object-aware anchor-free tracking. In Computer Vision - ECCV 2020 - 16th European Conference, Glasgow, UK, August 23-28, 2020, Proceedings, Part XXI, A. Vedaldi, H. Bischof, T. Brox, and J. Frahm (Eds.), Lecture Notes in Computer Science, Vol. 12366, pp. 771–787. Cited by: Table 1, Table 2, Comparison with SOTA, Comparison with SOTA, Table 3.
  • Z. Zhang and H. Peng (2019) Deeper and wider siamese networks for real-time visual tracking. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019, Long Beach, CA, USA, June 16-20, 2019, pp. 4591–4600. Cited by: Introduction, Related Works, Implementation Details.
  • L. Zheng, M. Tang, Y. Chen, J. Wang, and H. Lu (2020) Learning feature embeddings for discriminant model based tracking. In Computer Vision - ECCV 2020 - 16th European Conference, Glasgow, UK, August 23-28, 2020, Proceedings, Part XV, A. Vedaldi, H. Bischof, T. Brox, and J. Frahm (Eds.), Lecture Notes in Computer Science, Vol. 12360, pp. 759–775. Cited by: Table 3, Table 4.