1 Introduction
Precise scale estimation is extremely essential to a successful tracker. Early trackers usually solve this problem by multi-scale search [2, 6, 39, 4] or sampling-then-regression strategy [42, 32], which is inaccurate and has greatly limited the precision of trackers. In recent years, many high-performance scale estimation methods have been developed and have significantly improved trackers’ performance [23, 47, 7, 3]. For obtaining more robust and precise tracking results than before, many state-of-the-art trackers [45, 10, 7, 3] adopt a multiple-stage tracking strategy. These trackers first locate the target coarsely and then refine their results using a refinement module. However, existing refinement modules suffer from limited transferability and precision.
In this work, we propose a novel, general and accurate refinement module. This module is trained separately and can be directly applied to any existing trackers, elevating the quality of their predicted bounding boxes. The proposed module utilizes an accurate pixel-wise correlation layer and a spatial-aware non-local layer for high-quality scale estimation. Moreover, our module predicts bounding box, corners and mask simultaneously. Thus, this module can quickly adapt to complex scenarios. We also develop a novel branch selector module to choose the most adequate output wisely.
We choose five famous base trackers: DiMP [3], ATOM [7], SiamRPN++ [23], RTMDNet [17] and ECO [6], to perform comprehensive experiments on three tracking benchmarks, namely, LaSOT [9], TrackingNet [31] and VOT2018 [20]. Experimental results show that our proposed refinement module improves base trackers’ performances significantly, and its performance surpasses those of its competitors (i.e., IoU-Net [7, 3] and SiamMask [47]) by a large margin.
2 Related Work
Early Scale Estimation.
Before the thrive of deep learning, early scale estimation methods can be summarized as two categories: multiple-scale search and sampling-then-regression strategies. Most correlation-filter-based trackers
[14, 6, 39] and SiamFC [2] adopt the former strategy. Specifically, these trackers construct search regions with different sizes, then compute correlation with the template, and finally determine the size of the target as the size-level where the highest response locates. Multiple-scale search is coarse and time-consuming due to its fixed-aspect-ratio prediction and heavy image pyramid operation. Another type of method first generates a large number of bounding box samples, then uses some methods to choose the best one, and finally applies regression on it to obtain more accurate results. SINT [42], MDNet [32] and RTMDNet [17] are three representative trackers that exploit this approach.Modern Scale Estimation.
As deep learning techniques become mature, several high-performance scale estimation approaches are developed and can be categorized into the following classes, namely, RPN-based [24, 58, 23], Mask-based [47, 28], IoU-based [7, 3] and Anchor-free-based [52]. RPN-based methods learn a region proposal network [35], which determines whether the current anchor contains the target and makes refinement to it simultaneously. SiameseRPN-series trackers [24, 58, 23] utilize it as the core component, and thus achieve great success in recent years. Mask representation is more accurate, and the ability to predict mask is quite beneficial to precise scale estimation. SiamMask [47] and D3S [28] belong to this class, and obtain higher precision than Siamese trackers that can only predict boxes. IoU-based approaches learn a network to predict overlap between candidate boxes and groundtruth. During the inference phase, this strategy optimizes candidate boxes by gradient-ascent, and therefore obtains more precise results. ATOM [7] and DiMP [3] fully exploit this method, and thus surpass traditional correlation-filter trackers by a large margin. In the past year, anchor-free philosophy has become quite popular in the object detection field [22, 56, 19, 43]. It eliminates anchors to change the label-assignment rules and the learning targets. SiamFC++ [52] introduces this structure into object tracking field, and therefore achieves state-of-the-art performance.
Refinement Modules.
Many state-of-the-art trackers [45, 10, 7, 3, 21] apply multiple-stage tracking strategy to obtain accurate and robust results. This approach first locates the target coarsely and then utilizes a refinement module to refine results from the previous stage. SPM [45] and Siamese Cascaded RPN [10] adopt a light-weight relation network [41] and stacked RPNs [35], respectively, as the refinement module to further increase trackers’ discriminative power and precision. However, the two refinement modules have to be trained together with their previous Siamese tracker in an end-to-end manner; this procedure limits their flexibility of combining with other base trackers. ATOM [7] and DiMP [3] first use an online classification module to locate the target, then draw some random samples around it, and finally deploy a modified IoU-Net [16] to maximize the overlap between these samples and groundtruth for obtaining more precise bounding boxes. This modified IoU-Net can be trained separately from the base tracker. Thus, it has good transferability, but its precision still has large space to improve. Notably, the champions of VOT2019 [21], the main and long-term challenges utilize SiamMask [47] as a refinement module [21]. Similar to IoU-Net [16] mentioned before, SiamMask [47] can be combined with any base tracker. However, SiamMask is designed as a tracker rather than a refinement module. Therefore, it is still not perfect. Considering previous refinement modules’ weak transferability and limited accuracy, a new, general and precise refinement method called Alpha-Refine is developed.
3 Alpha-Refine
Single object tracking task can be decomposed as target localization and scale estimation. In this work, base trackers are responsible for localizing the target, and Alpha-Refine is designed specifically for precise scale estimation. To be specific, after the base tracker gets a coarse tracking result, a small search region whose size is double the size of the tracking result is cropped and sent to Alpha-Refine. Then Alpha-Refine outputs a more precise bounding box as the final tracking result. For the next frame, the base tracker crops search region based on the refined result from the last frame.
3.1 Network Architecture
Alpha-Refine has two input branches, namely, the reference branch and the test branch. These two branches take the small search region from the 1st and the current frame respectively as the input. The two branches use a parameter-shared ResNet-50 [13]
network as the backbone. After symmetric feature extraction, a PrRoI Pooling layer
[16] is used to obtain target features of the reference frame. Different from existing trackers that fuse features by naive correlation or depth-wise correlation, Alpha-Refine innovatively introduces an accurate pixel-wise correlation layer and a spatial-aware non-local layer for feature aggregation, to obtain fine reference-guided target features. Moreover, we deploy three complementary prediction heads to output the bounding box, corners, and mask, respectively. Three branches provide stronger supervision and more diverse results during training and testing phases. To wisely choose the most adequate one as the final result, a novel and efficient branch selector is proposed. The overall architecture of Alpha-Refine is shown in Figure 1.
3.2 Fine Feature Aggregation
Accurate Pixel-wise Correlation.
In recent years, Siamese architecture has been widely used in the tracking field, leading to a large number of successful trackers. However, most of these methods aggregate features of the template and the search region using coarse naive correlation [2, 24, 58] or depth-wise correlation [23, 47]. Both of these methods take the whole target as the kernel to produce degraded correlation maps, which cannot accurately reflect the size of the target in the current search region.
In this work, we replace these methods with pixel-wise correlation [49] for high-quality feature representation. We denote and as features of the template and the search region. Pixel-wise correlation first decomposes into sub-kernels , and then uses them to compute correlation with separately for obtaining correlation maps . The process can be described as
(1) |
where denotes naive correlation.
Unlike naive or depth-wise correlation, pixel-wise correlation takes each part of the target as a kernel to ensure that each correlation map encodes information of a local region on the target. With smaller kernel size and more diverse target representation, the correlation feature maps better retain the target’s boundary and scale information, which is beneficial to subsequent prediction. Figure 2(a) shows the computation process of three correlation methods. Due to the inaccuracy of rectangle annotations, some sub-kernels encode less discriminative background pixels. So we also add a channel-wise attention operation after the pixel-wise correlation layer to enhance features of the most discriminative regions for scale estimation.
Spatial-aware Non-local Fusion.
To precisely determine the boundary of the target, it is important to utilize global contextual information. Non-local module [48] is a good choice for this goal, due to its ability to capture long-range dependencies. The non-local operation can be described as
(2) |
denotes the input feature maps and denotes the output feature maps. computes the relationship between different locations on the feature maps to return a scalar. represents a normalization factor, which is . We take as an embedded gaussian form:
(3) |
where and . Non-local operation in this form can be easily implemented by softmax function, namely, . The non-local module in the embedded gaussian form is shown in Figure 2(b).
![]() |
![]() |
(a) Visualization of various correlations. | (b) The non-local module. |
In this work, we plug in one non-local block after the channel-attention operation to guarantee that the features not only encode local information but also sense global cues.
3.3 Complementary Prediction Branches
Alpha-Refine has three output branches that predict one bounding box, two corners111The top-left corner and the bottom-right corner and one mask, respectively. After some post-processing, corners and the mask can also be transformed into bounding boxes to produce diverse candidates. All prediction heads take the non-local features as input. Detailed descriptions are shown in the following parts.
Bounding Box Head.
This module learns a coordinate transformation relative to the predefined anchor boxes. The base trackers usually have already coarsely located the target. Thus, a piece of useful prior information for the refinement module is that the target locates near the center of the search region. With such prior information, we can greatly reduce the number and complexity of anchors. In this work, only single anchor box is used, and its normalized coordinate in format is
. The bounding box head contains two stacked Conv-BN-ReLU layers, followed by a global average pooling layer and a fully-connected layer, which predicts four coordinate transformation factors. During the training stage, GIoU Loss
[36] is exploited to maximize the overlap between the predicted boxes and the groundtruth.(4) |
(5) |
and denote the area of the smallest enclosing box and union, respectively, between the predicted box and the groundtruth.
Corner Head.
Recently, keypoints detection techniques have become popular in the object detection field, and this situation has introduced a large number of state-of-the-art methods [22, 56, 8, 57]. Some works, such as SATIN [12], has introduced corner detection into the tracking field. However, their performances are still limited. In this work, we attempt to bridge this performance gap by unveiling the power of corner detection in the tracking field.
Different from SATIN [12] that predicts low-resolution heatmaps for corners and learns offsets to refine coordinates, our corner head progressively increases the feature resolution by repeated Conv-BN-ReLU-Bilinear222Bilinear means bilinear upsampling modules through predicting high-resolution heatmaps with the same size as the search region. After obtaining heatmaps of two corners, we apply soft-argmax function [30] to them for deriving the expected value of corners’ coordinates. During the training phase, mean squared error loss is used to optimize the parameters. Compared with SATIN [12], our approach has two advantages: (1) The resolution of predicted heatmaps is high, which results in no quantization error. (2) Our regression method does not suffer from the imbalance problem that using gaussian labels faces.
Mask Head.
As demonstrated by previous works [47, 28], the ability to predict precise mask is quite beneficial to improving the tracking performance, especially on benchmarks (e.g., VOT [20, 21]) that adopt rotated bounding box labels. We also aggregate low-level features from the backbone network by FPN [26] structure for recovering details to generate high-quality masks. Different from SiamMask [47]
that restricts the predicted mask in a region as large as the template, our mask branch predicts mask that has the same size as the search region for a higher-quality mask output. The mask branch is trained with the binary cross-entropy loss. During the tracking stage, the predicted mask is first binarized and then transformed into a bounding box. We use the same transformation method as SiamMask
[47] for fair comparison.
3.4 Farsighted and Efficient Branch Selector
As shown in Figure 3
three prediction heads provide diverse bounding box results. A novel and efficient branch selector module is designed to wisely choose the most suitable result as the output. This module takes non-local features as the input and predicts three scores for evaluating the quality of outputs from three branches. We exploit a few Conv-BN-ReLU layers and a Max-Pooling layer to decrease the number of channels and spatial resolution. We also use two fully-connected layers to predict scores for three branches. The detailed architecture of the branch selector is presented in Table
1. The branch selector module can forecast which branch produces the most accurate result in the current situation before any branch is run. Thus, we only need to run a single branch in one refinement, and this minimal requirement makes our refinement module efficient and accurate.Layer name | conv1 | conv2 | max pool | flatten | fc1 | fc2 |
---|---|---|---|---|---|---|
Output size | 32x16x16 | 16x16x16 | 16x8x8 | 1024 | 512 | 3 |
Structure | 3x3, 32 | 3x3, 16 | 2x2, stride=2 |
512 | 3 | |
BatchNorm | BatchNorm | BatchNorm | ||||
ReLU | ReLU | ReLU |
3.5 Training
Training Set Construction.
We use the training splits of LaSOT [9] and GOT-10K [15], VID, DET [37], COCO [27], Youtube-VOS [50] and some saliency datasets [53, 46, 38] to train the complete Alpha-Refine. We also develop two lite versions that use fewer datasets for fair comparison with IoU-Net [7, 3] and SiamMask [47]. Given a video sequence, two stochastic frames and with an interval of less than frames are first selected. Then, their groundtruths are processed by random translation and scaling to generate the reference and test boxes.
(6) |
(7) |
(8) |
and are two scalar factors corresponding to scale and center, respectively. and represent
standard normal distribution and
uniform distribution, respectively. for the reference frame, and for the test frame. After obtaining these boxes, we crop search regions, with areas of times bounding boxes’ areas. Finally, we resize them into as the inputs of Alpha-Refine.Training Approach.
The whole training stage is divided into two phases. In the first phase, we train Alpha-Refine without introducing the branch selector because the prediction quality of three branches in this phase is changing dynamically. This condition cannot provide solid labels for the branch selector. The losses of the bounding box, corner, and mask heads are denoted as , , and , respectively. The total loss is the weighted sum of these three losses.
(9) |
where and . In this phase, we train Alpha-Refine for epochs, each of which consists of 4000 iterations with a batch size of 32. Given the abundance of the training data, we do not freeze any parameters from the backbone.
After the backbone and three prediction heads have been adequately trained, the second phase begins. In this phase, we only train the branch selector and leave all other parameters frozen. As in the first phase, reference and test images are passed through the Alpha-Refine to obtain three different bounding box predictions. The IoUs between predictions , , and the groundtruth are computed, and the function is used to obtain the label for the branch selector.
(10) |
We train the branch selector using the cross-entropy loss for epochs, each of which contains 200 iterations with a batch size of . In both training phases, the Adam optimizer [18] is applied and the learning rate halves every epochs. The complete models and source codes will be released.
4 Experiments
In this work, we implement our algorithm with Pytorch
[33] deep learning library. The hardware platform is a PC machine with an intel-i9 CPU (64GB memory) and two NVIDIA RTX-2080Ti GPUs (11GB memory). We first perform comprehensive experiments on three popular tracking benchmarks: TrackingNet [31], LaSOT [9] and VOT2018 [20], together with five state-of-the-art base trackers: DiMP50 [3], ATOM [7], SiamRPN++ [23], RTMDNet [17] and ECO [6] to demonstrate Alpha-Refine’s ability to boost trackers’ performance. We denote SiamRPN++ as SiamRPNpp for concise descriptions in the experiment section. Then, we compare Alpha-Refine with other existing refinement modules: IoU-Net [7, 3] and SiamMask [47]. Finally, ablation study is made to verify the effectiveness of the pixel-wise correlation, the non-local layer and the branch selector.4.1 Comparison with the state-of-the-art
TrackingNet.
TrackingNet [31] is a popular large-scale short-term tracking benchmark. We evaluate various methods on its test-set, which contains sequences. For the test-set, only groundtruth from the first frame is given and participants need to submit their tracking results to the evaluation server. The quantitative results are shown in Table 2. As experimental results show, all the “Base Tracker+AR” outperforms base trackers by a large margin. Especially for ECO and RTMDNet that lack precise scale estimation, Alpha-Refine improves their AUC by more than . In addition, “DiMP50+AR” achieves AUC , creating the new state-of-the-art record.
Staple | CSRDCF | SiamFC | CFNet | MDNet | UPDT | Dsiam | Dsiam-Update | GFS-DCF | C-RPN | |
---|---|---|---|---|---|---|---|---|---|---|
[1] | [29] | [2] | [44] | [32] | [4] | [58] | [54] | [51] | [10] | |
P(%) | 47.0 | 48.0 | 53.3 | 53.3 | 56.5 | 55.7 | 59.1 | 62.5 | 56.6 | 61.9 |
(%) | 60.3 | 62.2 | 66.6 | 65.4 | 70.5 | 70.2 | 73.3 | 75.2 | 71.8 | 74.6 |
AUC(%) | 52.8 | 53.4 | 57.1 | 57.8 | 60.6 | 61.1 | 63.8 | 67.7 | 60.9 | 66.9 |
ECO | ECO+AR | RTMDNet | RTMDNet+AR | SiamRPNpp | SiamRPNpp+AR | ATOM | ATOM+AR | DiMP50 | DiMP50+AR | |
P(%) | 55.9 | 69.2 | 53.3 | 69.4 | 69.4 | 73.3 | 64.8 | 72.5 | 68.7 | 74.4 |
(%) | 71.0 | 78.4 | 69.4 | 78.7 | 80.0 | 81.5 | 77.1 | 80.9 | 80.1 | 82.5 |
AUC(%) | 61.2 | 73.2 | 58.4 | 73.1 | 73.3 | 76.2 | 70.3 | 75.9 | 74.0 | 77.5 |
LaSOT.
LaSOT [9] is a large-scale long-term tracking benchmark, which consists of videos and divides videos as the test-set. LaSOT ranks trackers using Success and Norm Precision. The qualitative results are shown in Figure 4. It can be seen that Alpha-Refine significantly elevates all base trackers’ performance. Specifically, the success curve of “Base Tracker+AR” is significantly higher than that of the base tracker when the overlap threshold is higher than 0.5. For RTMDNet, the improvement of AUC is up to . Also, DiMP50+AR achieves AUC , breaking the previous state-of-the-art record.
![]() ![]() |
VOT2018.
VOT2018 benchmark [20] includes challenging videos, whose annotations are rotated bounding boxes. VOT2018 has three performance measures: accuracy, robustness and EAO. Accuracy denotes mean overlap of successfully tracked frames. Robustness represents the failure rates. The final ranking measure is EAO (Expected Average Overlap), which simultaneously considers trackers’ accuracy and robustness. The results on the VOT2018 benchmark are shown in Table 3
. It can be seen that most of “Base Tracker+AR” outperform base trackers in terms of EAO significantly. When combined with DiMP50, DiMP50+AR improves its EAO to 0.476, achieving the new state-of-the-art performance. Although SiamRPNpp+AR’s EAO is slightly lower than SiamRPNpp, SiamRPNpp+AR achieves obviously higher accuracy than SiamRPNpp, which implies that Alpha-Refine does produce more precise bounding boxes. A potential reason for EAO drop is that hyperparameters of SiamRPNpp on the VOT benchmark have been carefully tuned. If we tune SiamRPNpp+AR in the same way, the performance can be further improved.
SiamDW | GCT | ASRCF | C-RPN | SPM | RPCF | DSiam-Update | GradNet | GFS-DCF | SiamMask | |
---|---|---|---|---|---|---|---|---|---|---|
[55] | [11] | [5] | [10] | [45] | [40] | [54] | [25] | [51] | [47] | |
EAO() | 0.301 | 0.274 | 0.328 | 0.289 | 0.338 | 0.316 | 0.393 | 0.247 | 0.397 | 0.347 |
Accuracy() | 0.520 | - | 0.494 | - | 0.580 | 0.500 | - | 0.507 | 0.511 | 0.602 |
Robustness() | 0.410 | - | 0.234 | - | 0.300 | 0.234 | - | 0.375 | 0.143 | 0.288 |
ECO | ECO+AR | RTMDNet | RTMDNet+AR | SiamRPNpp | SiamRPNpp+AR | ATOM | ATOM+AR | DiMP50 | DiMP50+AR | |
EAO() | 0.350 | 0.393 | 0.253 | 0.286 | 0.414 | 0.400 | 0.401 | 0.421 | 0.44 | 0.476 |
Accuracy() | 0.554 | 0.602 | 0.540 | 0.596 | 0.600 | 0.624 | 0.590 | 0.611 | 0.597 | 0.633 |
Robustness() | 0.243 | 0.234 | 0.407 | 0.393 | 0.234 | 0.272 | 0.204 | 0.197 | 0.153 | 0.136 |
4.2 Comparison with IoU-Net and SiamMask
Due to their precise scale estimation and good transferability, IoU-Net [7, 3] and SiamMask [47] have been successfully used as the refinement module [21]. In our experiments, they are used as our competitors and for fair comparison, we utilize the same backbone and train our model using the same (or less) datasets as them. Alpha-Refine takes ResNet-50 as the backbone, which is the same as IoU-Net333In our experiments, IoU-Net denotes the scale estimation module of DiMP50 [3]. and SiamMask. Moreover, since IoU-Net and SiamMask are not trained with the same datasets, we additionally develop two lite versions of Alpha-Refine, AR(IoU) and AR(Mask) for short, to make fair comparison. To be specific, IoU-Net [3] is trained on the training splits of the TrackingNet [31], LaSOT [9], GOT10K [15] and COCO [27]. Correspondingly, AR(IoU) utilizes only these datasets except for TrackingNet. Due to the lack of mask-level labels, AR(IoU) does not have a mask head. SiamMask is trained using Youtube-BBox [34], Youtube-VOS [50], VID, DET [37] and COCO [27]. Accordingly, AR(mask) exploits only these datasets except for Youtube-BBox. The detailed experimental results are shown in Table 4. The experimental results illustrate that all “Base Tracker+AR(IoU)” achieve better performance than “Base Tracker+IoU” and all “Base Tracker+AR(Mask)” outperform “Base Tracker+Mask”, even though our methods use less training data.
Tracker | Precision(%) | Norm Precision(%) | Success(%) |
---|---|---|---|
DiMP50+AR | 74.4 | 82.5 | 77.5 |
DiMP50+AR(IoU) | 73.0 | 81.9 | 77.1 |
DiMP50+IoU | 68.2 | 79.0 | 73.5 |
DiMP50+AR(Mask) | 73.7 | 82.2 | 77.0 |
DiMP50+Mask | 69.5 | 79.8 | 75.1 |
DiMP50 | 68.7 | 80.1 | 74.0 |
ATOM+AR | 72.5 | 80.9 | 75.9 |
ATOM+AR(IoU) | 70.8 | 80.2 | 75.4 |
ATOM+IoU | 66.8 | 77.9 | 72.3 |
ATOM+AR(Mask) | 71.3 | 80.1 | 74.9 |
ATOM+Mask | 68.7 | 79.4 | 74.3 |
ATOM | 64.8 | 77.1 | 70.3 |
SiamRPNpp+AR | 73.3 | 81.5 | 76.2 |
SiamRPNpp+AR(IoU) | 72.3 | 81.3 | 76.3 |
SiamRPNpp+IoU | 68.0 | 78.6 | 73.2 |
SiamRPNpp+AR(Mask) | 72.1 | 80.6 | 75.2 |
SiamRPNpp+Mask | 66.7 | 76.9 | 72.7 |
SiamRPNpp | 69.4 | 80.0 | 73.3 |
RTMDNet+AR | 69.4 | 78.7 | 73.1 |
RTMDNet+AR(IoU) | 67.5 | 78.1 | 72.7 |
RTMDNet+IoU | 65.3 | 77.0 | 70.5 |
RTMDNet+AR(Mask) | 68.8 | 78.5 | 72.7 |
RTMDNet+Mask | 67.5 | 78.4 | 72.5 |
RTMDNet | 53.3 | 69.4 | 58.4 |
ECO+AR | 69.2 | 78.4 | 73.2 |
ECO+AR(IoU) | 66.3 | 76.7 | 71.8 |
ECO+IoU | 62.1 | 74.5 | 68.0 |
ECO+AR(Mask) | 68.0 | 77.7 | 72.4 |
ECO+Mask | 67.4 | 78.2 | 73.1 |
ECO | 55.9 | 71.0 | 61.2 |
4.3 Ablation Study.
In the ablation study, the effectiveness of the pixel-wise correlation, the non-local module and the branch selector are explored respectively.
Pixel-wise correlation vs Naive and Depth-wise correlation.
To demonstrate the superiority of pixel-wise correlation to other kinds of correlation methods, we also implement two variants: AR(Naive) and AR(Depth), which fuse features by naive correlation and depth-wise correlation respectively. As shown in Table 5, pixel-wise correlation brings better performance than naive correlation and depth-wise correlation, thus proving the superiority of pixel-wise correlation in fine feature fusion.
The Non-local Module.
To show the effectiveness of the non-local layer, we implement a variant without the non-local operation and its results are denoted as “Tracker+AR(woNL)”. As Table 5 shows, compared with “Tracker+AR”, “Tracker+AR(woNL)” get worse performance, which verifies the effectiveness of the non-local module.
The branch selector.
To validate the advantage of the proposed branch selector, we implement the following variants: AR(BBox), AR(Corner), AR(Mask), and AR(Average). The first three represent that the refinement module always uses the output of the bounding box branch, or the corner branch, or the mask branch as the final result. The “Tracker+AR(Average)” denotes that the refinement module first gets predictions from all three branches, then takes the average of them as the final result. As shown in Table 5, the refinement with our branch selector(“Tracker+AR”) obtains the best results.
Tracker | EAO() | Accuracy() | Robustness() |
---|---|---|---|
DiMP50+AR | 0.476 | 0.633 | 0.136 |
DiMP50+AR(woNL) | 0.458 | 0.622 | 0.150 |
DiMP50+AR(Naive) | 0.439 | 0.628 | 0.169 |
DiMP50+AR(Depth) | 0.435 | 0.624 | 0.159 |
DiMP50+AR(Average) | 0.438 | 0.629 | 0.155 |
DiMP50+AR(BBox) | 0.375 | 0.570 | 0.187 |
DiMP50+AR(Corner) | 0.441 | 0.624 | 0.145 |
DiMP50+AR(Mask) | 0.446 | 0.642 | 0.192 |
DiMP50 | 0.440 | 0.597 | 0.153 |
4.4 Further Analysis.
Speed.
Alpha-Refine not only boosts the tracking performance of base trackers but also runs at a remarkable speed. The speed is tested on NVIDIA RTX-2080Ti. When only a single branch444Backbone + Feature Aggregation + one branch is used, AR(BBox), AR(Corner) and AR(Mask) run in 150 FPS, 130 FPS and 75 FPS respectively. When combined with base trackers, the speeds are summarized in Table 6. It can be seen that after combined with our refinement module, these base trackers can still run in real time.
DiMP50 | ATOM | SiamRPNpp | RTMDNet | ECO | |
---|---|---|---|---|---|
Speed(FPS) | 49 | 51 | 70 | 40 | 37 |
Speed(AR)(FPS) | 35 | 39 | 46 | 32 | 31 |
Multiple branches.
In this part, the behavior of the three branches is further discussed. The IoU curves between multiple predictions and groundtruths on the training set are shown in Figure 5. It can be also observed that the result from the mask head is poor at the start of training, but it grows quickly and surpasses the bounding box head at the end of the first epoch. After abundant training, the corner head and the mask head produce higher-quality results (IoU0.8) than the bounding box head (IoU=0.7). This also indicates that the corner head has the more powerful ability to produce precise results than the bounding box head, even though the bounding box branch is given larger weight () and directly optimized with IoU. In addition, in Figure 3, more qualitative results are provided. Although the mean IoU of the box branch is lower than another two branches, the box branch can also produce the most accurate results in some cases as shown in the first row of Figure 3. So with the help of the branch selector, all three branches can make their unique and irreplaceable contribution.
![]() |
![]() |
(a) | (b) |
5 Conclusion.
In this work, we propose a novel and precise refinement module called Alpha-Refine. Our contributions can be summarized as follows. First, this work is the first one to design a universal refinement module. Specifically, Alpha-Refine can be seamlessly combined with all existing trackers without the need for joint training or fine-tuning. Second, this work proposes a few effective principles to design high-performance refinement module, including (1) it brings better results to finely aggregate features and to capture global information (2) multiple prediction heads produce more diverse and reliable results (3) the branch selector is helpful to choose the optimal result. Third, we apply the Alpha-Refine model to five well-known and top-performed trackers and conduct numerous evaluations on three popular benchmarks. The experimental results show that our Alpha-Refine could consistently improve the tracking performance with few computational loads.
Appendix A Qualitative Results
Alpha-Refine has three parallel prediction heads, which predict the bounding box, the corners and the mask respectively. In this section, a large number of qualitative results are shown to illustrate Alpha-Refine’s ability to produce precise results.
a.1 Quality of the predicted corners
Figure 6 demonstrates the corners (and the corresponding boxes) predicted by Alpha-Refine. It can be seen that Alpha-Refine still produces quite reliable corners, even though motion blur, distractors and occlusion occur. This illustrates that the corner branch has great robustness to challenging factors in the tracking process.

a.2 Quality of the predicted masks
Figure 7 shows the comparison between the masks predicted by Alpha-Refine and SiamMask. After zooming in upon Figure 7, it can be seen that our masks are sharper and more robust to the cluttered background than those of SiamMask. For example, in video bird1 and shaking, masks from SiamMask are rough and broken respectively. However, Alpha-Refine accurately segments the targets’ contours and predicts complete masks. Besides, in video bolt1 and dinosaur, SiamMask can not distinguish the target from the background, leading to inferior mask predictions. But Alpha-Refine can still produce high-quality masks, which only contains the tracked targets.

a.3 Video Demos
To further illustrate the influence of Alpha-Refine on the tracking results, we provide some video demos in the videos folder. Figure 8 is a typical example. It can be seen from the videos that Alpha-Refine provides more accurate bounding boxes than DiMP50, boosting its tracking performance significantly.

References
- [1] (2016) Staple: Complementary learners for real-time tracking. In CVPR, Cited by: Table 2.
- [2] (2016) Fully-convolutional siamese networks for object tracking. In ECCV Workshop, Cited by: §1, §2, §3.2, Table 2.
- [3] (2019) Learning discriminative model prediction for tracking. In ICCV, Cited by: §1, §1, §2, §2, §3.5, §4.2, §4, footnote 3.
- [4] (2018) Unveiling the power of deep tracking. In ECCV, Cited by: §1, Table 2.
- [5] (2019) Visual tracking via adaptive spatially-regularized correlation filters. In CVPR, Cited by: Table 3.
- [6] (2017) ECO: efficient convolution operators for tracking. In CVPR, Cited by: §1, §1, §2, §4.
- [7] (2019) ATOM: Accurate tracking by overlap maximization. In CVPR, Cited by: §1, §1, §2, §2, §3.5, §4.2, §4.
- [8] (2019) Centernet: keypoint triplets for object detection. In ICCV, Cited by: §3.3.
- [9] (2019) LaSOT: a high-quality benchmark for large-scale single object tracking. In CVPR, Cited by: §1, §3.5, §4.1, §4.2, §4.
- [10] (2019) Siamese cascaded region proposal networks for real-time visual tracking. In CVPR, Cited by: §1, §2, Table 2, Table 3.
- [11] (2019) Graph convolutional tracking. In CVPR, Cited by: Table 3.
- [12] (2019) Siamese attentional keypoint network for high performance visual tracking. Knowledge-Based Systems, pp. 105448. Cited by: §3.3, §3.3.
- [13] (2016) Deep residual learning for image recognition. In CVPR, Cited by: §3.1.
- [14] (2008) High-speed tracking with kernelized correlation filters.. In ICVS, Cited by: §2.
- [15] (2019) Got-10k: a large high-diversity benchmark for generic object tracking in the wild. TPAMI. Cited by: §3.5, §4.2.
- [16] (2018) Acquisition of localization confidence for accurate object detection. In ECCV, Cited by: §2, §3.1.
- [17] (2018) Real-time MDNet. In ECCV, Cited by: §1, §2, §4.
- [18] (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §3.5.
- [19] (2019) Foveabox: beyond anchor-based object detector. arXiv preprint arXiv:1904.03797. Cited by: §2.
- [20] (2018) The sixth visual object tracking vot2018 challenge results. In ECCV, Cited by: §1, §3.3, §4.1, §4.
- [21] (2019) The seventh visual object tracking vot2019 challenge results. In ICCV Workshops, Cited by: §2, §3.3, §4.2.
- [22] (2018) Cornernet: detecting objects as paired keypoints. In ECCV, Cited by: §2, §3.3.
- [23] (2019) SiamRPN++: Evolution of siamese visual tracking with very deep networks. In CVPR, Cited by: §1, §1, §2, §3.2, §4.
- [24] (2018) High performance visual tracking with siamese region proposal network. In CVPR, Cited by: §2, §3.2.
- [25] (2019) GradNet: gradient-guided network for visual object tracking. In ICCV, Cited by: Table 3.
- [26] (2017) Feature pyramid networks for object detection. In CVPR, Cited by: §3.3.
- [27] (2014) Microsoft COCO: common objects in context. In ECCV, Cited by: §3.5, §4.2.
- [28] (2019) D3S–a discriminative single shot segmentation tracker. arXiv preprint arXiv:1911.08862. Cited by: §2, §3.3.
- [29] (2017) Discriminative correlation filter with channel and spatial reliability. In CVPR, Cited by: Table 2.
- [30] (2019) Human pose regression by combining indirect part detection and contextual information. Computers & Graphics 85, pp. 15–22. Cited by: §3.3.
- [31] (2018) TrackingNet: a large-scale dataset and benchmark for object tracking in the wild. In ECCV, Cited by: §1, §4.1, §4.2, §4.
-
[32]
(2016)
Learning multi–domain convolutional neural networks for visual tracking
. In CVPR, Cited by: §1, §2, Table 2. - [33] (2017) Automatic differentiation in pytorch. Cited by: §4.
- [34] (2017) YouTube-boundingboxes: a large high-precision human-annotated data set for object detection in video. In CVPR, Cited by: §4.2.
- [35] (2015) Faster R–CNN: towards real-time object detection with region proposal networks. In NIPS, Cited by: §2, §2.
- [36] (2019) Generalized intersection over union: a metric and a loss for bounding box regression. In CVPR, Cited by: §3.3.
-
[37]
(2015)
ImageNet Large scale visual recognition challenge.
International Journal of Computer Vision
115 (3), pp. 211–252. Cited by: §3.5, §4.2. - [38] (2015) Hierarchical image saliency detection on extended cssd. TPAMI 38 (4). Cited by: §3.5.
- [39] (2018) Correlation tracking via joint discrimination and reliability learning. In CVPR, Cited by: §1, §2.
- [40] (2019) Roi pooled correlation filters for visual tracking. In CVPR, Cited by: Table 3.
- [41] (2018) Learning to compare: relation network for few-shot learning. In CVPR, Cited by: §2.
- [42] (2016) Siamese instance search for tracking. In CVPR, Cited by: §1, §2.
- [43] (2019) FCOS: fully convolutional one-stage object detection. In ICCV, Cited by: §2.
- [44] (2017) End-to-end representation learning for correlation filter based tracking. In CVPR, Cited by: Table 2.
- [45] (2019) SPM-tracker: series-parallel matching for real-time visual object tracking. In CVPR, Cited by: §1, §2, Table 3.
- [46] (2017) Learning to detect salient objects with image-level supervision. In CVPR, Cited by: §3.5.
- [47] (2019) Fast online object tracking and segmentation: A unifying approach. In CVPR, Cited by: Figure 7, §1, §1, §2, §2, §3.2, §3.3, §3.5, §4.2, Table 3, §4.
- [48] (2018) Non-local neural networks. In CVPR, Cited by: §3.2.
- [49] (2019) Ranet: ranking attention network for fast video object segmentation. In ICCV, Cited by: §3.2.
- [50] (2018) Youtube-vos: sequence-to-sequence video object segmentation. In ECCV, pp. 585–601. Cited by: §3.5, §4.2.
-
[51]
(2019)
Joint group feature selection and discriminative filter learning for robust visual object tracking
. In ICCV, Cited by: Table 2, Table 3. - [52] (2019) SiamFC++: towards robust and accurate visual tracking with target estimation guidelines. arXiv preprint arXiv:1911.06188. Cited by: §2.
- [53] (2013) Saliency detection via graph-based manifold ranking. In CVPR, Cited by: §3.5.
- [54] (2019) Learning the model update for siamese trackers. In ICCV, Cited by: Table 2, Table 3.
- [55] (2019) Deeper and wider siamese networks for real-time visual tracking. In CVPR, Cited by: Table 3.
- [56] (2019) Objects as points. arXiv preprint arXiv:1904.07850. Cited by: §2, §3.3.
- [57] (2019) Bottom-up object detection by grouping extreme and center points. In CVPR, Cited by: §3.3.
- [58] (2018) Distractor-aware siamese networks for visual object tracking. In ECCV, Cited by: §2, §3.2, Table 2.