Log In Sign Up

Alpha-Refine: Boosting Tracking Performance by Precise Bounding Box Estimation

by   Bin Yan, et al.
Dalian University of Technology

In recent years, the multiple-stage strategy has become a popular trend for visual tracking. This strategy first utilizes a base tracker to coarsely locate the target and then exploits a refinement module to obtain more accurate results. However, existing refinement modules suffer from the limited transferability and precision. In this work, we propose a novel, flexible and accurate refinement module called Alpha-Refine, which exploits a precise pixel-wise correlation layer together with a spatial-aware non-local layer to fuse features and can predict three complementary outputs: bounding box, corners and mask. To wisely choose the most adequate output, we also design a light-weight branch selector module. We apply the proposed Alpha-Refine module to five famous and state-of-the-art base trackers: DiMP, ATOM, SiamRPN++, RTMDNet and ECO. The comprehensive experiments on TrackingNet, LaSOT and VOT2018 benchmarks demonstrate that our approach significantly improves the tracking performance in comparison with other existing refinement methods. The source codes will be available at


page 3

page 4

page 9


Video Annotation for Visual Tracking via Selection and Refinement

Deep learning based visual trackers entail offline pre-training on large...

Discriminative Segmentation Tracking Using Dual Memory Banks

Existing template-based trackers usually localize the target in each fra...

Target Transformed Regression for Accurate Tracking

Accurate tracking is still a challenging task due to appearance variatio...

TSDM: Tracking by SiamRPN++ with a Depth-refiner and a Mask-generator

In a generic object tracking, depth (D) information provides informative...

Towards Accurate Pixel-wise Object Tracking by Attention Retrieval

The encoding of the target in object tracking moves from the coarse boun...

Marior: Margin Removal and Iterative Content Rectification for Document Dewarping in the Wild

Camera-captured document images usually suffer from perspective and geom...

Transforming Model Prediction for Tracking

Optimization based tracking methods have been widely successful by integ...

1 Introduction

Precise scale estimation is extremely essential to a successful tracker. Early trackers usually solve this problem by multi-scale search [2, 6, 39, 4] or sampling-then-regression strategy [42, 32], which is inaccurate and has greatly limited the precision of trackers. In recent years, many high-performance scale estimation methods have been developed and have significantly improved trackers’ performance  [23, 47, 7, 3]. For obtaining more robust and precise tracking results than before, many state-of-the-art trackers [45, 10, 7, 3] adopt a multiple-stage tracking strategy. These trackers first locate the target coarsely and then refine their results using a refinement module. However, existing refinement modules suffer from limited transferability and precision.

In this work, we propose a novel, general and accurate refinement module. This module is trained separately and can be directly applied to any existing trackers, elevating the quality of their predicted bounding boxes. The proposed module utilizes an accurate pixel-wise correlation layer and a spatial-aware non-local layer for high-quality scale estimation. Moreover, our module predicts bounding box, corners and mask simultaneously. Thus, this module can quickly adapt to complex scenarios. We also develop a novel branch selector module to choose the most adequate output wisely.

We choose five famous base trackers: DiMP [3], ATOM [7], SiamRPN++ [23], RTMDNet [17] and ECO [6], to perform comprehensive experiments on three tracking benchmarks, namely, LaSOT [9], TrackingNet [31] and VOT2018 [20]. Experimental results show that our proposed refinement module improves base trackers’ performances significantly, and its performance surpasses those of its competitors (i.e., IoU-Net [7, 3] and SiamMask [47]) by a large margin.

2 Related Work

Early Scale Estimation.

Before the thrive of deep learning, early scale estimation methods can be summarized as two categories: multiple-scale search and sampling-then-regression strategies. Most correlation-filter-based trackers  

[14, 6, 39] and SiamFC [2] adopt the former strategy. Specifically, these trackers construct search regions with different sizes, then compute correlation with the template, and finally determine the size of the target as the size-level where the highest response locates. Multiple-scale search is coarse and time-consuming due to its fixed-aspect-ratio prediction and heavy image pyramid operation. Another type of method first generates a large number of bounding box samples, then uses some methods to choose the best one, and finally applies regression on it to obtain more accurate results. SINT [42], MDNet [32] and RTMDNet [17] are three representative trackers that exploit this approach.

Modern Scale Estimation.

As deep learning techniques become mature, several high-performance scale estimation approaches are developed and can be categorized into the following classes, namely, RPN-based [24, 58, 23], Mask-based [47, 28], IoU-based [7, 3] and Anchor-free-based [52]. RPN-based methods learn a region proposal network [35], which determines whether the current anchor contains the target and makes refinement to it simultaneously. SiameseRPN-series trackers [24, 58, 23] utilize it as the core component, and thus achieve great success in recent years. Mask representation is more accurate, and the ability to predict mask is quite beneficial to precise scale estimation. SiamMask [47] and D3S [28] belong to this class, and obtain higher precision than Siamese trackers that can only predict boxes. IoU-based approaches learn a network to predict overlap between candidate boxes and groundtruth. During the inference phase, this strategy optimizes candidate boxes by gradient-ascent, and therefore obtains more precise results. ATOM [7] and DiMP [3] fully exploit this method, and thus surpass traditional correlation-filter trackers by a large margin. In the past year, anchor-free philosophy has become quite popular in the object detection field [22, 56, 19, 43]. It eliminates anchors to change the label-assignment rules and the learning targets. SiamFC++ [52] introduces this structure into object tracking field, and therefore achieves state-of-the-art performance.

Refinement Modules.

Many state-of-the-art trackers [45, 10, 7, 3, 21] apply multiple-stage tracking strategy to obtain accurate and robust results. This approach first locates the target coarsely and then utilizes a refinement module to refine results from the previous stage. SPM [45] and Siamese Cascaded RPN [10] adopt a light-weight relation network [41] and stacked RPNs [35], respectively, as the refinement module to further increase trackers’ discriminative power and precision. However, the two refinement modules have to be trained together with their previous Siamese tracker in an end-to-end manner; this procedure limits their flexibility of combining with other base trackers. ATOM [7] and DiMP [3] first use an online classification module to locate the target, then draw some random samples around it, and finally deploy a modified IoU-Net [16] to maximize the overlap between these samples and groundtruth for obtaining more precise bounding boxes. This modified IoU-Net can be trained separately from the base tracker. Thus, it has good transferability, but its precision still has large space to improve. Notably, the champions of VOT2019 [21], the main and long-term challenges utilize SiamMask [47] as a refinement module [21]. Similar to IoU-Net [16] mentioned before, SiamMask [47] can be combined with any base tracker. However, SiamMask is designed as a tracker rather than a refinement module. Therefore, it is still not perfect. Considering previous refinement modules’ weak transferability and limited accuracy, a new, general and precise refinement method called Alpha-Refine is developed.

3 Alpha-Refine

Single object tracking task can be decomposed as target localization and scale estimation. In this work, base trackers are responsible for localizing the target, and Alpha-Refine is designed specifically for precise scale estimation. To be specific, after the base tracker gets a coarse tracking result, a small search region whose size is double the size of the tracking result is cropped and sent to Alpha-Refine. Then Alpha-Refine outputs a more precise bounding box as the final tracking result. For the next frame, the base tracker crops search region based on the refined result from the last frame.

3.1 Network Architecture

Alpha-Refine has two input branches, namely, the reference branch and the test branch. These two branches take the small search region from the 1st and the current frame respectively as the input. The two branches use a parameter-shared ResNet-50 [13]

network as the backbone. After symmetric feature extraction, a PrRoI Pooling layer 

[16] is used to obtain target features of the reference frame. Different from existing trackers that fuse features by naive correlation or depth-wise correlation, Alpha-Refine innovatively introduces an accurate pixel-wise correlation layer and a spatial-aware non-local layer for feature aggregation, to obtain fine reference-guided target features. Moreover, we deploy three complementary prediction heads to output the bounding box, corners, and mask, respectively. Three branches provide stronger supervision and more diverse results during training and testing phases. To wisely choose the most adequate one as the final result, a novel and efficient branch selector is proposed. The overall architecture of Alpha-Refine is shown in Figure 1.

Figure 1: The overall architecture of Alpha-Refine. Better viewed in color with zoom-in.

3.2 Fine Feature Aggregation

Accurate Pixel-wise Correlation.

In recent years, Siamese architecture has been widely used in the tracking field, leading to a large number of successful trackers. However, most of these methods aggregate features of the template and the search region using coarse naive correlation [2, 24, 58] or depth-wise correlation [23, 47]. Both of these methods take the whole target as the kernel to produce degraded correlation maps, which cannot accurately reflect the size of the target in the current search region.

In this work, we replace these methods with pixel-wise correlation [49] for high-quality feature representation. We denote and as features of the template and the search region. Pixel-wise correlation first decomposes into sub-kernels , and then uses them to compute correlation with separately for obtaining correlation maps . The process can be described as


where denotes naive correlation.

Unlike naive or depth-wise correlation, pixel-wise correlation takes each part of the target as a kernel to ensure that each correlation map encodes information of a local region on the target. With smaller kernel size and more diverse target representation, the correlation feature maps better retain the target’s boundary and scale information, which is beneficial to subsequent prediction. Figure 2(a) shows the computation process of three correlation methods. Due to the inaccuracy of rectangle annotations, some sub-kernels encode less discriminative background pixels. So we also add a channel-wise attention operation after the pixel-wise correlation layer to enhance features of the most discriminative regions for scale estimation.

Spatial-aware Non-local Fusion.

To precisely determine the boundary of the target, it is important to utilize global contextual information. Non-local module [48] is a good choice for this goal, due to its ability to capture long-range dependencies. The non-local operation can be described as


denotes the input feature maps and denotes the output feature maps. computes the relationship between different locations on the feature maps to return a scalar. represents a normalization factor, which is . We take as an embedded gaussian form:


where and . Non-local operation in this form can be easily implemented by softmax function, namely, . The non-local module in the embedded gaussian form is shown in Figure 2(b).

(a) Visualization of various correlations. (b) The non-local module.
Figure 2: Comparison among different correlation methods and the illustration of the non-local module. (a) From left to right, naive, depth-wise and pixel-wise correlations are demonstrated, respectively. The black-edged cubes or squares represent sliding kernels. The red-edged ones represent corresponding correlation maps. (b) represents matrix multiplication. represents the element-wise sum. Better viewed in color with zoom-in.

In this work, we plug in one non-local block after the channel-attention operation to guarantee that the features not only encode local information but also sense global cues.

3.3 Complementary Prediction Branches

Alpha-Refine has three output branches that predict one bounding box, two corners111The top-left corner and the bottom-right corner and one mask, respectively. After some post-processing, corners and the mask can also be transformed into bounding boxes to produce diverse candidates. All prediction heads take the non-local features as input. Detailed descriptions are shown in the following parts.

Bounding Box Head.

This module learns a coordinate transformation relative to the predefined anchor boxes. The base trackers usually have already coarsely located the target. Thus, a piece of useful prior information for the refinement module is that the target locates near the center of the search region. With such prior information, we can greatly reduce the number and complexity of anchors. In this work, only single anchor box is used, and its normalized coordinate in format is

. The bounding box head contains two stacked Conv-BN-ReLU layers, followed by a global average pooling layer and a fully-connected layer, which predicts four coordinate transformation factors. During the training stage, GIoU Loss 

[36] is exploited to maximize the overlap between the predicted boxes and the groundtruth.


and denote the area of the smallest enclosing box and union, respectively, between the predicted box and the groundtruth.

Corner Head.

Recently, keypoints detection techniques have become popular in the object detection field, and this situation has introduced a large number of state-of-the-art methods [22, 56, 8, 57]. Some works, such as SATIN [12], has introduced corner detection into the tracking field. However, their performances are still limited. In this work, we attempt to bridge this performance gap by unveiling the power of corner detection in the tracking field.

Different from SATIN [12] that predicts low-resolution heatmaps for corners and learns offsets to refine coordinates, our corner head progressively increases the feature resolution by repeated Conv-BN-ReLU-Bilinear222Bilinear means bilinear upsampling modules through predicting high-resolution heatmaps with the same size as the search region. After obtaining heatmaps of two corners, we apply soft-argmax function [30] to them for deriving the expected value of corners’ coordinates. During the training phase, mean squared error loss is used to optimize the parameters. Compared with SATIN [12], our approach has two advantages: (1) The resolution of predicted heatmaps is high, which results in no quantization error. (2) Our regression method does not suffer from the imbalance problem that using gaussian labels faces.

Mask Head.

As demonstrated by previous works [47, 28], the ability to predict precise mask is quite beneficial to improving the tracking performance, especially on benchmarks (e.g., VOT [20, 21]) that adopt rotated bounding box labels. We also aggregate low-level features from the backbone network by FPN [26] structure for recovering details to generate high-quality masks. Different from SiamMask [47]

that restricts the predicted mask in a region as large as the template, our mask branch predicts mask that has the same size as the search region for a higher-quality mask output. The mask branch is trained with the binary cross-entropy loss. During the tracking stage, the predicted mask is first binarized and then transformed into a bounding box. We use the same transformation method as SiamMask 

[47] for fair comparison.

Figure 3: Qualitative results from three prediction branches. Results from the bounding box, corner, and mask heads are shown in red, green, and blue, respectively. In addition, the groundtruth is shown in yellow. From top to bottom, three rows show the cases in which the bounding box, corner and mask heads perform the best. Better viewed in color with zoom-in.

3.4 Farsighted and Efficient Branch Selector

As shown in Figure 3

three prediction heads provide diverse bounding box results. A novel and efficient branch selector module is designed to wisely choose the most suitable result as the output. This module takes non-local features as the input and predicts three scores for evaluating the quality of outputs from three branches. We exploit a few Conv-BN-ReLU layers and a Max-Pooling layer to decrease the number of channels and spatial resolution. We also use two fully-connected layers to predict scores for three branches. The detailed architecture of the branch selector is presented in Table 

1. The branch selector module can forecast which branch produces the most accurate result in the current situation before any branch is run. Thus, we only need to run a single branch in one refinement, and this minimal requirement makes our refinement module efficient and accurate.

Layer name conv1 conv2 max pool flatten fc1 fc2
Output size 32x16x16 16x16x16 16x8x8 1024 512 3
Structure 3x3, 32 3x3, 16

2x2, stride=2

512 3
BatchNorm BatchNorm BatchNorm
Table 1: Detailed architecture of the branch selector.

3.5 Training

Training Set Construction.

We use the training splits of LaSOT [9] and GOT-10K [15], VID, DET [37], COCO [27], Youtube-VOS [50] and some saliency datasets [53, 46, 38] to train the complete Alpha-Refine. We also develop two lite versions that use fewer datasets for fair comparison with IoU-Net [7, 3] and SiamMask [47]. Given a video sequence, two stochastic frames and with an interval of less than frames are first selected. Then, their groundtruths are processed by random translation and scaling to generate the reference and test boxes.


and are two scalar factors corresponding to scale and center, respectively. and represent

standard normal distribution and

uniform distribution, respectively. for the reference frame, and for the test frame. After obtaining these boxes, we crop search regions, with areas of times bounding boxes’ areas. Finally, we resize them into as the inputs of Alpha-Refine.

Training Approach.

The whole training stage is divided into two phases. In the first phase, we train Alpha-Refine without introducing the branch selector because the prediction quality of three branches in this phase is changing dynamically. This condition cannot provide solid labels for the branch selector. The losses of the bounding box, corner, and mask heads are denoted as , , and , respectively. The total loss is the weighted sum of these three losses.


where and . In this phase, we train Alpha-Refine for epochs, each of which consists of 4000 iterations with a batch size of 32. Given the abundance of the training data, we do not freeze any parameters from the backbone.

After the backbone and three prediction heads have been adequately trained, the second phase begins. In this phase, we only train the branch selector and leave all other parameters frozen. As in the first phase, reference and test images are passed through the Alpha-Refine to obtain three different bounding box predictions. The IoUs between predictions , , and the groundtruth are computed, and the function is used to obtain the label for the branch selector.


We train the branch selector using the cross-entropy loss for epochs, each of which contains 200 iterations with a batch size of . In both training phases, the Adam optimizer [18] is applied and the learning rate halves every epochs. The complete models and source codes will be released.

4 Experiments

In this work, we implement our algorithm with Pytorch 

[33] deep learning library. The hardware platform is a PC machine with an intel-i9 CPU (64GB memory) and two NVIDIA RTX-2080Ti GPUs (11GB memory). We first perform comprehensive experiments on three popular tracking benchmarks: TrackingNet [31], LaSOT [9] and VOT2018 [20], together with five state-of-the-art base trackers: DiMP50 [3], ATOM [7], SiamRPN++ [23], RTMDNet [17] and ECO [6] to demonstrate Alpha-Refine’s ability to boost trackers’ performance. We denote SiamRPN++ as SiamRPNpp for concise descriptions in the experiment section. Then, we compare Alpha-Refine with other existing refinement modules: IoU-Net [7, 3] and SiamMask [47]. Finally, ablation study is made to verify the effectiveness of the pixel-wise correlation, the non-local layer and the branch selector.

4.1 Comparison with the state-of-the-art


TrackingNet [31] is a popular large-scale short-term tracking benchmark. We evaluate various methods on its test-set, which contains sequences. For the test-set, only groundtruth from the first frame is given and participants need to submit their tracking results to the evaluation server. The quantitative results are shown in Table 2. As experimental results show, all the “Base Tracker+AR” outperforms base trackers by a large margin. Especially for ECO and RTMDNet that lack precise scale estimation, Alpha-Refine improves their AUC by more than . In addition, “DiMP50+AR” achieves AUC , creating the new state-of-the-art record.

Staple CSRDCF SiamFC CFNet MDNet UPDT Dsiam Dsiam-Update GFS-DCF C-RPN
 [1]  [29]  [2]  [44]  [32]  [4]  [58]  [54]  [51]  [10]
P(%) 47.0 48.0 53.3 53.3 56.5 55.7 59.1 62.5 56.6 61.9
(%) 60.3 62.2 66.6 65.4 70.5 70.2 73.3 75.2 71.8 74.6
AUC(%) 52.8 53.4 57.1 57.8 60.6 61.1 63.8 67.7 60.9 66.9
P(%) 55.9 69.2 53.3 69.4 69.4 73.3 64.8 72.5 68.7 74.4
(%) 71.0 78.4 69.4 78.7 80.0 81.5 77.1 80.9 80.1 82.5
AUC(%) 61.2 73.2 58.4 73.1 73.3 76.2 70.3 75.9 74.0 77.5
Table 2: State-of-the-art comparison on the TrackingNet test set. “Base Tracker+AR” represents base tracker + Alpha-Refine. The best three results are marked in red, green and blue bold fonts respectively. Better viewed in color with zoom-in.


LaSOT [9] is a large-scale long-term tracking benchmark, which consists of videos and divides videos as the test-set. LaSOT ranks trackers using Success and Norm Precision. The qualitative results are shown in Figure 4. It can be seen that Alpha-Refine significantly elevates all base trackers’ performance. Specifically, the success curve of “Base Tracker+AR” is significantly higher than that of the base tracker when the overlap threshold is higher than 0.5. For RTMDNet, the improvement of AUC is up to . Also, DiMP50+AR achieves AUC , breaking the previous state-of-the-art record.

Figure 4: Evaluation results on the LaSOT benchmark. The best three results are marked in red, green and blue bold fonts respectively. Better viewed in color with zoom-in.


VOT2018 benchmark [20] includes challenging videos, whose annotations are rotated bounding boxes. VOT2018 has three performance measures: accuracy, robustness and EAO. Accuracy denotes mean overlap of successfully tracked frames. Robustness represents the failure rates. The final ranking measure is EAO (Expected Average Overlap), which simultaneously considers trackers’ accuracy and robustness. The results on the VOT2018 benchmark are shown in Table 3

. It can be seen that most of “Base Tracker+AR” outperform base trackers in terms of EAO significantly. When combined with DiMP50, DiMP50+AR improves its EAO to 0.476, achieving the new state-of-the-art performance. Although SiamRPNpp+AR’s EAO is slightly lower than SiamRPNpp, SiamRPNpp+AR achieves obviously higher accuracy than SiamRPNpp, which implies that Alpha-Refine does produce more precise bounding boxes. A potential reason for EAO drop is that hyperparameters of SiamRPNpp on the VOT benchmark have been carefully tuned. If we tune SiamRPNpp+AR in the same way, the performance can be further improved.

 [55]  [11]  [5]  [10]  [45]  [40]  [54]  [25]  [51]  [47]
EAO() 0.301 0.274 0.328 0.289 0.338 0.316 0.393 0.247 0.397 0.347
Accuracy() 0.520 - 0.494 - 0.580 0.500 - 0.507 0.511 0.602
Robustness() 0.410 - 0.234 - 0.300 0.234 - 0.375 0.143 0.288
EAO() 0.350 0.393 0.253 0.286 0.414 0.400 0.401 0.421 0.44 0.476
Accuracy() 0.554 0.602 0.540 0.596 0.600 0.624 0.590 0.611 0.597 0.633
Robustness() 0.243 0.234 0.407 0.393 0.234 0.272 0.204 0.197 0.153 0.136
Table 3: State-of-the-art comparison on the VOT2018 benchmark. The best three results are marked in red, green and blue bold fonts respectively.

4.2 Comparison with IoU-Net and SiamMask

Due to their precise scale estimation and good transferability, IoU-Net [7, 3] and SiamMask [47] have been successfully used as the refinement module [21]. In our experiments, they are used as our competitors and for fair comparison, we utilize the same backbone and train our model using the same (or less) datasets as them. Alpha-Refine takes ResNet-50 as the backbone, which is the same as IoU-Net333In our experiments, IoU-Net denotes the scale estimation module of DiMP50 [3]. and SiamMask. Moreover, since IoU-Net and SiamMask are not trained with the same datasets, we additionally develop two lite versions of Alpha-Refine, AR(IoU) and AR(Mask) for short, to make fair comparison. To be specific, IoU-Net [3] is trained on the training splits of the TrackingNet [31], LaSOT [9], GOT10K [15] and COCO [27]. Correspondingly, AR(IoU) utilizes only these datasets except for TrackingNet. Due to the lack of mask-level labels, AR(IoU) does not have a mask head. SiamMask is trained using Youtube-BBox [34], Youtube-VOS [50], VID, DET [37] and COCO [27]. Accordingly, AR(mask) exploits only these datasets except for Youtube-BBox. The detailed experimental results are shown in Table 4. The experimental results illustrate that all “Base Tracker+AR(IoU)” achieve better performance than “Base Tracker+IoU” and all “Base Tracker+AR(Mask)” outperform “Base Tracker+Mask”, even though our methods use less training data.

Tracker Precision(%) Norm Precision(%) Success(%)
DiMP50+AR 74.4 82.5 77.5
DiMP50+AR(IoU) 73.0 81.9 77.1
DiMP50+IoU 68.2 79.0 73.5
DiMP50+AR(Mask) 73.7 82.2 77.0
DiMP50+Mask 69.5 79.8 75.1
DiMP50 68.7 80.1 74.0
ATOM+AR 72.5 80.9 75.9
ATOM+AR(IoU) 70.8 80.2 75.4
ATOM+IoU 66.8 77.9 72.3
ATOM+AR(Mask) 71.3 80.1 74.9
ATOM+Mask 68.7 79.4 74.3
ATOM 64.8 77.1 70.3
SiamRPNpp+AR 73.3 81.5 76.2
SiamRPNpp+AR(IoU) 72.3 81.3 76.3
SiamRPNpp+IoU 68.0 78.6 73.2
SiamRPNpp+AR(Mask) 72.1 80.6 75.2
SiamRPNpp+Mask 66.7 76.9 72.7
SiamRPNpp 69.4 80.0 73.3
RTMDNet+AR 69.4 78.7 73.1
RTMDNet+AR(IoU) 67.5 78.1 72.7
RTMDNet+IoU 65.3 77.0 70.5
RTMDNet+AR(Mask) 68.8 78.5 72.7
RTMDNet+Mask 67.5 78.4 72.5
RTMDNet 53.3 69.4 58.4
ECO+AR 69.2 78.4 73.2
ECO+AR(IoU) 66.3 76.7 71.8
ECO+IoU 62.1 74.5 68.0
ECO+AR(Mask) 68.0 77.7 72.4
ECO+Mask 67.4 78.2 73.1
ECO 55.9 71.0 61.2
Table 4: Comparison with other refinement modules on the test-set of TrackingNet. “Base Tracker+IoU” and “Base Tracker+Mask” represent the base tracker + IoU-Net and the base tracker + SiamMask respectively. The results of base trackers and “Base Tracker+AR” are annotated with blue and red respectively.

4.3 Ablation Study.

In the ablation study, the effectiveness of the pixel-wise correlation, the non-local module and the branch selector are explored respectively.

Pixel-wise correlation vs Naive and Depth-wise correlation.

To demonstrate the superiority of pixel-wise correlation to other kinds of correlation methods, we also implement two variants: AR(Naive) and AR(Depth), which fuse features by naive correlation and depth-wise correlation respectively. As shown in Table 5, pixel-wise correlation brings better performance than naive correlation and depth-wise correlation, thus proving the superiority of pixel-wise correlation in fine feature fusion.

The Non-local Module.

To show the effectiveness of the non-local layer, we implement a variant without the non-local operation and its results are denoted as “Tracker+AR(woNL)”. As Table 5 shows, compared with “Tracker+AR”, “Tracker+AR(woNL)” get worse performance, which verifies the effectiveness of the non-local module.

The branch selector.

To validate the advantage of the proposed branch selector, we implement the following variants: AR(BBox), AR(Corner), AR(Mask), and AR(Average). The first three represent that the refinement module always uses the output of the bounding box branch, or the corner branch, or the mask branch as the final result. The “Tracker+AR(Average)” denotes that the refinement module first gets predictions from all three branches, then takes the average of them as the final result. As shown in Table 5, the refinement with our branch selector(“Tracker+AR”) obtains the best results.

Tracker EAO() Accuracy() Robustness()
DiMP50+AR 0.476 0.633 0.136
DiMP50+AR(woNL) 0.458 0.622 0.150
DiMP50+AR(Naive) 0.439 0.628 0.169
DiMP50+AR(Depth) 0.435 0.624 0.159
DiMP50+AR(Average) 0.438 0.629 0.155
DiMP50+AR(BBox) 0.375 0.570 0.187
DiMP50+AR(Corner) 0.441 0.624 0.145
DiMP50+AR(Mask) 0.446 0.642 0.192
DiMP50 0.440 0.597 0.153
Table 5: Ablation study on the VOT2018 benchmark. The results of base trackers and “Base Tracker+AR”are annotated with blue and red respectively.

4.4 Further Analysis.


Alpha-Refine not only boosts the tracking performance of base trackers but also runs at a remarkable speed. The speed is tested on NVIDIA RTX-2080Ti. When only a single branch444Backbone + Feature Aggregation + one branch is used, AR(BBox), AR(Corner) and AR(Mask) run in 150 FPS, 130 FPS and 75 FPS respectively. When combined with base trackers, the speeds are summarized in Table 6. It can be seen that after combined with our refinement module, these base trackers can still run in real time.

Speed(FPS) 49 51 70 40 37
Speed(AR)(FPS) 35 39 46 32 31
Table 6: Speeds of various trackers. The first row represents the original speeds of base trackers. The second row represents the speeds of “Base Tracker + Alpha-Refine”.

Multiple branches.

In this part, the behavior of the three branches is further discussed. The IoU curves between multiple predictions and groundtruths on the training set are shown in Figure 5. It can be also observed that the result from the mask head is poor at the start of training, but it grows quickly and surpasses the bounding box head at the end of the first epoch. After abundant training, the corner head and the mask head produce higher-quality results (IoU0.8) than the bounding box head (IoU=0.7). This also indicates that the corner head has the more powerful ability to produce precise results than the bounding box head, even though the bounding box branch is given larger weight () and directly optimized with IoU. In addition, in Figure 3, more qualitative results are provided. Although the mean IoU of the box branch is lower than another two branches, the box branch can also produce the most accurate results in some cases as shown in the first row of Figure 3. So with the help of the branch selector, all three branches can make their unique and irreplaceable contribution.

(a) (b)
Figure 5: Illustration of mean IoUs between multiple predictions and groundtruth on the training set. The red, green and blue curves represent the bounding box branch, the corner branch and the mask branch respectively. To observe the growth of the red curve, it is additionally shown in subfigure (b). Better viewed in color with zoom-in.

5 Conclusion.

In this work, we propose a novel and precise refinement module called Alpha-Refine. Our contributions can be summarized as follows. First, this work is the first one to design a universal refinement module. Specifically, Alpha-Refine can be seamlessly combined with all existing trackers without the need for joint training or fine-tuning. Second, this work proposes a few effective principles to design high-performance refinement module, including (1) it brings better results to finely aggregate features and to capture global information (2) multiple prediction heads produce more diverse and reliable results (3) the branch selector is helpful to choose the optimal result. Third, we apply the Alpha-Refine model to five well-known and top-performed trackers and conduct numerous evaluations on three popular benchmarks. The experimental results show that our Alpha-Refine could consistently improve the tracking performance with few computational loads.

Appendix A Qualitative Results

Alpha-Refine has three parallel prediction heads, which predict the bounding box, the corners and the mask respectively. In this section, a large number of qualitative results are shown to illustrate Alpha-Refine’s ability to produce precise results.

a.1 Quality of the predicted corners

Figure 6 demonstrates the corners (and the corresponding boxes) predicted by Alpha-Refine. It can be seen that Alpha-Refine still produces quite reliable corners, even though motion blur, distractors and occlusion occur. This illustrates that the corner branch has great robustness to challenging factors in the tracking process.

Figure 6: Predicted corners of Alpha-Refine. From top to bottom, three rows correspond to three challenging situations, motion blur, distractors and occlusion. Each small picture is a search region, with two corner heatmaps on it. The green bounding boxes are transformed from predicted corners. Better viewed in color with zoom-in.

a.2 Quality of the predicted masks

Figure 7 shows the comparison between the masks predicted by Alpha-Refine and SiamMask. After zooming in upon Figure 7, it can be seen that our masks are sharper and more robust to the cluttered background than those of SiamMask. For example, in video bird1 and shaking, masks from SiamMask are rough and broken respectively. However, Alpha-Refine accurately segments the targets’ contours and predicts complete masks. Besides, in video bolt1 and dinosaur, SiamMask can not distinguish the target from the background, leading to inferior mask predictions. But Alpha-Refine can still produce high-quality masks, which only contains the tracked targets.

Figure 7: Predicted masks of Alpha-Refine and SiamMask [47]. The golden and blue masks are from Alpha-Refine and SiamMask respectively. The rotated boxes are generated from the masks using cv2.minAreaRect(). Better viewed in color with zoom-in.

a.3 Video Demos

To further illustrate the influence of Alpha-Refine on the tracking results, we provide some video demos in the videos folder. Figure 8 is a typical example. It can be seen from the videos that Alpha-Refine provides more accurate bounding boxes than DiMP50, boosting its tracking performance significantly.

Figure 8: Comparison between results of DiMP50+AR and DiMP50. The green, red and blue boxes correspond to the groundtruths, the results of DiMP50+AR and the ones of DiMP50 respectively. Also, the IoU is shown in the top-left corner of the image. Better viewed in color with zoom-in.


  • [1] L. Bertinetto, J. Valmadre, S. Golodetz, O. Miksik, and P. H. S. Torr (2016) Staple: Complementary learners for real-time tracking. In CVPR, Cited by: Table 2.
  • [2] L. Bertinetto, J. Valmadre, J. F. Henriques, A. Vedaldi, and P. H. S. Torr (2016) Fully-convolutional siamese networks for object tracking. In ECCV Workshop, Cited by: §1, §2, §3.2, Table 2.
  • [3] G. Bhat, M. Danelljan, L. V. Gool, and R. Timofte (2019) Learning discriminative model prediction for tracking. In ICCV, Cited by: §1, §1, §2, §2, §3.5, §4.2, §4, footnote 3.
  • [4] G. Bhat, J. Johnander, M. Danelljan, F. Shahbaz Khan, and M. Felsberg (2018) Unveiling the power of deep tracking. In ECCV, Cited by: §1, Table 2.
  • [5] K. Dai, D. Wang, H. Lu, C. Sun, and J. Li (2019) Visual tracking via adaptive spatially-regularized correlation filters. In CVPR, Cited by: Table 3.
  • [6] M. Danelljan, G. Bhat, F. S. Khan, and M. Felsberg (2017) ECO: efficient convolution operators for tracking. In CVPR, Cited by: §1, §1, §2, §4.
  • [7] M. Danelljan, G. Bhat, F. S. Khan, and M. Felsberg (2019) ATOM: Accurate tracking by overlap maximization. In CVPR, Cited by: §1, §1, §2, §2, §3.5, §4.2, §4.
  • [8] K. Duan, S. Bai, L. Xie, H. Qi, Q. Huang, and Q. Tian (2019) Centernet: keypoint triplets for object detection. In ICCV, Cited by: §3.3.
  • [9] H. Fan, L. Lin, F. Yang, P. Chu, G. Deng, S. Yu, H. Bai, Y. Xu, C. Liao, and H. Ling (2019) LaSOT: a high-quality benchmark for large-scale single object tracking. In CVPR, Cited by: §1, §3.5, §4.1, §4.2, §4.
  • [10] H. Fan and H. Ling (2019) Siamese cascaded region proposal networks for real-time visual tracking. In CVPR, Cited by: §1, §2, Table 2, Table 3.
  • [11] J. Gao, T. Zhang, and C. Xu (2019) Graph convolutional tracking. In CVPR, Cited by: Table 3.
  • [12] P. Gao, R. Yuan, F. Wang, L. Xiao, H. Fujita, and Y. Zhang (2019) Siamese attentional keypoint network for high performance visual tracking. Knowledge-Based Systems, pp. 105448. Cited by: §3.3, §3.3.
  • [13] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In CVPR, Cited by: §3.1.
  • [14] J. F. Henriques, R. Caseiro, P. Martins, and J. Batista (2008) High-speed tracking with kernelized correlation filters.. In ICVS, Cited by: §2.
  • [15] L. Huang, X. Zhao, and K. Huang (2019) Got-10k: a large high-diversity benchmark for generic object tracking in the wild. TPAMI. Cited by: §3.5, §4.2.
  • [16] B. Jiang, R. Luo, J. Mao, T. Xiao, and Y. Jiang (2018) Acquisition of localization confidence for accurate object detection. In ECCV, Cited by: §2, §3.1.
  • [17] I. Jung, J. Son, M. Baek, and B. Han (2018) Real-time MDNet. In ECCV, Cited by: §1, §2, §4.
  • [18] D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §3.5.
  • [19] T. Kong, F. Sun, H. Liu, Y. Jiang, and J. Shi (2019) Foveabox: beyond anchor-based object detector. arXiv preprint arXiv:1904.03797. Cited by: §2.
  • [20] M. Kristan, A. Leonardis, J. Matas, M. Felsberg, R. Pflugfelder, L. ˇCehovin Zajc, T. Vojir, G. Bhat, A. Lukezic, A. Eldesokey, et al. (2018) The sixth visual object tracking vot2018 challenge results. In ECCV, Cited by: §1, §3.3, §4.1, §4.
  • [21] M. Kristan, J. Matas, A. Leonardis, M. Felsberg, R. Pflugfelder, J. Kamarainen, L. Cehovin Zajc, O. Drbohlav, A. Lukezic, A. Berg, et al. (2019) The seventh visual object tracking vot2019 challenge results. In ICCV Workshops, Cited by: §2, §3.3, §4.2.
  • [22] H. Law and J. Deng (2018) Cornernet: detecting objects as paired keypoints. In ECCV, Cited by: §2, §3.3.
  • [23] B. Li, W. Wu, Q. Wang, F. Zhang, J. Xing, and J. Yan (2019) SiamRPN++: Evolution of siamese visual tracking with very deep networks. In CVPR, Cited by: §1, §1, §2, §3.2, §4.
  • [24] B. Li, J. Yan, W. Wu, Z. Zhu, and X. Hu (2018) High performance visual tracking with siamese region proposal network. In CVPR, Cited by: §2, §3.2.
  • [25] P. Li, B. Chen, W. Ouyang, D. Wang, X. Yang, and H. Lu (2019) GradNet: gradient-guided network for visual object tracking. In ICCV, Cited by: Table 3.
  • [26] T. Lin, P. Dollár, R. Girshick, K. He, B. Hariharan, and S. Belongie (2017) Feature pyramid networks for object detection. In CVPR, Cited by: §3.3.
  • [27] T. Lin, M. Maire, S. J. Belongie, L. D. Bourdev, R. B. Girshick, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick (2014) Microsoft COCO: common objects in context. In ECCV, Cited by: §3.5, §4.2.
  • [28] A. Lukežič, J. Matas, and M. Kristan (2019) D3S–a discriminative single shot segmentation tracker. arXiv preprint arXiv:1911.08862. Cited by: §2, §3.3.
  • [29] A. Lukezic, T. Vojir, L. ˇCehovin Zajc, J. Matas, and M. Kristan (2017) Discriminative correlation filter with channel and spatial reliability. In CVPR, Cited by: Table 2.
  • [30] D. C. Luvizon, H. Tabia, and D. Picard (2019) Human pose regression by combining indirect part detection and contextual information. Computers & Graphics 85, pp. 15–22. Cited by: §3.3.
  • [31] M. Muller, A. Bibi, S. Giancola, S. Alsubaihi, and B. Ghanem (2018) TrackingNet: a large-scale dataset and benchmark for object tracking in the wild. In ECCV, Cited by: §1, §4.1, §4.2, §4.
  • [32] H. Nam and B. Han (2016)

    Learning multi–domain convolutional neural networks for visual tracking

    In CVPR, Cited by: §1, §2, Table 2.
  • [33] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer (2017) Automatic differentiation in pytorch. Cited by: §4.
  • [34] E. Real, J. Shlens, S. Mazzocchi, X. Pan, and V. Vanhoucke (2017) YouTube-boundingboxes: a large high-precision human-annotated data set for object detection in video. In CVPR, Cited by: §4.2.
  • [35] S. Ren, K. He, R. Girshick, and J. Sun (2015) Faster R–CNN: towards real-time object detection with region proposal networks. In NIPS, Cited by: §2, §2.
  • [36] H. Rezatofighi, N. Tsoi, J. Gwak, A. Sadeghian, I. Reid, and S. Savarese (2019) Generalized intersection over union: a metric and a loss for bounding box regression. In CVPR, Cited by: §3.3.
  • [37] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, and M. Bernstein (2015) ImageNet Large scale visual recognition challenge.

    International Journal of Computer Vision

    115 (3), pp. 211–252.
    Cited by: §3.5, §4.2.
  • [38] J. Shi, Q. Yan, L. Xu, and J. Jia (2015) Hierarchical image saliency detection on extended cssd. TPAMI 38 (4). Cited by: §3.5.
  • [39] C. Sun, D. Wang, H. Lu, and M. Yang (2018) Correlation tracking via joint discrimination and reliability learning. In CVPR, Cited by: §1, §2.
  • [40] Y. Sun, C. Sun, D. Wang, Y. He, and H. Lu (2019) Roi pooled correlation filters for visual tracking. In CVPR, Cited by: Table 3.
  • [41] F. Sung, Y. Yang, L. Zhang, T. Xiang, P. H. Torr, and T. M. Hospedales (2018) Learning to compare: relation network for few-shot learning. In CVPR, Cited by: §2.
  • [42] R. Tao, E. Gavves, and A. W. M. Smeulders (2016) Siamese instance search for tracking. In CVPR, Cited by: §1, §2.
  • [43] Z. Tian, C. Shen, H. Chen, and T. He (2019) FCOS: fully convolutional one-stage object detection. In ICCV, Cited by: §2.
  • [44] J. Valmadre, L. Bertinetto, J. Henriques, A. Vedaldi, and P. H. Torr (2017) End-to-end representation learning for correlation filter based tracking. In CVPR, Cited by: Table 2.
  • [45] G. Wang, C. Luo, Z. Xiong, and W. Zeng (2019) SPM-tracker: series-parallel matching for real-time visual object tracking. In CVPR, Cited by: §1, §2, Table 3.
  • [46] L. Wang, H. Lu, Y. Wang, M. Feng, D. Wang, B. Yin, and X. Ruan (2017) Learning to detect salient objects with image-level supervision. In CVPR, Cited by: §3.5.
  • [47] Q. Wang, L. Zhang, L. Bertinetto, W. Hu, and P. H. S. Torr (2019) Fast online object tracking and segmentation: A unifying approach. In CVPR, Cited by: Figure 7, §1, §1, §2, §2, §3.2, §3.3, §3.5, §4.2, Table 3, §4.
  • [48] X. Wang, R. Girshick, A. Gupta, and K. He (2018) Non-local neural networks. In CVPR, Cited by: §3.2.
  • [49] Z. Wang, J. Xu, L. Liu, F. Zhu, and L. Shao (2019) Ranet: ranking attention network for fast video object segmentation. In ICCV, Cited by: §3.2.
  • [50] N. Xu, L. Yang, Y. Fan, J. Yang, D. Yue, Y. Liang, B. Price, S. Cohen, and T. Huang (2018) Youtube-vos: sequence-to-sequence video object segmentation. In ECCV, pp. 585–601. Cited by: §3.5, §4.2.
  • [51] T. Xu, Z. Feng, X. Wu, and J. Kittler (2019)

    Joint group feature selection and discriminative filter learning for robust visual object tracking

    In ICCV, Cited by: Table 2, Table 3.
  • [52] Y. Xu, Z. Wang, Z. Li, Y. Ye, and G. Yu (2019) SiamFC++: towards robust and accurate visual tracking with target estimation guidelines. arXiv preprint arXiv:1911.06188. Cited by: §2.
  • [53] C. Yang, L. Zhang, H. Lu, X. Ruan, and M. Yang (2013) Saliency detection via graph-based manifold ranking. In CVPR, Cited by: §3.5.
  • [54] L. Zhang, A. Gonzalez-Garcia, J. v. d. Weijer, M. Danelljan, and F. S. Khan (2019) Learning the model update for siamese trackers. In ICCV, Cited by: Table 2, Table 3.
  • [55] Z. Zhang and H. Peng (2019) Deeper and wider siamese networks for real-time visual tracking. In CVPR, Cited by: Table 3.
  • [56] X. Zhou, D. Wang, and P. Krähenbühl (2019) Objects as points. arXiv preprint arXiv:1904.07850. Cited by: §2, §3.3.
  • [57] X. Zhou, J. Zhuo, and P. Krahenbuhl (2019) Bottom-up object detection by grouping extreme and center points. In CVPR, Cited by: §3.3.
  • [58] Z. Zhu, Q. Wang, B. Li, W. Wu, J. Yan, and W. Hu (2018) Distractor-aware siamese networks for visual object tracking. In ECCV, Cited by: §2, §3.2, Table 2.