|—:||ground-truth box||—:||predicted box||—:||anchor|
|RPN++ :||SiamRPN++||FC++ :||SiamFC++ (ours)|
Generic Visual Tracking aims at locating a moving object sequentially in a video, given very limited information, often only the annotation of the first frame. Being a fundamental build block in various areas of computer vision, the task comes with a variety of applications such as UAV-based monitoring[mueller2016benchmark] and surveillance system [kokkeby2015methods]. One unique characteristic of generic object tracking is that no prior knowledge (e.g., the object class) about the object, as well as its surrounding environment, is allowed [huang2018got].
Tracking problem can be treated as the combination of a classification task and an estimation task [danelljan2019atom]. The first task aims at providing a robust coarse location of the target via classification. The second task is then to estimate an accurate target state, often represented by a bounding box. While modern trackers have achieved significant progress, surprisingly their methods for the second task (i.e. target state estimation) largely differ. Based on this aspect, previous methods can be roughly divided into three categories. The first category, including Discriminative Correlation Filter (DCF) [henriques2014high-speed, bolme2010visual] and SiamFC [bertinetto2016fully], employs brutal multi-scale test which is inaccurate [danelljan2019atom] and inefficiency [li2018high]. Also, the prior assumption that target scale/ratio changes in a fixed rate in adjacent frames often does not hold in reality. For the second category, ATOM [danelljan2019atom] iteratively refines multiple initial bounding boxes via gradient ascending to estimate the target bounding box [jiang2018acquisition], which yields a significant improvement on accuracy. However, this target estimation method brings not only a heavy computation burden but also many additional hyper-parameters (e.g. the number of initial boxes, distribution of initial boxes) that requires careful tuning. The third category is SiamRPN tracker family [li2018high, zhu2018distractor, li2019siamrpn++] that performs an accurate and efficient target state estimation by introducing the Region Proposal Network (RPN) [ren2015faster]. However, the pre-defined anchor settings not only introduce ambiguous similarity scoring that severely hinders the robustness (see Section 4) but also need access to prior information of data distribution, which is clearly against the spirit of generic object tracking [huang2018got].
Motivated by the aforementioned analysis, we propose a set of guidelines for high-performance generic object tracker design:
G1: decomposition of classification and state estimation
The tracker should perform two sub-tasks: classification and state estimation. Without a powerful classifier, the tracker cannot discriminate the target from background or distractors, which severely hinders its robustness[zhu2018distractor]. Without an accurate estimation result, the accuracy of the tracker is fundamentally limited [danelljan2019atom]. Those brutal multi-scale test approaches largely ignore the latter task, suffering from inefficiency and low accuracy.
G2: non-ambiguous scoring The classification score should represent the confidence score of target existence directly, in the ”field of view”, i.e. sub-window of the corresponding pixel, rather than the pre-defined settings like anchor boxes. As a negative example, matching between objects and anchors (e.g. the anchor-based RPN branch) is prone to deliver a false positive result, leading to tracking failure (see Section 4 for more details).
G3: prior knowledge-free Tracking approaches should be free of prior knowledge like scale/ratio distribution, as is proposed by the spirit of generic object tracking [huang2018got]. Dependency on prior knowledge of data distribution exists widely in existing methods, which hinders the generalization ability.
G4: estimation quality assessment As is shown in previous researches [jiang2018acquisition, tian2019fcos], using classification confidence for bounding box selection directly will result in degenerated performance. An estimation quality score independent of classification should be used, as in many previous pieces of research about both object detection and tracking [jiang2018acquisition, tian2019fcos, danelljan2019atom]. The astonishing accuracy of the second branch (e.g. ATOM and DiMP) largely comes from this guideline. While the others still overlook it, leaving room for further estimation accuracy improvement.
Following the guidelines above, we design our SiamFC++ method based on fully-convolutional siamese trackers [bertinetto2016fully], where each pixel of the feature map directly corresponds to each translated sub-window on the search image due to its fully convolutional nature. We add a regression head for accurate target estimation, in parallel with the classification head (G1). Since the pre-defined anchor settings is removed, the matching ambiguity (G2) and prior knowledge (G3) about target scale/ratio distribution is also removed. Finally, following G4, an estimation quality assessment branch is added to privilege bounding boxes with high quality.
Our contribution can be summarized in three-fold:
By identifying the unique characteristics of tracking, we devise a set of practical guidelines of target state estimation for modern tracker design.
We design a simple but powerful SiamFC++ tracker with the application of our proposed guidelines. Extensive experiments and comprehensive analyses demonstrate the effectiveness of our proposed guidelines.
Our approach achieves state-of-the-art results on five challenging benchmarks . To the best of our knowledge, our SiamFC++ is the first tracker that achieves an AUC score of 75.4 on the large-scale TrackingNet dataset [muller2018trackingnet] while running at over 90 FPS.
|:||feature extractor||:||classification branch||:||regession branch||:||quality assessment|
|:||operation||:||cross-correlation||:||element-wise production||:||argmax (taking left w.r.t. right)|
2 Related Works
Modern trackers can be roughly divided into three branches by their way of target state estimation.
Some of them, including DCF [henriques2014high-speed, bolme2010visual] and SiamFC [bertinetto2016fully], use multi-scale test to estimate the target scale. Concretely, by rescaling the search patch into multiple scales and assembling a mini-batch of scaled images, the algorithm picks the scale corresponding to the highest classification score as the predicted target scale in the current frame. This strategy is fundamentally limited since bounding box estimation is inherently a challenging task, requiring a high-level understanding of the pose of objects [danelljan2019atom].
Inspired by DCF and IoU-Net [jiang2018acquisition], ATOM [danelljan2019atom] tracks target by sequential classification and estimation. The coarse initial location of the target obtained by classification is iteratively refined for accurate box estimation. The Multiple random initializations of bounding boxes in each frame and multiple back propagations in iterative refinement greatly slows down the speed of ATOM. This approach yields a significant improvement on accuracy but also brings a heavy computation burden. What’s more, ATOM introduces many additional hyper-parameters that require careful tuning.
Another branch, named SiamRPN and its succeeding works [li2018high, zhu2018distractor, li2019siamrpn++] append a Region Proposal Network after a siamese network, achieving a previously unseen accuracy. RPN regresses the location shift and size difference between pre-defined anchor boxes and target location. However, the RPN structure is much more fit for object detection, in which a high recall rate is required, while in visual tracking one and only one object should be tracked. Also, the ambiguous matching between anchor box and object severely hinders the robustness of tracker (see Section 4). Finally, the anchor setting does not comply with the spirit of generic object tracking, requiring pre-defined hyper-parameters describing its shape.
With many unique characteristics, visual tracking task still has a lot in common with object detection, which makes each one task benefiting from each other possible. For example, the RPN structure first devised in Faster-RCNN [ren2015faster] achieves astonishing accuracy in SiamRPN [li2018high]. Inheriting from Faster-RCNN [ren2015faster], most state-of-the-art modern detectors, named anchor-based detectors, have adopted the RPN structure and the anchor boxes setting [ren2015faster, liu2016ssd, Li_2018_ECCV]
. The anchor-based detectors classifies pre-defined proposals called anchor as positive or negative patches, with an extra offsets regression to refine the prediction of bounding box locations. However, hyper-parameters introduced by anchor boxes (e.g. the scale/ratio of anchor boxes) have shown a great impact on the final accuracy, and require heuristic tuning[cai2018cascade, tian2019fcos]. Researchers have tried various ways to design anchor-free detectors, like predicting bounding boxes at points near the center of objects [redmon2016you, huang2015densebox], or detecting and grouping a pair of corners of a bounding box [law2018cornernet]. In this paper, we show that a simple pipeline based on a carefully designed guidelines for target state estimation inspired by [huang2015densebox, yu2016unitbox, tian2019fcos] can achieve state-of-the-art tracking performance.
3 SiamFC++: Fully Convolutional Siamese Tracker for Object Tracking
In this section, we describe our Fully Convolutional Siamese tracker++ framework in detail. Our SiamFC++ is based on SiamFC and progressively refined according to the proposed guidelines. As shown in Figure 2
, the SiamFC++ framework consists of a siamese subnetwork for feature extraction and a region proposal subnetwork for both classification and regression.
Siamese-based Feature Extraction and Matching
Object tracking task can be viewed as a similarity learning problem [li2018high]. Concretely speaking, a siamese network is trained offline and evaluated online to locate a template image within a larger search image. A siamese network consists of two branches. The template branch takes target patch in the first frame as input (denoted as ), while the search branch takes the current frame as input (denoted as ). The siamese backbone, which shares parameters between two branches, performs the same transform on the input and to embed them into a common feature space for subsequent tasks. A cross-correlation between template patch and search patch is performed in the embedding space :
where denotes the cross-correlation operation, denotes the siamese backbone for common feature extraction, denotes the task-specific layer and denotes the sub-task type (”cls” for classification and ”reg” for regression). In our implementation, We use two convolution layers for both and after common feature extraction to adjust the common features into task-specific feature space. Note that the extracted features of and are of the same size.
Application of Design Guidelines in Head Network
Based on SiamFC, we progressively refine each part of our trackers following our guidelines.
Following G1, we design both classification head and regression head after the cross-correlation in the embedding space. For each pixel in feature maps, the classification head takes as input and classifies the corresponding image patch as either one positive or negative patch, while the regression head takes as input and outputs an extra offsets regression to refine the prediction of bounding box locations. The structure of heads is presented after the cross-correlation operation of Figure 2.
Specifically, for classification, location on feature map is considered as a positive sample if its corresponding location on the input image falls into the ground-truth bounding box. Otherwise, it is a negative sample. Here
is the total stride of backbone (in this paper). For the regression target of each positive location on feature map , the final layer predicts the distances from the corresponding location
to the four sides of the ground-truth bounding box, denoted as a 4D vector. Hence, the regression targets for location can be formulated as
where and denote the left-top and right-bottom corners of the ground-truth bounding box associated with point .
Each location on the feature map of both classification and regression head, corresponds to an image patch on the input image centered at location . Following G2, we directly classify corresponding image patch and regress the target bounding box at the location, as in many previous tracker [henriques2014high-speed, bolme2010visual, bertinetto2016fully]. In other words, our SiamFC++ directly views locations as training samples. While the anchor-based counterparts [li2018high, zhu2018distractor, li2019siamrpn++], which consider the location on the input image as the center of multiple anchor boxes, output multiple classification score at the same location and regress the target bounding box with respect to these anchor boxes, leading to ambiguous matching between anchor and object. Although [li2018high, zhu2018distractor, li2019siamrpn++] have shown superior performance on various benmarks than [henriques2014high-speed, bolme2010visual, bertinetto2016fully], we empirically show that the ambiguous matching could result in serious issues (see Section 4 for more details). In our per-pixel prediction fashion, only one prediction is made at each pixel on the final feature map. Hence it is clear that each classification score directly gives the confidence that the target is in the sub-window of the corresponding pixel and our design is free of ambiguity to this extent.
Since SiamFC++ does classification and regression w.r.t. the location, it is free of pre-defined anchor boxes, hence free of prior knowledge about target data distribution (e.g. scale/ratio), which comply with G3.
During the above sections, we do not take the target state estimation quality into consideration yet and directly use classification score to select the final box. That could cause the degradation of localization accuracy, as [jiang2018acquisition] shows that classification confidence is not well correlated with the localization accuracy. According to the analysis in [luo2016understanding], input pixels around the center of a sub-window will have larger importance on the corresponding output feature pixel than the rest. Thus we hypothesize that feature pixels around the center of objects will have a better estimation quality than others. Following G4, we add a simple yet effective quality assessment branch similar to [tian2019fcos, jiang2018acquisition] by appending a convolution layer in parallel with the convolution classification head, as shown in the right part of Figure 2. The output is supposed to estimate the Prior Spatial Score (PSS) which is defined as follows:
Note that PSS is not the only choice for quality assessment. As a variant, we can also predict the IoU score between predicted boxes and ground-truth boxes similar to [jiang2018acquisition]:
where is the predicted bounding box and is its corresponding ground-truth bounding box.
During inference, the score used for final box selection is computed by multiplying the PSS with the corresponding predicted classification score. In this way, those bounding boxes far from the center of objects will be downweighted seriously. Thus the tracking accuracy is improved.
We optimize a training objective as follows:
where is the indicator function that takes 1 if the condition in subscribe holds and takes 0 if not, denote the focal loss [lin2017focal] for classification result, denote the binary cross entropy (BCE) loss for quality assessment and denote the IoU loss [yu2016unitbox] for bounding box result. We assign to if is considered as a positive sample, and if as a negative sample.
|No.||VID||Youtube||COCO&Det& LaSOT&GOT||Backbone||Head type||Head structure||Quality assessment||A||R||EAO||EAO|
In this work, we implement two versions of trackers with different backbone architectures: the one that adopts the modified version of AlexNet in the previous literature [bertinetto2016fully], denoted as SiamFC++-AlexNet, and another one that uses GoogLeNet [szegedy2015going], denoted as SiamFC++-GoogLeNet. With lower computation cost, the later achieves the same or even better performance(see Table 4) on tracking benchmark than same previous methods using ResNet-50 [he2016deep]
. Both networks are pretrained on ImageNet[krizhevsky2012imagenet], which has been proven practical for tracking task [li2018high, zhu2018distractor]. We will release the code to facilitate further researches.
We adopt ILSVRC-VID/DET [russakovsky2015imagenet], COCO [lin2014microsoft] , Youtube-BB [real2017youtube], LaSOT [fan2019lasot] and GOT-10k [huang2018got] as our basic training set. Exceptions w.r.t. to specific benchmarks are detailed in the following subsections. For video datasets, we extract image pairs from VID, LaSOT, and GOT-10k by choosing frame pairs within an interval of less than 100 (5 for Youtube-BB). For image datasets (COCO/Imagenet-DET), we generate training samples by involving negative pairs [zhu2018distractor]
as part of training samples to enhance the capacity to distinguish distractors of our model. We perform random shifting and scaling following a uniform distribution on the search image as data augmentation techniques.
For the AlexNet version, we freeze the parameters from conv1 to conv3 and fine-tune conv4 and conv5. For those layers without pretraining, we adopt a zero-centered Gaussian distribution with a standard deviation of 0.01 for initialization. We first train our model with for 5 warm up epochs with learning rate linearly increased fromto
, then use a cosine annealing learning rate schedule for the rest of 45 epochs, with 600k image pairs for each epoch. We choose stochastic gradient descent (SGD) with a momentum of 0.9 as the optimizer.
For the version implemented with GoogLeNet, we freeze stage 1 and 2, fine-tune stage 3 and 4, augment the base learning rate to , and multiply the learning rate of parameters in the backbone by 0.1 w.r.t the global learning rate. We also reduce the number of image pairs per epoch to 300k, reduce the total epoch to 20 (thus 5 for warming-up, and 15 for training) and unfreeze the parameters in backbone at the 10th epoch to avoid overfitting. For the experiment on LaSOT benchmark [fan2019lasot] (protocol II), we freeze the parameters in the backbone and further reduce the number of image pairs per epoch to 150k so that the training with the relatively smaller amount of training data could be stabilized.
The proposed tracker with AlexNet backbone runs at 160 FPS on the VOT2018 short-term benchmark, while the one with GoogleNet backbone runs at about 90 FPS on the VOT2018 short-term benchmark, both evaluated on an NVIDIA RTX 2080Ti GPU.
The output of our model is a set of bounding boxes with their corresponding confidence scores . Scores are penalized based on the scale/ratio change of corresponding boxes and distance away from the target position predicted in the last frame. Then the box with the highest penalized score is chosen and is used to update the target state.
From SiamFC towards SiamFC++
While both employing a per-pixel prediction fashion, there exists a significant performance gap between SiamFC and our SiamFC++. In this subsection we perform an ablation study on VOT2018 dataset, with SiamFC as the baseline, aiming at identifying the key component for the improvement of tracking performance.
Results are shown in Table 1. Concretely, in the SiamFC baseline, the tracker only performs classification tasks in its network and the target state estimation is done with multi-scale test. We gradually update SiamFC tracker by using extra training data (Line 1/1), applying a better head structure (Line 1), and adding the regression branch for accurate estimation to yield our proposed SiamFC++ tracker (Line 1). We further replace the AlexNet backbone with GoogLeNet which is more powerful to extract visual feature (Line 1).
The key components for tracking performance can be listed in descending order as follows: the regression branch (0.094), data source diversity (0.063/0.010), stronger backbone (0.026), and better head structure (0.020), where the EAO brought by each part is noted in parentheses. Note that these are the extra components of SiamRPN++ over SiamFC. After adding all the extra components into SiamFC, Our SiamFC++ achieves superior performance with less computation budget. Also, there are two things worth to mention: 1). the robustness (R) of Line 1 surpasses SiamRPN tracker ( [li2018high]); 2). the R of Line 1 is at the same level of DaSiamRPN ( [zhu2018distractor]) while using less data (without COCO and DET) than the latter. These results indicate that, while the introduction of the RPN module and anchor boxes setting undoubtedly gives better accuracy, its robustness is not improved and even hindered. We owe this to its violation of our proposed guidelines.
Quality Assessment Choice
On GOT-10k val subset, we obtain an AO of 77.8 for the tracker predicting PSS and an AO of 78.0 for the tracker predicting IoU. Experiments have been conducted with SiamFC++-GoogLeNet. We finally choose PSS in this paper as an implementation of our approach for its stability empirically observed across datasets during our experiment.
Results on Several Benchmarks
We test our tracker on several benchmarks and results are gathered in Table 2.
Results on OTB2015 Benchmark
As one of the most classical benchmarks for the object tracking task, the OTB benchmark [wu2013online] provides a fair test for all families of trackers. We conduct experiments on OTB2015 [wu2013online] which contains 100 videos for tracker performance evaluation. With a success score of 0.682, our tracker reaches the state-of-the-art level w.r.t. other trackers in comparison.
Results on VOT Benchmark
VOT2018 [kristan2018sixth] contains 60 video sequences with several challenging topics including fast motion, occlusion, etc. We test our tracker on this benchmark and present the results in Table 2. Both versions of our trackers reaching comparable scores w.r.t. current state-of-the-art trackers, the tracker with AlexNet backbone outperforms other trackers with the same tracking speed and while the tracker with GoogLeNet backbone yields a comparable score. Besides, our tracker has a significant advantage in the robustness among the trackers in comparison. To the best of our knowledge, this is the first tracker that achieves an EAO of 0.400 on VOT2018 [kristan2018sixth] benchmark while running at a speed over 100 FPS, which demonstrate its potential of being applied in real production cases.
Results on LaSOT Benchmark
With a large number of video sequences (1400 sequences under Protocol I while 280 under Protocol II), LaSOT [fan2019lasot] (Large-scale Single Object Tracking) benchmark makes it impossible for trackers to overfit the benchmark, which achieves the purpose of testing the real performance of object tracking. Following Protocol II under which trackers are trained on LaSOT train subset and evaluated on LaSOT test subset, the proposed SiamFC++ tracker achieves better performance, even w.r.t. those who have better performance than ours on the VOT2018 benchmark. This reveals the fact that the scale of the benchmark influences the rank of trackers.
Results on GOT-10k Benchmark
For target class generalization testing, we train and test our SiamFC++ model on GOT-10k [huang2018got] (Generic Object Tracking-10k) benchmark. Not only as a large-scale dataset (10,000 videos in train subset and 180 in both val and test subset), it also gives challenges in terms of the requirement of category-agnostic for generic object trackers as there is no class intersection between train and test subsets. We follow the protocol of GOT-10k and only trained our tracker on the train subset. Our tracker with AlexNet backbone reaches an AO of 53.5 surpassing SiamRPN++ by 1.7, while our tracker with GoogLeNet backbone yields 59.5 which is even superior to ATOM that uses online updating method. This result shows the ability of our tracker to generalize even the target classes are unseen during the training phase, which matches the demand of the generic tracking.
Results on TrackingNet Benchmark
We evaluate our approach with 511 videos provided in the test split of TrackingNet [muller2018trackingnet]. We exclude the Youtube-BB dataset from our training data in order to avoid data leak. As is described in [muller2018trackingnet], the evaluation server calculates the following three indexes based on tracking results: success rate, precision, and normalized precision. Our SiamFC++-GoogLeNet outperforms the current state-of-the-art methods (including online-update methods like [danelljan2019atom]) in both precision and success rate dimensions, while our lightweight version SiamFC++ strikes a balance between performance and the speed. This result is achieved even without Youtube-BB containing a large portion of training data, which shows that the potential of our approach to be independent of large offline training data.
Comparison with Trackers that Do not Apply Our Guidelines
The family of SiamRPN [li2018high, zhu2018distractor, li2019siamrpn++] has achieved great success in visual tracking these years and drawn much attention from tracking community. Here we use state-of-the-art SiamRPN++ tracker as an example. Despite recent successes of the SiamRPN family, we have found that the SiamRPN tracker and its family do not follow our proposed guidelines entirely.
(G2) the classification score of SiamRPN represents the similarity between anchor and object, rather than template object and objects in search image, which may cause matching ambiguity;
(G3) the design of pre-set anchor boxes needs prior knowledge of the distribution of size and ratio of target;
(G4) the choice of target state estimation does not take estimation quality into consideration.
Note that the SiamRPN family adopts proposal refinement by the regression branch instead of the multi-scale test, and thus achieves astonishing tracking accuracy, which complies with our guideline G1.
As a consequence of the violation of guideline G2, we empirically find that the SiamRPN family is prone to deliver a false-positive result. In other words, SiamRPN will produce an unreasonable high score for nearby objects or background under large appearance variation of target object. As shown in Figure 1, we can see that SiamRPN++ fails to track the target object by giving very high scores for nearby objects (i.e. a rock or a face) under challenging scenarios like out-of-plane rotation and deformation. We hypothesize that SiamRPN matches objects and anchors rather than the object itself, which may deliver drifts and thus hinders its robustness. On the contrary, our proposed SiamFC++, which matches between template objects and objects in search image directly, gives accurate score predictions and successfully tracks the target.
To verify our hypothesis, we record the max score produced by SiamRPN++ and our proposed SiamFC++ on VOT2018 dataset. We then split them according to the tracking result, e.g., successful or failed. On VOT2018, a tracking result is considered failed if its overlap with the ground-truth box is zero. Otherwise, it is considered successful. The result is visualized in the first row in Figure 3. Comparing SiamRPN++ and SiamFC++ scores, we can see that most classification score of SiamRPN++ follows similar and highly overlapped distributions, successful or not, while the classification score of our SiamFC++ of failure state exhibit very different pattern with that of a successful state.
Another factor contributing to the ambiguity in SiamRPN++ is that the feature matching process is done with patches of fixed aspect ratio (multiple patches with different ratios will bring non-negligible computation cost), while each pixel of the feature after matching is assigned anchors whose aspect ratio varies.
As for the violation of G3, the performance of SiamRPN varies as the scales and ratios of anchors vary. As is shown in Table 3 from [li2018high], three different ratio settings are tried and the performance of SiamRPN varies when using different anchor settings. Thus the best performance is achieved only by accessing prior knowledge of data distribution, which is against the spirit of generic object tracking [huang2018got].
Besides, in the second row of Figure 3, we also plot the histogram of SiamRPN++ statistics of IoU between output bounding box and ground truth, and that between anchor and ground truth, in both success and failure state. As is shown from the IoU distribution, the prior knowledge given by anchor settings (violation of G3) leads to a bias in target state estimation. Concretely, the predicted box of SiamRPN++ tends to overlap more with the anchor box than with the ground truth box which can lead to performance degradation.
As for the violation of G4, we can see that the SR.5 and SR.75 of SiamRPN++ on GOT-10k benchmark are 7.7 and 15.4 points lower than those of SiamFC++, respectively. In GOT-10k, the Success Rate (SR) measures the percentage of successfully tracked frames where the overlaps exceed a pre-defined threshold (i.e., 0.5 or 0.75). The higher the threshold, the more accurate the tracking result. Hence SR is a solid indicator for estimation quality. The SR.75 of SiamRPN++ is much lower than that of SiamFC++, indicating the lower estimation quality of SiamRPN++ caused by the violation of guideline G4.
In this paper, we propose a set of guidelines for target state estimation in tracker design, by analyzing the unique characteristics of visual tracking tasks and the flaws of former trackers. Following these guidelines, we propose our approach that provides effective methods for both classification and target state estimation (G1), giving classification score without ambiguity (G2), tracking without prior knowledge (G3), and being aware of estimation quality (G4). We verify the effectiveness of proposed guidelines by extensive ablation study. And we show that our tracker based on these guidelines reaches state-of-the-art performance on five challenging benchmarks, while still running at 90 FPS.