Visual object tracking has received considerable attention due to its wide application such as intelligent surveillance, human-machine interaction and unmanned vehicles. Rapid progress has been made on visual tracking. However, it remains a challenging task especially for real world applications, as object in unconstrained recording conditions often suffers from large illumination variation, scale variation, background clutters and heavy occlusions, etc. Moreover, the appearance of non-rigid objects may change significantly due to extreme pose variation.
The current popular visual tracking methods [21, 3, 1, 18, 4, 42, 12, 38] revolve around the Siamese network based architectures. The Siamese network formulates the visual tracking task as a target matching problem and aims to learn a general similarity map between the target template and the search region. Since one single similarity map typically contains limited spatial information, a common strategy is to perform matching on multiple scales of the search regions to determine the object scale variation [21, 1, 18], which explains why these trackers are time-consuming and labor-intensive. SiamRPN  attaches the Siamese network a region proposal extraction subnetwork (RPN). By jointly training a classification branch and a regression branch for visual tracking, SiamRPN avoids the time-consuming step of extracting multi-scale feature maps for the object scale invariance. It achieves state-of-the-art results on many benchmarks. Later works such as DaSiam  , CSiam  and SiamRPN++  improves SiamRPN. However, since anchors are introduced for region proposal, these trackers are sensitive to the numbers, sizes and aspect ratios of anchor boxes, and expertise on hyper-parameter tuning is crucial to obtain successful tracking with these trackers.
In this paper, we show that an anchor-free Siamese network based tracker can perform better than the state-of-the-art RPN based trackers. Essentially we decompose tracking into two subproblems: one classification problem and one regression task. The classification branch aims to predict each spatial location a label, while the regression branch considers regressing each location a relative bounding box. With such decomposition, the tracking task can be solved in a per-pixel prediction manner. We then craft a simple yet effective Siamese based classification and regression network (SiamCAR) to learn the classification and regression models simultaneously in an end-to-end manner.
Previous work  leverages object semantic information to improve the bounding box regression. Inspired by this, SiamCAR is designed to extract response maps which include affluent category information and semantic information. Different from RPN models [3, 42, 4], which use two response maps for region proposal detection and regression respectively, SiamCAR takes one unique response map to predict object location and bounding box directly.
SiamCAR adopts the strategy of online training and offline tracking, without using any data enhancement during training. Our main contributions are:
We propose the so-called Siamese classification and regression framework (SiamCAR) for visual tracking. The framework is very simple in construction but powerful in performance.
The proposed tracker is both anchor and proposal free. The number of hyper-parameters has been significantly reduced, which keeps the tracker from complicated parameter tuning and make the tracker significantly simpler, especially in training.
Without bells and whistles, the proposed tracker achieves the state-of-the-art tracking performance in terms of both accuracy and time cost.
2 Related Works
We mainly review the family of Siamese RPN trackers since they dominate the tracking performance in recent years.
, classifier design and bounding box regression 
. Early feature extraction mainly uses color features, texture features or other hand-crafted ones. Benefit by the development of deep learning, now the deep convolutional feature CNN is widely adopted. Template updating can improve the model adaptability, but online tracking is very inefficient. Besides, the tracking drift problem for template updating still need to be solved. The introduction of correlation filter methods[7, 28, 17, 9, 36, 34]
make the tracking reach an unprecedented height both in efficiency and accuracy. The current researches show that the Siamese based online training and offline tracking methods with deep neural network have achieved the best balance between accuracy and efficiency[3, 4].
As one of the pioneering works, SiamFC  constructs a fully convolutional Siamese network to train a tracker. Encouraged by its success, many researchers follows the work and propose some updated models [37, 18, 1, 31, 3, 4] . CFNet  introduces the Correlation Filter layer to the SiamFC framework and performs online tracking to improve the accuracy. By modifying the Siamese branches with two online transformations, DSiam  proposes to learn a dynamic Siamese network, which achieves an improved accuracy with acceptable speed loss. The SA-siam  builds a twofold Siamese network with a semantic branch and an appearance branch. The two branches are trained separately to keep the heterogeneity of features but combined at the testing time to improve the tracking accuracy. In order to deal with the scale variation problem, these Siamese networks need to process multi-scale searching and result in time-consuming problem.
Inspired by the region proposal network for object detection , the SiamRPN  tracker performs the region proposal extraction after the Siamese network outputs. By jointly training a classification branch and a regression branch for region proposal, SiamRPN avoids the time-consuming step of extracting multi-scale feature maps for the object scale invariance and achieves very efficient results. However, it has difficulty in deal with distractors with similar appearance to the object. Based on SiamRPN, DaSiamRPN  increases the hard negative training data during the training phase. Through data enhancement, they improve the discrimination of the tracker and obtain a much more robust result. The tracker is further extended to long-term visual tracking. Up to now the framework has been modified a lot from SiamFC, but the performance still can not move on with deeper network by using AlexNet as backbone. Aims to this problem, SiamRPN++  optimizes the network architecture by using the ResNet  as backbone. At the same time, they randomly shift the training object location in the search region during model training to eliminate the center bias. After these modifications, the better tracking accuracy can be achieved in a very deep network architecture instead of shallow neural networks.
Anchors are adopted in these RPN based trackers for region proposal. Besides, anchor boxes can make use of the deep feature maps and avoid repeated computation, which can significantly speed up the tracking process. The state-of-the-art trackers SPM and SiamRPN  both work in a very high speed. Though SiamRPN++  adopts a very deep neural network, it can still work in a considerable real-time speed. The accuracy and speed of the state-of-the-art Anchor-free trackers like ECO  still has gap with these anchor-based trackers [11, 4] on the challenging benchmarks like GOT-10K 
. However, the tracking performance is very sensitive to the relative hyper-parameters of anchors, which need to be carefully tuned and empirical tricks are involved to achieve ideal performance. Moreover, since the size and aspect ratio of anchor boxes are fixed, even with heuristic tuned parameters, these trackers still have difficulty in processing objects with large shape deformation and pose variation. In this paper, we show that the problems can be greatly alleviated with our proposed SiamCAR. Moreover, we demonstrate that a tracker with much simpler construction can achieve even better performance than state-of-the-art ones.
3 Proposed Method
We now introduce our SiamCAR network in detail. As mentioned, we decompose the tracking task into two subproblems as classification and regression, and then solve them in a per-pixel manner. As show in Figure 2, the framework mainly consists of two simple subnetworks: a Siamese network for feature extraction along with a classification and regression network for bounding box prediction.
3.1 Feature Extraction with Siamese Subnetwork
Here we take advantage of the fully convolution network without padding to construct the Siamese subnetwork for the visual feature extraction. The subnetwork consists of two branches: a target branch which takes the tracking template patchas input, and a search branch which takes the search region as input. The two branches share the same CNN architecture as their backbone models, which output two feature maps and . In order to embed the information of these two branches, a response map can be obtained by performing the cross-correlation on with as a kernel. Since we need to decode the response map in the subsequent prediction subnetwork to obtain the location and scale information of the target, we hope that retains abundant information. However, the cross-correlation layer can only generate a single-channel compressed response map, which lacks useful features and important information for tracking, as suggested by  that different feature channels typically take distinct semantic information. Inspired by , we also use a depth-wise correlation layer to produce multiple semantic similarity maps:
where denotes the channel-by-channel correlation operation. The generated response map has the same number of channels as , and it contains massive information for classification and regression.
Low-level features like edge, corner, color and shape that represent better visual attributes are indispensable for location, while high-level features have better representation on semantic attributes and they are more crucial for discrimination. Many methods take advantage of fusing both low-level and high-level features to improve the tracking accuracy [6, 4]. Here we also consider to aggregate multi-layer deep features for tracking. We use the modified ResNet-50 as the same in  as our backbone networks. To achieve better inference for recognition and discrimination, we compound the features extracted from the last three residual blocks of the backbone, The three outputs denoted respectively as , , are concatenated as a unity
where includes channels. Hence contains channels.
The Depth-wise Cross Correlation is performed between the searching map and the template map to get a multi-channel response map. The response map is then convoluted with a kernel to reduce its dimension to channels. Through the dimension-reduction, the number of parameters can be significantly reduced, in a result the following computation can be speed up. The final dimension-reduced response map is adopted as the input to the classification-regression subnetwork.
3.2 Bounding Box Prediction with Classification and Regression Subnetwork
Each location in the response map can be mapped back onto the input search region as . The RPN-based trackers consider the corresponding location on the search region as the center of multi-scale anchor boxes, and regress the target bounding box with these anchor boxes as references. Different from them, our network directly classifies and regresses the target bounding box at each location. The associated training can be accomplished by the fully convolution operation in an end-to-end fashion, which avoids tricky parameter tuning and reduces human intervention.
The tracking task is decomposed into two subtasks: a classification branch to predict the category for each location, and a regression branch to compute the target bounding box at this location (see Figure 2 for an illustration of the subnetwork). For a response map extracted using the Siamese subnetwork, the classification branch outputs a classification feature map and the regression branch outputs a regression feature map . Here and represent the width and the height of the extracted feature maps respectively. As that shown in Figure 2, each point in contains a vector, which represents the foreground and background scores of the corresponding location in the input search region. Similarly, each point in contains a vector , which represents the distances from the corresponding location to the four sides of the bounding box in the input search region.
Since the ratio of areas occupied by the target and the background in the input search region is not very large, sample imbalance is not a problem. Therefore, we simply adopt the cross-entropy loss for classification and the IOU loss for regression. Let and denote the left-top and right-bottom corner of the ground truth bounding box and denote the corresponding location of point , the regression targets at can be calculated by:
With , the IOU between the ground-truth bounding box and the predicted bounding box can be computed. Then we compute the regression loss by using
where is the IOU loss as in  and is an indicator function defined by:
An observation is that the locations far away from the object center tend to produce low-quality predicted bounding boxes, which reduces the performance of the tracking system. Following 
, we add a center-ness branch in parallel with the classification branch to remove the outliers. As shown in Figure2, the branch outputs a center-ness feature map , where each point value gives the center-ness score of the corresponding location. The score in is defined by
where is in contrast with the distance between the corresponding location and the object center in the search region. If is located in the background, the value of is set to 0. The center-ness loss is
The overall loss function is
where represents the cross-entropy loss for classification. Constants and weight center-ness loss and regression loss. During model training, we empirically set and .
3.3 The Tracking Phase
Tracking aims at predicting a bounding box for the target in current frame. For a location , the proposed framework can produce a 6D vector , where represents the foreground score of classification, represents the center-ness socre, and represent the predicted width and height of the target in current frame. During tracking, the size and aspect ratio of the bounding box typically see minor change across consecutive frames. To supervise the prediction using this spatial-temporal consistency, we adopt a scale change penalty as that introduced in  to re-rank the classification score , which admits an updated 6D vector . Then the tracking phase can be formulated as:
where is the cosine window and is the balance weight. The output is a queried location with the highest score to be a target pixel.
Since our model solves the object tracking with a per-pixel prediction manner, each location is relative to a predicted bounding box. In the real tracking process, it will be jittering between adjacent frames if the only bounding box of is used as the target box. We observed that the pixels located around are more likely to be the target pixel. Hence we choose the top-k points from neighborhoods of according to the value . The final prediction is the weighted average of the selected regression boxes. Empirically, we found that setting and delivers stable tracking results (see Figure 3 for a comparison of using different values).
4.1 Implementation details
The proposed SiamCAR is implemented in Python with Pytorch on 4 RTX2080ti. For easy comparison, the input size of the template patch and search regions are set as the same with, respectively to pixels and pixels. The modified ResNet-50 as in 
is adopted as the backbone Siamese subnetwork. The network is pretrained on ImageNet and then using the parameters as initialization to retrain our model.
|CFnet ||0.293||0.265||0.087||35.62||Titan X||Matlab|
|MDnet ||0.299||0.303||0.099||1.52||Titan X||Python|
|SiamFC ||0.374||0.404||0.144||25.81||Titan X||Matlab|
|SPM ||0.513||0.593||0.359||72.30||Titan Xp||Python|
|SiamRPN++ ||0.517||0.616||0.325||49.83||RTX 2080ti||Python|
Training details. During the training process, the batch size is set as and totally epochs are performed by using stochastic gradient descent (SGD) with an initial learning rate . For the first epochs, the parameters of the Siamese subnetwork are frozen while training the classification and regression subnetwork. For the last epochs, the last blocks of ResNet-50 are unfrozen to be trained together. The whole training phase takes around hours. We train our SiamCAR with the data from COCO , ImageNet DET, ImageNet VID  and YouTube-BB  for experiments on GOT-10K  UAV, OTB  and LaSOT . It should be noticed that for experiments on GOT-10K and LaSOT, our SiamCAR is trained with only the specified training set provided by the official website for fair comparison.
Testing details. During the testing process, we take use of the offline tracking strategy. Only the object in the initial frame of a sequence is adopted as the template patch. Consequently, the target branch of the Siamese subnetwork can be pre-computed and fixed during the whole tracking period. The search region in the current frame is adopted as the input of the search branch. In Figure 4 we show a whole tracking process. With the outputs of classification-regression subnetwork, a location is queried through Equation (9). In order to achieve a more stable and smoother prediction between adjacent frames, a weighted average of regression boxes corresponding to the top-3 neighbors of is computed as the final tracking result.
4.2 Results on GOT-10K
GOT-10K  is a recently released large high-diversity benchmark for generic object tracking in the wild. It contains more than video segments of real-world moving objects. The fair comparison of deep trackers is ensured with the protocol that all approaches are using the same training data provided by the dataset. The classes in training dataset and testing dataset are zero overlapped. Authors need to train their models on the given training dataset and test them on the given testing dataset. After uploading the tracking results, the analysis is taken automatically by the official website. The provided evaluation indicators include success plots, average overlap () and success rate (). The
represents the average overlaps between all the estimated bounding boxes and ground-truth boxes. Therepresents the rate of successfully tracked frames whose overlap exceeds , while represents the rate of successfully tracked frames whose overlap exceeds .
We evaluate SiamCAR on GOT-10K and compare it with state-of-the-art trackers including SiamRPN++ , SiamRPN , SiamFC , ECO , CFNET  and other baselines or state-of-the art approaches. All the results are provided by the official website of GOT-10K. Figure 1 shows that SiamCAR can outperforms all the trackers on GOT-10K and Table 1 lists the comparison details of different indicators. As shown in Table 1, our tracker ranks 1st in terms of all the indicators. Compared with SiamRPN++, our SiamCAR improves the scores by , and relatively for , and .
Since the trackers fairly use the same training data and the ground-truth boxes of the testing dataset are unseen for trackers, the tracking results on GOT-10K are more credible and convincing than those on other benchmarks.
4.3 Results on LaSOT
LaSOT is a resent released benchmark for single object tracking. The dataset contains more than million manually annotated frames and videos. It contains classes and each class include tracking sequences. Such a large test dataset brings a great challenge to the tracking algorithms. The official website of LaSOT provides algorithms as baselines. Normalized precision plots, precision plots and success plots in one-pass evaluation () are considered as the indicators.
We compare our SiamCAR with the top-19 trackers including SiamRPN++ ,MDNet , DSiam , ECO  and other baselines. The results of SiamRPN++  are provided on the website of its authors, while other results are provided by the official website of LaSOT. As shown in Figure 6, our SiamCAR achieves the best performance. Compared with SiamRPN++, our SiamCAR improve the scores by , and relatively for the three indicators. Notably, compared with the provided baselines, our SiamCAR make a great progress by improving the scores by over , and relatively for the three indicators.
The leading results on such a large dataset demonstrate that our proposed network has a good generalization for visual object.
4.4 Results on OTB50
OTB-50 contains challenging videos with substantial variations. The test sequences are manually tagged with attributes to represent the challenging aspects, including illumination variation, scale variation, occlusion, deformation, motion blur, fast motion, in-plane rotation, out-of-plane rotation, out-of-view, background clutters and low resolution. We compare our network with 9 state-of-the-art approaches including SiamRPN++ , SiamRPN , SiamFC  and ECO . The success plots and precision plots in for each tracker are evaluated. As shown in Figure 7, the proposed SiamCAR ranks 1st in terms of both two indicators with those challenging aspects. Especially, our SiamCAR significantly improves the tracking accuracy for the aspects of low resolution, out-of-plane rotation and background clutter. The results demonstrate that SiamCAR can better deal with the challenging distractors and large pose variation, which benefit from the implicitly decoded semantic information of our classification-regression subnetwork.
4.5 Results on UAV123
UAV123 dataset contains in total of video sequences, including more than frames. All sequences are fully annotated with upright bounding boxes. The objects in the dataset mainly suffer from fast motion, large scale variation, large illumination variation and occlusions, which make the tracking challenging.
We compare our SiamCAR with 9 state-of-the-art approaches including SiamRPN++ , SiamRPN , SiamFC  and ECO  on this dataset. The success plot and precision plot of OPE are used as indicators to evaluate the overall performance. As shown in Figure 8, our SiamCAR outperforms all other trackers for both indicators. Compared with state-of-the-art RPN trackers [4, 42, 3], SiamCAR obtains competitive results with much simple network and without heuristic tuning parameters.
4.6 Run-time Evaluation
In column FPS of Table 1, we show the evaluation on GOT-10K in respect to the frame-per-second (FPS). The reported speed is evaluated on a machine with one RTX 2080ti and others are provided by the GOT-10K official results. As shown in the table, our SiamCAR achieves the best performance at a real-time speed with FPS. In addition, our network is much simpler than others and no special designed parameters are needed for training.
In this paper, we present a Siamese classification and regression framework as called SiamCAR to end-to-end train a deep Siamese network for visual tracking. We show that the tracking task can be solved in a per-pixel manner and adopted with the neat fully convolution framework. The proposed framework is very simple in structure but achieves state-of-the-art results without bells and whistles on GOT-10K and many other challenging benchmarks. It also achieves state-of-the-art results on large dataset such as LaSOT, which demonstrate the generalizability of our SiamCAR. Since the present framework is simple and neat, it can be easily to be modified with specific modules to make further improvement in the future.
-  A.F.He, C.Luo, X.M.Tian, and W.J.Zeng. A twofold siamese network for real-time object tracking. In CVPR, 2018.
-  B.Goutam. Accurate tracking by overlap maximization, 2019.
-  B.Li, J.J.Yan, W.Wu, Z.Zhu, and X.L.Hu. High performance visual tracking with siamese region proposal network. In CVPR, 2018.
-  B.Li, W.Wu, Q.Wang, F.Y.Zhang, J.L.Xing, and J.J.Yan. Siamrpn++: Evolution of siamese visual tracking with very deep networks. In CVPR, 2019.
-  B.Luca, V.Jack, G.Stuart, M.Ondrej, and T.Philip HS. Staple: Complementary learners for real-time tracking. In CVPR, 2016.
-  C.Ma, J.B.Huang, X.K.Yang, and M.H.Yang. Robust visual tracking via hierarchical convolutional features. TPAMI, 2018.
-  D.Bolme, J.Beveridge, B.Draper, and Y.Lui. Visual object tracking using adaptive correlation filters. In CVPR, 2010.
-  E.Real, J.Shlens, S.Mazzocchi, X.Pan, and V.Vanhoucke. Youtube-boundingboxes: A large high-precision human-annotated data set for object detection in video. In CVPR, 2017.
-  F.Li, Y.J.Yao, P.H.Li, D.Zhang, W.M.Zuo, and M.H.Yang. Integrating boundary and center correlation filters for visual tracking with aspect ratio variation. In ICCV, 2017.
-  G.Kiani, Hamed, F.Ashton, and L.Simon. Learning background-aware correlation filters for visual tracking. In ICCV, 2017.
-  G.T.Wang, C.Luo, Z.W.Xiong, and W.J.Zeng. Spm-tracker: Series-parallel matching for real-time visual object tracking. arXiv preprint arXiv:1904.04452, 2019.
-  H.Fan and H.B.Ling. Siamese cascaded region proposal networks for real-time visual tracking. In CVPR, 2019.
-  H.Fan, L.T.Lin, F.Yang, P.Chu, G.Deng, S.J.Yu, H.X.Bai, Y.Xu, C.Y.Liao, and H.B.Ling. Lasot: A high-quality benchmark for large-scale single object tracking. In CVPR, 2019.
H.Nam and B.Han.
Learning multi-domain convolutional neural networks for visual tracking.In CVPR, 2016.
-  J.H.Yu, Y.N.Jiang, Z.Y.Wang, Z.M.Cao, and T.Huang. Unitbox: An advanced object detection network. In ACM, 2016.
-  J.M.Zhang, S.G.Ma, and S.Stan. Meem: robust tracking via multiple experts using entropy minimization. In ECCV, 2014.
-  Joao.F.Henriques, R.Caseiro, M.Pedro, and B.Jorge. High-speed tracking with kernelized correlation filters. TPAMI, 2014.
-  J.Valmadre, L.Bertinetto, J.F.Henriques, A.Vedaldi, and P.H.Torr. End-to-end representation learning for correlation filter based tracking. In CVPR, 2017.
-  J.Y.Gao, T.Z.Zhang, and C.S.Xu. Graph convolutional tracking. In CVPR, 2019.
-  K.He, X.Zhang, S.Ren, and J.Sun. Deep residual learning for image recognition. In CVPR, 2016.
-  L.Bertinetto, J.Valmadre, J.F.Henriques, A.Vedaldi, and P.H.Torr. Fully-convolutional siamese networks for object tracking. In ECCV, 2016.
-  L.H.Huang, X.Zhao, and K.Q.Huang. Got-10k: A large high-diversity benchmark for generic object tracking in the wild. arXiv preprint arXiv:1810.11981, 2018.
L.Zhang, V.Jagannadan, N.S.Ponnuthurai, A.Narendra, and M.Pierre.
Robust visual tracking using oblique random forests.In CVPR, 2017.
-  M.Danelljan, A.Robinson, F.S.Khan, and M.Felsberg. Beyond correlation filters:learning continuous convolution operators for visual tracking. In ECCV, 2016.
-  M.Danelljan, G.Bhat, F.S.Khan, and M.Felsberg. Eco: Efficient convolution operators for tracking. In CVPR, 2017.
-  M.Danelljan, G.Hager, and F.Khan. Accurate scale estimation for robust visual tracking. In BMVA, 2014.
-  M.Danelljan, G.Hager, K.S.Fahad, and M.Felsberg. Learning spatially regularized correlation filters for visual tracking. In ICCV, 2015.
-  M.Danelljan, G.Hager, K.S.Fahad, and M.Felsberg. Discriminative scale space tracking. TPAMI, 2016.
-  O.Russakovsky, J.Deng, H.Su, J.Krause, S.Satheesh, S.Ma, Z.Huang, A.Karpathy, A.Khosla, and M.Bernstein. Imagenet large scale visual recognition challenge. IJCV, 2015.
-  P.Horst, M.Thomas, and B.Horst. In defense of color-based model-free tracking. In CVPR, 2015.
-  Q.Guo, W.Feng, C.Zhou, R.Huang, L.Wan, and S.Wang. Learning dynamic siamese network for visual object tracking. In ICCV, 2017.
-  S.Pu, Y.Song, and C.Ma. Deep attentive tracking via reciprocative learning. In NIPS, 2018.
-  S.Q.Ren, K.M.He, R.Girshick, and J.Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. In NIPS, 2015.
-  T.Liu, G.Wang, Q.X.Yang, and L.Wang. Part-based tracking via discriminative correlation filters. TCSVT, 2016.
-  T.Y.Lin, M.Michael, S.Belongie, J.Hays, P.Perona, D.Ramanan, P.Dollar, and C.L.Zitnick. Microsoft coco: Common objects in context. In ECCV, 2014.
-  T.Z.Zhang, C.S.Xu, and M.H.Yang. Multi-task correlation particle filter for robust object tracking. In CVPR, 2017.
-  X.P.Dong and J.B.Shen. Triplet loss in siamese network for object tracking. In ECCV, 2018.
-  Y.B.Song, C.Ma, L.J.Gong, J.W.Zhang, R.W.Lau, and M.H.Yang. Crest: Convolutional residual learning for visual tracking. In ICCV, 2017.
-  Y.Li and J.Zhu. A scale adaptive kernel correlation filter tracker with feature integration. In ECCV, 2014.
-  Y.Wu, J.Lim, and M.H.Yang. Online object tracking: A benchmark. In CVPR, 2013.
-  Z.Tian, C.H.Shen, H.Chen, and T.He. Fcos: Fully convolutional one-stage object detection. arXiv preprint arXiv:1904.01355, 2019.
-  Z.Zhu, Q.Wang, B.Li, W.Wu, J.J.Yan, and W.M.Hu. Distractor-aware siamese networks for visual object tracking. In ECCV, 2018.