Fully Convolutional Online Tracking
Discriminative training has turned out to be effective for robust tracking. However, online learning could be simply applied for classification branch, while still remains challenging to adapt to regression branch due to its complex design. In this paper, we present the first fully convolutional online tracking framework (FCOT), with a focus on enabling online learning for both classification and regression branches. Our key contribution is to introduce an anchor-free box regression branch, which unifies the whole tracking pipeline into a simpler fully convolutional network. This unified framework is beneficial to greatly decrease the complexity of tracking system and allows for more efficient training and inference. In addition, thanks to its simplicity, we are able to design a regression model generator (RMG) to perform online optimization of regression branch, making the whole tracking pipeline more effective in handling target deformation during tracking procedure. The proposed FCOT sets a new state-of-the-art results on five benchmarks including GOT-10k, LaSOT, TrackingNet, UAV123 and NFS, and performs on par with the state-of-the-art trackers on OTB100, with a high running speed of 53 FPS. The code and models will be made available at https://github.com/MCG-NJU/FCOT.READ FULL TEXT VIEW PDF
Most of the existing trackers usually rely on either a multi-scale searc...
By decomposing the visual tracking task into two subproblems as
In this paper, we present a novel siamese motion-aware network (SiamMan)...
The current strive towards end-to-end trainable computer vision systems
Vascular tracking of angiographic image sequences is one of the most
Visual tracking is fundamentally the problem of regressing the state of ...
In this paper, we present a conceptually simple, strong, and efficient
Fully Convolutional Online Tracking
Visual object tracking is a fundamental computer vision task, which aims at estimating the state of an arbitrary target in every frame of a video, given its bounding box in the first frame. It has a variety of applications such as human-computer and visual surveillance . However, tracking still remains as a highly challenging task due to several factors such as illumination changes, occlusion, and background clutter. In addition, target appearance variation along temporal dimension will further add difficulty for robust tracking.
In general, object tracking comprises a classification branch, to locate the target coarsely by discriminating the target from the background, and an regression branch , to generate an accurate bounding box of the target. For classification task, the current approaches could be roughly divided into generative trackers (e.g. SiamFC ) and discriminative tracker (e.g., DiMP ). The generative trackers typically employs a fixed target template without modeling background clutter, while the discriminative trackers learns an adaptive filter by maximizing the response gap between target and background. It is well established that this discriminative training would increase the robustness of tracking . For regression task, the existing methods usually depend on hand-crafted designs, such as anchor box placement [20, 19, 37], or box sampling and selection . Due to this complex design, this regression branch is cannot be easily optimized with online learning for each tracked target just as the updating classification branch in a discriminative tracker. Therefore, a natural question arises whether we could also design a simple box regression branch, analogue to classification branch, that could be easily updated with online learning and efficiently deployed in practice.
Based on the above analysis, we present a Fully Convolutional Online Tracker, termed as FCOT, to yield a conceptually simple, relatively efficient, and more precise tracking framework. Our FCOT allows the whole method to run in a principled fully convolutional manner without any hand-crafted design, but also enable to both classification and regression branches to be online optimized for more robust and precise tracking. The key part of our FCOT framework is to design an online anchor-free box regression branch, that direct regresses the bounding box size of target in each frame. This anchor-free box branch could be deployed in a fully convolutional manner during tracking process. Thus it enables the whole tracking system to be efficiently optimized during training and easily deployed for inference. In addition, due to its simplicity in design, an online optimization algorithm is proposed to adaptively tune the parameters of box regression branch, making it to effectively deal with object deformation along time and thus yield more precise tracked results, as shown in Figure 1.
Specifically, our FCOT framework starts with an new encoder-decoder architecture for high resolution feature extraction. Introducing upsampling layers into tracking system turns to be crucial for improving the accuracy of tracked results. Then, FCOT is composed of a classification branch for roughly localizing object center and a regression branch for regressing bounding box size. Both classification and regression branches are implemented with deformable convolutions due to its good performance handling deformation. To discriminate the tracked target with background and tackle the deformation issue of object shape during training, we propose to adpatively tune the parameters of both branches by online learning. Inspired by DiMP, we design a novel online regression model generator (RMG), composed of a model initializer and an online model optimizer. The effectiveness of our FCOT framework is demonstrated on the common tracking datasets, and it demonstrates our online box regressor is able to consistently improve tracking performance, in particular for higher IoU criteria. In summary, our main contributions are three-fold:
We propose a Fully Convolutional Online Tracker (FCOT) with a simple architecture to implement target classification and regression directly, which can improve the tracking accuracy yet guaranteeing efficiency.
We design a Regression Model Generator (RMG) to online optimize the regression model, which can estimate a precise target box in face of target appearance variations and deformations.
Modern tracking methods can be categorised as generative trackers and discriminative trackers. The former one is based on template matching, typically using Siamese networks [1, 19, 9, 35, 11] to perform similarity learning. Bertinetto et al.  first employ Siemese network to measure the similarity between the target and the search area with a tracking speeds of over 100 fps. SiamRPN  formulates visual tracking as a local one-shot detection task in inference by introducing a Region Proposal Network to Siamese network. SiamRPN++  improves SiamRPN by substituting the modified AlexNet  with Resnet-50 , which enables the backbone to extract abundant features.
trackers and classifier-based trackers are typical methods to online update the classification model so as to distinguish the target from background. However, these approaches rely on complicated online learning procedures that cannot be easily formulated in an end-to-end learning architecture. Bhat et al.  and Park et al.  further learns to learn during tracking based on the meta-learning framework. DiMP  introduces a target model predictor to online optimizing the target model instructed by the discriminative loss, which achieves leading performace in various benchmarks. In our work, we employ the target model predictor to perform online classification.
Previous trackers can be divided into three categories based on the task of target regression. DCF  and SiamFC  employ brutal multi-scale test to estimate the target scale roughly. RPN-based trackers [20, 19] regress the location shift and size difference between pre-defined anchor boxes and target location. ATOM  and DiMP  employ IoUNet to iteratively refine the inital multiple boxes. In this work, we take inspiration from FCOS  to regress the distance from estimated target center to the sides of the bounding box, which is similar with Siamfc++ . However, our FCOT is different on several important aspects. First, our FCOT is essentially an discrminative tracker with a focus on enabling online optimization for both classification and regression branches, while Siamfc++ is a generative tracker with fixed kernels for both branches. In addition, to fully unleash the power of FCOT, we resort to higher resolution of feature map produce classification and regression results.
In order to obtain a simple, efficient, and precise tracking method, we design a fully convolutional online tracker (FCOT), guided by the following principles.
A simple and unified architecture. We hope the components of feature extraction, classification branch, and regression branch could be implemented in a single and unified network architecture. A fully convolutional network is employed to locate the target center and regress the offsets from the target sides to the center directly, which can avoid designing hand-crafted box size estimation head such as hyper-parameters and IoU prediction in DiMP  or anchor box placement and design in SiamRPN . The unified fully convolutional scheme also enables the FCOT to be efficient for both training and inference.
Accurate regression and classification. First, compared with previous trackers, FCOT generates larger score map and box offset maps, ensuring more precise target center location and target bounding box regression. In addition, due its simplicity of FCOT, it’s for the first time to online optimize the regression model implemented by our proposed Regression Model Generator with steepest descent methodology. In this way, FOCT can update the regression model online thereby regressing the bounding box accurately facing the issue of target appearance changing in the subsequent frames. For classification, we utilize an online target model generator introduced by DiMP  to distinguish the target from the background.
As shown in Fig 2, our FCOT comprises a Resnet-50  backbone to extract general features, classification and regression heads to generate task specific features, online model generators for the two tasks to predict online models, a classification convolutional layer locating the target center and a regression convolutional layer estimating the offsets from the four sides to the target center. Thus, our discriminative tracker can integrate online updating target-specific information into classification and regression so as to predict bounding boxes accurately with such a simple FCN-based architecture.
In general, current trackers can be divided into two categories, generative and discriminative trackers. With taking the background information into consideration, discriminative trackers such as ATOM  and DiMP  achieve leading performance on various benchmarks. However, these approaches are two-stage and perform a complicated target regression procedure. Therefore, we introduce a simple fully convolutinal network for both classification and regression to overcome these issues.
We denote the training set as , which is composed of a set of training frames of length with its annotated bounding box . Specifically, the training images are selected from the tracked frames with the predicted bounding boxes in the online tracking phase.
Similarly, the test frame can be represented as , whose expected bounding box . The goals of the network are locating the target center of the test frame and estimating the distance from the predicted center to the four sides of the bounding box .
With the aforementioned inputs, the general features are extracted by an encoder-decoder backbone. The encoder covers from the Layer-1 to Layer-4 of Resnet-50. And the decoder contains a convolution and 2 simple up-sample layers. The spatial down-sampling ratio of the general feature maps is 4. Then the Classification heads and the regression heads extract task-specific features to cope with classification and regression tasks separately. The structure of the classification heads and the regression heads can be seen in Fig 3. Specifically, the classification heads employed in training and test branch share the same network with common weights while the regression heads in the two branches are different, which performs well as the experiments discover.
The Regression Head-1 outputs 1024 feature maps to generate four regression filters while the Regression Head-2 outputs 256 feature maps to be performed a regression convolution with the four filters.
In our FCOT, we formulate the tracking as a per-pixel prediction problem. We predict a target center confidence map and four offsets maps via classification and regression branch, which is defined as:
The parameter is the feature extractor of classification branch, is the feature extractor of regression branch, and represent the filters generated by the corresponding model generators and denotes a convolution operation.
For each location on the final feature maps, we can map it back onto the input image as , where
is the stride of the feature extractor(In this work,). For classification, denotes the confidence score of the pixel being a target center. During training, the classification target is a Gaussian function map centered at . For regression,
is expected to a 4D vectorrepresenting the distance from to the sides of the bounding box in the final feature maps. Hence, the regression targets of position can be formulated as follows:
We regress for positions in the vicinity of the target center (the area with a radius of 2 in this work) rather than for the only pixel . Experimental results demonstrate its effectiveness in section 4.2.3.
Inspired by DiMP , we present a regression model generator to online optimize the target regression model for the first time, which can alleviate the impact of target appearance changing on box regression. As shown in Fig 4, the regression model contains a model initializer and a model optimizer. The model initializer takes the regression features and bounding box of the first frame as input and generate the initial model which is a regression convolution filter. The features of the training set and their corresponding bounding boxes are then input into the model optimizer to update the model iteratively. According to the design, our tracker can not only update the filter online to fit with the changing target appearance by the optimizer but reduce optimization steps to improve the tracking speed.
The structure of the model initializer is a single ROI-pooling layer with the size of . For improving the efficiency, the initializer performs ROI pooling only on the features of the first frame thus generating a rough model. And the model optimizer is derived from online regression training loss:
The parameter is the length of the online training set which is composed of the tracked frames with high classification scores, is the features extracted by the Regression Head-1, denotes the 4D distance vector of center position in the regression map as described in 3.1.2, is a portion of with an area of (the same with the regression model size) centered at , is the regression convolution filter, denotes convolution and is a regularization factor. The objective is to optimize the regression convolution filter . Since the gradient descent is slow, we solve the issue with the steepest descent methodology, which compute a step length to update the model as follows:
The parameter denotes the number of iterations of optimizing. We compute and according to the similar expression with DiMP .
The parts to be trained offline include the encoder and decoder backbone, classification and regression heads, the parameter in regression model generator and some parameters in classification model generator as in DiMP . Our offline training are performed at two stages, which it first trains the entire network except for the regression optimizer in the model generator, and then updates the regression optimizer with the rest of the network freezed. In this way, the training time is reduced largely since online optimizing the model is time-consuming and there is just one parameter in the regression model generator to be trained. The total loss for offline training can be formulated as , which is 100 and is 0.1 in this work. For classification branch, we use the same loss and training strategies as DiMP . We use an IoU loss  for to train the regression branch and perform regression for the points in the vicinity of the target center with a radius of 2. This strategy can improve the accuracy of the target box regression during tracking especially when the detected target center is deviated from the groundtruth center.
Given the first frame with groundtruth bounding box, we construct a training set of size 15 with performing augmentation on the frame. The initial regression and classification model (convolutional filter) is generated by the model initializers. Then the initial models are optimized using the augmented training set. We present two simple strategies to online update classification and regression models. First, we add the frames with the highest classification score every 25 frames to the online training set so as to guarantee the quality of the training samples. Second, we merge the latest model with the model optimized with the augmented training set with the first frame, so the current model can be formulated as:
It turns out to boost the performance of the tracker.
Our tracker FCOT is implemented with Pytorch based on the project Pytracking. We use ADAM 
with learning rate decay of 0.2 every 25 epochs. We train FCOT for 100 epochs and 5 epochs in the first and the second stage respectively. We spend 50 hours training the whole model offline by 8 RTX 2080ti GPUs. While for inference, the average tracking speed is 53 fps on a single RTX 2080ti GPU. The encoder in backbone is initializied with ImageNet weights. The training set we used to train FCOT including TrackingNet, LaSOT , GOT-10k  and COCO  training dataset. The regression model is a convolution filter with size of and the classification model of .
We evaluate the impact of using different feature blocks from the Decoder(in Table 1). For regression, we can compare the feature layers settings of No.1, No.2 and No.4. Using features from only performs better than only , which demonstrates that using features with higher resolution leading to more accurate tracking results. Fusing features from both the blocks leading to significant improvement, giving scores of 75.3%, 51.7% and 62.7% for the three metrics respectively. For Classification, we can compare the feature layers settings of No.3 and No.4. The setting of using both and is difficult to train and cannot boost the performance than using only the so we overlook it in this paper. It can be derived that using features from performs better than , with gains of 2.1%, 3.6% and 2.3% respectively in , and AO.
|No.||w/ Opt||w/ Opt||SR()||SR()||AO()|
Here, we analyse the effect of our online model generators (Table 2). Optimization using the augmented training set with the first frame: It can be derived from the optimization settings of No.1, No.2, NO.3 and No.4 that performing optimization of the classification and regression model in the first frame can both improve the tracking accuracy and success rate. Optimization online: Comparing the optimization settings of No.4, No.5, NO.6 and No.7, we find that online optimization of the two models can both boost the performance largely. Compared with the setups of No.1, the setups of No. 7 achieves large gains of 4.4%, 3.0% and 3.4% in terms of the three metrics respectively, which demonstrates the effectiveness of the regression model generator and classification model generator.
|if merge filters?||if select the best samples?||SR()||SR()||AO()|
Here, we analyze the impact of the regression area around the target center point, where the points are performed regression during offline training. As shown in Table 3, we find that it’s useful for imporving the robustness of the FCOT to regress for the points in the vicinity of the target center. And the best size of the area is (the radius is 2).
we test the proposed FCOT on six tracking benchmarks and compare our results with the state-of-the-art trackers.
GOT10k  is a large-scale dataset with over 10000 video segments and has 180 segments for the test set. Apart from generic classes of moving objects and motion patterns, the object classes in the train and test set are zero-overlapped. This feature makes the dataset focus one-shot tracking capability of the generic tracker.
We show state-of-the-art comparison on Table 5. DiMP achieves an average overlap(AO) score of 61.1%. Compared with DiMP, Our FCOT improves 1.6% in AO. Impressively, Our FCOT achieves gains of 3.6% and 2.5% in success rate of threshold 0.5 and 0.75 respectively, which demonstrates that FCOT has the ability to generate accurate bounding boxes.
UAV123  is a large dataset captured from low-altitude UAVs. Thus compared with other benchmarks, the targets tend to be farther from the camera in UAV123. This dataset has a total of over 110K frames and 123 video sequences. We compare our FCOT with previous approaches on this dataset. As shown in Table 6, DiMP achieves an AUC score of 64.3% and a precision score of 84.9%. Our FCOT outperforms the previous approaches reaching 65.7% in AUC score and 87.6% in precision score, respectively. This demonstrates the powerful capabilities of our method on tracking far and tiny objects.
Need For Speed  contains a total of 380K frames in 100 videos captured with high frame rate cameras from real world scenarios. We evaluate our FCOT on the 30 FPS version of this dataset and compare with the recent approaches. The results are shown on Table 7. Specifically, FCOT obtains an AUC score of 62.2% and precision score of 74.5%.
TrackingNet  provides over 30K videos with more than 14 million dense bounding box annotations. We validate FCOT on its test set, which consists of 511 videos, with an average of 441 frames per sequence. The results generated by our tracker are submitted to the evaluation server. Then three metrics success, precision and normalized precision are caculated and feedbacked. As shown in Table 8, our FCOT reaches a precision score of 71.4% and a normalized score of 81.7% surpassing DiMP and SiamFC++. We also achieve a success score of 74.5%, which is competitive with the state-of-the-art method.
|Norm. Prec. ()||61.8||70.5||77.1||80.0||80.1||80.0||81.7|
LaSOT  has 280 videos in its test set. With an average of 2500 frames, sequences of LaSOT are longer than other dataset, which poses great challenges to trackers. We evaluate our FCOT on the test set to validate its long-term capability. The results are shown on Fig 5. FCOT achieves a success score of 55.4%, which is better than other methods except for DiMP. In normalized precision, we reach the best performance of 65.7%. Our method outperforms DiMP when overlap threshold is above 0.5, which means our online regression branch predicts more accurate bounding boxes. The overall gap between DiMP and FCOT comes from coarse-grained tracking. As also confirmed in normalized precision plots, our FCOT does better in accurate localization , while DiMP finds more coarse-grained predictions.
In this paper, we have presented a fully convolutional online tracker (FCOT), by unifying the components of feature extraction, classification head, and regression head into a encoder-decoder architecture. Our key contribution is to design an online anchor-free box size regression branch to directly estimate bbox sizes. This new design enables the whole tracking framework to be performed in a simple fully convolutional manner, and also allows for online optimization regression branch to well handle target deformation. Extensive experiments on several benchmarks demonstrate the high precision of our proposed anchor-free and online regression branch. Our FCOT outperforms the state-of-the-art trackers on most benchmarks at a high speed of 53 FPS.
We provide visualization examples generated by FCOT and DiMP  on OTB100 dataset  in the attached folder named visualization which is available at https://github.com/MCG-NJU/FCOT/tree/master/visualization. These sequences suffer from the limitations including occlusion, scale variation, deformation, motion blur and so on. It can be seen that our tracker performs well on these sequences. Particularly, the bounding boxes are more precise than DiMP  once the objects have been roughly located by the classification branch.
Furthermore, we find that the labels of OTB100 dataset are not accurate enough. It can be seen from Fig. 7 that the ground-truth bounding box(colored in green) are not accurate and the definitions of objects are ambiguous leading to the performance degradation of our FCOT.
In this section, we compare the classification score maps generated by DiMP  and FCOT. We can derive from Fig. 8 that the score map of FCOT is more precise than DiMP, which ensures the tracking accuracy of our tracker. It demonstrates the effectiveness of our designed encoder-decoder architecture.
ImageNet classification with deep convolutional neural networks. Commun. ACM 60 (6), pp. 84–90. Cited by: §2.1.