Log In Sign Up

Discriminative and Robust Online Learning for Siamese Visual Tracking

by   Jinghao Zhou, et al.

The problem of visual object tracking has traditionally been handled by variant tracking paradigms, either learning a model of the object's appearance exclusively online or matching the object with the target in an offline-trained embedding space. Despite the recent success, each method agonizes over its intrinsic constraint. The online-only approaches suffer from a lack of generalization of the model they learn thus are inferior in target regression, while the offline-only approaches (e.g., convontional siamese trackers) lack the video-specific context information thus are not discriminative enough to handle distractors. Therefore, we propose a parallel framework to integrate offline-trained siamese networks with a lightweight online module for enhance the discriminative capability. We further apply a simple yet robust template update strategy for siamese networks, in order to handle object deformation. Robustness can be validated in the consistent improvement over three siamese baselines: SiamFC, SiamRPN++, and SiamMask. Beyond that, our model based on SiamRPN++ obtains the best results over six popular tracking benchmarks. Though equipped with an online module when tracking proceeds, our approach inherits the high efficiency from siamese baseline and can operate beyond real-time.


page 1

page 3

page 5

page 7


Multi-Branch Siamese Networks with Online Selection for Object Tracking

In this paper, we propose a robust object tracking algorithm based on a ...

Learning Discriminative Model Prediction for Tracking

The current strive towards end-to-end trainable computer vision systems ...

Learning Cascaded Siamese Networks for High Performance Visual Tracking

Visual tracking is one of the most challenging computer vision problems....

SPiKeS: Superpixel-Keypoints Structure for Robust Visual Tracking

In visual tracking, part-based trackers are attractive since they are ro...

Updatable Siamese Tracker with Two-stage One-shot Learning

Offline Siamese networks have achieved very promising tracking performan...

Learning Online for Unified Segmentation and Tracking Models

Tracking requires building a discriminative model for the target in the ...

Code Repositories


Discriminative and Robust Online Learning for Siamese Visual Tracking [AAAI2020]

view repo

1 Introduction

Figure 1: Qualitative demenstration and comparison. ATOM, an online-trained discriminative tracker, though equipped with an IoUNet, accurate bounding box regression still eludes such approach. SiamRPN++, an offline-trained generative tracker, lacks of background and video-specific information thus struggles at situations like distractors and deformation. Our method DROL, combines the unique strengths of each one, achieves fully discriminative and generative object visual tracking.

Visual object tracking is a fundamental topic in various computer vision tasks, such as vehicle navigation, automatic driving, visual surveillance, and video analytics. Briefly speaking, given the position of an arbitrary target of interest in the first frame of a sequence, visual object tracking aims to estimate its location in all subsequent frames.

In the context of visual object tracking, almost all the state-of-the-art trackers can be categorized into two categories: discriminative and generative trackers. Discriminative trackers train a classifier to distinguish the target from the background, while generative trackers find the object that can best match the target through computing the joint probability density between targets and search candidates.

Recently, the siamese framework [Bertinetto et al.2016a] as a generative tracker has received increasing attentions in that its performance has taken the lead in various benchmarks while running at real-time speed. Generally, these approaches obtain a similarity embedding space between subsequent frames and directly regress to the real state of the target [Li et al.2018, Wang et al.2019]. Despite its recent success, the siamese paradigm still suffers from severe limitations. Foremost, siamese approaches ignore the background information in the course of tracking, leading to inferior discriminative capability in the face of distractors. Furthermore, since most siamese-based trackers solely utilize the first frame as template or merely update it through averaging the subsequent frames [Zhu et al.2018], its performance degrades with huge deformation, rotation, and motion blur, which results in poor generalization ability of accurate object regression.

Comparatively speaking, discriminative trackers can tackle the aforementioned restrictions in a graceful manner through a discriminative classifier and a highly efficient online updating strategy. However, accurate target regression has long eluded such approach since its low sensitivity to the target state changes using brute-force searching, and its susceptibility to noisy distraction easily results in target state drift. Also, these methods using deep features and complicated online updating procedures are often extremely slow.

In light of above-mentioned works, we intend to complement the generative siamese baseline with a module for discriminative and robust online learning. We propose two parallel subnets, one primarily accounting for object regression using an offline-trained siamese network, the other to deal with distractors through an online-trained classifier. A template updater is also introduced to handle the drift of object appearance. With the siamese regressor, our method is powerful and generative in terms of robust target regression. With online learning of the discriminative classifier, our approach can fully exploit background information and update itself with new data in the course of tracking.

To wrap up, we develope a highly effective visual tracking framework and establish a new state-of-the-art in terms of robustness and accuracy. The main contributions of this work are listed in threefold:

  • We propose to combine a lightwieght classifier which is trained and updated online with siamese networks to explicitly account for background information and further deal with distractors and occlusion.

  • We propose template update for the siamese networks based on discriminative classification score to improve their robustness handling deformation, rotation, and illumination, etc.

  • The proposed parallel online-offline framework can be integrated with a variety range of siamese trackers without re-training them. Our method consistently set new state-of-the-art performance on popular benchmarks of OTB100, VOT2018, VOT2018-LT, UAV123, TrackingNet, and LaSOT.

2 Related Work

Siamese visual tracking. Recently, the tracking methods based on siamese network [Bertinetto et al.2016a] in the literature greatly outperformed other trackers due to heavy offline training, which largely enriches the depiction of the target. The following works include higher-fidelity object representation [Wang et al.2019, Li et al.2018]

, deeper backbone for feature extraction

[Li et al.2019a, Zhang and Peng2019], and multi-stage or multi-branch network design [Fan and Ling2019, He et al.2018]. Beyond that, another popular optimization is template’s update for robust tracking in the face of huge target’s appearance change, either by using fused frames [Choi, Kwon, and Lee2017a] or separately selected frames [Yang and Chan2018]. However, the lack of integrating background information degrades its performance on challenges like distractor, partial occlusions, and huge deformation. In fact, such rigidity is intrinsic given the fact that the object classification score from siamese framework turns out to be messy thus requires a centered Gaussian window to ensure the stability. While many works intend to construct a target-specific space, by distractor-aware offline training [Zhu et al.2018], residual attentive learning [Wang et al.2018], gradient-based learning [Choi, Kwon, and Lee2017b, Li et al.2019b], this embedding space obtained through cross-correlation has yet to explicitly account for the distractors and appearance variation.

Online learning approach. Online learning is a dominant feature for discriminative trackers to precede, which is achieved by training an online component to distinguish the target object from the background. Particularly, the correlation-filter-based trackers [Henriques et al.2014, Danelljan et al.2017] and classifier-based trackers [Kalal, Mikolajczyk, and Matas2011, Nam and Han2016] are among the most representative and powerful methods. These approaches learn an online model of the object’s appearance using hand-crafted features or deep features pre-trained for object classification. Given the recent prevalence of meta-learning framework, [Bhat et al.2019, Park and Berg2018] further learns to learn during tracking. Comparatively speaking, online learning for siamese-network-based trackers has had less of attentions. While previous approaches are to directly insert a correlation filter as a layer in the forward pass and update it online [Bertinetto et al.2016b, Guo et al.2017], our work focuses on the siamese’s combination with an online classifier in a parallel manner with no need to retrain them.

Target regression and classification. Visual object tracking synchronously needs target regression and classification, which can be regarded as two different but related subtasks. While object classification has been successfully addressed by the discriminative methods, object regression is recently upgraded by the introduction of Regional Proposal Network [Li et al.2018] in siamese network owing to the intrinsic strength for generative tracker to embed rich representation. Nevertheless, the majority of trackers in the gallery stand out usually with the strength within one subtask while degrading within the other. To design a high-performance tracker, lots of works have been focused on addressing these two needs in two separate parts. MBMD [Zhang et al.2018] uses a regression network to propose candidates and classify them online. ATOM [Danelljan et al.2019] combines a simple yet effective online discriminative classifier and high-efficiency optimization strategy with an IoUNet through overlap maximization, while the regression is achieved through a sparse sampling strategy compared with the dense one via cross-correlation in siamese networks.

3 Proposed Method

Figure 2: Overview of the network architecture for visual object tracking. It consists of siamese matching subnet majorly accounting for bounding box regression and online classification subnet generating classification score. Dashed line denotes reference scores passed from the classifier to the updater to select the short-term template .

The visual object tracking task can be formulated as a learning problem whose the primary task is to find the optimal target position by minimizing the following objective:


where computes the residual at each spatial location and is the annotated label. The impact of regularization on is set by . In this section, we detail how two subnets in our proposed method as shown in Figure 2 can be unified to this formula and complement each other to form a high-performance tracker.

3.1 Siamese Matching Subnet

Before moving to our approach, we firstly review the siamese network baseline. The siamese-network-based tracking algorithms take (1) as a template-matching task by formulating as a cross-correlation problem and learn a embedding space that computes where in the search region can best match the template, shown as


where one branch learns the feature representation of the target , and the other learns the search area .

While the brute-force scale search to obtain target regression is insufficient in accuracy, siamese network extensions explicitly complement themselves with a Region Proposal Network (RPN) head or a mask head by separately regressing the bounding box and the mask of the target by encoding a subspace for box and mask . They each output the offsets for all prior anchors or a pixel-wise binary mask. These 3 variants of target regression can be formulated as:


which is seperately adopted by our chosen baseline SiamFC, SiamRPN++, and SiamMask. is the scaled template with maximal response value and is a vector, directly giving target center at pixel with maximal score. is a vector which stands for the offsets of center point location and scale of the anchors to groudtruth. is a vector encoding the spatial distribution of object mask.

The eventual predicted target state is obtained using the same strategy as in [Bertinetto et al.2016a, Li et al.2018, Li et al.2019a, Wang et al.2019]. Additionally, the siamese network takes the residuals for classification commonly as cross-entropy loss, bounding box regression as smooth

loss, and mask regression as logistic regression loss.

3.2 Discriminative Online Classification

Figure 3: Network architecture of the online classification subnet. It consists of three modules: compression module to refine the feature map from backbone suitable for classification, attention module resolving data imbalance and filter module. denotes spatial attention consisting of FC layers after global average pool (GAP) and channel-wise attention composed of a softmax after channel averaging. Grey area denotes being kept fixed during tracking.
Input : Video frames of length ;
Initial target state ;
Output : Target state in subsequent frames;
Setting train set for online classification subnet and update with augmented search ;
Setting template for siamese matching subnet ;
for t 2 to  do
      Obtain the search based on the previous predicted target state ;
       // Track
       Obtain via classification subnet;
       Obtain using (2);
       Obtain current target state based on using (6) and ; if  then
             Train with ;
       end if
      if  mod  then
             using (7);
       end if
end for
Algorithm 1 Tracking algorithm

Comparatively speaking, discriminative trackers like classifier-based and correlation-filter-based trackers, take (1) as a candidate-scoring task based on loss by formulating,


given its desirable convexity with a closed-form solution and fast convergence speed. These trackers return the confidence for each sampled candidate or, as in our classification subnet C, a heatmap, representing a distribution of labels.

However, in light of related researches [Li et al.2019b, Yang et al.2019], the fitting of confidence scores for all negative background pixels dominates online learning by directly using the raw features. Plus, only a few convolutional filters play an important role in constructing each feature pattern or object category.

The aforementioned data imbalance in both spatial and channel-wise fashion degrades model’s discriminative capabilities. To accommodate this issue, we consider a classifier with attention module trained exclusively online to discriminate the target from any distractors in the search

. As per the speed requirement of online learning, it’s set to a lightweight 3-layer convolutional neural network, shown as Figure


The gradient of (1) can be computed as


Instead of using the brute-force stochastic gradient descent (SGD) as the optimization strategy, following

[Danelljan et al.2019], we used Newton-Gaussian descent better suited for online learning in terms of convergence speed. Generally, it reformulates (1) into the squared norm of the residual vector , where and , so that a positive definite quadratic problem can be formed. This algorithm allows an adptive search direction and learning rate during backpropogation.

After obtaining the classification confidence score, we resize it using cubic interpolation to the same spatial size as the classification score in siamese matching network and fuse them thorugh weighted sum yielding an adaptive classificatin score, which can be formulated as:


where is the impact of each two confidence scores. Different types of regression strategy can then be implemented to locate the target.

3.3 Template Update

Figure 4: Pipeline of the template update in the siamese network. We showcase the updating details of sequence Basketball of OTB2015. The frame with the highest score in the last frames will be used to update the short-term template .

A crucial section of siamese trackers is the selection of the template, as the change of target appearances will result in a poor quality of the bounding boxes or even target drift with a bad template. In light of previous works, we design an extra template branch for siamese networks to reserve the target information on the most recent frame. The template for this branch is denoted as , representing the template in a short-term memory as opposed to retaining the first frame for long-term memory.

It’s worth noting that the key for good performance rests largely on the quality of identified templates, thus the selection of possible template candidates need to be designed reasonably. Fortunately, the score from online classfication subnet provide us a powerful yardstick to select the high-quality template without extra computation. Thus we select the short-term template by


where is the cropping operation to feature size of template and is filtering threshold suppressing low-quality template facing occlusion, motion blur, and distractors.

By shifting to the short-term template, in (2) can be modified to which is decided with


where denotes location with maximal score and its corresponding bounding box. This is to avoid bounding box drift and inconsistency using the short-term template. The full pipeline of our method is detailed in Algorithm 1.

4 Experiments

4.1 Implementation Details

Our method is implemented in Python with PyTorch, and the complete code and video demo will be made available at

Architecture. Corresponding to the regression variants as described in (3), we apply our method in three siamese baselines SiamFC, SiamRPN++, and SiamMask yielding DROL-FC, DROL-RPN, DROL-Mask respectively. In DROL-FC, we use the modified AlexNet as the backbone and up-channel correlation. In DROL-RPN, we use the modified AlexNet and layer of ResNet50 and depth-wise correlation. While in DROL-Mask, we use layer of ResNet50 and depth-wise correlation. For the classification subnet, the first layer is a

convolutional layer with ReLU activation, which reduces the feature dimensionality to

. The last layer employs a kernel with a single output channel.

Online learning. The classification subnet is trained exclusively online by maintaining a training set with a maximal batch size of . During tracking, we add the coming frame to the set by replacing the least similar one. Each sample is labeled with a Gaussian function centered at target position. Particularly, hard negative mining is performed whenever distractors are detected by doubling the learning rate.

To initialize the classifier, we use the region of size to pre-train the whole classification subnet. Similar to [Danelljan et al.2019], we perform data augmentation to the first frame of translation, rotation, and blurring, yielding initial training samples. For the subsequent frames, only the last layers are updated while the first two layers kept fixed to retain the target-specific information.

Hyperparameters setting. For classification score fusion, we set to in DROL-FC and in DROL-RPN and DROL-Mask. For template update, the short-term template uses target in the most recent frame. To strike a balance between stability and dynamicity when tracking proceeds, we update the short-term template every frames, while , , and are set to , , and respectively.

4.2 Comparison with state-of-the-art

We first showcase the consistence improvement of DROL with each of its baseline on two popular benchmarks OTB2015 [Wu, Lim, and Yang2015] and VOT2018 [Kristan et al.2018], as shown in Table 1. We further verify our best model DROL-RPN with backbone ResNet50 on other four benchmarks: VOT2018-LT, UAV123 [Mueller, Smith, and Ghanem2016], TrackingNet [Muller et al.2018], and LaSOT [Fan et al.2019].

Method Backbone OTB2015 VOT2018 FPS
CCOT VGGNet 0.682 0.903 0.267 0.494 0.3
ECO VGGNet 0.694 0.910 0.280 0.484 8
UPDT ResNet50 0.702 - 0.378 0.536 -
LADCF VGGNet 0.696 0.906 0.389 0.503 10
MDNet VGGNet 0.678 0.909 - - 1
ATOM ResNet18 0.669 0.882 0.401 0.590 30
DiMP ResNet50 0.684 - 0.440 0.597 40
SiamFC AlexNet 0.582 0.771 0.206 0.527 120
DROL-FC AlexNet 0.619 0.848 0.256 0.502 60
SiamRPN AlexNet 0.637 0.851 0.326 0.569 200
DaSiamRPN AlexNet 0.658 0.875 0.383 0.586 160
C-RPN AlexNet 0.663 0.882 - - 32
SPM AlexNet 0.687 0.899 - - 120
SiamRPN++ AlexNet 0.666 0.876 0.352 0.576 180
ResNet50 0.696 0.915 0.414 0.600 35

AlexNet 0.689 0.914 0.376 0.583 80
ResNet50 0.715 0.937 0.481 0.616 30
SiamMask ResNet50 - - 0.347 0.602 56
DROL-Mask ResNet50 - - 0.434 0.614 40
Table 1: state-of-the-art comparison on two popular tracking benchmarks OTB2015 and VOT2018 with their running speed. AUC: area under curve; Pr: precisoin; EAO: expected average overlap; A: accuracy; FPS: frame per second. The performance is evaluated using the best results over all the settings proposed in their original papers. The speed is tested on Nvidia GTX 1080Ti GPU.

OTB2015. We validate our proposed tracker on the OTB2015 dataset, which consists of 100 videos. According to the previous research, siamese-based trackers couldn’t perform pretty well compared with correlation-filter-based trackers. Up to the latest work SiamRPN++ unveiling the power of deep neural network, the siamese network took the lead in OTB benchmark. Nevertheless, as shown in Figure 1, even the best siamese trackers still suffer the agonizing pain from distractors, full occlusion and deformation in sequences like Baseketball, Liquor and Skating1. Comparatively speaking, our approach can robustly discriminate the aforementioned challenges thus obtain a desirable gain above the siamese baseline through online learning.

Figure 5 and Table 1 show that DROL-RPN achieves the top-ranked result in both AUC and Pr. Compared with previous best tracker UPDT, we improve in overlap. While compared with ATOM and SiamRPN++, our tracker achieves a performance gain of , in overlap and , in precision respectively.

Figure 5: state-of-the-art comparison on the OTB2015 dataset in terms of success rate (left) and precision (right).

VOT2018. Compared with OTB2015, VOT2018 benchmark includes more challenging factors, thus may be deemed as a more holistic testbed on both accuracy and robustness. We validate our proposed tracker on the VOT2018 dataset, which consists of sequences over conditions, respectively Overall, Camera motion, Illumination change, Motion Change, Size change, Occlusion and Unassigned. Our approach achieves the best results in conditions of all.

(a) Baseline
(b) Long-term
Figure 6:

state-of-the-art comparison on VOT2018 Baseline (left) and Long-term (right) dataset. In right one, the maximal and minimal value for each attribute is shown in pair, and the rest are permuted propotionally. In left one, the average precision-recall curves are shown above, each with its maximal F-score pointed out at corresponding point.

Pr 0.566 0.574 0.627 0.642 0.649 0.687
Re 0.510 0.521 0.588 0.583 0.612 0.650
Table 2: state-of-the-art comparison on the VOT2018-LT dataset in terms precision (Pr) and Recall (Re).

As Figure 6(a) shown, we can observe that our tracker DROL-RPN improves the ATOM and SiamRPN++ method by an absolute gain of and respectively in terms of EAO. Our approach outperformed the leading tracker DiMP by in overlap and in accuracy. Additionally, DROL-RPM achieves a robustness of and reach a substantial gain of compared with VOT2018 challenge winner MFT in terms of robustness after adopting the template update strategy. Though featuring desirable performance in terms of accuracy, our tracker inherits the high efficiency of siamese framework and operates at frame-rate beyond real-time.

VOT2018-LT. VOT2018 Long-term dataset provides a challenging yardstick on robustness as the target may leave the field of view or encounters full occlusions for a long period. We validate our method on VOT2018-LT, composed of videos with an average sequence length of about frames. The performance measures precision, recall, and F-score during tracking, we report these metrics compared with the state-of-the-art method in Figure 6(b) and Table 2.

We directly adopting the long-term strategy the same as SiamRPN++ [Li et al.2019a]

in our siamese subnet. When the target absence is detected and the search region is enlarged, we don’t add any search at these moments into training sets to train the classifier. Compared with our baseline SiamRPN++, our approach achieves a maximum performance gain of

in precision, in recall, and in F-score respectively. Results show that even encountered huge deformation and long-term target absence, the online classification subnet can still perform desirably.

UAV123. We evaluate our proposed tracker on the UAV123 dataset. The dataset contains more than frames and a total of video sequences, each with on average frames captured from low-altitude UAVs.

Table 3 showcases the overlap and precision of the compared tracker. We compare our trackers with previous state-of-the-arts and results show that our approach outperforms the others by achieving an AUC score of . Our tracker outperforms the SiamRPN++ baseline by a large margin of in AUC score and in precision.

AUC 0.525 0.527 0.613 0.644 0.654 0.652
Pr 0.741 0.748 0.807 - - 0.855
Table 3: state-of-the-art comparison on the UAV123 dataset in terms of success (AUC), and precision (Pr).

TrackingNet. We validate our proposed tracker on the TrackingNet test dataset, which provides more than videos with million dense bounding box annotations. The dataset covers a wide selection of object classes in a broad and diverse context. Its test subset composed of videos allows us access to fully comparing all the trackers in large-volume provided data including common objects in the wild.

AUC 0.614 0.703 0.712 0.733 0.740 0.746
Pr 0.555 0.648 0.661 0.694 0.687 0.708
NPr 0.710 0.771 0.778 0.800 0.801 0.817
Table 4: state-of-the-art comparison on the TrackingNet test dataset in terms of success (AUC), precision (Pr) and normalized precision (NPr).
Figure 7: state-of-the-art comparison on LaSOT dataset in terms of success rate (left) and normalized precision (right).

As illustrated in Table 4, our approach DROL-RPN outperforms all previous methods and improves the performance of state-of-the-art by in overlap, in precision, and in normalized precision.

LaSOT. We validate our proposed tracker on the LaSOT dataset to further validate its discriminative and generative capability. The large-scale sequences in test set to provide a comprehensive yardstick for evaluating the trackers’ overall performances. LaSOT benchmark has an average sequence length of frames, which indicates that the model’s discriminative and generative capability to deal with the distractor and huge deformation is crucially significant.

Figure 7 reports the overall performance of DROL-RPN on LaSOT testing set. Specifically, our tracker achieves an AUC of and an NPr of , which outperformed ATOM and SiamRPN++ without bells and whistles.

4.3 Ablation Study

In this part, we perform an extensive ablation analysis to demonstrate the impact of each component and fully illustrate the superiority of our proposed method. For brevity, we only showcase DROL-RPN with backbone ResNet50.

(a) Search
(b) w/o C
(c) w/o Att
(d) w/ Att
Figure 8: Visualization on final classification output after the Classification subnet C and attention module is used on sequence Bolt of OTB2015.
Figure 9: IoU curve and expected average IoU on sequence MotorRolling (left) and Coke (right) of OTB2015.
Table 5: Ablation study of attention module in online classification subnet and update parameters in siamese matching subnet. Cmp, Att represent compresson module and attention module respectively. and denotes whether first-frame and short-term template are used, and denotes the update interval for short-term template.