1 Introduction
Recent years have witnessed various approaches of Siamese network for visual tracking task because of their balance between accuracy and speed. The pioneering work SiamFC [4] proposed a simple yet effective tracking framework by designing a Siamese network for offline training to learn a metric function and convert the tracking task to template matching using the learned metric. This framework is an ideal baseline for realtime tracking since its simple architecture is easy to be combined with other techniques and the high speed of nearly 86 FramePerSecond (FPS) allows adding these skills to improve accuracy and simultaneously maintain realtime speed (30 FPS). Since then, many realtime trackers [38, 17, 20, 52, 42, 18, 14, 43, 13, 45, 49, 24, 51] have been proposed to improve its accuracy through various of techniques. Along with this line, the recent tracker SiamRPN [24, 22] (the champion of VOT2018 [22] realtime challenge), achieved significant improvement of accuracy and high speed (nearly 90 FPS), by applying a Region Proposal Network (RPN) to directly regress the position and scale of objects. This method will likely become the next baseline to further promote realtime tracking, due to its high speed and impressive accuracy.
Despite being studied actively with remarkable progress, Siamesenetwork based visual trackers generally face a conflict between high memory cost and strict constraints on memory budget in realworld applications, especially for SiamRPN [24, 22], whose model size is up to 361.8 MB. Their high memory cost makes them undesirable for practical mobile visual tracking applications, such as accurate trackers running realtime on a drone, smartphone or sensor nodes. How to decrease the memory cost of Siamese trackers without a remarkable loss of tracking accuracy is one of the key points to build the bridge between the academic algorithms and practical applications. In the other aspect, reducing model size will directly decrease the computational cost to produce a faster tracker. If the faster tracker achieves similar accuracy as the larger one, like SiamFC or SiamRPN, it will be another attractive baseline to facilitate realtime tracking.
To address the above points, we propose a novel Distilled Siamese Trackers (DST) framework built upon a TeacherStudents Knowledge Distillation (TSsKD) model, which is specially designed for learning a small, fast yet accurate Siamese tracker through Knowledge Distillation (KD) techniques. TSsKD essentially explores a one teacher vs multistudents learning mechanism inspired by the most usual teaching and learning methods in the schools, multiple students learn knowledge from a teacher and help each other to facilitate learning effect. In particular, TSsKD models two kinds of knowledge distillation styles. First, knowledge transfer from teacher to students, which is achieved by a trackingspecific distillation strategy. Second, mutual learning between students, working in a studentstudent knowledge sharing manner.
More specifically, to inspire more efficient and trackingspecific knowledge distillation within the same domain (without additional data or labels), the teacherstudent knowledge transfer is equipped with a set of carefully designed losses, , a teacher soft loss, adaptive hard loss, and Siamese attention transfer loss. The first two allow the student to mimic the highlevel semantic information of the teacher and groundtruth while reducing overfitting, and the last one incorporated with Siamese structure is applied to learn the middlelevel semantic hints. To further enhance the performance of the student tracker, we introduce a knowledge sharing strategy with a conditional sharing loss that encourages sharing reliable knowledge between students. This provides extra guidance that facilitates smallsize trackers (the “dull” students) to establish a more comprehensive understanding of the tracking knowledge and thus achieve higher accuracy.
As a summary, our key contributions include

[leftmargin=*]

A novel framework of Distilled Siamese Trackers (DST) is proposed to compress Siamesebased deep trackers for highperformance visual tracking. To the best of our knowledge, this is the first work that introduces knowledge distillation for visual tracking.

Our framework is achieved by a novel teacherstudents knowledge distillation (TSsKD) model proposed for better knowledge distillation via simulating the teaching mechanism among one teacher and multiple students, including teacherstudent knowledge transfer and studentstudent knowledge sharing. In additions, a theoretical analysis is conducted to prove its effectiveness.

For the knowledge transfer model, we design a set of losses to tightly couple the Siamese structure and also decrease the overfitting during training for better tracking performance. For the knowledge sharing mechanism, a conditional sharing loss is proposed to transfer reliable knowledge between students and further enhance the “dull” students.
Extensive empirical evaluations for famous SiamFC [4] and SiamRPN [24, 22] trackers on several tracking benchmarks clearly demonstrate the generality and impressive performance of the proposed framework. The distilled trackers achieve compression rates of more than 13 – 18 and a speedup of nearly 2 – 3, respectively, while maintaining the same or even slightly improved tracking accuracy. The distilled SiamRPN also obtains a stateoftheart performance (as shown in Fig. 1) at an extremely high speed of 265 FPS.
2 Related Work
Trackers with Siamese Networks: Tao et al. [36] utilized a Siamese network with both convolutional and fullyconnected layers for training, and achieved favorable accuracy, while maintaining a low speed of 2 FPS. To improve the speed, Bertinetto et al. [4] proposed “SiamFC” by only applying an endtoend Siamese network with 5 fullyconvolutional layers for offline training. Because of its high speed at nearly 86 FPS on GPU, favorable accuracy, and simple mechanism for online tracking, there has been a surge of interest around SiamFC. Various improved methods are proposed [38, 17, 20, 52, 42, 18, 14, 43, 13, 45, 49, 24, 51]. For instance, Li et al. [24] proposed a SiamRPN tracker by combining the Siamese network and RPN [32]
, which directly obtains the location and scale of objects by regression, to avoid multiple forward computations for scale estimation in common Siamese trackers. Thus, it can run at 160 FPS with a better tracking accuracy. Subsequently, Zhu
et al. [51] proposed distractoraware training to use more datasets and applied distractoraware incremental learning to improve online tracking. In the recent VOT2018 [22], a variant of SiamRPN with a larger model size won the realtime challenge in the EAO metric and ranked 3rd in the main challenge.Knowledge Distillation for Compression: In network compression, the goal of KD is to generate a student network that obtains better performance than the one trained directly by transferring knowledge from the teacher network. In an early work, Bucilua et al. [5]
compressed key information into a single neural network from an ensemble of networks. Recently, Ba
et al. [2] demonstrated an approach to improve the performance of shallow neural networks, by mimicking deep networks in training. Romero et al. [33] approximated the mappings between student and teacher hidden layers to compress networks by training the relatively narrower students with linear projection layers. Subsequently, Hinton et al. [19] proposed a dark knowledge extracted from the teacher network by matching the full soft distribution between the student and teacher networks during training. Following this work, KD has attracted more interest in this community and a variety of methods have been applied to it [35, 37, 47, 8, 6, 50, 16]. For example, Zagoruyko et al. [47] employed an attention map to KD by training student network with matching the attention map of the teacher at the end of each residual stage. In most existing works concerned with KD, the architecture of the student network is usually manually designed. NettoNet (N2N) [1] method focuses on generating optimal reduced architecture for KD automatically. We use it to obtain the “dull” student network with reduced architecture. Then, we propose new KD [19] and attention transfer [47] methods to be adaptive for Siamese tracking network, and a novel teacherstudents learning mechanism to further improve performance.3 Revisiting SiamFC and SiamRPN
Since we adopt SiamFC [4] and SiamRPN [24] as the base trackers for our distilled tracking framework, we first revisit their basic network structures and training losses.
SiamFC adopts a twostream fully convolutional network architecture, which takes target patches (denoted as ) and current search regions (denoted as
) as inputs. After a nopadding feature extraction network
modified from AlexNet [23], a crosscorrelation operation is conducted on the two extracted feature maps:(1) 
The location of the target in the current frame is then inferred according to the peak value on the correlation response map . The logistic loss, , a usual binary classification loss, is used to train SiamFC:
(2) 
where is a realvalued score in the response map and is a groundtruth label.
SiamRPN, as an extension of SiamFC, has a Siamese feature extraction subnetwork (same as SiamFC) and an additional RPN subnetwork [32]
. After feature extraction, the features are fetched into the RPN subnetwork. The final outputs are foregroundbackground classification score maps and regression vectors of predefined anchors. By applying a single convolution and crosscorrelation operations
in RPN on two feature maps, the outputs are obtained by:(3) 
(4) 
where is the predefined anchor number. The template feature maps and are then used as kernels in the crosscorrelation operation to obtain the final classification and regression outputs, with size .
During training, the following multitask loss function is optimized:
(5) 
where and are the groundtruths of the classification and regression outputs, is a crossentropy loss for classification, and is a smooth loss with normalized coordinates for regression.
4 Distilled Siamese Trackers
In this section, we detail the proposed framework of Distilled Siamese Trackers (DST) for highperformance tracking. As shown in Fig. 2, the proposed framework consists of two essential stages. First, in §4.1
, for a given teacher network, such as SiamRPN, we obtain a “dull” student with a reduced network architecture via Deep Reinforcement Learning (DRL). Second, the “dull” student network is further trained simultaneously with an “intelligent” student via the proposed distillation model facilitated by a teacherstudents learning mechanism (see §
4.2).4.1 “Dull” Student Selection
Inspired by N2N [1] for compressing classification networks, we transfer selecting a student tracker with reduced network architecture to learning an agent with optimal compression strategy (policy) by DRL. Unlike N2N, we only conduct layer shrinkage because of Siamese trackers’ shallow network architecture. Layer removal will cause a sharp decline in accuracy and divergence of the policy network.
In our task, the agent for selecting a small and reasonable network is learned from a sequential decisionmaking process by policy gradient DRL. The whole decision process can be modeled as a Markov Decision Process (MDP), which is defined as the tuple
. The state space is a set of all possible reduced network architectures derived from the teacher network. is the set of all actions to transform one network into another compressed one. Here, we use layer shrinkage [1] actions by changing the configurations of each layer, such as kernel size, padding, and number of output filters. is the state transition function. is the discount factor in MDP. To maintain the equal contribution for each reward, we set to 1. is the reward function. The reward of final state in [1] achieves a balance between tracking accuracy and compression rate, which is defined as follows:(6) 
where is the relative compression rate of a student network with size compared to a teacher with size . and are the validation accuracies of the student and teacher networks. We propose to define a new metric of accuracy for tracking by selecting the top proposals with the highest confidence and calculating their overlaps with the groundtruth boxes for image pairs in validation set:
(7) 
where denotes the th proposal of the th image pair, is the corresponding groundtruth and is the overlap function. At each step, the policy network outputs actions and the reward is defined as the average reward of generated students:
(8) 
Given a policy network and the predefined MDP, we use the REINFORCE method [44] to optimize the policy and finally obtain the optimal policy and reduced student network. All the training processes in this section are based on a small dataset selected from the whole dataset, considering the time cost.
4.2 TeacherStudents Knowledge Distillation
After the network selection, we obtain a “dull” student network with poor comprehension due to small model size. To pursue more intensive knowledge distillation and promising tracking performance, we propose a TeacherStudents Knowledge Distillation (TSsKD) model. It encourages teacherstudent knowledge transfer as well as mutual learning between students that serves as more flexible and appropriate guidance. In §4.2.1, we elaborate the teacherstudent knowledge transfer (distillation) model. Then, in §4.2.2, we describe the studentstudent knowledge sharing strategy. Finally, in §4.2.3, we provide a theoretical analysis to prove the effectiveness of our TSsKD model.
4.2.1 TeacherStudent Knowledge Transfer
In the teacherstudent knowledge transfer model, we propose a novel transfer loss to capture the knowledge in teacher networks. It contains three components: Teacher Soft (TS) loss, Adaptive Hard (AH) loss, and Siamese Attention Transfer (SAT) loss. The first two allow the student to mimic the outputs of the teacher network, such as the logits
[19] in the classification model. These two losses can be seen as a variant of KD methods [19, 6], which are used to extract dark knowledge from teacher networks. The last loss is for the middle feature maps and leads students to pay attention to the same regions of interest as the teacher. This provides middlelevel semantic hints to the student. Our knowledge transfer loss includes both classification and regression parts and can be incorporated into other networks by deleting the corresponding part.Teacher Soft (TS) Loss: We set and as the student’s classification and bounding box regression outputs, respectively. In order to incorporate the dark knowledge that regularizes students by placing emphasis on the relationships learned by the teacher network across all the outputs, we need to ‘soften’ the output of classification. We set , where temp is a temperature parameter to obtain a soft distribution [19]. Similarly, . Then, we give the TS loss for knowledge distillation as follows:
(9) 
where is a Kullback Leibler (KL) divergence loss on soft outputs of the teacher and student. is the original regression loss of the tracking network.
Adaptive Hard (AH) Loss: To make full use of the ground truth , we combine the outputs of the teacher network with the original hard loss of the student network. For regression loss, we employ a modified teacher bounded regression loss [6], which is defined as:
(10) 
where is the gap between the student’s and the teacher’s loss (here it’s , the regression loss of the tracking network) with the groundtruth. is a margin.
This loss will keep the student regression vector close to the groundtruth when its quality is worse than the teacher. However, once it outperforms the teacher network, we stop offer loss for the student to avoid overfitting. Added with the student’s original classification loss, our AH loss is defined as follows:
(11) 
Siamese Attention Transfer (SAT) Loss: To lead a student to concentrate on the same regions of interest a teacher during tracking, we introduce attention transfer [47]
into our framework. Based on the assumption that the activation of a hidden neuron can indicate its importance for a specific input, we can transfer the semantic features of a teacher onto a student by forcing it to mimic the teacher’s attention map. We use activationbased maps calculated by a mapping function
, which outputs 2D activation map with 3D feature maps provided. Here we use:(12) 
where is a spacial feature map of the th channel, and represents the absolute values of a matrix.
Siamese network has two weightsharing branches with different inputs: a target patch and a larger search region. The feature network needs to learn the attention in the feature maps of both branches. It has been found that the surrounding noise in the search region’s features will disturb the attention transfer of the target’s features due to the existence of distractors. To solve the problem, we set a weight on the search region’s feature maps. Then, we define the following multilayer Siamese AT loss:
(13) 
(14) 
(15) 
where is the set of attention transfer layers’ indices. and denote the teacher’s feature map of layer on the search and target branch, respectively. is the weight on the teachers’ th feature map. Student variables are defined in the same way. By introducing this weight, which rearranges the importance of different patches in the search region according to their similarities with the target, more attention is paid to the target. This keeps the attention maps of the two branches consistent and enhances the effect of attention transfer. An example of our multilayer Siamese attention transfer is shown in Fig. 3. The comparison of activation maps with and without weights shows that the surrounding noise is suppressed effectively.
By combining the above three types of losses, the overall loss for transferring knowledge from a teacher to a student is defined as follows:
(16) 
4.2.2 StudentStudent Knowledge Sharing
Based on our teacherstudent distillation model, we propose a studentstudent knowledge sharing mechanism to further narrow the gap between the teacher and the “dull” student. As an “intelligent” student with a larger model size usually learns and performs better (due to its better comprehension), sharing its knowledge is able to inspire the “dull” one to develop a more indepth understanding. On the other side, the “dull” one can do better in some cases and provide some helpful knowledge too.
We take two students as an example and denote them as a “dull” student s1 and an “intelligent” student s2. For a proposal
in Siamese trackers, assume that the predicted probabilities of being target by
s1 and s2 are and , respectively. The predicted boundingbox regression values are and . To improve the learning effect of s1, we obtain the knowledge shared from s2 by using the its prediction as prior knowledge. The KL Divergence is used to quantify the consistency of proposals’ classification probabilities:(17) 
where and are probabilities of background. For regression, we use smooth loss:
(18) 
The knowledge sharing loss for can be defined as:
(19) 
Combined with the knowledge transfer loss, our final objective functions for s1 and s2 are as follows:
(20) 
(21) 
where is a discount factor on account of the two students’ different reliability, and denotes the weight of knowledge sharing. Considering the “dull” student’s worse performance, we set . To filter the reliable knowledge for sharing, for , we set a condition on :
(22) 
Here, , are the losses for s1 and teacher, with groundtruth. is their gap constraint.
is a function decreasing geometrically with current epoch
. For , a similar condition exists.To train the two students simultaneously, the final loss for our TSsKD is:
(23) 
4.2.3 Why Does TSsKD Work?
According to the VC theory [39, 27], the learning process can be regarded as a statistical procedure. Given data, a student function belonging to a function class , and a real (groundtruth) target function , the task (classification or regression) error of learning from scratch without KD (NOKD) can be decomposed as:
(24) 
where the , and are the expected, estimation and approximation error, respectively. presents an appropriate function class capacity measure. is the learning rate measuring the difficulty of a learning problem small values present difficult cases while large values indicate easy problems. Setting as the teacher function, the error of with KD is defined as:
(25) 
(26) 
(27) 
where (25)(26) because , which is an assumption in [27]. In addition, Lopez et al. [27] made more reasonable assumptions for (26)(24) to prove that KD outperforms NOKD. For instance, is small, , and .
To analyze our TSsKD model, we first focus on the “dull” student denoted as . Assume also belongs to (the same network as ) but is selected (trained) by different methods from . Then, we can obtain the error upper bound of with our TSsKD:
(28) 
To prove that our TSsKD outperforms KD, (28)(26) should hold. Thus, we also make two reasonable assumptions: , and . Recalling Eq. (20), the objective function of one student, we can find that the first item is a KD loss offering the same information, while the second item provides additional information. We use a condition function to filter out noisy information to capture reliable information. We believe that the is a general situation since more reliable information should allow for faster learning. In addition, more information also enhances the generalization of the network to decrease the approximation error, . The above analysis is also suitable for other students since even the “dull” student can do better in some cases. Thus, our TSsKD can improve the performance of all students.
5 Experiments
To demonstrate the effectiveness of our framework, we employ it on two representative Siamese trackers: SiamFC [4] and SiamRPN [24] (VOT version). We evaluate the distilled trackers on several benchmarks, including OTB100 [46], DTB [25], VOT2016 [21], LaSOT [15], and TrackingNet [28]
, and give an ablation study. All the experiments are implemented using PyTorch with an Intel i7 CPU and an Nvidia GTX 1080ti.
5.1 Implementation Details
N2N Setting.
In the “dull” student selection experiment, an LSTM is employed as the policy network. A small representative dataset (about 10,000 image pairs) is created by selecting images uniformly for several classes in the whole dataset of the corresponding tracker. The policy network is updated for 50 steps. In each step, three reduced networks are generated and trained from scratch for 10 epochs on the small dataset. We observe heuristically that this is sufficient to compare performance. Both SiamRPN and SiamFC use the same settings.
Training Datasets. For SiamRPN, same with teacher in [22]
, we preprocess four datasets: ImageNet VID
[34], YouTubeBoundingBoxes [31], COCO [26] and ImageNet Detection [34], to generate about two million image pairs with 127 pixels for target patches and 271 pixels for search regions. SiamFC is trained on ImageNet VID [34] with 127 pixels and 255 pixels for two inputs, respectively, which is consistent with [4].SiamFC  SiamRPN  

logistic  crossentropy+bounded  
KL  KL+  
MSE  MSE  
KL  KL+ 
loss and Kullback Leibler Divergence loss, respectively
Optimization. During the teacherstudents knowledge distillation, “intelligent” students are generated by halving the convolutional channels of both teachers. SiamRPN’s student networks are warmed up by training with the groundtruth for 10 epochs and then trained for 50 epochs with the learning rate exponentially decreasing from to . Same with the teacher, SiamFC’s student networks are trained for 30 epochs with a learning rate of . All the losses used in the experiments are reported in Table 1
. The other hyperparameters are set to:
, , , , and .To make the network robust to gray videos, 25 of the training image pairs for SiamFC [4] and SiamRPN [24]
are converted to grayscale. Moreover, to train SiamRPN, a translation within 12 pixels and a resize operation varies from 0.85 to 1.15 are performed on each training sample to increase the diversity. There is also a detail that the outputs of SiamFC students are first inputted to a Sigmoid function and then we obtain two score maps for target and background. The two score maps are then used to compute KL divergence in knowledge sharing.
DSTrpn  DSTfc  SiamRPN  SiamFC  HP  DAT  HOGLR  SODLT  DSST  MDNet  MEEM  SRDCF  
DP  0.7965  0.7486  0.7602  0.7226  0.6959  0.4237  0.4638  0.5488  0.4037  0.6916  0.5828  0.4969 
OP  0.6927  0.5741  0.6827  0.5681  0.5775  0.2650  0.3057  0.4038  0.2706  0.5328  0.3357  0.3723 
AUC  0.5557  0.4909  0.5502  0.4797  0.4721  0.2652  0.3084  0.3640  0.2644  0.4559  0.3649  0.3390 
5.2 Evaluations of “Dull” Student Selection
In Fig. 8, many inappropriate SiamRPNlike networks are generated and cause very unstable accuracies and rewards in the top 30 iterations. After 5 iterations, the policy network gradually converges and finally achieves a high compression rate. On the other side, the policy network converges quickly after several iterations on SiamFC due to its simple architecture (See Fig. 8). The compression results show that our method is able to generate an optimal architecture regardless of the teacher’s complexity. Finally, two reduced models of size 19.7 MB and 0.7 MB for SiamRPN (361.8 MB) and SiamFC (9.4 MB) are generated.
5.3 Benchmark Results for Visual Object Tracking
Results on OTB100. On the OTB100 benchmark [46], we compare our DSTrpn (SiamRPN as teacher) and DSTfc (SiamFC as teacher) trackers with various recent fast trackers (more than 50 FPS), including the teacher networks SiamRPN [24] and SiamFC [4], Siamtri [13], TRACA [7], HP [14], Cfnet2 [38], and fDSST [10]
. The evaluation metrics include both precision and success plots in one pass evaluation (OPE)
[46], where ranks are sorted using precision scores with center error less than 20 pixels and AreaUndertheCurve (AUC), respectively.In Fig. 5, our DSTrpn outperforms all the other trackers in terms of precision and success plots. We also report the running speed of all trackers for comparison. Notice that our DSTrpn runs at an extremely high speed of 265 FPS, which is nearly 3 faster than the teacher network SiamRPN (90 FPS) and obtains the same (even slightly better) precision and AUC scores. Our DSTfc runs more than 2 faster than SiamFC with comparable performance.
Results on DTB. We benchmark our method on the Drone Tracking Benchmark (DTB) [25] including 70 videos captured by drone cameras. We compare SiamRPN, recent Siamese works such as HP [14], and the trackers evaluated in DTB, including DAT [30], HOGLR [41], SODLT [40], DSST [9], MDNet [29], MEEM [48], and SRDCF [11]. The evaluation metrics include Distance Precision (DP) at a threshold of 20 pixels, Overlap Precision (OP) at an overlap threshold of 0.5, and the AreaUndertheCurve (AUC).
As shown in Table 2, DSTrpn achieves the best performance in terms of DP and OP. For AUC, DSTrpn ranks first (0.5557) and significantly outperforms SiamFC (0.4797). Compared with the teacher SiamRPN, our student network surpasses it in terms of both AUC and OP, while even achieving a improvement on DP, considering that our model size is just of the teacher SiamRPN. DSTfc outperforms SiamFC in terms of all three criterias too.
Results on VOT2016. The benchmark of VOT2016 [21] differs from OTB100, a tracker is evaluated using a resetbased mechanism. Whenever a tracker loses the object after five frames, it will be reinitialized. The major evaluation metric is the Expected Average Overlap (EAO) which combines accuracy and robustness. We evaluate the distilled trackers and various SOTA trackers, including CCOT [12], Staple [3], and other trackers in VOT2016. Fig. 6 shows that our tracker achieves the second highest EAO among all trackers. Notice that our model is much smaller than the first rank SiamRPN while only losing a few EAO (0.023).
DSTrpn  SiamRPN  DSTfc  SiamFC  DaSiamRPN  MDNet  
LaSOT (AUC)  0.434  0.457  0.340  0.343  0.415  0.397 
TrackingNet (AUC)  0.649  0.675  0.562  0.573  0.638  0.606 
Size (MB)  19.7  361.8  0.7  9.4  90.5  17.7 
Results on LaSOT and TrackingNet. We also do extensive experiments on largescale datasets such as LaSOT [15] and TrackingNet [28] to evaluate the generalization of our method. We compare DaSiamRPN [51], MDNet [29], and our baselines: SiamRPN [22] and SiamFC [4]. As shown in Table 3, the model size of our DSTrpn (or DSTfc) is further smaller than its teacher model SiamRPN (or SiamFC), while the AUC scores on two largescale datasets are very close to the teacher model. Notice that, our DSTrpn achieves better performance than DaSiamRPN and MDNet in both two datasets.
5.4 Ablation Study
GT  AH  TS  SAT  Precision  AUC  
SiamRPN  Student1  ✓  0.638  0.429  
✓  0.796  0.586  
✓  ✓  0.795  0.579  
✓  ✓  0.800  0.591  
✓  ✓  0.811  0.608  
✓  ✓  ✓  0.812  0.606  
✓  ✓  ✓  0.825  0.624  
Teacher  /  /  /  /  0.853  0.643  
SiamFC  Student1  ✓  0.707  0.523  
✓  0.711  0.535  
✓  ✓  0.710  0.531  
✓  ✓  0.742  0.548  
✓  ✓  ✓  0.741  0.557  
Teacher  /  /  /  /  0.772  0.581 
Knowledge Transfer Components. The teacherstudent knowledge transfer consists of three major components: (i) Adaptive Hard (AH) loss, (ii) Teacher Soft (TS) loss, and (iii) Siamese Attention Transfer (SAT) loss. We conduct an extensive ablation study by implementing a number of variants using different combinations, including (1) GT: simply using the groundtruth without any of the three losses, (2) TS, (3) GT+TS, (4) AH+TS, (5) TS+SAT, (6) GT+TS+SAT, and (7) AH+TS+SAT (the full knowledge transfer method). Table 4 shows our results on SiamFC and SiamRPN.
For SiamRPN, we can see that the GT without any proposed loss degrades dramatically compared with the teacher network, due to the large model size gap. When using the Teacher Soft (TS) loss to distill knowledge from the teacher network, we observe a significant improvement in terms of precision () and AUC (). However, directly combining GT and TS (GT+TS) could be suboptimal due to overfitting. By replacing GT with AH, AH+TS further boosts the performance for two metrics. Finally, by adding the Siamese Attention Transfer (SAT) loss, the model (AH+TS+SAT) is able to close the gap between the teacher and student networks, outperforming other variants (TS+SAT or GT+TS+SAT). SiamFC only employs classification loss, so GT is equal to AH and we use GT here. Results show that the gaps are narrower than SiamRPN but performance improvements can still be seen. The results clearly demonstrate the effectiveness of each component.
Different Learning Mechanisms. To evaluate our TSsKD model, we also conduct an ablation study on different learning mechanisms: (i) NOKD: train from scratch, (ii) TSKD: our trackingspecific teacherstudent knowledge distillation (transfer) and (iii) TSsKD. “Student1” and “Student2” represent “dull” and “intelligent” student, respectively. Students are trained following different paradigms and results on SiamRPN and SiamFC can be seen in Table 5. With knowledge distillation, all students are improved. Moreover, with the knowledge sharing in our TSsKD, the “dull” SiamRPN student gets a performance improvement of in terms of AUC. The “dull” SiamFC student gets a improvement. On the other side, the “intelligent” SiamRPN and SiamFC students get slight improvements () as well. Fusing the knowledge from teacher, groundtruth and “intelligent” student, the “dull” SiamRPN student obtains the best performance.
Loss comparison. We also compare the losses of different student networks in our experiments in the paper. As shown in Fig. 7, the “intelligent” students (denoted as Student1) have a lower loss than the “dull” ones (denoted as Student2) in the whole training and validation process, and maintain a better understanding of the training dataset. They provide additional reliable knowledge to the “dull” students which further inspires more intensive knowledge distillation and better tracking performance.
NOKD  TSKD  TSsKD  Size  Speed  
SiamRPN  Student1  0.429  0.624  0.646  19.7M  265 FPS 
Student2  0.630  0.641  0.644  90.6M  160 FPS  
Teacher  0.642  /  /  361.8M  90 FPS  
SiamFC  Student1  0.523  0.557  0.573  0.7M  230 FPS 
Student2  0.566  0.576  0.579  2.4M  165 FPS  
Teacher  0.581  /  /  9.4M  110 FPS 
6 Conclusion
This paper proposed a new framework of Distilled Siamese Trackers (DST) to learn small, fast yet accuracy trackers from larger Siamese Trackers. This framework is built upon a teacherstudents knowledge distillation model including two kinds of knowledge transfer styles: 1) knowledge transfer from teacher to students by a trackingspecific distillation strategy; 2) mutual learning between students in a knowledge sharing manner. The theoretical analysis and extensive empirical evaluations on two Siamese trackers have clearly demonstrated the generality and effectiveness of the proposed DST. Specifically, for the stateoftheart (SOTA) SiamRPN, the distilled tracker also achieved a high compression rate, ran at an extremely high speed, and obtained similar performance as the teacher. Thus, we believe such a distillation method can be used for improving many SOTA deep trackers towards practical tracking tasks.
7 Appendix
7.1 Details of DRL
In the “dull” student selection stage, we use a policy gradient algorithm to optimize our policy network step by step. With the parameters of the policy network denoted as , our objective function is the expected reward over all the action sequences :
(29) 
To calculate the gradient of our policy network, we use REINFORCE [44] in our experiment. Given the hidden state , the gradient is formulated as:
(30)  
where is the probability of actions controlled by the current policy network with hidden state . is the reward of the current th student model at step
. Furthermore, in order to reduce the high variance of estimated gradients, a stateindependent baseline
is introduced:(31) 
It denotes an exponential moving average of previous rewards. Finally, our policy gradient is calculated as:
(32) 
7.2 Extension to More Students
Our TSsKD model can be naturally extended to more students. Given students s1, s2, …, sn, the objective function for si is as follows:
(33) 
Here is the discount factor between si and sj considering their different reliability. For example, in our case with two students in the paper, = 1 and = 0.5. We conduct an experiment on different student numbers and obtain a result reported in Fig. 8. Students are generated by reducing the number of convolutional channels to a scale (0.4, 0.45, 0.5, 0.55). In our case, since our “dull” students achieve performance close to the teacher with one “intelligent” student, more students don’t bring significant improvements.
References
 [1] A. Ashok, N. Rhinehart, F. Beainy, and K. M. Kitani. N2n learning: network to network compression via policy gradient reinforcement learning. In ICLR, 2018.
 [2] J. Ba and R. Caruana. Do deep nets really need to be deep? In NIPS, 2014.
 [3] L. Bertinetto, J. Valmadre, S. Golodetz, O. Miksik, and P. H. Torr. Staple: Complementary learners for realtime tracking. In CVPR, 2016.
 [4] L. Bertinetto, J. Valmadre, J. F. Henriques, A. Vedaldi, and P. H. Torr. Fullyconvolutional siamese networks for object tracking. In ECCV Workshop, 2016.
 [5] C. Bucilu, R. Caruana, and A. NiculescuMizil. Model compression. In SIGKDD, 2006.
 [6] G. Chen, W. Choi, X. Yu, T. Han, and M. Chandraker. Learning efficient object detection models with knowledge distillation. In NIPS, 2017.

[7]
J. Choi, H. J. Chang, T. Fischer, S. Yun, K. Lee, J. Jeong, Y. Demiris, and
J. Y. Choi.
Contextaware deep feature compression for highspeed visual tracking.
In CVPR, 2018.  [8] W. M. Czarnecki, S. Osindero, M. Jaderberg, G. Swirszcz, and R. Pascanu. Sobolev training for neural networks. In NIPS, 2017.
 [9] M. Danelljan, G. Häger, F. Khan, and M. Felsberg. Accurate scale estimation for robust visual tracking. In BMVC, 2014.
 [10] M. Danelljan, G. Häger, F. S. Khan, and M. Felsberg. Discriminative scale space tracking. IEEE TPAMI, 39(8):1561–1575, 2017.
 [11] M. Danelljan, G. Hager, F. Shahbaz Khan, and M. Felsberg. Learning spatially regularized correlation filters for visual tracking. In ICCV, 2015.
 [12] M. Danelljan, A. Robinson, F. Shahbaz Khan, and M. Felsberg. Beyond correlation filters: Learning continuous convolution operators for visual tracking. In ECCV, 2016.
 [13] X. Dong and J. Shen. Triplet loss in siamese network for object tracking. In ECCV, 2018.
 [14] X. Dong, J. Shen, W. Wang, Y. Liu, L. Shao, and F. Porikli. Hyperparameter optimization for tracking with continuous deep qlearning. In CVPR, 2018.

[15]
H. Fan, L. Lin, F. Yang, P. Chu, G. Deng, S. Yu, H. Bai, Y. Xu, C. Liao, and
H. Ling.
Lasot: A highquality benchmark for largescale single object
tracking.
In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
, pages 5374–5383, 2019.  [16] T. Furlanello, Z. C. Lipton, M. Tschannen, L. Itti, and A. Anandkumar. Born again neural networks. In ICML, 2018.
 [17] Q. Guo, W. Feng, C. Zhou, R. Huang, L. Wan, and S. Wang. Learning dynamic siamese network for visual object tracking. In ICCV, 2017.
 [18] A. He, C. Luo, X. Tian, and W. Zeng. A twofold siamese network for realtime object tracking. In CVPR, 2018.
 [19] G. Hinton, O. Vinyals, and J. Dean. Distilling the knowledge in a neural network. In NIPS Workshop, 2014.
 [20] C. Huang, S. Lucey, and D. Ramanan. Learning policies for adaptive tracking with deep feature cascades. In ICCV, 2017.
 [21] M. Kristan, A. Leonardis, J. Matas, M. Felsberg, R. Pflugfelder, L. Čehovin, T. Vojir, G. Häger, A. Lukežič, and G. Fernandez. The visual object tracking vot2016 challenge results. In ECCV workshop, 2016.
 [22] M. Kristan, A. Leonardis, J. Matas, M. Felsberg, R. Pfugfelder, L. C. Zajc, T. Vojir, G. Bhat, A. Lukezic, A. Eldesokey, G. Fernandez, and et al. The sixth visual object tracking vot2018 challenge results. In ECCV workshop, 2018.

[23]
A. Krizhevsky, I. Sutskever, and G. E. Hinton.
Imagenet classification with deep convolutional neural networks.
In NIPS, 2012.  [24] B. Li, J. Yan, W. Wu, Z. Zhu, and X. Hu. High performance visual tracking with siamese region proposal network. In CVPR, 2018.
 [25] S. Li and D.Y. Yeung. Visual object tracking for unmanned aerial vehicles: A benchmark and new motion models. In AAAI, 2017.
 [26] T.Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick. Microsoft coco: Common objects in context. In ECCV, 2014.
 [27] D. LopezPaz, L. Bottou, B. Schölkopf, and V. Vapnik. Unifying distillation and privileged information. In ICLR, 2016.
 [28] M. Muller, A. Bibi, S. Giancola, S. Alsubaihi, and B. Ghanem. Trackingnet: A largescale dataset and benchmark for object tracking in the wild. In Proceedings of the European Conference on Computer Vision (ECCV), pages 300–317, 2018.
 [29] H. Nam and B. Han. Learning multidomain convolutional neural networks for visual tracking. In CVPR, 2016.
 [30] H. Possegger, T. Mauthner, and H. Bischof. In defense of colorbased modelfree tracking. In CVPR, 2015.
 [31] E. Real, J. Shlens, S. Mazzocchi, X. Pan, and V. Vanhoucke. Youtubeboundingboxes: A large highprecision humanannotated data set for object detection in video. In CVPR, 2017.
 [32] S. Ren, K. He, R. Girshick, and J. Sun. Faster rcnn: Towards realtime object detection with region proposal networks. In NIPS, 2015.
 [33] A. Romero, N. Ballas, S. E. Kahou, A. Chassang, C. Gatta, and Y. Bengio. Fitnets: Hints for thin deep nets. arXiv preprint arXiv:1412.6550, 2014.
 [34] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. FeiFei. ImageNet Large Scale Visual Recognition Challenge. IJCV, 115(3):211–252, 2015.
 [35] P. Sadowski, J. Collado, D. Whiteson, and P. Baldi. Deep learning, dark knowledge, and dark matter. In NIPS Workshop, 2015.
 [36] R. Tao, E. Gavves, and A. W. Smeulders. Siamese instance search for tracking. In CVPR, 2016.
 [37] G. Urban, K. J. Geras, S. E. Kahou, O. Aslan, S. Wang, R. Caruana, A. Mohamed, M. Philipose, and M. Richardson. Do deep convolutional nets really need to be deep and convolutional? In ICLR, 2017.
 [38] J. Valmadre, L. Bertinetto, J. F. Henriques, A. Vedaldi, and P. H. Torr. Endtoend representation learning for correlation filter based tracking. In CVPR, 2017.
 [39] V. Vapnik. Statistical learning theory, 1998.
 [40] N. Wang, S. Li, A. Gupta, and D.Y. Yeung. Transferring rich feature hierarchies for robust visual tracking. arXiv preprint arXiv:1501.04587, 2015.
 [41] N. Wang, J. Shi, D.Y. Yeung, and J. Jia. Understanding and diagnosing visual tracking systems. In ICCV, 2015.
 [42] Q. Wang, Z. Teng, J. Xing, J. Gao, W. Hu, and S. Maybank. Learning attentions: residual attentional siamese network for high performance online visual tracking. In CVPR, 2018.
 [43] X. Wang, C. Li, B. Luo, and J. Tang. Sint++: Robust visual tracking via adversarial positive instance generation. In CVPR, 2018.
 [44] R. J. Williams. Simple statistical gradientfollowing algorithms for connectionist reinforcement learning. Machine Learning, 8(34):229–256, 1992.
 [45] T. Yang and A. B. Chan. Learning Dynamic Memory Networks for Object Tracking. In ECCV, 2018.
 [46] W. Yi, L. Jongwoo, and M.H. Yang. Object tracking benchmark. IEEE TPAMI, 37(9):1834–1848, 2015.
 [47] S. Zagoruyko and N. Komodakis. Paying more attention to attention: Improving the performance of convolutional neural networks via attention transfer. In ICLR, 2016.
 [48] J. Zhang, S. Ma, and S. Sclaroff. Meem: Robust tracking via multiple experts using entropy minimization. In ECCV, 2014.
 [49] Y. Zhang, L. Wang, J. Qi, D. Wang, M. Feng, and H. Lu. Structured siamese network for realtime visual tracking. In ECCV, 2018.
 [50] Y. Zhang, T. Xiang, T. M. Hospedales, and H. Lu. Deep mutual learning. In CVPR, 2018.
 [51] Z. Zhu, Q. Wang, B. Li, W. Wu, J. Yan, and W. Hu. Distractoraware siamese networks for visual object tracking. In ECCV, 2018.
 [52] Z. Zhu, W. Wu, W. Zou, and J. Yan. Endtoend flow correlation tracking with spatialtemporal attention. In CVPR, 2018.