1 Introduction
Generic visual object tracking is the task of estimating the bounding box of a target in a video sequence given only its initial position. Typically, the preliminary model learned from the first frame needs to be updated continuously to adapt to the target’s appearance variations caused by rotation, illumination, occlusion, deformation, etc. However, it is challenging to optimize the initial learned model efficiently and effectively as tracking proceeds. Training samples for model updating are usually collected based on estimated bounding boxes, which could be inaccurate. Those small errors will accumulate over time, gradually resulting in model degradation.
To avoid model updating, which may introduce unreliable training samples that ruin the model, several approaches [4, 42] investigate tracking by only comparing the first frame with the subsequent frames, using a similarity function based on a learned discriminant and invariant deep Siamese feature embedding. However, training such a deep representation is difficult due to drastic appearance variations that commonly emerge in longterm tracking. Other methods either update the model via an exponential moving average of templates [15, 43], which marginally improves the performance, or optimize the model with handdesigned SGD methods [32, 40]
, which needs numerous iterations to converge blackthus preventing realtime speed. blackLimiting the number of SGD iterations can allow near realtime speeds, but at the expense of poor quality model updates due to the loss function not being optimized sufficiently.
In recent years, much effort has been done on localizing the object using robust online learned classifier, while few attention is paid on designing accurate bounding box estimation. Most trackers simply resort to multiscale search by assuming that the object aspect ratio does not change during tracking, which is often violated in real world. Recently, SiamRPN
[22] borrows the idea of region proposal networks [36] in object detection to decompose tracking task into two branches, which are classifying the target from the background and regressing the accurate bounding box based with reference to anchor boxes mounted on different positions. It achieves higher precision on bounding box estimation but suffers lower robustness compared with stateoftheart methods due to no online model updating.In this paper, we address the aforementioned problems by proposing a tracking framework which is composed of two modules: response generation and bounding box regression, where the first component produces a response map to indicate the possibility of covering the object for anchor boxes mounted on slidingwindow positions, and the second part predicts bounding box shifts to anchors to get refined rectangles. Instead of enumerating different scales and aspect ratios of anchors as in object detection, we propose to use only one sized anchor for each position and adapt it to shape changes by resizing its corresponding convolutional filter using bilinear interpolation, which saves model parameters and computing time. To effectively adapt tracking model to appearance changes during tracking, we propose a recurrent model optimization method to learn a more effective gradient descent that converges the model update in 12 steps, and generalizes better to future frames. The key idea is to train a neural optimizer that can converge the tracking model to a good solution in a few gradient steps. During the training phase, the tracking model is first updated using the neural optimizer, and then it is applied on future frames to obtain an error signal for minimization. Under this particular setting, the resulting optimizer converges the tracking classifier significant faster than SGDbased optimizers, especially for blacklearning the initial tracking model. In summary, our contributions are:

We propose a tracking model consisting of resizable response generator and bounding box regressor, where only one sized anchor is used on each spatial position and its corresponding convolutional filter could be adapted to shape variations by bilinear interpolation.

We propose a recurrent neural optimizer, which is trained in a metalearning setting, that recurrently updates the tracking model with faster convergence.

We conduct comprehensive experiments on large scale datasets including OTB, VOT, LaSOT, GOT10k and TrackingNet, and our trackers achieve favorable performance compared with the stateoftheart.
2 Related Work
Tracking by Response Generation Correlation filter based trackers [15, 8, 3, 4] formulate response generation as an elementwise multiplication in Fourier domain to improve computation efficiency, which is essentially a convolution operation on cyclically shifted samples. Instead of leveraging the periodic assumption used in correlation filter, which may introduce unwanted boundary affect [12], SiamFC [4] proposes to convolve the search feature map with a object template in spatial domain to generate the heat map. Recent works improve SiamFC [4] by introducing various model updating strategies, including recurrent generation of target template filters through a convolutional LSTM [46], a dynamic memory network [47], where object information is written into and read from an addressable external memory, and distractoraware incremental learning [50], which make use of hardnegative templates around the target to suppress distractors. It should be noted that all these algorithms essentially achieve model updating by linearly interpolating old target templates with the newly generated one, in which the major difference is how to control the weights when combining them. This is far from optimal compared with optimization methods using gradient decent, which minimize the tracking loss directly to adapt to new target appearances. What’s more, all aforementioned trackers estimate the bounding box via simple multiscale search mechanism, which is not able to handle aspect ratio changes. Recently, SiamRPN [22] and its extensions [21, 10, 49] have shown promising results on estimating accurate bounding box via offline training on large scale datasets. Different from these algorithms which enumerates a set of predefined anchors with different aspect ratios on each spatial position, we adopt a resizable anchor to adapt the shape variation of object, which saves model parameters and computing time.
Instead of using a Siamese network to build the convolutional filter, other methods [40, 25, 33] generate the filter by performing gradient decent on the first frame, which could be continuously optimized during subsequent frames. Specially, [33] proposes to train the initial tracking model in a metalearning setting, which shows promising results. However, it still uses traditional SGD to optimize the tracking model during the subsequent frames, which is not effective to adapt to new appearance and slow in updating the model.In contrast to these trackers, our offline learned recurrent neural optimizer can update the model in only one or two gradient steps, resulting in much faster runtime speed, and better accuracy.
Learning to learn. Learning to learn or metalearning has a long history [37, 2, 31]. With the recent successes of applying metalearning on fewshot classification [35, 30]
[11, 38], it has regained attention. The pioneering work [1] designs an offline learned optimizer using gradient decent and shows promising performance compared with traditional optimization methods. However, it does not generalize well for large numbers of descent step. To mitigate this problem, [27] proposes several training techniques, including parameters scaling and combination with convex functions to coordinate the learning process of the optimizer. [44]also addresses this issue by designing a hierarchical RNN architecture with dynamically adapted input and output scaling. In contrast to other works that output an increment for each parameter update, which is prone to overfitting due to different gradient scales, we instead associate an adaptive learning rate produced by a recurrent neural network with the computed gradient for fast convergence of the model update.
3 Proposed Algorithm
Our tracker consists of two main modules: 1) a tracking model that is resizable to adapt shape changes; and 2) a neural optimizer that is in charge of model updating. The tracking model contains two branches where the response generation branch determines the presence of target by predicting a confidence score map and the bounding box regression branch estimate the precise box of target by regressing coordinates shifts to box anchors mounted on slidingwindow positions. The offline learned neural optimizer is accountable to online update the tracking model in order to adapt to appearance variations. Note both response generation and bounding box regression are built on the feature map computed from the backbone CNN network. The whole framework is briefly illustrated on Fig. 1
3.1 Resizable Tracking Model
Mimicking correlation filter [15], we use convolutional filters with the same shape of object for both response generation and bounding box regression. That means the number of model parameters may vary among different sequences and even different frames in the same video when the shape of object changes. However we aim to meta train a fixed sized initialization of tracking model which could be adaptable to different videos and be continuously updated for subsequent frames. To address this problem, we warp the predefined shaped convolutional filter to specific size using bilinear interpolation as in [33] before convolving with the feature map, and thus are able to keep optimizing the fixed shaped model recurrently during subsequent frames. Specifically, tracking model contains two parts, i.e. correlation filter and bounding box regression filter . They are warped to specific size to adapt to the shape variation of target.
(1)  
(2)  
(3) 
where means resizing the covolutional filter to specific filter size using bilinear interpolation. The filter size is computed by,
(4)  
(5) 
where is the scale factor to enlarge filter size to cover some context information and
is the stride of feature map.
are the width and height of object on image patch at time step . Thanks to the resizable filters, there is no need to enumerate different aspect ratios and scales of anchor boxes when doing bounding box regression. We only use one sized anchor on each spatial location whose size is corresponding to the shape of regression filter size.(6) 
This saves regression filter parameters and achieves faster speed. Note that we update the filter size and its corresponding anchor box every frames, , i.e. just before every model updating, during both offline training and testing/tracking phrases. Through this modification, we are able to initialize the tracking model with and recurrently optimize it in subsequent frames without worrying about the shape changes of object.
3.2 Recurrent Model Optimization
Traditional optimization methods have the problem of slow converge due to small learning rates and limited training samples, while simply increasing the learning rate has the risk of the training loss wandering wildly. Instead, we design a recurrent neural optimizer, which is trained to converge the model to a good solution in a few gradient steps^{1}^{1}1We only use one gradient step in our experiment, while considering multiple steps is straightforward., to update the tracking model. Our key idea is based on the assumption that the best optimizer should be able to update the model to generalize well on future frames. During the offline training stage, we perform a one step gradient update on the tracking model using our recurrent neural optimizer and then minimize its loss on the next few frames. Once the offline learning stage is finished, we use this learned neural optimizer to recurrently update the tracking model to adapt to appearance variations. The optimizer trained in this way will be able to quickly converge the model update to generalize well for future frame tracking.
We denote response generation network as and the bounding box regression network as , where is the feature map input and are the parameters. The tracking loss consists of two parts: response loss and regression loss.
(7) 
where the first part is a L2 loss and the second part is a smooth L1 loss [36]. represents the ground truth box. Note we adopt parameterization of bounding box coordinates as in [36]. is the corresponding label map which is built using a 2D Gaussian function
(8) 
where . controls the shape of the response map. A typical tracking process updates the tracking model using historical training examples and then tests this updated model on the following frames until the next update. We simulate this scenario in a metalearning paradigm by recurrently optimizing the response regression network, blackand then testing it on a future frame.
Specifically, the tracking network is updated by
(9) 
where is a fully elementwise learning rate that has the same dimension with the regression network parameters , and is elementwise multiplication. The learning rate is recurrently generated as
(10)  
(11) 
where is a coordinatewise LSTM [1] parameterized by , which shares the parameters across all dimensions of input, and
is the sigmoid function to bound the predicted learning rate. The LSTM input
is an elementwise stack of the blackprevious learning rate , the current parameters , the current update loss and its gradient , along a new axis^{2}^{2}2We therefore get an matrix, where is the number of parameters in . Note that the current update lossis broadcasted to have compatible shape with other vectors. To better understand this process, we can treat the input of LSTM as a minibatch of vectors where
is the batch size and 4 is the dimension of the input vector.. The current update loss is computed from a minibatch of updating samples,(12) 
blackwhere the updating samples blackare collected from the previous frames, where is the frame interval blackbetween model updates during online tracking. Finally, we test the newly updated model on a randomly selected future blackframe^{3}^{3}3We found that using more than one future frame does not improve performance but costs more time during the offline training phase. and obtain the meta loss,
(13) 
where is randomly selected within .
During offline training stage, we perform the aforementioned procedure on a minibatch of videos and get an averaged meta loss for optimizing blackthe neural optimizer,
(14) 
where is the batch size and is the number of model update, and is a video clip sampled from training episodes. It should be noted that the initial regression parameter and initial learning rate are also trainable variables, which are jointly learned with the neural optimizer . By minimizing the averaged meta loss , we aim to train a neural optimizer that can update the tracking model to generalize well on subsequent frames, as well as to learn a beneficial initialization of tracking model, which is broadly applicable to different tasks (i.e. videos). The overall training process is detailed in Algorithm 1.
3.3 Gradient Flow
Note that our recurrent model optimization process involves a gradient update step (see Eq. 9), which may need to compute the second derivative when minimizing the meta loss. Specifically, we show the computation graph of the offline training process of our framework in Fig. 2, where the gradients flow backwards along the solid lines and ignores the dashed lines. For the initial frame (Fig. 2 left), we aim to learn a conducive tracking network and learning rates that can fast adapt to different videos with one gradient step, which is correlated closely with the tracking model . Therefore, we backpropagate through both the gradient update step and the gradient of the update loss (i.e. computing the second derivative). For the subsequent frames, our goal is to learn the online neural optimizer , which is independent of the tracking model that is being optimized. Hence, we choose to ignore the gradient paths along the update loss gradient , and the neural optimizer inputs , which are composed of historical context vectors computed from the tracking model. The effect of this simplification is to focus the training more on improving the neural optimizer to reduce the loss, rather than on secondary items, such as the LSTM input or loss gradient.
3.4 Random Filter Scaling
The neural optimizer has difficulty to generalize well on new tasks due to overfitting as discussed in [27, 44, 1]. By analyzing the learned behavior of the neural optimizer, we found that our preliminary trained optimizer will predict similar learning rates (see Fig. 3: Left). We attribute this to overfit to specific scaled network input.
Let’s first use a simple example to illustrate the overfitting problem. Suppose the object function that the neural optimizer minimizes is ^{4}^{4}4The loss function (see Eq 7) we use for optimization includes a L2 loss and smooth L1 loss, here, we consider a simple linear model with a L2 loss as objective function for simplicity.. Obviously, the optimal learning rate is since we can achieve the lowest loss in one gradient decent . If the learned neural optimizer perfectly follow this rule, it will not generalize well for different scaled input , , causing overfitting. To address this problem, we multiply tracking model with a randomly sampled vector which has the same dimension as .
(15)  
(16) 
where
represents a uniform distribution and
is the range factor to control the scale extent. Then, the objective function becomes . In this way, the network input is indirectly scaled without modifying the training samples in practice. Thus, the learned neural optimizer is forced to predict adaptive learning rates(see Fig. 3: Right) for different scaled input, rather than to produce similar learning rates for similar scaled input, which improves its generalization ability.4 Online Tracking via Proposed Framework
We use the offline trained neural optimizer, initial regression model and learning rate to perform online tracking. The overall process is similar to offline training except that we do not compute the meta loss blackor its gradient.
Model Initialization. Given the first frame, we first crop an image patch centered on the provided bounding box and compute its feature map. Then we use this example, as well as the offline learned and , to perform onestep gradient update to build an initial model as in (9).
Bounding Box Estimation. We estimate the object bounding box by first find the presence of target through response generation and then predict accurate box by bounding box regression. We employ the penalty strategy used in [22] on the generated response to suppresses anchors with large changes in scale and aspect ratio. In addition, we also multiply the response map by a Gaussianlike motion map to suppress large movement. The bounding box computed by the anchor that corresponds to the maximum score of respond map is the final prediction. To smoothen the results, we linearly interpolate this estimated object size with the previous one. Besides, we also design a variant of our tracker which uses multiscale search to estimate object box, which is termed as ROAM. We name the ROAM with bounding box regression by ROAM++.
Model Updating. We update the model every frame. Even though offline training used the previous frames to perform a onestep gradient update of the model, in practice, we find that using more than one step could further improve the performance during tracking (see Sec. 6.2). Therefore, we adopt twoblackstep gradient update using the previous frames in our experiments.
5 Implementation Details
Patch Cropping. Given a bounding box of the object , the ROI of the image patch has the same center black and takes a larger size , where is a scale factor, to cover some background. Then, the ROI is resized to a fixed size for batch processing, where is computed based on a predefined object size , which we denote as the Base Object Size (BOS).
Network Structure. We use the first 12 convolution layers of the pretrained VGG16 [39]
as the feature extractor. The top maxpooling layers are removed to increase the spatial resolution of the feature map. Both the response generation network
and bounding box regression network consists of two convolutional layers with a dimension reduction layer of as the first layer, and a correlation layer of and a regression layer of as the second layer respectively. We use two stacked LSTM layers with 20 hidden units for the neural optimizer .Hyper Parameters. We set the updating interval , and the batch size of updating samples . The ROI scale factor is and the BOS is . The response generation uses and the feature stride of the CNN feature extractor is . The scale and aspect ratio factors used for initial image patch augmentation is .
Training Details. We use ADAM [17]
optimization with a minibatch of 16 video clips of length 31 on 4 GPUs (4 videos per GPU) to train our framework. The datasets we used for training include video datasets: ImageNet VID
[20], TrackingNet [29], LaSOT [9], GOT10k [16], and image datasets: ImageNet DET [20], COCO [24]. During training, we randomly extract a continuous sequence clip for video datasets, and repeat the same still image to form a video clip for image datasets. Note we randomly augment all frames in a training clip by slightly stretching and scaling the images. We use a learning rate of 1e6 for the initial regression parameters and the initial learning rate . For the recurrent neural optimizer, we use a learning rate of 1e3. Both learning rates are multiplied by 0.5 every 5 epochs. We implement our tracker in Python with the PyTorch toolbox
[34], and conduct the experiments on a computer with an NVIDIA Tesla P40 GPU and Intel(R) Xeon(R) CPU E52630 v4 @ 2.20GHz CPU.6 Experiments
We evaluate our trackers on six benchmarks: OTB2015 [45], VOT2016 [19], VOT2017 [18], LaSOT [9], GOT10k [16] and TrackingNet [29].
6.1 Comparison Results with Stateoftheart
We compare our ROAM and ROAM++ with recent response regression based trackers including MetaTracker [33], DSLT [25], MemTrack [47], CREST [40], SiamFC [4], CCOT [8], ECO [7], Staple [3], as well as recent stateoftheart trackers including SiamRPN[22], DaSiamRPN [50], SiamRPN+ [49] and CRPN [10] on both OTB and VOT datasets. blackFor the methods using SGD updates, the number of SGD steps followed their implementations.
VOT2016  VOT2017  
EAO()  A()  R()  EAO()  A()  R()  
ROAM++  red0.441  green0.599  red0.174  red0.380  green0.543  red0.195 
ROAM  blue0.384  0.556  green0.183  green0.331  0.505  green0.226 
MetaTracker  0.317  0.519         
DaSiamRPN  green0.411  red0.61  blue0.22  blue0.326  red0.56  0.34 
SiamRPN+  0.37  0.58  0.24  0.30  blue0.52  0.41 
CRPN  0.363  blue0.594    0.289     
SiamRPN  0.344  0.56  0.26  0.244  0.49  0.46 
ECO  0.375  0.55  0.20  0.280  0.48  blue0.27 
DSLT  0.343  0.545  0.219       
CCOT  0.331  0.54  0.24  0.267  0.49  0.32 
Staple  0.295  0.54  0.38  0.169  0.52  0.69 
CREST  0.283  0.51  0.25       
MemTrack  0.272  0.531  0.373  0.248  0.524  0.357 
SiamFC  0.235  0.532  0.461  0.188  0.502  0.585 
Results on VOT2016 and VOT2017. The evaluation metrics include expected average overlap (EAO), accuracy value (Acc.), robustness value (Rob.). The top three performing trackers are colored with red, green and blue respectively.
OTB. Fig. 4 presents the experiment results on OTB2015 dataset, which contains 100 sequences with 11 annotated video attributes. Both our ROAM and ROAM++ achieve similar AUC compared with top performer ECO and outperform all other trackers. Specifically, our ROAM and ROAM++ surpass MetaTracker [33], which is the baseline for meta learning trackers that uses traditional optimization method for model updating, by 6.7% and 6.9% on the success plot respectively, demonstrating the effectiveness of the proposed recurrent model optimization algorithm and resizable bounding box regression.
VOT. Table 1 shows the comparison performance on VOT2016 and VOT2017 datasets. Our ROAM++ achieves the beset EAO on both VOT2016 and VOT2017. Specially, both our ROAM++ and ROAM shows superior performance on robustness value compared with RPN based trackers which have no model updating, demonstrating the effectiveness of our recurrent model optimization scheme. In addition, our ROAM++ and ROAM outperform the baseline MetaTracker [33] by 39.2% and 21.1% on EAO of VOT2016, respectively.
LaSOT. LaSOT [9] is a recently proposed largescale tracking dataset. We evaluate our ROAM with top 10 performing trackers of the benchmark, including MDNet [32], VITAL [41], SiamFC [4], StructSiam [48], DSiam [13], SINT [13], ECO [7], STRCF [23], ECO_HC [7] and CFNet [43], on the testing split which consists of 280 videos. Fig. 5 presents the comparison results of precision plot and success plot on LaSOT testset. Our ROAM++ achieves the best result compared with stateoftheart trackers on the benchmark, outperforming the second best MDNet with an improvement of 19.3% and 12.6% on precision plot and success plot respectively.
GOT10k. GOT10k [16] is a large highdiverse dataset for object tracking that is proposed recently. There is no overlap in object categories between training split and test split, which follows the oneshot learning setting [11]. Therefore, using external training data is strictly forbidden when testing trackers on their online server. Following their protocol, we train our ROAM by using only the training split of this dataset. Table 2 shows the detailed comparison results on GOT10k test dataset. Both our ROAM++ and ROAM surpass other trackers with a large margin. Specially, our ROAM++ obtains an AO of 0.465, a of 0.532 and a of 0.236, outperforming SiamFCv2 with an improvement of 24.3%, 31.7% and 63.9% respectively.







ROAM  ROAM++  

AO()  0.299  0.315  0.316  0.325  0.342  0.348  blue0.374  green0.436  red0.465  
()  0.303  0.297  0.309  0.328  0.372  0.353  blue0.404  green0.466  red0.532  
()  0.099  0.088  0.111  0.107  0.124  0.098  blue0.144  green0.164  red0.236 
TrackingNet. TrackingNet [29] provides more than 30K videos with around 14M dense bounding box annotations by filtering shorter video clips from YoutubeBB [5]. Note that we only use the training split of this dataset to train our tracker. Table 3 presents the detailed comparison results on TrackingNet test dataset. Our ROAM++ surpasses other stateoftheart tracking algorithms on all three evaluation metrics. In detail, our ROAM++ obtains an improvement of 10.6% , 10.3% and 6.9% on AUC, precision and normalized precision respectively compared with top performing tracker MDNet on the benchmark.







ROAM  ROAM++  

AUC()  0.528  0.534  0.541  0.554  0.571  0.578  blue0.606  green0.620  red0.670  
Prec.()  0.470  0.480  0.478  0.492  0.533  0.533  green0.565  blue0.547  red0.623  
Norm. Prec.()  0.603  0.622  0.608  0.618  0.663  0.654  green0.705  blue0.695  red0.754 
6.2 Ablation Study
To take a deep analysis of the proposed tracking algorithms, we study our trackers from various aspects. Note all these ablations are trained only on ImageNet VID dataset for simplicity.
Impact of Different Modules. To verify the effectiveness of different modules, we design four variants of our framework: 1) SGD: replacing recurrent neural optimizer with traditional SGD for model updating (blackusing the same number of gradient steps as ROAM); 2) ROAMw/o RFS: training a recurrent neural optimizer without RFS; 3) SGDOracle: using the groundtruth bounding boxes to build updating samples for SGD during the testing phase; 4) ROAMOracle: using the groundtruth bounding boxes to build updating samples for ROAM during the testing phase. The results are presented in Fig. 6 (left). ROAM gains about 6% improvement on AUC compared with the baseline SGD, demonstrating the effectiveness of our recurrent model optimization method on model updating. Without RFS during offline training, the tracking performance drops substantially due to overfitting. ROAMOracle performs better than SGDOracle, showing that our offline learned neural optimizer is more effective than traditional SGD method using the same updating samples. In addition, these two oracles (SGDoracle and ROAMoracle) achieve higher AUC score compared with their normal versions, indicating that the tracking accuracy could be boosted by improving the quality of the updating samples.
Architecture of Neural Optimizer. To investigate more architectures of the neural optimizer, we presents three variants of our method: 1) ROAMGRU
: using two stacked Gated Recurrent Unit (GRU)
[6] as our neural optimizer; 2) ROAMFC: using two linear fullyconnected layers followed by tanh activation function as the neural optimizer; 3)
ROAMConstLR: using a learned constant elementwise learning rate for model optimization instead of the adaptively generated one. Fig. 6 (right) presents the results. ROAMGRU decreases the AUC a little, while ROAMFC gets much lower AUC compared with ROAM, showing the importance of our recurrent structure. Moreover, the performance drop of ROAMConstLR verifies the necessity of using an adaptable learning rate for model updating.More Steps in Updating. During offline training, we only perform onestep gradient decent blackto optimize the updating loss. We investigate the effect of using more than one gradient step on tracking performance during the test phase for both ROAM and SGD (see Fig. 7). Our method can be further improved with more than one step, but will gradually decrease when using too many steps. This is because our framework is not trained to perform so many steps during the offline stage. blackWe also use two fixed learning rates for SGD, where the larger one is 7e6 ^{5}^{5}5MetaTracker use this learning rate. and the smaller one is 7e7. Using a larger learning rate, SGD could reach its best performance much faster than using a smaller learning rate, while both have similar best AUC. Our ROAM consistently outperforms SGD(7e6), showing the superiority of adaptive elementwise learning rates. blackFurthermore, ROAM with 12 gradient steps outperforms SGD(7e7) using a large number of steps, which shows the improved generalization of ROAM.
Update Loss Comparison. To show the effectiveness of our neural optimizer on minimizing the update loss, we compare the tracking loss over time between ROAM and SGD after two gradient steps for a few videos of OTB2015 in Fig. 8 (see supplementary for more examples). Under the same number of gradient updates, our neural optimizer obtains lower loss compared with traditional SGD, demonstrating its faster converge for model optimization.
Why does ROAM work? As discussed in [33], directly using the learned initial learning rate for model optimization in subsequent frames could lead to divergence. This is because the learning rates for model initialization are relatively larger than the ones needed for subsequent frames, which therefore causes unstable model optimization. In particular, the initial regression model is offline trained to be broadly applicable to different videos, which therefore needs relatively larger gradient step to adapt to a specific task, leading to a relative larger . For the subsequent frames, the appearance variations could be sometimes small or sometimes large, and thus the model optimization process needs an adaptive learning rate to handle different situations. Fig. 9 presents the histograms of initial learning rates and updating learning rates on OTB2015. Most of updating learning rates are relatively small because usually there are only minor appearance variations between updates. As is shown in Figs. 4, our ROAM, which performs model updating for subsequent frames with adaptive learning rates, obtains substantial performance gain compared with MetaTracker [33], which use a traditional SGD with a constant learning rate for model updating.
Impact of Training Data. For fair comparison, we also train the variants of our ROAM++ and ROAM with ImageNet VID training data, which are named by VIDROAM++ and VIDROAM, respectively. Table 4 shows the comparison results on VOT datasets. With only VID dataset for training, our ROAM++ still outperforms our preliminary tracker ROAM. What’s more, both our VIDROAM++ and VIDROAM outperform the baseline MetaTracker [33] with a large margin.
VOT2016  VOT2017  
EAO  A  R  EAO  A  R  
ROAM++  0.4414  0.5987  0.1743  0.3803  0.5433  0.1945 
VIDROAM++  0.4147  0.5886  0.1749  0.3330  0.5513  0.2366 
ROAM  0.3842  0.5558  0.1833  0.3313  0.5048  0.2263 
VIDROAM  0.3568  0.5468  0.2091  0.2971  0.5098  0.2491 
MetaTracker  0.317  0.519         
7 Conclusion
In this paper, we propose a novel tracking model consisting of resziable response generator and bounding box regressor, where the first part generates a score map to indicate the presence of target at different spatial locations and the second module regresses bounding box coordinates shift to the anchors densely mounted on sliding window positions. To effectively update the tracking model for appearance adaption, we propose a recurrent model optimization method within a meta learning setting in a few gradient steps. Instead of producing an increment for parameter updating, which is prone to overfit due to different gradient scales, we recurrently generate adaptive learning rates for tracking model optimization using an LSTM. Moreover, we propose a simple yet effective training trick by randomly scaling the convolutional filters used in the tracking model to prevent overfitting, which improves the performance significantly. Extensive experiments on OTB, VOT, LaSOT, GOT10k and TrackingNet datasets demonstrate superior performance compared with stateoftheart tracking algorithms.
References
 [1] M. Andrychowicz, M. Denil, S. Gomez, M. W. Hoffman, D. Pfau, T. Schaul, and N. de Freitas. Learning to Learn by Gradient Descent by Gradient Descent. In NeurIPS, 2016.
 [2] S. Bengio, Y. Bengio, J. Cloutier, and J. Gecsei. On the Optimization of a Synaptic Learning Rule. In Preprints Conf. Optimality in Artificial and Biological Neural Networks, 1992.
 [3] L. Bertinetto, J. Valmadre, S. Golodetz, O. Miksik, and P. Torr. Staple: Complementary Learners for RealTime Tracking. In CVPR, 2016.
 [4] L. Bertinetto, J. Valmadre, J. F. Henriques, A. Vedaldi, and P. H. S. Torr. FullyConvolutional Siamese Networks for Object Tracking. In ECCV Workshop on VOT, 2016.
 [5] G. Brain, J. Shlens, S. Mazzocchi, G. Brain, and V. Vanhoucke. YouTubeBoundingBoxes : A Large HighPrecision HumanAnnotated Data Set for Object Detection in Video Esteban Real bear dog airplane zebra. In CVPR, 2017.
 [6] K. Cho, B. Van Merriënboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y. Bengio. Learning Phrase Representations using RNN Encoderdecoder for Statistical Machine Translation. arXiv preprint arXiv:1406.1078, 2014.
 [7] M. Danelljan, G. Bhat, F. S. Khan, and M. Felsberg. ECO: Efficient Convolution Operators for Tracking. In CVPR, 2017.
 [8] M. Danelljan, A. Robinson, F. S. Khan, and M. Felsberg. Beyond correlation filters: Learning continuous convolution operators for visual tracking. In ECCV, 2016.
 [9] H. Fan, L. Lin, F. Yang, and P. Chu. LaSOT : A Highquality Benchmark for Largescale Single Object Tracking. In CVPR, 2019.
 [10] H. Fan and H. Ling. Siamese Cascaded Region Proposal Networks for RealTime Visual Tracking. In CVPR, 2019.
 [11] C. Finn, P. Abbeel, and S. Levine. ModelAgnostic MetaLearning for Fast Adaptation of Deep Networks. In ICML, 2017.
 [12] H. K. Galoogahi, T. Sim, and S. Lucey. Correlation Filters with Limited Boundaries. In CVPR, 2015.
 [13] Q. Guo, W. Feng, C. Zhou, R. Huang, L. Wan, and S. Wang. Learning Dynamic Siamese Network for Visual Object Tracking. In ICCV, 2017.
 [14] D. Held, S. Thrun, and S. Savarese. Learning to Track at 100 FPS with Deep Regression Networks. In ECCV, 2016.
 [15] J. F. Henriques, R. Caseiro, P. Martins, and J. Batista. Highspeed tracking with kernelized correlation filters. TPAMI, 2015.
 [16] L. Huang, X. Zhao, S. Member, K. Huang, and S. Member. GOT10k : A Large HighDiversity Benchmark for Generic Object Tracking in the Wild. arXiv, 2019.
 [17] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
 [18] M. Kristan, A. Leonardis, J. Matas, M. Felsberg, R. Pflugfelder, L. Zajc, T. Vojí, G. Bhat, A. Lukežič, A. Eldesokey, G. Fernández, A. GarcíaMartín, A. IglesiasArias, A. Aydin Alatan, A. GonzálezGarcía, A. Petrosino, A. Memarmoghadam, A. Vedaldi, A. Muhič, A. He, A. Smeulders, A. G. Perera, B. Li, B. Chen, C. Kim, C. Xu, C. Xiong, C. Tian, C. Luo, C. Sun, C. Hao, D. Kim, D. Mishra, D. Chen, D. Wang, D. Wee, E. Gavves, E. Gundogdu, E. VelascoSalido, F. Shahbaz Khan, F. Yang, F. Zhao, F. Li, F. Battistone, G. De Ath, G. R. K S Subrahmanyam, G. Bastos, H. Ling, H. Kiani Galoogahi, H. Lee, H. Li, H. Zhao, H. Fan, H. Zhang, H. Possegger, H. Li, H. Lu, H. Zhi, H. Li, H. Lee, H. Jin Chang, I. Drummond, J. Valmadre, J. Spencer Martin, J. Chahl, J. Young Choi, J. Li, J. Wang, J. Qi, J. Sung, J. Johnander, J. Henriques, J. Choi, J. van de Weijer, J. Rodríguez Herranz, J. M. Martínez, J. Kittler, J. Zhuang, J. Gao, K. Grm, L. Zhang, L. Wang, L. Yang, L. Rout, L. Si, L. Bertinetto, L. Chu, M. Che, M. Edoardo Maresca, M. Danelljan, M.H. Yang, M. Abdelpakey, M. Shehata, M. Kang, N. Lee, N. Wang, O. Miksik, P. Moallem, P. VicenteMoñivar, P. Senna, P. Li, P. Torr, P. Mariam Raju, Q. Ruihe, Q. Wang, Q. Zhou, Q. Guo, R. MartínNieto, R. Krishna Gorthi, R. Tao, R. Bowden, R. Everson, R. Wang, S. Yun, S. Choi, S. Vivas, S. Bai, S. Huang, S. Wu, S. Hadfield, S. Wang, S. Golodetz, T. Ming, T. Xu, T. Zhang, T. Fischer, V. Santopietro, V. Struc, W. Wei, W. Zuo, W. Feng, W. Wu, W. Zou, W. Hu, W. Zhou, W. Zeng, X. Zhang, X. Wu, X.J. Wu, X. Tian, Y. Li, Y. Lu, Y. Wei Law, Y. Wu, Y. Demiris, Y. Yang, Y. Jiao, Y. Li, Y. Zhang, Y. Sun, Z. Zhang, Z. Zhu, Z.H. Feng, Z. Wang, and Z. He. The sixth Visual Object Tracking VOT2018 challenge results. In ECCV Workshop on VOT, 2018.
 [19] M. Kristan, A. Leonardis, J. Matas, M. Felsberg, R. Pflugfelder, L. Č. Zajc, T. Vojír, G. Häger, A. Lukežič, A. Eldesokey, G. Fernández, and Others. The Visual Object Tracking VOT2017 Challenge Results. In ICCV Workshop on VOT, 2017.

[20]
A. Krizhevsky, I. Sutskever, and G. E. Hinton.
Imagenet classification with deep convolutional neural networks.
In NeurIPS, 2012.  [21] B. Li, W. Wu, Q. Wang, F. Zhang, J. Xing, and J. Yan. SiamRPN++: Evolution of Siamese Visual Tracking with Very Deep Networks. In CVPR, 2019.
 [22] B. Li, J. Yan, W. Wu, Z. Zhu, and X. Hu. High Performance Visual Tracking with Siamese Region Proposal Network. In CVPR, 2018.
 [23] F. Li, C. Tian, W. Zuo, L. Zhang, and M.H. Yang. Learning SpatialTemporal Regularized Correlation Filters for Visual Tracking. In CVPR, pages 4904–4913, 2018.
 [24] T.Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick. Microsoft COCO: Common Objects in Context. In ECCV, pages 740–755, 2014.
 [25] X. Lu, C. Ma, B. Ni, X. Yang, I. Reid, and M.h. Yang. Deep Regression Tracking with Shrinkage Loss. In ECCV, 2018.
 [26] A. Lukežič, T. Vojíř, L. Čehovin, J. Matas, and M. Kristan. Discriminative Correlation Filter with Channel and Spatial Reliability. In CVPR, 2017.
 [27] K. Lv, S. Jiang, and J. Li. Learning Gradient Descent: Better Generalization and Longer Horizons. In ICML, 2017.
 [28] C. Ma, J.b. Huang, X. Yang, and M.h. Yang. Hierarchical Convolutional Features for Visual Tracking. In ICCV, 2015.
 [29] M. Müller, A. Bibi, S. Giancola, S. AlSubaihi, and B. Ghanem. TrackingNet: A LargeScale Dataset and Benchmark for Object Tracking in the Wild. In ECCV, 2018.
 [30] T. Munkhdalai and H. Yu. Meta Networks. In ICML, 2017.
 [31] D. K. Naik and R. Mammone. MetaNeural Networks that Learn by Learning. In Neural Networks, 1992. IJCNN., International Joint Conference on, 1992.
 [32] H. Nam and B. Han. Learning MultiDomain Convolutional Neural Networks for Visual Tracking. In CVPR, 2016.
 [33] E. Park and A. C. Berg. MetaTracker: Fast and Robust Online Adaptation for Visual Object Trackers. In ECCV, 2018.
 [34] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer. Automatic Differentiation in PyTorch. In NeurIPS Workshop, 2017.
 [35] S. Ravi and H. Larochelle. Optimization As a Model for FewShot Learning. In ICLR, 2017.
 [36] S. Ren, K. He, R. Girshick, and J. Sun. Faster RCNN: Towards RealTime Object Detection with Region Proposal Networks. In NeurIPS, 2015.
 [37] J. Schmidhuber. Evolutionary Principles in Selfreferential Learning: On Learning how to Learn: the Metametameta…hook. PhD thesis, 1987.
 [38] J. Schulman, J. Ho, X. Chen, and P. Abbeel. Meta Learning Shared Hierarchies. arXiv, 2018.
 [39] K. Simonyan and A. Zisserman. Very Deep Convolutional Networks for Largescale Image Recognition. In ICLR, 2015.
 [40] Y. Song, C. Ma, L. Gong, J. Zhang, R. Lau, and M.H. Yang. CREST: Convolutional Residual Learning for Visual Tracking. In ICCV, 2017.
 [41] Y. Song, C. Ma, X. Wu, L. Gong, L. Bao, W. Zuo, C. Shen, R. Lau, and M.H. Yang. VITAL: VIsual Tracking via Adversarial Learning. In CVPR, 2018.
 [42] R. Tao, E. Gavves, and A. W. M. Smeulders. Siamese Instance Search for Tracking. In CVPR, 2016.
 [43] J. Valmadre, L. Bertinetto, F. Henriques, A. Vedaldi, and P. H. S. Torr. Endtoend representation learning for Correlation Filter based tracking. In CVPR, 2017.
 [44] O. Wichrowska, N. Maheswaranathan, M. W. Hoffman, S. G. Colmenarejo, M. Denil, N. de Freitas, and J. SohlDickstein. Learned Optimizers that Scale and Generalize. In ICML, 2017.
 [45] Y. Wu, J. Lim, and M.H. Yang. Object Tracking Benchmark. PAMI, 2015.
 [46] T. Yang and A. B. Chan. Recurrent Filter Learning for Visual Tracking. In ICCV Workshop on VOT, 2017.
 [47] T. Yang and A. B. Chan. Learning Dynamic Memory Networks for Object Tracking. In ECCV, 2018.
 [48] Y. Zhang, L. Wang, D. Wang, and H. Lu. Structured Siamese Network for RealTime Visual Tracking. In ECCV, 2018.
 [49] Z. Zhang, H. Peng, and Q. Wang. Deeper and Wider Siamese Networks for RealTime Visual Tracking. In CVPR, 2019.
 [50] Z. Zhu, Q. Wang, B. Li, W. Wu, J. Yan, and W. Hu. Distractoraware Siamese Networks for Visual Object Tracking. In ECCV, 2018.
Comments
There are no comments yet.