ROAM: Recurrently Optimizing Tracking Model

07/28/2019 ∙ by Tianyu Yang, et al. ∙ City University of Hong Kong Beijing Didi Infinity Technology and Development Co., Ltd. 0

Online updating a tracking model to adapt to object appearance variations is challenging. For SGD-based model optimization, using a large learning rate may help to converge the model faster but has the risk of letting the loss wander wildly. Thus traditional optimization methods usually choose a relatively small learning rate and iterate for more steps to converge the model, which is time-consuming. In this paper, we propose to offline train a recurrent neural optimizer to predict an adaptive learning rate for model updating in a meta-learning setting, which can converge the model in a few gradient steps. This substantially improves the convergence speed of updating the tracking model, while achieving better performance. Moreover, we also propose a simple yet effective training trick called Random Filter Scaling to prevent overfitting, which boosts the performance greatly. Finally, we extensively evaluate our tracker, ROAM, on the OTB, VOT, GOT-10K, TrackingNet and LaSOT benchmark and our method performs favorably against state-of-the-art algorithms.



There are no comments yet.


page 3

page 5

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Generic visual object tracking is the task of estimating the bounding box of a target in a video sequence given only its initial position. Typically, the preliminary model learned from the first frame needs to be updated continuously to adapt to the target’s appearance variations caused by rotation, illumination, occlusion, deformation, etc. However, it is challenging to optimize the initial learned model efficiently and effectively as tracking proceeds. Training samples for model updating are usually collected based on estimated bounding boxes, which could be inaccurate. Those small errors will accumulate over time, gradually resulting in model degradation.

To avoid model updating, which may introduce unreliable training samples that ruin the model, several approaches [4, 42] investigate tracking by only comparing the first frame with the subsequent frames, using a similarity function based on a learned discriminant and invariant deep Siamese feature embedding. However, training such a deep representation is difficult due to drastic appearance variations that commonly emerge in long-term tracking. Other methods either update the model via an exponential moving average of templates [15, 43], which marginally improves the performance, or optimize the model with hand-designed SGD methods [32, 40]

, which needs numerous iterations to converge blackthus preventing real-time speed. blackLimiting the number of SGD iterations can allow near real-time speeds, but at the expense of poor quality model updates due to the loss function not being optimized sufficiently.

In recent years, much effort has been done on localizing the object using robust online learned classifier, while few attention is paid on designing accurate bounding box estimation. Most trackers simply resort to multi-scale search by assuming that the object aspect ratio does not change during tracking, which is often violated in real world. Recently, SiamRPN

[22] borrows the idea of region proposal networks [36] in object detection to decompose tracking task into two branches, which are classifying the target from the background and regressing the accurate bounding box based with reference to anchor boxes mounted on different positions. It achieves higher precision on bounding box estimation but suffers lower robustness compared with state-of-the-art methods due to no online model updating.

In this paper, we address the aforementioned problems by proposing a tracking framework which is composed of two modules: response generation and bounding box regression, where the first component produces a response map to indicate the possibility of covering the object for anchor boxes mounted on sliding-window positions, and the second part predicts bounding box shifts to anchors to get refined rectangles. Instead of enumerating different scales and aspect ratios of anchors as in object detection, we propose to use only one sized anchor for each position and adapt it to shape changes by resizing its corresponding convolutional filter using bilinear interpolation, which saves model parameters and computing time. To effectively adapt tracking model to appearance changes during tracking, we propose a recurrent model optimization method to learn a more effective gradient descent that converges the model update in 1-2 steps, and generalizes better to future frames. The key idea is to train a neural optimizer that can converge the tracking model to a good solution in a few gradient steps. During the training phase, the tracking model is first updated using the neural optimizer, and then it is applied on future frames to obtain an error signal for minimization. Under this particular setting, the resulting optimizer converges the tracking classifier significant faster than SGD-based optimizers, especially for blacklearning the initial tracking model. In summary, our contributions are:

  • We propose a tracking model consisting of resizable response generator and bounding box regressor, where only one sized anchor is used on each spatial position and its corresponding convolutional filter could be adapted to shape variations by bilinear interpolation.

  • We propose a recurrent neural optimizer, which is trained in a meta-learning setting, that recurrently updates the tracking model with faster convergence.

  • We conduct comprehensive experiments on large scale datasets including OTB, VOT, LaSOT, GOT10k and TrackingNet, and our trackers achieve favorable performance compared with the state-of-the-art.

2 Related Work

Tracking by Response Generation Correlation filter based trackers [15, 8, 3, 4] formulate response generation as an element-wise multiplication in Fourier domain to improve computation efficiency, which is essentially a convolution operation on cyclically shifted samples. Instead of leveraging the periodic assumption used in correlation filter, which may introduce unwanted boundary affect [12], SiamFC [4] proposes to convolve the search feature map with a object template in spatial domain to generate the heat map. Recent works improve SiamFC [4] by introducing various model updating strategies, including recurrent generation of target template filters through a convolutional LSTM [46], a dynamic memory network [47], where object information is written into and read from an addressable external memory, and distractor-aware incremental learning [50], which make use of hard-negative templates around the target to suppress distractors. It should be noted that all these algorithms essentially achieve model updating by linearly interpolating old target templates with the newly generated one, in which the major difference is how to control the weights when combining them. This is far from optimal compared with optimization methods using gradient decent, which minimize the tracking loss directly to adapt to new target appearances. What’s more, all aforementioned trackers estimate the bounding box via simple multi-scale search mechanism, which is not able to handle aspect ratio changes. Recently, SiamRPN [22] and its extensions [21, 10, 49] have shown promising results on estimating accurate bounding box via offline training on large scale datasets. Different from these algorithms which enumerates a set of predefined anchors with different aspect ratios on each spatial position, we adopt a resizable anchor to adapt the shape variation of object, which saves model parameters and computing time.

Instead of using a Siamese network to build the convolutional filter, other methods [40, 25, 33] generate the filter by performing gradient decent on the first frame, which could be continuously optimized during subsequent frames. Specially, [33] proposes to train the initial tracking model in a meta-learning setting, which shows promising results. However, it still uses traditional SGD to optimize the tracking model during the subsequent frames, which is not effective to adapt to new appearance and slow in updating the model.In contrast to these trackers, our off-line learned recurrent neural optimizer can update the model in only one or two gradient steps, resulting in much faster runtime speed, and better accuracy.

Learning to learn. Learning to learn or meta-learning has a long history [37, 2, 31]. With the recent successes of applying meta-learning on few-shot classification [35, 30]

and reinforcement learning

[11, 38], it has regained attention. The pioneering work [1] designs an off-line learned optimizer using gradient decent and shows promising performance compared with traditional optimization methods. However, it does not generalize well for large numbers of descent step. To mitigate this problem, [27] proposes several training techniques, including parameters scaling and combination with convex functions to coordinate the learning process of the optimizer. [44]

also addresses this issue by designing a hierarchical RNN architecture with dynamically adapted input and output scaling. In contrast to other works that output an increment for each parameter update, which is prone to overfitting due to different gradient scales, we instead associate an adaptive learning rate produced by a recurrent neural network with the computed gradient for fast convergence of the model update.

Figure 1:

Pipeline of ROAM++. Given a mini-batch of training patches, deep features are extracted by

Feature Net. The response map and bounding boxes are predicted for each sample using the the Tracking Model, from which the update loss with ground truth and its gradient are computed. Next, the element-wise stack of previous learning rates, current parameters, current update loss and its gradient are input into a coordinate-wise LSTM to generate the adaptive learning rate as in (11). The model is then updated using one gradient decent step as in (9). Finally, we apply the updated model on a randomly selected future frame to get a meta loss for minimization as in (13)

3 Proposed Algorithm

Our tracker consists of two main modules: 1) a tracking model that is resizable to adapt shape changes; and 2) a neural optimizer that is in charge of model updating. The tracking model contains two branches where the response generation branch determines the presence of target by predicting a confidence score map and the bounding box regression branch estimate the precise box of target by regressing coordinates shifts to box anchors mounted on sliding-window positions. The offline learned neural optimizer is accountable to online update the tracking model in order to adapt to appearance variations. Note both response generation and bounding box regression are built on the feature map computed from the backbone CNN network. The whole framework is briefly illustrated on Fig. 1

3.1 Resizable Tracking Model

Mimicking correlation filter [15], we use convolutional filters with the same shape of object for both response generation and bounding box regression. That means the number of model parameters may vary among different sequences and even different frames in the same video when the shape of object changes. However we aim to meta train a fixed sized initialization of tracking model which could be adaptable to different videos and be continuously updated for subsequent frames. To address this problem, we warp the predefined shaped convolutional filter to specific size using bilinear interpolation as in [33] before convolving with the feature map, and thus are able to keep optimizing the fixed shaped model recurrently during subsequent frames. Specifically, tracking model contains two parts, i.e. correlation filter and bounding box regression filter . They are warped to specific size to adapt to the shape variation of target.


where means resizing the covolutional filter to specific filter size using bilinear interpolation. The filter size is computed by,


where is the scale factor to enlarge filter size to cover some context information and

is the stride of feature map.

are the width and height of object on image patch at time step . Thanks to the resizable filters, there is no need to enumerate different aspect ratios and scales of anchor boxes when doing bounding box regression. We only use one sized anchor on each spatial location whose size is corresponding to the shape of regression filter size.


This saves regression filter parameters and achieves faster speed. Note that we update the filter size and its corresponding anchor box every frames, , i.e. just before every model updating, during both offline training and testing/tracking phrases. Through this modification, we are able to initialize the tracking model with and recurrently optimize it in subsequent frames without worrying about the shape changes of object.

3.2 Recurrent Model Optimization

Traditional optimization methods have the problem of slow converge due to small learning rates and limited training samples, while simply increasing the learning rate has the risk of the training loss wandering wildly. Instead, we design a recurrent neural optimizer, which is trained to converge the model to a good solution in a few gradient steps111We only use one gradient step in our experiment, while considering multiple steps is straightforward., to update the tracking model. Our key idea is based on the assumption that the best optimizer should be able to update the model to generalize well on future frames. During the offline training stage, we perform a one step gradient update on the tracking model using our recurrent neural optimizer and then minimize its loss on the next few frames. Once the offline learning stage is finished, we use this learned neural optimizer to recurrently update the tracking model to adapt to appearance variations. The optimizer trained in this way will be able to quickly converge the model update to generalize well for future frame tracking.

We denote response generation network as and the bounding box regression network as , where is the feature map input and are the parameters. The tracking loss consists of two parts: response loss and regression loss.


where the first part is a L2 loss and the second part is a smooth L1 loss [36]. represents the ground truth box. Note we adopt parameterization of bounding box coordinates as in [36]. is the corresponding label map which is built using a 2D Gaussian function


where . controls the shape of the response map. A typical tracking process updates the tracking model using historical training examples and then tests this updated model on the following frames until the next update. We simulate this scenario in a meta-learning paradigm by recurrently optimizing the response regression network, blackand then testing it on a future frame.

Specifically, the tracking network is updated by


where is a fully element-wise learning rate that has the same dimension with the regression network parameters , and is element-wise multiplication. The learning rate is recurrently generated as


where is a coordinate-wise LSTM [1] parameterized by , which shares the parameters across all dimensions of input, and

is the sigmoid function to bound the predicted learning rate. The LSTM input

is an element-wise stack of the blackprevious learning rate , the current parameters , the current update loss and its gradient , along a new axis222We therefore get an matrix, where is the number of parameters in . Note that the current update loss

is broadcasted to have compatible shape with other vectors. To better understand this process, we can treat the input of LSTM as a mini-batch of vectors where

is the batch size and 4 is the dimension of the input vector.
. The current update loss is computed from a mini-batch of updating samples,


blackwhere the updating samples blackare collected from the previous frames, where is the frame interval blackbetween model updates during online tracking. Finally, we test the newly updated model on a randomly selected future blackframe333We found that using more than one future frame does not improve performance but costs more time during the off-line training phase. and obtain the meta loss,


where is randomly selected within .

0:    : distribution over training videos.
0:     : initial tracking model and learning rates.: recurrent neural optimizer.
1:  Initialize all network parameters.
2:  while not done do
3:     Draw a mini-batch of videos:
4:     for all  do
5:        Compute in (9).
6:        Compute meta loss using (13).
7:        for  do
8:           Compute adaptive learning rate using neural optimizer as in (11).
9:           Compute updated model using (9).
10:           Compute meta loss using (13).
11:        end for
12:     end for
13:     Compute averaged meta loss using (14).
14:     Update by computing gradient of .
15:  end while
Algorithm 1 Offline training of our framework

During offline training stage, we perform the aforementioned procedure on a mini-batch of videos and get an averaged meta loss for optimizing blackthe neural optimizer,


where is the batch size and is the number of model update, and is a video clip sampled from training episodes. It should be noted that the initial regression parameter and initial learning rate are also trainable variables, which are jointly learned with the neural optimizer . By minimizing the averaged meta loss , we aim to train a neural optimizer that can update the tracking model to generalize well on subsequent frames, as well as to learn a beneficial initialization of tracking model, which is broadly applicable to different tasks (i.e. videos). The overall training process is detailed in Algorithm 1.

Figure 2:

Computation graph of the recurrent neural optimizer.Dashed lines are ignored during backpropagation of the gradient.

3.3 Gradient Flow

Note that our recurrent model optimization process involves a gradient update step (see Eq. 9), which may need to compute the second derivative when minimizing the meta loss. Specifically, we show the computation graph of the offline training process of our framework in Fig. 2, where the gradients flow backwards along the solid lines and ignores the dashed lines. For the initial frame (Fig. 2 left), we aim to learn a conducive tracking network and learning rates that can fast adapt to different videos with one gradient step, which is correlated closely with the tracking model . Therefore, we back-propagate through both the gradient update step and the gradient of the update loss (i.e. computing the second derivative). For the subsequent frames, our goal is to learn the online neural optimizer , which is independent of the tracking model that is being optimized. Hence, we choose to ignore the gradient paths along the update loss gradient , and the neural optimizer inputs , which are composed of historical context vectors computed from the tracking model. The effect of this simplification is to focus the training more on improving the neural optimizer to reduce the loss, rather than on secondary items, such as the LSTM input or loss gradient.

3.4 Random Filter Scaling

The neural optimizer has difficulty to generalize well on new tasks due to overfitting as discussed in [27, 44, 1]. By analyzing the learned behavior of the neural optimizer, we found that our preliminary trained optimizer will predict similar learning rates (see Fig. 3: Left). We attribute this to overfit to specific scaled network input.

Let’s first use a simple example to illustrate the overfitting problem. Suppose the object function that the neural optimizer minimizes is 444The loss function (see Eq 7) we use for optimization includes a L2 loss and smooth L1 loss, here, we consider a simple linear model with a L2 loss as objective function for simplicity.. Obviously, the optimal learning rate is since we can achieve the lowest loss in one gradient decent . If the learned neural optimizer perfectly follow this rule, it will not generalize well for different scaled input , , causing overfitting. To address this problem, we multiply tracking model with a randomly sampled vector which has the same dimension as .



represents a uniform distribution and

is the range factor to control the scale extent. Then, the objective function becomes . In this way, the network input is indirectly scaled without modifying the training samples in practice. Thus, the learned neural optimizer is forced to predict adaptive learning rates(see Fig. 3: Right) for different scaled input, rather than to produce similar learning rates for similar scaled input, which improves its generalization ability.

Figure 3: Histogram of predicted learning rates during offline training

4 Online Tracking via Proposed Framework

We use the offline trained neural optimizer, initial regression model and learning rate to perform online tracking. The overall process is similar to offline training except that we do not compute the meta loss blackor its gradient.

Model Initialization. Given the first frame, we first crop an image patch centered on the provided bounding box and compute its feature map. Then we use this example, as well as the offline learned and , to perform one-step gradient update to build an initial model as in (9).

Bounding Box Estimation. We estimate the object bounding box by first find the presence of target through response generation and then predict accurate box by bounding box regression. We employ the penalty strategy used in [22] on the generated response to suppresses anchors with large changes in scale and aspect ratio. In addition, we also multiply the response map by a Gaussian-like motion map to suppress large movement. The bounding box computed by the anchor that corresponds to the maximum score of respond map is the final prediction. To smoothen the results, we linearly interpolate this estimated object size with the previous one. Besides, we also design a variant of our tracker which uses multi-scale search to estimate object box, which is termed as ROAM. We name the ROAM with bounding box regression by ROAM++.

Model Updating. We update the model every frame. Even though offline training used the previous frames to perform a one-step gradient update of the model, in practice, we find that using more than one step could further improve the performance during tracking (see Sec. 6.2). Therefore, we adopt two-blackstep gradient update using the previous frames in our experiments.

5 Implementation Details

Patch Cropping. Given a bounding box of the object , the ROI of the image patch has the same center black and takes a larger size , where is a scale factor, to cover some background. Then, the ROI is resized to a fixed size for batch processing, where is computed based on a predefined object size , which we denote as the Base Object Size (BOS).

Network Structure. We use the first 12 convolution layers of the pretrained VGG-16 [39]

as the feature extractor. The top max-pooling layers are removed to increase the spatial resolution of the feature map. Both the response generation network

and bounding box regression network consists of two convolutional layers with a dimension reduction layer of as the first layer, and a correlation layer of and a regression layer of as the second layer respectively. We use two stacked LSTM layers with 20 hidden units for the neural optimizer .

Hyper Parameters. We set the updating interval , and the batch size of updating samples . The ROI scale factor is and the BOS is . The response generation uses and the feature stride of the CNN feature extractor is . The scale and aspect ratio factors used for initial image patch augmentation is .

Training Details. We use ADAM [17]

optimization with a mini-batch of 16 video clips of length 31 on 4 GPUs (4 videos per GPU) to train our framework. The datasets we used for training include video datasets: ImageNet VID

[20], TrackingNet [29], LaSOT [9], GOT10k [16], and image datasets: ImageNet DET [20], COCO [24]. During training, we randomly extract a continuous sequence clip for video datasets, and repeat the same still image to form a video clip for image datasets. Note we randomly augment all frames in a training clip by slightly stretching and scaling the images. We use a learning rate of 1e-6 for the initial regression parameters and the initial learning rate . For the recurrent neural optimizer

, we use a learning rate of 1e-3. Both learning rates are multiplied by 0.5 every 5 epochs. We implement our tracker in Python with the PyTorch toolbox

[34], and conduct the experiments on a computer with an NVIDIA Tesla P40 GPU and Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz CPU.

6 Experiments

We evaluate our trackers on six benchmarks: OTB-2015 [45], VOT-2016 [19], VOT-2017 [18], LaSOT [9], GOT-10k [16] and TrackingNet [29].

6.1 Comparison Results with State-of-the-art

We compare our ROAM and ROAM++ with recent response regression based trackers including MetaTracker [33], DSLT [25], MemTrack [47], CREST [40], SiamFC [4], CCOT [8], ECO [7], Staple [3], as well as recent state-of-the-art trackers including SiamRPN[22], DaSiamRPN [50], SiamRPN+ [49] and C-RPN [10] on both OTB and VOT datasets. blackFor the methods using SGD updates, the number of SGD steps followed their implementations.

Figure 4: Precision and success plot on OTB-2015.
VOT-2016 VOT-2017
EAO() A() R() EAO() A() R()
ROAM++ red0.441 green0.599 red0.174 red0.380 green0.543 red0.195
ROAM blue0.384 0.556 green0.183 green0.331 0.505 green0.226
MetaTracker 0.317 0.519 - - - -
DaSiamRPN green0.411 red0.61 blue0.22 blue0.326 red0.56 0.34
SiamRPN+ 0.37 0.58 0.24 0.30 blue0.52 0.41
C-RPN 0.363 blue0.594 - 0.289 - -
SiamRPN 0.344 0.56 0.26 0.244 0.49 0.46
ECO 0.375 0.55 0.20 0.280 0.48 blue0.27
DSLT 0.343 0.545 0.219 - - -
CCOT 0.331 0.54 0.24 0.267 0.49 0.32
Staple 0.295 0.54 0.38 0.169 0.52 0.69
CREST 0.283 0.51 0.25 - - -
MemTrack 0.272 0.531 0.373 0.248 0.524 0.357
SiamFC 0.235 0.532 0.461 0.188 0.502 0.585
Table 1:

Results on VOT-2016 and VOT-2017. The evaluation metrics include expected average overlap (EAO), accuracy value (Acc.), robustness value (Rob.). The top three performing trackers are colored with red, green and blue respectively.

OTB. Fig. 4 presents the experiment results on OTB-2015 dataset, which contains 100 sequences with 11 annotated video attributes. Both our ROAM and ROAM++ achieve similar AUC compared with top performer ECO and outperform all other trackers. Specifically, our ROAM and ROAM++ surpass MetaTracker [33], which is the baseline for meta learning trackers that uses traditional optimization method for model updating, by 6.7% and 6.9% on the success plot respectively, demonstrating the effectiveness of the proposed recurrent model optimization algorithm and resizable bounding box regression.

VOT. Table 1 shows the comparison performance on VOT-2016 and VOT-2017 datasets. Our ROAM++ achieves the beset EAO on both VOT-2016 and VOT-2017. Specially, both our ROAM++ and ROAM shows superior performance on robustness value compared with RPN based trackers which have no model updating, demonstrating the effectiveness of our recurrent model optimization scheme. In addition, our ROAM++ and ROAM outperform the baseline MetaTracker [33] by 39.2% and 21.1% on EAO of VOT-2016, respectively.

LaSOT. LaSOT [9] is a recently proposed large-scale tracking dataset. We evaluate our ROAM with top 10 performing trackers of the benchmark, including MDNet [32], VITAL [41], SiamFC [4], StructSiam [48], DSiam [13], SINT [13], ECO [7], STRCF [23], ECO_HC [7] and CFNet [43], on the testing split which consists of 280 videos. Fig. 5 presents the comparison results of precision plot and success plot on LaSOT testset. Our ROAM++ achieves the best result compared with state-of-the-art trackers on the benchmark, outperforming the second best MDNet with an improvement of 19.3% and 12.6% on precision plot and success plot respectively.

Figure 5: Precision and success plot on LaSOT test dataset

GOT-10k. GOT-10k [16] is a large high-diverse dataset for object tracking that is proposed recently. There is no overlap in object categories between training split and test split, which follows the one-shot learning setting [11]. Therefore, using external training data is strictly forbidden when testing trackers on their online server. Following their protocol, we train our ROAM by using only the training split of this dataset. Table 2 shows the detailed comparison results on GOT-10k test dataset. Both our ROAM++ and ROAM surpass other trackers with a large margin. Specially, our ROAM++ obtains an AO of 0.465, a of 0.532 and a of 0.236, outperforming SiamFCv2 with an improvement of 24.3%, 31.7% and 63.9% respectively.

AO() 0.299 0.315 0.316 0.325 0.342 0.348 blue0.374 green0.436 red0.465
() 0.303 0.297 0.309 0.328 0.372 0.353 blue0.404 green0.466 red0.532
() 0.099 0.088 0.111 0.107 0.124 0.098 blue0.144 green0.164 red0.236
Table 2: Results on GOT-10k. The evaluation metrics include average overlap (AO), success rate at 0.5 overlap threshod. ( ), success rate at 0.75 overlap threshod. ( ). The top three performing trackers are colored with red, green and blue respectively.

TrackingNet. TrackingNet [29] provides more than 30K videos with around 14M dense bounding box annotations by filtering shorter video clips from Youtube-BB [5]. Note that we only use the training split of this dataset to train our tracker. Table 3 presents the detailed comparison results on TrackingNet test dataset. Our ROAM++ surpasses other state-of-the-art tracking algorithms on all three evaluation metrics. In detail, our ROAM++ obtains an improvement of 10.6% , 10.3% and 6.9% on AUC, precision and normalized precision respectively compared with top performing tracker MDNet on the benchmark.

AUC() 0.528 0.534 0.541 0.554 0.571 0.578 blue0.606 green0.620 red0.670
Prec.() 0.470 0.480 0.478 0.492 0.533 0.533 green0.565 blue0.547 red0.623
Norm. Prec.() 0.603 0.622 0.608 0.618 0.663 0.654 green0.705 blue0.695 red0.754
Table 3: Results on TrackingNet. The evaluation metrics include area under curve (AUC) of success plot, Precision, Normalized Precison. The top three performing trackers are colored with red, green and blue respectively.

6.2 Ablation Study

To take a deep analysis of the proposed tracking algorithms, we study our trackers from various aspects. Note all these ablations are trained only on ImageNet VID dataset for simplicity.

Figure 6: Ablation Study on OTB-2015 with different variants of ROAM.

Impact of Different Modules. To verify the effectiveness of different modules, we design four variants of our framework: 1) SGD: replacing recurrent neural optimizer with traditional SGD for model updating (blackusing the same number of gradient steps as ROAM); 2) ROAM-w/o RFS: training a recurrent neural optimizer without RFS; 3) SGD-Oracle: using the ground-truth bounding boxes to build updating samples for SGD during the testing phase; 4) ROAM-Oracle: using the ground-truth bounding boxes to build updating samples for ROAM during the testing phase. The results are presented in Fig. 6 (left). ROAM gains about 6% improvement on AUC compared with the baseline SGD, demonstrating the effectiveness of our recurrent model optimization method on model updating. Without RFS during offline training, the tracking performance drops substantially due to overfitting. ROAM-Oracle performs better than SGD-Oracle, showing that our offline learned neural optimizer is more effective than traditional SGD method using the same updating samples. In addition, these two oracles (SGD-oracle and ROAM-oracle) achieve higher AUC score compared with their normal versions, indicating that the tracking accuracy could be boosted by improving the quality of the updating samples.

Architecture of Neural Optimizer. To investigate more architectures of the neural optimizer, we presents three variants of our method: 1) ROAM-GRU

: using two stacked Gated Recurrent Unit (GRU)

[6] as our neural optimizer; 2) ROAM-FC

: using two linear fully-connected layers followed by tanh activation function as the neural optimizer; 3)

ROAM-ConstLR: using a learned constant element-wise learning rate for model optimization instead of the adaptively generated one. Fig. 6 (right) presents the results. ROAM-GRU decreases the AUC a little, while ROAM-FC gets much lower AUC compared with ROAM, showing the importance of our recurrent structure. Moreover, the performance drop of ROAM-ConstLR verifies the necessity of using an adaptable learning rate for model updating.

Figure 7: AUC vs. gradient steps on OTB-2015.

More Steps in Updating. During offline training, we only perform one-step gradient decent blackto optimize the updating loss. We investigate the effect of using more than one gradient step on tracking performance during the test phase for both ROAM and SGD (see Fig. 7). Our method can be further improved with more than one step, but will gradually decrease when using too many steps. This is because our framework is not trained to perform so many steps during the offline stage. blackWe also use two fixed learning rates for SGD, where the larger one is 7e-6 555MetaTracker use this learning rate. and the smaller one is 7e-7. Using a larger learning rate, SGD could reach its best performance much faster than using a smaller learning rate, while both have similar best AUC. Our ROAM consistently outperforms SGD(7e-6), showing the superiority of adaptive element-wise learning rates. blackFurthermore, ROAM with 1-2 gradient steps outperforms SGD(7-e7) using a large number of steps, which shows the improved generalization of ROAM.

Figure 8: Comparison of update loss between ROAM and SGD using two-steps for each update.

Update Loss Comparison. To show the effectiveness of our neural optimizer on minimizing the update loss, we compare the tracking loss over time between ROAM and SGD after two gradient steps for a few videos of OTB-2015 in Fig. 8 (see supplementary for more examples). Under the same number of gradient updates, our neural optimizer obtains lower loss compared with traditional SGD, demonstrating its faster converge for model optimization.

Why does ROAM work? As discussed in [33], directly using the learned initial learning rate for model optimization in subsequent frames could lead to divergence. This is because the learning rates for model initialization are relatively larger than the ones needed for subsequent frames, which therefore causes unstable model optimization. In particular, the initial regression model is offline trained to be broadly applicable to different videos, which therefore needs relatively larger gradient step to adapt to a specific task, leading to a relative larger . For the subsequent frames, the appearance variations could be sometimes small or sometimes large, and thus the model optimization process needs an adaptive learning rate to handle different situations. Fig. 9 presents the histograms of initial learning rates and updating learning rates on OTB-2015. Most of updating learning rates are relatively small because usually there are only minor appearance variations between updates. As is shown in Figs. 4, our ROAM, which performs model updating for subsequent frames with adaptive learning rates, obtains substantial performance gain compared with MetaTracker [33], which use a traditional SGD with a constant learning rate for model updating.

Figure 9: Histogram of initial learning rates and updating learning rates on OTB-2015.

Impact of Training Data. For fair comparison, we also train the variants of our ROAM++ and ROAM with ImageNet VID training data, which are named by VID-ROAM++ and VID-ROAM, respectively. Table 4 shows the comparison results on VOT datasets. With only VID dataset for training, our ROAM++ still outperforms our preliminary tracker ROAM. What’s more, both our VID-ROAM++ and VID-ROAM outperform the baseline MetaTracker [33] with a large margin.

VOT-2016 VOT-2017
ROAM++ 0.4414 0.5987 0.1743 0.3803 0.5433 0.1945
VID-ROAM++ 0.4147 0.5886 0.1749 0.3330 0.5513 0.2366
ROAM 0.3842 0.5558 0.1833 0.3313 0.5048 0.2263
VID-ROAM 0.3568 0.5468 0.2091 0.2971 0.5098 0.2491
MetaTracker 0.317 0.519 - - - -
Table 4: Comparison results on VOT-2016 and VOT-2017 datasets with different training datasets

7 Conclusion

In this paper, we propose a novel tracking model consisting of resziable response generator and bounding box regressor, where the first part generates a score map to indicate the presence of target at different spatial locations and the second module regresses bounding box coordinates shift to the anchors densely mounted on sliding window positions. To effectively update the tracking model for appearance adaption, we propose a recurrent model optimization method within a meta learning setting in a few gradient steps. Instead of producing an increment for parameter updating, which is prone to overfit due to different gradient scales, we recurrently generate adaptive learning rates for tracking model optimization using an LSTM. Moreover, we propose a simple yet effective training trick by randomly scaling the convolutional filters used in the tracking model to prevent overfitting, which improves the performance significantly. Extensive experiments on OTB, VOT, LaSOT, GOT-10k and TrackingNet datasets demonstrate superior performance compared with state-of-the-art tracking algorithms.


  • [1] M. Andrychowicz, M. Denil, S. Gomez, M. W. Hoffman, D. Pfau, T. Schaul, and N. de Freitas. Learning to Learn by Gradient Descent by Gradient Descent. In NeurIPS, 2016.
  • [2] S. Bengio, Y. Bengio, J. Cloutier, and J. Gecsei. On the Optimization of a Synaptic Learning Rule. In Preprints Conf. Optimality in Artificial and Biological Neural Networks, 1992.
  • [3] L. Bertinetto, J. Valmadre, S. Golodetz, O. Miksik, and P. Torr. Staple: Complementary Learners for Real-Time Tracking. In CVPR, 2016.
  • [4] L. Bertinetto, J. Valmadre, J. F. Henriques, A. Vedaldi, and P. H. S. Torr. Fully-Convolutional Siamese Networks for Object Tracking. In ECCV Workshop on VOT, 2016.
  • [5] G. Brain, J. Shlens, S. Mazzocchi, G. Brain, and V. Vanhoucke. YouTube-BoundingBoxes : A Large High-Precision Human-Annotated Data Set for Object Detection in Video Esteban Real bear dog airplane zebra. In CVPR, 2017.
  • [6] K. Cho, B. Van Merriënboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y. Bengio. Learning Phrase Representations using RNN Encoder-decoder for Statistical Machine Translation. arXiv preprint arXiv:1406.1078, 2014.
  • [7] M. Danelljan, G. Bhat, F. S. Khan, and M. Felsberg. ECO: Efficient Convolution Operators for Tracking. In CVPR, 2017.
  • [8] M. Danelljan, A. Robinson, F. S. Khan, and M. Felsberg. Beyond correlation filters: Learning continuous convolution operators for visual tracking. In ECCV, 2016.
  • [9] H. Fan, L. Lin, F. Yang, and P. Chu. LaSOT : A High-quality Benchmark for Large-scale Single Object Tracking. In CVPR, 2019.
  • [10] H. Fan and H. Ling. Siamese Cascaded Region Proposal Networks for Real-Time Visual Tracking. In CVPR, 2019.
  • [11] C. Finn, P. Abbeel, and S. Levine. Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks. In ICML, 2017.
  • [12] H. K. Galoogahi, T. Sim, and S. Lucey. Correlation Filters with Limited Boundaries. In CVPR, 2015.
  • [13] Q. Guo, W. Feng, C. Zhou, R. Huang, L. Wan, and S. Wang. Learning Dynamic Siamese Network for Visual Object Tracking. In ICCV, 2017.
  • [14] D. Held, S. Thrun, and S. Savarese. Learning to Track at 100 FPS with Deep Regression Networks. In ECCV, 2016.
  • [15] J. F. Henriques, R. Caseiro, P. Martins, and J. Batista. High-speed tracking with kernelized correlation filters. TPAMI, 2015.
  • [16] L. Huang, X. Zhao, S. Member, K. Huang, and S. Member. GOT-10k : A Large High-Diversity Benchmark for Generic Object Tracking in the Wild. arXiv, 2019.
  • [17] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
  • [18] M. Kristan, A. Leonardis, J. Matas, M. Felsberg, R. Pflugfelder, L. Zajc, T. Vojí, G. Bhat, A. Lukežič, A. Eldesokey, G. Fernández, A. García-Martín, A. Iglesias-Arias, A. Aydin Alatan, A. González-García, A. Petrosino, A. Memarmoghadam, A. Vedaldi, A. Muhič, A. He, A. Smeulders, A. G. Perera, B. Li, B. Chen, C. Kim, C. Xu, C. Xiong, C. Tian, C. Luo, C. Sun, C. Hao, D. Kim, D. Mishra, D. Chen, D. Wang, D. Wee, E. Gavves, E. Gundogdu, E. Velasco-Salido, F. Shahbaz Khan, F. Yang, F. Zhao, F. Li, F. Battistone, G. De Ath, G. R. K S Subrahmanyam, G. Bastos, H. Ling, H. Kiani Galoogahi, H. Lee, H. Li, H. Zhao, H. Fan, H. Zhang, H. Possegger, H. Li, H. Lu, H. Zhi, H. Li, H. Lee, H. Jin Chang, I. Drummond, J. Valmadre, J. Spencer Martin, J. Chahl, J. Young Choi, J. Li, J. Wang, J. Qi, J. Sung, J. Johnander, J. Henriques, J. Choi, J. van de Weijer, J. Rodríguez Herranz, J. M. Martínez, J. Kittler, J. Zhuang, J. Gao, K. Grm, L. Zhang, L. Wang, L. Yang, L. Rout, L. Si, L. Bertinetto, L. Chu, M. Che, M. Edoardo Maresca, M. Danelljan, M.-H. Yang, M. Abdelpakey, M. Shehata, M. Kang, N. Lee, N. Wang, O. Miksik, P. Moallem, P. Vicente-Moñivar, P. Senna, P. Li, P. Torr, P. Mariam Raju, Q. Ruihe, Q. Wang, Q. Zhou, Q. Guo, R. Martín-Nieto, R. Krishna Gorthi, R. Tao, R. Bowden, R. Everson, R. Wang, S. Yun, S. Choi, S. Vivas, S. Bai, S. Huang, S. Wu, S. Hadfield, S. Wang, S. Golodetz, T. Ming, T. Xu, T. Zhang, T. Fischer, V. Santopietro, V. Struc, W. Wei, W. Zuo, W. Feng, W. Wu, W. Zou, W. Hu, W. Zhou, W. Zeng, X. Zhang, X. Wu, X.-J. Wu, X. Tian, Y. Li, Y. Lu, Y. Wei Law, Y. Wu, Y. Demiris, Y. Yang, Y. Jiao, Y. Li, Y. Zhang, Y. Sun, Z. Zhang, Z. Zhu, Z.-H. Feng, Z. Wang, and Z. He. The sixth Visual Object Tracking VOT2018 challenge results. In ECCV Workshop on VOT, 2018.
  • [19] M. Kristan, A. Leonardis, J. Matas, M. Felsberg, R. Pflugfelder, L. Č. Zajc, T. Vojír, G. Häger, A. Lukežič, A. Eldesokey, G. Fernández, and Others. The Visual Object Tracking VOT2017 Challenge Results. In ICCV Workshop on VOT, 2017.
  • [20] A. Krizhevsky, I. Sutskever, and G. E. Hinton.

    Imagenet classification with deep convolutional neural networks.

    In NeurIPS, 2012.
  • [21] B. Li, W. Wu, Q. Wang, F. Zhang, J. Xing, and J. Yan. SiamRPN++: Evolution of Siamese Visual Tracking with Very Deep Networks. In CVPR, 2019.
  • [22] B. Li, J. Yan, W. Wu, Z. Zhu, and X. Hu. High Performance Visual Tracking with Siamese Region Proposal Network. In CVPR, 2018.
  • [23] F. Li, C. Tian, W. Zuo, L. Zhang, and M.-H. Yang. Learning Spatial-Temporal Regularized Correlation Filters for Visual Tracking. In CVPR, pages 4904–4913, 2018.
  • [24] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick. Microsoft COCO: Common Objects in Context. In ECCV, pages 740–755, 2014.
  • [25] X. Lu, C. Ma, B. Ni, X. Yang, I. Reid, and M.-h. Yang. Deep Regression Tracking with Shrinkage Loss. In ECCV, 2018.
  • [26] A. Lukežič, T. Vojíř, L. Čehovin, J. Matas, and M. Kristan. Discriminative Correlation Filter with Channel and Spatial Reliability. In CVPR, 2017.
  • [27] K. Lv, S. Jiang, and J. Li. Learning Gradient Descent: Better Generalization and Longer Horizons. In ICML, 2017.
  • [28] C. Ma, J.-b. Huang, X. Yang, and M.-h. Yang. Hierarchical Convolutional Features for Visual Tracking. In ICCV, 2015.
  • [29] M. Müller, A. Bibi, S. Giancola, S. Al-Subaihi, and B. Ghanem. TrackingNet: A Large-Scale Dataset and Benchmark for Object Tracking in the Wild. In ECCV, 2018.
  • [30] T. Munkhdalai and H. Yu. Meta Networks. In ICML, 2017.
  • [31] D. K. Naik and R. Mammone. Meta-Neural Networks that Learn by Learning. In Neural Networks, 1992. IJCNN., International Joint Conference on, 1992.
  • [32] H. Nam and B. Han. Learning Multi-Domain Convolutional Neural Networks for Visual Tracking. In CVPR, 2016.
  • [33] E. Park and A. C. Berg. Meta-Tracker: Fast and Robust Online Adaptation for Visual Object Trackers. In ECCV, 2018.
  • [34] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer. Automatic Differentiation in PyTorch. In NeurIPS Workshop, 2017.
  • [35] S. Ravi and H. Larochelle. Optimization As a Model for Few-Shot Learning. In ICLR, 2017.
  • [36] S. Ren, K. He, R. Girshick, and J. Sun. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. In NeurIPS, 2015.
  • [37] J. Schmidhuber. Evolutionary Principles in Self-referential Learning: On Learning how to Learn: the Meta-meta-meta…-hook. PhD thesis, 1987.
  • [38] J. Schulman, J. Ho, X. Chen, and P. Abbeel. Meta Learning Shared Hierarchies. arXiv, 2018.
  • [39] K. Simonyan and A. Zisserman. Very Deep Convolutional Networks for Large-scale Image Recognition. In ICLR, 2015.
  • [40] Y. Song, C. Ma, L. Gong, J. Zhang, R. Lau, and M.-H. Yang. CREST: Convolutional Residual Learning for Visual Tracking. In ICCV, 2017.
  • [41] Y. Song, C. Ma, X. Wu, L. Gong, L. Bao, W. Zuo, C. Shen, R. Lau, and M.-H. Yang. VITAL: VIsual Tracking via Adversarial Learning. In CVPR, 2018.
  • [42] R. Tao, E. Gavves, and A. W. M. Smeulders. Siamese Instance Search for Tracking. In CVPR, 2016.
  • [43] J. Valmadre, L. Bertinetto, F. Henriques, A. Vedaldi, and P. H. S. Torr. End-to-end representation learning for Correlation Filter based tracking. In CVPR, 2017.
  • [44] O. Wichrowska, N. Maheswaranathan, M. W. Hoffman, S. G. Colmenarejo, M. Denil, N. de Freitas, and J. Sohl-Dickstein. Learned Optimizers that Scale and Generalize. In ICML, 2017.
  • [45] Y. Wu, J. Lim, and M.-H. Yang. Object Tracking Benchmark. PAMI, 2015.
  • [46] T. Yang and A. B. Chan. Recurrent Filter Learning for Visual Tracking. In ICCV Workshop on VOT, 2017.
  • [47] T. Yang and A. B. Chan. Learning Dynamic Memory Networks for Object Tracking. In ECCV, 2018.
  • [48] Y. Zhang, L. Wang, D. Wang, and H. Lu. Structured Siamese Network for Real-Time Visual Tracking. In ECCV, 2018.
  • [49] Z. Zhang, H. Peng, and Q. Wang. Deeper and Wider Siamese Networks for Real-Time Visual Tracking. In CVPR, 2019.
  • [50] Z. Zhu, Q. Wang, B. Li, W. Wu, J. Yan, and W. Hu. Distractor-aware Siamese Networks for Visual Object Tracking. In ECCV, 2018.