FEAR: Fast, Efficient, Accurate and Robust Visual Tracker

We present FEAR, a novel, fast, efficient, accurate, and robust Siamese visual tracker. We introduce an architecture block for object model adaption, called dual-template representation, and a pixel-wise fusion block to achieve extra flexibility and efficiency of the model. The dual-template module incorporates temporal information with only a single learnable parameter, while the pixel-wise fusion block encodes more discriminative features with fewer parameters compared to standard correlation modules. By plugging-in sophisticated backbones with the novel modules, FEAR-M and FEAR-L trackers surpass most Siamesetrackers on several academic benchmarks in both accuracy and efficiencies. Employed with the lightweight backbone, the optimized version FEAR-XS offers more than 10 times faster tracking than current Siamese trackers while maintaining near state-of-the-art results. FEAR-XS tracker is 2.4x smaller and 4.3x faster than LightTrack [62] with superior accuracy. In addition, we expand the definition of the model efficiency by introducing a benchmark on energy consumption and execution speed. Source code, pre-trained models, and evaluation protocol will be made available upon request

READ FULL TEXT VIEW PDF
04/02/2021

Learning to Filter: Siamese Relation Network for Robust Tracking

Despite the great success of Siamese-based trackers, their performance u...
03/01/2021

MFST: Multi-Features Siamese Tracker

Siamese trackers have recently achieved interesting results due to their...
05/27/2020

AFAT: Adaptive Failure-Aware Tracker for Robust Visual Object Tracking

Siamese approaches have achieved promising performance in visual object ...
12/31/2018

SiamRPN++: Evolution of Siamese Visual Tracking with Very Deep Networks

Siamese network based trackers formulate tracking as convolutional featu...
07/21/2019

Tracking Holistic Object Representations

Recent advances in visual tracking are based on siamese feature extracto...
07/18/2019

A Strong Feature Representation for Siamese Network Tracker

Object tracking has important application in assistive technologies for ...
05/15/2009

Generalized Kernel-based Visual Tracking

In this work we generalize the plain MS trackers and attempt to overcome...

1 Introduction


Figure 1: The EAO-Latency trade-off plot. Compared to other state-of-the-art approaches (shown in blue), FEAR-XS tracker (in red) achieves superior or comparable quality (EAO) while attaining outstanding real-time performance on mobile devices.

Visual object tracking is a highly active research area of computer vision with many applications such as autonomous driving

[autonomous], surveillance [surveillance], augmented reality [AR], and robotics [robotics]. Building a general system for tracking an arbitrary object in the wild using only information about the location of the object in the first frame is non-trivial due to occlusions, deformations, lighting changes, background cluttering, reappearance, etc. [OTB]. Real-world scenarios often require models to be deployed on the edge devices with hardware and power limitations, adding further complexity. Thus, developing a robust tracking algorithm has remained a challenge.

The recent adoption of deep neural networks, specifically Siamese networks

[Siam2015], has led to significant progress in visual object tracking [SiamFC, SiamRPN, SiamFC++, SiamRPN++, DaSiamRPN, SiamDW, Ocean]. One of the main advantages of Siamese trackers is the possibility of end-to-end offline learning. In contrast, methods incorporating online learning [ATOM, DiMP, MDNet] increase computational complexity to an unacceptable extent for real-world scenarios [survey2].

Current state-of-the-art approaches for visual object tracking achieve high results on several benchmarks [VOT, VOT2020] at the cost of heavy computational load. Top-tier visual trackers like SiamRPN++ [SiamRPN++] and Ocean [Ocean]

exploit complex feature extraction and cross-correlation modules, resulting in 54M parameters and 49 GFLOPs, and 26M parameters and 20 GFLOPs, respectively. Recently, STARK

[STARK] introduced a transformer-based encoder-decoder architecture for visual tracking with 23.3M parameters and 10.5 GFLOPs. The large memory footprint cannot satisfy the strict performance requirements of real-world applications. Employing a mobile-friendly backbone into the Siamese tracker architecture does not lead to a significant boost in the inference time, as most memory and time-consuming operations are in the decoder or bounding box prediction modules (see Table 1). Therefore, designing a lightweight visual object tracking algorithm, efficient across a wide range of hardware, remains a challenging problem. Moreover, it is essential to incorporate temporal information into the algorithm to make a tracker robust to pose, lighting, and other object appearance changes. This usually assumes adding either dedicated branches to the model [STARK], or online learning modules [DiMP]. Either approach results in extra FLOPs that negatively impact the run-time performance.

We introduce a novel lightweight tracking framework, FEAR tracker, that efficiently solves the abovementioned problems. We develop a single-parameter dual-template module which allows to learn the change of the object’s appearance on the fly without any increase in model complexity, mitigating the memory bottleneck of recently proposed online learning modules [ATOM, DiMP, KYS, prDIMP]

. This module predicts the likelihood of the target object being close to the center of the search image, thus allowing to select candidates for the template image update. Furthermore, we interpolate the online selected dynamic template image feature map with the feature map of the original static template image in a learnable way. This allows the model to capture object appearance changes during inference. We optimize the neural network architecture to perform more than 10 times faster than most current Siamese trackers. Additionally, we design an extra lightweight FEAR-XS network that achieves real-time performance on mobile devices while still surpassing or achieving comparable accuracy to the state-of-the-art deep learning methods.


The main contributions of the paper are:

  • Introduction of the dual-template representation for object model adaptation. The first template, static, anchors the original visual appearance and thus prevents drift and, consequently, an adaptation-induced failure. The other is dynamic; its state reflects the current acquisition conditions and object appearance. We show that a learned convex combination of the two templates is effective for tracking on multiple benchmarks.

  • A novel pixel-wise fusion block for bounding box regression which exploits both the extracted similarity features and visual and positional information of original image. The proposed fusion block can be integrated into most of the existing Siamese trackers.

  • A lightweight architecture that combines a compact feature extraction network, the dual-template representation and pixel-wise fusion blocks. The resulting FEAR-XS tracker runs at 205 fps on iPhone 11, 4.2 faster than LightTrack [LightTrack] and 26.6 faster than Ocean [Ocean], with high accuracy on multiple benchmarks. Besides, the algorithm is highly energy-efficient.

  • We introduce a new tracker efficiency benchmark and protocol. Efficiency is defined in terms of both energy consumption and execution speed. Such aspect of vision algorithms, important in real-world use, has not been benchmarked before.

2 Related Work

Visual Object Tracking. Conventional tracking benchmarks such as annual VOT challenges [VOT] and the Online Tracking Benchmark [OTB] have historically been dominated by hand-crafted features-based solutions [Matas, KCF, Staple]. With the rise of deep learning, they lost popularity constituting only 14% of VOT-ST2020 [VOT2020] participant models. Lately, short-term visual object tracking task [VOT2020] was mostly addressed using either discriminatory correlation filters [ECO, DiMP, ATOM, DCFST, LADCF, AFOD] or Siamese neural networks [Ocean, SiamRPN, SiamRPN++, DaSiamRPN, SiamFC++, SiamDW, GOTURN], as well as both combined [RPT, OceanPlus, AlphaRef]

. Moreover, success of visual transformer networks for image classification

[ViT] has resulted in new high-scoring models [STARK, TransT, TrMeetsTr] for tracking.

Figure 2: The FEAR network architecture consists of four main components: the feature extraction network, the pixel-wise fusion blocks, the bounding box and classification heads. The CNN backbone extracts feature representations from the template and search images. The pixel-wise fusion block effectively combines template and search image features. The bounding box and classification heads make the final predictions for the box location and its presence, respectively.

Siamese trackers. Trackers based on Siamese correlation networks perform tracking based on offline learning of a matching function. This function acts as a similarity metric between the features of the template image and the cropped region of the candidate search area. Siamese trackers initially became popular due to their impressive trade-off between accuracy and efficiency [SINT, SiamFC, RASNet, SiamRPN, DaSiamRPN]; however, they could not keep up with the accuracy of online learning methods [ATOM, DiMP]. With recent modeling improvements, Siamese-based trackers [STARK, Ocean] hold winning positions on the most popular benchmarks [VOT, GOT10k, LaSOT].

One of the state-of-the-art methods, Ocean [Ocean], incorporates FCOS [FCOS] anchor-free object detection paradigm for tracking, directly regressing the distance from the point in the classification map to the corners of the bounding box. Another state-of-the-art approach, STARK [STARK], introduces a transformer-based encoder-decoder in a Siamese fashion: flattened and concatenated search and template feature maps serve as an input to the transformer network.

Neither of the forenamed state-of-the-art architectures explicitly addresses the task of fast, high-quality visual object tracking across the wide variety of GPU architectures.

Recently, LightTrack [LightTrack] made a considerable step towards performant tracking on mobile, optimizing for FLOPs as well as model size via NAS [nas, detnas]. Still, FLOP count does not always reflect actual inference time [FBNet].

Efficient Neural Networks. Designing efficient and lightweight neural networks optimized for inference on mobile devices has attracted much attention in the past few years due to many practical applications. SqueezeNet [SqueezeNet] was one of the first works focusing on reducing the size of the neural network. They introduced an efficient downsampling strategy, extensive usage of 1x1 Convolutional blocks, and a few smaller modules to decrease the network size significantly. Furthermore, SqueezeNext [SqueezeNext] and ShiftNet [shift] achieve extra size reduction without any significant drop of accuracy. Recent works focus not only on the size but also on the speed, optimizing FLOP count directly. MobileNets introduce new architecture components: MobileNet [MobileNet] uses depth-wise separable convolutions as a lightweight alternative to spatial convolutions, and MobileNet-v2 [mobilenetv2] adds memory-efficient inverted residual layers. ShuffleNet [shufflenet] utilizes group convolutions and shuffle operations to reduce the FLOP count further. More recently, FBNet [FBNet] also takes the hardware design into account, creating a family of mobile-optimized CNNs using neural architecture search.

For FEAR tracker, we followed best practices for designing efficient and flexible neural network architecture. For an extremely lightweight version, where possible, we used depth-wise separable convolutions instead of regular ones and designed the network layers such that the Conv-BN-ReLU blocks could be fused at the export step.

3 The method

FEAR tracker is a single, unified model composed of a feature extraction network, pixel-wise fusion blocks, and task-specific subnetworks for bounding box regression and classification. Given a static template image, , a search image crop, , and a dynamic template image, , the feature extraction network yields the feature maps over these inputs. The template feature representation is then computed as a linear interpolation between static and dynamic template image features. Next, it is fused with the search image features in the pixel-wise fusion blocks and passed to the classification and regression subnetworks. Every stage is described in detail further on, and the overview of the FEAR network architecture is illustrated in Figure 2.

3.1 FEAR Network Architecture

Feature Extraction Network.

Efficient tracking pipeline requires a flexible, lightweight, and accurate feature extraction network. Moreover, the outputs of such backbone network should have high enough spatial resolution to have optimal feature capability of object localization [SiamRPN++] while not increasing the computations for the consecutive layers. Most of the current Siamese trackers [Ocean, SiamRPN++] increase the spatial resolution of the last feature map, which significantly degrades the performance of successive layers. We observe that keeping the original spatial resolution significantly reduces the computational cost of both backbone and prediction heads, as shown in Table 1.

We use the first four stages of the neural network pretrained on the ImageNet

[ImageNet] as a feature extraction module. The FEAR-M tracker adopts the vanilla ResNet-50 [ResNet] as a backbone, and the FEAR-L tracker incorporates the RegNet [xu2021regnet] backbone to pursue the state-of-the-art tracking quality, yet remaining efficient.

The output of the backbone network is a feature map of stride

16 for the template and search images. To map the depth of the output feature map to a constant number of channels, we use a simple AdjustLayer

which is a combination of Convolutional and Batch Normalization

[BatchNorm] layers.

To shift towards being more efficient during inference on mobile devices, for the mobile version of our tracker - FEAR-XS - we utilize the FBNet [FBNet] family of models designed via NAS.

Model architecture Backbone Prediction heads
GigaFLOPs GigaFLOPs
FEAR-XS tracker 0.318 0.160
FEAR-XS tracker 0.840 0.746
OceanNet 4.106 1.178
OceanNet (original) 14.137 11.843
Table 1: GigaFLOPs, per frame, of the FEAR tracker and OceanNet [Ocean] architectures; indicates the increased spatial resolutions of the backbone. We show in Section 4.4 that upscaling has a negligible effect on accuracy while increasing FLOPs significantly.

Table 1 and Figure 5 demonstrate that even a lightweight encoder does not improve the model efficiency of modern trackers due to the complex prediction heads. Thus, designing a lightweight and accurate decoder is still a challenge.

Pixel-wise fusion block.

The cross-correlation module is the core operation to combine template and search image features. Most existing Siamese trackers use either simple cross-correlation operation [SiamFC, SiamFC++, SiamRPN] or more lightweight depth-wise cross-correlation [SiamRPN++]. They are further passed to consecutive networks, such as the bounding box regressor. In contrast, we introduce the pixel-wise fusion block to enhance the similarity information obtained via pixel-wise correlation with the position and appearance information extracted from the search image (see Table 4).

We pass the search image feature map through one 3x3 Conv-BN-ReLU block, and calculate the point-wise cross-correlation between these features and template image features. Then, we concatenate the computed correlation feature map with the search image features and pass the result through one 1x1 Conv-BN-ReLU block to aggregate them. With this approach, learned features are more discriminative and can efficiently encode object position and appearance: see Section 4.4 and Table 4 for more details on the ablation study.

The overall architecture of a pixel-wise fusion block is visualized in Figure 3.

Figure 3: The pixel-wise fusion block. The search and template features are combined using the point-wise cross-correlation module and enriched with search features via concatenation. The output is then forwarded to regression heads.

Classification and Bounding Box Regression Heads.

The core idea of a bounding box regression head is to estimate the distance from each pixel within the target object’s bounding box to the ground truth bounding box sides

[Ocean, FCOS].

Such bounding box regression takes into account all of the pixels in the ground truth box during training, so it can accurately predict the magnitude of target objects even when only a tiny portion of the scene is designated as foreground.

The bounding box regression network is a stack of two simple 3x3 Conv-BN-ReLU blocks. We use just two such blocks instead of four proposed in Ocean [Ocean] to reduce computational complexity.

The classification head employs the same structure as a bounding box regression head. The only difference is that we use one filter instead of four in the last Convolutional block. This head predicts a 16x16 score map, where each pixel represents a confidence score of object appearance in the corresponding region of the search crop.

Dynamic Template Update.

We propose a dual-template representation that allows the model to capture the appearance changes of objects during inference without a need to perform an optimization on the fly. In addition to the main static template and search image , we randomly sample a dynamic template image, , from a video sequence during model training to capture the object under various appearances. We pass through the feature extraction network, and the resulting feature map, , is linearly interpolated with the main template feature map via a learnable parameter :

(1)

We further pass and

to the Similarity Module that computes cosine similarity between the dual-template and search image embeddings. The search image embedding

is obtained via the Weighted Average Pooling (WAP) [WAP] of by the classification confidence scores; the dual-template embedding is computed as an Average Pooling [AvgPool] of . During inference, for every frames we choose the image crop with the highest cosine similarity with the dual-template representation, and update the dynamic template with the predicted bounding box at this frame. The general scheme of the Dynamic Template Update algorithm is shown in Figure 4. In addition, for every training pair we sample a negative crop from a frame that doesn’t contain the target object. We pass it through the feature extraction network, and extract the negative crop embedding similarly to the search image, via WAP. We then compute Triplet Loss [TripletLoss] with the embeddings extracted from , and , respectively. This training scheme does provide a signal for the dynamic template scoring while also biasing the model to prefer more general representations.

Figure 4: Dynamic Template Image update. We compare the average-pooled dual-template representation with the search image embedding using cosine similarity, and dynamically update the template representation when object appearance changes dramatically.

In Section 4, we demonstrate the efficiency of our method on a large variety of academic benchmarks and challenging cases. The dual-template representation module allows the model to efficiently encode the temporal information as well as the object appearance and scale change. The increase of model parameters and FLOPs is small and even negligible, making it almost a cost-free temporal module.

3.2 Overall Loss Function:

Training a Siamese tracking model requires a multi-component objective function to simultaneously optimize classification and regression tasks. As shown in previous approaches [Ocean, STARK], IoU loss [IoUloss] and classification loss are used to efficiently train the regression and classification networks jointly. In addition, to train FEAR tracker, we supplement those training objectives with triplet loss, which enables performing Dynamic Template Update. As seen in the ablation study, it improves the tracking quality by EAO with only a single additional trainable parameter and marginal inference cost (see Table 4). To our knowledge, this is a novel approach in training object trackers.

The triplet loss term is computed from template (), search (), and negative crop () feature maps:

(2)

where

(3)

The regression loss term is computed as:

(4)

where denotes the target bounding box, denotes the predicted bounding box, and indexes the training samples.

For classification loss term, we use Focal Loss [FocalLoss]:

(5)

where

In the above, is a GT class, and

is the predicted probability for the class

.

The overall loss function is a linear combination of the three components:

(6)

In practice, we use 0.5, 1.0, 1.0 as , , , respectively.

4 Experimental Evaluation

4.1 Implementation Details

Training:

We implemented all of our models using PyTorch

[pytorch]. The backbone network is initialized using the pretrained weights on ImageNet. All the models are trained using 4 RTX A6000 GPUs, with a total batch size of 512. We use ADAM [ADAM] optimizer with a learning rate =

and a plateau learning rate reducer with a reduce factor = 0.5 every 10 epochs when the target metric (mean IoU) stops increasing. Each epoch contains

image pairs. The training takes 5 days to converge.

For each epoch, we randomly sample 20,000 images from LaSOT dataset [LaSOT]

, 120,000 images - COCO dataset

[COCO], 400,000 images - YoutubeBB dataset [YouTubeBB], 320,000 - GOT10k dataset [GOT10k], 310,000 - ImageNet dataset [ImageNet].

From each video sequence in a dataset, we randomly sample a template frame and search frame such that the distance between them is frames. Starting from the 15th epoch, we increase by 2 every epoch. It allows the network to learn the correlation between objects on easier samples initially and gradually increase complexity as the training proceeds. A dynamic template image is sampled from the video sequence between the static template frame and search image frame. For the negative crop, where possible, we sample it from the same frame as the dynamic template but without overlap with this template crop; otherwise, we sample the negative crop from another video sequence.

Preprocessing: We extract template image crops with an additional offset of around the bounding box. Then, we apply a light shift (up to 8px) and random scale change (up to

on both sides) augmentations, pad image to the square size with the mean RGB value of the crop, and resize it to the size of 128x128 pixels. We apply the same augmentations with a more severe shift (up to 48px) and scale (between

and from the original image size) for the search and negative images. Next, the search image is resized to 256x256 pixels with the same padding strategy as in the template image.

Finally, we apply random photometric augmentations for both search and template images to increase model generalization and robustness under different lighting and color conditions [albumentations].

Testing: During inference, tracking follows the same protocols as in [SiamFC, SiamRPN]. The static template features of the target object are computed once at the first frame. The dynamic template features are updated every 30 frames and interpolated with the static template features. These features are combined with the search image features in the correlation modules, regression, and classification heads to produce the final output.

4.2 Tracker efficiency benchmark

Figure 5: Online Efficiency Benchmark on iPhone 8: battery consumption, device thermal state, and inference speed degradation over time. FEAR-XS tracker does not change the thermal state of the device and has a negligible impact on the battery level. Transformer-based trackers have a battery level drop comparable to the Siamese trackers, reaching a high thermal state in less than 10 minutes of online processing.

Setup:

Mobile devices have a limited amount of both computing power and energy available to execute a program. Most current benchmarks measure only runtime speed without taking into account the energy efficiency of the algorithm, which is equally important in a real-world scenario. Thus, we introduce the FEAR Benchmark to estimate the effect of tracking algorithms on mobile device battery and thermal state and its impact on the processing speed over time. It measures the energy efficiency of trackers with online and offline evaluation protocols - the former to estimate the energy consumption for the real-time input stream processing and the latter to measure the processing speed of a constant amount of inputs.

The online evaluation collects energy consumption data by simulating a real-time (30 FPS) camera input to the neural network for 30 minutes. The tracker cannot process more frames than the specified FPS even if its inference speed is faster, and it skips inputs that cannot be processed on-time due to the slower processing speed. We collect battery level, the device’s thermal state, and neural network inference speed throughout the whole experiment. The offline protocol measures the inference speed of trackers by simulating a constant number of inputs for the processing, similar to processing a media file from a disk. All frames are processed one by one without any inference time restrictions.Additionally, we perform a model warmup before the experiment, as the first model executions are usually slower. We set the number of warmup iterations and inputs for the processing to 20 and 100, respectively.

In this work, we evaluate trackers on iPhone 7, iPhone 8 Plus, iPhone 11, and Pixel 4. All devices are fully charged before the experiment, no background tasks are running, and the display is set to the lowest brightness to reduce the energy consumption of hardware that is not involved in computations.

We observe that algorithms that reach the high system thermal state get a significant drop in the processing speed due to the smaller amount of processing units available. The results prove that the tracking speed is dependent on the energy efficiency, and both should be taken into account.

Online efficiency benchmark: Figure 5 summarizes the online benchmark results on iPhone 8. The upper part of the plot demonstrates the degradation of inference speed over time. We observe that FEAR-XS tracker and STARK-Lightning [stark_lightning] backbone do not change inference speed over time, while LightTrack [LightTrack] and OceanNet [Ocean] start to process inputs slower. Also, transformer network STARK-S50 degrades significantly and becomes 20% slower after 30 minutes of runtime. The lower part of the figure demonstrates energy efficiency of FEAR-XS tracker compared to other trackers and its negligible impact on device thermal state. STARK-S50 and Ocean overheat device after 10 minutes of execution, LightTrack slightly elevates temperature after 24 minutes, STARK-Lightning overheats device after 27 minutes, while FEAR-XS tracker keeps device in a low temperature. Moreover, Ocean with a lightweight backbone FBNet [FBNet] is still consuming lots of energy and produces heat due to complex and inefficient decoder.

Figure 6: Offline Efficiency Benchmark: mean FPS on a range of mobile GPU architectures. FEAR-XS tracker has superior processing speed on all devices while being an order of magnitude faster on a modern GPU – Apple A13 Bionic.

Offline efficiency benchmark: We summarize the results of offline benchmark in Figure 6. We observe that FEAR-XS tracker achieves  1.6 times higher FPS than LightTrack [LightTrack] on iPhone 7 (A10 Fusion and PowerVR Series7XT GPU), iPhone 8 (A11 Bionic with 3-core GPU) and Google Pixel 4 (Snapdragon 855 and Adreno 640 GPU). Furthermore, FEAR-XS tracker is more than 4 times faster than LightTrack on iPhone 11 (A11 Bionic with 4-core GPU). FEAR-XS tracker achieves more than  10 times faster inference than OceanNet [Ocean] and STARK [STARK] on all aforementioned mobile devices. Such low inference time makes FEAR tracker a very cost-efficient candidate for use in resource-constrained applications.

4.3 Comparison with the state-of-the-art

FPS Success Score Precision Score Success Rate
30 0.618 0.753 0.780
240 0.655 0.816 0.835
Table 2: Extremely High FPS Tracking Matters. The metrics were computed from the same set of frames on 30 and 240 fps NFS benchmark[nfs]. FEAR-XS, tracking in over 200 fps, achieves superior performance than trackers limited to 30 fps by incorporating additional temporal information from intermediate frames.
SiamFC++ SiamRPN++ SiamRPN++ ATOM KYS Ocean STARK STARK LightTrack FEAR-XS FEAR-M FEAR-L
(GoogleNet) (MobileNet-V2) (ResNet-50) (offline) (S50) (lightning)
[SiamFC++] [SiamRPN++] [SiamRPN++] [ATOM] [KYS] [Ocean] [STARK] [stark_lightning] [LightTrack]
EAO ↑ 0.227 0.235 0.239 0.258 0.274 0.290 0.270 0.226 0.240 0.270 0.278 0.303
Accuracy ↑ 0.418 0.432 0.438 0.457 0.453 0.479 0.464 0.433 0.417 0.471 0.476 0.501
Robustness ↑ 0.667 0.656 0.668 0.691 0.736 0.732 0.719 0.627 0.684 0.708 0.728 0.755
iPhone 11 FPS ↑ 7.11 6.86 3.49 - - 7.72 11.2 87.41 49.13 205.12 56.2 38.3
Parameters (M) ↓ 12.706 11.150 53.951 - - 25.869 23.342 2.280 1.969 1.370 9.672 33.653
(a) VOT-ST2021 [VOT]
SiamRPN++ ATOM KYS Ocean STARK LightTrack FEAR-XS FEAR-M FEAR-L
(ResNet-50) (offline) (S50)
[SiamRPN++] [ATOM] [KYS] [Ocean] [STARK] [LightTrack]
Average Overlap ↑ 0.518 0.556 0.636 0.592 0.672 0.611 0.619 0.623 0.645
Success Rate ↑ 0.618 0.634 0.751 0.695 0.761 0.710 0.722 0.730 0.746
(b) GOT-10K [GOT10k]
SiamRPN++ ATOM KYS Ocean STARK STARK LightTrack FEAR-XS FEAR-M FEAR-L
(ResNet-50) (offline) (S50) (lightning)
[SiamRPN++] [ATOM] [KYS] [Ocean] [STARK] [stark_lightning] [LightTrack]
Success Score ↑ 0.503 0.491 0.541 0.505 0.668 0.586 0.523 0.535 0.546 0.579
Precision Score ↑ 0.496 0.483 0.539 0.517 0.701 0.579 0.515 0.545 0.556 0.609
Success Rate ↑ 0.593 0.566 0.640 0.594 0.778 0.690 0.596 0.641 0.638 0.686
(c) LaSOT [LaSOT]
SiamRPN++ ATOM KYS Ocean STARK STARK LightTrack FEAR-XS FEAR-M FEAR-L
(ResNet-50) (offline) (S50) (lightning)
[SiamRPN++] [ATOM] [KYS] [Ocean] [STARK] [stark_lightning] [LightTrack]
Success Score ↑ 0.596 0.592 0.634 0.573 0.681 0.628 0.591 0.614 0.622 0.658
Precision Score ↑ 0.720 0.711 0.766 0.706 0.825 0.754 0.730 0.768 0.745 0.814
Success Rate ↑ 0.748 0.737 0.795 0.728 0.860 0.796 0.743 0.788 0.788 0.834
(d) NFS [nfs]
Table 3: Comparison of FEAR and the state-of-the art real-time trackers on common benchmarks: VOT-ST2021[VOT], GOT-10K[GOT10k], LaSOT[LaSOT], and NFS[nfs]. FEAR tracker uses much fewer parameters, achieves higher FPS; its accuracy and robustness is on par with the best. Red, green and blue font colors indicate the top-3 trackers

We compare FEAR tracker to existing state-of-the-art Siamese[Ocean, LightTrack, SiamFC++, SiamRPN++] and DCF[ATOM, DiMP, KYS] trackers in terms of model accuracy, robustness and speed. We evaluate performance on two short-term tracking benchmarks: VOT-ST2021[VOT], GOT-10k[GOT10k] and two long-term tracking benchmarks: LaSOT[LaSOT], NFS[nfs]. We provide three version of FEAR tracker: FEAR-XS, FEAR-M and FEAR-XL. The first one is a lightweight network optimized for on-device inference while two latter networks are more heavy and provide more accurate results.

VOT-ST2021 Benchmark: This benchmark consists of 60 short video sequences with challenging scenarios: similar objects, partial occlusions, scale and appearance change to address short-term, causal, model-free trackers. Table 3(a) reports results on VOT-ST2021. It takes both Accuracy (A) and Robustness (R) into account to compute the Expected Average Overlap metric (EAO) [VOT] which is used to evaluate the overall performance. FEAR-L tracker demonstrates 1.3% higher EAO than Ocean [Ocean] and outperforms trackers with online update, such as ATOM [ATOM] and KYS [KYS], by 3% EAO. FEAR-XS tracker shows near state-of-the-art performance, outperforming LightTrack [LightTrack] and STARK-Lightning [stark_lightning] by 3% and 4.4% EAO, respectively, while having higher FPS. Also, it is only 2% behind Ocean, yet having more than 18 times fewer parameters than Ocean tracker and being 26 times faster at model inference time (iPhone 11).

GOT-10K Benchmark: GOT-10K [GOT10k] is a benchmark covering a wide range of different objects, their deformations, and occlusions. We evaluate our solution using the official GOT-10K submission page. FEAR-XS tracker achieves better results than LightTrack [LightTrack] and Ocean [Ocean], while using 1.4 and 19 times fewer parameters, respectively. More details in the Table 3(b).

LaSOT Benchmark: LaSOT [LaSOT] contains 280 video segments for long-range tracking evaluation. Each sequence is longer than 80 seconds in average making in the largest densely annotated long-term tracking benchmark. We report the Success Score as well as Precision Score and Success Rate. As presented in Table 3(c), the Precision Score of FEAR-XS tracker is 3% and 2.8% superior than LightTrack [LightTrack] and Ocean [Ocean], respectively. Besides, the larger FEAR-M and FEAR-L trackers further improve Success Score outperforming KYS [KYS] by 0.5% and 3.8%.

NFS Benchmark: NFS [nfs] dataset is a long-range benchmark, which has 100 videos (380K frames) captured with now commonly available higher frame rate (240 FPS) cameras from real world scenarios. Table 3(d) presents that FEAR-XS tracker achieves better Success Score (61.4%), being 2.3% and 4.1% higher than LightTrack [LightTrack] and Ocean [Ocean], respectively. Besides, FEAR-L tracker outperforms KYS [KYS] by 2.4% Success Score and 4.8% Precision Score. Additionally, Table 2 reports the impact of extremely high fps video processing on accuracy, implying the importance of developing a fast tracker capable to process videos in higher FPS.

4.4 Ablation Study

To verify the efficiency of the proposed method, we evaluate the effects of its different components on the VOT-ST2021 [VOT] benchmark, as presented in Table 4. The baseline model (#1) consists of the FBNet backbone with an increased spatial resolution of the final stage, followed by a plain pixel-wise cross-correlation operation and bounding box prediction network. The performance of the baseline is 0.236 EAO and 0.672 Robustness. In #2, we set the spatial resolution of the last stage to its original value and observe a negligible degradation of EAO while increasing fps (see Tab. 1). Adding our pixel-wise fusion blocks (#3) brings a 3% EAO improvement. This indicates that combining search image features and correlation feature maps enhances feature representability and improves tracking accuracy. Furthermore, the proposed dynamic template update module (#4) also brings an improvement of 0.6% in terms of EAO and 2.5% Robustness, showing the effectiveness of this module.

# Component EAO Robustness
1 baseline 0.236 0.672
2 + lower spatial resolution 0.234 0.668
3 + pixel-wise fusion block 0.264 0.683
4 + dynamic template update 0.270 0.708
Table 4: FEAR tracker – Ablation study on VOT-ST2021 [VOT].

5 Conclusions

In this paper, we introduce FEAR tracker - an efficient and powerful new Siamese tracking framework that benefits from novel architectural blocks. We validate FEAR tracker performance on several popular academic benchmarks and show that the model nears or exceeds existing solutions while dramatically reducing the computational cost of inference. We demonstrate that the FEAR-XS model attains real-time performance on embedded devices with high energy efficiency.

Acknowledgements:

The authors thank Eugene Khvedchenia, Lukas Schulte and Mandela Patrick for an early review and in-depth discussions. Vasyl Borsuk is supported by Ampersand Foundation.

References

Appendix A. Training Datasets

The YouTube-BoundingBoxes [YouTubeBB] is a large-scale dataset of videos. The dataset consists of approximately 380,000 video segments of 15-20s with a recording quality often akin to that of a hand-held cell phone camera.

The LaSOT [LaSOT] consists of 1,400 sequences with more than 3.5M frames in total. Each sequence contains 2,500 frames on average and the dataset represents 70 different object categories.

The GOT-10k [GOT10k] is built upon the backbone of WordNet structure [WordNet] and it populates the majority of over 560 classes of moving objects and 87 motion patterns. It contains more than 10,000 of short video sequences with more than 1.5M manually labeled bounding boxes, annotated at 30 frames per second, enabling unified training and stable evaluation of deep trackers.

The ImageNet-VID [ImageNet] is a benchmark created for video object detection task. It contains 30 object categories. Overall, benchmark consists of near 2M annotations and over 4,000 video sequences.

In addition, similar to other tracking models [SiamFC, DaSiamRPN, Ocean], we use a part of the COCO [COCO] dataset for object detection with 80 different object categories to diversify the training dataset for visual object tracking. In our setup, we set to let the network efficiently predict the object’s location in a larger context.

Appendix B. Technical details

B.1. Pixel-wise correlation implementation

Classical cross-correlation cannot be executed by most mobile neural network inference engines such as CoreML [coreml] due to unsupported convolutional operation with dynamic weights from the template features. Thus, we reformulated the pixel-wise cross-correlation operation as a matrix multiplication operation that is better supported on mobile devices.

Given input image features and template image features flattened along the spatial dimensions to shapes and respectively, we compute pixel-wise cross-correlation features as:

(7)

The resulting

will be a tensor of shape

.

B.2. Smartphone-based Implementation

The models are trained offline using PyTorch [pytorch] and then ported with an optimal model snapshot to mobile devices for inference. All models are executed in float16 mode for faster execution comparing to float32 computations. The precision loss of float16 computations is negligible, we observe that the results differ only by depending on the experiment.

We use Core ML [coreml]

framework to run FEAR tracker on iPhone devices. Core ML is a machine learning API from Apple that optimizes on-device neural network inference by leveraging the CPU, GPU and Neural Engine.

For Android devices, we employ TensorFlow Lite

[tensorflow2015-whitepaper]

which is an open-source deep learning framework for on-device inference from Google supporting execution on CPU, GPU and DSP.

Appendix C. Qualitative comparison

The comparison of FEAR tracker with the state-of-the-art methods is presented in Figure 7. We display the tracking results of every 200 frames (0 - 1000) on the challenging cases from LaSOT benchmark where the object appearance and scale change throughout the video.


Figure 7: Qualitative comparison of FEAR tracker with state-of-the-art methods on challenging cases of variations in tracked object appearance from LaSOT benchmark [LaSOT]. Green: Ground Truth, Red: FEAR-L, Yellow: STARK Lightning, Blue: Ocean, Purple: Stark-ST50.