COMET: Context-Aware IoU-Guided Network for Small Object Tracking

06/04/2020 ∙ by Seyed Mojtaba Marvasti-Zadeh, et al. ∙ Sharif Accelerator University of Alberta Yazd University 0

Tracking an unknown target captured from medium- or high-aerial view is challenging, especially in scenarios of small objects, large viewpoint change, drastic camera motion, and high density. This paper introduces a context-aware IoU-guided tracker that exploits an offline reference proposal generation strategy and a multitask two-stream network. The proposed strategy introduces an efficient sampling strategy to generalize the network on the target and its parts without imposing extra computational complexity during online tracking. It considerably helps the proposed tracker, COMET, to handle occlusion and view-point change, where only some parts of the target are visible. Extensive experimental evaluations on broad range of small object benchmarks (UAVDT, VisDrone-2019, and Small-90) demonstrate the effectiveness of our approach for small object tracking.



There are no comments yet.


page 5

page 14

page 16

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Visual tracking has various applications from biology [56], surgery [4], and ocean exploration [41] to self-driving cars [5], autonomous robots [47], surveillance [46], and sports [42]

. As one of the most-active research areas in computer vision, visual tracking aims to accurately localize a model-free target while it robustly estimates a fitted bounding box on target. Tracking a target in real-world scenarios involves challenging situations such as heavy illumination variation, partial and full occlusions, aspect ratio change, fast target motion, deformation, and so forth. Recently, tracking a target on videos captured by drones or flying robots has introduced extra challenges (e.g., small objects, dense cluttered background, weather condition, aerial view, fast camera/object motion, camera rotation, and viewpoint change)

[66, 14, 13, 63]

. Realistic videos from aerial view usually contain small objects with complex backgrounds. Detecting and tracking these targets are difficult even for humans as a consequence of low resolution and limited pixels of small objects. Besides, the pooling layers and large sampling strides of network layers may lead to information loss. To address these issues, various approaches have been proposed

[55]. For instance, the SSD [40] uses low-level features for small object detection and high-level ones for larger objects. To consider context information for small object detection, DSSD [18] increases the resolution of feature maps using deconvolution layers. The MDSDD [7] utilizes several multi-scale deconvolution fusion modules to enhance the performance of small object detection. Also, [37] utilizes multi-scale feature concatenation and attention mechanisms to enhance small object detection using context information. Finally, the SCRDet [62] introduces SF-Net and MDA-Net for feature fusion and highlighting object information using attention modules, respectively. Although object detection methods have made considerable progress on handling small objects from aerial views (e.g., [58, 50, 33, 1]), developing specific visual trackers for this application is still in progress.

Although modern deep learning-based visual tracking methods have achieved significant success in tracking generic objects, these methods cannot provide satisfactory performance for the drone’s vision systems. These visual trackers often assume that the initial target bounding box (BB) provides sufficient information for target modeling. Moreover, these trackers use conventional visual tracking datasets (e.g., OTB

[60], TC-128 [36], VOT [31], NfS [19]) consisting videos recorded by traditional surveillance cameras which do not include challenging videos for high-altitude view. Motivated by these problems, this paper proposes a Context-aware iOu-guided network for sMall objEct Tracking (COMET). It exploits a multitask two-stream network to process context information at various scales and focuses on salient areas via attention modules. Given a rough estimation of target location by an online classification network [8], the proposed network simultaneously predicts intersection-over-union (IoU) and normalized center location error (CLE) between the estimated bounding boxes (BBs) and the ground-truth ones. Moreover, an effective proposal generation strategy is proposed, which helps the network to learn contextual information. This strategy leads to a better generalization of the proposed network to handle occlusion. Despite ATOM tracker [8] that is an IoU-guided visual tracking method, the proposed method develops an architecture that learns to more effectively exploit the representations of target and its parts for small object tracking from aerial views.

The contributions of the paper are summarized as the following two folds.

1) Offline Proposal Generation Strategy: The proposed method uses two different approaches for offline training and online tracking phases. In offline training, it generates limited high-quality proposals from the reference frame and passes them through the network. Despite other proposal generation strategies that only generate test proposals to find the fitted target BB, these proposals, including context information, help the network to learn target and its parts. Therefore, the proposed strategy effectively can handle large occlusions and viewpoint changes in challenging scenarios. In online training, the proposed tracker just uses initial target BB for its reference branch, which provides no extra computational burden on its efficiency.

2) Multitask Two-Stream Network:

The proposed method develops a multitask two-stream network to deal with recent challenges in small object tracking. First, the network fuses aggregated multi-scale spatial features with semantic ones to provide sufficient information, exploit both low-level and high-level target representation, and construct higher-quality target model. Second, it utilizes lightweight spatial and channel attention modules to focus on more relevant information for small object tracking. Third, the network optimizes a proposed multitask loss function to minimize the errors derived from IoU-prediction and center location error of a target.

Extensive experimental analyses are performed to compare the proposed tracker with state-of-the-art methods on recent and related small object tracking benchmarks, namely, UAVDT [13], VisDrone-2019 [14], and Small-90 [39].

The rest of the paper is organized as follows. In Section 2, an overview of related works is briefly outlined. In Section 3 and Section 4, the proposed method and experimental analyses are presented. Finally, the conclusion is summarized in Section 5.

2 Related Work

Generally, visual tracking methods are classified into shallow trackers (e.g., discriminative correlation filters (DCF)

[25, 11, 9]) and deep learning (DL) based visual trackers, which attracted the main attention of visual tracking community. In this section, recent DL-based methods are briefly investigated. The DL-based methods use pre-trained conventional networks (e.g., AlexNet [32], VGG-Net [6, 49], and/or ResNet [23]

) for feature extraction or train a deep neural network for visual tracking purposes. The network architectures are designed as a convolutional neural network (CNN), Siamese neural network (SNN), recurrent neural network (RNN), or a custom network. While CNN-based trackers were the first attractive architecture, most of the recent trackers are based on two-stream networks such as SNNs. These methods provide competitive tracking performance and acceptable speed simultaneously. In this section, recent visual trackers with the aid of two-stream neural networks are briefly described.

The motivation of using two-stream networks for visual tracking was interested in the generic object tracking using regression networks (GOTURN) [24], which only utilizes offline training of a network without any online fine-tuning during tracking. This idea continued by fully-convolutional Siamese networks (SiamFC) [2], which defined the visual tracking as a general similarity learning problem to address limited labeled data problems. This method uses an offline-trained fully-convolutional network to compute the cross-correlation existing between exemplar and search regions. Also, the Siamese instance search tracker (SINT) [54] utilizes an offline matching function without online adaption or updating. To exploit both the efficiency of the correlation filter (CF) and CNN features, the CFNet [57] provides a closed-form solution for end-to-end training of the CF layer. This asymmetric shallow SNN-based method requires few operations and limited memory that is particularly proper for embedded devices. Moreover, the work [12] applies triplet loss on exemplar, positive instance, and negative instance to strengthen the feedback of back-propagation and provide more powerful features. These methods could not achieve competitive accuracy and robustness against well-known DCF methods since they are prone to drift due to appearance variations of target; However, these methods provide beyond real-time speed.

The dynamic Siamese network (DSiam) [21] proposes a general model for fast transformation learning, which also includes feature fusion, online adaption, and background suppression. As a baseline of the most-successive state-of-the-art tracking methods (e.g., [67, 65, 17, 59, 34]), the Siamese region proposal network (SiamRPN) [35] formulates generic object tracking as a local one-shot learning with bounding box refinement. The proposed network includes two classification and regression heads to classify target from its background and regress coordinates of target proposals. The distractor-aware Siamese RPNs (DaSiamRPN) [67] exploits semantic backgrounds and distractor suppression to learn more robust features. Also, it partially addresses full occlusion and out-of-view challenges by a local-to-global search region strategy. To design deeper and wider SNN-based methods, the SiamDW [65] have investigated various units and backbone networks (e.g., ResNet, ResNext [61], and GoogLeNet [52]) to take full advantage of state-of-the-art network architectures. The Siamese cascaded RPN (CRPN) [17] consists of multiple RPNs that perform stage-by-stage classification and localization. This method aims to improve the discriminative power of classifiers and address the imbalance distribution of training samples by hard negative sampling. Moreover, the SiamRPN++ method [34] proposes a ResNet-driven Siamese tracker that not only exploits layer-wise and depth-wise aggregations but also uses a spatial-aware sampling strategy to train a deeper network successfully. The SiamMask tracker [59] benefits bounding box refinement and class agnostic binary segmentation to improve the estimated target region by similarity measurement.

Although SNN-based methods provides both desirable performance and computational efficiency, these methods usually do not consider background information and suffer from poor generalization due to lacking online training and update strategy. The ATOM tracker [8] performs classification and target estimation tasks with the aid of an online classifier and an offline IoU-predictor, respectively. First, this method discriminates the target from its background and then, the generated proposals around the estimated location are refined by the IoU-predictor. Based on a model prediction network, the DiMP tracker [3] learns a robust target model by employing a discriminative loss function and an iterative optimization strategy with a few steps. The performance of these trackers is dramatically decreased on videos captured from medium- and high-altitude aerial views since there are no strategies to deal with existing challenges. For example, the limited information of small target, dense distribution of distractors, large view-point change, or low resolution leads to tracking failures of the conventional methods.

3 Our Approach

This section presents an overview of the proposed method, motivation, and a detailed description of the main contributions. The graphical abstract of proposed offline training and online tracking is shown in Fig. 1. The proposed framework mainly consists of an offline proposal generation strategy and a two-stream multitask network, which consists of lightweight individual modules for small object tracking. Also, the proposed proposal generation strategy helps the network to learn a generalized target model, handle occlusion and viewpoint change with the aid of context information. This strategy is just applied to offline training of the network to avoid extra computational burden in online tracking.

Figure 1: Overview of proposed method in offline training and online tracking phases.

3.1 Motivation

Tracking small objects mostly raised by videos taken from an aerial view and has three major difficulties including 1- lacking appearance information to distinguish them from background or distractors, 2- much more possibility of locations (that is, accurate localization requirement), and 3- limited knowledge and experiences according to previous efforts. Also, the challenges of small object tracking are usually originated from the UAV videos captured from medium- (3070m) and high-altitudes (70m). While the medium-altitudes involve more view angles, the high-altitudes cause less clarity, tiny objects, and more complex scenarios [63].

A key motivation is to solve the mentioned issues by adapting small object detection schemes into the network architecture for tracking purposes. The aim is to track small objects from medium- or high-altitudes in complex scenarios. Thus, the proposed network employs an offline proposal generation strategy and a two-stream multitask network. In the next subsections, the proposed method will be described.

3.2 Offline Proposal Generation Strategy

The eventual goal of proposal generation strategies is to provide a set of candidate detection regions, which are possible locations of objects. There are various category-dependent strategies for proposal generation [20, 40, 28]. For instance, the IoU-Net [28] augments the ground-truth instead of using region proposal networks (RPNs) to provide better performance and robustness to the network. Also, the ATOM [8]

uses the IoU-Net proposal generation strategy with a modulation vector to integrate target-specific information into its network.

Motivated by IoU-Net [28] and ATOM [8], an offline proposal generation strategy is proposed to extract context information of target from the reference frame. The ATOM tracker generates target proposals from the test frame (), given the target location in that frame (). The target proposals are produced by a jittered ground-truth location in offline training, and it will be achieved by an online target classification network in online tracking. The test proposals are generated according to . Then, a network is trained to predict IoU values () between and object, given the BB of the target in the reference frame (). Finally, the designed network in the ATOM minimizes the mean square error of and .

In this work, the proposed strategy also provides target patches with background supporters from the reference frame (denoted as ) to solve the challenging problems of small object tracking. Besides , the proposed method exploits just in offline training. Exploiting context features and target parts will assist the proposed network (Sec. 3.3) in handling occlusion and viewpoint change problems for small objects. For simplicity, we will describe the proposed offline proposal generation strategy with the process of IoU-prediction. However, the proposed network predicts both IoU and center location error (CLE) of test proposals, simultaneously. An overview of the process of offline proposal generation for IoU-prediction is shown in Algorithm 1.

Notations: Bounding box ( for a test frame or for a reference frame), IoU threshold ( for a test frame or for a reference frame), Number of proposals ( for a test frame or for a reference frame), Iteration number (), Maximum iteration (

), A Gaussian distribution with zero-mean (

) and randomly selected variance

(), Bounding box proposals generated by a Gaussian jittering ( for a test frame or for a reference frame)
Input: , , , ,
for  do

      while  and ;
end for
Algorithm 1 Offline Proposal Generation

The proposed strategy generates target proposals from the reference frame, which are generated as . Note that it considers to prevent drift toward visual distractors. The proposed tracker exploits this information (especially in challenging scenarios involving occlusion and viewpoint change) to avoid confusion during target tracking. The and are passed through the reference branch of the proposed network, simultaneously (Sec. 3.3). In this work, an extended modulation vector has been introduced to provide the representations of the target and its parts into the network. That is a set of modulation vectors that each vector encoded the information of one reference proposal. To compute IoU-prediction, the features of the test patch should be modulated by the features of the target and its parts. It means that the IoU-prediction of test proposals is computed per each reference proposal. Thus, there will be IoU predictions. Instead of computing times of IoU-predictions, the extended modulation vector allows the computation of groups of IoU-predictions at once. Therefore, the proposed network predicts groups of IoU-predictions by minimizing the mean square error of each group compared to . During online tracking, the proposed method does not generate and just uses the to predict one group of IoU-predictions for generated . Therefore, the proposed strategy can effectively alleviate occlusion and viewpoint change problems (see supplementary material) with no extra computational complexity in online tracking.

3.3 Multitask Two-Stream Network

Tracking small objects from aerial view involves extra difficulties such as clarity of target appearance, fast viewpoint change, or drastic rotations besides existing tracking challenges. The aim of this part is to design an architecture that handles the challenges of small object tracking by considering recent advances in small object detection. Inspired by [28, 8, 62, 38, 45], a two-stream network is proposed (see Fig. 2), which consists of multi-scale processing and aggregation of features, the fusion of hierarchical information, spatial attention module, and channel attention module. Also, the proposed network seeks to maximize the IoU between estimated BBs and the object while it minimizes their location distance. Hence, the proposed network exploits a multitask loss function, which is optimized to consider both the accuracy and robustness of the estimated BBs of the target. In the following, the proposed architecture and the role of the main components are described.

The proposed network has adopted ResNet-50 [23] to provide backbone features for reference and test branches. Following small object detection methods [55, 62], features from block3 and block4 of ResNet-50 are just extracted to exploit both spatial and semantic features while controlling the number of parameters. Then, the proposed network employs a multi-scale aggregation and fusion module (MSAF) to capture hierarchical information of the target by different layers. It then fuses the spatial- and semantic-rich features for applying attention modules. The MSAF processes spatial information via the InceptionV3 module [53] to perform factorized asymmetric convolutions on target regions. This low-cost multi-scale processing helps the network to approximate optimal filters that are proper for small object tracking. Also, semantic features are passed through a convolution and deconvolution layers to be refined and resized for feature fusion. The resulted hierarchical information is fused by an element-wise addition of the spatial and semantic feature maps. After feature fusion, the number of channels is reduced by convolution layers to limit the network parameters. Although exploring multi-scale features does not considerably impact on tracking large-size targets, it effectively helps for small objects that may contain less than % pixels of a frame.

Next, the proposed network utilizes a modified version of the bottleneck attention module (BAM) [45], which has a simple and lightweight architecture. It emphasizes target-related spatial and channel information and suppresses distractors and redundant information, which are common in aerial images [62]. The BAM includes channel attention, spatial attention, and identity shortcut connection branches. In this work, the SENet [26]

is employed as the channel attention branch, which uses global average pooling (GAP) and a multi-layer perceptron to find the optimal combination of channels. The spatial attention module utilizes dilated convolutions to increase the receptive field. This helps the network to consider context information for small object tracking. In fact, the spatial and channel attention modules answer to “where” the critical features are located and “what” relevant features are. Lastly, the identity shortcut connection helps for better gradient flow.

Figure 2: Overview of proposed network (COMET). The MSAF denotes multi-scale aggregation and fusion module. Also, the fully-connected block, global average pooling block, and linear layer are denoted as the FC, GAP, and linear, respectively.

After that, the proposed method generates proposals from the test frame. Also, it uses the proposed proposal generation strategy to extract the BBs from the target and its parts from the reference frame in offline training. These generated BBs are applied to the resulted feature maps and fed into a precise region of interest (PrRoI) pooling layer [28], which is differentiable w.r.t. BB coordinates. The proposed network uses a convolutional layer with a kernel to convert the PrRoI output to target appearance coefficients. To merge the information of target and its parts into the test branch, the target coefficients are expanded and multiplied with the features of test patch. That is, applying target-specific information into the test branch by the proposed extended modulation vector. Then, the test proposals () are applied to the features of the test branch and fed to a PrRoI pooling. Finally, the proposed network simultaneously predicts IoU and CLE of test proposals by optimizing a multitask loss function as


where the , , and represent the loss function for IoU-prediction head, loss function for the CLE-head, and balancing hyper-parameter for loss functions, respectively. These loss functions are defined as


where the is the normalized distance between the and . For example, is calculated as . Also, the represents the predicted CLE between BB estimations () and target, given an initial BB in the reference frame. In offline training, the proposed network optimizes the loss function (1) to learn how to predict the target BB from the pattern of proposals generation. However, the online optimization process will be different. In online tracking, the target BB from the first frame (similar to [35, 59, 34, 8]) and also target proposals in the test frame passes through the network. As a result, there is just one group of CLE-prediction as well as IoU-prediction to avoid more computational complexity. In this phase, the aim is to maximize the IoU-prediction of test proposals using the gradient ascent algorithm and also to minimize its CLE-prediction using the gradient descent algorithm. Algorithm 2 describes the process of online tracking in detail. This algorithm shows how the proposed method passes inputs through the proposed network and updates BB coordinates based on scaled back-propagated gradients. While the IoU-gradients are scaled up with BB sizes to optimize in log-scaled domain such as [28], just and coordinates of test BBs are scaled up for CLE-gradients. It experimentally achieved better results compared to the similar scaling process for IoU-gradients. The intuitive reason is that the proposed network learned the normalized location differences between BB estimations and target BB. That is, the CLE-prediction is responsible for accurate localization, whereas the IoU-prediction robustly determines the BB aspect ratio. After refining the test proposals ( for online phase) for times, the proposed method selects the best BBs and uses the average of these predictions based on IoU-scores as the final target BB.

Notations: Input sequence (), Sequence length (), Current frame (), Rough estimation of bounding box (), Generated test proposals (), Concatenated bounding boxes (), Bounding box prediction (), Step size (), Iteration numbers of refinement (), Online classification network (), Scale and center jittering () with random factors, Network outputs ( and ), Decay rate ()
Input: ,
Output: ,
for  do


, )
for  do
             , = FeedForward(, , , )
,             ,
,            ,
       end for
       Select best w.r.t. IoU-scores
end for
Algorithm 2 Online Tracking

4 Empirical Evaluation

In this section, the proposed method is evaluated on three state-of-the-art benchmarks for small object tracking from aerial view: VisDrone-2019 (35 videos) [14], UAVDT (50 videos) [13], and Small-90 (90 videos) [39]. Video sequences in the VisDrone-2019, UAVDT, and Small-90 benchmarks are annotated by twelve, eight, and eleven attributes, respectively. The proposed method is compared with state-of-the-art visual trackers in terms of precision and success metrics [60]. Note that the Small-90 dataset includes the challenging videos of the UAV-123 dataset [43] with small objects. Besides, the UAV-123 dataset is a low-altitude (- meters) UAV dataset [63, 43], which lacks varieties in small objects, camera motions, and real scenes [63]. Furthermore, generic tracking datasets do not consist of challenges such as tiny objects, significant viewpoint changes, camera motion, and high density. For instance, these datasets (e.g., OTB [60], VOT [31, 30], etc.) provide videos that captured by fixed or moving car-based cameras with limited viewing angles. For these reasons and our focus on small object tracking (from videos captured by UAVs from medium- and high-altitudes), the proposed tracker is evaluated on three related benchmarks to demonstrated the motivation and major effectiveness of COMET for small object tracking. In the following, implementation details, ablation analyses, and state-of-the-art comparisons on small object tracking benchmarks are presented.

4.1 Implementation Details

The proposed method uses Block 3 and Block 4 of ResNet-50 pre-trained on ImageNet

[48] to extract backbone features. For offline proposal generation, the hyper-parameters are set to (test proposals number), (seven reference proposal numbers plus reference ground-truth), , , , and image sample pairs randomly selected from videos with a maximum gap of frames (). From the reference image, a square patch centered at the target is extracted with the area of times the target region. Also, flipping and color jittering are used for data augmentation of the reference patch. To extract the search area, a patch (with the area of times the test target scale) with some perturbation in the position and scale is sampled from the test image. These cropped regions are then resized to the fixed size of . The values for IoU and CLE are normalized to the range of .

The maximum iteration numbers for proposal generation are for reference proposals and for test proposals. The weights of the backbone network are frozen and other weights are initialized using [22]. The training splits are extracted from the official training set (protocol II) of LaSOT [16], training set of GOT-10K [27], NfS [19], and training set of VisDrone-2019 [14] datasets. Moreover, the validation splits of VisDrone-2019 and GOT-10K datasets have been used in the training phase. To train in an end-to-end fashion, the ADAM optimizer [29] is used with an initial learning rate of , weight decay of , and decay factor per every epochs. The proposed network trained for epochs with a batch size of and

sampled videos per epoch. Also, the proposed tracker is implemented using PyTorch, and it runs about

FPS on an Nvidia Tesla V100 GPU with GB RAM. Finally, the parameters of the online classification network are set as the ATOM tracker [8]. All experimental results are publicly available on

4.2 Ablation Analysis

A systematic ablation study on individual components of the proposed tracker has been conducted on the challenging UAVDT dataset [63] (see Fig. 7). It includes three different versions of the proposed network consisting of the networks without 1) “CLE-head”, 2) “CLE-head and reference proposals generation”, and 3) “CLE-head, reference proposals generation, and attention module”, referred to as A1, A2, and A3, respectively. Moreover, two other different feature fusion operations have been investigated, namely features multiplication and features concatenation, compared to the COMET that uses element-wise addition of feature maps.

Based on these experiments, the precision rates are , , , and success rates are , , and , for A1, A2, and A3, respectively. These results demonstrate the effectiveness of each component on tracking performance while the proposed method has achieved and in terms of precision and success rates, respectively. According to these results, the attention module, reference proposal generation strategy, and CLE-head have improved the success rate of the proposed tracker up to , , and , respectively. Moreover, the mentioned components have outperformed the precision rate of COMET up to , , and , respectively. Besides, comparing results of feature fusion opertations demonstrate that the element-wise addition has provided the best fusion method, which also previously proved by other methods such as [8].

Figure 7: Ablation analysis of COMET based on different components (top row) and feature fusion operations (bottom row) on UAVDT dataset.

4.3 State-of-the-art Comparison

For quantitative comparison, the proposed method (COMET) is compared with state-of-the-art visual tracking methods including ECO [9], ATOM [8], DiMP [3], SiamRPN [34], SiamMask [59], C-COT [11], CREST [51], MDNet [44], GOTURN [24], PTAV [15], MCPF [64], SRDCF [10], and CFNet [57] on three state-of-the-art datasets for small object tracking. Fig. 8 shows the achieved results on three recent small object datasets in terms of precision and success plots [60]. Based on Fig. 8, the proposed method outperforms recent top-performing visual trackers on challenging small object datasets. The Small-90 dataset is collected from other visual tracking datasets such as UAV-123 [43], OTB-2015 [60], TC-128 [36]. This inconsistency leads to limited differences between the performance of various trackers. The proposed method has shown better performance against the IoU-guided ATOM tracker. Moreover, Table 1 presents the attribute-based comparison of different visual tracking methods on the UAVDT dataset. This dataset consists of 80K video frames and focuses on complex scenarios. Table 1 compares the performance of visual tracking methods according to eight attributes: background clutter (BC), illumination variation (IV), scale variation (SV), camera motion (CM), object motion (OM), small object (SO), object blur (OB), and large occlusion (LO). These results demonstrate the performance of the proposed method compared to the state-of-the-art visual tracking methods. Moreover, this table shows that the proposed method can successfully handle the occlusion problem, and also it is proper for small object tracking. Besides, the qualitative comparison are shown in Fig. 9.

Figure 8: Overall precision and success comparisons of the proposed method (COMET) with state-of-the-art tracking methods on UAVDT, VisDrone-2019, and Small-90 datasets (top to bottom row, respectively).
Metric Tracker BC IV SV CM OM SO OB LO
Precision SiamRPN [34] 74.9 89.7 80.1 75.9 80.4 83.5 89.4 66.6
SiamMask [59] 71.6 86.4 77.3 76.7 77.8 86.7 86.0 60.1
ATOM [8] 70.1 80.8 73.0 77.2 73.4 80.6 74.9 66.0
DiMP-50 [3] 71.1 84.3 76.1 80.3 75.8 81.4 79.0 68.6
C-COT [11] 55.7 72.0 55.9 62.3 56.1 79.2 66.2 46.0
ECO [9] 61.1 76.9 63.2 64.4 62.7 79.1 71.0 50.8
MDNet [44] 63.6 76.4 68.5 69.6 66.8 78.4 72.4 54.7
CREST [51] 56.2 69.0 56.7 62.1 55.8 74.2 65.6 49.7
GOTURN [24] 61.1 76.9 63.2 64.4 62.7 79.1 71.0 50.8
PTAV [15] 57.2 69.6 56.5 63.9 56.4 79.1 66.2 50.3
SRDCF [10] 58.2 74.7 59.6 64.2 60.0 76.4 70.6 46.0
CFNet [57] 56.7 72.7 61.1 64.3 59.9 77.5 71.7 44.7
MCPF [64] 51.2 73.1 55.1 59.2 55.3 74.5 73.0 42.5
COMET 82.9 88.3 89.3 86.0 89.8 91.9 87.7 78.3
Success SiamRPN [34] 68.6 83.5 74.6 71.3 74.7 75.9 83.6 61.1
SiamMask [59] 65.8 79.8 72.3 72.3 71.0 77.9 78.0 57.0
ATOM [8] 60.2 70.3 67.3 70.7 63.9 68.7 65.5 59.7
DiMP-50 [3] 62.1 75.6 46.5 73.0 66.7 70.2 71.3 63.9
C-COT [11] 34.6 44.1 39.4 41.8 30.9 48.4 38.8 34.8
ECO [9] 41.0 52.2 45.2 46.7 37.6 51.8 46.4 36.4
MDNet [44] 41.2 53.3 46.5 44.2 41.6 47.5 52.1 35.5
CREST [51] 34.0 43.5 34.3 40.4 33.8 38.0 37.3 35.1
GOTURN [24] 41.0 52.2 45.2 46.7 37.6 51.8 46.4 36.4
PTAV [15] 30.1 38.3 31.6 34.3 26.7 36.4 36.7 31.8
SRDCF [10] 35.4 48.8 40.6 39.9 34.9 45.2 45.6 30.7
CFNet [57] 38.5 48.9 45.4 43.0 35.8 49.7 47.8 36.0
MCPF [64] 32.3 45.5 36.6 39.8 32.2 44.8 46.1 28.1
COMET 75.1 79.7 84.5 79.7 81.8 79.1 80.2 75.2
Table 1: Attribute-based comparison of state-of-the-art visual tracking methods on the UAVDT dataset.

Although the proposed network uses the online classification network of ATOM tracker [8] to obtain an initial estimation of the target location, the proposed method considerably achieves better performance in terms of accuracy and robustness. Both SiamRPN [34] and SiamMask [59] trackers are based on SiamRPN [35]

tracker. These trackers are comprised of a Siamese sub-network and a region proposal sub-network to exploit deep features and target proposals. Moreover, these trackers employed ResNet-50 as the backbone network. While SiamRPN

exploits classification and regression branches, the SiamMask tracker also enjoys a segmentation branch to produce target binary mask. Despite the desirable performance of recent trackers on conventional visual tracking datasets, these trackers are not designed to deal with challenging scenarios in videos captured from aerial view. Therefore, the proposed COMET tracker uses a different network architecture, which exploits the proposed proposal generation strategy. This strategy provides context information of the target and its parts for the proposed network. The proposed network also utilizes various modules to aggregate multiple scales of target and feature fusion.

Figure 9: Qualitative comparison of proposed COMET tracker with state-of-the-art tracking methods on S1202, S0602, and S0801 video sequences from UAVDT dataset (top to bottom row, respectively).

Moreover, it exploits attention modules to focus on essential channels and spatial features. Finally, by considering both IoU-prediction and CLE-prediction, the proposed method computes IoU-predictions and CLE-predictions to find an effective BB refinement strategy. The proposed method does not generate additional reference proposals on online tracking to preserve computational efficiency. Also, it obtains the fitted target BB by maximizing IoU-prediction and minimizing CLE-prediction.

5 Conclusion

A context-aware IoU-guided tracker proposed that includes an offline proposal generation strategy and a two-stream multitask network. It aims to track small objects in videos captured from medium- and high-altitude aerial views. First, an introduced proposal generation strategy provides context information for the proposed network to learn the target and its parts. This strategy effectively helps the network to handle occlusion and viewpoint change in high-density videos with a broad view angle in which only some parts of the target are visible. Moreover, the proposed network exploits multi-scale feature aggregation and attention modules to learn multi-scale features and prevent visual distractors. Finally, the proposed multitask loss function accurately estimate target region considering its IoU- and CLE-predictions. Experimental results on three medium- and high-altitude aerial view tracking datasets and remarkable performance of proposed tracker demonstrate the motivation and effectiveness of proposed components for small object tracking purposes.


  • [1] Y. Bai, Y. Zhang, M. Ding, and B. Ghanem (2018) SOD-mtgan: small object detection via multi-task generative adversarial network. In Proc. ECCV, Cited by: §1.
  • [2] L. Bertinetto, J. Valmadre, J. F. Henriques, A. Vedaldi, and P. H.S. Torr (2016) Fully-convolutional Siamese networks for object tracking. In Proc. ECCV, pp. 850–865. Cited by: §2.
  • [3] G. Bhat, M. Danelljan, L. V. Gool, and R. Timofte (2019) Learning discriminative model prediction for tracking. External Links: Link, arXiv:1904.07220 Cited by: §2, §4.3, Table 1.
  • [4] D. Bouget, M. Allan, D. Stoyanov, and P. Jannin (2017) Vision-based and marker-less surgical tool detection and tracking: A review of the literature. Medical Image Analysis 35, pp. 633–654. Cited by: §1.
  • [5] M. Chang, J. Lambert, P. Sangkloy, J. Singh, B. Sławomir, A. Hartnett, D. Wang, P. Carr, S. Lucey, D. Ramanan, and J. Hays (2019) Argoverse: 3D tracking and forecasting with rich maps. In Proc. IEEE CVPR, pp. 8748–8757. Cited by: §1.
  • [6] K. Chatfield, K. Simonyan, A. Vedaldi, and A. Zisserman (2014) Return of the devil in the details: Delving deep into convolutional nets. In Proc. BMVC, pp. 1–11. Cited by: §2.
  • [7] L. Cui, R. Ma, P. Lv, X. Jiang, Z. Gao, B. Zhou, and M. Xu (2018) MDSSD: multi-scale deconvolutional single shot detector for small objects. External Links: Link, arXiv:1805.07009v3 Cited by: §1.
  • [8] M. Danelljan, G. Bhat, F. S. Khan, and M. Felsberg (2018) ATOM: accurate tracking by overlap maximization. External Links: Link, arXiv:1811.07628 Cited by: §1, §2, §3.2, §3.2, §3.3, §3.3, §4.1, §4.2, §4.3, §4.3, Table 1.
  • [9] M. Danelljan, G. Bhat, F. Shahbaz Khan, and M. Felsberg (2017) ECO: Efficient convolution operators for tracking. In Proc. IEEE CVPR, pp. 6931–6939. Cited by: §2, §4.3, Table 1.
  • [10] M. Danelljan, G. Hager, F. S. Khan, and M. Felsberg (2015) Learning spatially regularized correlation filters for visual tracking. In Proc. IEEE ICCV, pp. 4310–4318. Cited by: §4.3, Table 1.
  • [11] M. Danelljan, A. Robinson, F. S. Khan, and M. Felsberg (2016) Beyond correlation filters: Learning continuous convolution operators for visual tracking. In Proc. ECCV, Vol. 9909 LNCS, pp. 472–488. Cited by: §2, §4.3, Table 1.
  • [12] X. Dong and J. Shen (2018) Triplet loss in Siamese network for object tracking. In Proc. ECCV, pp. 472–488. Cited by: §2.
  • [13] D. Du, Y. Qi, H. Yu, Y. Yang, K. Duan, G. Li, W. Zhang, Q. Huang, and Q. Tian (2018) The unmanned aerial vehicle benchmark: object detection and tracking. In Proc. ECCV, pp. 375–391. Cited by: §1, §1, §4.
  • [14] D. Du, P. Zhu, L. Wen, X. Bian, H. Ling, and et al. (2019) VisDrone-sot2019: the vision meets drone single object tracking challenge results. In Proc. ICCVW, Cited by: §1, §1, §4.1, §4.
  • [15] H. Fan and H.Ling (2019) Parallel tracking and verifying. IEEE Trans. Image Process. 28 (8), pp. 4130–4144. Cited by: §4.3, Table 1.
  • [16] H. Fan, L. Lin, F. Yang, P. Chu, G. Deng, S. Yu, H. Bai, Y. Xu, C. Liao, and H. Ling (2018) LaSOT: a high-quality benchmark for large-scale single object tracking. External Links: Link, arXiv:1809.07845 Cited by: §4.1.
  • [17] H. Fan and H. Ling (2018) Siamese cascaded region proposal networks for real-time visual tracking. External Links: Link, arXiv:1812.06148 Cited by: §2.
  • [18] C. Fu, W. Liu, A. Ranga, A. Tyagi, and A.C. Berg (2017) DSSD: deconvolutional single shot detector. External Links: Link, arXiv:1701.06659 Cited by: §1.
  • [19] H. K. Galoogahi, A. Fagg, C. Huang, D. Ramanan, and S. Lucey (2017) Need for speed: A benchmark for higher frame rate object tracking. In Proc. IEEE ICCV, pp. 1134–1143. Cited by: §1, §4.1.
  • [20] R. Girshick, J. Donahue, T. Darrell, and J. Malik (2014) Rich feature hierarchies for accurate object detection and semantic segmentation. In Proc. IEEE CVPR, pp. 580–587. Cited by: §3.2.
  • [21] Q. Guo, W. Feng, C. Zhou, R. Huang, L. Wan, and S. Wang (2017) Learning dynamic Siamese network for visual object tracking. In Proc. IEEE ICCV, pp. 1781–1789. Cited by: §2.
  • [22] K. He, Z. X., R. S., and S. J. (2015) Delving deep into rectifiers: surpassing human-level performance on imagenet classification. In Proc. ICCV, pp. 1026–1034. Cited by: §4.1.
  • [23] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In Proc. IEEE CVPR, pp. 770–778. Cited by: §2, §3.3.
  • [24] D. Held, S. Thrun, and S. Savarese (2016) Learning to track at 100 FPS with deep regression networks. In Proc. ECCV, pp. 749–765. Cited by: §2, §4.3, Table 1.
  • [25] J. F. Henriques, R. Caseiro, P. Martins, and J. Batista (2015) High-speed tracking with kernelized correlation filters. IEEE Trans. Pattern Anal. Mach. Intell. 37 (3), pp. 583–596. Cited by: §2.
  • [26] J. Hu, L. Shen, S. Albanie, G. Sun, and E. Wu (2019) Squeeze-and-excitation networks. IEEE Trans. Pattern Anal. Mach. Intell.. External Links: Document Cited by: §3.3.
  • [27] L. Huang, X. Zhao, and K. Huang (2018) GOT-10k: a large high-diversity benchmark for generic object tracking in the wild. External Links: Link, arXiv:1810.11981 Cited by: §4.1.
  • [28] B. Jiang, R. Luo, J. Mao, T. Xiao, and Y. Jiang (2018) Acquisition of localization confidence for accurate object detection. In Proc. IEEE ECCV, pp. 816–832. Cited by: §3.2, §3.2, §3.3, §3.3.
  • [29] D. P. Kingma and J. Ba (2014) ADAM: a method for stochastic optimization. In Proc. ICLR, Cited by: §4.1.
  • [30] M. Kristan and et al. (2019) The seventh visual object tracking VOT2019 challenge results. In Proc. ICCVW, Cited by: §4.
  • [31] M. Kristan, A. Leonardis, J. Matas, M. Felsberg, R. Pflugfelder, and et al. (2019) The sixth visual object tracking VOT2018 challenge results. In Proc. ECCVW, pp. 3–53. Cited by: §1, §4.
  • [32] A. Krizhevsky, I. Sutskever, and G. E. Hinton (2012) ImageNet classification with deep convolutional neural networks. In Proc. NIPS, Vol. 2, pp. 1097–1105. Cited by: §2.
  • [33] R. LaLonde, D. Zhang, and M. Shah (2018) ClusterNet: detecting small objects in large scenes by exploiting spatio-temporal information. In Proc. CVPR, Cited by: §1.
  • [34] B. Li, W. Wu, Q. Wang, F. Zhang, J. Xing, and J. Yan (2018) SiamRPN++: evolution of Siamese visual tracking with very deep networks. External Links: Link, arXiv:1812.11703 Cited by: §2, §3.3, §4.3, §4.3, Table 1.
  • [35] B. Li, J. Yan, W. Wu, Z. Zhu, and X. Hu (2018) High performance visual tracking with Siamese region proposal network. In Proc. IEEE CVPR, pp. 8971–8980. Cited by: §2, §3.3, §4.3.
  • [36] P. Liang, E. Blasch, and H. Ling (2015) Encoding color information for visual tracking: Algorithms and benchmark. IEEE Trans. Image Process. 24 (12), pp. 5630–5644. Cited by: §1, §4.3.
  • [37] J. Lim, M. Astrid, H. Yoon, and S. Lee (2019) Small object detection using context and attention. External Links: Link, arXiv:1912.06319v2 Cited by: §1.
  • [38] J. Lim, M. Astrid, H. Yoon, and S. Lee (2019) Small object detection using context and attention. External Links: Link, arXiv:1912.06319v2 Cited by: §3.3.
  • [39] C. Liu, W. Ding, and J. e. al. Yang (2020) Aggregation signature for small object tracking. IEEE Trans. Image Processing 29, pp. 1738–1747. Cited by: §1, §4.
  • [40] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C. Fu, and A. C. Berg (2016) SSD: single shot multibox detector. In Proc. ECCV, pp. 21–37. Cited by: §1, §3.2.
  • [41] J. Luo, Y. Han, and L. Fan (2018) Underwater acoustic target tracking: A review. Sensors 18 (1), pp. 112. Cited by: §1.
  • [42] M. Manafifard, H. Ebadi, and H. Abrishami Moghaddam (2017) A survey on player tracking in soccer videos. Comput. Vis. Image Und. 159, pp. 19–46. Cited by: §1.
  • [43] M. Mueller, N. Smith, and B. Ghanem (2016) A benchmark and simulator for UAV tracking. In Proc. ECCV, pp. 445–461. Cited by: §4.3, §4.
  • [44] H. Nam and B. Han (2016) Learning multi-domain convolutional neural networks for visual tracking. In Proc. IEEE CVPR, pp. 4293–4302. Cited by: §4.3, Table 1.
  • [45] J. Park, S. Woo, J. Lee, and I. S. Kweon (2018) BAM: bottleneck attention module. In Proc. BMVC, pp. 147–161. Cited by: §3.3, §3.3.
  • [46] B. Risse, M. Mangan, B. Webb, and L. Del Pero (2018) Visual tracking of small animals in cluttered natural environments using a freely moving camera. In Proc. IEEE ICCVW, pp. 2840–2849. Cited by: §1.
  • [47] C. Robin and S. Lacroix (2016) Multi-robot target detection and tracking: Taxonomy and survey. Autonomous Robots 40 (4), pp. 729–760. Cited by: §1.
  • [48] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei (2015) ImageNet large scale visual recognition challenge. IJCV 115 (3), pp. 211–252. Cited by: §4.1.
  • [49] K. Simonyan and A. Zisserman (2014) Very deep convolutional networks for large-scale image recognition. In Proc. ICLR, pp. 1–14. Cited by: §2.
  • [50] B. Singh and L. S. Davis (2018) An analysis of scale invariance in object detection ­ snip. In Proc. CVPR, Cited by: §1.
  • [51] Y. Song, C. Ma, L. Gong, J. Zhang, R. W.H. Lau, and M. H. Yang (2017) CREST: Convolutional residual learning for visual tracking. In Proc. ICCV, pp. 2574–2583. Cited by: §4.3, Table 1.
  • [52] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich (2015) Going deeper with convolutions. In Proc. IEEE CVPR, pp. 1–9. Cited by: §2.
  • [53] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna (2016) Rethinking the inception architecture for computer vision. In Proc. CVPR, pp. 2818–2826. Cited by: §3.3.
  • [54] R. Tao, E. Gavves, and A. W.M. Smeulders (2016) Siamese instance search for tracking. In Proc. IEEE CVPR, pp. 1420–1429. Cited by: §2.
  • [55] K. Tong, Y. Wu, and F. Zhou (2020) Recent advances in small object detection based on deep learning: A review. Image and Vision Computing 97. Cited by: §1, §3.3.
  • [56] V. Ulman, M. Maška, and et al. (2017) An objective comparison of cell-tracking algorithms. Nature Methods 14 (12), pp. 1141–1152. Cited by: §1.
  • [57] J. Valmadre, L. Bertinetto, J. Henriques, A. Vedaldi, and P. H.S. Torr (2017) End-to-end representation learning for correlation filter based tracking. In Proc. IEEE CVPR, pp. 5000–5008. Cited by: §2, §4.3, Table 1.
  • [58] X. W., Z. D., Y. H., and A. V. (2018) Context-aware single-shot detector. In Proc. WACV, pp. 1784–1793. Cited by: §1.
  • [59] Q. Wang, L. Zhang, L. Bertinetto, W. Hu, and P. H. S. Torr (2018) Fast online object tracking and segmentation: a unifying approach. External Links: Link, arXiv:1812.05050 Cited by: §2, §3.3, §4.3, §4.3, Table 1.
  • [60] Y. Wu, J. Lim, and M. Yang (2015) Object tracking benchmark. IEEE Trans. Pattern Anal. Mach. Intell. 37 (9), pp. 1834–1848. Cited by: §1, §4.3, §4.
  • [61] S. Xie, R. Girshick, P. Dollár, Z. Tu, and K. He (2017) Aggregated residual transformations for deep neural networks. In Proc. IEEE CVPR, pp. 5987–5995. Cited by: §2.
  • [62] X. Yang, J. Yang, J. Yan, Y. Zhang, T. Zhang, Z. Guo, X. Sun, and K. Fu (2019) SCRDet: towards more robust detection for small, cluttered and rotated objects. In Proc. IEEE ICCV, Cited by: §1, §3.3, §3.3, §3.3.
  • [63] H. Yu, G. Li, and W. e. al. Zhang (2019) The unmanned aerial vehicle benchmark: Object detection, tracking and baseline. Int. J. Comput. Vis.. Cited by: §1, §3.1, §4.2, §4.
  • [64] T. Zhang, C. Xu, and M. H. Yang (2017) Multi-task correlation particle filter for robust object tracking. In Proc. IEEE CVPR, pp. 4819–4827. Cited by: §4.3, Table 1.
  • [65] Z. Zhang and H. Peng (2019) Deeper and wider Siamese networks for real-time visual tracking. External Links: Link, arXiv:1901.01660 Cited by: §2.
  • [66] P. Zhu, L. Wen, D. Du, and et al. (2018) VisDrone-vdt2018: the vision meets drone video detection and tracking challenge results. In Proc. ECCVW, pp. 496–518. Cited by: §1.
  • [67] Z. Zhu, Q. Wang, B. Li, W. Wu, J. Yan, and W. Hu (2018) Distractor-aware Siamese networks for visual object tracking. In Proc. ECCV, pp. 103–119. Cited by: §2.