Aggregation Signature for Small Object Tracking

by   Chunlei Liu, et al.

Small object tracking becomes an increasingly important task, which however has been largely unexplored in computer vision. The great challenges stem from the facts that: 1) small objects show extreme vague and variable appearances, and 2) they tend to be lost easier as compared to normal-sized ones due to the shaking of lens. In this paper, we propose a novel aggregation signature suitable for small object tracking, especially aiming for the challenge of sudden and large drift. We make three-fold contributions in this work. First, technically, we propose a new descriptor, named aggregation signature, based on saliency, able to represent highly distinctive features for small objects. Second, theoretically, we prove that the proposed signature matches the foreground object more accurately with a high probability. Third, experimentally, the aggregation signature achieves a high performance on multiple datasets, outperforming the state-of-the-art methods by large margins. Moreover, we contribute with two newly collected benchmark datasets, i.e., small90 and small112, for visually small object tracking. The datasets will be available in


page 2

page 4

page 6

page 8


Unified Transformer Tracker for Object Tracking

As an important area in computer vision, object tracking has formed two ...

MOT20: A benchmark for multi object tracking in crowded scenes

Standardized benchmarks are crucial for the majority of computer vision ...

Tiny Object Tracking: A Large-scale Dataset and A Baseline

Tiny objects, frequently appearing in practical applications, have weak ...

Towards Grand Unification of Object Tracking

We present a unified method, termed Unicorn, that can simultaneously sol...

Segment as Points for Efficient Online Multi-Object Tracking and Segmentation

Current multi-object tracking and segmentation (MOTS) methods follow the...

Similarity Mapping with Enhanced Siamese Network for Multi-Object Tracking

Multi-object tracking has recently become an important area of computer ...

Code Repositories


small object tracking database

view repo

I Introduction

While several tracking methods have been developed over the past decade [34, 15, 18, 8, 14, 42, 9] and have been proven to be successful in many applications, such as robotics or video surveillance, tracking small objects in videos still remains a challenging problem, in particular when the complex scenarios and real time constraints are to be considered. In this paper, small objects mean that the targets in images have sizes of less than 1% of the whole image. The challenge of small object tracking mainly roots in two main facts: first, the visual features of small objects are extremely fickle, thus making feature representation difficult; second, sudden and large drift always occurs to small objects in tracking because of the shaking of the lens, compared to the normal-sized objects. The so-called sudden and large drift is that the target distance between two adjacent frames in the image coordinate system is two times larger than the target size.

For a long time, researchers only reported tracking results on common benchmarks using reasonably sized targets, but paid less attention to the small-object tracking problem. Just few existing algorithms related to the small object tracking, while that were designed to enhance the visual features of such a type of targets, with the hope that tracked objects would no longer be lost if robust features were exploited. For instance, the method in [2, 13] integrates both spatial and frequency domain features in order to localize the targets more accurately. Alternatively, the method in [1] tends to enhance the robustness of a tracker by strengthening the feature representations (e.g., target attributes) for the small targets. Recently, Rozumnyi et al. [31] have proposed to deal with fast moving and motion blur problems of the objects, but the performance is unsatisfactory due to low resolution and complex background clutters. Regarding now deep learning methods [23][22] are developed, we think the high-level features seem not to be effective for small objects. Moreover, we doubt that a continuous tracking of small-sized objects can be guaranteed even if robust visual features are exploited, considering the fact that small targets can easily be confused with the noise and clutters in real scenes. In other words, it might be more realistic to allow small objects to get lost during tracking, while investigating a better solution to re-detect them.

The intuition here is about “how human beings recognize the small target when it is lost due to clutter background?” Most likely, humans first look at the salient objects/regions popping up in the scene, and further verify whether one of the salient objects is the target of interest [5]. A few works mimic human being’s behavior and involve the saliency information in object tracking. For example, the method in [27] integrates saliency for the representation of context, while [39, 11, 25] incorporate saliency into appearance models in various ways in order to improve the robustness of the tracker. However, as they mostly focus on the target appearances in the image domain, performance is not satisfactory since the appearance is implicitly weak for small objects. Therefore, they might only be reliably applied for tracking normal-sized objects. In this paper, we propose a new saliency online learning framework, termed aggregation signature, and focus on small object tracking. To the best of our knowledge, no saliency-based methods have utilized all context information, including intensity, saturation, saliency and motion information, for small object tracking yet.

Fig. 1: The aggregation signature results are shown at different iterations, reflecting that the tracked target becomes more salient in the learning process. Our aggregation signature constitutes the first attempt to incorporate the tracked target information into the quaternion discrete cosine transform (QDCT) image signature, whose aggregation capacity is proved in theoretical terms.

Unlike handcrafted image signatures, which are simple yet powerful tools to spatially match the sparse foreground objects in an image [17, 33], the explicit advantage of our aggregation signature lies in a learning mechanism exploited to build an adaptive target signature. The result is that it can quickly detect the salient objects even though they are very small, which can further improve the (re-)localization performance of the trackers. We open up a new direction to track small objects by mimicking the human attention mechanism. In particular, the theoretical evidence proves that it is more effective, and that the resulting foreground saliency map from our aggregation signature becomes more consistent with the target appearance along iterations, as shown in Fig. 1. Moreover, the aggregation signature is so generic that it can be integrated into other trackers. In summary, the contributions of this paper include:

(i) The proposed aggregation signature is proved, in the theoretical terms, to be more efficient for sparse foreground detection, makings the tracked target more salient as compared to the background.

(ii) The aggregation signature improves the capacity of accumulating information for the target based on a learning mechanism, whereas the conventional image signatures are handcrafted and more likely prone to fail to adapt to the target.

(iii) New challenging datasets – small90 and small112 – are collected for small object tracking evaluation. The datasets are publicly available for further research development.

Ii Aggregation Signature

Image signature is a simple yet powerful tool to spatially match the sparse foreground of an image [17]. By using the sign function of DCT, the resulting handcrafted descriptor can approximately detect salient image regions efficiently. Rather than separating a color image into three channel images and computing image signatures respectively, QDCT [33] can discriminate the relative importance of four components by introducing a quaternion component. In general, both DCT and QDCT based image signatures are handcrafted methods with no involvement of a learning process. Differently, the proposed aggregation signature improves the discriminative capability of QDCT signature via learning multi-cue information, in particular the target prior information.

Ii-a Definition of Aggregation Signature

We begin by considering an image which exhibits the following structure:


where represents the foreground and represents the background. Please refer to Table I for the definitions used throughout the rest of this section. Formally, the aggregation signature (AS) is defined as:


where is the entrywise sign operator, represents the iteration and represents the 4 channels in use. Then, the reconstructed image can be defined as:




where 111If is the image signature based on DCT, we have ., represents the reconstructed result in the iteration with as its conjugate form, and represents the element wise product. , , represent three different channels such as any one channel of RGB, image intensity and image saturation (or motion in tracking). is a two-dimensional prior related to the tracked target, which will be elaborated in Section IV.

Ii-B Foreground Aggregation Signature Properties

In this section, we provide evidence that, for an image which adheres to a certain mathematical structure, the background can be suppressed by the aggregation signature.

Proposition: The image reconstructed from the aggregation signature matches the foreground object more accurately in the learning process with a high probability as follows:


where stands for probability, is a small positive value, N represents total image pixel number, represents the norm, denotes the inner-product. denotes expectation, which reveals about the similarity between the foreground and the object saliency information obtained by aggregation signature.

Terms Notation
The entrywise sign operator.
The conjugate form of .
, the reconstructed image of DCT.
,the reconstructed image
of QDCT.

The expectation of random variable


norm of vector

. (p=2 if omitted).
The inner-product of and .
The Hadamard (entrywise) product operator.
Support set of .
TABLE I: Notation and terms used in this paper.

Proof: We know the transform between QDCT and DCT is


For ease of explanation, we only focus on one channel, that is to say and the result can be easily generalized for the quaternion case in a straightforward way, then we have


where and represents the points of the corresponding support set. We note that the proof is applicable to channels in Equ. (6), so we take the channel for example. Then, we have


Since the results obtained by DCT are independent of each other, we assume


where is very small, since the probability that the DCT output is equal to a certain value is very small. Then we have the following statement:


which means that in a high probability we have , considering that is very small.

Similarly, we have


Since , if , then we have


Combining (11) and (12), we have


Based on the image signature proposed by Hou [17], we have


where represents the support set of . Given the bound [17], we have


And then it becomes


For a spatially sparse foreground, we have the following statement:


Together with Equ. (10), we have


which proves the proposition.

Remark: Here, is very small as in Equ. (9), e.g., , and the probability mentioned above is % when . In other words, background is suppressed more during learning aggregation signature with high probability. We also did a statistic analysis on in Equ. (9) based on the MSRA-B dataset [24] , which indicates that is very small less than .

Iii Aggregation Signature Tracker

We exploit the aggregation signature to enhance the re-detection process for small object tacking, which is called aggregation signature tracker (AST). More specifically, when a target is found drifting by a thresholding method, a saliency detection with the tracked target as prior will be triggered, which enables the online aggregation signature to suppress the background data. Together with the context information indicated in different channels, we re-detect the objects to relocate the tracked target. The whole tracking procedure is illustrated in Fig. 2(a) and Algorithm 1, and we elaborate each key component in the following.

Fig. 2: Scheme of the aggregation signature tracker, which includes the base tracker and re-detection stages, particularly for small objects. The part of aggregation signature calculation illustrates the saliency map calculation in the re-detection procedure. Once a drifting is detected, we choose the search region around the center of the previous target location to calculate the saliency map via aggregation signature. The blue box is the search region. In the learning process, the target prior () and the context information in the blue box are used to learn the saliency map that helps to find a new initial position, where the base tracker will be performed again for re-detection.

Drifting detection: As evident on output constraint transfer tracking method (OCT) [40]

, a simple distribution is necessary and significant to achieve high efficiency. OCT builds upon a reasonable assumption that the response to the target image follows a Gaussian distribution, so we trigger the re-detection process based on a thresholding method as:


where represents the mean response using all previous frames, represents the maximum response of the current frame, and is the threshold. The target is supposed to be lost if the response of the current frame is far from the average response. Once the target is occluded or out of view, this mechanism helps us search continuously in the following frames.

Saliency map calculation: The aggregation signature is used to obtain the saliency map and to further coarsely re-localize the target. Through iterations, we gradually smooth the aggregation signature by a Gaussian kernel [17] to obtain the saliency map. The salient regions are regarded as the coarse candidate positions of the target, on which a re-detection process is performed still based on the selected base trackers. It should be mentioned that involving the targeted object, as a prior in saliency detection, does not occur in the conventional methods. Two key components are elaborated as follows:

1:Initial target bounding box
2:Initial , ,
3:if the frame  then
4:     repeat
5:         Crop out the search windows according to , and extract feature
6:         Compute the maximal response according to base tracker
7:         The position is obtained according to the maximal response
8:         Updating essential parameters of the base tracker
10:     until  ==3
11:end if
12:Compute the mean of response using all previous frames
13:if the frame  then
14:     repeat
15:         Crop out the search windows according to , and extract feature
16:         Compute the maximal response according to base tracker
17:         if   then
18:              Crop out the target search regions
19:              Obtain channels according to channels design in section 4
20:              Calculate the aggregation signature saliency map as illustrated in section 3.1
21:              Obtain the coarse target location based on the saliency map
22:              Compute new target location according to base tracker
23:         end if
25:     until  the end of the video
26:     Updating parameters of the base tracker
27:     Updating
28:end if
Algorithm 1 - Aggregation signature tracker

1) Channels design: We denote the input image captured at frame t as , where , , and are the red, green and blue channels of . Then, we obtain three channels used in our aggregation signature representing as: intensity , saturation and movement , respectively, where is a constant. We deploy image signature [17] to calculate the initial saliency map as the first channel .

2) Target Prior: As shown in Fig. 2 (b), we select salient regions similar to the target in the last frame in size. Next, we assign each candidate a weight indicating the similarity to a target prior information, which is measured simply by the Euclidean distance as:


where denotes the weight of the region for the candidate saliency map at the frame, is a constant. , where represents the histogram of the candidate saliency map, while denotes the target histogram for the frame calculated by


where is 0.5 in this paper. We note that the weights are set to for the regions outside the selected salient areas.

Iv Experiments

In this section, we evaluate the aggregation signature based on our small90 dataset and a visual saliency benchmark MSRA-B [24]. We further test the performance of our aggregation signature based tracker on the small90, small112, UAV123_10fps [28] and UAV20L [28] according to the object tracking benchmark [38]. The test platforms are Intel I7 2.7 GZ (4 cores) CPU with 8G RAM, and GPU with NVIDIA GeForce GTX 1070.

Fig. 3: Attribute distribution across small90.
Fig. 4: The first frames of selected sequences from small90. The red bounding box indicates the ground truth.

Iv-a Datasets

Few datasets are available for small object tracking task. We establish a comprehensive database, termed small90 benchmark, consisting of 90 annotated small-sized object sequences, where several additional challenges, such as target drifting and low resolution, have been encompassed. We add 22 more challenging sequences into small90, and obtain another new dataset termed as small112. Each sequence is categorized with 11 attributes - illumination variations (IV), scale variations (SV), occlusions (OCC), deformations (DEF), motion blur (MB), fast motion (FM), in-plane rotation (IPR), out-of-plane rotation (OPR), out-of-view (OV), background clutters (BC) and low resolution (LR), for better analysis of the tracking approaches. The attribute distribution in our dataset is plotted in Fig. 3, which shows that some attributes occur more frequently, e.g., LR, than the others. We note that one sequence is often annotated with multiple attributes. The examples of first frames from our datasets are illustrated in Fig. 4.

Iv-B Aggregation Signature on Image

We first evaluate how aggregation signature can enhance the performance of saliency detection, based on the commonly used metrics including location-based metrics normalized scanpath saliency (NSS) [3], mean absolute error (MAE) [35] and distribution-based metric similarity (SIM) [20]. The comparative DCT image signature (IS) and QDCT image signature (QIS) are computed to extensively validate the effectiveness of our aggregation signature (AS) method, particularly on both MSRA-B [24] and small90 databases. There are 5000 images in MSRA-B, which is a large scale image database for quantitative evaluation of visual attention algorithms. From the results in Table II

, we observe that our method achieves overall better performance quantitatively than IS and QIS in terms of the MAE, NSS, SIM measures, and thus leading to a better estimation of the visual distance between the predicted saliency map and the ground truth. Fig.

5 provides the saliency maps of different methods, and the ground truth on images from small90, which shows that the background is more suppressed in aggregation signature with resepct to the others methods. In terms of running speed, the aggregation signature module achieves 32 frames per second (FPS) in our experiments.

MSRA-B small90
IS 0.2659 0.9710 0.3574 0.1513 3.6968 0.0163
QIS 0.2607 0.9808 0.3630 0.1260 4.4293 0.0288
AS 0.2559 0.9844 0.3695 0.0660 7.2658 0.0611
TABLE II: Experiments on metrics (MAE, NSS, SIM) among IS, QIS and AS on MSRA-B and small90. Bold fonts highlight the best performance.
Fig. 5: Representative results of different signature methods. For each group from left to right, they are the original image, ground truth, IS, QIS and AS, respectively. Comparing with the other methods, AS yields the best background suppression performance.

Iv-C Aggregation Signature on Tracking

We empirically set the iteration number equal to 4, the saliency patches as 6. For other parameters, we follow the previous work [40] and set , , in all experiments for fair comparisons.

We then test the performance of aggregation signature in tracking (AST) by comparing with DCT image signature, QDCT image signature, which are incorporated with KCF, based on small90. The results in Fig. 6 reveal that the aggregation signature clearly outperforms other signatures in small object tracking. Also, we use one-pass evaluation (OPE) [38] to evaluate our results in the whole experiments section. Furthermore, we compare KCF_AST with other saliency-based trackers, including saliency prior context model (SPC) [27] and structuralist cognitive tracker (SCT) [4] in the same figure. KCF_AST (76.6%) is about 22% higher than SPC (54.9%), and 9% higher than SCT (67.7%) in terms of the precision, while KCF_AST (46.6%) is about 16% higher than SPC (30.9%) and 5% higher than SCT (42.1%) based on the average success rate.

We also compare our trackers with OCT, which also exploits the similar failure detection scheme to improve KCF. One can note that the performance of KCF_AST is higher than OCT by 13.2% and 7.8% in terms of precision and success rate, respectively.

Fig. 6: Precision and success rate plots for AST performance on small90, in which SPC and SCT are the other two saliency-based trackers, while KCF_QDCT and KCF_DCT are employed to compare the aggregation signature performance in tracking.

The small90 benchmark: In Fig. 7, we further show the precision and success plots of 30 state-of-the-art trackers including SiamRPN [45] [19], LDES [21], SAT [12], TLD [18], LCT [26], OCT [40], CSK [32], CT [44], STC [43], KCF [16], ECO [6], MDNet [29], LCCF [41], SRDCF [7] and CPF [30], generated by the benchmark toolbox. While several baseline algorithms, e.g., LDES, DaSiamRPN, ECO, have shown promising potential in tracking small objects, our AST still helps achieve the precision rates of 84.9% (LDES_AST), 83.1% (DaSiamRPN_AST), 83.2% (ECO_AST) which improve its counterpart base trackers by 1.6%, 0.9%, 1.7% respectively. Meanwhile, the above three trackers with our AST on achieves a success rate of 68.6%, 69.7%, 64.3%, outperforming the base trackers by 1.7%, 0.4%, 0.9% respectively. Besides, our MDNet_AST outperforms by 7.1% and 4.0% respectively to achieve a precision rate of 86.6% and a success rate of 65.9% compared to MDNet. This again confirms that our aggregation signature can consistently improve the performance of base trackers. Likewise, LCCF_AST also shows a significant incremental performance, compared with the base tracker LCCF. Besides, when compared with the state-of-the-art re-detection trackers, our LCCF_AST (54.8%) significantly outperforms its base tracker LCCF (46.4%), and also TLD (52.7%), LCT(46.7%) and OCT (54.2%) by 2.1%, 8.3% and 0.7% in terms of the success rate on small90, respectively. The superior tracking performance confirms that our method is more effective than the state-of-the-art re-detection trackers such as TLD, LCT and OCT.

Fig. 7: Precision and success rate plots on small90.

We illustrate some examples for KCF_AST in Fig. 8

to show how our aggregation signature helps to improve the tracking performance. In the sequences selected from small90, the tracked objects are subject to severe image quality deterioration during the tracking process. In particular: 1) the background of the scene presents clutters while many objects are similar to the target in appearance and 2) severe drifting or long-time out of view results in directly drift of the target in the far range. In addition, we adopt the MDNet, LCCF (deep feature) and the KCF as base trackers in our frameworks for comparision of visual tracking experiments. Results are shown in Fig.11; our main goal here is to show how our method helps to drastically reduce the tracking failure.

Observed from the results on Fig. 8 and Fig. 11, we can conclude that the aggregation signature can effectively improve the performance of base trackers, especially for small object tracking, and both saliency detection and tracking are enhanced by incorporating our image signature. As a final consideration, we acknowledge that the proposed method has the ability to relocate the target when drifting, and performs very well on the small target sequences.

Fig. 8: Representative tracking results on four challenging sequences (fastcar, wakeboard, truck and blackcar). For each subfigure, the current frame, the saliency maps obtained by Aggregation Signature (AS) and the corresponding tracking results are shown in the first, middle and right column, respectively. We can see that our AST tracker could tackle the drifting, deformation, background clutter and out-of-view challenge due to the usage of aggregation signature.
Fig. 9: Precision and success rate plots on small112.
Fig. 10: Precision and success rate plots on UAV123_10fps.

The small112 benchmark: We further collect a new benchmark dataset with 112 fully annotated sequences to facilitate the performance evaluation. On the basis of small90, the added 22 sequences are more difficult sequences. As shown in Fig. 9, KCF_AST, LCCF_AST, ECO_AST improve the performance of KCF, LCCF, ECO from 58.0%, 64.7%, 77.9% to 71.0%, 77.1%, 81.9% on precision rate and 41.6%, 44.5%, 62.9% to 49.2%, 50.8%, 66.0% on success rate, which demonstrates that AST improves these base trackers significantly on complex small object tracking sequences. Though the baseline trackers, such as SiamRPN, LDES, perform very well, still 0.1% and 0.4% improvements on precision and 0.5% and 0.5% improvements on success rate have been obtained by AST, which validates the effectiveness of AST. Observed from the experimental results, all the trackers endowed with the aggregation signature module perform consistently better than the base trackers, which further validates the effectiveness of the proposed approach. Also, the results show that better base trackers gain less performance improvements. The reason might be that aggregation signature is less useful if the drifting is not obvious, which is the case of using a better tracker.

The UAV123_10fps benchmark: We test ASTs on UAV123_10fps [28] as shown in Fig. 10, which contains 123 sequences posing many challenges. Compared to the base tracker MDNet, we can see the aggregation signature (MDNet_AST) significantly improves the performance of MDNet from 50.2% to 54.2% in precision rate and 42.2% to 47.5% in success rate, which further validates the effectiveness of the proposed method. While KCF_AST is about 6% higher than KCF based on the precision, and is about 8% higher based on success rate. As for these more recent state-of-the-art trackers such as LDES, DaSiamRPN, ECO, their corresponding ASTs still achieve better results than these base trackers.

IV 0.396 0.619 0.715 0.551 0.491 0.707 0.538 0.765 0.707 0.798 0.719 0.747
SV 0.495 0.710 0.723 0.618 0.574 0.805 0.706 0.805 0.775 0.794 0.768 0.809
OCC 0.619 0.678 0.751 0.726 0.673 0.772 0.692 0.732 0.799 0.803 0.757 0.758
DEF 0.542 0.706 0.767 0.676 0.599 0.757 0.671 0.805 0.807 0.844 0.777 0.793
MB 0.303 0.491 0.631 0.421 0.353 0.582 0.390 0.684 0.516 0.717 0.696 0.726
FM 0.353 0.573 0.746 0.500 0.412 0.645 0.452 0.789 0.573 0.809 0.770 0.803
IPR 0.438 0.672 0.811 0.604 0.522 0.752 0.623 0.844 0.787 0.877 0.779 0.805
OPR 0.464 0.704 0.838 0.625 0.551 0.782 0.650 0.869 0.831 0.921 0.808 0.833
OC 0.237 0.374 0.880 0.327 0.293 0.611 0.431 0.795 0.721 0.855 0.494 0.664
BC 0.533 0.696 0.770 0.655 0.599 0.769 0.653 0.815 0.786 0.855 0.789 0.804
LR 0.578 0.717 0.816 0.666 0.625 0.783 0.697 0.858 0.805 0.900 0.845 0.863
IV 0.209 0.379 0.423 0.328 0.291 0.422 0.322 0.445 0.430 0.487 0.451 0.464
SV 0.264 0.459 0.396 0.361 0.324 0.416 0.393 0.439 0.511 0.519 0.504 0.524
OCC 0.343 0.435 0.465 0.460 0.439 0.469 0.446 0.461 0.502 0.507 0.480 0.479
DEF 0.305 0.454 0.456 0.425 0.378 0.460 0.411 0.477 0.524 0.540 0.508 0.514
MB 0.150 0.299 0.396 0.260 0.201 0.368 0.230 0.402 0.324 0.464 0.453 0.474
FM 0.185 0.367 0.473 0.317 0.246 0.421 0.282 0.481 0.374 0.537 0.514 0.538
IPR 0.251 0.412 0.470 0.367 0.316 0.446 0.374 0.483 0.486 0.541 0.480 0.491
OPR 0.262 0.433 0.481 0.376 0.329 0.460 0.386 0.495 0.514 0.570 0.500 0.510
OC 0.150 0.263 0.408 0.209 0.181 0.382 0.242 0.436 0.425 0.512 0.329 0.427
BC 0.305 0.451 0.471 0.416 0.376 0.476 0.407 0.493 0.511 0.552 0.526 0.532
LR 0.334 0.469 0.499 0.414 0.382 0.475 0.417 0.507 0.527 0.587 0.561 0.571
TABLE III: Precision and Success rate for the 11 attributes in Small90. Bold fonts highlight the best performance

The UAV20L benchmark: We also test ASTs on the well-known benchmark UAV20L [28] as shown in Fig. 11, where some of the tracked objects are very small. The state-of-the-art SRDCF is chosen as the base tracker, leading to our SRDCF_AST. Apparently, SRDCF_AST obtains better performances with respect to the state-of-the-art. As compared to the base tracker SRDCF, we can see the aggregation signature (SRDCF_AST) significantly improves the performance of SRDCF from 50.7% to 53.1% in precision rate, which further validates the effectiveness of the proposed method. LCCF_AST is about 7% higher than LCCF, while KCF_AST is about 3% higher than KCF based on the precision. In addition, LCCF_AST and KCF_AST, though showing no outstanding performance in terms of success rate, still achieved better results than their base trackers, respectively. Furthermore, as for the more state-of-the-art trackers LDES and DaSiamRPN, we also show that LDES_AST and DaSiamRPN_AST improve their base trackers by a clear margin.

Fig. 11: Precision and success rate plots on UAV20L.

Quantitative Attribution Evaluation of Benchmarks: The full set of plots generated by the benchmark toolbox for small90 are also shown in Table III. From the results, we can conclude that AST trackers achieve a much better performance in most cases for small-sized objects, especially for motion blur and fast motion, in which we can see all AST trackers improve dramatically, since saliency-based AST trackers can be more robust than base trackers to the variations mentioned previously. To conclude, AST can consistently improve the results of base trackers in most cases, and AST-trackers achieve new state-of-the-art results.

Speed analysis: In terms of tracking speed on small90, KCF_AST has a processing rate of 120.88 frames per second (FPS), while LCCF_AST based on deep features has 16.52 FPS, which show that our proposed trackers not only achieve the state-of-the-art results, but also performs in real time. Although the frame rate of the proposed tracking framework has a drop, as compared to the original base tracker, the tracking performance is significantly improved on small90, e.g., 8.2% improvement on LCCF in terms of success rate.

V Conclusions

A new aggregation signature has been proposed to improve the small target tracking performance. The aggregation signature uses the target as a prior to adaptively locate the salient object, which is deployed to re-detect the tracked objects when drifting. It is generic and can be used in conjunction with other trackers. We evaluated our tracking framework with KCF, SRDCF, LCCF, ECO, SAT, LDES, DaSiamRPN and MDNet. To validate the resulting aggregation signature tracker, we have also collected new video datasets named small90 and small112, which contain fully annotated video sequences for small target tracking. The experimental results have clearly demonstrated how our methods improve the performance for the challenging situations, such as severe drifting, deformation and out of view. Furthermore, our approach will be extended to different applications in the future, such as large-scale retrieval [36][37] and classification [10].

Vi Acknowledgment

The work was supported by the National Key Research and Development Program of China (Grant No. 2016YFB0502602) and National Natural Science Foundation of China under Grant 61672079, in part by Supported by Shenzhen Science and Technology Program (No.KQTD2016112515134654).


  • [1] K. Ahmadi and E. Salari (2015) Small dim object tracking using a multi objective particle swarm optimisation technique. IET Image Processing 9 (9), pp. 820–826. Cited by: §I.
  • [2] K. Ahmadi and E. Salari (2016) Small dim object tracking using frequency and spatial domain information. Pattern Recognition 58, pp. 227–234. Cited by: §I.
  • [3] A. Borji, D. N. Sihite, and L. Itti (2013) Quantitative analysis of human-model agreement in visual saliency modeling: a comparative study. IEEE Transactions on Image Processing 22 (1), pp. 55. Cited by: §IV-B.
  • [4] J. Choi, H. J. Chang, J. Jeong, Y. Demiris, and Y. C. Jin (2016) Visual tracking using attention-modulated disintegration and integration. In Computer Vision and Pattern Recognition, pp. 4321–4330. Cited by: §IV-C.
  • [5] J. Choi, H. J. Chang, S. Yun, T. Fischer, Y. Demiris, and Y. C. Jin (2017) Attentional correlation filter network for adaptive visual tracking. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 4828–4837. Cited by: §I.
  • [6] M. Danelljan, G. Bhat, F. Shahbaz Khan, and M. Felsberg (2017) ECO: efficient convolution operators for tracking. In Computer Vision and Pattern Recognition, pp. 6931–6939. Cited by: §IV-C.
  • [7] M. Danelljan, G. Hager, F. S. Khan, and M. Felsberg (2015) Learning spatially regularized correlation filters for visual tracking. In IEEE International Conference on Computer Vision, pp. 4310–4318. Cited by: §IV-C.
  • [8] M. Danelljan, G. Hager, F. S. Khan, and M. Felsberg (2017) Discriminative scale space tracking. IEEE Transactions on Pattern Analysis and Machine Intelligence 39 (8), pp. 1561–1575. Cited by: §I.
  • [9] C. Deng, Y. Han, and B. Zhao (2018) High performance visual tracking with extreme learning machine framework. IEEE Trans. on Cyber.. External Links: Link Cited by: §I.
  • [10] G. Ding, Y. Guo, K. Chen, C. Chu, J. Han, and Q. Dai (2019) DECODE: deep confidence network for robust image classification. IEEE Transactions on Image Processing 28 (8), pp. 3752–3765. Cited by: §V.
  • [11] J. Fan, Y. Wu, and S. Dai (2010) Discriminative spatial attention for robust tracking. In European Conference on Computer Vision, pp. 480–493. Cited by: §I.
  • [12] Y. Han, C. Deng, B. Zhao, and D. Tao (2019) State-aware anti-drift object tracking. IEEE Transactions on Image Processing 28 (8), pp. 4075–4086. Cited by: §IV-C.
  • [13] Y. Han, C. Deng, B. Zhao, and B. Zhao (2019) Spatial-temporal context-aware tracking. IEEE Signal Process. Lett. 26 (3), pp. 500–504. Cited by: §I.
  • [14] S. Hare, A. Saffari, and P. H. S. Torr (2016) Struck: structured output tracking with kernels. IEEE Transactions on Pattern Analysis and Machine Intelligence 38 (10), pp. 2096–2109. Cited by: §I.
  • [15] M. Heber, M. Godec, M. R ther, P. M. Roth, and H. Bischof (2013) Segmentation-based tracking by support fusion. Computer Vision and Image Understanding 117 (6), pp. 573–586. Cited by: §I.
  • [16] J. F. Henriques, R. Caseiro, P. Martins, and J. Batista (2015) High-speed tracking with kernelized correlation filters. IEEE Transactions on Pattern Analysis and Machine Intelligence 37 (3), pp. 583–596. Cited by: §IV-C.
  • [17] X. Hou, J. Harel, and C. Koch (2012) Image signature: highlighting sparse salient regions. IEEE Transactions on Pattern Analysis and Machine Intelligence 34 (1), pp. 194. Cited by: §I, §II-B, §II, §III, §III.
  • [18] Z. Kalal, K. Mikolajczyk, and J. Matas (2012) Tracking-learning-detection. IEEE Transactions on Pattern Analysis and Machine Intelligence 34 (7), pp. 1409–1422. Cited by: §I, §IV-C.
  • [19] B. Li, J. Yan, W. Wu, Z. Zhu, and X. Hu (2018) High performance visual tracking with siamese region proposal network. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §IV-C.
  • [20] J. Li, C. Xia, Y. Song, S. Fang, and X. Chen (2015) A data-driven metric for comprehensive evaluation of saliency models. In IEEE International Conference on Computer Vision, pp. 190–198. Cited by: §IV-B.
  • [21] Y. Li, J. Zhu, W. Song, and Z. Wang (2019) Robust estimation of similarity transformation for visual object tracking.

    Association for the Advance of Artificial Intelligence

    Cited by: §IV-C.
  • [22] C. Liu, W. Ding, X. Xia, Y. Hu, B. Zhang, J. Liu, B. Zhuang, and G. Guo (2019) RBCN: rectified binary convolutional networks for enhancing the performance of 1-bit dcnns. Cited by: §I.
  • [23] C. Liu, W. Ding, X. Xia, B. Zhang, J. Gu, J. Liu, R. Ji, and D. Doermann (2019) Circulant binary convolutional networks: enhancing the performance of 1-bit dcnns with circulant back propagation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2691–2699. Cited by: §I.
  • [24] T. Liu, J. Sun, N. N. Zheng, X. Tang, and H. Y. Shum (2007) Learning to detect a salient object. In Computer Vision and Pattern Recognition, pp. 1–8. Cited by: §II-B, §IV-B, §IV.
  • [25] Y. Luo and J. Yuan (2013) Salient object detection in videos by optimal spatio-temporal path discovery. In Acm International Conference on Multimedia, pp. 509–512. Cited by: §I.
  • [26] C. Ma, X. Yang, C. Zhang, and M. H. Yang (2015) Long-term correlation tracking. In Computer Vision and Pattern Recognition, pp. 5388–5396. Cited by: §IV-C.
  • [27] C. Ma, Z. Miao, X. P. Zhang, and M. Li (2017) A saliency prior context model for real-time object tracking. IEEE Transactions on Multimedia PP (99), pp. 1–1. Cited by: §I, §IV-C.
  • [28] M. Mueller, N. Smith, and B. Ghanem (2016) A benchmark and simulator for uav tracking. In European Conference on Computer Vision, pp. 445–461. Cited by: §IV-C, §IV-C, §IV.
  • [29] H. Nam and B. Han (2016)

    Learning multi-domain convolutional neural networks for visual tracking

    In Computer Vision and Pattern Recognition, pp. 4293–4302. Cited by: §IV-C.
  • [30] P. Perez, C. Hue, J. Vermaak, and M. Gangnet (2002) Color-based probabilistic tracking. European Conference on Computer Vision I, pp. 661–675. Cited by: §IV-C.
  • [31] D. Rozumnyi, J. Kotera, F. Sroubek, L. Novotny, and J. Matas (2017) The world of fast moving objects. In Computer Vision and Pattern Recognition, pp. 4838–4846. Cited by: §I.
  • [32] C. Rui, P. Martins, and J. Batista (2012) Exploiting the circulant structure of tracking-by-detection with kernels. In European Conference on Computer Vision, pp. 702–715. Cited by: §IV-C.
  • [33] B. Schauerte and R. Stiefelhagen (2012) Quaternion-based spectral saliency detection for eye fixation prediction. In European Conference on Computer Vision, pp. 116–129. Cited by: §I, §II.
  • [34] S. Stalder, H. Grabner, and L. V. Gool (2010) Cascaded confidence filtering for improved tracking-by-detection. In European Conference on Computer Vision, pp. 369–382. Cited by: §I.
  • [35] C. J. Willmott and K. Matsuura (2005) Advantages of the mean absolute error (mae) over the root mean square error (rmse) in assessing average model performance. Climate Research 30 (1), pp. 79. Cited by: §IV-B.
  • [36] G. Wu, J. Han, Y. Guo, L. Liu, and L. Shao (2018) Unsupervised deep video hashing via balanced code for large-scale video retrieval. IEEE Transactions on Image Processing 28 (4), pp. 1993–2007. Cited by: §V.
  • [37] G. Wu, J. Han, Z. Lin, G. Ding, B. Zhang, and Q. Ni (2019) Joint image-text hashing for fast large-scale cross-media retrieval using self-supervised deep learning. IEEE Transactions on Industrial Electronics 66 (12), pp. 9868–9877. Cited by: §V.
  • [38] Y. Wu, J. Lim, and M. H. Yang (2015) Object tracking benchmark. IEEE Transactions on Pattern Analysis and Machine Intelligence 37 (9), pp. 1834–1848. Cited by: §IV-C, §IV.
  • [39] K. M. Yi, H. Jeong, B. Heo, H. J. Chang, and Y. C. Jin (2014) Initialization-insensitive visual tracking through voting with salient local features. In IEEE International Conference on Computer Vision, pp. 2912–2919. Cited by: §I.
  • [40] B. Zhang, Z. Li, X. Cao, Q. Ye, C. Chen, L. Shen, A. Perina, and R. Jill (2017) Output constraint transfer for kernelized correlation filter in tracking. IEEE Transactions on Systems Man and Cybernetics Systems 47 (4), pp. 693–703. Cited by: §III, §IV-C, §IV-C.
  • [41] B. Zhang, S. Luan, C. Chen, J. Han, W. Wang, A. Perina, and L. Shao (2018) Latent constrained correlation filter. IEEE Transactions on Image Processing PP (99), pp. 1–1. Cited by: §IV-C.
  • [42] B. Zhang, A. Perina, Z. Li, V. Murino, J. Liu, and R. Ji (2016) Bounding multiple gaussians uncertainty with application to object tracking. International Journal of Computer Vision 118 (3), pp. 364–379. Cited by: §I.
  • [43] K. Zhang, L. Zhang, Q. Liu, D. Zhang, and M. H. Yang (2014) Fast visual tracking via dense spatio-temporal context learning. In European Conference on Computer Vision, pp. 127–141. Cited by: §IV-C.
  • [44] K. Zhang, L. Zhang, and M. H. Yang (2012) Real-time compressive tracking. In European Conference on Computer Vision, pp. 864–877. Cited by: §IV-C.
  • [45] Z. Zhu, Q. Wang, L. Bo, W. Wu, J. Yan, and W. Hu (2018) Distractor-aware siamese networks for visual object tracking. In European Conference on Computer Vision, Cited by: §IV-C.