Aerial object tracking is a popular topic due to its large range of applications in security, traffic surveillance, autonomous driving, and UAV monitoring. Tracking from aerial platforms can be performed with a number of data modalities including, but not limited to, grayscale , thermal , color  and most recently, hyperspectral imagery [4, 5, 6]. Each modality has been exploited for a unique application and comes with its own set of advantages and disadvantages. For the scope of this paper, we will focus on two specific types of sensor modalities: (1) Wide Area Motion Imagery (WAMI), and (2) Adaptive Hyperspectral Imagery.
The WAMI platform can scan up to a km km area at 2 frames per second (fps), with vehicles occupying roughly - pixels. One can perform persistent tracking utilizing this large field of view image. The low spatial resolution of WAMI has good performance in vehicle tracking under certain circumstances but is not super helpful when it comes to handling background clutter, occlusions, and low contrast objects. To counter this, multiple appearance-based features like textures, color histograms, and histogram of gradients, and motion cues are used in combination, leading to multiple heat maps for the search area. This is not feasible in real-time tracking as the methods can become computationally expensive and hence there exists a need to balance between designing complex models and facilitating real-time implementations. The datasets recorded by WAMI are: (1) WPAFB 2009  and (2) CLIF 
, where each one includes a single video with less than 100 frames. This is detrimental towards training standalone machine learning and deep learning based architectures since the amount of available data is quite low for training purposes. Since data collection from an aerial platform is lengthy and costly, it is possible to utilize deep learning architectures as feature encoders rather than an end-to-end tracking system. However, WAMI is not a practical platform since it provides single-channel imagery and the deep learning architectures are trained on ImageNet which has RGB images.
The unique challenges posed by aerial platforms can be better addressed by smarter multi-modal data acquisition. In this direction, the Rochester Institute of Technology Multi-object Spectrometer (RITMOS) concept is utilized by a number of trackers [10, 11, 5, 6] as an example that can collect a small, targeted amount of hyperspectral data. The RITMOS captures data in two different modalities : (1) a full frame single channel image, and (2) limited hyperspectral data from the desired pixel locations. It can acquire a full-frame single channel image in about sec and scan a row of pixels hyperspectrally in ms. Such an adaptive and multi-modal data concept provides more freedom to address aerial tracking challenges. Driven by this freedom and specifications, we design a discriminative tracker to operate on this platform as shown in Fig. 1. We refer the readers to [12, 5] for more information on the workings of RITMOS.
Synthetic Imagery concept
The Digital Imaging and Remote Sensing (DIRSIG)
software has been used before to generate spectral scenarios for varied applications that use conventional computer vision techniques and deep learning based models[13, 6, 14, 15]. Since flying spectral sensors on an aerial platform is still an ongoing area of development due to the high costs involved, we evaluate our tracker on synthetic scenarios generated using DIRSIG by [16, 17, 5]. In particular, we focus on two scenarios: (1) with trees and (2) without trees. We track 43 vehicles in both scenarios as shown in Fig. 2. In addition, we generate a synthetic single-channel aerial dataset for training a CNN and use it to perform vehicle classification on the real WAMI platform, similar to  (Sect. IV).
The rich sensory information from hyperspectral imagery has been utilized by generative trackers [5, 6]. The discriminative and deep learning driven trackers on the other hand have recently improved the traditional object tracking dramatically. The main challenges behind the application of discriminative and deep learning trackers in aerial hyperspectral images are:
Well-established discriminative algorithms such as Efficient Convolutional Operators (ECO) , Kernelized Correlation filters (KCF) , Struck , and Tracking-learning-detection (TLD)  are mostly associated with close-angle color image single/multi object tracking at high video frame rates and thus, consider a small region of interest (ROI).
. This leads to very optimistic results during off-line tracking due to minimal variance in the training dataset, which results poor performance during online tracking.
Ii Related Work
Tracking-by-detection algorithms exploited low-level features such as Histogram of Oriented Gradients and Color-naming features [23, 24, 25] to perform discriminative tracking until the emergence of deep CNN architectures in the computer vision field. The first correlation filter tracker - the Minimum Output Sum of Squared Error (MOSSE) filter 27]. Later, the Scale Adaptive with Multiple Features (SAMF) tracker  was proposed to concatenate multi-channel HoG and color-naming features. Finally, a kernelized version of correlation filter (KCF) using multi-channel features was proposed to further improve tracking without drastically increasing the computational complexity [28, 18].
The first studies utilizing CNN architectures in object tracking focused on employing the features learned in architectures such as AlexNet , VGGNet  trained on the ImageNet dataset .  extracted low-level features from VGGNet to learn a more discriminative correlation filter. Specifically, they encoded objects with the activations of the first several convolutional layers from VGGNet. This setting provided them with a dimensional low-level feature set that can be interpreted as a more advanced version of HoG features. They reported slight improvement in the Visual Object Tracking Challenge 2015 (VOT2015) object tracking challenge with deep CNN features over the HoG features. Going deeper is a major key to achieving the state-of-the-art in most computer vision challenges, however, the nature of deep CNN architectures prohibits us from applying high-level features in tracking-by-detection algorithms. This is mainly due to increasing translation invariance in deeper layers resulting from spatial pooling operations.
have gained the reputation of the most efficient and effective architecture in tracking. Two branches consisting of the same architecture layers are used in a typical Siamese Network. The bottom branch is provided the ground truth of an object of interest in an ROI, whereas the top branch is assigned the task of estimating the position of the object given the new ROI. Late fusion of the branches is performed and the new position is regressed. The Siamese Networks have surpassed all the other deep tracking-by-detection algorithms in the VOT2015 challenge. Due to the scarcity of annotated datasets for aerial tracking, it is difficult to develop an end-to-end deep learning tracker for aerial platforms.
There is scarcity of annotated datasets for aerial tracking in which deep learning or traditional trackers can be trained and evaluated. UAV123 , recently released by Meuller et al., has a ground sampling distance (GSD) that is significantly lower than the high-altitude aerial platforms - thus resulting in objects occupying more than - pixels. The dataset has sequences at 30 fps, drastically higher than standard WAMI and spectral sequences, which are generally in the 1.42 fps - 2 fps range. Flying RITMOS on an aerial platform is still an ongoing area of development and due to lack of any other real dataset in this area, we use synthetically generated Rochester Institute of Technology Multi-object Spectrometer
(RITMOS)-like data to evaluate the performance of our proposed tracker. This way, we prevent probable overfitting that would have been caused due to training and testing on the same dataset by using deep learning models as feature encoders in our tracker.
This study addresses the unique challenges posed by the application of discriminative trackers to aerial platforms. A novel method that employs a discriminative tracker is proposed to tackle low temporal (around 1.42 fps) and spatial resolutions (0.3 m). Primarily, we design a method to enlarge the area considered by the tracker to handle the low temporal resolution. Given the rich hyperspectral imagery, we utilize pre-trained deep convolutional networks as feature encoders to boost tracking performance. To accommodate deep features in a near real-time tracking system, we design a region-of-interest (ROI) mapping strategy that only forward passes the large ROI and projects the individual ROIs to the large ROI feature maps (Fig. 1). Finally, the proposed tracker is evaluated on a synthetic hyperspectral video generated by the Digital Imaging and Remote Sensing (DIRSIG) software . To prove the high-fidelity of this video, a large single-channel aerial dataset is synthesized using DIRSIG and a deep learning framework is trained on it to classify images from the real dataset (WAMI). We refer the readers to following link to access our synthetic vehicle classification dataset (https://buzkent86.github.io/datasets/).
To the best of our knowledge, this is the first time an adaptive hyperspectral sensor-inspired discriminative tracker (DeepHKCF) has been proposed to perform robust single target tracking in spectral aerial imagery that can be generalized to the WAMI platform.
Iii Proposed Tracker
As discussed in the previous section, the tracking platform has low frame rate making the global camera motion removal step necessary to perform consistent tracking. In this direction, we register the input frame to the canonical frame where the tracking is initialized using standard computer vision techniques. First, keypoints in the images are extracted using the Scale Invariant Feature Transform (SIFT)  and described with gradient orientation histograms. In the next step, the homography matrix between two images is estimated with the Random Sample Consensus (RANSAC)  algorithm. Finally, the input image is warped to the canonical image using the accumulated homography matrix over time.
The core of the proposed tracker is built upon the work of Henriques et al.[28, 18] with Kernelized Correlation Filters (KCF). The KCF has emerged as a high accuracy tracker that can operate at hundreds of frame rate under specific conditions. Its computational efficiency is derived from the correlation filter framework that represents training examples using a circulant matrix. The fact that a circulant matrix can be diagonalized by Discrete Fourier Transform (DFT) is the key to reducing the complexity of any tracking method based on the correlation filter. The off-diagonal elements become zero whereas the diagonal elements represent the eigenvalues of the circulant matrix. The KCF applies a kernel to transform the feature channels to a more discriminative domain.
with Kernelized Correlation Filters (KCF). The KCF has emerged as a high accuracy tracker that can operate at hundreds of frame rate under specific conditions. Its computational efficiency is derived from the correlation filter framework that represents training examples using a circulant matrix. The fact that a circulant matrix can be diagonalized by Discrete Fourier Transform (DFT) is the key to reducing the complexity of any tracking method based on the correlation filter. The off-diagonal elements become zero whereas the diagonal elements represent the eigenvalues of the circulant matrix. The KCF applies a kernel to transform the feature channels to a more discriminative domain.
Essentially, the KCF solves the problem in the form of ridge regression:
Essentially, the KCF solves the problem in the form of ridge regression:
where represents the desired continuous response, represents the correlation filter and represents template for the given channel. The parameter enables one to integrate features in multiple channel space: an earlier version based on this formulation employed grayscale feature () to learn the solution vector . Later, multi-channel features such as Color, HoG and a concatenation of them showed improved accuracy [18, 27, 39, 40, 41]. To reduce the complexity of the closed-form solution for Eqn. 1, an element-wise multiplication in the frequency domain was proposed for :
where and denote the parameter in Fourier domain and conjugate of a complex number whereas and are the element-wise multiplication, and a regularization term to prevent divisions by zero.
The solution to the kernelized version of ridge regression is given by  as follows:
where is the kernel matrix and is the vector of coefficients , that represent the solution in the kernel-transformed dual space. The diagonalized Fourier domain dual form solution (non-linear version) is then expressed as
where is the first row of the kernel matrix and is the kernel’s autocorrelation.
For multiple channel cases, we obtain , which represents the first row of the kernel matrix in the frequency domain, also known as gram matrix. It can be formulated as:
where concatenates the individual vectors for channels: . In training step, the arbitrary vector is replaced by , and in test step, it is replaced by .
To detect the object of interest, we typically wish to evaluate the regression function on several locations in the image, i.e. several candidate patches, which can be modeled by cyclic shifts.
Since is a vector containing the output for all cyclic shifts of , we can diagonalize it to obtain a more efficient computation in the Fourier domain:
where is the kernel correlation of and .
Eqn. 7 then translates into the following equation in time domain:
where denotes the correlation response at all cyclic shifts of the first row of the kernel matrix.
The temporal information can be further integrated into the tracker by updating the filter and target template at every frame as follows:
where is the learning rate. This correlation filter framework only estimates the translation of the object whereas the scale of the object can be updated by running a correlation filter on different size ROIs with same centroids . By correlating a filter with different ROIs, we can get multiple response maps and choose the one with highest confidence to estimate the new scale of the target. In this study, we do not estimate the scale of the target as the scenarios are captured from a fixed altitude platform.
Iii-a Single KCF-Multiple ROIs Approach
Discriminative trackers like KCF learn to function in an online manner by collecting positive and negative samples and then detecting the target of interest in a ROI to update the classifier. The standard form KCF requires small ROIs as the appearance-based features deteriorate with larger background context. Unfortunately, these features are hard to collect from aerial imaging platforms due to their low spatial resolution. Moreover, there are two other limitations: (1) Increasing the context size leads to background dominated features resulting in confusion between different objects and (2) The platform we consider has lower temporal resolution ( fps) leading to large displacement of objects in successive frames. Adding the platform motion into this picture makes the application of vanilla-form KCF in aerial platforms extremely difficult.
To handle these challenges, we propose a single KCF-in-multiple ROIs approach (Fig. 3). Our approach applies the same KCF to different ROIs overlapping each other to minimize the likelihood of target loss. It is essential to have reasonable overlap between the ROIs (Sect. V-G) as we filter each ROI with a Hanning window to avoid distortion at boundaries due to FFT operation. This approach can be formulated by modifying Eqn. 8 as:
where and represent the indexes for different ROIs. A simple way to estimate the new position of the target in this framework would be using the peak-to-side-lobe ratio (PSR) values in ROIs and finding the position of the pixel with maximum confidence in all ROIs as:
where represents the number of ROIs in full ROI. The PSR, on the other hand, denotes the margin between the peak value in the response map and the mean of the sidelobe corresponding to the area excluding the x pixels around the peak. The result is normalized by the standard deviation of the sidelobe as follows.
pixels around the peak. The result is normalized by the standard deviation of the sidelobe as follows.
This position estimation approach can be softened by considering all the ROIs with PSR values larger than a pre-determined threshold, . In this case, the Eqn. 12 can be reorganized as follows.
By softening our decision, we perform low-pass filtering and avoid jumps to other objects that has a high PSR value in only one ROI.
As mentioned earlier, the single KCF-multiple ROIs approach can better handle the low temporal resolution than the traditional KCF. On the other hand, it increases the complexity linearly from to , where represents the number of ROIs in the full ROI. The low temporal frame rate of the scenario helps us accommodate this approach in the DeepHKCF tracker. It is possible to further increase the frame rate by running the KCF on the multiple ROIs in parallel as the ROI operations are independent.
Iii-B Traditional Low-level Features
In this study, we follow the KCF tracker and concatenate multiple features as in the SAMF  tracker. More specifically, we concatenate the Felzenszwalb’s HoG (fHoG)  feature channels and pure hyperspectral channels and apply the Gaussian kernel operation to learn a more discriminative model as follows:
where and represent the hyperspectral and fHoG feature channels. The number of hyperspectral and fHoG feature channels in this study are 61 and 31 respectively. Additionally, in the results section (Sect. V), we experiment with fHoG features and hyperspectral features alone to observe how well they perform individually.
Iii-C Deep Convolutional Features
In this study, we follow an approach similar to  to learn a discriminative model for the KCF. Low spatial resolution scenario enables us to pursue a slightly higher level of abstraction of objects. In particular, we apply the activations of the fifth convolutional layer learned in VGGNet  trained over ImageNet . Additionally, we experiment with different levels of object abstractions in the experiments section (Sect. V).
DIRSIG imagery provides us with a full-frame grayscale image as well as a narrow field of view hyperspectral image at 1.42 fps. Unlike other aerial platforms, it provides hyperspectral data in the visible wavelength range, enabling the use of deep CNN architectures trained on ImageNet consisting of RGB images. One can pick the central red, green and blue channels and forward-pass them through the layers of interest. Another approach could be computing the average of red, green and blue channels in their respective range to come up with the representative red, green and blue channel images to feed the CNN. Our experiments favor the first approach as the latter approach introduces undesired noise due to the averaging operation.
Fast Convolutional Features with ROI Mapping
The single KCF-multiple ROIs approach treats each ROI independently to compute the filter response. This requires forward-passing individual ROIs through the CNN architecture. Such an inefficient approach leads to a slower tracker. To increase the run-time performance and perform near real-time tracking at the platform frame-rate, we use the ROI mapping strategy commonly used in convolutional object detectors such as Fast R-CNN , Faster R-CNN , and R-FCN . This way, we only forward-pass the full ROI and project the individual ROIs to the feature maps extracted from the full ROI as shown in Fig. 4.
where represents the full detection ROI used to get the convolutional features . The individual detection ROIs, , are then projected to the feature map, , through the projection function .
Once we estimate the translation of the target, the filter is updated using the Fourier domain solution as in Eqn. 4. First, the px neighborhood around the target is considered and forward-passed through the convolutional network as shown in Fig. (a)a. To match the detection ROI size (Fig. (b)b), we then project the central px area to the feature maps and reformulate the solution as:
where and represent the full training ROI and actual training ROI mapped to feature maps of the using the function . On the other hand, we can avoid forward-passing the training ROI if the actual training ROI, , is a subset of the detection ROI, . In this case, the Fourier domain solution can be reformulated as
Iv Vehicle Classification in WAMI Platform by using a Synthetic Dataset
As discussed in Section I, detecting cars with high accuracy is a major problem of tracking algorithms utilizing the WAMI platform. This is due to two major reasons : (1) the lack of a large dataset captured from the WAMI platform and (2) the lack of color channels prevents smooth transfer learning from the networks trained on the ImageNet.
platform. This is due to two major reasons : (1) the lack of a large dataset captured from the WAMI platform and (2) the lack of color channels prevents smooth transfer learning from the networks trained on the ImageNet.
In this study, we build a synthetic single-channel vehicle classification dataset using DIRSIG and fine-tune a CNN to perform vehicle classification on the real platform (WAMI). To build this dataset, we generate full-frame hyperspectral images captured from the Mega-Scene I scene available in DIRSIG  with different settings. The simulation setting is designed as a function of time, and hence the brightness in the scene varies as a function of sunlight which can then lead to a more general dataset. In particular, nine simulations from different months in a year are generated to find representative samples of varying conditions. We keep the other parameters similar to the simulation used to generate the RITMOS-like scenario. Overall, nine different simulations are generated from the same scene with the same vehicular traffic and platform motion to the tracking video.
Iv-a Temporal Data Augmentation
It is possible to generate just one frame per simulation. However, to increase the number of simulations, we add more temporal-variance and change the initial platform location. Changing the platform location in a large number of simulations can be a tedious task, and to avoid that, we perform temporal data augmentation by generating low frame rate videos on a moving platform. More specifically, the frame rate for each simulation is set to fps, resulting in images per simulation. This way, we can capture cars from different angles with different backgrounds.
Iv-B Hyperspectral Data Augmentation
The data augmentation is highly important in our case as we mimic the WAMI platform in a dataset consisting of fully synthetic images. In particular, it is difficult to approximate the spectral sensitivity curve of a real platform synthetically. The same car samples from different wavelengths are augmented to better approximate the WAMI platform internal mechanics. We stick with 61 channels in the visible (400 nm) to near infrared (1000 nm) wavelength range. In a single-band image setting with 0.2 fps, we produce about 180 images leading to small spectral variance in the dataset. By using all 61 channels, we generate over 1000 images over the 9 simulations, considering time and spectral depth. This approach has the potential downside of generating to a dataset dominated by highly similar images. To address this, we sample 6 bands from 6 uniform distributions covering the 61 channels as shown in Fig.
depth. This approach has the potential downside of generating to a dataset dominated by highly similar images. To address this, we sample 6 bands from 6 uniform distributions covering the 61 channels as shown in Fig.5. This increases the spectral variance while ensuring a reasonably large gap between the augmented images at different wavelengths.
Iv-C Positive and Negative Samples Collection
The procedure described above produces 27613 vehicle chips ( px) and the vehicles are located in the central position of the positive chips. Similar to the WAMI platform, a vehicle is represented by 20 10 pixels on average in the generated scenarios. Adding context in positive samples seems to improve the learned weights in a CNN . To collect negative samples, we perform hard-negative mining by considering areas surrounding the positive samples. A negative sample is randomly captured from an area whose center is T = - pixels away from the center of the positive sample.
Our final dataset consists of 55226 chips captured from different positions of Mega-Scene I at different times. To validate the performance on the WAMI platform, we annotate 600 positive and negative chips from the CLIFF06 and CLIFF07 videos captured from the WAMI platform. Some of the positive samples from the training and validation dataset can be visualized in Fig. 6. Finally, we train a well-known CNN architecture to perform vehicle classification on the WAMI platform.
Iv-D Training the models
The architecture used in this study is the ZFNet, an optimized version of AlexNet. We adopt two different training strategies : (1) training from scratch and (2) fine-tuning the weights learned on the ImageNet using the synthetic aerial vehicle detection dataset. In the latter approach, the learning rate is set to 0.0001 other than the classification layer. The classification layer is assigned a learning rate of 0.0005. On the other hand, in the former method, we tune the learning rate to 0.1. The ZFNet from scratch is trained for 4200 iterations with the batch size of 64 whereas the one pre-trained on ImageNet is trained for 400 iterations with the same batch size. In this two experiments, the networks are validated on the 600 samples from WAMI (CLIFF06 and CLIFF07). To integrate further context information into the learned weights, we introduce dilated convolutions with hole size and in the 1st and 2nd convolutional layers . Finally, we follow a two-stage training strategy that uses the WAMI samples to further update the weights from the fine-tuned ZFNet. The model is validated on the remaining WAMI samples. This, as expected, boosts the classification accuracy on the WAMI platform. Training is performed on the NVIDIA Tesla K80 GPU in Caffe framework
WAMI samples. This, as expected, boosts the classification accuracy on the WAMI platform. Training is performed on the NVIDIA Tesla K80 GPU in Caffe framework.
As seen in Table I, over accuracy is achieved by only using our synthetic dataset to train the ZFNet. This proves the high fidelity of the hyperspectral scenario used to evaluate the deep hyperspectral kernelized correlation filter tracker. To further improve the accuracy, a small amount of WAMI samples are used to train a more WAMI domain-specific model. This improves the accuracy up to reaching the state-of-the-art in vehicle classification in WAMI platform. Finally, with the availability of this dataset, the need to collect a large amount of training samples from the WAMI platform is removed. To support the aerial vehicle detection related studies, we plan on releasing the full images with ground truth locations of the vehicles. This will give more freedom to train detection-domain architectures such as Faster-RCNN , R-FCNN , YOLO9000 , and SSD .
V Tracking Experiments
DIRSIG is a very useful program for generating remote sensing images with high fidelity. This is proved in the previous section where we generated a large synthetic dataset representative of the WAMI platform and trained a convolutional network to classify real images from the WAMI platform. A hyperspectral tracking video representative of the RITMOS sensor was generated in previous studies [4, 6, 13, 10]. The hyperspectral tracking scenario has two different videos : (1) without trees and (2) with dense trees ( full occlusion by trees). Both videos have 1.42 fps and 157 frames with same vehicular traffic, and platform position. For our study, we used both videos to evaluate the performance of the proposed DeepHKCF tracker, its variants, and other hyperspectral state-of-the-art trackers.
V-a Hyperparameter Tuning
This section discusses the hyperparameters that need to be tuned to perform optimal tracking considering both run-time performance and accuracy. The KCF has a number of hyperparameters including padding size, desired Gaussian response width, learning rate and Gaussian kernel bandwith. Our approach has three main hyperparameters: (1) full ROI size, (2) overlap between ROIs and (3) PSR threshold to remove the contribution of noisy response maps. We set the size of a single ROI to (48
This section discusses the hyperparameters that need to be tuned to perform optimal tracking considering both run-time performance and accuracy. The KCF has a number of hyperparameters including padding size, desired Gaussian response width, learning rate and Gaussian kernel bandwith. Our approach has three main hyperparameters: (1) full ROI size, (2) overlap between ROIs and (3) PSR threshold to remove the contribution of noisy response maps. We set the size of a single ROI to (4848) pixels since each vehicle occupies about 20 10 pixels and hence reasonable content is captured. This removes the need to have a padding size hyperparameter in the DeepHKCF. The other KCF hyperparameters are tuned to similar values as the original KCF paper. The overlap between ROIs is set to in each dimension (Sect. V-G) whereas the full ROI size and PSR threshold are set to 96 96 pixels (Sect. V-H) and respectively.
V-B Tracking Performance Metrics
For analyzing the performance of our tracker and its variants, we use two metrics: (1) Central Location Error and (2) Precision, which are defined as follows:
Central Location Error
The central location error (CLE) for a dataset can be calculated in three effective steps: (1) The central location error is defined as the average Euclidean distance between the predicted center location of the target and the ground truth of a frame. (2) The average center location error over all the frames of one sequence is used to then summarize the overall performance value for that sequence. (3) Lastly, the average central location error of a dataset is calculated by averaging the central location error across all the sequences in the dataset. Ideally, it is preferred to have a low central location error.
Precision can be defined as thresholding the Euclidean distance between the prediction and ground truth centroid. In the paper, the final Precision scores are obtained by: (1) dividing the number of successful frames to the total number of frames in a sequence to get the Precision score at the respective threshold. (2) Performing the same operation on all the sequences and averaging to compute the final Precision score on a dataset. Pr 20 px and Pr 50 px represent the precision at 20 and 50 pixels Euclidean distance thresholds. The threshold is slided between between to pixels by 1 pixel interval to draw the Precision figures. Ideally, it is preferred to have a high Precision value.
— Best — Best — Best
— Best — Best — Best
V-C Results on the No-trees Scenario
After tuning the hyperparameters of DeepHKCF, we test it on the 43 vehicles in the no-trees scenario. We compare the DeepHKCF to a number of variants of the proposed single KCF multiple ROIs approach as seen in Table II. Furthermore, we perform experiments on the original KCF algorithm (single KCF-single ROI approach) with the same ROI size ( px). Additionally, we compare the proposed tracker to Efficient Convolutional Operator tracker (ECO)  that is ranked first in the VOT16 tracking benchmark. For fair comparison, we increase the learning rate of the ECO tracker to match our scenarios and use the same features. Similar to DeepHKCF and HKCF, we use the activations of the convolutional layer of the VGGNet and fHoG features. To determine the ROI area, ECO considers the padding size of 3.5, larger than the optimal padding size (2.0) of the KCF. We keep this hyper-parameter same as including further background deteriorates the features. For pixel vehicle, the ECO and vanilla-form KCF trackers have a search area of , and pixels whereas it is for the proposed DeepHKCF. Fig. (b)b and Table II show the performances of the proposed DeepHKCF, its variants and the baseline KCF and ECO trackers.
The DeepHKCF performs exceptionally well in the no-trees scenario, achieving precision at the 20 px threshold and outperforming all the baseline methods by a large margin. Meanwhile, the proposed HKCF with fHoG features performs substantially worse than the one with deep features. However, it outruns the original KCF with fHoG features by a large margin at px precision as shown in Table II, proving the contribution of the proposed single KCF multiple ROIs approach in low frame rate tracking. Concatenating hyperspectral features with fHoG slightly degrades the precision whereas hyperspectral feature channels alone performs worse than the former methods. This indicates that the NIR channels do not contribute to tracking in the KCF framework. The ECO tracker, on the other hand, delivers - lower accuracy than the DeepHKCF trackers at 20 px precision and about worse in terms of the precision at 50 px and central location error. All in all, the DeepHKCF tracker with ROI mapping (FastDeepHKCF) achieves optimal results considering its reasonably high tracking accuracy and highest operation rate among the DeepHKCF trackers.
V-D Results on the Dense Trees Scenario
In addition to conducting experiments on the no-trees scenario, we run the DeepHKCF tracker and its variants on the same scenario with dense trees. This is an extremely challenging scene dominated by large trees and their shadows as shown in Fig. 2. On average, a vehicle is fully occluded in 1 out of 4 frames. Severe occlusions combined with low frame rate make this a more challenging scene. The DeepHKCF trackers outperform the other baseline methods other than ECO tracker by a large margin as in the no-trees scenario (Fig. (b)b). At 20 px precision, the DeepHKCF tracker achieves about accuracy whereas others perform - worse. On the other hand, among the DeepHKCF trackers, the FastDeepHKCF, delivers similar precision at 50 px and higher frame rate. Similar to the no trees scenario, the combination of hyperspectral feature channels with fHoG degrades the performance with respect to the fHoG-only features. We believe that this could be due to more frequent switching to non-vehicle objects with similar hyperspectral features to the target of interest through occlusions. By using fHoG-only features, it is less likely to switch to an object that does not appear like a vehicle.
The dramatic drop in precision rates between the no-trees and dense trees scenarios is easily seen in Fig. 7. This is likely due to three major reasons : (1) high frequency of severe occlusions, (2) low video frame rate, and (3) relatively smaller search area considered by our single KCF-multiple ROIs approach. The combination of the first two reasons leads to dramatically large displacement of objects in between the frames where they are visible. This results in the targets being located out of the search area of the DeepHKCF tracker. There are two solutions to address the challenge of tracking through severe occlusions. The first and less practical solution is increasing the full ROI size in each dimension. This way, we increase the likelihood of keeping the target in our search area traveling through severe occlusions. However, this will also reduce the run-time performance. A more practical solution could be delivered by leveraging a Bayes Filter. For instance, [6, 4] uses a Bayes Filter in a Multi-dimensional Assignment algorithm to update the measurements in light of the later measurements in the same scenario. This way, we can low-pass the unlikely jumps that occurs during severe occlusions. On the other hand, we believe that, increasing the search area in a practical manner might be the key to achieving state-of-the-art performance in scenarios dominated by trees (see Sect. V-H).
V-E Experiments on Temporally Down-sampled Video
In the previous sections, we evaluated the proposed trackers and the baseline methods on the 1.42 fps videos with dense trees and without trees. The ECO tracker performs only slightly worse to the DeepHKCF trackers at 20 px precision. We believe that this might be due to slowly moving or stopped vehicles. To further test their performance with respect to more drastic target motion, we down-sample the video without trees temporally by two, resulting in 0.7 fps video. All the hyperparameters in the FastDeepHKCF and ECO are kept same as before for fair comparison. As seen in Table IV, the DeepHKCF trackers outperform the ECO tracker by a large margin in terms of precision and central location error showing its robustness to extreme target displacement in successive frames. The ECO tracker misses more targets due to the smaller ROI considered in the detection operation.
V-F Comparison with Hyperspectral Trackers
In the previous section the proposed tracker is compared to the state-of-the-art discriminative trackers. In this section, we compare the proposed DeepHKCF tracker to the generative hyperspectral trackers [5, 6]. These trackers are designed for the DIRSIG scenarios and extensively use the multi-dimensional assignment algorithm (MDA) . HFT  relies on off-line trained road and car classifiers to optimize the search space. The hyperspectral histograms are then computed in a sliding window to assign similarity scores. The obtained heat map is then thresholded and post-processed to find the blobs, which are assigned to the target using MDA. The HLT , on the other hand, learns a generative target model using hyperspectral likelihood maps rather than using off-line trained classifiers. The blobs are extracted from the final thresholded map and track statistics are updated from the past N frames using the MDA. Here, we compare DeepHKCF only with HLT in Table V since HFT relies on car and asphalt classifiers trained on the samples from the same scene.
|HLT  (5D)||HLT  (2D)|
As seen in Table V, the use of a Bayes Filter and the multi-dimensional assignment algorithm (MDA) is crucial in a scenario largely dominated by occlusions. We can see the effect of reducing the length of the time window in MDA as the HLT’s performance drops drastically by reducing the width from -D to -D, especially for the scenario with trees. The proposed DeepHKCF trackers outperform HLT in the no-trees scenario by about in terms of central location error, thus establishing its dominance in a scenario without occlusion. Finally, the FastDeepHKCF delivers the optimal results considering the trade-off between tracking accuracy and run-time performance.
V-G Effect of Overlap Ratio
We experiment on the DeepHKCF tracker with ROI mapping (FastDeepHKCF) as a function of the overlap ratio between the adjacent ROIs. As mentioned before, it is necessary to have overlap between the adjacent ROIs as the hanning window is applied to the features before the FFT operation. The hanning window filters the noise at the boundaries resulting from FFT operation. Increasing overlap ratio, at the same time, leads to increased complexity () due to a larger number of ROIs () in the full ROI ( px). Fig. (a)a shows the precision rates of FastDeepHKCF tracker with different overlap ratios between the adjacent ROIs.
The and overlap ratios lead to drastically better results than the lower ones (Fig. (a)a). Considering the accuracy/speed trade-off, overlap (m = ) is used as the optimal setting.
V-H Effect of ROI Size
The overlap between the adjacent ROIs is an essential part of the DeepHKCF tracker as it ensures consideration of the every single point in the full ROI by the correlation filter. Another key parameter in this direction is the full ROI size since we have a low temporal resolution and occlusion-dominated scene. We enlarge the ROI size of the optimal DeepHCKF tracker and observe the performance in the scene with trees. The results are shown in Fig. (b)b. In this experiment, the run-time performance of the tracker is ignored as the goal is to measure the contribution of full ROI size.
As shown in Fig. (b)b, the larger ROI size with the same overlap ratio does not necessarily lead to better performance while quadratically increasing the speed. This could be due to growing confusion as the larger ROIs contain a higher number of similar objects. Additionally, these results demonstrate the obvious need to couple the tracking-by-detection algorithms to a Multi-dimensional Assignment algorithm in a Bayes Filter framework in occlusion-dominated scenes [6, 5, 4].
Adaptive multi-modal sensors are becoming increasingly important in the aerial tracking domain due to the unique challenges posed by this platforms. In this study, we propose a tracking-by-detection algorithm driven tracker inspired by a multi-modal sensor and deep features. This approach replaces the traditional template-matching based hyperspectral trackers with a new state-of-the-art tracker becoming increasingly popular in traditional visual object tracking. More specifically, we delivered a new framework to handle low temporal resolution in aerial platforms in KCF tracker, called single KCF-multiple ROIs approach. To further boost the tracking accuracy, we replaced the traditional features with deep CNN features. Finally, an ROI mapping approach was proposed to speed up extracting features in a single KCF-multiple ROIs approach. The proposed DeepHKCF tracker was evaluated on synthetic scenarios generated by DIRSIG software. In the scenario with no-trees, the DeepHKCF tracker performs exceptionally well with precision at 50 px, outperforming other trackers. In the same scenario but dominated by occlusions, it is outperformed by trackers employing a multi-dimensional assignment algorithm and Bayes Filter. To prove the high-fidelity of the DIRSIG generated scenarios, we build a synthetic, aerial vehicle classification dataset to perform classification on the real-platform (WAMI). Our dataset, consisting of samples, was used to train CNNs to perform binary classification. We achieve about on the WAMI samples by only training on synthetic dataset. This dataset can be highly beneficial in aerial detection and tracking due to limited amount of publicly available data in those domains.
In future work, we plan on supporting the DeepHKCF tracker by integrating a multi-dimensional assignment algorithm and Bayes Filter to better handle severe occlusions.
This work has been supported by the Dynamic Data Driven Applications Systems Program, Air Force Office of Scientific Research under Grant FA9550-11-1-0348.
-  R. Pelapur, S. Candemir, F. Bunyak, M. Poostchi, G. Seetharaman, and K. Palaniappan, “Persistent target tracking using likelihood fusion in wide-area and full motion video sequences,” in Information Fusion (FUSION), 2012 15th International Conference on. IEEE, 2012, pp. 2420–2427.
-  J. Portmann, S. Lynen, M. Chli, and R. Siegwart, “People detection and tracking from aerial thermal views,” in Robotics and Automation (ICRA), 2014 IEEE International Conference on. IEEE, 2014, pp. 1794–1800.
-  M. Danelljan, G. Bhat, F. S. Khan, and M. Felsberg, “Eco: Efficient convolution operators for tracking,” arXiv preprint arXiv:1611.09224, 2016.
-  B. Uzkent, M. J. Hoffman, and A. Vodacek, “Efficient integration of spectral features for vehicle tracking utilizing an adaptive sensor,” in IS&T/SPIE Electronic Imaging, 2015, pp. 940 707–940 707.
-  B. Uzkent, Real-time Aerial Vehicle Detection and Tracking using a Multi-modal Optical Sensor. Rochester Institute of Technology, 2016.
B. Uzkent, A. Rangnekar, and M. J. Hoffman, “Aerial vehicle tracking by
adaptive fusion of hyperspectral likelihood maps,” in
Computer Vision and Pattern Recognition Workshops (CVPRW), 2017 IEEE Conference on. IEEE, 2017, pp. 233–242.
-  AFRL, “Wright-patterson air force basevvi (wpafb) dataset,” https://www.sdms.afrl.af.mil/index.php?collection=wpafb2009, 2009.
-  ——, “Wami columbus large image format (clif) dataset,” https://www.sdms.afrl.af.mil/index.php?collection=clif2007, 2007.
-  O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein et al., “Imagenet large scale visual recognition challenge,” International Journal of Computer Vision, vol. 115, no. 3, pp. 211–252, 2015.
-  B. Uzkent, M. J. Hoffman, A. Vodacek, J. P. Kerekes, and B. Chen, “Feature matching and adaptive prediction models in an object tracking dddas,” Procedia Computer Science, vol. 18, pp. 1939–1948, 2013.
-  B. Uzkent, M. J. Hoffman, A. Vodacek, and B. Chen, “Feature matching with an adaptive optical sensor in a ground target tracking system,” IEEE Sensors Journal, vol. 15, no. 1, pp. 510–519, 2015.
-  R. D. Meyer, K. J. Kearney, Z. Ninkov, C. T. Cotton, P. Hammond, and B. D. Statt, “RITMOS: a micromirror-based multi-object spectrometer,” in SPIE Astronomical Telescopes+ Instrumentation. International Society for Optics and Photonics, 2004, pp. 200–219.
-  B. Uzkent, M. J. Hoffman, and A. Vodacek, “Integrating Hyperspectral Likelihoods in a Multidimensional Assignment Algorithm for Aerial Vehicle Tracking,” IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, vol. 9, no. 9, pp. 4325–4333, 2016.
-  S. Han, A. Fafard, J. Kerekes, M. Gartley, E. Ientilucci, A. Savakis, C. Law, J. Parhan, M. Turek, K. Fieldhouse et al., “Efficient generation of image chips for training deep learning algorithms,” in Automatic Target Recognition XXVII, vol. 10202. International Society for Optics and Photonics, 2017, p. 1020203.
-  S. Han and J. P. Kerekes, “Overview of passive optical multispectral and hyperspectral image simulation techniques,” IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 2017.
-  B. Uzkent, M. J. Hoffman, and A. Vodacek, “Spectral validation of measurements in a vehicle tracking dddas,” Procedia Computer Science, vol. 51, pp. 2493–2502, 2015.
-  B. Uzkent, Real-time Aerial Vehicle Detection and Tracking using a Multi-modal Optical Sensor. Rochester Institute of Technology, 2016.
-  J. F. Henriques, R. Caseiro, P. Martins, and J. Batista, “High-speed tracking with kernelized correlation filters,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 37, no. 3, pp. 583–596, 2015.
-  S. Hare, S. Golodetz, A. Saffari, V. Vineet, M.-M. Cheng, S. L. Hicks, and P. H. Torr, “Struck: Structured output tracking with kernels,” IEEE transactions on pattern analysis and machine intelligence, vol. 38, no. 10, pp. 2096–2109, 2016.
-  Z. Kalal, K. Mikolajczyk, and J. Matas, “Tracking-learning-detection,” IEEE transactions on pattern analysis and machine intelligence, vol. 34, no. 7, pp. 1409–1422, 2012.
-  R. LaLonde, D. Zhang, and M. Shah, “Fully convolutional deep neural networks for persistent multi-frame multi-object detection in wide area aerial videos,” arXiv preprint arXiv:1704.02694, 2017.
-  M. Yi, F. Yang, E. Blasch, C. Sheaff, K. Liu, G. Chen, and H. Ling, “Vehicle classification in wami imagery using deep network,” in SPIE Defense+ Security. International Society for Optics and Photonics, 2016, pp. 98 380E–98 380E.
-  N. Dalal and B. Triggs, “Histograms of oriented gradients for human detection,” in Computer Vision and Pattern Recognition, 2005. CVPR 2005. IEEE Computer Society Conference on, vol. 1. IEEE, 2005, pp. 886–893.
-  Y. Li and J. Zhu, “A scale adaptive kernel correlation filter tracker with feature integration,” in European Conference on Computer Vision. Springer, 2014, pp. 254–265.
-  P. F. Felzenszwalb, R. B. Girshick, D. McAllester, and D. Ramanan, “Object detection with discriminatively trained part-based models,” IEEE transactions on pattern analysis and machine intelligence, vol. 32, no. 9, pp. 1627–1645, 2010.
-  D. S. Bolme, J. R. Beveridge, B. A. Draper, and Y. M. Lui, “Visual object tracking using adaptive correlation filters,” in Computer Vision and Pattern Recognition (CVPR), 2010 IEEE Conference on. IEEE, 2010, pp. 2544–2550.
-  H. K. Galoogahi, T. Sim, and S. Lucey, “Multi-channel correlation filters,” in Proceedings of International Conference on Computer Vision, 2013.
-  J. F. Henriques, R. Caseiro, P. Martins, and J. Batista, “Exploiting the circulant structure of tracking-by-detection with kernels,” in Proceedings on European Conference on Computer Vision, 2012, pp. 702–715.
-  A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in Advances in neural information processing systems, 2012, pp. 1097–1105.
-  K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” arXiv preprint arXiv:1409.1556, 2014.
-  M. Danelljan, G. Hager, F. Shahbaz Khan, and M. Felsberg, “Convolutional features for correlation filter based visual tracking,” in Proceedings of the IEEE International Conference on Computer Vision Workshops, 2015, pp. 58–66.
-  D. Held, S. Thrun, and S. Savarese, “Learning to track at 100 fps with deep regression networks,” in European Conference on Computer Vision. Springer, 2016, pp. 749–765.
-  L. Bertinetto, J. Valmadre, J. F. Henriques, A. Vedaldi, and P. H. Torr, “Fully-convolutional siamese networks for object tracking,” in European Conference on Computer Vision. Springer, 2016, pp. 850–865.
-  L. Leal-Taixé, C. Canton-Ferrer, and K. Schindler, “Learning by tracking: Siamese cnn for robust target association,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, 2016, pp. 33–40.
-  M. Mueller, N. Smith, and B. Ghanem, “A benchmark and simulator for uav tracking,” in Proceedings of European Conference on Computer Vision, 2016, pp. 445–461.
-  E. J. Ientilucci and S. D. Brown, “Advances in wide-area hyperspectral image simulation,” in AeroSense 2003. International Society for Optics and Photonics, 2003, pp. 110–121.
-  D. G. Lowe, “Distinctive image features from scale-invariant keypoints,” International journal of computer vision, vol. 60, no. 2, pp. 91–110, 2004.
-  M. A. Fischler and R. C. Bolles, “Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography,” in Readings in computer vision. Elsevier, 1987, pp. 726–740.
-  M. Tang and J. Feng, “Multi-kernel correlation filter for visual tracking,” in Proceedings of International Conference on Computer Vision, 2015, pp. 3038–3046.
-  C. Ma, X. Yang, C. Zhang, and M.-H. Yang, “Long-term correlation tracking,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 5388–5396.
-  A. Bibi and B. Ghanem, “Multi-template scale-adaptive kernelized correlation filters,” in Proceedings of the IEEE International Conference on Computer Vision Workshops, 2015, pp. 50–57.
-  R. Rifkin, G. Yeo, T. Poggio et al., “Regularized least-squares classification,” Nato Science Series Sub Series III Computer and Systems Sciences, vol. 190, pp. 131–154, 2003.
-  R. Girshick, “Fast r-cnn,” in Proceedings of the IEEE international conference on computer vision, 2015, pp. 1440–1448.
-  S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: Towards real-time object detection with region proposal networks,” in Advances in neural information processing systems, 2015, pp. 91–99.
-  J. Dai, Y. Li, K. He, and J. Sun, “R-fcn: Object detection via region-based fully convolutional networks,” in Advances in neural information processing systems, 2016, pp. 379–387.
-  F. Yu and V. Koltun, “Multi-scale context aggregation by dilated convolutions,” arXiv preprint arXiv:1511.07122, 2015.
-  Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell, “Caffe: Convolutional architecture for fast feature embedding,” arXiv preprint arXiv:1408.5093, 2014.
-  J. Redmon and A. Farhadi, “Yolo9000: better, faster, stronger,” arXiv preprint arXiv:1612.08242, 2016.
-  W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y. Fu, and A. C. Berg, “Ssd: Single shot multibox detector,” in European conference on computer vision. Springer, 2016, pp. 21–37.
-  A. B. Poore, “Multidimensional assignment formulation of data association problems arising from multitarget and multisensor tracking,” Computational Optimization and Applications, vol. 3, no. 1, pp. 27–57, 1994.