Spectral-spatial features for material based object tracking in hyperspectral video

12/11/2018 ∙ by Fengchao Xiong, et al. ∙ 12

Traditional color images only depict color intensities in red, green and blue channels, often making object trackers fail when a target shares similar color or texture as its surrounding environment. Alternatively, material information of targets contained in a large amount of bands of hyperspectral images (HSI) is more robust to these challenging conditions. In this paper, we conduct a comprehensive study on how HSIs can be utilized to boost object tracking from three aspects: benchmark dataset, material feature representation and material based tracking. In terms of benchmark, we construct a dataset of fully-annotated videos which contain both hyperspectral and color sequences of the same scene. We extract two types of material features from these videos. We first introduce a novel 3D spectral-spatial histogram of gradient to describe the local spectral-spatial structure in an HSI. Then an HSI is decomposed into the detailed constituent materials and associate abundances, i.e., proportions of materials at each location, to encode the underlying information on material distribution. These two types of features are embedded into correlation filters, yielding material based tracking. Experimental results on the collected benchmark dataset show the potentials and advantages of material based object tracking.



There are no comments yet.


page 1

page 3

page 4

page 7

page 9

page 10

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Object tracking is one of the fundamental tasks in computer vision, particularly for video surveillance and robotics. It is a challenging task to detect the size and location of an object in video frames given a bounding box of the object in the first frame 

[1]. Many efforts have been made to obtain informative visual cues, such as color intensities [2, 3], color names [4], texture [5, 6] and more for object tracking. However, tracking in traditional color videos has its inherent limitations. Trackers tend to fail in some challenging scenarios in which a target has similar color or texture with its surrounding environment, as exemplified by the camouflage of wild life [7] or the recent failure of autonomous cars111https://www.tesla.com/blog/tragic-loss. In these cases, however, the underlying material information of foreground and background is distinct. Due to the limitation of color images in depicting the full physical property of surface reflectance, the spectral differences of materials are lost, resulting indistinguishable foreground and background in practice [8].

(a) RGB
(b) HSI
(c) Spectrum (1, 2)
(d) Spectrum (3, 4)
(e) Spectrum (5, 6)
Fig. 1: An example of HSIs for material identification. (a) shows plastic (left) and cotton (right) toys. (b) shows their corresponding false-color image generated from an hyperspectral image. (c)-(e) demonstrate the spectral reflectances at several pixels. Though these pixels are similar in color, their spectrums are different.

In contrast, hyperspectral images (HSI) record continuous spectrum information at each object instead of monochrome or color intensities. The spectrum information provides details on the material constitution of contents in the scene and increases the inter-object discrimination capability [9]. Fig. 1 shows a sample HSI for material identification from two toy minions built from different materials. Though pixels 1 and 2 are both black in color, their recorded spectral responses are different especially from 571nm to 642nm. The same observation can be made from Figs. 1(d) and 1(e). Benefit from the superior advantages of material identification along the spectral dimension, HSIs have enabled many unique applications in remote sensing [10] and computer vision [7, 9, 11].

Using HSIs for visual object tracking is faced with many challenges. On one hand, there is a lack of benchmark datasets with high diversity to support hyperspectral object tracking. One reason is that with the limitation of most existing hyperspectral sensors, it is difficult to collect real-time hyperspectral videos at high frame rate with high signal-to-noise ratio and high spatial and spectral resolutions. On the other hand, due to the high dimensionality nature of HSIs caused by large number of spectral bands, traditional feature extractors developed for monochrome or color images may not provide highly discriminative information for hyperspectral data due to the ignorance of valuable spectral information. Therefore, there is a need to develop effective spectral-spatial feature extraction methods to simultaneously consider the spatial information, spectral information and joint spectral-spatial information in an HSI.

Several previous works utilized HSIs for object tracking in specific applications, for example, chemical gum gas plume tracking [12] and aerial vehicle tracking [13]. In most cases, pixel-wise spectral reflectance is adopted as the feature for object tracking [14, 15, 16, 13]. Alternatively, Uzkent et al. [17]

proposed a deep kernelized correlation filter based method (DeepHKCF) for aerial object tracking, in which an HSI was converted to false-color image before passing to a deep convolutional neural network. In 

[18], Qian et al. selected a set of 3-dimensional patches as convolutional kernels to extract features. The problem of all these works is that the spectral-spatial information of targets is not fully explored, making the learned model not sufficiently discriminative to achieve robust tracking. Furthermore, most of these tracking methods are not developed for the computer vision setting in which objects are captured at high frame rate in close-range scenarios.

In this paper, we propose a comprehensive study on how HSIs can be used for object tracking. Our original and novel contribution lies in four aspects: 1) a new hyperspectral tracking benchmark dataset, 2) a material based hyperspectral tracking (MHT) framework, 3) a novel spectral-spatial feature extraction method, and 4) estimation of material distribution. We introduce a fully annotated dataset with 35 hyperspectral videos captured by a commercial high-speed hyperspectral camera 

222Dataset: https://bearshng.github.io/mht/

. Moreover, we tackle object tracking problem from new perspective in which the material properties are considered. To explore material information, we develop two feature extractors, spectral-spatial histogram of gradients (SSHOG) and spatial distribution of materials to capture the material properties in an HSI. SSHOG captures the spectral-spatial texture information, where 3D spectral-spatial cubes are adopted to extract the spatial and spectral gradients orientations rather than 2D spatial patches. The distributions of underlying constitute materials in the scene are encoded by abundances. They are obtained by hyperspectral unmixing which decomposes an HSI into constitute spectral (or endmembers) and their corresponding fractions (or abundances). These spectral-spatial features are general in nature, and can be used to facilitate various subsequence computer vision tasks, for example, hyperspectral super-resolution 

[19, 20] and salient object detection [7]. Moreover, we develop an online learning approach to adaptively adjust their relative importance in object state translation. Extensive experiments with detailed analysis are carried out to explore a better understanding of how HSIs can be used to promote object tracking. Given increasing adoption of HSIs in computer vision tasks, we expect our work will clear some pivotal obstacles in both theoretical research and the practical usage of HSIs.

TABLE I: Sample sequences in our benchmark dataset. We show the ground truth bounding box in the first frame of both RGB video (left) and hyperspectral video (right).

Ii Related Work

Discriminative correlation filter (DCF) is widely used in object tracking due to its competitive performance and computational efficiency enabled by fast Fourier transform (FFT). DCF produces filters by minimizing the output sum of squared error (MOSSE) 

[2] for all circular shifts of a training sample. Some efforts have been made to address several limitations of MOSSE. For example, Henriques et al. embedded kernel methods to correlation filter to achieve non-linear decision boundary [5] without sacrifice of computational cost. Improvement has also been made in feature representation to learn more discriminative filters, for example, by extracting HOG [21], color names [4, 22]

and deep features learned by convolutional neural networks (CNNs) 

[23, 24, 25, 26, 27].

Some studies focus on spatial regularization to suppress unwanted boundary effects caused by periodic assumption on training samples [6, 28, 22, 29]. Danelljan et al. introduced spatial regularization to penalise correlation filter coefficients according to their spatial location [6]. A background-aware correlation filter (BACF) considers all background patches as negative samples by using a rectangular mask covered central part of the circular samples [28]. Benefit from alternating direction method of multipliers (ADMM) and FFT, BACF is also computationally efficient.

However the tracking methods for color videos may not adapt well for hyperspectral videos. This is due to the high dimensionality of HSIs caused by an extra dimension of spectral domain, which makes generic features designed for color images unsuitable for HSIs. Therefore more effective feature extraction methods should be developed to simultaneously exploit the spatial and spectral information.

Iii Hyperspectral Tracking Benchmark Dataset

Hyperspectral object tracking research requires a high-quality dataset in computer vision setting, which however, is unavailable to our knowledge. Recently, Uzkent et al. introduced an aerial object tracking dataset. This dataset is synthetic, generated by Digital Imaging and Remote Sensing (DIRSIG) software [17], making it too ideal to approximate the complexity required by computer vision problems.

Recent progress on sensors makes it possible to collect hyperspectral sequences at video rate. In our research, a snapshot mosaic hyperspectral camera 333https://www.imec-int.com/en/hyperspectral-imaging was used to collect videos. This camera can acquire videos up to 180 hyperspectral cubes per second, each of which contains pixels and 16 bands in the wavelength from 470nm to 620nm. In our data collection, we captured videos at 25 frames per second (fPS), where a frame refers to a 3D hyperspectral cube with two dimensions indexing the spatial location and the third dimension indexing the spectral band. For fair comparison with color-based tracking methods, RGB videos were also acquired at the same frame rate in a very close view point as the hyperspectral videos. These RGB videos were registered to hyperspectral videos to make sure they contain almost identical scenes with similar spatial resolutions. Moreover, we also employed CIE color matching functions 444http://cvrl.ioo.ucl.ac.uk/cmfs.htm to convert all hyperspectral frames to 3-channel color images. The color images were then transformed to match the color intensity of the RGB videos, generating false-color videos. The data collection generated three video sequences for each tracking task, i.e., hyperspectral, false-color and color videos, with 35 tracking tasks in total.

Table IV shows the first frame with ground-truth bounding box of some sequences in the collected dataset. Following the challenging factors listed in [30], we carefully collected videos to include multiple target categories, diverse scenarios, rich activities and diverse content etc., guaranteeing the generality and complexity of the dataset. For example, the tracking objects are of high diversity, including vehicles, faces, people, generic objects, etc. In addition, we collected objects in both indoor and outdoor scenes. This benchmark not only addresses the critical needs in investigating HSIs for object tracking, but also drives the potential usability of HSIs in other computer vision problems, for example, semantic segmentation and object detection.

Iv Material Based Tracking

In this section, we give the details of the material based tracking method, including feature representation and feature reliability learning.

Iv-a Spectral-spatial Histogram of Gradients

Fig. 2: Framework of the proposed SSHOG. We first calculate the spectral-spatial gradients in an HSI in both spatial and spectral directions, which is represented by in a spherical coordinate. After that, each point in a cell votes in spatial and spectral orientations, yielding a spectral-spatial histogram. Finally the spectral-spatial histogram is normalized in a local block.

Previous works show that 3D spectral-spatial features are effective for HSI processing [31, 11]. Instead of 2D patches, these features are constructed in a local 3D spectral-spatial neighborhood. Therefore, spectral information, spatial information and joint spectral-spatial information are simultaneously taken into consideration. To this end, we build a novel spectral-spatial histogram of gradients (SSHOG) descriptor for an HSI, whose framework is shown in Fig. 2. Given an HSI containing pixels and bands, SSHOG is constructed as follows.

Spectral-spatial gradient computation: An HSI is a three-dimensional cube containing two-dimensional spatial information and one-dimensional spectral information. Therefore, its gradients can be calculated in three directions, defined as:


Spectral-spatial Orientation Binning: Transferring the gradients in (LABEL:eq:gradient) to a spherical coordinate system, they can be identically represented by , where represents the spectral-spatial gradient magnitude, indicates the spatial gradient orientation, and denotes the spectral gradient orientation. Their definitions are given as follows:


After the gradient calculation, an orientation-based histogram is created within a local 3D

cell to characterize the underlying local spectral-spatial shape information. The points in each cell are quantized by their gradient orientation and weighed by the corresponding gradient magnitudes. In our method, 9 sensitive orientations are used in the spatial domain, covering 0-360 degrees. 4 insensitive orientations are used in the spectral domain, covering 0-180 degrees. Moreover, trilinear interpolation of the pixel is used to avoid aliasing effect. Finally, the orientations are grouped separately and concatenated, yielding a

dimensional feature in a cell.

Block normalization:

This step aims to reduce the sensitivity of feature descriptor to illumination and foreground-background contrast. Specifically, the feature vector in each cell is locally grouped with feature vectors in the spatial neighbouring cells to form the description of a larger connected block. As in 

[21], we use blocks for block normalization. After that, the gradient strength is normalized to unit length by being divided by their norm. To account for the illumination changes, we limit the maximum values in the feature vector to 0.2, which is then re-normalized. Consequently, we get a dimensional feature for each block. Finally, the block descriptors are concatenated in the spectral direction, yielding proposed SSHOG.

(a) RGB
(b) Plastic
(c) Background
(d) Real
Fig. 3: Unmixing results of a hyperspectral scene. (a) shows the color images of two lemons. The left lemon is plastic and the right one is real. (b)-(d) give the abundances generated by hyperspectral unmixing.

Iv-B Material Distribution Learning

Low spatial resolution of sensor and long distance from camera to imaging targets cause overlapped spectral responses of neighboring different surface materials, leading to “mixed” pixels. Hyperspectral unmixing decomposes these pixels into a collection of spectral signatures, or endmembers, and associated proportions or abundances. Fig. 3 provides an example of hyperspectral unmixing. In this scene, the abundances produced by hyperspectral unmixing clearly demonstrate the underlying distributions of three materials, namely, plastic lemon, background, and real lemon.

Modern unmixing methods can be broadly categorised into three groups, supervised, unsupervised and semi-supervised. Supervised unmixing breaks this problem into two tasks, endmember extraction [32, 33] and abundance estimation [34, 35, 36]. Despite fast speed, supervised methods fail to deal with complicated situations where pure endmembers are unavailable. Alternatively, unsupervised methods integrate both tasks into a single problem to simultaneously estimate endmembers and abundances [37, 38], but their performance is limited by time-consuming optimization steps. Recently, semi-supervised unmixing has been introduced which assumes the endmembers are a part of a predefined spectral library [39, 40]. Since the endmembers are selected from the library, the unmixing results are more accurate and closer to the spectral signatures of real-world materials.

Considering the requirements on tracking speed and endmember accuracy, we adopt semi-supervised methods to select the endmembers from an offline library built from the initial frame. These endmembers are used for abundance estimation in all subsequent frames. The spectral library is constructed using K-means method to cluster the endmembers extracted by vertex component analysis (VCA) 

[32] from a collection of HSIs into a set of clusters. The spectral reflectance of each cluster center is considered as an atom in the library. In terms of endmember selection, let be the target object in the bounding box, covering bands and pixels, the endmembers are selected using CLSUnSAL [39] by enforcing group sparsity on the abundance matrix , which is formulated as:


where is the predefined spectral library with materials. When the CLSUnSAL is converged, the sum of each row of is calculated, yielding a vector with entries. Each element in can be regarded as the total contributions of a specified material in . The spectral signatures of top elements are then selected, determined by HySime [41], whose entries as the constitute endmembers in the scene. Once the endmembers are available, a simplex-projection unmixing (SPU) [36] method is used for abundance estimation because of its computational efficiency and superior performances.

Fig. 4: Framework of the proposed MHT tracker. SSHOG and abundances are first calculated, then weighted according to their reliability in order to learn correlation filters. These filters are used to convolve with the candidate patches, yielding multi-channel responses. The sum of the responses is calculated to get the final response map, whose maximum value indicates the predicted location of the object. Finally, the reliability of each feature is updated according to the responses, and the filters are updated by BACF.

Iv-C Material Based Object Tracking Method

Considering the high dimensionality of an HSI and the computational efficiency of BACF [28], we adopt BACF as our tracking method. Under the framework of BACF, our material based tracking learns the filters by the following objective function:


where is the concatenation of SSHOG and abundances, contains their weights for determining the location of target, is a binary matrix which crops the central patch of , denotes the spatial convolution operator, is the regularization parameter, and , represents feature dimensionality and number of pixels, respectively.

Instead of estimating channel-wise reliability in [22], we use group-wise reliability to represent the importance of features. Each group of features jointly represent one typical physical meaning, for example, all the channels in the abundances embody the distribution of underlying materials. Moreover, group-wise reliability differentiates the discriminative power of an individual feature. This enables the tracker to adaptively suppress the effect of less reliable features and enhance learning from more reliable features.

We evaluate the reliability of each feature using its expressiveness in object detection, including overlap reliability score, distance reliability score and self reliability score. The first two scores measure the contributions for final object localization. For overlapping reliability scores, we compute the overlap ratio between the bounding box from an individual feature and the final bounding box :


Then, the overlap reliability score is determined by:


In terms of distance reliability score, we also compare the differences between the predicted central location from one particular feature and the final position , formulated as:


In addition, self reliability score measures the trajectory smoothness degree of feature. It records the shift between the previous bounding box and the current bounding box , which is given by:


where is the central location determined by the -th feature at the -th frame. and represent the width and height of the bounding box, respectively. With the above reliability measure, the final reliability weights of different features are given by:


The framework of the proposed material based object tracking is shown in Fig. 4. The feature reliability is initialized with the same value for all the features at the initial frame. After each frame, the feature reliability is computed by Eq. (9

), which is then updated by an autoregressive model with a learning rate

, formulated as . The model optimization, target detection, and filters update procedures are the same as those in BACF [28].

V Experiments

In this section, we first investigate our material based tracking method from a feature representation perspective, then compare it with some state-of-the-art methods, including hand-crafted feature based trackers, deep feature based trackers and hyperspectral trackers. Moreover, the attribute-based and quality-based comparison are also presented.

V-a Experimental Setting

In our experiments, the cell size was set to 4, resulting dimensional SSHOG. The learning rate was set to 0.0033. All the other parameters were set to the same as those in BACF [28]. All methods were tested on a Windows machine equipped with Intel(R) Core(TM) i7-7800X CPU@3.50GHz and 128GB RAM.

Three evaluation protocols, precision plot, success plot and area under curve (AUC), were adopted to describe the performance of all the trackers. Precision plot records the fractions of frames whose estimated location is within the given distance threshold of the ground truth. The average distance precision rate is reported at a threshold of 20 pixels. Success plot shows the percentages of successful frames whose overlap ratio between the predicted bounding box and ground-truth is larger than a certain threshold varied from 0 to 1. Moreover, AUC of each success plot was used as an overall evaluation measure to rank all trackers.

(a) Precision plot
(b) Success plot
Fig. 5: Precision plot and success plot of different features on the collected hyperspectral video dataset. The average distance precision scores at a threshold of 20 pixels and AUCs of the trackers are specified in the legends.

V-B Effectiveness of the Proposed Feature

In this experiment, we compare the effectiveness of five feature extractors, including spectrum, histogram of gradients (HOG), abundances, SSHOG, and SSHOG combined with material abundances (abbreviated as MHT). For spectrum, the raw spectral response at each pixel was employed as the feature for tracking. The HOG was calculated following [42], whose cell size and orientations were set to and 9 respectively. In this experiment, except for MHT in which the reliability weights were assigned in BACF, all other tracking steps were based on the framework of the original BACF.

Fig. 5 compares object tracking performance using different features. Spectral feature provides the worst accuracy among all the compared methods, as the raw spectrum is sensitive to illumination changes. Abundances exploit the underlying material distribution information, giving better performance. HOG considers the local spatial structure information which is crucial for object tracking, therefore, produces more favorable results. SSHOG depicts local spectral-spatial structure information, yielding a gain of in AUC compared with HOG. Notably, replacing the SSHOG with the combination of abundances and SSHOG allows MHT to dominate SSHOG and HOG by a relatively large margin. This is owing to the hybrid benefit from SSHOG and abundances. From this experiment, we can infer that generic descriptors proposed in color images may not adapt well to robust feature representation of HSIs. In contrast, spectral-spatial feature is more effective in providing more discriminative information for target tracking.

V-C Quantitative Comparison with Hand-crafted Feature-based Trackers.

In this experiment, we compare the proposed MHT tracker with ten state-of-the-art color trackers with hand-crafted features, including KCF [5], fDSST [43], SRDCF [6], MUSTer [44], SAMF [45], Struck [46], CNT [47], BACF [28], CSR-DCF [22], and MCCT [48]. The MHT tracker was tested on hyperspectral videos. Since all the alternative trackers were developed for color videos, they were run on both color videos and false-color videos generated from hyperspectral videos.

(a) Precision plot
(b) Success plot
Fig. 6: Comparisons with hand-crafted feature-based trackers on RGB videos. MHT outperforms all the other trackers.
(a) Precision plot
(b) Success plot
Fig. 7: Comparisons with hand-crafted feature-based trackers on hyperspectral videos or corresponding false-color videos. MHT achieves the best accuracy with an AUC of .

Fig. 6 and Fig. 7 show the tracking performance of all methods. The results show that KCF and Struck give unsatisfying success scores because of limited consideration of scale estimation. MUSter also produces inferior performance due to the fact that it fails to detect the key points when the object shares similar appearance as the background, for example in coin sequences and card sequences. CSR-DCF, BACF and SRDCF integrate the background information to learn more discriminative filters, resulting in much better tracking performance. It is worth mentioning that the proposed MHT tracker achieves noticeable improvement over the original BACF, when original BACF was run on both color videos and hyperspectral videos respectively, providing a gain of and in AUC. In addition, compared with the other trackers, our approach ranks top over a range of thresholds, by achieving an AUC of , followed by MCCT on color dataset. This implies that MHT well represents the image content using the constitute material distribution information and local spectral-spatial information contained in an HSI. Such information is beneficial to enabling a tracker to distinguish the target from the background. Furthermore, due to adaptive feature robustness learning, our tracker is able to make full use of more reliable features to learn correlation filters.

V-D Quantitative Comparison with Deep Feature-based Trackers

In this section, we select several state-of-the-art deep feature-based trackers for comparison, including ECO [24], CF2 [49], TRACA [50], CFNet [51], HDT [52], DSiam [53], DeepSRDCF [54], and C-COT [55]. As reported in Table II, the proposed method achieves competitive performance with C-COT on color videos and significantly outperforms the other trackers on both color and false-color videos. The reason of lower performance of other methods is the challenging nature of the dataset which contains objects with similar color or texture as the background. In addition, most of deep trackers present lower AUCs in hyperspectral dataset. This means useful spectral information contained in hyperspectral videos is lost in the false-color videos.

Video MHT C-COT [55] ECO [24] CF2 [49] CFNet [51] HDT [52] DSiam [53] DeepSRDCF [54] TRACA [50]
Color n/a 0.617 0.575 0.483 0.580 0.453 0.564 0.571 0.570
Hyperspectral/False-color 0.606 0.561 0.563 0.503 0.528 0.476 0.483 0.564 0.553
TABLE II: Performance comparison with deep trackers in terms of AUC. The top two values are highlighted by red and blue.
Attributes MHT BACF [28] DeepSRDCF [54] ECO [24] C-COT [55] TRACA [50] SRDCF [6] MCCT [48] CFNet [51] CSR-DCF [22]
Scale variation (SV) 0.606 0.576 0.550 0.560 0.577 0.524 0.519 0.492 0.529 0.481
Motion blur (MB) 0.633 0.637 0.622 0.629 0.592 0.596 0.623 0.490 0.468 0.583
Occlusion (OCC) 0.509 0.469 0.489 0.489 0.464 0.421 0.453 0.450 0.411 0.380
Fast motion (FM) 0.573 0.566 0.571 0.587 0.577 0.498 0.570 0.580 0.548 0.368
Low resolution (LR) 0.474 0.371 0.384 0.448 0.450 0.439 0.309 0.329 0.448 0.370
In-plane rotation (IPR) 0.701 0.666 0.637 0.654 0.681 0.621 0.639 0.638 0.631 0.607
Out-of-plane rotation (OPR) 0.705 0.678 0.650 0.665 0.683 0.610 0.628 0.592 0.631 0.607
Deformation (DEF) 0.691 0.666 0.617 0.640 0.617 0.672 0.638 0.558 0.622 0.589
Background clutters (BC) 0.636 0.584 0.552 0.545 0.566 0.543 0.529 0.519 0.535 0.553
Illumination variation (IV) 0.479 0.507 0.508 0.498 0.475 0.345 0.416 0.410 0.408 0.296
Out-of-view (OV) 0.758 0.766 0.745 0.767 0.750 0.534 0.519 0.513 0.623 0.759
TABLE III: Attribute-based comparison on hyperspectral/false-color videos. The best two results are shown in red and blue fonts. Our tracker ranks the first on 7 out of 11 attributes: SV, OCC, LR, IPR, OPR, DEF, and BC.

V-E Quantitative Comparison with Hyperspectral Trackers.

We also compared our method with two alternative hyperspectral trackers, CNHT [18] and DeepHKCF [17]. Both CNHT and DeepHKCF are based on KCF but use different features. In CNHT, normalized three-dimensional patches were selected from the target region in the initial frame as fixed convolution kernels for feature extraction in succeeding frames. In terms of DeepHKCF, an HSI was converted into false-color image to learn deep features by VGGNet.

Fig. 8 presents the tracking results of all competing hyperspectral trackers. The results show that CNHT gives inferior accuracy due to the fact that it only considers fixed positive samples in learning convolutional filters. Without negative patches in the surrounding background, the features produced by the fixed convolutional filters are not discriminative enough to learn a robust model, significantly deteriorating the prediction of object location. In contrast, VGGNet uses both positive and negative samples to learn discriminative feature representation. Therefore, DeepHKCF shows more competitive performance than CNHT. However, since HSIs are converted into three-channel false-color images before passing through the VGGNet, the complete spectral-spatial structural information in an HSI is not fully explored. This makes DeepHKCF fail to outperform the proposed MHT tracker. Combing local spectral-spatial texture information and detailed material information, the proposed MHT achieves the most appealing performance. This experiment again suggests that material information facilitates object tracking.

(a) Precision plot
(b) Success plot
Fig. 8: Comparison with hyperspectral trackers.
Fig. 9: Qualitative evaluation on 3 video sequences (i.e., drive, worker, forest).

V-F Attribute-based Evaluation

In this experiment, we report the tracking effectiveness with respect to different video attributes. For simplicity, we only present the performance of top 9 color trackers on false-color videos and MHT on hyperspectral videos. Table III reports the AUCs of all the trackers. MHT ranks the first on 7 out of 11 attributes: SV, OCC, LR, IPR, OPR, DEF and BC. In LR situations, it is difficult to extract robust features from the target in a color image. In contrast, the material information can be captured by an HSI and represented by the proposed SSHOG and abundance features, helping our tracker to discriminate the object from the background. On the videos with IPR, OPR, DEF, BC and OCC attributes, the tracking target is partly or fully deteriorated, which makes the spatial structure information unreliable. Compared with spatial information, the underlying material information is more robust. Thanks to the superior advantages of the proposed SSHOG and abundances in material spectral-spatial structure representation and underlying material information exploitation, MHT tracker is more capable of separating target from surrounding environment.

V-G Qualitative Comparisons

Here we provide qualitative evaluations of the competing trackers on sample hyperspectral or false-color videos, as shown in Fig. 9. Due to ensemble trackers, MCCT provides better performance in rotation (drive) and clutter conditions (forest). BACF drifts away when a target is in low-resolution (worker) and shares similar color as the background (forest). Thanks to the high robustness of material properties, MHT shows higher robustness in tracking the objects in these scenarios.

Vi Conclusion

In this paper, we introduce a benchmark dataset for object tracking in hyperspectral videos and propose a material based tracking method to study how HSIs can be explored to advance object tracking. The material information is embodied in the proposed SSHOG and abundance features. SSHOG encodes the local spectral-spatial texture information by summarising the occurrences of spectral-spatial gradients orientation in local regions of an HSI. Abundance describes the underlying constituent material distribution using hyperspectral unmixing approach. Extensive experiments on proposed hyperspectral benchmark dataset demonstrate that the material properties contribute to moving object tracking, confirming that HSIs have great potential in object tracking. In our future work, we will develop a material based convolutional neural network to investigate deep spectral-spatial structural information for hyperspectral tracking.

Appendix A Supplementary Material

In this supplementary material, we show the first frame of all the sequences in the collected dataset in Table IV. The whole dataset contains 35 color videos and 35 hyperspectral videos with an average of 500 frames in each sequence. The hyperspectral videos are further converted to false-color videos. The full dataset and benchmark including the sequences, annotations and associate code will be available online after the review process. Some preprocessing steps are listed as follows:

Spectral Calibration: Our calibration process involves two steps: dark calibration and spectral correction. Dark calibration aims to remove the effect of noise produced by the camera sensor. We performed dark calibration by subtracting a dark frame from the captured image, in which the dark frame was captured when lens was covered by a cap. The goal of spectral calibration is to suppress the contributions of unwanted second order responses, for example response to wavelengths leaking into the filters. By applying the sensor-specific spectral correction matrix on the acquired reflectance, the resulted spectrum is more consistent with the expected spectrum. An example is given in Fig. 10. The green curve is acquired spectrum without correction and the red curve is corrected spectrum. As can be seen, after the spectral correction, the spectrum is smoother.

Fig. 10: Effect of spectral correction.

Color to Hyperspectral Image Registration: We registered the hyperspectral sequences and color sequences to make them describe almost the same scene. Specifically, we manually selected some points in the initial frame of both hyperspectral and color videos as key points. These points were matched and then used for the geometrical transformation. Finally, the resulting matrix was used to transform the color image in all the subsequent frames to align with corresponding hyperspectral frames.

Hyperspectral to Color Image Conversion: We converted hyperspectral videos to false color videos using CIE color matching functions (CMFs) 555http://cvrl.ioo.ucl.ac.uk/cmfs.htm. The CMFs indicate the weights of each wavelength in a hyperspectral imagery (HSI) when they are used to generate red, green, and blue channels. Given an HSI with pixels and bands, this step converts the HSI to a CIE XYZ image, formulated as


where are the CMFs. Here we used 10-deg XYZ CMFs transformed from the CIE (2006) with a step size of 0.1nm. After that, is converted to the default RGB colour space sRGB. Furthermore, the color transformation method in [56] was applied to make the color intensity of converted image close to that of corresponding collected color frames.

Ball Basketball Board1
Board2 Book Bus
Bus2 Campus Car
Car2 Car3 Coin
Coke Card Drive
Excavator Face Face2
Forest Forest2 Fruit
TABLE IV: Sequences in our benchmark. We show the ground truth bounding box in the first frame of both RGB video (left) and hyperspectral video (right). Each video is also labeled with the challenging factors according to 11 attributes listed in [57], including illumination variation (IV), scale variation (SV), occlusion (OCC), deformation (DEF), motion blur (MB), fast motion (FM), in-plane rotation (IPR), out-of-plane rotation (OPR), out-of-view (OV), background clutters (BC), and low resolution (LR).
Hand Kangaroo Kangaroo2
Player Playground Paper
Pedestrian Pedestrian2 Rubik
Student Toy1 Toy2
Toy3 Worker


  • [1] Z. Chen, Z. Hong, and D. Tao, “An experimental survey on correlation filter-based tracking,” arXiv preprint arXiv:1509.05520, 2015.
  • [2] D. S. Bolme, J. R. Beveridge, B. A. Draper, and Y. M. Lui, “Visual object tracking using adaptive correlation filters,” in

    Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR)

    , 2010, pp. 2544–2550.
  • [3] J. F. Henriques, R. Caseiro, P. Martins, and J. Batista, “Exploiting the circulant structure of tracking-by-detection with kernels,” in Proc. European conference on computer vision (ECCV), 2012.
  • [4] M. Danelljan, F. S. Khan, M. Felsberg, and J. v. d. Weijer, “Adaptive color attributes for real-time visual tracking,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), June 2014, pp. 1090–1097.
  • [5] J. F. Henriques, R. Caseiro, P. Martins, and J. Batista, “High-speed tracking with kernelized correlation filters,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 37, no. 3, pp. 583–596, 2015.
  • [6] M. Danelljan, G. Häger, F. S. Khan, and M. Felsberg, “Learning spatially regularized correlation filters for visual tracking,” in Proc. IEEE Int. Conf. Comput. Vis. (ICCV), Dec 2015, pp. 4310–4318.
  • [7] J. Liang, J. Zhou, L. Tong, X. Bai, and B. Wang, “Material based salient object detection from hyperspectral images,” Pattern Recognit., vol. 76, pp. 476–490, 2018.
  • [8] S. W. Oh, M. S. Brown, M. Pollefeys, and S. J. Kim, “Do it yourself hyperspectral imaging with everyday digital cameras,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2016.
  • [9]

    M. Uzair, A. Mahmood, and A. Mian, “Hyperspectral face recognition with spatiospectral information fusion and PLS regression,”

    IEEE Trans. Image Process., vol. 24, no. 3, pp. 1127–1137, 2015.
  • [10] M. Ye, Y. Qian, J. Zhou, and Y. Y. Tang, “Dictionary learning-based feature-level domain adaptation for cross-scene hyperspectral image classification,” IEEE Trans. Geosci. Remote Sens., vol. 55, no. 3, pp. 1544–1562, 2017.
  • [11] S. L. Al-khafaji, J. Zhou, A. Zia, and A. W. Liew, “Spectral-spatial scale invariant feature transform for hyperspectral images,” IEEE Trans. Image Process., vol. 27, no. 2, pp. 837–850, 2018.
  • [12] G. Tochon, J. Chanussot, M. D. Mura, and A. L. Bertozzi, “Object tracking by hierarchical decomposition of hyperspectral video sequences: Application to chemical gas plume tracking,” IEEE Trans. Geosci. Remote Sens., vol. 55, no. 8, pp. 4567–4585, 2017.
  • [13] B. Uzkent, M. J. Hoffman, and A. Vodacek, “Integrating hyperspectral likelihoods in a multidimensional assignment algorithm for aerial vehicle tracking,” IEEE J. Select. Topics Appl. Earth Observ. Remote Sens., vol. 9, no. 9, pp. 4325–4333, Sept 2016.
  • [14] T. Wang, Z. Zhu, and E. Blasch, “Bio-inspired adaptive hyperspectral imaging for real-time target tracking,” IEEE Sens. J., vol. 10, no. 3, pp. 647–654, March 2010.
  • [15] A. Banerjee, P. Burlina, and J. Broadwater, “Hyperspectral video for illumination-invariant tracking,” in Proc. First Workshop on Hyperspectral Image and Signal Processing: Evolution in Remote Sensing (WHISPERS), Aug 2009, pp. 1–4.
  • [16] H. V. Nguyen, A. Banerjee, and R. Chellappa, “Tracking via object reflectance using a hyperspectral video camera,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. Workshops (CVPRW), 2010.
  • [17] B. Uzkent, A. Rangnekar, and M. J. Hoffman, “Tracking in aerial hyperspectral videos using deep kernelized correlation filters,” IEEE Trans. Geosci. Remote Sens., pp. 1–13, 2018.
  • [18] K. Qian, J. Zhou, F. Xiong, and H. Zhou., “Object tracking in hyperspectral videos with convolutional features and kernelized correlation filter,” in arXiv preprint arXiv:1810.11819, 2018.
  • [19] R. Kawakami, Y. Matsushita, J. Wright, M. Ben-Ezra, Y. Tai, and K. Ikeuchi, “High-resolution hyperspectral imaging via matrix factorization,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), 2011.
  • [20] C. Lanaras, E. Baltsavias, and K. Schindler, “Hyperspectral super-resolution by coupled spectral unmixing,” in Proc. IEEE Int. Conf. Comput. Vis. (ICCV), 2015.
  • [21] N. Dalal and B. Triggs, “Histograms of oriented gradients for human detection,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), 2005.
  • [22] A. Lukežic, T. Vojír, L. C. Zajc, J. Matas, and M. Kristan, “Discriminative correlation filter with channel and spatial reliability,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), July 2017, pp. 4847–4856.
  • [23] N. Wang and D.-Y. Yeung, “Learning a deep compact image representation for visual tracking,” in Proc. Advances in neural information processing systems (NIPS), 2013, pp. 809–817.
  • [24] M. Danelljan, G. Bhat, F. S. Khan, and M. Felsberg, “ECO: Efficient convolution operators for tracking.” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), 2017.
  • [25] L. Wang, W. Ouyang, X. Wang, and H. Lu, “Visual tracking with fully convolutional networks,” in Proc. IEEE Int. Conf. Comput. Vis. (ICCV), 2015.
  • [26] Q. Guo, W. Feng, C. Zhou, R. Huang, L. Wan, and S. Wang, “Learning dynamic Siamese network for visual object tracking,” in Proc. IEEE Int. Conf. Comput. Vis. (ICCV), 2017.
  • [27] C. Sun, D. Wang, H. Lu, and M.-H. Yang, “Learning spatial-aware regressions for visual tracking,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), 2018, pp. 8962–8970.
  • [28] H. K. Galoogahi, A. Fagg, and S. Lucey, “Learning background-aware correlation filters for visual tracking,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), 2017, pp. 21–26.
  • [29] M. Mueller, N. Smith, and B. Ghanem, “Context-aware correlation filter tracking,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), vol. 2, no. 3, 2017, p. 6.
  • [30] Y. Wu, J. Lim, and M. Yang, “Object tracking benchmark,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 37, no. 9, pp. 1834–1848, 2015.
  • [31] S. Jia, J. Hu, J. Zhu, X. Jia, and Q. Li, “Three-dimensional local binary patterns for hyperspectral imagery classification,” IEEE Trans. Geosci. Remote Sens., vol. 55, no. 4, pp. 2399–2413, April 2017.
  • [32] J. M. P. Nascimento and J. M. B. Dias, “Vertex component analysis: a fast algorithm to unmix hyperspectral data,” IEEE Trans. Geosci. Remote Sens., vol. 43, no. 4, pp. 898–910, 2005.
  • [33] C. Chang, C. Wu, C. Lo, and M. Chang, “Real-time simplex growing algorithms for hyperspectral endmember extraction,” IEEE Trans. Geosci. Remote Sens., vol. 48, no. 4, pp. 1834–1850, 2010.
  • [34] D. C. Heinz and Chein-I-Chang, “Fully constrained least squares linear spectral mixture analysis method for material quantification in hyperspectral imagery,” IEEE Trans. Geosci. Remote Sens., vol. 39, no. 3, pp. 529–545, 2001.
  • [35] E. Chouzenoux, M. Legendre, S. Moussaoui, and J. Idier, “Fast constrained least squares spectral unmixing using primal-dual interior-point optimization,” IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens., vol. 7, no. 1, pp. 59–69, 2014.
  • [36] R. Heylen, D. Burazerovic, and P. Scheunders, “Fully constrained least squares spectral unmixing by simplex projection,” IEEE Trans. Geosci. Remote Sens., vol. 49, no. 11, pp. 4112–4122, 2011.
  • [37] Y. Wang, C. Pan, S. Xiang, and F. Zhu, “Robust hyperspectral unmixing with correntropy-based metric,” IEEE Trans. Image Process., vol. 24, no. 11, pp. 4027–4040, 2015.
  • [38]

    Y. Qian, F. Xiong, S. Zeng, J. Zhou, and Y. Y. Tang, “Matrix-vector nonnegative tensor factorization for blind unmixing of hyperspectral imagery,”

    IEEE Trans. Geosci. Remote Sens., vol. 55, no. 3, pp. 1776–1792, 2017.
  • [39] M. Iordache, J. M. Bioucas-Dias, and A. Plaza, “Collaborative sparse regression for hyperspectral unmixing,” IEEE Trans. Geosci. Remote Sens., vol. 52, no. 1, pp. 341–354, 2014.
  • [40] S. Zhang, J. Li, H. Li, C. Deng, and A. Plaza, “Spectral-spatial weighted sparse regression for hyperspectral image unmixing,” IEEE Trans. Geosci. Remote Sens., vol. 56, no. 6, pp. 3265–3276, 2018.
  • [41] J. M. Bioucas-Dias and J. M. P. Nascimento, “Hyperspectral subspace identification,” IEEE Trans. Geosci. Remote Sens., vol. 46, no. 8, pp. 2435–2445, 2008.
  • [42] P. F. Felzenszwalb, R. B. Girshick, D. McAllester, and D. Ramanan, “Object detection with discriminatively trained part-based models,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 32, no. 9, pp. 1627–1645, 2010.
  • [43] M. Danelljan, G. Häger, F. S. Khan, and M. Felsberg, “Discriminative scale space tracking,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 39, no. 8, pp. 1561–1575, 2017.
  • [44] Z. Hong, Z. Chen, C. Wang, X. Mei, D. Prokhorov, and D. Tao, “Multi-store tracker (MUSTer): A cognitive psychology inspired approach to object tracking,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), 2015, pp. 749–758.
  • [45] Y. Li and J. Zhu, “A scale adaptive kernel correlation filter tracker with feature integration,” in Proc. European conference on computer vision Workshop (ECCVW), 2014.
  • [46] S. Hare, S. Golodetz, A. Saffari, V. Vineet, M. Cheng, S. L. Hicks, and P. H. S. Torr, “Struck: Structured output tracking with kernels,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 38, no. 10, pp. 2096–2109, 2016.
  • [47] K. Zhang, Q. Liu, Y. Wu, and M. Yang, “Robust visual tracking via convolutional networks without training,” IEEE Trans. Image Process., vol. 25, no. 4, pp. 1779–1792, 2016.
  • [48] N. Wang, W. Zhou, Q. Tian, R. Hong, M. Wang, and H. Li, “Multi-cue correlation filters for robust visual tracking,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), 2018, pp. 4844–4853.
  • [49] C. Ma, J. Huang, X. Yang, and M. Yang, “Hierarchical convolutional features for visual tracking,” in Proc. IEEE Int. Conf. Comput. Vis. (ICCV), Dec 2015, pp. 3074–3082.
  • [50] J. Choi, H. J. Chang, T. Fischer, S. Yun, K. Lee, J. Jeong, Y. Demiris, and J. Y. Choi, “Context-aware deep feature compression for high-speed visual tracking,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), 2018, pp. 479–488.
  • [51] J. Valmadre, L. Bertinetto, J. Henriques, A. Vedaldi, and P. H. S. Torr, “End-to-end representation learning for correlation filter based tracking,” in Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), July 2017, pp. 5000–5008.
  • [52] Y. Qi, S. Zhang, L. Qin, H. Yao, Q. Huang, J. Lim, and M. Yang, “Hedged deep tracking,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), June 2016, pp. 4303–4311.
  • [53] Q. Guo, W. Feng, C. Zhou, R. Huang, L. Wan, and S. Wang, “Learning dynamic siamese network for visual object tracking,” in Proc. IEEE Int. Conf. Comput. Vis. (ICCV), Oct 2017, pp. 1781–1789.
  • [54] M. Danelljan, G. Häger, F. S. Khan, and M. Felsberg, “Convolutional features for correlation filter based visual tracking,” in Proc. IEEE Int. Conf. Comput. Vis. Workshop (ICCVW), 2015.
  • [55] M. Danelljan, A. Robinson, F. S. Khan, and M. Felsberg, “Beyond correlation filters: Learning continuous convolution operators for visual tracking,” in Proc. European Conference on Computer Vision (ECCV).   Springer, 2016, pp. 472–488.
  • [56] F. Pitie and A. Kokaram, “The linear monge-kantorovitch linear colour mapping for example-based colour transfer,” in Proc. 4th European Conference on Visual Media Production, Nov 2007, pp. 1–9.
  • [57] Y. Wu, J. Lim, and M. Yang, “Object tracking benchmark,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 37, no. 9, pp. 1834–1848, 2015.