1 Introduction
Hyperspectral imaging plays an important role in remote sensing as it provides hundreds of contiguous, narrow spectral bands [1]. With the advantage of rich spectral information, hyperspectral images (HSIs) have been widely used in many applications involving image classification [2] and segmentation [3], such as land cover detection and mining. However, to the best of our knowledge, there is very little work focusing on hyperspectral video processing. The main reason is that it is difficult to capture hyperspectral videos with low speed imaging devices. It is not until the last a couple of years that low cost hyperspectral video cameras become available, making it possible to collect hyperspectral videos at a high frame rate.
In this paper, we introduce one of the very first work on object tracking in hyperspectral videos. Object tracking is an important research topic in computer vision and multimedia. Most tracking methods
[4, 5, 6, 7, 8] were developed on grayscale or RGB videos. Discriminative correlation filter (DCF) [4, 5, 6, 9]based framework explores supervised visual object tracking. The DCF trains the filters very efficiently in the frequency domain via fast Discrete Fourier transform (DFT). It learns a correlation filter to localize the object in consecutive frames. The learned filter is applied to estimate the target location by calculating the maximum response. Bolme et al. introduced the minimum output sum of squared error filter (MOSSE) tracker
[9] which utilizes grayscale features and achieves an impressive speed in tracking application. Other features used for tracking include the incorporation of kernels and histogram of gradient (HOG) features [6], the addition of color name features [4], adaptive scale [10], and the integration of deep learning features
[11]. The kernelized correlation filter (KCF) method [6] circularly shifts the training samples and exploits the advantage of multichannel HOG features with the kernel trick. Zhang et al. proposed the SpatioTemporal Context (STC) [7]tracker, which explores the correlation filter in terms of the probability theory, and utilized the dense sampling to track the object of interest.
In recent years, deep learning methods have shown success in object tracking [12, 13, 14]. Several works [15, 16]
have combined deep learning with the correlation filter based framework. Instead of using handcrafted features such as HOG, the DCF trackers use features automatically learnt by convolutional neural network (CNN). This significantly improves the robustness of the tracking. Zhang et al. proposed the lightweight convolutional network based tracker (CNT)
[17] which has a simple architecture and yet effectively constructs a robust representation. This tracker demonstrates that a twolayer CNN without pooling and training process can obtain competitive results on a benchmark dataset with 50 challenging videos, and outperforms the first deep learning based tracker (DLT) [8] by a large margin.In this paper, we propose a novel convolutional feature based tracker for hyperspectral video processing. The videos were captured by a hyperspectral camera of 14 bands in the range of 470620nm. We first defined convolution filters from a set of normalized threedimensional cubes surrounding a target. The convolutional operations generate a set of feature maps that are combined to form a threedimensional representation of an object, which is used in the tracking process. In the tracking step, KCF is adopted to distinguish targets from neighboring environment. We extend the KCF method so it can cope with hyperspectral data.
The remainder of this paper is organized as follows. In Section 2, we first present the convolutional feature for hyperspectral images. Then, we briefly describe the KCF tracker, and how it can be extended for multichannel convolutional features for hyperspectral tracking. In Section 3, experimental results are presented to verify the performance of the proposed method on hyperspectral video sequences. Our method is also compared with the stateofart methods on grayscale and RGB videos of the same scene. Finally, conclusions are drawn in Section 4.
2 The Proposed Tracking Algorithm
In this section, we describe the details of the proposed method. We first introduce convolutional features in the 3D spectralspatial domain. Then we describe the KCF method and its extension to hyperspectral data.
2.1 Convolutional Features for Hyperspectral Target
Motivated by the success of convolution network on visual tracking [17]
, we utilize this method to extract the local hyperspectral information. Given a target template, the proposed hierarchical representation architecture contains two steps. First, local features which contain spectral information are extracted from a bank of threedimensional filters convolving with the input image at each position. Then, these features are stacked together to form a threedimensional representation. This feature extraction process is shown in Fig.
1 and Fig. 2.The image patchs in Fig. 1 are generated from local hyperspectral image cube , where and denote patch size and the number of spectral bands, respectively. A set of overlapping local image patchs centered at each pixel position is densely sampled inside the image patch through a sliding window of size , where . In the first frame of the video, several filters are selected randomly from , the responses on the image patch are denoted with feature maps , which can be expressed as
(1) 
where is the convolution operator.
Fig. 1 shows that the 3D filter, which is localized, can extract local structural features for the hyperspectral cubes. Furthermore, convolutional results of three target templates (at the bottom of Fig. 1) are similar in geometric layout, which demonstrates that the local filter is effective in extracting the target features despite their appearance variation. For negative templates (see the third image patch in the bottom row of Fig. 1), its convolutional result are very different from the target templates. As shown in Fig. 2, 3D features generated by 10 filters have similar properties. Therefore, the convolution results and the generated features represent the inner structure of the tracking target.
2.2 Tracking Framework
The output of the filtering operation are stacked to form a threedimensional representation. This can be considered as the multichannel feature which is required as the input to the kernelized correlation filter (KCF) tracker. In this section, we firstly briefly describe the KCF tracker [6], and then extend it to use the effective features introduced in the above section.
2.2.1 The KCF Tracker
Our approach is built on the KCF tracker which has achieved impressive results on Visual Tracker Benchmarks [18]
. The key of the KCF algorithm is to train a classifier through a ridge regression model, whose objective function is represented as
(2) 
where , and represent the regression value, regularization parameter and regression coefficient, respectively.
The KCF approach densely samples a circulant sample matrix , where denotes the circulant operation based on the first row (i.e., base sample). This matrix can be decomposed into
(3) 
where and denote the DFT matrix and the diagonalization function, respectively. is the Hermitian transpose of , which is a constant.
Improved via the kernel trick, coefficient is mapped to a highdimensional feature space, i.e., , where means the mapping function and is a new coefficient. Then, the coefficient can be formulated as
(4) 
(5) 
where kernel matrix is also a circulant matrix with denoting the first row. In the current frame, represents a prior and can be modeled as , where denotes the exponential function, and is a normalization constant. denotes the Euclidean distance between the target and a pixel in the neighborhood. and represent a scale parameter and a shape parameter, respectively.
In Eq. (5), can be computed based on the Gaussian function, i.e.,
(6) 
Subsequently, the object tracking task is transformed to a detection problem. The image patch of the current frame at the same target location is treated as the testing base sample, therefore, the reponse map is expressed as:
(7) 
where and are learnt before the current frame. An intuitive description is that the reponse is a linear combination of the neighboring kernel value with the weighted coefficient .
2.2.2 Multichannel Convolutional Features of HSI
Suppose the multichannel representation
(which has been reshaped to one row matrix, i.e., vector) in the current frame is composed of
, where denotes the th target representation. Since the kernels are based on the dotproduct, which can be computed by summing the individual dotproducts for each channel, Eq. (6) can be rewritten with the multichannel representation in the next frame as(8) 
Therefore, the 3D stacked convolutional features can be seen as multichannel features referring to a pixel of the target object in KCF.
3 Experimental Results and Analysis
In this section, we introduce the dataset used for the experiments, and provide details on the experimental setting, results, and comparison with alternatives.
3.1 Experimental Dataset
We performed experiments on nine image sequences. They are named as , , , , , , , , , respectively. The sequences contain three scenes and each scene has three videos corresponding to grayscale, color, and hyperspectral format, respectively. The color scene and hyperspectral scene are the same, which were captured using a Nikon D600 camera and a Photonfocus or an Ximea hyperspectral camera. These two types of cameras were put side by side when capturing the videos. The hyperspectral cameras captured frames of 16 bands with active range of 460630nm at 30 frames per second. After spectral calibration, the HSI is transformed into a threedimensional data cube with 14 channels for the Photofocus camera or 11 channels for the Ximea camera. The grayscale video is formed by band image at 490nm of the HSI sequences. Therefore, the grayscale sequences are the same as the HSI sequences in the size and number of the frames, the video content, and the target location.
Sequence  No. of Frames  Image size  Target size  No. of Bands  Description 

182  512272  3230  1  OCC, FM and BC  
422  19801080  133123  3  OCC and BC  
182  512272  3230  14  OCC, FM and BC  
114  512272  5065  1  OCC, OV and FM  
230  19801080  170240  3  OCC and OV  
114  512272  5065  14  OCC, OV and FM  
641  512272  4570  1  OCC, BC, IPR and DEF  
676  19801080  180280  3  OCC, BC, IPR and DEF  
641  512272  4570  14  OCC, BC, IPR and DEF 
3.2 Experimental Setup
To better analyze the strength and weakness of the tracking algorithm, we considered 6 attributes [18] based on different challenging factors including background clutters (BC), out of view (OV), inplane rotation (IPR), fast motion (FM), Deformation (DEF), and Occlusion (OCC), which are summarized in Table 1.
The proposed convolutional network based hyperspectral tracking (CNHT) method was implemented in MATLAB and ran at 1 frame per second on a PC with Intel i77700HQ (2.8 GHz) and 32 GB RAM. To validate the performance of the CNHT approach, we compared it with some stateoftheart algorithms, including deep network based trackers DLT [8] and CNT [17], and correlation filter based trackers STC [7] and KCF [6]. The experimental results of the comparison are shown in Figs. 35. For convenience, we display the results on grayscale and hyperspectral sequences in the same images as their scene are identical.
For four comparative methods, we only changed the parameter on the search scope (e.g., in STC, the search scope is fixed at 6 times of the target size), in order to adapt to fast motion of the object. In our tracker, the state of the target (i.e., size and location) in the first frame was given by the ground truth, which is carefully manually labelled. The size of the filter was set to 6614 (=6, =14), the number of filters was set to a small number of 10 for high speed tracking. The size of the base sample was set as 0.23 times of the initial target size, in order to handle fast motion. The other parameters with respect to the KCF method remain unchanged as in the original paper.
3.3 Qualitative Comparison






3.3.1 Background Clutters
Fig. 3 and Fig. 5 show some screenshots of the tracking results in sequences where the background and the target have similar color in the RGB images. In the sequence, the color of the apple and its neighbourhood are red. The CNT method undergoes large drift in the entire sequence. The DLT, STC and KCF track the target well at the beginning of the sequence (e.g. 18), but lose the target at the final stage (e.g. 372). The tracking result of the KCF method on the grayscale video is more accurate, which is shown in Fig. 3(b). Utilizing the spectral information, our tracker is the only one that performs well on the entire sequence. The target people in the sequence is wearing a green jumper which is similar to the color of the plants. The CNT tracks the object stably, even in the sequence. The DLT tracker drifts away from the target from the beginning to the end. The STC and KCF approaches lock on parts of background when the people walks in front of the tree (e.g. 176). Furthermore, the KCF faces the same problem as in the grayscale image. However, the convolutional network based KCF approach handles color similarity well thanks to the fact that it exploits the characteristic of hyperspectral features.
3.3.2 Partial Occlusion
The targets in all sequences contain partial occlusion. In Fig. 3, the apple is frequently occluded by the fingers. In the gray sequences, the KCF and CNT methods are able to redetect the object when the target reappears in the screen (e.g., ). In Fig. 4, the deer is partial occluded by the camera (e.g., 62 of the grayscale sequence and 148 of the RGB sequence). All trackers achieve favorable results because targets of interest are large compared with the size of frames and have different appearance from the background. However, the target moves out of the screen at 72 of the RGB sequence or 30 of the grayscale sequence, in which frame the DLT method drifts to the background. In Fig. 5, the location estimation of the people is possibly disturbed by the thick bush (e.g., 176 of either grayscale or RGB video). The KCF method does not performs well (e.g., 483 of either grayscale or 498 of RGB video). Nevertheless, the proposed method obtains a stable tracking target on the hyperspectral video with much better accuracy than the alternatives on the grayscale video and the RGB video.
3.4 Quantitative Comparison
Algorithm  Video Type  Mean Precision (20px)  Mean FPS 

DLT  Gray  80.4  0.8/20 (CPU/GPU) 
STC  Gray  82.8  365 
CNT  Gray  92.7  0.5 
KCF  Gray  53.6  278 
DLT  RGB  21.5  0.6/10 (CPU/GPU) 
STC  RGB  34.6  13 
CNT  RGB  53.1  0.5 
KCF  RGB  45.8  65 
CNHT  HSI  98.2  1 
Fig. 6 shows the performance of all tracking algorithms in terms of precision which is defined as the ratio of successful frames whose tracking output is within the given threshold (in pixels) from the groundtruth, measured by the center distance between bounding boxes. The precision of the proposed algorithm on the HSI sequence ranks the highest (0.982), which is followed by the CNT (0.927) on grayscale sequences. The CNHT method takes advantages of the KCF method, hyperspectral information, and convolution method. Thus, it outperforms the KCF method, which uses only one band of the hyperspectral video, by 83. The precision of STC and CNT on apple sequences are much lower than those over the deer and people sequences. This is because the apple is small and moves fast. More importantly, it has similar color as the background. As shown in Table 2, the proposed CNHT method runs at 1 frame per second, which is acceptable in consideration of the multiple bands in hyperspectral videos. The algorithm efficiency can be improved in the future via running on GPUs which increase the speed of DLT method by at least 16 times.
4 Conclusions
In this paper, we introduce a convolutional feature based kernerlized correlation filter approach for hyperspectral video tracking. The hyperspectral features are extracted via twolayer convolutional network. They provide discriminative information and can be used as multichannel features for the KCF tracking framework. The experimental results demonstrate that the presented method performs well in a hyperspectral dataset. This lays the foundation for developing future hyperspectral tracking methods.
References

[1]
X. Bai, J. Zhou, A. Kelly: Pattern recognition for high performance. Pattern Recognition.
82, 38–39 (2018)  [2] Y. Li, W. Xie, and H. Li: Hyperspectral image reconstruction by deep convolutional neural network for classification. Pattern Recognition 63, 371–383 (2017).
 [3] F. Alam, J. Zhou, A. Alan, X. Jia, J. Chanussot and Y. Gao: Conditional Random Field and Deep Feature Learning for Hyperspectral Image Segmentation. arXiv preprint arXiv:1711.04483, (2017).
 [4] M. Danelljan, F. Khan, M. Felsberg and J. Weijer: Adaptive color attributes for realtime visual tracking. IEEE Conference on Computer Vision and Pattern Recognition, 1090–1097 (2014).
 [5] J. Henriques, R. Caseiro, P. Martins and J. Batista: Exploiting the circulant structure of trackingbydetection with kernels. European conference on computer vision, 702–715 (2012).
 [6] J. Henriques, R. Caseiro, P. Martins and J. Batista: HighSpeed Tracking with Kernelized Correlation Filters. IEEE Transactions on Pattern Analysis and Machine Intelligence 37(3), 583–596 (2015).
 [7] K. Zhang, Q. Liu, Y. Wu and M. Yang: Robust visual tracking via convolutional networks without training. IEEE Transactions on Image Processing, 25(4), 1779–1792 (2016).
 [8] K. Zhang, L. Zhang, and Q. Liu: Fast visual tracking via dense spatiotemporal context learning. European Conference on Computer Vision, 127–141 (2014).
 [9] D. Bolme, J. Beveridge, B. Draper and Y. Lui: Visual object tracking using adaptive correlation filters. IEEE Conference on Computer Vision and Pattern Recognition, 2544–2550 (2010).
 [10] Y. Li and J. Zhu: A scale adaptive kernel correlation filter tracker with feature integration. European Conference on Computer Vision, 254–265 (2014).
 [11] H. Nam and B. Han: Learning multidomain convolutional neural networks for visual tracking. IEEE Conference on Computer Vision and Pattern Recognition, 4293–4302 (2016).

[12]
J.W. Choi, H. Chang, T. Fischer, S. Yun, K. Lee, J. Jeong, Y. Demiris and J. Choi: Contextaware Deep Feature Compression for Highspeed Visual Tracking. IEEE Conference on Computer Vision and Pattern Recognition, (2018).
 [13] H. Nam and B. Han: Learning multidomain convolutional neural networks for visual tracking. IEEE Conference on Computer Vision and Pattern Recognition, 4293–4302 (2016).
 [14] L. Wang, W. Ouyang, X. Wang and H. Lu: Visual tracking with fully convolutional networks. IEEE International Conference on Computer Vision, 3119–3127 (2015).
 [15] C. Ma, J. Huang, X. Yang and M. Yang: Hierarchical convolutional features for visual tracking. IEEE International Conference on Computer Vision, 3074–3082 (2015).
 [16] M. Danelljan, G. Hager, F. Khan and M. Felsberg: Convolutional features for correlation filter based visual tracking. IEEE International Conference on Computer Vision Workshops, 58–66 (2015).
 [17] N. Wang and D. Yeung: Learning a deep compact image representation for visual tracking. Advances in neural information processing systems, 809–817 (2013).
 [18] Y. Wu, J. Lim and M. Yang: Online object tracking: A benchmark. IEEE Conference on Computer Vision and Pattern Recognition, 2411–2418 (2013).
Comments
There are no comments yet.