Object Tracking in Hyperspectral Videos with Convolutional Features and Kernelized Correlation Filter

10/28/2018 ∙ by Kun Qian, et al. ∙ 0

Target tracking in hyperspectral videos is a new research topic. In this paper, a novel method based on convolutional network and Kernelized Correlation Filter (KCF) framework is presented for tracking objects of interest in hyperspectral videos. We extract a set of normalized three-dimensional cubes from the target region as fixed convolution filters which contain spectral information surrounding a target. The feature maps generated by convolutional operations are combined to form a three-dimensional representation of an object, thereby providing effective encoding of local spectral-spatial information. We show that a simple two-layer convolutional networks is sufficient to learn robust representations without the need of offline training with a large dataset. In the tracking step, KCF is adopted to distinguish targets from neighboring environment. Experimental results demonstrate that the proposed method performs well on sample hyperspectral videos, and outperforms several state-of-the-art methods tested on grayscale and color videos in the same scene.



There are no comments yet.


page 3

page 4

page 8

page 9

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Hyperspectral imaging plays an important role in remote sensing as it provides hundreds of contiguous, narrow spectral bands [1]. With the advantage of rich spectral information, hyperspectral images (HSIs) have been widely used in many applications involving image classification [2] and segmentation [3], such as land cover detection and mining. However, to the best of our knowledge, there is very little work focusing on hyperspectral video processing. The main reason is that it is difficult to capture hyperspectral videos with low speed imaging devices. It is not until the last a couple of years that low cost hyperspectral video cameras become available, making it possible to collect hyperspectral videos at a high frame rate.

In this paper, we introduce one of the very first work on object tracking in hyperspectral videos. Object tracking is an important research topic in computer vision and multimedia. Most tracking methods 

[4, 5, 6, 7, 8] were developed on grayscale or RGB videos. Discriminative correlation filter (DCF) [4, 5, 6, 9]

based framework explores supervised visual object tracking. The DCF trains the filters very efficiently in the frequency domain via fast Discrete Fourier transform (DFT). It learns a correlation filter to localize the object in consecutive frames. The learned filter is applied to estimate the target location by calculating the maximum response. Bolme et al. introduced the minimum output sum of squared error filter (MOSSE) tracker 

[9] which utilizes grayscale features and achieves an impressive speed in tracking application. Other features used for tracking include the incorporation of kernels and histogram of gradient (HOG) features [6], the addition of color name features [4], adaptive scale [10]

, and the integration of deep learning features 

[11]. The kernelized correlation filter (KCF) method [6] circularly shifts the training samples and exploits the advantage of multichannel HOG features with the kernel trick. Zhang et al. proposed the Spatio-Temporal Context (STC) [7]

tracker, which explores the correlation filter in terms of the probability theory, and utilized the dense sampling to track the object of interest.

In recent years, deep learning methods have shown success in object tracking [12, 13, 14]. Several works [15, 16]

have combined deep learning with the correlation filter based framework. Instead of using hand-crafted features such as HOG, the DCF trackers use features automatically learnt by convolutional neural network (CNN). This significantly improves the robustness of the tracking. Zhang et al. proposed the lightweight convolutional network based tracker (CNT) 

[17] which has a simple architecture and yet effectively constructs a robust representation. This tracker demonstrates that a two-layer CNN without pooling and training process can obtain competitive results on a benchmark dataset with 50 challenging videos, and outperforms the first deep learning based tracker (DLT) [8] by a large margin.

In this paper, we propose a novel convolutional feature based tracker for hyperspectral video processing. The videos were captured by a hyperspectral camera of 14 bands in the range of 470-620nm. We first defined convolution filters from a set of normalized three-dimensional cubes surrounding a target. The convolutional operations generate a set of feature maps that are combined to form a three-dimensional representation of an object, which is used in the tracking process. In the tracking step, KCF is adopted to distinguish targets from neighboring environment. We extend the KCF method so it can cope with hyperspectral data.

The remainder of this paper is organized as follows. In Section 2, we first present the convolutional feature for hyperspectral images. Then, we briefly describe the KCF tracker, and how it can be extended for multichannel convolutional features for hyperspectral tracking. In Section 3, experimental results are presented to verify the performance of the proposed method on hyperspectral video sequences. Our method is also compared with the state-of-art methods on grayscale and RGB videos of the same scene. Finally, conclusions are drawn in Section 4.

2 The Proposed Tracking Algorithm

In this section, we describe the details of the proposed method. We first introduce convolutional features in the 3D spectral-spatial domain. Then we describe the KCF method and its extension to hyperspectral data.

2.1 Convolutional Features for Hyperspectral Target

Motivated by the success of convolution network on visual tracking [17]

, we utilize this method to extract the local hyperspectral information. Given a target template, the proposed hierarchical representation architecture contains two steps. First, local features which contain spectral information are extracted from a bank of three-dimensional filters convolving with the input image at each position. Then, these features are stacked together to form a three-dimensional representation. This feature extraction process is shown in Fig. 

1 and Fig. 2.

Figure 1: 3D convolution process.

The image patchs in Fig. 1 are generated from local hyperspectral image cube , where and denote patch size and the number of spectral bands, respectively. A set of overlapping local image patchs centered at each pixel position is densely sampled inside the image patch through a sliding window of size , where . In the first frame of the video, several filters are selected randomly from , the responses on the image patch are denoted with feature maps , which can be expressed as


where is the convolution operator.

Figure 2: Stacked features.

Fig. 1 shows that the 3D filter, which is localized, can extract local structural features for the hyperspectral cubes. Furthermore, convolutional results of three target templates (at the bottom of Fig. 1) are similar in geometric layout, which demonstrates that the local filter is effective in extracting the target features despite their appearance variation. For negative templates (see the third image patch in the bottom row of Fig. 1), its convolutional result are very different from the target templates. As shown in Fig. 2, 3D features generated by 10 filters have similar properties. Therefore, the convolution results and the generated features represent the inner structure of the tracking target.

2.2 Tracking Framework

The output of the filtering operation are stacked to form a three-dimensional representation. This can be considered as the multichannel feature which is required as the input to the kernelized correlation filter (KCF) tracker. In this section, we firstly briefly describe the KCF tracker [6], and then extend it to use the effective features introduced in the above section.

2.2.1 The KCF Tracker

Our approach is built on the KCF tracker which has achieved impressive results on Visual Tracker Benchmarks [18]

. The key of the KCF algorithm is to train a classifier through a ridge regression model, whose objective function is represented as


where , and represent the regression value, regularization parameter and regression coefficient, respectively.

The KCF approach densely samples a circulant sample matrix , where denotes the circulant operation based on the first row (i.e., base sample). This matrix can be decomposed into


where and denote the DFT matrix and the diagonalization function, respectively. is the Hermitian transpose of , which is a constant.

Improved via the kernel trick, coefficient is mapped to a high-dimensional feature space, i.e., , where means the mapping function and is a new coefficient. Then, the coefficient can be formulated as


where kernel matrix is also a circulant matrix with denoting the first row. In the current frame, represents a prior and can be modeled as , where denotes the exponential function, and is a normalization constant. denotes the Euclidean distance between the target and a pixel in the neighborhood. and represent a scale parameter and a shape parameter, respectively.

In Eq. (5), can be computed based on the Gaussian function, i.e.,


Subsequently, the object tracking task is transformed to a detection problem. The image patch of the current frame at the same target location is treated as the testing base sample, therefore, the reponse map is expressed as:


where and are learnt before the current frame. An intuitive description is that the reponse is a linear combination of the neighboring kernel value with the weighted coefficient .

2.2.2 Multichannel Convolutional Features of HSI

Suppose the multichannel representation

(which has been reshaped to one row matrix, i.e., vector) in the current frame is composed of

, where denotes the -th target representation. Since the kernels are based on the dot-product, which can be computed by summing the individual dot-products for each channel, Eq. (6) can be rewritten with the multichannel representation in the next frame as


Therefore, the 3D stacked convolutional features can be seen as multichannel features referring to a pixel of the target object in KCF.

3 Experimental Results and Analysis

In this section, we introduce the dataset used for the experiments, and provide details on the experimental setting, results, and comparison with alternatives.

3.1 Experimental Dataset

We performed experiments on nine image sequences. They are named as , , , , , , , , , respectively. The sequences contain three scenes and each scene has three videos corresponding to grayscale, color, and hyperspectral format, respectively. The color scene and hyperspectral scene are the same, which were captured using a Nikon D600 camera and a Photonfocus or an Ximea hyperspectral camera. These two types of cameras were put side by side when capturing the videos. The hyperspectral cameras captured frames of 16 bands with active range of 460-630nm at 30 frames per second. After spectral calibration, the HSI is transformed into a three-dimensional data cube with 14 channels for the Photofocus camera or 11 channels for the Ximea camera. The grayscale video is formed by band image at 490nm of the HSI sequences. Therefore, the grayscale sequences are the same as the HSI sequences in the size and number of the frames, the video content, and the target location.

Sequence No. of Frames Image size Target size No. of Bands Description
182 512272 3230 1 OCC, FM and BC
422 19801080 133123 3 OCC and BC
182 512272 3230 14 OCC, FM and BC
114 512272 5065 1 OCC, OV and FM
230 19801080 170240 3 OCC and OV
114 512272 5065 14 OCC, OV and FM
641 512272 4570 1 OCC, BC, IPR and DEF
676 19801080 180280 3 OCC, BC, IPR and DEF
641 512272 4570 14 OCC, BC, IPR and DEF
Table 1: Summary of video sequences.

3.2 Experimental Setup

To better analyze the strength and weakness of the tracking algorithm, we considered 6 attributes [18] based on different challenging factors including background clutters (BC), out of view (OV), in-plane rotation (IPR), fast motion (FM), Deformation (DEF), and Occlusion (OCC), which are summarized in Table 1.

The proposed convolutional network based hyperspectral tracking (CNHT) method was implemented in MATLAB and ran at 1 frame per second on a PC with Intel i7-7700HQ (2.8 GHz) and 32 GB RAM. To validate the performance of the CNHT approach, we compared it with some state-of-the-art algorithms, including deep network based trackers DLT [8] and CNT [17], and correlation filter based trackers STC [7] and KCF [6]. The experimental results of the comparison are shown in Figs. 3-5. For convenience, we display the results on grayscale and hyperspectral sequences in the same images as their scene are identical.

For four comparative methods, we only changed the parameter on the search scope (e.g., in STC, the search scope is fixed at 6 times of the target size), in order to adapt to fast motion of the object. In our tracker, the state of the target (i.e., size and location) in the first frame was given by the ground truth, which is carefully manually labelled. The size of the filter was set to 6614 (=6, =14), the number of filters was set to a small number of 10 for high speed tracking. The size of the base sample was set as 0.2-3 times of the initial target size, in order to handle fast motion. The other parameters with respect to the KCF method remain unchanged as in the original paper.

3.3 Qualitative Comparison

(a) Sampled tracking results on RGB sequence.
(b) Sampled tracking results on grayscale and HSI sequences.
Figure 3: Qualitative results on the apple sequence.
(a) Sampled tracking results on RGB sequence.
(b) Sampled tracking results on grayscale and HSI sequences.
Figure 4: Qualitative results of the deer sequence.
(a) Sampled tracking results on RGB sequence.
(b) Sampled tracking results on grayscale and HSI sequences.
Figure 5: Qualitative results of the people sequence.

3.3.1 Background Clutters

Fig. 3 and Fig. 5 show some screenshots of the tracking results in sequences where the background and the target have similar color in the RGB images. In the sequence, the color of the apple and its neighbourhood are red. The CNT method undergoes large drift in the entire sequence. The DLT, STC and KCF track the target well at the beginning of the sequence (e.g. 18), but lose the target at the final stage (e.g. 372). The tracking result of the KCF method on the grayscale video is more accurate, which is shown in Fig. 3(b). Utilizing the spectral information, our tracker is the only one that performs well on the entire sequence. The target people in the sequence is wearing a green jumper which is similar to the color of the plants. The CNT tracks the object stably, even in the sequence. The DLT tracker drifts away from the target from the beginning to the end. The STC and KCF approaches lock on parts of background when the people walks in front of the tree (e.g. 176). Furthermore, the KCF faces the same problem as in the grayscale image. However, the convolutional network based KCF approach handles color similarity well thanks to the fact that it exploits the characteristic of hyperspectral features.

3.3.2 Partial Occlusion

The targets in all sequences contain partial occlusion. In Fig. 3, the apple is frequently occluded by the fingers. In the gray sequences, the KCF and CNT methods are able to re-detect the object when the target re-appears in the screen (e.g., ). In Fig. 4, the deer is partial occluded by the camera (e.g., 62 of the grayscale sequence and 148 of the RGB sequence). All trackers achieve favorable results because targets of interest are large compared with the size of frames and have different appearance from the background. However, the target moves out of the screen at 72 of the RGB sequence or 30 of the grayscale sequence, in which frame the DLT method drifts to the background. In Fig. 5, the location estimation of the people is possibly disturbed by the thick bush (e.g., 176 of either grayscale or RGB video). The KCF method does not performs well (e.g., 483 of either grayscale or 498 of RGB video). Nevertheless, the proposed method obtains a stable tracking target on the hyperspectral video with much better accuracy than the alternatives on the grayscale video and the RGB video.

3.4 Quantitative Comparison

(a) Precision plot for all sequences.
(b) Precision plot for apple sequences.
(c) Precision plot for deer sequences.
(d) Precision plot for people sequences.
Figure 6: precision plot.
Algorithm Video Type Mean Precision (20px) Mean FPS
DLT Gray 80.4 0.8/20 (CPU/GPU)
STC Gray 82.8 365
CNT Gray 92.7 0.5
KCF Gray 53.6 278
DLT RGB 21.5 0.6/10 (CPU/GPU)
STC RGB 34.6 13
CNT RGB 53.1 0.5
KCF RGB 45.8 65
CNHT HSI 98.2 1
Table 2: Precision and FPS

Fig. 6 shows the performance of all tracking algorithms in terms of precision which is defined as the ratio of successful frames whose tracking output is within the given threshold (in pixels) from the ground-truth, measured by the center distance between bounding boxes. The precision of the proposed algorithm on the HSI sequence ranks the highest (0.982), which is followed by the CNT (0.927) on grayscale sequences. The CNHT method takes advantages of the KCF method, hyperspectral information, and convolution method. Thus, it outperforms the KCF method, which uses only one band of the hyperspectral video, by 83. The precision of STC and CNT on apple sequences are much lower than those over the deer and people sequences. This is because the apple is small and moves fast. More importantly, it has similar color as the background. As shown in Table 2, the proposed CNHT method runs at 1 frame per second, which is acceptable in consideration of the multiple bands in hyperspectral videos. The algorithm efficiency can be improved in the future via running on GPUs which increase the speed of DLT method by at least 16 times.

4 Conclusions

In this paper, we introduce a convolutional feature based kernerlized correlation filter approach for hyperspectral video tracking. The hyperspectral features are extracted via two-layer convolutional network. They provide discriminative information and can be used as multichannel features for the KCF tracking framework. The experimental results demonstrate that the presented method performs well in a hyperspectral dataset. This lays the foundation for developing future hyperspectral tracking methods.


  • [1]

    X. Bai, J. Zhou, A. Kelly: Pattern recognition for high performance. Pattern Recognition.

    82, 38–39 (2018)
  • [2] Y. Li, W. Xie, and H. Li: Hyperspectral image reconstruction by deep convolutional neural network for classification. Pattern Recognition 63, 371–383 (2017).
  • [3] F. Alam, J. Zhou, A. Alan, X. Jia, J. Chanussot and Y. Gao: Conditional Random Field and Deep Feature Learning for Hyperspectral Image Segmentation. arXiv preprint arXiv:1711.04483, (2017).
  • [4] M. Danelljan, F. Khan, M. Felsberg and J. Weijer: Adaptive color attributes for real-time visual tracking. IEEE Conference on Computer Vision and Pattern Recognition, 1090–1097 (2014).
  • [5] J. Henriques, R. Caseiro, P. Martins and J. Batista: Exploiting the circulant structure of tracking-by-detection with kernels. European conference on computer vision, 702–715 (2012).
  • [6] J. Henriques, R. Caseiro, P. Martins and J. Batista: High-Speed Tracking with Kernelized Correlation Filters. IEEE Transactions on Pattern Analysis and Machine Intelligence 37(3), 583–596 (2015).
  • [7] K. Zhang, Q. Liu, Y. Wu and M. Yang: Robust visual tracking via convolutional networks without training. IEEE Transactions on Image Processing, 25(4), 1779–1792 (2016).
  • [8] K. Zhang, L. Zhang, and Q. Liu: Fast visual tracking via dense spatio-temporal context learning. European Conference on Computer Vision, 127–141 (2014).
  • [9] D. Bolme, J. Beveridge, B. Draper and Y. Lui: Visual object tracking using adaptive correlation filters. IEEE Conference on Computer Vision and Pattern Recognition, 2544–2550 (2010).
  • [10] Y. Li and J. Zhu: A scale adaptive kernel correlation filter tracker with feature integration. European Conference on Computer Vision, 254–265 (2014).
  • [11] H. Nam and B. Han: Learning multi-domain convolutional neural networks for visual tracking. IEEE Conference on Computer Vision and Pattern Recognition, 4293–4302 (2016).
  • [12]

    J.W. Choi, H. Chang, T. Fischer, S. Yun, K. Lee, J. Jeong, Y. Demiris and J. Choi: Context-aware Deep Feature Compression for High-speed Visual Tracking. IEEE Conference on Computer Vision and Pattern Recognition, (2018).

  • [13] H. Nam and B. Han: Learning multi-domain convolutional neural networks for visual tracking. IEEE Conference on Computer Vision and Pattern Recognition, 4293–4302 (2016).
  • [14] L. Wang, W. Ouyang, X. Wang and H. Lu: Visual tracking with fully convolutional networks. IEEE International Conference on Computer Vision, 3119–3127 (2015).
  • [15] C. Ma, J. Huang, X. Yang and M. Yang: Hierarchical convolutional features for visual tracking. IEEE International Conference on Computer Vision, 3074–3082 (2015).
  • [16] M. Danelljan, G. Hager, F. Khan and M. Felsberg: Convolutional features for correlation filter based visual tracking. IEEE International Conference on Computer Vision Workshops, 58–66 (2015).
  • [17] N. Wang and D. Yeung: Learning a deep compact image representation for visual tracking. Advances in neural information processing systems, 809–817 (2013).
  • [18] Y. Wu, J. Lim and M. Yang: Online object tracking: A benchmark. IEEE Conference on Computer Vision and Pattern Recognition, 2411–2418 (2013).