Saliency Guided Hierarchical Robust Visual Tracking

12/21/2018 ∙ by Fangwen Tu, et al. ∙ 6

A saliency guided hierarchical visual tracking (SHT) algorithm containing global and local search phases is proposed in this paper. In global search, a top-down saliency model is novelly developed to handle abrupt motion and appearance variation problems. Nineteen feature maps are extracted first and combined with online learnt weights to produce the final saliency map and estimated target locations. After the evaluation of integration mechanism, the optimum candidate patch is passed to the local search. In local search, a superpixel based HSV histogram matching is performed jointly with an L2-RLS tracker to take both color distribution and holistic appearance feature of the object into consideration. Furthermore, a linear refinement search process with fast iterative solver is implemented to attenuate the possible negative influence of dominant particles. Both qualitative and quantitative experiments are conducted on a series of challenging image sequences. The superior performance of the proposed method over other state-of-the-art algorithms is demonstrated by comparative study.



There are no comments yet.


page 3

page 4

page 7

page 9

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Visual tracking aiming at estimating the location of specified target in video sequences is one of the most important research branch in computer vision. It has wide applications including security surveillance, robotics, motion analysis, military patrol and etc. Although, much attention has been attracted to this topic and there exists many breakthrough during the past few years, it is still very challenging to develop a robust algorithm because of the scene variation such as partial or full occlusion, abrupt motion, background clutter, illumination variation and non-rigid deformation

A popular trend to tackle visual tracking problem is to involve sparse representation since it was firstly proposed in [2]. The motion and observation model is employed and embedded into a particle filter framework with updated appearance dictionary. Their following works further enhance the speed of an L1 tracker [3] [4] by reducing the number of samples to be decomposed and proposing an accelerated proximal gradient (APG) solver. To consider the holistic and part-based information simultaneously, in [5] a sparse discriminative classifer (SDC) and a sparse generative model (SGM) are developed jointly for visual tracking. It improves the discriminative power of the tracker at the expense of more time consumption. To alleviate the computational burden, a fast tracking algorithm using non-adaptive random projected appearance model is proposed in [6]. The superior speed benefits a lot from the its coarse-to-fine framework. To ensure the sparsity of the coefficient, most of the state-of-the-art algorithms apply L1 regularization constraint. A L2-RLS tracker which shows competitive performance is proposed in [7]. Compared with traditional L1 tracker, L2-RLS achieves optimizing computational complexity by only recalculating a project matrix after the template updating in solving the cost function. However, all of these algorithms only depend on the greyscale of the image and abandon the color cues of the target. In addition, their work usually assume that the target location in current frame is near from that in the last frame. Therefore, it is very difficult to handle the cases with abrupt motion.
In this paper, we perform a saliency guided global search ahead the random walking sampling in particle filter to provide a rough location of the target. Saliency with bottom-up structure is originally used to predict the eye movement in a scene using the low-level cues (e.g., oriented filter responses and color) [8]. While, to make it suitable for visual tracking in which we hope to emphasize on certain target, a top-down structure with prior information needs to be developed. Previous studies like [9] [10] have verified the feasibility to incorporate top-down saliency into visual tracking. In [9], the saliency map is built using a modified frequency-tuned method with pre-calculated weights in the same way of VOCUS [11]. Then the target is tracked with a local and global search processes. In [10], the saliency map is generated by assigning weights to three conspicuity maps in Itti and Koch’s saliency model [12]. However, in their works, the weights are computed in the first frame and remain unchanged during the tracking, which makes the algorithm fail to adapt to the appearance variation of the target. Moreover, the prior information from the last frame also plays a crucial role in guiding the location prediction. Thus, to overcome these problems, a novel weight updating mechanism as well as a comprehensive saliency map generation method are proposed in this paper.
To further take advantage of the color information, a superpixel based HSV histogram matching is newly introduced in this paper. Superpixels can be utilized to reduce the sample set that we need to consider because each superpixel patch contains pixels with similar color feature that can be merged together. Thus, it has been widely employed to address visual tracking task [13] [14] [15]. In [16], a superpixel based discriminative appearance model is constructed to facilitate the tracker to distinguish the target from the background. Although the result is promising, it misses the cue of structural information of the object which can be addressed by conventional template matching approaches. To cope with this, a joint local search scheme is proposed by combining a simplified superpixel matching with L2-RLS tracker. Superpixel matching equips the algorithm with the capability to identify the target through color distribution. Meanwhile, L2-RLS tracker will investigate the candidate patches from holistic appearance perspective.
Finally, the hierarchical structure is completed with a linear refinement search. This search is developed to balance the contribution of particles with high confidence and avoid the domination of individual ones caused by the traditional maximum a posterior (MAP) operation. The idea is inspired by [17] and [15] which hope to extend the state space of particle observations from discrete to continuous by conducting local linear coding. Different from them, the proposed method is only applied to the selected promising candidates and hence, a customized optimization function is required. In addition, a fast iterative solver with analytical solution is established and the improved performance is demonstrated by experimental study.
The main contributions of this work is three-fold and can be summarized as follows:

  • A novel saliency guided global search algorithm is proposed considering prior information and adaptivity to object appearance variation. Nineteen features containing comprehensive description of the target are combined with online learnt weights to produce a top-down saliency map for candidate patch selection. A corresponding integration mechanism is developed to filter out the false targets. The global search can provide a rough target location prediction for the tracker to handle the abrupt motion and appearance variation problems.

  • Superpixel based HSV histogram matching is incorporated into an L2-RLS tracker to involve the color distribution of the object and achieve a joint observation likelihood. This method can not only boost the discriminative power of an intensity template based tracker by introducing additional color cue but consider the structural information by using superpixels.

  • Linear refinement search is designed before the final estimated result is obtained to further rectify the bounding box and lower the drifting risk of a single dominate particle caused by the MAP operation by sharing the risk with several most promising candidate patches. A novel cost function is proposed and a fast iterative solver with analytical solution in each loop is developed. The capability of this search to upgrade the accuracy is validated by case study.

The organization of this paper is as follows. Section II briefly introduces the preliminary knowledge and concept on particle filter. Section III presents the details of saliency guided global search. The integration mechanism between global search and local search is elaborated in Section IV. Section V states the hierarchical local search containing superpixel matching and refinement search. Both qualitative and quantitative experiments are conducted in Section VI. Finally, Section VII gives some concluding remarks.

Ii Preliminaries on Particle Filter

Bayesian inference framework is usually applied for visual tracking problem. It is supposed to estimate the posterior distribution of state variables characterizing a dynamic system by



denotes the observation vector up to

tth frame, is the state variable of a target in frame t and represents the motion model predicting the state in current frame with the immediate previous state, indicates the observation model which is a likelihood function in essence. Conventionally, the estimated state can be obtained by MAP operation


where indicates the state of the ith sample. In this paper, six affine parameters compose the state , where

represent the translation along horizontal and vertical direction, rotation angle, scale, aspect ratio and skew respectively. The random walk is employed as the transition model for

i.e. ,

denotes a Gaussian distribution.

stands for the diagonal covariance matrix containing standard deviation of the six affine parameters in


Iii Saliency Guided Global Search

Traditional bottom-up attention model is not suitable for visual tracking since it fails to provide salient maps on the objects that we are interested in. To handle this, a novel top-down visual attention model is proposed to learn the saliency map that can emphasize the tracking target. The final saliency map is constructed through the combination of nineteen low-level features with different weights. The weights are updated based on the tracking result of the current frame. The outputs of this global search (candidate particles) are generated by calculating the connected area on the saliency map. They act as the samples of potential target position and passed into the integration mechanism. The overall pipeline of the global search can be described in Figure

1. The saliency guided search will efficiently cope with challenging scenarios such as abrupt motion, motion blur as well as the out-of-plane rotation.

Fig. 1: Pipeline of saliency guided global search. (a) Linearly combine 19 feature maps with pre-determined weights . (b) Penalize the combined map with estimated target position in last frame and generate binary saliency map. (c) Find the maximum connected area and determine the geometric center for candidate patch generation. (d) Update the weights with the tracking result in current frame.

Iii-a Attention Map Building and Candidate Particle Generation

In saliency detection, different levels of feature are usually utilized to construct a saliency map. 33 features distributed from low-, mid- and high-level are computed as the input for a SVM classifier in

[18]. But the high-level features which detect a person or a face is not applicable since they need off-line training and tracking targets are not restricted to humans. Similar situation holds for the mid-level feature which assumes salient objects mostly lie on a horizon place. Thus, in this paper, we take advantage of the low-level features that can depict the fundamental and general characteristics of the object.
By considering the tradeoff between validity and speed, nineteen low-level features are exacted firstly on a newly arrived frame. Similar to [18], steerable pyramid subbands [19] in four orientation and three scales form the first 13 feature maps denoted as , . We also incorporate four broadly tuned color channels () as well as the intensity channel () [20] which are created by


where are red, green and blue channel of the image. Considering that the tracker is very likely to track humans, a skin color channel is involved as . The concatenation of the 19 feature maps construct the candidate feature map set . To release computational burden, similar to [18], the original image is warped into for feature map computation. Define a weight vector indicating the correlation degree between individual feature map and the tracking target. The method to determine and update will be introduced in next subsection. A top-down saliency map is created through the weighted sum of the low-level feature maps as


where is the ith element of . To incorporate the position information in the last frame into the saliency map, a revised center prior penalization is introduced in this work. Different from the traditional center prior [18] which believes human naturally tend to frame the object of interest near the center of the image, we penalize the saliency value in with the distance to the center of target in the last frame as , where is the Hadamard product (element-wise product). represents the penalty matrix defined as


where and denote the center point of the target in the last frame and a certain point on . returns the Euclidean distance between the two points. is a tunable scaler and set as 2 in this paper.

Remark 1

Proper selection of is of importance in eliminating the false salient regions as well as reserving the true target for abrupt motion handling. Too small value is not adequate for the suppression of disturbance. On the contrary, too large value will lead to the points far from vanish severely which may also block the true target when abrupt motion occurs.

For the purpose of noise attenuation and easy operation, the distance penalized saliency map

is subject to binarization with a threshold

to produce the final saliency map . The procedure introduced by now is described in Figure 1 (a) (b).
In this paper, we assume the tracking target or part of it can map to a connected area on the saliency map . The intuition behind this assumption is that the parts on an object which possess features with high distinction degree are usually coherent, for example a face with shin color against the entire head with black hair and red mouth. With this assumption, an “run-relabel” algorithm [21] is employed to derive connected area set with a predefined threshold . controls the minimum area that can be selected into set . Subsequently, we return the center location of each area to form . We apply the same size and orientation of the bounding box in the last frame to every center element in to crop out the candidate particles as shown in Figure 1 (c). The particles are then passed into a judgement system for further evaluation which will be elaborated in “Integration Mechanism” section.

Iii-B Online Weight Updating

In order to adapt to the appearance changes of the object and enhance the discriminative power, the weight needs to be updated dynamically. The update utilizes the prior information of the current frame after the target is estimated. Firstly, a binary groundtruth map is built as follow


where is a pixel set indicating the estimated bounding box in current frame. Together with , the weights will be updated through an optimization problem.


where is the vectorized version of candidate feature set . is a penalty parameter. The optimal solution to (7) is computed as


In order to avoid the error accumulation, the weight update is only conducted when a certain evaluation criterion is met instead of updating for every frame. The details can be found in the following section. For further clarification, a case study on dataset of Girl is carried out as Figure 2 shows. The original images of frame 65, 210 are extracted and the corresponding binary saliency maps are presented in the upper row. The relevance of two representative feature maps (intensity, skin color) to the final saliency map is investigated and plotted as the lower row shows. Considering the fact that the relevance increases if the corresponding weight approaches to 1, the relevance degree of individual feature can be quantitatively computed as

Fig. 2: Upper row: Images and generated saliency maps on Girl of frame 65 and 210. Lower row: Relevance degree variation of feature and .

From the study, it can be observed that the binary saliency map can efficiently predict the rough location of the target even if severe appearance changes occur. Combining the relevance degree curves, we can find that the skin color feature dominates over other ones before the girl turns back. When she shows her hair towards the camera, the intensity feature becomes the main descriptor for the saliency map construction. This switching can be efficiently handled by the online weight update. It should be noted that although the saliency map cannot provide accurate silhouette of the object, it is adequate to act as a coarse search result that indicates the rough location. A fine search procedure is required to achieve a robust and accurate tracking as introduced in local search section.

Iv Integration Mechanism

The integration mechanism introduced in this paper performs as an evaluation process for the generated candidate particles from global search. Due to the possible clutter background or identical fake targets in video sequence, the outcome particles from global search may also contain false target patches. Thus, an algorithm with strong local discriminative power is required to rank the particles. By considering the balance between accuracy and speed, the L2-RLS [7] tracker with PCA and square template is employed with the following likelihood function


where denotes the sparse coefficient and . is a tunable scalar. The confidence value of each candidate particle denoted as is calculated with (10). Then, the maximum confidence is derived as and it is regarded as the most promising candidate patch that enclose the target.
To achieve the evaluation, we set two thresholds for the maximum confidence and three cases are considered. (i) If , it indicates that the global search has already provided a sufficiently accurate location prediction, hence, the local search will be ignored directly and the corresponding patch is regarded as the target. (ii) If , this means the output is acceptable but not accurate. Thus, a further local search is required centered at the geometric center of patch . This case may occur because the saliency map after binarilzation and connected area thresholding would only present a certain salient part of the target. In this way, there may exist displacement between the geometric center of the extracted connected area and the actual target center. (iii) If we have , the output of global search is very likely to have drifted away due to the possible cluster background or illumination variation. In this case, local search is performed normally centered at the target location in last frame.
Additionally, an extra threshold is needed as the weight update judgement criterion for updating. If is larger than , we can tell that the current image condition is suitable for weight updating. Otherwise, the image is undergoing severe external interference which may lead to the drift of the entire tracker. To equip the proposed tracker with re-initialization power after drifting, the weight update suspends in this case. When the external interference factor vanishes, the global search will be able to guide the tracker to capture the target again.

V Local Search with Hierarchical Structure

V-a Superpixel Matching via HSV Histogram

Most of the current appearance template matching algorithm depends only on the target intensity for template building. In this way, the color information is neglected and the accuracy degrades since the color characteristics provide extra discriminative power to distinguish the target and the background. To efficiently describe the color, compared with low-level pixel-wise cue, the superpixels as a mid-level cue presents color structural information about the details of an object. Thus, in this paper, a novel superpixel matching is proposed as an enhancement to sole intensity template matching by investigating the color distribution of the target object. This method can be summarized as Figure 3.

Fig. 3: Flowchart of superpixel HSV matching

The superpixel matching is still embeded into the L2 tracker mentioned in the last section to achieve a collaborative tracking. patches with the highest confidence are passed to the superpixel matching process. There are two main reasons for employing this structure: (i) the generation of superpixel segments is time-consuming, we can accelerate the algorithm by reducing the search set and (ii) it is challenging for the proposed superpixel matching method to handle severe occlusion and illumination changes. On the contrary, together with square template, L2 tracker has excellent capability in coping with these problems. Thus, incorporating with L2 tracker can complement the drawbacks of superpixel matching approach.
Each image patch from the selected particles is firstly subject to segmentation to generate superpixels using SLIC [22]. The superpixel segments are then converted to a normalized HSV color space before a five-bin histogram is created for each channel. The length of each bin is evenly allocated on the full scale of 0-1. The three histograms are denoted as with and computed as


where is the jth bin of histogram and denotes the center of individual bin. is a scalar chosen as 10. The expression in (11) measures the distance between each superpixel patch and the bin centers under different channels which reflects the color distribution. Subsequently is concatenated to yield as the figure shows. The generated HSV histogram will be compared with a template . This template is initialized with the target patch in the first frame and updated with the following update law.


where is a learning rate which is tuned as 0.95 in this paper.

denotes the optimum histogram that possesses the maximum joint observation likelihood which will be introduced in the following context. We then introduce a cosine similarity to measure the similarity between

and the template histogram .


The similarity is then transformed into an error pattern where is a positive parameter. Following the idea of multi-objective optimization [23], the reconstruction error from the L2 tracker and HSV histogram matching error are fused to produce the confidence for each particles. Both of the errors are concatenated into a new vector where and . The ideal point is derived where denotes the index of candidate particles. Then the joint observation likelihood of state is calculated as follow.


where are used to adjust the ratio between these two cues. Top particles with maximum observation likelihood are reserved and proceed to linear refinement search.

V-B Linear Refinement Search

Due to the inevitable existence of inaccurate appearance template , imperfect dictionary update scheme and drawbacks in superpixel histogram matching, the estimated result computed by traditional MAP may not be so reliable since only the one that has the maximum confidence value is selected as the result. But, it is very likely that the most accurate candidate has lower confidence, thus, cannot contribute to the final result. To ameliorate this, we propose a novel linear refinement approach to balance the result by also incorporating the candidates that possess high confidence into the final coding.
In refinement search phase, particles with the highest confidence are extracted to achieve a linear combination for the final estimation. The particles construct a new candidate . Then we solve the following optimization problem.


where and are the coefficients for candidate set and the PCA dictionary . The intuition behind this approach is that we hope to use the linear combination of the candidate patches to reconstruct the target instead of a single one to upgrade the accuracy. This idea is similar to [15]. The main differences can be highlighted as (i) our method only consider the particles with high confidence value instead of the entire ones in order to reduce the possibility that the final result is interfered by “bad” particles; (ii) since in practical implementation, is usually not larger than 10, the sparse constraint for can be eliminated in (15). In this way, by still adopting the norm constraint for , an iterative analytical solution to the optimization problem can be found. The computational burden can be hugely released.
The coefficients and can be solved iteratively by fixing one and updating the other one. Given fixed where represents the ith iteration, and consequently , solving becomes a optimiztion problem which gives


Then we fix which yields , the original problem becomes a linearly-constrained minimum Euclidean norm problem which has an analytical solution as [24]


where denotes a vector with all ones. The iteratively solving program terminates when a stopping criterion is satisfied and output a . The optimum coefficient is derived by revising the negative entries in to zero followed by a normalization operation. Finally, the estimated target is computed by linearly combining the particles based on , i.e.


The summary of the solver can be found in Algorithm 1.

Remark 2

The iterative operation for solving (15) is efficient since both phases (16) (V-B) have analytical solution. Moreover, (16) and (V-B) can be rewritten as and where , and . Throughout the solving process, the new candidate set and the dictionary are fixed which means , and remain unchanged and can be computed prior to the calculation loop. Therefore, the computational complexity can be dramatically reduced.

1:  Initialization: Initialize with corresponding confidence derived from (V-A) and normalize it to guarantee . Calculate , and according to the description in Remark 2 using which contains particles with highest confidence and the appearance template .
2:  while Neither the solution is convergent nor the maximum iteration number is met do
7:     i=i+1
8:  end while
9:  Vanish the negative entries and normalize to produce
Algorithm 1 Iterative solver for (15)

A case study on the video sequence of DragonBaby is conduced to evaluate the performance of refinement approach. The result is shown as Figure 4. From (a), we can see that the bounding box of the refined result (red box) shows higher overlap rate to the real target than the one with highest confidence (blue box). The quantitative study can also support this observation (0.6964 vs 0.6744). The result of all the five candidates and the corresponding overlap rate is presented as (b). It should be noted that the candidate with the second highest confidence value possesses the largest overlap rate followed by . Although the overlap rate of refined result is not competitive to and , it improves compared to which would be regarded as the outcome in MAP. Besides, the weight distribution of the five candidates is shown in (c), and have high weights corresponding to and which indicate that they contribute much to the final result and lead to an improved accuracy. However, due to the imperfect template and other disturbance, with lowest overlap rate is also assigned with a relatively high weight. This phenomenon is difficult to avoid since we are lack of sufficiently accurate priori knowledge to judge a candidate before the refinement and get rid of these “bad” candidates. But by involving the refinement approach, the output is achieved in a balanced way and the effect of “bad” candidates can be attenuated. A study to measure the improvement from the proposed refinement approach to the traditional MAP is carried out on the entire sequence and the result is plotted as (d). We ignore the frames that skip local search according to the integration mechanism. The figure is plotted by subtracting the overlap rate of from the refined result . It can be observed that although, for the minority of frames, the refinement approach degrades the accuracy, for most frames, it can level-up the performance. And the maximum improvement scale can be as large as 0.15.

Fig. 4: Case study on frame 11 on DragonBaby. (a) The bounding box of refined result and candidate with highest confidence. (b) Bounding boxes of and all the candidates -. (c) Coefficients of five candidates. (d) Overlap rate improvement between the refined results and throughout the sequence.

The proposed algorithm in this paper can be summarized by Algorithm 2.

1:  Initialization: Initialize weight vector , PCA and square template , HSV histogram template and other coefficients.
2:  Input: Current frame
3:  Output: Estimated target location
4:  Start: Resize the image to and generate 19 low-level feature maps .
5:  Produce the saliency map through (4), target location penalization (5)and binarization.
6:  Find the centers of connected regions that have larger areas than threshold .
7:  Crop the candidate particles with the same size and orientation of bounding box in .
8:  Calculate corresponding confidence for using (10) and pick the maximum one .
9:  if  then
10:     Skip local search and set
11:  else if  then
12:     Perform local search by firstly sampling at the geometric center of patch .
13:     Apply tracker as described in [7] and extract particles with highest confidence values.
14:     Generate HSV histogram for each patch using (11) and concatenate them into .
15:     Compute the similarity and the corresponding error .
16:     Calculate the joint observation likelihood with (V-A) and proceed to linear refinement search with patches that have highest confidence values to form .
17:     Solve the optimization problem (15) with Algorithm 1 to derive .
18:     Obtain refined result as
19:  else
20:     Perform the local search using the particles sampled at the center of estimated target location in .
21:  end if
22:  Update dictionary as depicted in [7] and update using (12).
23:  if  then
24:     Update with the groundtruth map using (8).
25:  end if
Algorithm 2 The summary of proposed SHT tracker

Vi Experiments

The proposed SHT algorithm is implemented in MATLAB and run on an Intel Core i7-4710HQ 2.5GHz PC with 16GB memory. The running speed is at around 1.5 FPS without any code optimization. The number of particles and for superpixel matching and refinement search is 70 and . The regularization parameters , in update law (7) and refinement search (15) are selected as 0.05 and 0.005 respectively. The thresholds for the availability of saliency-guided candidate patches and need to be tuned for each dataset. Experientially, they belong to the range of [0.2, 0.45] and [0.4, 0.8]. The template size in tracker remains to be and the number of PCA templates in is 16. 600 particles are sampled for each frame. The test is conducted both qualitatively and quantitatively on ten challenging video sequences. And the results are compared with eleven state-of-the-art trackers including L1APG [4], IVT [25], Frag [26], TLD [27], CSK [28], L2-RLS [7], ASLA [29], SCM [5], CXT [30], LOT [14], DFT [31].

Vi-a Component Validation

This section demonstrates the effectiveness and efficacy of the three main components of SHT: saliency guided global search, superpixel matching and linear refinement search. Figure 5 shows a case study on DragonBaby dataset for the global search. Four frames with severely abrupt motion and incomplete target are extracted on the first row. The second row presents the final binary saliency map which can roughly capture the location of the target. Although, there is still a false saliency area after the binarization and connected area thresholding on the second map, the integration mechanism can facilitate to abandon it and guide the tracker to capture the correct target as shown in Figure 8(b). The third row demonstrates the weight allocation of the 19 feature maps in the four frames. It can be noted that the last feature which describes the skin color is always assigned with highest weight. This observation is consistent with the fact that we are tracking a human head. The dominant role of this feature can be further proven by the histogram of accumulated feature weight in Figure 6. This histogram represents the normalized summation of the absolute value of each weight. The yellow color feature map occupies the second high accumulated weight. The first reason is straightforward that yellow is the most identical color to human skin. However, if we look further to the weight allocation plot in Figure 5, it can be found that the sign of the is mostly negative (first three frames) which can be utilized to counteract the false saliency regions in which have similar yellow color as shown in the last row of the figure. These false regions are caused by the background objects with similar color channel such as the leaves. In this way, together with the binarization and connected area thresholding, a pure saliency map can be achieved. Finally, the weight updating curves as shown in Figure 6 show that the weights are updated per frame in this case because there is no severe occlusion occurs and every frame is suitable for updating.

Fig. 5: Case study on Dragonbabdy for saliency guided search. First row shows the original images of frame 42, 44, 46 and 51. Second row presents the binary saliency maps. Third row is the corresponding weights to the 19 feature maps. Last row is the feature map of skin-color channel.
Fig. 6: Histogram of accumulated feature weights and the weight updating of three most salient ones.
Fig. 7: Tracking results of individual component. Blue and yellow bins denote the overlap rate and center error for each data set. L2-RLS, NSGS, NSM, NLRS and SHT mean the sole appearance matching in [7], No Saliency Guided Search, No Superpixel Matching, No Linear Refinement Search and the overall proposed tracker.

For the purpose of thorough study, additional investigation is performed on five datasets in the absence of a certain component. The result is shown as Figure 7. In NSGS study, we block the saliency module as well as the integration mechanism and perform the particle sampling centered at the location of target from the last frame. For NLRS, the target is predicted with traditional MAP that maximize the observation likelihood (V-A). Generally, the proposed SHT algorithm can outperform sole L2 tracker in some challenging scenarios such as out-of-plane rotation (Bird2, Girl), deformation (Dog, MountainBike), abrupt motion (DragonBaby). Thanks to the global search ability of the saliency guided approach, an obvious performance level-up can be achieved between NSGS and SHT on the datasets with discriminative target appearance and the background (Bird2, DragonBaby). Although, due to the weight updating design, a pure and available saliency map can be obtained for Girl, the accuracy is not improved much. This is because the face motion of the girl is very slow, hence, a proper standard deviation in particle sampling can sufficiently cover the range of the motion. As for the superpixel matching, it is especially efficient for the cases that have consistent color distribution. For example, the target in Dog undergoes severe deformation when the dog running towards the woman. But the color composition (black and white) remains unchanged which is very suitable for the superpixel matching. The same situation can be observed for the case of Bird2 and MountainBike. Besides, even if the color composition of the target changes, the template updating mechanism can still guarantee the robustness of the tracker as demonstrated in Girl. From NLRS, we can judge that the refinement approach will slightly improve the accuracy since essentially, it does not bring in any new cue for tracking. But due to its capability in attenuating the influence of “bad” candidates and efficient computational time, the involvement of refinement an auxiliary procedure is beneficial for tracking .

Cost time (s) 0.15 0.40 0.001 0.12 0.67
SGS: Saliency Guided Search. SM: Superpixel Matching. LRS: Linear Refinement Search.
TABLE I: Running time of individual component

The time consumption of each component is listed in Table I

. As the table shows, superpixel matching is the most time-consuming part of SHT because of the SLIC process. The linear refinement almost doest not cost any time since analytical solution can be found for each iteration in the solver. Finally, the speed of feature extraction in SGS is not fast but acceptable.

Vi-B Qualitative Evaluation

(a) Selected frames of tracking results for Sequences Birds, Boy, Caviar and Dog.
(b) Selected frames of tracking results for Sequences DroganBaby, Girl, Jogging, MountainBike, Singer1 and Walking2.
Fig. 8: Tracking result screenshots of seven trackers. Video sequences of Caviar, Girl, Jogging, Walking2 with heavy occlusions. Video sequences of Boy, DragonBaby with fast motion and motion blur. Video sequences of MountainBike, Singer1 with background clutters and drastic illumination variation. Video sequences of bird2, DragonBaby, Girl, MountainBike with in-plane or out-of-plane rotation. Video sequences of Dog, Jogging, Bird2 with non-rigid object deformation.

Qualitative investigation is conducted with ten challenging video sequences on six state-of-the-art trackers. The result is shown as Figure 8 and the analysis comes below.
Heavy occlusion: Four representative datasets with partial or full occlusion are evaluated in this case. The capability for occlusion handling of SHT mainly inherits from the square template embedded in the L2 tracker. The ratio combination observation likelihood (V-A) ensures that the possible drift caused by the superpixel matching will not effect the performance when encountering occlusions. In Girl, when the girl’s face is partially blocked by the man in the last figure, only the proposed tracker and L2-RLS with square template and SCM, ASLA with part-based template can cope with this. Although, L1APG also employs trivial template and update scheme based on occlusion detection, the raw pixel dictionary makes the tracker susceptible to the cases when the target is blocked by similar objects as shown in the caviar sequence.
Abrupt motion and motion blur: It is usually very challenging for particle filter based tracker to deal with targets with fast motion due to the tradeoff between number of samples and sampling radius. Besides, the situation becomes more complicated since motion blur is always along with fast motion. The saliency guided global search can solve this problem favorably because it only focuses on salient regions in the whole image and provides a predicted location. Moreover, although motion blur brings challenges for template matching, its saliency feature remains the same, hence, saliency search is very robust to motion blur as well. The tracking result comparison of DragonBaby is presented in Figure 8(b). The target undergoes fast motions in the first three figures which corresponds to frame 42,44,46. It can be observed that only SHT is able to complete the tracking. In the last figure, the baby turns back his head, but the color characteristics make it still discriminative to the background and the saliency search can successfully guide the tracking. In Boy, together with the refinement procedure, the proposed method can provide a more accurate performance as shown in the second and fourth figure compared with sole L2 tracker. Apart from SHT, CXT also outperforms other trackers due to its essence of PN tracker and equipped with re-initialization mechanism. Moreover, the involvement of context information also contributes to its premium performance.
Background clutter and illumination variation: these two situations are considered together because they both degrade the contrast between the foreground and background. In MountainBike, the biker rides across the gap with varying postures. Also, the rocks and bushes inside the gap form clutter background that is very challenging for the saliency guided search since there may be many false saliency regions. Thanks to the integration mechanism, the false targets can always be filtered out and the tracker can capture the biker throughout the sequence. On the contrary, L2 tracker drifts to the rocks after the biker landing. This is because the greyscale holistic appearance template it uses is not sufficient to discriminate background and the target. The superpixel matching in SHT can handle this by investigating the color distribution of each particles. In Singer1, the target undergoes drastic illumination changes under the state light as shown in Figure 8(b). All the sparse representation based trackers (L1APG, SCM, ASLA, L2, SHT) perform favorably due to the inherent robustness against illumination variation. CXT which depends on the context elements exploration is very sensitive to illumination variation and can not achieve the tracking when the light condition turns back to normal.
Rotation: Four video sequences: Bird2, DragonBaby, Girl, MountainBike with in-plane and out-of-plane rotation are tested. In Bird2, most of the trackers cannot handle the mirror rotation when the bird walks back, especially the deformation of the bird occurs at the same time. As analyzed in the last subsection, the color composition of the bird remains the same when the symmetric rotation happens. In this way, the superpixel matching can facilitate the task and rectify the incorrect output caused by the L2 tracker. In Girl, the target is subject to complicated rotation, appearance changes and scale variation. Only SHT can adapt to the challenges at the same time.
Non-rigid object deformation: Non-rigid deformation occurs when human or animal move and the appearance vary. In Dog dataset, the superpixel matching can help to improve the performance as mentioned in component validation section. It should be noted that the shadow created by the woman is an interference factor especially when the dog shaking its tail under her and some trackers tend to enlarge the bounding box to enclose the shadow region due to the lack of robustness. In Jogging, the deformation of the woman happens in a gentle way so that the proposed tracker have no problem in handling it. Most of the trackers are blocked by the pole at the beginning. Only L1APG and SHT complete the tracking.

Vi-C Quantitative Comparison

Bird2 0.20 0.49 0.43 0.27 0.58 0.49 0.50 0.46 0.24 0.09 0.61 0.72
Boy 0.51 0.25 0.45 0.62 0.66 0.78 0.37 0.38 0.80 0.54 0.40 0.80
Caviar 0.38 0.15 0.26 0.20 0.14 0.86 0.15 0.87 0.13 0.21 0.10 0.86
Dog 0.38 0.14 0.34 0.32 0.36 0.45 0.42 0.56 0.57 0.47 0.32 0.63
DragB 0.24 0.25 0.40 0.07 0.20 0.49 0.23 0.19 0.37 0.52 0.14 0.66
Girl 0.42 0.17 0.47 0.49 0.38 0.47 0.65 0.27 0.61 0.43 0.29 0.68
Jogging 0.83 0.18 0.46 0.72 0.15 0.44 0.14 0.14 0.13 0.65 0.22 0.78
MounB 0.62 0.74 0.13 0.21 0.71 0.60 0.74 0.62 0.23 0.58 0.30 0.66
Singer1 0.82 0.57 0.25 0.75 0.36 0.79 0.80 0.87 0.45 0.19 0.36 0.83
Walking2 0.82 0.74 0.28 0.42 0.47 0.76 0.35 0.78 0.37 0.34 0.41 0.79
Avg 0.52 0.37 0.35 0.41 0.40 0.61 0.44 0.51 0.39 0.40 0.32 0.74
Note: The best three results are highlighted in red, blue and green.
TABLE II: Average overlap rate
Bird2 80.4 28.4 28.1 221.6 17.9 29.7 19.6 29.4 43.5 109.0 46.4 10.0
Boy 17.8 91.8 33.9 4.0 20.1 2.9 89.6 53.1 2.0 64.7 106.3 2.7
Caviar 21.7 64.1 27.2 43.6 71.4 2.7 62.1 2.5 73.6 43.1 105.1 2.4
Dog 11.3 93.7 12.1 51.5 6.9 11.6 9.5 6.1 5.9 10.1 15.8 5.6
DragB 129.7 92.8 46.3 213.1 87.9 32.2 65.7 62.8 89.1 26.3 75.5 8.6
Girl 22.1 20.8 20.1 10.9 19.3 19.4 4.2 63.9 6.2 22.9 23.9 4.4
Jogging 2.9 89.3 27.5 7.7 164.7 37.0 175.0 141.5 120.4 14.3 33.4 3.5
MounB 24.8 7.3 208.7 208.5 6.5 18.4 7.8 11.7 178.7 24.9 155.0 10.9
Singer1 3.2 11.7 77.0 8.3 14.0 4.1 3.2 2.9 11.9 141.1 18.7 3.5
Walking2 3.3 3.1 57.3 24.0 17.9 2.6 38.1 1.9 33.0 64.6 29.0 2.6
Avg 31.7 50.3 53.8 79.3 42.7 16.1 47.5 37.6 56.4 52.1 60.9 5.5
Note: The best three results are highlighted in red, blue and green.
TABLE III: Average tracking error
Fig. 9: Success plots of OPE. Seven trackers are tested on the ten datasets

For the purpose of thorough evaluation, the proposed SHT tracker is compared with other eleven algorithms in terms of overlap rate and average center error as presented in Table II and III where the top three results are highlighted in red, blue and green fonts. The template size of L1APG is shifted from the default to for better performance at the cost of longer computational time. It can be observed that the SHT performs superiorly over other state-of-the-art algorithms. Specifically, compared with L2-RLS tracker, an obvious performance improvement can be achieved. Although, it already performs favorably on the datasets such as Boy and Caviar as stated in their paper, the proposed tracker further enhances the accuracy. The same situation holds for some other video sequences that can be successfully tracked by L2 algorithm. The performance is believed to be further improved if other embedded tracker with better robustness and accuracy is employed to replace L2. But the balance should also be considered between accuracy and speed. Moreover, different from saliency guided global search and superpixel HSV matching which need three channels (RGB) of the image, the linear refinement can be applied to all the particle filter based trackers because it only relies on greyscale image.
We also run the One-Pass Evaluation (OPE) on the success rate for seven trackers for comparison. Success rate can be defined as [24]


where is a threshold ranging from 0 to 1 with the interval of 0.1. returns the number of element in set . denotes the overlap rate of tth frame. The success rate curves are plotted in Figure 9. The results reveal that SHT algorithm can always guarantee a remarkable success rate with different threshold values.

Vii Conclusion

In this paper, a saliency guided robust visual tracking algorithm has been proposed with hierarchical structure. In global search, nineteen feature maps are combined with pre-trained weights to construct the saliency map. Then the possible locations are determined by searching the connected regions on this map. A novel integration mechanism is designed to filter out the false targets and then pass the estimated result to local search. A superpixel based HSV histogram matching scheme is incorporated into an L2 tracker to involve the color distribution matching in the local search. Finally, a linear refinement approach is introduced to further rectify the result with selected promising candidates. A customized fast solver is designed for this approach. In experiment, by qualitatively and quantitatively comparing with eleven state-of-the-art algorithms on ten challenging video sequences, the superiority of the proposed tracker has been demonstrated.


  • [1] K. Lu, Z. Ding, and S. Ge, “Locally connected graph for visual tracking,” Neurocomputing, vol. 120, pp. 45–53, 2013.
  • [2] X. Mei and H. Ling, “Robust visual tracking using &# x2113; 1 minimization,” in 2009 IEEE 12th International Conference on Computer Vision, pp. 1436–1443, IEEE, 2009.
  • [3] X. Mei, H. Ling, Y. Wu, E. Blasch, and L. Bai, “Minimum error bounded efficient l1 tracker with occlusion detection (preprint),” tech. rep., DTIC Document, 2011.
  • [4] C. Bao, Y. Wu, H. Ling, and H. Ji, “Real time robust l1 tracker using accelerated proximal gradient approach,” in

    Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on

    , pp. 1830–1837, IEEE, 2012.
  • [5] W. Zhong, H. Lu, and M.-H. Yang, “Robust object tracking via sparse collaborative appearance model,” IEEE Transactions on Image Processing, vol. 23, no. 5, pp. 2356–2368, 2014.
  • [6] K. Zhang, L. Zhang, and M.-H. Yang, “Fast compressive tracking,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 36, no. 10, pp. 2002–2015, 2014.
  • [7] Z. Xiao, H. Lu, and D. Wang, “L2-rls-based object tracking,” Circuits and Systems for Video Technology, IEEE Transactions on, vol. 24, no. 8, pp. 1301–1309, 2014.
  • [8] J. Yang and M.-H. Yang, “Top-down visual saliency via joint crf and dictionary learning,” in Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on, pp. 2296–2303, IEEE, 2012.
  • [9] W. Li, P. Wang, and H. Qiao, “Top–down visual attention integrated particle filter for robust object tracking,” Signal Processing: Image Communication, vol. 43, pp. 28–41, 2016.
  • [10] Y. Su, Q. Zhao, L. Zhao, and D. Gu, “Abrupt motion tracking using a visual saliency embedded particle filter,” Pattern Recognition, vol. 47, no. 5, pp. 1826–1834, 2014.
  • [11] S. Frintrop, VOCUS: A visual attention system for object detection and goal-directed search, vol. 3899. Springer, 2006.
  • [12] L. Itti and C. Koch, “A saliency-based search mechanism for overt and covert shifts of visual attention,” Vision research, vol. 40, no. 10, pp. 1489–1506, 2000.
  • [13] X. Li, Z. Han, L. Wang, and H. Lu, “Visual tracking via random walks on graph model,” 2015.
  • [14] S. Oron, A. Bar-Hillel, D. Levi, and S. Avidan, “Locally orderless tracking,” International Journal of Computer Vision, vol. 111, no. 2, pp. 213–228, 2015.
  • [15] G. Wang, X. Qin, F. Zhong, Y. Liu, H. Li, Q. Peng, and M.-H. Yang, “Visual tracking via sparse and local linear coding,” Image Processing, IEEE Transactions on, vol. 24, no. 11, pp. 3796–3809, 2015.
  • [16] F. Yang, H. Lu, and M.-H. Yang, “Robust superpixel tracking,” IEEE Transactions on Image Processing, vol. 23, no. 4, pp. 1639–1651, 2014.
  • [17] J. Wang, J. Yang, K. Yu, F. Lv, T. Huang, and Y. Gong, “Locality-constrained linear coding for image classification,” in Computer Vision and Pattern Recognition (CVPR), 2010 IEEE Conference on, pp. 3360–3367, IEEE, 2010.
  • [18] T. Judd, K. Ehinger, F. Durand, and A. Torralba, “Learning to predict where humans look,” in Computer Vision, 2009 IEEE 12th international conference on, pp. 2106–2113, IEEE, 2009.
  • [19] E. P. Simoncelli and W. T. Freeman, “The steerable pyramid: A flexible architecture for multi-scale derivative computation,” in icip, p. 3444, IEEE, 1995.
  • [20] L. Itti, C. Koch, and E. Niebur, “A model of saliency-based visual attention for rapid scene analysis,” IEEE Transactions on Pattern Analysis & Machine Intelligence, no. 11, pp. 1254–1259, 1998.
  • [21] R. M. Haralock and L. G. Shapiro, Computer and robot vision. Addison-Wesley Longman Publishing Co., Inc., 1991.
  • [22] R. Achanta, A. Shaji, K. Smith, A. Lucchi, P. Fua, and S. Susstrunk, “Slic superpixels compared to state-of-the-art superpixel methods,” Pattern Analysis and Machine Intelligence, IEEE Transactions on, vol. 34, no. 11, pp. 2274–2282, 2012.
  • [23] K. Miettinen, Nonlinear multiobjective optimization, vol. 12. Springer Science & Business Media, 2012.
  • [24] H. Liu, M. Yuan, F. Sun, and J. Zhang, “Spatial neighborhood-constrained linear coding for visual object tracking,” Industrial Informatics, IEEE Transactions on, vol. 10, no. 1, pp. 469–480, 2014.
  • [25] D. A. Ross, J. Lim, R.-S. Lin, and M.-H. Yang, “Incremental learning for robust visual tracking,” International Journal of Computer Vision, vol. 77, no. 1-3, pp. 125–141, 2008.
  • [26] A. Adam, E. Rivlin, and I. Shimshoni, “Robust fragments-based tracking using the integral histogram,” in Computer vision and pattern recognition, 2006 IEEE Computer Society Conference on, vol. 1, pp. 798–805, IEEE, 2006.
  • [27] Z. Kalal, J. Matas, and K. Mikolajczyk, “Pn learning: Bootstrapping binary classifiers by structural constraints,” in Computer Vision and Pattern Recognition (CVPR), 2010 IEEE Conference on, pp. 49–56, IEEE, 2010.
  • [28] J. F. Henriques, R. Caseiro, P. Martins, and J. Batista, “Exploiting the circulant structure of tracking-by-detection with kernels,” in Computer Vision–ECCV 2012, pp. 702–715, Springer, 2012.
  • [29] X. Jia, H. Lu, and M.-H. Yang, “Visual tracking via adaptive structural local sparse appearance model,” in Computer vision and pattern recognition (CVPR), 2012 IEEE Conference on, pp. 1822–1829, IEEE, 2012.
  • [30] T. B. Dinh, N. Vo, and G. Medioni, “Context tracker: Exploring supporters and distracters in unconstrained environments,” in Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on, pp. 1177–1184, IEEE, 2011.
  • [31] L. Sevilla-Lara and E. Learned-Miller, “Distribution fields for tracking,” in Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on, pp. 1910–1917, IEEE, 2012.