Visual tracking is a fundamental and active research topic in computer vision due to its wide range of applications such as activity analysis, visual surveillance and self-driving systems. Despite significant progress has been made in recent years, it remains a challenging issue, partly due to the difficulty of constructing robust object representation to cope with various factors including camera motion, partial occlusion, background clutter and illumination change.
Numerous visual tracking methods recently adopt the tracking-by-detection paradigm, i.e., separating the foreground object from the background over time using a classifier. These methods usually localize the object using a bounding box, and draw positive (negative) samples from inside (outside) of the bounding box for the classifier update. Since the ground-truth object labeling is only available in the initial frame, incrementally updating the object classifier in subsequent frames often result in model drift due to inclusion of outlier samples.
Significant efforts have been made to alleviate the effects of outlier samples in visual tracking [1, 2, 3, 4, 5, 6, 7, 8, 9]. Several methods in [1, 3, 4] update the object classifiers by considering the distances of samples with respect to the bounding box center, e.g., the samples close to the center receiving higher weights. Some other methods [10, 11] segment foreground objects from the background during the tracking process to exclude background clutter. However, these methods are limited in dealing with cluttered backgrounds (e.g., unreliable segmented object masks). To improve the robustness, Kim et al.  define an image patch based 8-neighbor graph to represent the tracked object, in which if two nodes are connected by an edge if they are are -neighbors and the edge weight is computed based on low-level feature distance. There are two main issues with this approach: i) it only considers the spatial neighbors and do not capture the intrinsic relationship between patches; ii) it uses low-level feature which are less effective in the presence of clutter and noise.
To handle these issues, we learn a robust object representation for visual tracking. Given one bounding box of the target object, we partition it into non-overlapping local patches, which are described by color and gradient histograms. Instead of using static structures in existing methods [12, 7], we learn a dynamic graph with patches as nodes (i.e., adaptive structure and node weights for each frame) for representing the target object, where the weight of each node describes how likely it belongs to the target object, and the edge weight indicates the appearance compatibility of two neighboring patches. Existing methods usually perform two steps for node weight computation, i.e., first constructing the graph with a static structure and low-level features, and then computing node weights based on some semi-supervised methods [12, 7]. In this work, we propose a novel representation model to jointly learn the graph that infers the graph structure, edge weights and node weights.
With the advances of compressed sensing , numerous methods exploiting the relationship of data representations have been proposed [14, 15, 16, 17]. The representations are generally utilized to define [14, 15, 16] or learn 
the affinity matrix of a graph. Motivated by these methods, we represent each patch descriptor as a linear combination of other patch descriptors, and develop a model to jointly optimize the graph structure, edge weights and node weights while suppressing the effects of noise from clutter and low-level features.
It is worth mentioning that our model has the following three distinctive properties: 1) it is capable of collaboratively optimize the graph structure, edge weights and node weights according to the underlying intrinsic relationship, which provides a flexible solution for visual tracking and other vision problems such as saliency detection  and semi-supervised object segmentation ; 2) it is effective to suppress the effects of noise from pixels and low-level features in computing the affinity matrix of the graph; 3) it is generic, and can incorporate other constraints (e.g., low-rank and sparse constraints) to further improve the robustness of graph learning.
To improve the tracking efficiency, we develop an alternating direction method of multipliers (ADMM) algorithm to seek the solution of the proposed model. In particular, the alternating direction method  is used to linearize the quadratic penalty term while avoiding an auxiliary variable and some matrix inversions such that each subproblem can be efficiently solved with a closed-form solution. We construct the robust object representations by combining patch features with the optimized weights, and then apply the structured support vector machine (SVM)  for object tracking and model update.
In each frame, the proposed algorithm is carried out with several steps. First, the graph is initialized with binary weights to according to the ground truth (first frame) or the predicted bounding box (subsequent frames). Second, the graph is optimized by a linearized ADMM algorithm. Third, the object feature representation is updated by imposing the patch weights on the extracted image features. The object location is finally predicted by adopting the structured SVM.
We make three major contributions for visual tracking and related applications in this work:
We propose an effective approach to alleviate the effects of background clutter in visual tracking. Extensive experiments show that the proposed method outperforms most state-of-the-art trackers on four benchmark datasets.
We present a novel representation model to learn a dynamic graph according to the intrinsic relationship among image patches. The proposed model is jointly optimizes the graph structure, edge weights and node weights while suppressing the effects of patch noise and/or corruption. It also provides a general solution for visual tracking and other vision problems such as saliency detection [12, 20] and interactive object segmentation [18, 21].
We develop an ADMM algorithm to efficiently solve the associated optimization problem. Empirically, the proposed optimization algorithm exhibits stable convergence behavior on real image data.
This paper provides a more complete understanding of the early results 
, with more background, insights, analysis, and evaluation. In particular, our approach advances the early work in several aspects. First, we utilize the data representations to learn more meaningful graph affinity, instead of directly using data representations. Second, we generalize the graph learning algorithm to incorporating different constraints (or priors), such as the low rank, sparse and spatial smoothness constraints. We further discuss the merits for graph learning, and instantiate the sparse constraints into our framework. Third, scale estimation is considered in this work to improve visual tracking Finally, we carry out extensive experiments on large-scale benchmark datasets to demonstrate the effectiveness of the proposed algorithm, including quantitative comparisons with the state-of-the-art trackers and ablation studies.
2 Literature Review
A plethora of visual methods have been proposed in the literature [23, 24] and we discuss the most related work in this section. We discuss the advances of visual tracking in two aspects: constructing robust appearance models to alleviate the effect of background clutter, and learning data affinity for model construction.
2.1 Appearance Models
Various tracking methods have been proposed to improve the robustness to nuisance factors including label ambiguity, background clutter, corruption and occlusion. Grabner et al.  present a tracking approach that adapts to drastic appearance changes and limits the drifting problem. The knowledge from labeled data is used to construct static prior for online classifier while unlabeled samples are explored in a principled manner during tracking. Babenko et al.  use a bag of multiple samples, instead of a single sample, to update the classifier reliably. To avoid the label ambiguity, Hare et al.  exploit structured samples instead of binary-labeled samples when training the classifier in the structured SVM framework .
To alleviate the effects of background clutter, one representative approach is to assign weights to different pixels or patches in the bounding box. Comaniciu et al.  develop the kernel-based method to assign smaller weights to boundary pixels for histogram matching. In  He et al. also assume that pixels far from a box center should be less important. These methods do not perform well when a target shape cannot be well described by a rectangle or occluded. Some methods [10, 11] integrate segmentation results in visual tracking to alleviate the effects of background. These algorithms, however, reply heavily on the quality of segmentation results. Kim et al.  develop a random walk restart algorithm on a 8-neighbor graph to compute patch weights within the target object bounding box. Nevertheless, the constructed graph does not capture the relationship between patches well.
2.2 Data Affinity
In vision and learning problems, we often have a set of data drawn from a union of subspaces , where is the feature dimension and is the number of data vectors. To characterize the relation between the data in , the key is to construct an effective affinity matrix , in which reflects the similarity between data points and . While computing Euclidean distances on the raw data is the most intuitive way to construct the data affinity matrix, such metric usually does not reveal the global subspace structure of data well.
where and denote the representation matrix and the residual matrix, respectively. In (1), is the regularizer on . is the model of , which can be with different forms depending on data characteristics; and as well as are the weight parameters.
Numerous methods have been developed to extract compact information from image data including sparse representation (SR) 
and low-rank representation (LRR) ,
Different from traditional methods, SR schemes can be used to exploit higher order relationships among more data points effectively, and hence provide more compact and discriminative models . The main drawback of SR methods is that data is processed individually without taking the existence of inherent global structure into account. On the other hand, low-rank representation models use low rank constraints on data representations to capture the global structure of the whole data. It has been shown that, under mild conditions, LRR methods can preserve the membership of samples that belong to the same subspace well. Recently, Zhuang et al.  harnesses both sparsity and low-rankness of data to learn more informative representations.
In general, after solving the problem (1), the representation is used to define the affinity matrix of an undirected graph with for and . However, the metric implies the affinity is already not the same as the original definition. This is because the affinity defines an approximation to the pairwise distances between data samples while the representation is the reconstruction coefficients of one sample from others. As such, Guo  proposes a method to simultaneously learn data representations and affinity matrix. Experimental results on the synthetic and real datasets demonstrate the effectiveness of learning model representation and affinity matrix jointly.
3 Patch-based Graph Learning
Given one bounding box that encloses the target object, we partition it into non-overlapping patches and assign each one with a weight that reflects the importance in describing the target object to alleviate the effects of background clutter. We concatenate these weighted patch descriptors into a feature vector and use the Struck  method for object tracking. In this section, we first describe a sparse low-rank model based on local patches, and then an efficient ADMM algorithm to compute the weights.
Each bounding box of the target object is partitioned into non-overlapping patches, and a set of low-level appearance features are extracted and combined into one single -dimensional feature vector for characterizing the -th patch. Using these patches as graph nodes, each bounding box can be represented with a graph, in which the weight of each node describes how likely it belongs to the target object and the edge weight between two neighboring patches indicates appearance compatibility.
For visual tracking, some patches in a target bounding box may belong to background due to irregular shape, scale variation and partial occlusion of the target object, as shown in Figure 1. Thus, we assign a weight for each graph node to alleviate the effects of background pixels on object tracking and model update. On the other hand, instead of constructing spatially ordered graphs [12, 7], the edges are dynamically learned for capturing the intrinsic relationship of data. In this work, we propose a novel graph learning approach to infer the edges and node weights jointly which performs well against the state-of-the-art alternatives for visual tracking.
All the feature vectors of patches in one bounding box form the data matrix . Each patch descriptor can be represented as a linear combination of remaining patch descriptors, and the representation of all patch vectors can then be formulated by , where is the representation coefficient matrix. Since the patch feature matrix often contains noise, the representation can be obtained by solving the objective function (1).
The optimal representation coefficient matrix in (1) is often utilized to define the affinity matrix of an undirected graph in the way of for the feature vector and . As and are the reconstruction coefficients, this encoded information is not the same as the original definition, which defines an approximation to the pairwise distances between and 
. Therefore, we also learn the affinity matrix by assuming that the patch features should have larger probabilities to be in the same cluster if their representations have smaller distance, and impose the following constraints,
where is the desired affinity matrix, whose element reflects the probability of the patch features and from the same cluster based on the distance between their representations and . The constraints and guarantee the probability property of each column of . With some simple algebra, we integrate these constraints into (1), and have
where is the Laplacian matrix of , and is the degree matrix of , a diagonal matrix whose the -th diagonal element is . In (5), and are weight parameters. In addition, the last term is used to avoid overfitting. Note that minimizing the term could exclude the trivial solution , where
indicates the identity matrix. The trivial solutionis also not achieved as it means the data are clean, which is an “ideal” case, and does not exist in real-world applications.
To alleviate the effects of background clutter, we assign a weight for each patch using a semi-supervised formulation. Let be an initial weight vector, in which if is a target object patch, and indicating a background patch. In this work, is computed by the initial ground truth (for first frame) or the previous tracking result (for subsequent frames). For -th patch, if it belongs to the shrunk region of the bounding box then is , and if it belongs to the expanded region of the bounding box then is . Figure 1 shows the one example how the weights are assigned. Although using a simple initialization strategy, we demonstrate empirically this scheme performs well empirically, and show the robustness to clutter and noise in Figure 2.
The remaining patches are non-determined, and are diffused by other patches. To this end, we define an indicator vector that indicates the -th patch is foreground or background patch, and denotes the -th patch is non-determined patch. We integrate the patch weights into (5), and obtain
where indicates the element-wise product. , , and are weight parameters. The third and fourth terms are the smoothness and fitting constraints. Since the indicator vector removes fitness constraint of non-determined patch weights, we introduce the last term to avoid overfitting. Specifically, the smoothness term of constrains that and are similar to each other when is non-zero, and the fitting term of controls that its elements are close to 0 or 1. However, the fitting constraint is partial, and we thus introduce to avoid its element amplitude too large.
As discussed in Section 2, the regularizer is usually based on sparse or low-rank priors, e.g., sparse representation (SR) and low-rank representation (LRR). The SR methods exploit higher order relationships among more data points and hence is more discriminative [14, 28]. The LRR approaches employ low rank constraints on data representations to capture the global structure of data points, and thus is robust to noise and corruption [16, 15]
. However, the LRR methods require singular value decomposition (SVD) operations at each iteration, which is computationally demanding. Therefore, we impose the sparse constraints (i.e.,-norm, a convex surrogate for -norm) on in this work for computational efficiency.
In (6), the model can be in different forms based on the characteristic of data. For visual tracking, as some image patches are corrupted (e.g., occluded by background or other objects), we employ -norm (a convex surrogate for -norm) on . Putting the data terms and prior together, we have:
where and denote -norm and -norm of a matrix, respectively. (7) is reasonable as two patches should prefer to be sparsely represented by same set of patches if they are similar. In particular, for optimizing , we exploit higher order relationship among patches by minimizing norm on , and also penalty inconsistency between and when patch features and are similar (i.e., large ) by minimizing .
It is worth noting that although is a non-symmetrical affinity matrix, as shown in next section, the solutions of the variables that rely on (i.e., and ) are based on a symmetrical affinity matrix, i.e., .
Although (7) is not jointly convex on , , and , but it is convex with respect to each of them when others are fixed. The ADMM (Alternating Direction Method of Multipliers) algorithm  has shown to be an efficient and effective solver of problems similar to (7). To apply ADMM for the above problem, we need to make the objective function separable. Therefore, we introduce an auxiliary variable to replace in (7):
The augmented Lagrangian function of (8) is
where is the penalty parameter, and . In the above equation, and are the Lagrangian multipliers. The ADMM alternatively updates one variable by minimizing with fixing other variables. In addition to the Lagrangian multipliers, there are 5 variables, including , , , and , to be solved. The solutions of these subproblems are discussed below.
With other variables in (9) fixed, the -subproblem can be written as:
To avoid using an auxiliary variable and matrix inversions, we use the linearized ADMM method  to minimize the -subproblem of (9). The quadratic term is replaced by its first order approximation at the previous iteration and adding a proximal term. Thus, can be updated by:
where is the shorthand of . In (11), is the partial differential of with respect to , and . With some manipulation, we have: .
Generally, the solution of is obtained by the soft-threshold (or shrinkage) method :
where , and is the soft-threshold operator on with parameter .
By fixing other variables in (9), the -subproblem can be formulated as:
To compute , we take the derivative of with respect to , and set it to be 0. With some manipulation, we have:
where is the identity matrix.
When other variables in (9) are fixed, the -subproblem can be formulated as:
where is a vector whose -th element is . Notice that the parameter is introduced to control the number of nearest neighbors of (or ) that could have chance to connect edges with (or ).
By fixing other variables in (9), the -subproblem can be formulated as:
Similar to the solution for updating , we take the derivative of with respect to , and set it to be 0. With some manipulation, the closed-form solution of this subproblem can be computed by:
where is the degree matrix of that , and , and .
The procedure of solving (9) terminates when the maximum element changes of , , , and between two consecutive iterations are less than a threshold (e.g., in this work) or the maximum number of iterations reaches a pre-defined number (e.g., in this work).
4 Structured SVM Tracking
In this section, we incorporate the optimized weights of patches into the conventional tracking-by-detection algorithm, Struck , for visual tracking. Although we use the Struck method in this work, the optimized patch weights can also be incorporated into other tracking-by-detection algorithms. The Struck method selects the optimal target bounding box in the -th frame by maximizing a classification score:
where is the normal vector of a decision plane of the -th frame, and denotes the descriptor representing a bounding box in the -th frame. Instead of using binary-labeled samples, the Struck method employs structured samples that consist of a target bounding box and nearby boxes in the same frame to alleviate the labeling ambiguity in training the classifier. Specifically, it enforces that the confidence score of a target bounding box is larger than that of a nearby box by a margin determined by the overlap ratio between two boxes. As such, the Struck method can reduce adverse the labeling ambiguity problems.
For robust tracking, we decompose the problem of target state estimation into the two subproblems of translation estimation and scale estimation [31, 32, 33]. Motivated by Bayesian filtering algorithms [34, 35], we propose a simpler yet effective random strategy for target state refinement.
4.1 Translation Estimation
We incorporate the optimized patch weights into the Struck method, in which we improve the robustness to drastic appearance changes and unreliable tracking results of a target object. Given the bounding box of the target object in the previous frame , we first set a searching window in current frame . For -th candidate bounding box within the search window, we weight its patch feature descriptor by the weight , and concatenate them into a vector as the feature representation:
where we normalize as so that all elements of sum to 1, and the parameter
is fixed to a pre-defined number (e.g., 37 in this work). Herein, we use the Sigmoid function to map the normalized weights into the range of 0 to 1, which has a parameterto control the steepness of normalized weights. The optimal bounding box can be selected to update the object location by maximizing the classification score:
where is learned in the initial frame, which can alleviate the issue of learning drastic appearance changes, and is a weight parameter.
4.2 Scale Estimation
Given the estimated location , we sample a set
of bounding boxes from the Gaussian distribution centered at, in which the elements of the covariance are the variations of the affine parameters, and its setting depends on motion variations of the target object. To simultaneously estimate scales and refine locations, we utilize four independent affine parameters to draw samples including the scale factor, aspect ratio and translation. For example, we empirically set to these parameters (scale factor, aspect ratio, and translation) to 0.05, 0.01, 1 and 1, respectively in this paper. As object translation is estimated before, we use 100 samples in this paper to compute scale while slightly adjusting translation for a trade-off between efficiency and accuracy. The bounding box is updated by the one with the highest score,
where and are classifiers trained in scale spaces at time and 0, respectively.
To update the classifier , we use a similar method to translation estimation. Given the optimal estimate , we extract bounding boxes around at different scales and the corresponding feature representations for scale factors excluding the positive sample with the scale factor 1 . We then find the optimal by
where , and indicates the Intersection-over-Union operation. To optimize (25
), we use the stochastic variance reduced gradient scheme. To reduce the sensitivity to noises of scale update, we carry out scale estimation at an interval of 3 frames.
4.3 Model Update
To alleviate the issues of model drift, we update the classifier and patch weights only when the confidence score of tracking result is larger than a threshold . In this paper, the confidence score of tracking result in -th frame is defined as the average similarity between the weighted descriptor of the tracked bounding box and the positive support vectors
where is the set of the positive support vectors at time . Algorithm 2 shows the main steps of the proposed tracking method.
It should be noted that the proposed tracking algorithm is significantly different from the recently proposed approaches that use sparse representation for object tracking [38, 39, 40, 34] in which reconstruction errors or representation coefficients are used to compute the confidence of candidates in the Bayesian filtering framework. While we employ the sparse representation to learn a dynamic graph for representing a target object, the node weights are used to suppress the effects of background clutter in the tracking-by-detection framework.
In addition, our approach is also significantly different from the SOWP  method in several aspects. First, the proposed algorithm learns a dynamic graph to represent a target object that better captures the intrinsic relationship among image patches. Second, our method optimizes the edge and node weights jointly while the SOWP method first computes the edge weights and then the node weights. Third, the proposed tracker considers the initial foreground and background clutter in a unified model, while the SOWP method requires two passes to compute the final patch weights, one for foreground and another for background.
5 Performance Evaluation
The proposed tracker based weighted patch-based graph (WPG) representation is implemented in C++. All experiments are carried out on a machine with an Intel i7 4.0 GHz CPU and 32 GB RAM. We test runtime of WPG on the OTB100 dataset , and scale each frame such that the minimum side length of a bounding box is 32 pixels for efficiency. The proposed algorithm is able to track a target object at 5 frames per second where the optimization method converges within 50 iterations. We use the benchmark datasets and protocols [23, 24, 41] to evaluate the proposed approach. In addition, we evaluate several variants of the proposed method to demonstrate the contribution of main modules.
5.1 Experimental Setup
For fair comparisons, we fix all parameters and other settings on all datasets in our experiments. We partition all bounding box into 64 non-overlapping patches as a trade-off between accuracy and efficiency, and extract color and gradient histograms for each patch, where the dimension of gradients and each color channel is set to be 8. We evaluate different number of patches from , and empirically determine that the proposed method performs best with 64 patches as a trade-off between accuracy and complexity. Note that we fix patch number as square to adapt patch size to the size of object bounding box, which makes patches have consistent shape with target object. Otherwise, it is hard to find a unified partition method for all sequences. To improve efficiency, each frame is scaled such that the minimum side length of a bounding box is 32 pixels. A bounding box is described by with and height of and pixels. The side length of a search window is initially set to a small range, , to reduce false positives, and then set to a large range, , to handle abrupt motions if the center distance of the object box between two consecutive frames is above a predefined threshold (e.g., 5 pixels in this work).
For seed selection, we shrink and expand the tracked bounding box as and , respectively, where denotes the upper left coordinate of the tracked bounding box, and and indicate the patch width and height, respectively . In the proposed model (6), there involves several parameters, which are set as follows. On one hand, similar to , we simplify the settings as and . Following , we set . Although is 2 orders of magnitude higher than and , we find that these terms can balance well by outputting each term after optimization. On the other hand, and are to control the balance of smoothness and fitness of . According to the setting of similar models [12, 22], we set . For the Struck method, we empirically set .
5.1.2 OTB100 Dataset
We evaluate the proposed tracking method on the OTB100 benchmark dataset . The OTB100 dataset contains image sequences with ground-truth object locations and attributes for performance analysis. We use precision rate (PR) and success rate (SR) with the threshold of pixels for quantitative performance.
5.1.3 Temple Color Dataset
For further validating the effectiveness of the proposed approach, we also compare with other tracking approaches on the Temple Color dataset . This database contains 128 challenging image sequences of human, animals and rigid objects. In addition to tracking ground truth, each sequence in the dataset is also annotated by its challenging factors as defined in 
. The evaluation metrics are also defined in.
5.1.4 NUS PRO Dataset
We also compare the proposed algorithm with other tracking approaches on the NUS-PRO dataset . This dataset contains challenging image sequences of pedestrians and rigid objects, mainly captured from moving cameras. Aside from target locations, each sequence is annotated with occlusion level for evaluation. We employ the threshold-response relationship (TRR) with three criteria for occlusion computation  on the entire dataset to evaluate the proposed tracking method.
5.1.5 VOT Challenge Dataset
For more comprehensive evaluation, we also run the proposed tracker on the VOT2014 challenge dataset , whose dataset contains more deformations and the aligned bounding boxes contain more noise. Accuracy (ACC) and robustness (ROB) are used to assess the performance of a tracker. The accuracy computes the overlap ratio between an estimated bounding box and the ground truth. The robustness indicates the number of tracking failures, i.e., the number of frames in which the overlap ratios are zero.
5.2 Evaluation on the OTB100 Dataset
We first evaluate the proposed algorithm on the OTB100 dataset against tracking methods. Next we analyze the performance of evaluated methods based on attributes of image sequences.
5.2.1 Tracking Methods Without Deep Features
We evaluate the proposed algorithm against the state-of-the-art tracking methods without using deep features, e.g., Struck, DSST , MEEM , MUSTer  and SOWP . Figure 3 shows the OPE plots on the OTB100 dataset, and Figure 4 presents some qualitative results. Overall, the proposed algorithm performs favorably against the state-of-the-art methods, e.g., 9.1% over SOWP in the precision score and 5.5% over MUSTer in the success score. Figure 4 shows that the proposed approach effectively handles scenes with illumination variation (Basketball and Ironman), background clutter (Diving, Ironman and Box), deformation (Basketball, Bolt2 and Diving) and partial occlusion (Ironman, Box and Human3).
The excellent performance of WPG suggests that the proposed tracker is able to mitigate outlier effects by integrate local patch weights into feature representations, which brings biggest performance gain for achieving state-of-the-art tracking performance. In addition to it, the following components are also beneficial to promoting tracking performance. First, local patch representations are robust to object deformation and partial occlusion. Second, the classification and update schemes are used to avoid model contagious by drastic appearance changes and unreliable tracking results of a target object. Finally, the scale handling strategy is employed to adapt to scale variations and also refine object translation.
5.2.2 Tracking Methods Based on Deep Features
We evaluate the proposed algorithm against the state-of-the-art tracking methods using deep features including DLT , HCF , C-COT  and MDNet . Figure 3, Table I and Table II show the evaluation results. Overall, the proposed tracker performs well against the DLT and HCF methods in all aspects. The proposed tracker performs equally well against the C-COT and MDNet schemes in terms of precision and slightly worse in terms of success rate. Furthermore, the proposed algorithm differs from the C-COT and MDNet methods in several aspects.
The proposed tracking method does not require laborious pre-training or a large training set. In addition, it does not need to save a large pre-trained deep model. We initialize the proposed model using the ground truth bounding box in the first frame, and update it in subsequent frames.
It is easy to implement as each subproblem of the proposed model has a closed-form solution.
It performs more robustly than the MDNet and C-COT methods in some situations. In particular, it outperforms the C-COT method on sequences with background clutters in terms of precision and success rate, which suggests the effectiveness of our approach in suppressing the background effects during tracking.
5.2.3 Attribute-based Evaluation
We present the precision plots with 11 different attributes in Table I and Table II. The attributes include background clutter (BC), deformation (DEF), fast motion (FM), illumination variation (IV), in-plane rotation (IPR), low resolution (LR), motion blur (MB), occlusion (OCC), out-of-plane rotation (OPR), out of view (OV) and scale variation (SV).
The comparison plots show that our tracker significantly outperforms other non-DL-based tracking methods, and achieves comparable performance with DL-based ones on the attribute-based subsets (e.g., BC and DEF), which validates the effectiveness of introducing the optimized weights in the object representation that suppresses background clutter and noises. The performance of our tracker against others on OCC and OV suggests that the adopted classification and update schemes can re-track objects in case of tracking failure, e.g., totally occlusion and re-entering the field of view, and alleviate incorrect update of noisy samples. The even worse performance of our tracker against others on FM and LR suggests the weakness of our used features (color and gradient) in representing the target object and search strategy, and we will address these issues in future work.
In particular, we compare our WPG with the SOWP method  that is most related to us as follows. For the PR score, WPG outperforms the SOWP method significantly, especially on the sequences with deformation, out of view, background clutter and low resolution. It demonstrates advances of WPG over SOWP in learning robust object feature representations under background inclusion and less information, and also in re-tracking objects after they back to view. For the SR score, WPG also excels SOWP with a large margin, especially on the challenges of scale variation, occlusion and low resolution, which verify the effectiveness of scale handling, background suppression and reliability highlighting in WPG, while SOWP does not handle scale variations and is also limited by its weight computation scheme.
5.3 Evaluation on the Temple Color Dataset
We evaluate the proposed algorithm on the Temple Color dataset . Figure 5 shows the evaluation results against 9 state-of-the-art tracking approaches, including DGT , Staple , MEEM , SRDCF , Struck , KCF , ASLA , MIL , and VTD . Overall, the proposed algorithm performs favorably against the other trackers, e.g., DGT (Our previous version) (PR/SR: 4.8%/3.2%), Staple (8.4%/3.7%) and SRDCF (10.0%/5.4%).
5.4 Evaluation on the NUS-PRO Dataset
We evaluate the proposed algorithm against the state-of-the-art trackers on the NUS-PRO  dataset.
5.4.1 Overall Performance
We present the evaluation results of our method against 20 conventional trackers on the NUS-PRO dataset  in Figure 6. Overall, the proposed tracker performs favorably against other trackers on the NUS-PRO dataset. The results of the top 4 performing methods (CPF , ASLA , SCM  and LOT ) show that the combination of local feature representations and particle filter search models can achieve the state-of-the-art performance. Although adopting only the local feature representation, the proposed tracking algorithm performs well on the NUS-PRO dataset.
5.4.2 Category-based Evaluation
We present how the proposed tracker performs on 4 object categories in the NUS-PRO database. The AUC plots of TRR curves in Figure 7 show that the proposed method performs well in the rigid object, sportsman and face sequences, and comparably with the SCM scheme in the pedestrian sequences. The sportsman category is the most challenging among 4 object types in the NUS-PRO database, followed by the classes of pedestrians, rigid objects and faces.
|ACC||ROB||ACC w/o||ACC||ROB||ACC w/o|
5.4.3 Attribute-based Evaluation
We present the AUC plots of TRR curves of the evaluated tracking algorithms based on 12 attributes, including shadow change (SC), flash (FL), dim light (DL), camera shaking (CS), scale change (SC), rotation (RO), shape deformation (SD), partial occlusion (PO), full occlusion (FO), clutter background (CB), similar objects (SO) and fast background change (FBC). The proposed tracker performs well on scenes with most attributes including SC, DL, CS, SC, RO, SD, PO, FO, CB and SO. The evaluation results are consistent with the findings on the OTB100 dataset except that the FL, DL, CS and FBC attributes are not reported on the OTB100 dataset and the proposed method performs slightly worse than the others on the scenes with the DL and FBC attributes. The performance on the sequences with the DL attribute may be explained by the adopted features (color and gradient) of the proposed method for representing target objects under low illumination conditions, which can be improved by using integrating more features . On the other hand, the performance of the proposed algorithm on sequences with the FBC attribute can be explained by the search strategy, and can be further improved by using robust motion or search models to leverage more temporal and spatial information.
5.5 Evaluation on the VOT Challenge Dataset
Finally, we report the evaluation results of WPG against SOWP  and the top three trackers (i.e., DSST , SAMF  and KCF ) on the VOT2014 challenge dataset , as shown in Table III. In Baseline evaluation, a tracker is initialized with a ground truth. In Region noise evaluation, a tracker inputs a perturbed ground truth.
From Table III, we can see that WPG obtains low ACC scores, achieves the best ROB results in both evaluations. In the VOT challenge, a re-initialization step is triggered using a new ground truth when a tracker is detected as failure. Therefore, the compared trackers fail to track more frequently than WPG, and thus they obtain higher overlap ratios. To mitigate these effects of re-initialization, we remove re-initialization step in evaluations, and denote overlap ratios as ACC w/o. The results show that WPG yields the best ACC scores without the re-initialization.
It is worth noting that sequences of Region noise evaluation contain more clutter and noise, but WPG performs better on Region noise evaluation than on Baseline. It suggests that WPG can handle region noise more effectively than others.
To demonstrate the effectiveness of the main components, we present empirical results using 4 variants of the proposed algorithm on the OTB100 dataset . These variants are: 1) WPG: We remove the patch weights in our tracking algorithm, 2) WPG: We remove the affinity learning and directly utilize the representation coefficients to diffuse patch weights. The objective function is:
We also use the ADMM algorithm to solve (27). 3) WPG: We remove the sparse constraints on , but enforce minimizing to avoid the trivial solution. Thus, can be updated with the closed-form solution:
4) WPG: We remove the sample-specific sparse constraints on but enforce minimizing to avoid the trivial solution. Thus, can be updated with a closed-form solution:
To rules out the implementation flaw or optimization differences, we set parameters , and to a ridiculously low number (e.g., in this work) to render contribution of each term, and denote them as WPG, WPG and WPG, respectively.
Table IV shows the evaluation results against the SOWP method . The performance gains achieved by the proposed algorithm over the SOWP method demonstrate the significances of the main components. In particular, the results show that: 1) Introducing patch weights into the object representations helps suppress the effects of background clutters in visual tracking by comparing the performance of WPG against the other schemes. WPG is a spatially reliability learning method, which has been proven to be an effective way to mitigate outlier effects, and thus bring big performance gains for achieving state-of-the-art tracking performance [56, 57]. 2) The WPG, WPG, WPG and WPG methods perform well against the SOWP scheme, which suggests that the dynamic graph facilitates optimizing the patch weights by capturing the intrinsic relationship among image patches. Comparing with restriction of spatial neighbors in SOWP, variants of WPG are good at exploring long-range relationships among patches, and also mitigating noise effects of low-level features. Hence, the patch weights optimized by WPG variants are more accurate and robust. 3) The WPG algorithm performs better than the WPG, WPG and WPG schemes, thereby justifying the effectiveness of learning graph affinity matrix , sparse constraints on , and sample-specific sparse constraints on , respectively. First, sparse representation based graph [58, 28] could automatically select the most informative neighbors for each patch, and explore higher order relationships among patches, hence is more powerful and discriminative. Second, the learned graph could suppress corrupted and noisy image patches by modelling noise in sparse representation.
The performances of WPG, WPG and WPG against WPG further demonstrate above observations and conclusions.
In this paper, we propose an effective algorithm for visual tracking by suppressing the effects of background clutters. A patch-based graph is learned dynamically by capturing the intrinsic relationship among patches. To reduce the computational complexity, we develop an efficient algorithm for the proposed model by solving several convex subproblems. Finally, the optimized patch weights are incorporated into the structured SVM framework to carry out the tracking task. Extensive experimental results on three benchmark datasets demonstrate the effectiveness of the proposed algorithm over the state-of-the-art methods. Our future work will focus on: 1) learning the dynamic spatio-temporal graphs to explore more relations among image patches, 2) developing robust motion or search models for addressing fast object or background motions, and 3) replacing the hand-craft features with hierarchical appearance models for more effective representations.
-  D. Comaniciu, V. Ramesh, and P. Meer, “Kernel-based object tracking,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2003.
-  A. Adam, E. Rivlin, and I. Shimshoni, “Robust fragments-based tracking using the integral histogram,” in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2006.
-  S. Hare, A. Saffari, and P. H. S. Torr, “Struck: Structured output tracking with kernels,” in Proceedings of IEEE International Conference on Computer Vision, 2011.
-  S. He, Q. Yang, R. Lau, J. Wang, and M.-H. Yang, “Visual tracking via locality sensitive histograms,” in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2013.
-  J. Zhang, S. Ma, and S. Sclaroff, “MEEM: robust tracking via multiple experts using entropy minimization,” in Proceedings of European Conference on Computer Vision, 2014.
-  T. Zhang, K. Jia, C. Xu, Y. Ma, and N. Ahuja, “Partial occlusion handling for visual tracking via robust part matching,” in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2014.
-  H.-U. Kim, D.-Y. Lee, J.-Y. Sim, and C.-S. Kim, “Sowp: Spatially ordered and weighted patch descriptor for visual tracking,” in Proceedings of IEEE International Conference on Computer Vision, 2015.
-  Y. Li, J. Zhu, and S. C. Hoi, “Reliable patch trackers: Robust visual tracking by exploiting reliable patches,” in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2015.
-  R. Timofte, J. Kwon, and L. V. Gool, “Picaso: pixel correspondences and soft match selection for real-time tracking,” Computer Vision and Image Understanding, vol. 153, pp. 151–162, 2016.
-  S. Duffner and C. Garcia, “Pixeltrack: A fast adaptive algorithm for tracking non-rigid objects,” in Proceedings of IEEE International Conference on Computer Vision, 2013.
-  F. Yang, H. Lu, and M.-H. Yang, “Robust superpixel tracking,” IEEE Transactions on Image Processing, vol. 23, no. 4, pp. 1639–1651, 2014.
-  C. Yang, L. Zhang, H. Lu, X. Ruan, and M.-H. Yang, “Saliency detection via graph-based manifold ranking,” in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2013.
-  D. L. Donoho, “Compressed sensing,” IEEE Transactions on Information Theory, vol. 52, no. 4, pp. 1289–1306, 2006.
-  E. Elhamifar and R. vidal, “Sparse subspace clustering,” in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2009.
-  G. Liu, Z. Lin, S. Yan, J. Sun, Y. Yu, and Y. Ma, “Robust recovery of subspace structures by low-rank representation,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 35, no. 1, pp. 171–184, 2013.
L. Zhuang, H. Gao, Z. Lin, Y. Ma, X. Zhang, and N. Yu, “Non-negative low rank and sparse graph for semi-supervised learning,” inProceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2012.
X. Guo, “Robust subspace segmentation by simultaneously learning data
representations and their affinity matrix,” in
Proceedings of International Joint Conference on Artificial Intelligence, 2015.
-  C. Li, L. Lin, W. Zuo, W. Wang, and J. Tang, “An approach to streaming video segmentation with sub-optimal low-rank decomposition,” IEEE Transactions on Image Processing, vol. 25, no. 5, pp. 1947–1960, 2016.
-  Z. Lin, R. Liu, and Z. Su, “Linearized alternating direction method with adaptive penalty for low rank representation,” in Proceedings of Annual Conference on Neural Information Processing Systems, 2011.
-  K. Wang, L. Lin, J. Lu, C. Li, and K. Shi, “Pisa: Pixelwise image saliency by aggregating complementary appearance contrast measures with edge-preserving coherence,” IEEE Transactions on Image Processing, vol. 24, no. 10, pp. 3019–3033, 2015.
-  C. Li, L. Lin, W. Zuo, S. Yan, and J. Tang, “Sold: Sub-optimal low-rank decomposition for efficient video segmentation,” in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2015.
-  C. Li, L. Lin, W. Zuo, and J. Tang, “Learning patch-based dynamic graph for visual tracking,” in Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence, 2017.
-  Y. Wu, J. Lim, and M.-H. Yang, “Object tracking benchmark,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2015.
-  A. Li, M. Li, Y. Wu, M.-H. Yang, and S. Yan, “Nus-pro: A new visual tracking challenge,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 38, no. 2, pp. 335–349, 2016.
-  H. Grabner, M. Grabner, and H. Bischof, “Semi-supervised on-line boosting for robust tracking,” in Proceedings of European Conference on Computer Vision, 2008.
-  B. Babenko, M.-H. Yang, and S. Belongie, “Robust object tracking with online multiple instance learning,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 33, no. 7, pp. 1619–1632, 2011.
I. Tsochantaridis, T. Joachims, T. Hofmann, and Y. Altun, “Large margin
methods for structured and interdependent output variables,”
Journal of Machine Learning Research, vol. 6, pp. 1453–1484, 2005.
-  J. Wright, Y. Ma, J. Mairal, G. Sapiro, T. S. Huang, and S. Yan, “Sparse representation for computer vision and pattern recognition,” Proceedings of IEEE, vol. 98, no. 6, pp. 1031–1044, 2010.
-  Z. Lin, A. Ganesh, J. Wright, M. Chen, L. Wu, and Y. Ma, “Fast convex optimization algorithms for exact recovery of a corrupted low-rank matrix,” UIUC Technical Report UILU-ENG-09-2214, 2009.
Y. Xu and W. Yin, “A block coordinate descent method for regularized multiconvex optimization with applications to nonnegative tensor factorization and completion,”SIAM Journal on imaging sciences, vol. 6, no. 3, pp. 1758–1789, 2013.
-  C. Ma, X. Yang, C. Zhang, and M.-H. Yang, “Long-term correlation tracking,” in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2015.
-  Z. Hong, Z. Chen, C. Wang, X. Mei, D. Prokhorov, and D. Tao, “MUlti-Store Tracker (MUSTer): a cognitive psychology inspired approach to object tracking,” in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2015.
-  M. Danelljan, G. Hager, F. Khan, and M. Felsberg, “Accurate scale estimation for robust visual tracking,” in Proceedings of British Machine Vision Conference, 2014.
-  C. Li, H. Cheng, S. Hu, X. Liu, J. Tang, and L. Lin, “Learning collaborative sparse representation for grayscale-thermal tracking,” IEEE Transactions on Image Processing, vol. 25, no. 12, pp. 5743–5756, 2016.
-  T. Zhang, C. Xu, and M.-H. Yang, “Multi-task correlation particle filter for robust object tracking,” in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2017.
-  H.-U. Kim and C.-S. Kim, “Locator-checker-scaler object tracking using spatially ordered and weighted patch descriptor,” IEEE Transactions on Image Processing, vol. 26, no. 8, pp. 3817–3830, 2017.
R. Johnson and T. Zhang, “Accelerating stochastic gradient descent using predictive variance reduction,” inProceedings of Annual Conference on Neural Information Processing Systems, 2013.
-  X. Mei and H. Ling, “Robust visual tracking and vehicle classification via sparse representation,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 33, no. 11, pp. 2259–2272, 2011.
-  W. Zhong, H. Lu, and M.-H. Yang, “Robust object tracking via sparsity-based collaborative model,” in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2012.
-  C. Bao, Y. Wu, H. Ling, and H. Ji, “Real time robust tracking using accelerated proximal gradient approach,” in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2012.
-  M. Kristan, R. Pflugfelder, and et al., “The visual object tracking vot2014 challenge results,” in Proceedings of European Conference on Computer Vision Workshop, 2014.
-  P. Liang, E. Blasch, and H. Ling, “An approach to streaming video segmentation with sub-optimal low-rank decomposition,” IEEE Transactions on Image Processing, vol. 24, no. 12, pp. 5630–5644, 2015.
-  N. Wang and D.-Y. Yeung, “Learning a deep compact image representation for visual tracking,” in Proceedings of Annual Conference on Neural Information Processing Systems, 2013.
-  C. Ma, J.-B. Huang, X. Yang, and M.-H. Yang, “Hierarchical convolutional features for visual tracking,” in Proceedings of IEEE International Conference on Computer Vision, 2015.
-  M. Danelljan, A. Robinson, F. S. Khan, and M. Felsberg, “Beyond correlation filters: Learning continuous convolution operators for visual tracking,” in Proceedings of European Conference on Computer Vision, 2016.
H. Nam and B. Han, “Learning multi-domain convolutional neural networks for visual tracking,” inProceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2016.
-  L. Bertinetto, J. Valmadre, S. Golodetz, O. Miksik, and P. H. Torr, “Staple: Complementary learners for real-time tracking,” in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2016.
-  M. Danelljan, G. Hager, F. Shahbaz Khan, and M. Felsberg, “Learning spatially regularized correlation filters for visual tracking,” in Proceedings of the IEEE International Conference on Computer Vision, 2015.
-  S. Hare, S. Golodetz, A. Saffari, V. Vineet, M.-M. Cheng, S. L. Hicks, and P. H. Torr, “Struck: Structured output tracking with kernels,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 38, no. 10, pp. 2096–2109, 2016.
-  J. F. Henriques, R. Caseiro, P. Martins, and J. Batista, “High-speed tracking with kernelized correlation filters,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2015.
-  X. Jia, H. Lu, and M.-H. Yang, “Visual tracking via adaptive structural local sparse appearance model,” in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2012.
-  J. Kwon and K. M. Lee, “Visual tracking decomposition,” in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2010.
-  P. Perez, C. Hue, J. Vermaak, and M. Gangnet, “Color-based probabilistic tracking,” in Proceedings of European Conference on Computer Vision, 2002.
-  S. Oron, A. Bar-Hillel, D. Levi, and S. Avidan, “Locally orderless tracking,” in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2012.
-  L. Agapito, M. M. Bronstein, and C. Rother, “A scale adaptive kernel correlation filter tracker with feature integration,” in Proceedings of European Conference on Computer Vision Workshop, 2014.
-  A. Lukezic, T. Vojir, L. C. Zajc, J. Matas, and M. Kristan, “Discriminative correlation filter with channel and spatial reliability,” in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2016.
-  C. Sun, D. Wang, H. Lu, and M.-H. Yang, “Correlation tracking via joint discrimination and reliability learning,” in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2018.
J. Wright, A. Yang, A. Ganesh, S. Sastry, and Y. Ma, “Robust face recognition via sparse representation,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 31, no. 2, pp. 210–227, 2009.