Generic short-term object tracking has been a popular research topic in computer vision for the last few years and trackers based on theDiscriminative Correlation Filter (DCF)  have been particularly successful for applications with time constraint [24, 11, 7, 15]. However, in RGB tracking, there are fundamental difficulties that can be solved with the help of depth (D) information, occlusion handling being the most obvious. Additionally, RGBD sensors are popular in robotics where 3D object tracking also has many important applications, e.g., object manipulation and grasping.
There have been surprisingly small number of works on RGBD tracking since the introduction of the first dedicated benchmark for RGBD tracking: Princeton RGBD Tracking dataset . The dataset authors also proposed baseline trackers for 2D (depth as an additional cue) and 3D tracking (3D bounding box). More recently, particle filter based methods have been proposed by Meshgi et al.  and Bibi et al. , but they are both slow for real-time applications. Instead, we adopt the Discriminative Correlation Filter (DCF) approach since it is proven to be fast and successful in RGB tracking. The two other DCF based RGBD works to our best knowledge are Hannuna et al.  and An et al.  which are both among the top performers in the Princeton dataset.
In aforementioned RGBD tracking methods, depth has been used as a mere additional cue for tracking, but intuitively the most important role for the depth information is in accurate and robust occlusion handling. In our work, instead of separating occlusion handling and tracking, we unite the two processes by integrating occlusion handling with correlation filters in the sense that the spatial filter supports are dynamically altered using the depth based segmentation. In this work we make the following novel contributions:
We propose a novel occlusion handling mechanism based on object’s depth distribution and DCF’s response history. By detecting strong occlusions, our tracker avoids corrupting the object model.
We propose depth masking for DCF which avoids using the occluded regions in matching and therefore provides more reliable tracking scores.
We experimentally validate Depth Masked Discriminative Correlation Filter (DM-DCF) tracker on the Princeton RGBD Tracking benchmark where it ranks among the state-of-the-art algorithms with ranking first in multiple categories while being faster than the other top performing methods in the benchmark.
Our code will be made publicly available to facilitate fair comparisons.
Ii Related Work
Object Tracking – Existing trackers for generic short-term object tracking on RGB videos can be grouped under two main categories, Generative (matching with updated target model) or Discriminative
(classifier based) methods, depending on how a tracked object (target region) is modelled and localized. A generative tracker represents the target as a collection of features from previously tracked frames and matches them to search region in the current frame. A few prominent examples of this family of trackers are Incremental Visual Tracking (IVT) , Structural Sparse Tracking (SST)  and kernel-based object tracking (mean shift) . Generative trackers build a model from the positive examples, e.g. tracked regions, but false matches occur if the background or other objects have similar texture properties. This issue is addressed in discriminative trackers by continuously learning and updating a classifier that distinguishes the target from background. In recent years, discriminative approach has been more popular and it has produced many well-performing trackers such as Tracking-Learning-Detection (TLD) , Continuous Convolutional Operators Tracking (C-COT) 
, Multi-Domain Convolutional Neural Network (MDNet), ECO , CSR-DCF . Due to its mathematical simplicity, efficiency and superior performance, we adopted the DCF based approach as our baseline to allow us to achieve fast throughput rate while retaining an accuracy comparable to more complicated algorithms as it is proven in VOT 2017 .
Object Tracking with Depth – The number of research papers on generic RGBD (color + depth) object tracking is surprisingly limited despite the fact that depth sensors are ubiquitous and the apparent application of the problem in robotics. One of the reasons is the lack of suitable datasets until recent years. In 2013 Song et al.  introduced the Princeton Dataset for RGBD tracking and nine variations of a baseline tracker. They presented two main approaches; depth as an additional cue and point cloud tracking
. In the first case, depth is added as an additional dimension to Histogram of Oriented Gradients (HOG) feature space and in the second case tracking is based on 3D point clouds producing also a 3D bounding box. Their best method performs well, but contains heuristic processing stages and its speed (0.26 FPS) makes the algorithm unsuitable for real-time applications.
Another RGBD method was recently proposed by Meshgi et al.  who tackled the problem with an occlusion-aware particle filter framework employing a probabilistic model. Although their algorithm is among the top performers on Princeton Benchmark, the complexity of their model, the number of parameters to be tuned and the slowness of the algorithm (0.9 FPS) makes it unpractical for many applications.
Bibi et al. 
suggested a part-based sparse tracker within a particle filter framework. They represented particles as 3D cuboids that consist of sparse linear combination of dictionary elements which are not updated over the time. In case of no occlusion, their method first finds a rough estimate of the target location using 2D optical flow. Following this, particles are sampled in rotationand translation space. To detect the occlusions, they adopted a heuristic which states that the number of points in the 3D cuboid of the target should decrease below a threshold. Their method sets currently the state-of-the-art on Princeton benchmark however, computation times are not mentioned in the original work.
To the authors’ best knowledge there are only two RGBD trackers based on DCF which is used in our method. The first one was proposed by Hannuna et al. 
who use depth for occlusion handling, scale and shape analysis. To this end, they first apply a clustering scheme on depth histogram which is followed by formation of a single Gaussian distribution based depth modelling where they assume the cluster with the smallest mean must be the object (similar heuristic used in Songet al. ). Another shortcoming of their algorithm is that their occlusion handling allows tracking of occluding region which introduces same problems as in .
Second method was proposed by An et al. . They used depth based segmentation in parallel to Kernelized Correlation Filter (KCF)  detection and then interpreted the results to locate the target and determine the occlusion state with a heuristic approach.
The rest of the paper is organized as follows; Section III provides an overview for the proposed tracking method, Section IV reports the results of our tracker, its comparison against the state-of-the-art RGBD trackers and also discusses the ablation studies to evaluate the impacts of our design choices. Finally Section V sums up our proposed method with remarks for the future work. The overview of the proposed method can be seen in Fig. 1 and Alg. 1.
Iii Depth Masked DCF
Our approach is inspired by the recent work of Lukezic et al.  who robustified standard DCF by introducing filter masks and won VOT 2017 challenge in real-time track. However, their method is plain RGB and our RGBD tracker differs from their work significantly in the following terms. Firstly, their method is an RGB tracker without occlusion handling mechanism. Secondly, we update correlation filter mask using depth cues instead of spatial 2D priors and color segmentation. As we show in Section IV-D, our approach is clearly superior which can be attributed to the power of depth cue.
An overview of our DM-DCF algorithm is given below while Section III-A provides an introduction to DCF based tracking, Section III-B reports how the depth based DCF masks are created, Section III-C discusses the optimization of correlation filters with spatial constraints and Section III-D introduces our occlusion handling mechanism.
Iii-a Correlation Filters
The problem to be solved in correlation filter based tracking is to find a suitable filter at discrete time point and sample point that provides desired output for given input image which includes the target object. Desired output, , can be constructed by using a small, dense, 2D gaussian at the centre of a tracked object image 
. Optimization of the filter can be formulated as a ridge regression problem:
where is the regularization parameter that is used to avoid overfitting to the current object appearance. A widely used technique is to assume circular repetitions of each input as where represents all circular shifts of . This assumption leads to a fast filter optimization in the Fourier domain :
where , and
are the Fourier transforms of the correlation filter, input image and the desired output respectively.is the element-wise Hadamar product and denotes the Hermitian conjugate. Computation in Fourier domain reduces the complexity from in the spatial domain to for the images of the size pixels and examples as it is reported in . However, this also enforces a special form of (1):
where runs over the all circular shifts of each input . Henriques et al.  also extended the above to include kernel functions that can make tracker even more effective without any loss in computation speed.
Another important part of correlation filter based tracking is time averaging for online adaptation where previous appearances are retained in “tracker memory” 
Iii-B Depth Masking
The masking approach in our work to select active pixels (i.e. pixels that are used for DCF updates) for the DCF tracker was inspired by the work of Lukezic et al.  who constructed an RGB cue driven mask by forming a pixel graph where the foreground was segmented by graph cut approach using color histograms and spatial relationships. However, their method is deemed to fail in the cases where background and foreground are of similar color and it cannot detect occlusions as it can be observed in Fig. 2 and Fig. 3. On the other hand, in our method the depth cue turns out to be clearly better in mask generation and also provides an intuitive way to detect occlusion and switching from the tracking to detection mode which provides superior performance in long-term occlusions as in Fig. 2. In the case of foreground masking of a tracked object, the correlation filter is changed to a masked correlation filter, , which replaces in (1) or (3). The mask has value to mark the visible (active) region of a tracked object and value to inactivate pixels in the background. Another advantage of masked correlation is the fact that the border effects in cyclic correlation can be removed if the mask is made larger than the current bounding box  (up to the size of the whole input frame).
We construct our mask using probabilistic representations and
of foreground object and background (note that background in our case means scene elements both closer and further away from the object) respectively. In its simplest form, the mask can be generated from foreground probability ratios
However, we found another approach based on adaptive thresholding to work better since it avoids setting the threshold . We assign each mask pixel a probability ratio value which produces a “foreground probablity image” and the probability image is thresholded to form a binary foreground mask by the adaptive method of Otsu .
For the probability distribution estimation we tested both single Gaussian and Gaussian Mixture Models, but found single Gaussians to perform better and another additional benefit is their fast online update rules. Our distributions areand whose parameters are updated by the following rules:
where and are fixed update rates.
To construct the new distribution for the foreground, the depth values that are in the current mask are picked. In the first frame however, the ground truth bounding box provided by the dataset is used to create the initial distributions.
Iii-C Filter Optimization
and the circular function (3) into
that can be written in the Fourier domain as 
where are defined in the spatial domain and is the Fourier operator matrix multiplied by the number of dimensions in the signal. This solution is very inefficient in the Fourier domain () and therefore the primal solution in the spatial is faster. However, (9) is in the form where the Lagrangian multiplier can be introduced and Alternating Direction Method of Multipliers (ADMM) adopted :
The augmented Lagrangian method uses the following unconstrained objective
where is the penalty term affecting the convergence and is the Lagrange multiplier updated on each iteration. Optimization iteratively updates the estimates and and the multiplier using the rule
Iii-D Occlusion Detection
Detecting heavy occlusions () and consequently stopping model updates is a vital part of occlusion-aware tracking. This process allows DCF to avoid possible model pollution which eventually leads to drifting. In RGB-based DCF, the main source of information for occlusion handling is a correlation filter response at the maximum location since a rapid decrease can be considered as an evidence of occlusion [1, 10]. To include tracker based occlusion detection, we calculate running mean of the maximum responses where the maximum of the current frame is added iteratively:
The main drawback of the above tracker response based occlusion detection is the implicit assumption that occlusions occur faster than model appearance changes. This does not have to be true and might cause false occlusion detections.
To compensate the weakness of filter response based occlusion handling, we introduce a depth cue based occlusion detection which is simple and efficient. Intuitively, all pixels that pass through our probability based mask generation in Section III-B represent depth values where the target object appears. We can easily define the amount of occlusion to be allowed by enforcing a threshold for the visible pixels in ( in all our experiments). This depth based occlusion detection comes without any extra cost since the information is already available from the masking stage.
Our final occlusion detection combines both the filter response based occlusion detection and depth based occlusion detection where an occlusion is declared if both detectors are triggered, i.e. filter response falls below of moving average and number of pixels supporting object depth falls as well below of bounding box regions. If occlusion is detected, the filter update is stopped and the system switches into full image detection mode (occlusion recovery). Occlusion recovery model does not make any assumptions on object’s reappearance probability and it is run as long as the target object is absent in the scene.
In this section, we present an extensive evaluation of the proposed method. Section IV-A provides implementation details, Section IV-B overviews the dataset and the metrics used for the evaluation, Section IV-C discusses the results and Section IV-D compares different variants of the proposed method in an ablation study.
Iv-a Implementation Details
To make our results directly comparable to the state-of-the-art, we selected the same three RGB features in  (CSR-DCF): HOG , Color Names  and gray level pixel values. We also adopt the same parameter values as in the original CSR-DCF except DCF filter update rate () is set to 0.03. Update rates (, ) for probability distributions and are set to and respectively. The parameters were kept fixed in our experiments that were run on non-optimized Matlab code with Intel Core i7 3.6GHz laptop and Ubuntu 16.04 OS. Our processing speed is calculated according to an average sequence where the number of occluded frames makes 25% of all frames.
Iv-B Dataset and Evaluation Metrics
Princeton RGBD dataset  consists of 100 sequences from 11 categories and the authors provide ground truth only for five videos. Methods are evaluated by uploading them to a specific evaluation server. Results for other methods were taken from the online leaderboard table at the Princeton website with the exception of An et al.  who have not registered their method. Therefore, we took their numbers directly from the respective paper.
Bibi et al.  and Hannuna et al.  reported that of Princeton RGBD dataset videos have synchronization errors between the RGB and depth frames. In addition, of the sequences require bounding box re-alignment as pixel correspondences between RGB and depth frames were erroneous. In their experiments, Hannuna et al. and Bibi et al. used rectified versions of the dataset and therefore we found it fair to use their corrected sequences in our evaluations.
The evaluation uses the widely adopted Intersection over Union (IOU) metric proposed by the authors of the Princeton dataset similar to the one used in VOT RGB dataset .
|Alg.||Avg Rank||Human||Animal||Rigid||Large||Small||Slow||Fast||Occl.||Occl.||Pass. Motion||Act. Motion||FPS|
|3D-T ||2.81||0.81 (1)||0.64 (4)||0.73 (5)||0.80 (1)||0.71 (3)||0.75 (5)||0.75 (1)||0.73 (1)||0.78 (5)||0.79 (3)||0.73 (2)||N.A|
|RGBDOcc+OF ||3.27||0.74 (4)||0.63 (5)||0.78 (1)||0.78(3)||0.70 (4)||0.76 (2)||0.72 (3)||0.72 (2)||0.75 (6)||0.82 (2)||0.70 (4)||0.26|
|OAPF ||3.45||0.64 (6)||0.85 (1)||0.77 (3)||0.73 (5)||0.73 (2)||0.85 (1)||0.68 (6)||0.64 (6)||0.85 (1)||0.78 (4)||0.71 (3)||0.9|
|Our||3.63||0.76 (3)||0.58 (6)||0.77 (2)||0.72 (6)||0.73 (1)||0.75 (4)||0.72 (4)||0.69 (3)||0.78 (4)||0.82 (1)||0.69 (6)||8.3|
|DLST ||3.63||0.77 (2)||0.69 (3)||0.73 (6)||0.80 (2)||0.70 (6)||0.73 (6)||0.74 (2)||0.66 (4)||0.85 (2)||0.72 (6)||0.75 (1)||4.6|
|DS-KCF-Shape ||4.18||0.71 (5)||0.71 (2)||0.74 (4)||0.74 (4)||0.70 (5)||0.76 (3)||0.70 (5)||0.65 (5)||0.81 (3)||0.77 (5)||0.70 (5)||35.4|
|CSR-DCF ||10.55||0.53 (9)||0.56 (11)||0.68 (12)||0.55 (12)||0.62 (9)||0.66 (12)||0.56 (10)||0.45 (14)||0.79 (6)||0.67 (12)||0.56 (9)||13.6|
The results of our and the best performing other trackers for the Princeton RGBD dataset are given in Table I. As it can be seen below, our method performs on par with the top performing RGBD trackers (OAPF, RGBDOcc+OF and 3D-T) and is an order of magnitude faster. Out of the fast trackers (ours, DLST, DS-KCF and CSR-DCF) ours and DLST are the best with equal average rank, but our method is faster. In addition to that, our method wins in two categories: Small and Passive. These results indicate that our depth masked DCF is a suitable tracker for applications where balance between performance and speed is important.
The advantages of using the depth channel to complement 2D information are evident as our DM-DCF outperforms its RGB competitor, CSR-DCF, in almost all categories with a clear margin. The only category that CSR-DCF performs better is the “no occlusion” category where benefits of depth cue are understandably not necessary. Compared to the other DCF based methods DS-KCF-Shape  and DLST , DM-DCF performs considerably better in Occlusion category. This shows that our occlusion handling mechanism is more powerful as we use a maximum DCF response score history in conjunction with foreground segmentation using two separate probability distributions instead of a single frame response score and single distribution.
Iv-D Ablation Study
We conducted a set of ablation studies to support our design choices. Moreover, we also evaluated our algorithm on the original Princeton Dataset which has considerable amount of registration and synchronization errors. We report the accuracy for the following variants:
CSR-DCF State-of-the-art RGB tracker by Lukezic et al. 
DM-DCF– Our method with the all proposed components switched off.
+occ Depth based occlusion handling switched on.
+mask Depth based masking added i.e. full DM-DCF.
As it can be seen in Fig. 4, adding depth based masking and occlusion handling improve the results almost .
An important finding for the future work is that in general, different algorithms favor certain categories and motion types. As compared the results of all methods in Table I, most of the methods favor rigid motion over non-rigid motion (rigid vs. animal categories). This can be explained by the fact that the parameters are kept constant for all 95 test sequences and the adopted parameters favors rigid object tracking. Shape changes and adaptation speeds are different for non-rigid objects such as animals.
Another similar problem can be seen in occlusion vs. no occlusion. Again, improvement in the occlusion sequences means slight degradation in tracker performance in no occlusion cases. These observations suggest us to adopt adaptive parameters in our future work so that the tracker would adjust its parameters on the fly according to the target object.
In this paper, we proposed a Depth Masked Discriminative Correlation Filter (DM-DCF) RGBD tracker that uses the depth cue to detect occlusions (enable switching from the tracking to the detection mode) and to construct a spatial mask that improves DCF tracking. To this end, we are the first to use depth based segmentation masks inherently in DCF formulation extracting target regions for filter updates. Comparison and ablation studies on the publicly available Princeton RGBD Benchmark dataset verified that our trackers is on pair with the state-of-the-art while providing clearly better frame rate as compared to the top performers.
-  N. An, X.-G. Zhao, and Z.-G. Hou. Online rgb-d tracking via detection-learning-segmentation. In ICPR, 2016.
-  A. Bibi, T. Zhang, and B. Ghanem. 3d part-based sparse tracker with automatic synchronization and registration. In CVPR, 2016.
-  D. Bolme, J. Beveridge, B. Draper, and Y. Lui. Visual object tracking using adaptive correlation filters. In CVPR, 2010.
Distributed optimization and statistical learning via the alternating
direction method of multipliers.
Foundations and Trends in Machine Learning, 3, 2010.
-  D. Comaniciu, V. Ramesh, and P. Meer. Kernel-based object tracking. IEEE Transactions on Pattern Analysis and Machine Intelligence, volume 25, pages 564–567, 2003.
-  N. Dalal and B. Triggs. Histograms of oriented gradients for human detection. In CVPR, 2005.
-  M. Danelljan, G. Bhat, F. Shahbaz Khan, and M. Felsberg. Eco: Efficient convolution operators for tracking. In CVPR, 2017.
-  M. Danelljan, A. Robinson, F. Shahbaz Khan, and M. Felsberg. Beyond Correlation Filters: Learning Continuous Convolution Operators for Visual Tracking. In ECCV, 2016.
-  H. Galoogahi, T. Sim, and S. Lucey. Correlation filters with limited boundaries. In CVPR, 2015.
-  S. Hannuna, M. Camplani, J. Hall, M. Mirmehdi, D. Damen, T. Burghardt, A. Paiement, and L. Tao. Ds-kcf: a real-time tracker for rgb-d data. In Journal of Real-Time Image Processing, 2016.
-  J. Henriques, R. Caseiro, P. Martins, and J. Batista. High-Speed Tracking with Kernelized Correlation Filters. In IEEE Transactions on Pattern Analysis and Machine Intelligence, volume 37, pages 1–14, 2014.
C. Hester and D. Casasent.
Multivariant technique for multiclass pattern recognition.In Applied Optics, volume 19, pages 1758–1761, 1980.
-  Z. Kalal, K. Mikolajczyk, and J. Matas. Tracking-Learning-Detection. In IEEE Transactions on Pattern Analysis and Machine Intelligence, volume 34, pages 1409–1422, 2011.
-  M. Kristan, J. Matas, G. Nebehay, F. Porikli, and L. Cehovin. A Novel Performance Evaluation Methodology for Single-Target Trackers. In IEEE Transactions on Pattern Analysis and Machine Intelligence, volume 38, pages 2137–2155, 2016.
-  A. Lukezic, T. Vojir, L. Cehovin, J. Matas, and M. Kristan. Discriminative correlation filter with channel and spatial reliability. In CVPR, 2017.
-  K. Meshgi, S. I. Maeda, S. Oba, H. Skibbe, Y. Z. Li, and S. Ishii. An occlusion-aware particle filter tracker to handle complex and persistent occlusions. Computer Vision and Image Understanding, 150:81–94, 2016.
-  H. Nam and B. Han. Learning Multi-Domain Convolutional Neural Networks for Visual Tracking. In CVPR, 2016.
-  N. Otsu. A threshold selection method from gray-level histograms. IEEE Trans. Sys., Man., Cyber., 1979.
-  D. A. Ross, J. Lim, R.-S. Lin, and M.-H. Yang. Incremental visual tracking. International Journal of Computer Vision (IJCV), 77:125–141, 2008.
-  A. W. M. Smeulders, D. M. Chu, R. Cucchiara, S. Calderara, A. Dehghan, and M. Shah. Visual tracking: An experimental survey. volume 36, pages 1442–1468, 2014.
-  S. Song and J. Xiao. Tracking revisited using rgbd camera: Unified benchmark and baselines. In ICCV, 2013.
-  J. van de Weijer, C. Schmid, J. Verbeek, and D. Larlus. Learning color names for real-world applications. In IEEE Transactions on Pattern Analysis and Machine Intelligence, volume 18, pages 1512–1523, 2009.
-  T. Zhang, S. Liu, C. Xu, S. Yan, B. Ghanem, N. Ahuja, and M.-H. Yang. Structural sparse tracking. In CVPR, 2015.
-  Matej Kristan, Ales Leonardis, Jiri Matas, Michael Felsberg, Roman Pflugfelder, Luka Cehovin Zajc, Tomas Vojir, Gustav Hager, Alan Lukezic, Abdelrahman Eldesokey and Gustavo Fernandez The Visual Object Tracking VOT2017 Challenge Results. In ICCV, 2017.