The recent emergence of cost-effective depth sensors have triggered significant attention of researchers within computer vision and robotics community. Depth sensors have their own advantages over visible light cameras. Depth images are insensitive to variation in texture and illumination. Moreover, depth images represent 3D structural information reflecting shape cues and geometry.
There is a growing body of research on topics such as human detection and human tracking using 3D information. Ikemura et al. 
proposed a method for human detection. The method extracts relational depth similarity features and constructs a classifier which is then used for detecting humans. Xia et al. proposed human detection method in indoor environment using depth information. The method first identifies the possible regions that may contain humans which are then verified using 3D head model. Zhang et al.  explored the Kinect for detecting falls in elderly people. The method eventually adopt a known background subtraction method for detecting people and analyze the trajectory to detect the fall.
Instead of using only depth maps, some researchers exploited both RGB and depth channels for human detection and tracking. Han et al.  proposed a method for detecting and tracking humans using both RGB and depth channels. The method locates the object in depth map and then extracts the corresponding visual features from the RGB data. These visual features are then used for tracking objects in successive frames of RGB data. Spinello et al.  extends the idea of Histogram of Oriented Gradients (HOG) and proposed a conceptually similar Histogram of Oriented Depths (HOD) method for human detection. Enzweiler et al.  developed a stereo system which combines intensity images, stereo disparity maps, and optical flow for detecting and tracking people. Song et al.  developed a RGB-D dataset for tracking objects. They implemented and tested different algorithms, such as HOG with Optical flow, on RGB-D data for tracking objects.
Graph-based object tracking methods have been proposed in past for RGB videos. Gomilla and Meyer  proposed a graph-based object tracking method whereby each frame is represented as a region adjacency graph and thus, tracking becomes a graph-matching problem. Wang and Nevatia 
proposed a tracking method based on superpixels. Constellation appearance model was constructed based on visual features extracted from superpixels and then tracking is performed using Dynamic Bayesian Network. Recently, Yang et al. proposed a tracking method using superpixels. They used mid level cues for distinguishing between target and the background. Tracking is then achieved by formulating a map between target and background.
In this paper, we have successfully applied region graph based approach for object detection and tracking in depth videos. The proposed approach shows a significant improvement of about 9% over the existing methods. Further, depth maps are very noisy and contain high magnitude noise which is difficult to remove using spatial filter. We proposed a method to remove such high magnitude noise from depth maps and have seen a significant improvement of about 23% in tracking results over the tracking results obtained without noise suppression.
2 Proposed Method
In this section, region graph based approach for ROI detection and tracking is discussed in detail.
2.1 Noise Suppression
Noise in depth videos can be classified into three categories: (i) noise from sensors , (ii) noise from boundary of objects , and (iii) holes in depth map (caused by fast movements, random surfaces, etc.) . Noise occurred due to variation in sensors is evenly distributed throughout the depth map and is of low magnitude. Such type of noise can be easily removed using spatial filters such as Gaussian filter. On the other hand, noise occurring due to boundary of objects and holes in depth map is not evenly distributed and their magnitude is quite high in contrast to noise arising from sensors. Such type of noise is difficult to remove using spatial filters. Based on temporal learning, Xia et al.  proposed a filtering algorithm for removing such high magnitude noise. Figure 1(c) shows a snapshot of the filtered depth map using Xia et al.’s  method. The method is able to remove some part of high magnitude noise but not all. In this section, a method is proposed to remove high magnitude noise arising from boundary of objects and holes in depth map.
The proposed method first applies a 2D Gaussian smoothing filter to remove noise arriving from sensors. Then, the proposed method segments the depth map into regions using morphological watershed segmentation (MWS) proposed by Meyer et al. 111Any other image segmentation algorithm can be used.. Let us assume that closed regions and background regions are detected in spatially filtered depth map using MWS222We assume that depth map has at least one region.. Background regions are the ones whose at least boundary points lie on the border of depth map. Some of the closed regions might enclose other closed regions. We represent enclosed regions as , enclosing regions as , and the remaining closed regions as independent regions such that . Figure 2 shows different possible regions. Region and can be merged and the merging decision is solely dependent upon application. For instance, region and can not be merged for human part-based tracking while for complete human tracking, and can be merged. In the proposed method, we are interested in complete object tracking in contrast to part-based tracking and hence, we merge regions and .
Now, the proposed method computes the area of each closed region as: .
Noise arising from boundary of objects and holes in depth map will also create regions. Though magnitude of and will be higher in comparison to , it has lower magnitude in contrast to actual object size. Figure 3 shows the plot of area of closed regions contained in Figure 1(b). In Figure 3, we can see that the area of regions corresponding to noise is quite less (No. of pixels lies between and ) in comparison to regions corresponding to object size (No. of pixels lies between and ). Threshold based approach is then applied to remove noise from spatially filtered depth map as:
where is the threshold for eliminating the noise arising from boundary of objects and holes in depth map. is a function of region area and can be computed as , .
2.2 Good features to track
Object traverses a very limited space from time to time . We use this important clue to map regions between two consecutive frames and identify ROI. Let and be the undirected graphs for frame at time and . Displacement between two arbitrary nodes, and , with region and respectively is defined as: . This measure tells us how much a region has traversed from time to . Furthermore, region mapping process is speeded-up by including the threshold . If , then we map the regions and 333Empirical value of in our experiments is 80..
Besides region mapping, the proposed method simultaneously do temporal tracking of “mapped regions” in first depth maps to determine ROI. ROI can be either growing region (region whose area changes, as in case of hand waving) or shrinking region (region whose area changes, as in case of zooming out) or moving region (region whose position changes, as in case of rolling ball) or combination of growing, shrinking, and movement (as in case of box lifting). The proposed method computes from mapped regions of first frames. If is greater than a particular threshold value444Empirical value is , then we mark the region as ROI.
As the proposed method is region-based, direction of object movement is an important clue and can help in optimizing the performance of the proposed method while searching the ROI in next depth frame. We use a cardinal system to determine the direction of ROI movement. Cardinal system can be 4-point or 8-point or 16-point. In our experiments, we have used 4-point cardinal system, shown in Figure 4. For determining the direction of ROI, the proposed method computes difference between the mapped regions of two consecutive depth maps and then decides the direction of region movement using the four point cardinal system. It is to be noted that the direction is computed with respect to the center of the region of depth frame at time i.e. . Figure 5 and Figure 5 shows snapshots of two consecutive depth maps where person is waving a hand while the difference between these depth maps is shown in Figure 5. From Figure 5, we can see that the direction of hand movement is North-East. Given this information, we can predict that the hand movement in the next depth frame will be either in East or North direction and we do not need to search in South and West direction.
The proposed method starts tracking ROI from depth map ( depth maps are required for ROI determination). In order to track ROI, the proposed method creates a weighted graph for each ROI. Let and denotes the number of nodes and edges in graph of frame, where each node corresponds to each region in the depth map (set of elements to be tracked) and edges corresponding to neighbouring regions555We draw the boundary between objects using distance transform.. Let us assume that node represents ROI. Now, a node table is constructed for node . Node table contains the shortest distance between ROI and the remaining regions. Now, the weights are assigned to edge as:
Figure 6 shows an example of constructing a node table and assigning weights to the edges of a graph . To illustrate, let us assume that is the ROI (represented in green color). Now, the proposed method computes the distance between node and remaining nodes of graph and constructs a node table, shown in Figure 6. Based on the values of node table and Eq. 2, weights are assigned to the edges (see Figure 6). When an object moves, set of attributes (such as area and perimeter) of adjoining regions are impacted. In other words, when any movement occurs in region corresponding to node , attributes of the edges with weight will be affected most. Movement in node will have an impact on the attributes of nodes , , , , and most. Since the attributes of nodes and will have the least impact due to any movement in , we do not have to consider these nodes for tracking. This ultimately helps in reducing the searching area for object tracking. In order to further minimize the searching area, the proposed method utilizes predicted direction of the ROI. If the ROI is moving towards West, for instance, then regions with edges and weight
in the West direction are most probable regions for tracking. In our example, nodesand are in West of and hence, we need to search in regions corresponding to these nodes.
Objects can appear or disappear in a scene. The proposed method runs a background thread at a constant frame interval to check if any ROI is appearing or disappearing in the scene. This background thread executes ROI detection method discussed in Section 2.2 and if there is any change in number of ROIs, it updates the object tracking thread. In case number of ROIs are more than one, the proposed method initiates different foreground thread to track each ROI. Value of constant frame interval in our experiments is equal to number of frames required to detect ROI i.e. .
2.4 Multi-object Tracking and Occlusion Detection
Multiple objects or ROIs in a scene can occlude if they are moving towards each other. Occlusion can be detected based on ROI’s size and the distance between ROI’s. Let and be two ROI’s detected and be the euclidean distance between and . If , then the two ROI’s are about to occlude. If the area of is changing drastically while the change in area of is either zero or slight, then we mark ROI as occludee.
Let us say that there are ROI’s with level sets . The Euclidean distance between ROI and ROI can be obtained as: . Now, we compute change in the area of ROI as: , where denotes the area of ROI at time .
The proposed method then computes the occlusion detection parameter between ROI and ROI as: . If , then we say that occlusion occurs where is the threshold parameter for occlusion detection. We conducted experiments on samples with two or more humans to adjust the value of . We varied the value of from 0 to 1 in our experiments. When we set , we were able to detect the occlusion as soon as it occurs. However, when we set , we were able to detect the occlusion but only when at least 25% of the region of two humans were overlapping. Hence, we set in our experiments.
Figure 7 shows an occlusion example where two humans occlude each other (in this sequence, both the humans are in motion). Before the occlusion, the boundaries of the humans are complete. During occlusion, two humans start overlapping each other (Figure 7(b)) and hence, the boundaries are broken. At this point, the distance between two humans is very less ().
3 Experimental Results
We evaluated the performance of the proposed method on videos taken from standard datasets (MSR  , UT Austin , Princeton University , IROS , and RWTH ). Videos in these datasets differ in terms of pose, background, actions, etc. We conducted experiments on a computer having Intel i5-2410M 2.30 GHz processor and 6GB DDR3 RAM.
Since the proposed method uses depth maps for detecting ROI, it should be considered as an independent variable to conduct a fair study. We conducted experiments to adjust the value of and results are shown in Figure 8. From Figure 8, we can see that the performance plateaus after . Hence, we used in our experiments.
3.1 Metrics for evaluation
We quantify the performance of the proposed method using two metrics:
Score for object detection: We used this metric for determining the accuracy of the proposed object detection method .
score combines both precision and recall and is defined as:, where and . Here, represents precision, represents recall, represents true negative samples, represents true positive samples, and represents false negative samples.
Success rate (SR) for tracking: For quantitative evaluation of our tracking algorithm, we employed the criterion used in PASCAL VOC challenge . If , then frame is tracked successfully. Here, denotes the overlap ratio, denotes the tracked region, and denotes the ground truth.
To compare the results with previous methods, we either take the reported best results or carefully select the parameters with the provided source code.
3.2 Object Detection Results
We tested the proposed method on all datasets. Figure 9 shows the snapshots of the object detection while Table 1 summarizes the accuracy of the proposed method on different datasets. From Table 1, we can see that the average accuracy of the proposed method is . This clearly indicates that the proposed method is able to detect objects (humans, boxes, etc.) in most of the cases. The performance of the proposed method is low in case of IROS , Princeton  and RWTH dataset. Since the proposed method detect and track an object by analysing its motion, it is not able to detect the objects which are stationary resulting in low accuracy.
Further, we compare the performance of the proposed object detection method with Xia et al.’s  noise suppression method in Table 2. From Table 2, we can see that the proposed noise suppression method results in higher accuracy than Xia et al.’s  method.
|Noise Suppression Method||None||Xia ||Ours|
Comparison with related work: Table 3 contains the quantitative comparison between the proposed method and the related work. From Table 3, we can see that the proposed method outperform the methods proposed by Ikemura et al.  and Xia et al.  while is comparable with the performance of Spinello et al.’s  method. Ikemura et al.’s  method is window-based and is able to detect people which are well centered in frame resulting in high false negatives and low accuracy. Xia et al.’s  method detect the person by identifying the head contour. In case of occlusions, the method is able to detect the head contour of the foreground object and hence, resulting in false negatives and low accuracy. Further, we compared the accuracy of the proposed method with Combo-HOD detector proposed by Spinello et al. . The accuracy of the proposed method is around 93% while the accuracy of Combo-HOD detector proposed by Spinello et al.  is around 97%. Since the proposed method detect objects based on the movement, it is not able to detect the humans which are stationary, resulting in slightly less accuracy than Spinello et al.’s  method.
3.3 Object Tracking Results
For object tracking, we tested the proposed method on all datasets. Figure 10 shows a snapshot of object tracking on one of the RGB-D videos in Princeton dataset. Video is captured inside a room where 4 people are present ( 3 adults and 1 kid). As we can see in Figure 10, the proposed method is able to track only 3 people out of the 4 people inside the room. One of the adults present in the scene is stationary. The proposed method tracks the object based on motion of the object and hence, failed to detect one of the adults present inside the scene.
Figure 11 shows a snapshot of tracking a bear. As we can see in Figure 11, the proposed method is able to detect and track the bear as well as the box (with and without occlusion). However, the proposed method is not able to track the person. This is because the proposed method detect and track the objects which correspond to closed region and not the background region (see Section 2 for details). The person in this video belongs to the background region and hence, the proposed method fails to detect and track the person.
For quantitative experiments, we computed overlapping ratio between tracked object and ground truth data (discussed in Section 3.1). Since RGB-D datasets are collected using different camera settings and under different conditions (such as indoor, outdoor, etc.), should be considered as a variable to conduct a fair study. We conducted experiments to adjust the value of and results are shown in Figure 12. From Figure 12, we can see that the average tracking SR of the proposed method is around 83% across all datasets at .
Further, we compare the performance of the proposed object tracking method with Xia et al.’s  noise suppression method in Table 4. From Table 4, we can see that the proposed noise suppression method results in higher accuracy than Xia et al.’s  method.
|Noise Suppression Method||None||Xia ||Ours|
Comparison with related work: We compare the performance of the proposed tracking method against different methods in Table 5. From Table 5, we can see that the proposed method is more robust than the other methods. The higher accuracy of the proposed method against other methods is mainly because of the noise suppression. For instance, RGBOcc + OF  method detects false features due to presence of noise, resulting in low accuracy. When we applied RGBOcc + OF  method after suppressing the noise using the proposed method, we saw that the success rate increased from 75% to 83%.
|RGBD HOG + Optical Flow (on depth data)||75%|
|RGB HOG + Optical Flow (on RGB data) ||52%|
|Structured output tracking ||46%|
|Visual tracking ||42%|
|Compressive tracking ||40%|
|Proposed method (on depth data)||83%|
It is worth noting that the accuracy reported for object detection in Table 1 and object tracking in Table 5 is different. This is because the proposed method is able to track only the motion part of the object. For instance, person might walk for sometime then does not move for sometime and then again start moving. In such a case, the proposed method tracks the person when he/she is moving. An example is shown in Figure 13 where two persons enter the scene (University Hall). One of them keeps walking while other goes to ATM machine and waits there for some time to complete his work. As we can see in Figure 13, the proposed method is able to track both the person till they are moving. The proposed method subsequently lose the track of one of them as there is no movement (see Figure 13). Though we are able to detect both the persons in this scene, we are able to track these persons till they are walking. Such scenarios resulted in low SR.
3.4 Impact on execution time
The proposed method uses direction of object movement and weighted graph for optimizing the search area required for ROI detection and tracking. In case of optimization, we use direction of object movement and weighted graph for ROI detection and tracking while in case of without optimization, we search the entire frame to detect and track the ROI. Average execution time required per frame for noise reduction, ROI detection, and ROI tracking on different datasets by the proposed method without optimization and with optimization of search area are 888.6 and 540 milliseconds respectively i.e. the proposed optimized method is faster than the unoptimized method. Though the proposed optimization method significantly reduces the execution time, we have not seen any difference in the accuracy of the proposed method with and without optimization.
In this paper, we proposed a region graph based method for noise suppression, object detection, and object tracking using depth cameras. The experimental results show that the proposed method is able to detect and track the objects with and without occlusions. The proposed approach can be applied in different applications such as human activity recognition.
The advantage of the proposed method can be summarized as: first, the proposed method does not require any training for ROI detection and tracking. Second, the proposed method uses direction of object movement and weighted graph for ROI detection and tracking. This helps in optimizing the searching area for detection and tracking. The limitations of the proposed method are: (i) it is not able to do re-identification of objects and (ii) object detection and tracking is not in real-time. RGB-based tracking systems seem to be more robust towards object re-identification, though their overall accuracy for tracking is lower. In future, we will try to address these limitations by: (i) complementing the proposed method with RGB data for person re-identification and (ii) utilizing GPU’s for speeding-up performance of the proposed method.
This material is based upon work supported partially by the National Science Foundation under Grant No. 1012975. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.
M. Enzweiler, A. Eigenstetter, B. Schiele, and D. Gavrila.
Multi-cue pedestrian classification with partial occlusion handling.
Computer Vision and Pattern Recognition (CVPR), 2010 IEEE Conference on, pages 990–997, June 2010.
-  M. Everingham, L. Van Gool, C. Williams, J. Winn, and A. Zisserman. The Pascal Visual Object Classes (VOC) Challenge. International Journal of Computer Vision, 88(2):303–338, 2010.
-  C. Gomila and F. Meyer. Graph-based object tracking. In Image Processing, 2003. ICIP 2003. Proceedings. 2003 International Conference on, volume 2, pages II–41–4 vol.3, Sept 2003.
-  J. Han, E. Pauwels, P. de Zeeuw, and P. de With. Employing a rgb-d sensor for real-time tracking of humans across multiple re-entries in a smart environment. Consumer Electronics, IEEE Transactions on, 58(2):255–263, May 2012.
-  S. Hare, A. Saffari, and P. Torr. Struck: Structured output tracking with kernels. In Computer Vision (ICCV), 2011 IEEE International Conference on, pages 263–270, Nov 2011.
-  S. Ikemura and H. Fujiyoshi. Real-time human detection using relational depth similarity features. In R. Kimmel, R. Klette, and A. Sugimoto, editors, Computer Vision – ACCV 2010, volume 6495 of Lecture Notes in Computer Science, pages 25–38. Springer Berlin Heidelberg, 2011.
-  G. Kootstra and D. Kragic. Fast and bottom-up object detection, segmentation, and evaluation using gestalt principles. In Robotics and Automation (ICRA), 2011 IEEE International Conference on, pages 3423–3428, May 2011.
-  J. Kwon and K. M. Lee. Visual tracking decomposition. In Computer Vision and Pattern Recognition (CVPR), 2010 IEEE Conference on, 2010.
-  W. Li, Z. Zhang, and Z. Liu. Action recognition based on a bag of 3d points. In Computer Vision and Pattern Recognition Workshops (CVPRW), 2010 IEEE Computer Society Conference on, pages 9–14, June 2010.
-  F. Meyer and S. Beucher. Morphological segmentation. Journal of Visual Communication and Image Representation, 1(1):21 – 46, 1990.
U. Rafi, J. Gall, and B. Leibe.
A semantic occlusion model for human pose estimation from a single depth image.In Computer Vision and Pattern Recognition Workshops (CVPRW), 2015 IEEE Conference on, pages 67–74, June 2015.
-  S. Song and J. Xiao. Tracking revisited using rgbd camera: Unified benchmark and baselines. In Computer Vision (ICCV), 2013 IEEE International Conference on, pages 233–240, Dec 2013.
-  L. Spinello and K. Arras. People detection in rgb-d data. In Intelligent Robots and Systems (IROS), 2011 IEEE/RSJ International Conference on, pages 3838–3843, Sept 2011.
-  J. Wang, Z. Liu, Y. Wu, and J. Yuan. Mining actionlet ensemble for action recognition with depth cameras. In Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on, June 2012.
-  W. Wang and R. Nevatia. Robust object tracking using constellation model with superpixel. In K. Lee, Y. Matsushita, J. Rehg, and Z. Hu, editors, Computer Vision ACCV 2012, volume 7726 of Lecture Notes in Computer Science, pages 191–204. Springer Berlin Heidelberg, 2013.
-  L. Xia and J. Aggarwal. Spatio-temporal depth cuboid similarity feature for activity recognition using depth camera. In Computer Vision and Pattern Recognition (CVPR), 2013 IEEE Conference on, June 2013.
-  L. Xia, C.-C. Chen, and J. Aggarwal. Human detection using depth information by kinect. In Computer Vision and Pattern Recognition Workshops (CVPRW), 2011 IEEE Computer Society Conference on, 2011.
-  F. Yang, H. Lu, and M.-H. Yang. Robust superpixel tracking. Image Processing, IEEE Transactions on, 23(4):1639–1651, April 2014.
-  K. Zhang, L. Zhang, and M.-H. Yang. Real-time compressive tracking. In A. Fitzgibbon, S. Lazebnik, P. Perona, Y. Sato, and C. Schmid, editors, Computer Vision ECCV 2012, volume 7574 of Lecture Notes in Computer Science, pages 864–877. Springer Berlin Heidelberg, 2012.
-  Z. Zhang, W. Liu, V. Metsis, and V. Athitsos. A viewpoint-independent statistical method for fall detection. In Pattern Recognition (ICPR), 2012 21st International Conference on, 2012.