Hand Action Detection from Ego-centric Depth Sequences with Error-correcting Hough Transform

Detecting hand actions from ego-centric depth sequences is a practically challenging problem, owing mostly to the complex and dexterous nature of hand articulations as well as non-stationary camera motion. We address this problem via a Hough transform based approach coupled with a discriminatively learned error-correcting component to tackle the well known issue of incorrect votes from the Hough transform. In this framework, local parts vote collectively for the start & end positions of each action over time. We also construct an in-house annotated dataset of 300 long videos, containing 3,177 single-action subsequences over 16 action classes collected from 26 individuals. Our system is empirically evaluated on this real-life dataset for both the action recognition and detection tasks, and is shown to produce satisfactory results. To facilitate reproduction, the new dataset and our implementation are also provided online.


page 2

page 4

page 6

page 7


UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild

We introduce UCF101 which is currently the largest dataset of human acti...

A Hybrid RNN-HMM Approach for Weakly Supervised Temporal Action Segmentation

Action recognition has become a rapidly developing research field within...

ESAD: Endoscopic Surgeon Action Detection Dataset

In this work, we take aim towards increasing the effectiveness of surgic...

Mining YouTube - A dataset for learning fine-grained action concepts from webly supervised video data

Action recognition is so far mainly focusing on the problem of classific...

Online Action Detection

In online action detection, the goal is to detect the start of an action...

Long Short View Feature Decomposition via Contrastive Video Representation Learning

Self-supervised video representation methods typically focus on the repr...

I Introduction

Recent development of ego-centric vision systems provide rich opportunities as well as new challenges. Besides the well-known Google Glass [1], more recent systems such as Metaview Spaceglasses [2] and Oculus Rift [3] have started to incorporate depth cameras for ego-centric 3D vision. A commonplace shared by these ego-centric cameras is the fact that they are mobile cameras. Also, the interpretation of hand actions in such scenarios is known to be a critical problem [4, 5, 6]. Meanwhile, facilitated by emerging commodity-level depth cameras [7, 8]

, noticeable progress has been made in hand pose estimation and tracking 

[9, 10, 11, 12, 13, 14, 15]. The problem of hand action detection from mobile depth sequences however remains unaddressed.

As illustrated in Fig. 1, in this paper we address this problem in the context of an ego-centric vision system. Due to the diversity in hand shapes, sizes and variations in hand actions, it could be difficult to differentiate actions from other dynamic motions in the background. The difficulty of the problem is further compounded in the presence of a non-stationary camera as considered here. Our contribution in this paper is three-fold. (1) To our knowledge this is the first such academic effort to provide an effective and close to real time solution for hand action detection from mobile ego-centric depth sequences. (2) We propose a novel error-correcting mechanism to tackle the bottleneck issue of incorrect votes from Hough transform which has been shown to degrade prediction performance [16, 17, 18]. This follows from our observation that voting errors frequently exhibit patterns that can be exploited to gain more knowledge. (3) We make available our comprehensive, in-house annotated ego-centric hand action dataset 222This dataset and our code can be found at the dedicated project website http://web.bii.a-star.edu.sg/~xuchi/handaction.htm. on which the proposed method is thoroughly evaluated. The error-correcting module is also examined with a series of tasks on a synthetic dataset. The empirical evaluations demonstrated that the proposed method is highly competitive and validates that our approach is able to pick out subtleties such as fine finger movements as well as coarse hand motions.

Fig. 1: An illustration of the ego-centric hand action detection problem. Key frames (colorized depth images) from a few exemplar actions are shown here. The type of actions vary from coarse hand motions to fine finger motions. Action recognition/detection tasks in this scenario are challenging due to (i) Illumination artefacts, (ii) Variations across subjects in the way actions are performed, (iii) Variations in hand shapes, sizes and (iv) Non-stationary camera positions due to head motion.

Ii Related Works

Ii-a Action Recognition and Detection

The problem of action recognition and detection is a classic topic in vision. Traditionally the focus has been more on looking at full-body human activities such as ”skipping” or ”jumping” [19]: For example, the problem of action detection is addressed in [20] using context free grammars. It has also been observed in [21] that very short sequences (also referred to as snippets, usually of 5-7 frames in length) are usually sufficient to identify the action type of the entire action duration. Single key-frames and local space-time interest point features are also utilized in [22] to detect drinking action type from realistic movie scenarios. Yuan et al. [23] focus on improving search efficiency, while [24]

resorts to contexture cues and convolutional neural networks. The work of Yao et al. 

[25] is perhaps the most related, in which a probabilistic Hough forest framework is proposed for action recognition. An interesting method is proposed in [26] to use human full-body action detection to help with pose analysis. Meanwhile, daily activity datasets of first-person color camera view has been established and studied by several groups [27, 28]

for applications such as life-logging and tele-rehabilitation. There are also recent works on action recognition using recurrent neural network approaches 

[29, 30]. Very recently, more research efforts have focused on action recognition from depth sequences [31]: For instance, an action-let ensemble model is proposed in [32]

to characterize individual action classes and intra-class variances; The problem of multi-label action detection problem is considered in

[33] with a structural SVM model. The work of [34] is among the few efforts to further explore head-mount RGB-D camera for action recognition. Nevertheless, hand action detection still lacks thorough investigation, especially for depth cameras in the context of mobile ego-centric vision.

Related works on hand action recognition and detection are relatively scarce. Among the early works on hand gesture recognition, Lee and Kim [35]

study the usage of a hidden Markov model (HMM) based on hand tracking trajectories. The emergence of consumer depth cameras significantly revolutionized the landscape of this field, where upper or full body skeleton estimation 

[36] is shown to be a powerful feature for the related problem of sign language recognition [37]. Recently, multi-sensor hand gesture recognition systems [38, 39] are proposed in a car driving environment with stationary rear cameras aimed at the driver. We note in passing that in-car cameras are usually stationary while in this paper the ego-centric vision refers to the general case of non-stationary cameras, and in particular we look at head-mount cameras. Therefore, adjacent image frames are not aligned any more and usual techniques such as background subtraction are not applicable. Furthermore, the Kinect human body skeleton estimation becomes inapplicable as the camera is head-mounted.

Ii-B Hough Transform

As probably one of the most widely used computer vision techniques, the Hough transform is first introduced by 

[40], initially as a line extraction method. It is subsequently reformulated by [41] to its current form and extended to detect circles. Furthermore, Ballard [42] develops the Generalized Hough transform to detect arbitrary shapes. In a typical Hough transform procedure, a set of local parts is captured to sufficiently represent the object of interest. Then, each of the local parts vote for a particular position [42]. Finally in the voting space, computed as score functions of the vote counts over all parts, an object is detected by mode-seeking the locations receiving the most significant score.

An important and more recent development due to [43] is its extension to a probabilistic formulation to incorporate a popular part-based representation, the visual Bag-of-Words (BoW). This is commonly referred to as the Implicit Shape Model (ISM), as follows. Formally, consider an object of interest as , and its label as , where is a set of feasible object categories. The position of the object in voting space is characterized by e.g. its nominal central location and scale in images or videos, and is collectively denoted as . In our context, the object is captured by a number of (say ) local parts, , and the number may vary from one object to another. Denote by the feature descriptor of a local part which is observed at a relative location 333The relative location is usually measured with respect to the object’s nominal center. . As a result, . We consider a visual Bag-of-Words representation, and denote by the

-th codebook entry of the weight vector for a local part

. The score function now becomes


which recovers Eq.(5) of [43]. The voting space is thus obtained by computing the score function over every position . The seminal work of ISM [43] has greatly influenced the recent development of Hough transform-like methods (e.g. [44, 45]), where the main emphases are on improving voting and integration with learning methods. Specifically the large-margin formulation of Hough Transform [44] is closely related. Moreover, the generality of Hough transform on how each of the local parts could vote enables myriad part-to-vote conversion strategies. Instead of the BoW strategy of [43, 44], Hough forests [45] are proposed which can be regarded as generating discriminatively learned codebooks.

There are also related works examining Hough vote consistency in the context of 2D object detection and recognition. A latent Hough transform model is studied in [16] to enforce the consistency of votes. In [46], the grouping, correspondences, and transformation of 2D parts are initialized by a bottom-up grouping strategy, which are then iteratively optimized till convergence. In [47], the hypothesis and its correspondences to the 2D parts are updated by greedy inference in a probabilistic framework. Our work is significantly different from these efforts: (1) Instead of exploring only the consistency of correct votes, we also seek to rectify the impact of incorrect votes, such that both correct and incorrect votes can contribute to improve the performance. (2) Rather than iteratively optimize complex objectives, with many unknown variables [46, 47], that are often computationally expensive and prone to local minima, our algorithm is simple and efficient as the process involves only linear operations. (3) The above mentioned algorithms are designed for 2D object detection, which cannot be directly applied in the temporal context considered in this paper.

Ii-C Related work on Error Correcting Output Codes

The concept of error correcting or error control has been long established in information theory and coding theory communities [48], which has been recently employed for multiclass Classification [49]. It is shown in [50] that there is an interesting analogy between the standard Hough transform method and error correction codes in the context of curve fitting.

Fig. 2: An illustration of the preprocessing step on hand normalization.

Iii Our Approach

Inspired by [21], we consider the employment of snippets as our basic building blocks, where a temporal sliding window is used to densely extract snippets from video clips. Each snippet thus corresponds to such a temporal window and is subsequently used to place a vote for both the action type and its start end positions under the Hough transform framework. This dense voting strategy nevertheless leads to many uninformative and even incorrect local votes where either the action type or its start/end locations could be wrong. In what follows we propose an error correcting map to explicitly characterize these local votes, where the key assumption here is that for a particular action type, the patterns of accumulated local votes are relatively stable. The patterns refer to spatial and categorical distributional information of the collection of local votes obtained from the snippets in training set, which include the correct local votes as well as the incorrect votes.

Fig. 3: An illustration of the proposed error correcting Hough transform. (a) Each snippet votes on the action type and its center position . denotes the amount of deviation, while being the difference between the snippet position and the true center position. Note the center position is nominally shown here only for illustration purpose. In practice, the start or end positions are used instead. (b) The learned vector of an action type , which is re-organized into the error correcting parameter cubic space.
Fig. 4: Exemplar hand action detection results. In this and next Figures, blue line denotes the ground-truths, while green line is for the correct detection, and red line for the incorrect detection.

A Preprocessing Step: Hand Normalization

To facilitate the follow-up feature generation, as a preprocessing step we normalize the hand position, in-plane orientation, and size. Assume that a hand is present in each depth image. During this step the hand location, orientation, and size are estimated using Hough forest [13]. This is achieved by jointly voting for the 3D location, in-plane orientation, and size of the hand on each pixel, then the local maxima in the accumulated voting space is located by mean-shift [13]. Based on these triplet information, the hand image patches are further normalized such that in what follows we would work with a canonical upright hand with a fixed patch size as exemplified in Fig. 2. In this paper the patch size is fixed to pixels.

Local Votes from Snippets

After the aforementioned preprocessing step, a sliding window of length is used to densely extract (with step size 1) snippets from the temporal zones in video clips where hand actions take place. Each snippet thus corresponds to a temporal subsequence of fixed length , and at each frame, it contains the normalized hand patch as well as the estimated hand location and orientation. At test run, a snippet is subsequently used to place a vote for the action type as well as its start and end positions.

To simplify the matter, for the target position to be voted, the center position of current sliding window is nominally used here only for illustration purpose as in Fig. 3(a). In practice, the start or end positions are used instead. In other words, the nominal center position is replaced in practice by either the start or the end positions separately. Now, consider a relatively larger sliding temporal window of length that contains multiple snippets. We further denote its nominal center location , and assume that this temporal window is overlapped with one of the action subsequences with action type . This sliding window will be the corner-stone of our paper in producing the Hough voting score. As illustrated in Fig. 3(a), given a snippet located at in such a temporal window, let denote the quantized temporal deviation from the snippet location to a position it might vote for, and be a possible action type. Denote the quantized deviation from the snippet location to the center position of current sliding window. The quantization of and

is necessary here since the temporal frames are already quantized. Now we would like to learn the probability distribution of temporal offsets

, and class label , which can be factorized as


Two random forests are trained for this purpose: The first forest, a classification random forest

, models the classification probability , while the second one is a conditional regression forest that represents the conditional probability . The term explicitly displays the quantization process:

is a random variable indexing over the bins after quantization. Then

for one bin where the deviation of location of to falls exactly under it, and when refers to any other bins. For both forests, two sets of features are used for binary tests in the split nodes: The first is the commonly-used set of features that measures the absolute difference of two spatio-temporal 3D offsets [11]

in the normalized hand patches. It is complemented by the second set of features which considers the 6D parameters of estimated hand location and orientation from hand localization. In addition, the standard Shannon entropy of multivariate Gaussian distribution 

[13] is used to compute information gains at the split nodes.

Votes from these local snippets however produce a fair amount of local errors. These voting errors could be categorized into two types: large temporal deviation (the start end positions) and inter-class confusion. The large temporal deviation arises from various reasons, including

  • Temporal rate variation: The scale and speed of actions may vary notably in practice across subjects and at different times;

  • Repetitive pattern: For example, in action “asl-blue” that will be formally introduced later, the hand is flipped twice, and it is difficult to predict whether it should be the first flip or the second, when one only focuses on a short action snippet at a time;

  • Manual labeling inconsistency: The annotation of start and end positions may be inconsistent in training data.

Meantime, inter-class confusion refers to scenarios when a snippet is mistakenly categorized into an incorrect action type. For example, since a snippet extracted from action “asl-milk” could be similar to that from action “ui-doubleclick”, they may be confused when placing the local vote. An important observation here is that both temporal deviations and inter-class confusions often exhibit specific patterns that could be exploited potentially. These observations lead us to propose in what follows a novel mechanism to cope with and even benefit from these local errors.

Our Error Correcting Hough Transform (ECHT)

The central piece of Hough transform lies in the score function defined for the voting space, as has been elaborated in the previous section. In this paper, we consider a linear additive score function of an object with label at position as,


The score function is additive in term of local votes with each being defined as . Therefore the features is decomposed into local snippet-based features . Consider for example the large-margin formulation of Hough Transform in [44], which amounts to being a relaxation of (1) to non-probabilistic function forms: by reducing to , and by denoting the weight vector as , the linear form of (3) is obtained, with . This is exactly the large-margin formulation of ISM described in Eq.(12) of [44].

In our context, since the local votes from snippets contain noticeable amount of errors, it is necessary to consider an error-control mechanism. This motivates us to consider, instead of the BoW approach as studied in e.g. [43, 44], a new linear map that explicitly characterizing errors from local votes:


As illustrated in Fig. 3(b), given an action type ,

Fig. 5: A systematic temporal error analysis using the synthetic dataset.

-axis denotes the temporal additive Gaussian noise with zero mean and standard deviation

, while -axis presents the average temporal deviations from the ground-truths.

denotes a column vector obtained by concatenating the parameters over the range of three-dimensional error correcting space of , and , with each element denoting a particular parameter. In a similar way, each element of is , which encodes the local vote obtained from the snippet using random forests as in Eq. (2).

is uniformly distributed and is thus ignored as being a constant factor.

Training Testing Phases of ECHT

Since the Hough voting space is characterized as a linear map , we can learn the parameter vector during the training phase as follows. For each of the action subsequences in video clips, we randomly sample subsequences around it. For each such subsequence of action type and start/end positions , its ground-truth voting score is defined as the intersection over union of the sampled subsequence and the action subsequence. The training objective amounts to estimating the parameter that minimizing the discrepancy between the ground-truth and the computed scores over all training subsequences, plus a regularization term of :


where denotes the vector norm, a trade-off constant, indexes over all the training subsequences, and considers all the snippets within the current subsequence . In this paper, we consider the -insensitive loss [51] for and the problem is solved by linear support vector regression (SVR) [51].

At test run, given a test example consisting of a set of snippets , the action detection problem boils down to finding those with a threshold and with the help of non-maximal suppression.

Iv Experiments

An in-house dataset has been constructed for hand action detection as illustrated in Fig. 1, where the videos are collected from a head-mount depth camera (i.e. time-of-flight depth camera Softkinetic DS325). The spatial resolution of this depth sensor is pixels, while the horizontal and vertical field-of-view are and , respectively. The video clips are acquired at a frame rate of 15 frames per second (FPS). We consider 16 hand action classes, with ten classes from ASL (American sign language) and the rest six from UI (user interface applications), as follows: asl-bathroom, asl-blue, asl-green, asl-j, asl-milk, asl-scissors, asl-where, asl-yellow, asl-you, asl-z, ui-circle, ui-click, ui-doubleclick, ui-keyTap, ui-screenTap, ui-swipe. During data acquisition, the distance of hand to camera varies from 210mm to 600mm, with the average distance of 415mm.

The following methods are considered during empirical evaluations:

  • ECHT: The proposed full-fledged approach with an error correcting map for both temporal error and inter-class confusion.

  • ECHT-T: A degraded variant of ECHT with only temporal error correction.

  • ECHT-C: A degraded variant of ECHT with only inter-class confusion correction.

  • Standard HT: Standard Hough forest method using a Dirac function for class prediction, and a Gaussian distribution smoothing function over the estimations of temporal start/end positions, which can be regarded as a adaptation of the state-of-the-art method [25]) in our problem.

  • HMM1 & HMM2: Two variants of the standard HMM for action recognition tasks following e.g. [35] to train a HMM for each action class. For HMM1 we use the normalized 3D hand movement feature and HoG features from normalized hand patches, while for HMM2 we use only the HoG feature.

Throughout our experiments, for both Hough forests and , the number of trees is set to 20, and the tree depth is set to 20 by default. is used to refer to either the start or end positions of current action subsequence, as these two boundary positions are voted independently. The error correcting cubic space of , and are quantized into sub-cubes. and for the SVR problem [51].

In term of evaluation criteria, a correct prediction is defined as an image subsequence with intersection-over-union ratio greater than 0.5 when comparing to the ground-truth action subsequence, and with correctly predicted action label. This naturally leads to the consideration of using precision, recall, as well as F1 score as the performance evaluation criteria. Following standard definition, F1 score is a harmonic mean of precision and recall, which is defined as


Fig. 6: A systematic inter-class error analysis using the synthetic dataset. -axis denotes the additive Gaussian noise with zero mean and standard deviation , while -axis presents the average F1 score.

Iv-a Synthetic Experiments

To facilitate in-depth analysis of the proposed ECHT approach under a controlled environment, we carry out a series of experiments on a synthetic dataset simulating simplified scenarios. During these experiments, a training set of about 2,500 such short temporal sequences with 16 action classes and test set of 5 long sequences with assorted actions are generated. In practice, errors in the local votes could be due to a systematic bias (i.e. a simple version of it is all the votes are added up by a fixed constant), random (i.e. addition Gaussian noise) or a combination of both. Large magnitude random errors would be hard for any mechanism to correct over. On the other hand, we hypothesize that errors of systematically biased nature can be fully accounted for and corrected for our ECHT approach, while this type of errors remain non-correctable for standard HT. To demonstrate this on temporal and inter-class errors we performed two sets of experiments.

Experiments on Temporal Error Analysis

Here each local vote for the start point of its action sequence is purposely left-shifted by a value equal to along with an additional Gaussian noise with standard deviation . Over a single train-test pass, the value of is kept constant. There are no errors introduced in the votes for the action type. Figures 5(a) and (b) show the average deviation in the predicted start location of actions from the ground truth for various and . For the trivial case of and , both ECHT and the standard HT deliver exact predictions. For but with increasing , there is a corresponding decrease in performance of the standard HT, which relies purely on the local votes. In contrast, ECHT manages to account for this systematic shift and produce exact predictions. With increasing there is a degradation in the performance of both methods. which is consistent with the fact that random noise is hard to control for. Note ECHT-T and ECHT-C are not shown and the former one performs exactly the same as ECHT while the latter one being the same as the standard HT. Besides, to produce this and the next figures, ten experimental repeats are performed independently for each parameter set.

Fig. 7: Incorrect detections. Left: The temporal overlap is insufficient. Right: Action class is predicted wrongly.

Experiments on Inter-class Error Analysis

In these set of experiments, the local votes for the start and end positions of actions are untouched. Instead, all the votes for action type are perturbed. Initially each class vote is a binary probability distribution rolling over all the action types. First we cyclically rotate this distribution times. Then a Gaussian noise is added, and the vote is assigned a new class id sampled according to this modified distribution. Figures 6(a) and (b) present the average F1-score for various levels of and . For , ECHT produces 100% accurate predictions for all values of , as the error correcting map is able to control for systematic swap in the class ids. On the other hand, for standard HT the performance drops to 0% for non-zero values of (). Complete reliance on the local votes explains this result as for every vote predicts a class id other than the true class. For increasing values of , the performances of both methods degrade.

Fig. 8: Hand action detection results for the variants of the proposed error correcting approach and the standard Hough transform method.
Fig. 9: Confusion matrix of our ECHT on the action recognition task.

Iv-B Real-life Experiments

Real-life experiments are conducted using our in-house dataset that contains ego-centric depth videos collected from 26 subjects of different ethnic groups and ages. Some of the acquired hand data are with various accessories such as wrist watches, cords, rings, etc. For the training data, we collect from 15 subjects 240 long video sequences which contain 2518 single-action subsequences. For the testing data, we collect from another 11 subjects 60 long video sequences which contain 659 single-action subsequences. The lengths of the long video sequences varying from 500 to 2500 frames, while the length variation among the single-action subsequences from 7 frames up to 48 frames.

Action Recognition

Action recognition task is based on single-action video clips. As stated previously, our in-house dataset contains 2518 training instances, and 659 testing instances of 16 action classes. The average F1 accuracy and standard deviation of the comparison methods are as follows: variants of our approach ECHT 96.18%3.27%, ECHT-T 93.74%7.86%, ECHT-C 88.40%18.74%, the standard HT 87.89%18.66%, as well as the HMM1 and HMM2 of 63.64%23.57% and 57.16%27.70%, respectively. Not surprisingly ECHT consistently overtakes the rest with a noticeable margin, which is followed by ECHT-T, ECHT-C, and standard HT, respectively, while HMM1 & HMM2 produce the least favorable results with much larger spread. The confusion matrix of our ECHT is further presented in Fig. 9.

Action Detection

Our training data contains 240 long video sequences (with 2518 foreground single-action subsequences), and the test data contains 60 long video sequences (with 659 foreground single-action subsequences) as mentioned previously, These sequences also contain various background daily actions, such using keyboard & mouse & telephone, reading, writing, drinking, looking around, etc. Performances of the comparison methods are shown in Fig. 8, while Fig. 4 presents exemplar hand action detection visual results of our approach. Experimental results suggest that temporal error correction is a primary factor accounting for the performance gains, while class error correction play a relatively minor role. Moreover, although ECHT-C alone provides comparably small improvement, The combined usage of temporal and class error correction always leads to notable performance gain over the baseline method of standard HT. Moreover, our ECHT action detector is demonstrated being capable of robustly detecting the target actions from backgrounds. Fig. 7 shows several incorrect detection results, while more visual results are provided in the supplementary video files.

Fig. 10: Robustness evaluation of the internal parameters. (a) F1 score as a function of the size of the error correcting map. (b) F1 score as a function of the number of trees, the same value is used for both and .

Size of error correcting map

As our approach contain several internal parameters, it is of interest to examine the performance robustness with respect to these parameters. We start by looking at the error correcting space, which is in our context a cubic space of , and that are quantized into sub-cubes. The quantization on is related to the number of classes. We test the number of sub-cubes on and from to . As in Fig. 10(a), the hand action detection performance with respect to the size of the error correcting map is very stable.

Fig. 11: F1 score as a function of synthesized perturbations on hand location results. (a) F1 scores vs. the standard deviation of the Gaussian perturbation noise on hand locations. (b) F1 scores vs. the standard deviation of the Gaussian perturbation noise on hand orientations.

Number of Trees

We further investigate the performance variations with respect to the tree sizes in both and . As displayed in Fig. 10(b), empirically the F1 score increases with the number of tree grows to 12, then the F1 scores remain largely unchanged despite the continuing increased number of trees.

Perturbations of Hand Localization Results

As our approach includes a preprocessing step to localize hand position and orientation, the result of this step will inevitably influences the overall performance. To study its effect, we add random Gaussian noise as disturbance to the estimated hand location. As displayed in Fig. 11, when the perturbation noise on position is lower than 10 mm, the performance (F1 score) remains relatively stable. When the noise reaches 15 mm and beyond, there starts to have a very noticeable performance drop. This is mainly due to the fact that the estimated hand position frequently falls outside the hand region onto the background when noise is larger than 15 mm. Nevertheless, in practice our hand location estimator can reliably locate the target in the hand region. It is also observed that empirically our approach is relatively more stable with respect to the disturbance of orientation when comparing to that the disturbance of positions.

Running Time

Real-life experiments are performed on a desktop with Intel Core i7 CPU 3.20GHz and 24GB memory. Note that our code is not optimized, with only 1 core being used during the experiments. Table I shows the result of time profiling on our algorithm for a test video sequence.

Computation Time
Preprocess 8.642 ms/frame
Obtain Votes 0.289 ms/frame
Apply ECHT 0.016 ms/frame
Total 8.947 ms/frame
TABLE I: Computation time of an exemplar test run.

V Conclusion and Outlook

This paper describes an error correcting Hough forest approach to tackle the novel and challenging problem of hand action detection from mobile ego-centric depth sequence. Empirical evaluations demonstrate the applicability of the proposed approach. For future work, we plan to extend our approach to address action detection scenarios beyond hand actions.


  • [1] “Google Glass,” www.google.com/glass, 2013.
  • [2] “Metaview Spaceglasses,” www.getameta.com, 2015.
  • [3] “Oculus Rift,” www.oculus.com, 2016.
  • [4] A. Fathi, A. Farhadi, and J. Rehg, “Understanding egocentric activities,” in ICCV, 2011.
  • [5] Y. Li, A. Fathi, and J. Rehg, “Learning to predict gaze in egocentric video,” in ICCV, 2013.
  • [6] D.-A. Huang, M. Ma, W. Ma, and K. Kitani, “How do we use our hands? discovering a diverse set of common grasps,” in CVPR, 2015.
  • [7] “Kinect,” www.xbox.com/en-US/kinect, 2011.
  • [8] “Softkinetic,” www.softkinetic.com, 2012.
  • [9] C. Li and K. M. Kitani, “Pixel-level hand detection in ego-centric videos,” in CVPR, 2013.
  • [10] I. Oikonomidis, M. Lourakis, and A. Argyros, “Evolutionary quasi-random search for hand articulations tracking,” in CVPR, 2014.
  • [11] D. Tang, A. Tejani, H. Chang, and T. Kim, “Latent regression forest: Structured estimation of 3D articulated hand posture,” in CVPR, 2014.
  • [12] C. Qian, X. Sun, Y. Wei, X. Tang, and J. Sun, “Realtime and robust hand tracking from depth,” in CVPR, 2014.
  • [13] C. Xu, A. Nanjappa, X. Zhang, and L. Cheng, “Estimate hand poses efficiently from single depth images,” IJCV, pp. 1–25, 2015.
  • [14] G. Rogez, J. Supancic, and D. Ramanan, “First-person pose recognition using egocentric workspaces,” in CVPR, 2015.
  • [15] ——, “Understanding everyday hands in action from rgb-d images,” in ICCV, 2015.
  • [16] N. Razavi, J. Gall, P. Kohli, and L. van Gool, “latent Hough transform for object detection,” ECCV, 2012.
  • [17] P. Wohlhart, S. Schulter, M. Kostinger, P. Roth, and H. Bischof, “discriminative Hough forests for object detection,” BMVC, 2012.
  • [18] O. Woodford, M. Pham, A. Maki, F. Perbet, and B. Stenger, “Demisting the Hough transform for 3D shape recognition and registration,” IJCV, 2013.
  • [19] C. Schuldt, I. Laptev, and B. Caputo, “Recognizing human actions: A local SVM approach,” in ICPR, 2004.
  • [20] H. Pirsiavash and D. Ramanan, “Parsing videos of actions with segmental grammars,” in CVPR, 2014.
  • [21] K. Schindler and L. van Gool, “Action snippets: how many frames does human action recognition require?” in CVPR, 2008.
  • [22] I. Laptev and P. Perez, “Retrieving actions in movies,” in ICCV, 2007.
  • [23] J. Yuan, Z. Liu, and Y. Wu, “Discriminative 3D subvolume search for efficient action detection,” CVPR, 2009.
  • [24] G. Gkioxari, R. Girshick, and J. Malik, “Contextual action recognition with R*CNN,” in ICCV, 2015.
  • [25] A. Yao, J. Gall, and L. V. Gool, “A Hough transform-based voting framework for action recognition,” in CVPR, 2010.
  • [26] T. Yu, T. Kim, and R. Cipolla, “Unconstrained monocular 3d human pose estimation by action detection and cross-modality regression forest, booktitle=CVPR, year=2013,.”
  • [27] H. Pirsiavash and D. Ramanan, “Detecting activities of daily living in first-person camera views,” in CVPR, 2012.
  • [28] M. Ryoo and L. Matthies, “First-person activity recognition: What are they doing to me?” in CVPR, 2013.
  • [29] Y. Du, W. Wang, and L. Wang, “Hierarchical recurrent neural network for skeleton based action recognition,” in CVPR, 2015.
  • [30] J. Donahue, L. Hendricks, S. Guadarrama, M. Rohrbach, S. Venugopalan, K. Saenko, and T. Darrell, “Long-term recurrent convolutional networks for visual recognition and description,” in CVPR, 2015.
  • [31] M. Ye, Q. Zhang, L. Wang, J. Zhu, R. Yang, and J. Gall, Time-of-Flight and Depth Imaging: Sensors, Algorithms, and Applications, ser. LNCS 8200.    Springer, 2013, ch. A Survey on Human Motion Analysis from Depth Data, pp. 149–87.
  • [32] J. Wang, Z. Liu, Y. Wu, and J. Yuan, “Learning actionlet ensemble for 3D human action recognition,” IEEE TPAMI, vol. 36, no. 5, pp. 914–27, 2014.
  • [33] P. Wei, N. Zheng, Y. Zhao, and S. Zhu, “Concurrent action detection with structural prediction,” in ICCV, 2013.
  • [34] M. Moghimi, P. Azagra, L. Montesano, A. Murillo, and S. Belongie, “Experiments on an RGB-D wearable vision system for egocentric activity recognition,” in CVPR Workshop on Egocentric (First-person) Vision, 2014.
  • [35] H.-K. Lee and J. H. Kim, “An HMM-based threshold model approach for gesture recognition,” IEEE TPAMI, vol. 21, no. 10, pp. 961–73, 1999.
  • [36] J. Shotton, T. Sharp, A. Kipman, A. Fitzgibbon, M. Finocchio, A. Blake, M. Cook, and R. Moore, “Real-time human pose recognition in parts from single depth images,” Comm. ACM, vol. 56, no. 1, pp. 116–24, 2013.
  • [37] S. Lang, M. Block, and R. Rojas, “Sign language recognition using kinect,” in Artificial Intelligence and Soft Computing.    Springer, 2012, pp. 394–402.
  • [38] P. Molchanov, S. Gupta, K. Kim, and K. Pulli, “Multi-sensor system for driver’s hand-gesture recognition,” in IEEE Conference on Automatic Face and Gesture Recognition, 2015.
  • [39] E. Ohn-Bar and M. M. Trivedi, “Hand gesture recognition in real time for automotive interfaces: A multimodal vision-based approach and evaluations,” IEEE Trans on Intelligent Transportation Systems, vol. 15, no. 6, pp. 2368–77, 2014.
  • [40] P. Hough, “Machine analysis of bubble chamber pictures,” in Proc. Int. Conf. High Energy Accelerators and Instrumentation, 1959.
  • [41] R. Duda and P. Hart, “Use of the Hough transformation to detect lines and curves in pictures,” Commun. ACM, vol. 15, pp. 11–5, 1972.
  • [42] D. Ballard, “Generalizing the Hough transform to detect arbitrary shapes,” Pattern Recognition, vol. 13, no. 2, pp. 111–22, 1981.
  • [43] B. Leibe, A. Leonardis, and B. Schiele, “Combined object categorization and segmentation with an implicit shape model,” in ECCV Workshop statistical learning in CV, 2004.
  • [44] S. Maji and J. Malik, “Object detection using a max-margin Hough transform,” in CVPR, 2009.
  • [45] J. Gall, A. Yao, N. Razavi, L. V. Gool, and V. Lempitsky, “Hough forests for object detection, tracking, and action recognition,” IEEE TPAMI, vol. 33, no. 11, pp. 2188–202, 2011.
  • [46] P. Yarlagadda, A. Monroy, and B. Ommer, “Voting by grouping dependent parts,” in ECCV, 2010.
  • [47] O. Barinova, V. Lempitsky, and P. Kholi, “On detection of multiple object instances using Hough transforms,” IEEE TPAMI, vol. 34, no. 9, pp. 1773–84, 2012.
  • [48] W. Huffman and V. Pless, Fundamentals of error-correcting codes.    Cambridge Univ. Press, 2003.
  • [49] T. Dietterich and G. Bakiri, “Solving multiclass learning problems via error-correcting output codes,” JAIR, vol. 2, pp. 263–86, 1995.
  • [50] S. B. Guruprasad, “On the breakdown point of the Hough transform,” in International Conference on Advances in Pattern Recognition, 2009.
  • [51]

    C. Chang and C. Lin, “LIBSVM: A library for support vector machines,”

    ACM Trans. Intel. Sys. Tech., vol. 2, pp. 1–27, 2011.