Left/Right Hand Segmentation in Egocentric Videos

07/21/2016 ∙ by Alejandro Betancourt, et al. ∙ TU Eindhoven Università di Genova 0

Wearable cameras allow people to record their daily activities from a user-centered (First Person Vision) perspective. Due to their favorable location, wearable cameras frequently capture the hands of the user, and may thus represent a promising user-machine interaction tool for different applications. Existent First Person Vision methods handle hand segmentation as a background-foreground problem, ignoring two important facts: i) hands are not a single "skin-like" moving element, but a pair of interacting cooperative entities, ii) close hand interactions may lead to hand-to-hand occlusions and, as a consequence, create a single hand-like segment. These facts complicate a proper understanding of hand movements and interactions. Our approach extends traditional background-foreground strategies, by including a hand-identification step (left-right) based on a Maxwell distribution of angle and position. Hand-to-hand occlusions are addressed by exploiting temporal superpixels. The experimental results show that, in addition to a reliable left/right hand-segmentation, our approach considerably improves the traditional background-foreground hand-segmentation.



There are no comments yet.


page 1

page 4

page 5

page 6

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The recent widespread availability of wearable devices has quickly attracted the interest of researchers, computer scientists and high-tech companies Starner2013 . The 90’s idea of a body-worn device that is always ready to be used is nowadays possible, and its potential applicability to real problems is evident. In general, the wearable sensor that most attracted researchers’ attention is the video camera: while enjoying a unique position to record what the user is seeing, it suffers from important issues and technical challenges Betancourt2014 . Images and videos recorded from this perspective are commonly referred to as First-Person Vision (FPV) or Egocentric videos Betancourt2014 .

Figure 1: The difference between hand-segmentation and hand-identification

One of the more promising aspects of this video perspective is the tight link between the camera location and the user point of view, which makes it possible to frequently capture user’s hands. A proper understanding of hand movements enables important applications such as activity recognition Fathi2011 , user-machine interaction Baraldi2015

, gaze estimation

Fathi2012a ; Buso2015 , hand-posture recognition Cai2015 ; Yang2015 , among others. The authors of Betancourt2015b propose a hierarchical structure to develop hand-based methods and highlight several fields that might benefit of robust and efficient hand understanding techniques.

Hand-based studies are not restricted to wearable cameras and computer vision, in fact, biologists and neuroscientists have been deeply exploring hand usage in daily activities

Renteria2012 , even before the emergence of modern wearable devices. There is a consistent number of studies and theories investigating hand-dominance in humans and its relationship with upper limb motion skills. It is estimated that out of individuals are right-handed and as a consequence their upper limb skills are asymmetric for what concerns speed, control and strength. Interestingly, these findings are similar across different geographic locations and cultures Mcmanus2009 .

Current FPV hand-based literature consistently approaches inference over hands as a background/foreground segmentation problem, where hand-like pixels represent the foreground while the remaining pixels define the background Li2013b ; Fathi2011a

. Even if this approach provides a broad range of algorithmic approaches and evaluation criteria based on machine learning and computer vision, it oversimplifies the biological perspective, ignoring hand-dominance and limiting the capabilities of wearable cameras to understand hand interactions and asymmetric upper limb motor skills. Figure

1 shows the difference between the traditional binary hand-segmentation and a left/right(L/R) hand-segmenter as proposed in this work. The second row shows an example on which both hands interact closely thus causing a hand-to-hand occlusion. The standard binary hand-segmenter fails in explaining these occlusions and creates a single hand-like segment while the L/R hand-segmenter can detect and split two hands correctly.

Concerning hand-identification, some authors have suggested a strong relation between hand identity and the position of hand-like segments Philipose2009 . Intuitively, segments located on the left, or right side of the frame belong to the left or right hand respectively. Figure 2 shows some examples of frames where this approach does not work. In summary, the location-based strategy performs well if the symmetry between the hands is almost perfect and there are not hand-to-hand occlusions.

(a) Asymmetry in the position of the hands.
(b) Manipulating objects in the borders of the frame.
(c) Hand-Segmentation with Occlusion problem.
Figure 2: Hand-Segmentation examples.

From our experience, the capability of wearable cameras to distinguish one hand from the other is critical. This is particularly true when it comes to understand bi-manual tasks Swinnen2004 ; Vincze2009 , e.g. in medical rehabilitation of upper-limb stroke Turolla2013 and cerebral palsy Speth2013a . Another field that can benefit from this independent understanding is neuroscience, where hand-dominance is commonly associated with several neurological factors Goble2008 . For example, the authors of Knaus2016 ; Cook2013 found significant differences in the hand-dominance level of children with Autism Spectrum Disorder (ASD). A wearable camera that is able to differentiate between left and right hand is not only in line with biological and neurosciences perspective, but it also opens the doors to understanding hands as two interacting entities, centrally coordinated to achieve a particular goal.

To the best of our knowledge, this is the first work exploring in detail a L/R hand-segmentation, considering realistic scenarios with hand-to-hand occlusions and asymmetric hand positions. The contribution of this paper is three-folded: i) It proposes a theoretical hand-identification model based on Maxwell distributions to decide whether a hand-like segment corresponds to a left or a right hand ii) It faces the hand-to-hand occlusions problem by exploiting temporal superpixels under a dynamic procedure. iii) It significantly improves a state-of-the-art binary hand-segmenter by using a multi-model classification strategy and assuming that one left and one right hand appear in front of the user at most. The last assumption is valid, especially in the aforementioned medical applications, where no human-to-human interaction is required Turolla2013 .

The remaining of this paper is organized as follows: Section 2 summarizes the state of the art in hand-based methods in FPV. In section 3 our approach is presented and subsequently evaluated in 4. The provided evaluation is performed sequentially by first analyzing each components independently. Finally, 5 concludes and provides future research lines based on the obtained result.

2 State of the Art

Recent literature Betancourt2014 highlights the significant role of hands in FPV. Several promising hand-based applications are frequently mentioned, such as activity recognition Fathi2011 and user-machine interaction Baraldi2015 , among others. Different authors have also sketched other advanced and more realistic applications. However, real applicability is still restricted by the limited capabilities of current methods to work under realistic conditions, such as illumination changes, or complex hand interactionsBetancourt2015b .

In Betancourt2015b the authors propose a unified structure for hand-based methods, which highlights the importance of understanding hands at different hierarchical levels (i.e. hand-detection, hand-segmentation, hand-identification and hand-tracking). The first level of the structure is hand-detection, which answers the yes-no question about hand presence in a frame. The objective of this level is to optimize computational resources, and to reduce the false positives rate when hands are not being recorded by the camera Betancourt2014a ; Betancourt2015a . Hand-detection is commonly faced as a frame-by-frame classification problem Betancourt2015 , and is frequently given as granted when studying controlled tasks where the user is always manipulating objects in front of the camera, for example in the Kitchen Fathi2012a and the EDSH Li2013a datasets.

Once hands are detected, the following step, and most studied level, is hand-segmentation. The goal of hand-segmenters is to find the set of pixels of a particular frame belonging to the hands of the user. Recent hand-segmenters can be considered advanced implementations of the color-based seminal work of Jones1999 . Remarkable results are obtained in the work of Li2013a

where a Random Forest classifier is trained to discriminate positive and negative pixels. That work also explores the use of texture and the fusion of multiple hand-segmenters to deal with changing light conditions

Li2013b . This strategy is further improved in Zhu2014 by preserving the shape of the hands using a shape-aware classifier and also in Baraldi2015 by using superpixels. The authors in Zhu2015 use the segmented hands to divide the hand-like segments in fingers, palm, and arm.

According to the framework proposed in Betancourt2015b , hand-identification is more than an incremental step towards the solution of the problem, since it opens a range of new possibilities and applications. It makes technically possible to make a paradigm shift towards viewing the hands as two interacting entities working jointly to achieve a particular goal. Literature on hand-identification is insufficient: the problem is usually regarded as a post-processing step performed after segmentation. Ren et al. Philipose2009 , for instance, use the side of the frame where the skin-like segments are located to label it as the left or right hand. We can identify three common cases in which this approach does not work correctly: i) The symmetry of the skin-like segments is affected by changes in the attention point or the camera position, Figure 2(a). ii) The user is manipulating an object close to the frame borders, Figure 2(b). iii) The hands are close enough to be segmented as a single skin-like segment (hand-to-hand occlusion), Figure 2(c).

To address these cases, Fathi et al. Fathi2011a

use a Support Vector Machine that is able to classify each frame into four categories, i.e. single left hand, single right hand, two different hands, two interacting hands. These categories oversimplify the hand-identification problem since they do not provide a L/R hand-segmentation. In the same line of research,

Buso2015 extends the approach of Fathi2011a by using the relative positions of the segmented hands and the active objects to build a goal oriented model of attention. Denoted also as a hand-identification but targeting a different purpose, the authors of Lee2014 propose a Bayesian method to identify if a hand-like segment belongs to the user or to somebody in front of him. This problem clearly provides an alternative definition of hand-identification, which is particularly important when the user is interacting with other people. The authors illustrate the importance of this approach by using a dataset recorded for medical experiments with children.

The primary goal of the present work is to address the hand-identification problem following the definition proposed in the independent studies Betancourt2014 and Philipose2009 . Our approach relies on a multi-model implementation of the binary hand-segmenter proposed by Li2013a ; Li2013b , but can be easily extended to future improvements of any hand-segmentation algorithm. Compared to the state of the art literature, we highlight three important novelties:

  • The proposed L/R hand-segmenter significantly improves segmentation score of the state-of-the-art by fusing multiple Random Forests to capture light changes and by exploiting the fact that the user has at most one left and one right hand. This assumption is particularly useful when studying bi-manual tasks in controlled environments. The experimental results show that, in addition to the reliable left and right information, our final segmentation improves the state-of-the-art binary hand-segmenter Zhu2014 of around score points in some videos of the kitchen dataset Fathi2011a .

  • In contrast to Fathi2011a , our approach relies on simple set algebra to detect occlusions, which is computationally more efficient and achieves a detection level of (as shown in section 4). Moreover, our method not only provides a category label but splits the occluded binary hand-segment by using superpixels.

  • Given a previous occlusion detection and split, we propose a probabilistic L/R hand-identification model using a max-likelihood ratio test of two Maxwell distributions based on position and angle. Our approach is robust to asymmetries in the hand positions and can be tuned for different camera locations and lenses. The experimental results show that our method accurately identifies of the manual masks of the kitchen dataset Fathi2011a .

3 Our approach

Our final goal is to extract an accurate L/R hand-segmentation that is robust to hand-to-hand occlusions, asymmetric hand configurations and object manipulations in the borders of the frame. Figure 3 summarizes the proposed work-flow; at the top of the diagram there is the input frame, while the resulting L/R hand-segmentation is shown at the bottom. The intermediate stpdf are the hand-segmentation (section 3.1), the hand-to-hand occlusion detection and disambiguation (section 3.2), and the hand-identification (section 3.3). These stpdf are in line with the unified structure proposed in Betancourt2015b . It is important to note that the three levels (i.e. Hand-Segmentation, Occlusion-Detection, Hand-Identification) are mutually independent; which makes it possible to improve them separately or using more complex sensors instead. As an example, the occlusion detector can be applied on hand-segments coming from a RGBD cameras, and the hand-identification method can be used on top of a faster occlusion detector, or even directly on the hand-segmenter if the occlusions are not relevant for a particular application.

Figure 3: General description of our approach

From the diagram it can be noticed that, at each time instant the procedure exploits the previous L/R segmentation to detect and split the hand-to-hand occlusions. This temporal dependency requires a reliable previous L/R segmentation and no hand-to-hand occlusion in the initial frame. Intuitively, the higher the sampling rate, the more reliable the occlusion detection; however, as will be shown in 4, even using sampling rates of the final segmentation accuracy of the left and right hand is still around . The main goals of this work are the algorithmic performance and the segmentation capabilities. To reach real-time performance, the algorithm must be optimized by balancing the compression width, the sampling rate, or by developing parallel versions of the Random Forest and/or the superpixel algorithm. Our current work points out that, using GPU implementation and an image resampler (outputting pixels width images and preserving aspect-ratio), it is possible to achieve a throughput of .

3.1 Binary Hand-Segmentation

At this level there is no difference between left and right hand. The objective is to discriminate pixels of the frame that looks like the hand-skin based on color. This level is based on a multi-model version of the pixel-by-pixel binary hand-segmenter proposed by Li2013b . Figure 4 summarizes the general idea of the multi-model approach. The gray blocks correspond to the training while the white blocks to the testing.

Figure 4: Multi-model binary hand-segmenter

The first column of the figure contains the manual masks and their corresponding raw frames. The masks were extracted using the graph cut manual segmenter provided by Li2013b . Let us denote as the number of manual masks available in the dataset, and as the number of training pairs selected to build a multi-model binary hand-segmenter. For each training pair a trained binary random forest () and its global features () are obtained and stored in order to construct a pool of illumination models (second column of the figure). Each is trained using as features the values of each pixel in the frame and as class their corresponding values in the binary masks. As global feature () we use the flatten HSV histogram. The choice of the color spaces is based on the results reported by Li2013b and Morerio2013 . Once the illumination models are trained, a K-Nearest-Neighbors structure, denoted as , is estimated using as input the global features .

In the testing phase, the is used as a recommender system which maps the global features frame to the indexes of the nearest illumination models (). These are used to obtain possible segmentations (), which are eventually fused to obtain the final hand-segmentation (). This procedure is illustrated in the third column of Figure 4. Formally, let’s denote the testing frame as and its HSV-histogram as , the indexes of the closest illumination models ordered with increasing euclidean distances as equation (3.1), their corresponding random forest as equation (3.1), and their pixel-by-pixel segmentation applied to as equation (3.1).

The binary hand-segmentation of the frame is the normalized weighted average of the individual segmentations in , which is formally given by equation (4); is a decaying weight factor that is equal to , based on the results of Li2013b . The weights are then set as . With this in mind, the hand-segmentation has 2 parameters to be defined, namely the number of illumination models () and the number of closest random forests to average (). These parameters are defined in section 4 following an exhaustive evaluation in the Kitchen dataset.


At this point, there is a set of hand-like segments, some of them matching the hands of the user (true-positives) and some of them as the result of pixels in the image with similar color to the user skin (false-positives). If a fixed camera location (e.g. head, chest, shoulder) is known, then it is possible to define a set of post-processing rules to remove some of the false-positives. The post-processing has stpdf: i) Find the contours (polygons) containing the hand-like pixels; ii) Remove those contours far of the left, bottom or right margin; iii) Remove the contours smaller than , where is the width of the frame; iv) keep the largest contours. We perform an extra filtering stage after the hand-identification step to keep only the best left and best right contour.

3.2 Hand-to-hand Occlusions

The proposed L/R hand-identification model assumes that the hand-like segments are not occluded or have been split before. If hand-to-hand occlusions were ignored the L/R hand-identification model would process a larger hand-like segment and would assign it completely as left or right. Moreover, ignoring the occlusions would make the tracking of the hands more complex due to frequent flickering in the hand-identification. To avoid these cases we perform an occlusion detection step (section 3.2.1) followed by a segmentation split (section 3.2.2). Figure 5 shows some examples of occlusion (first column) and the split (second column). The third column shows the result of the L/R hand-segmentation if occlusions are properly handled.

Figure 5: Examples of Hand-to-hand occlusion split.

3.2.1 Hand-to-hand Occlusion Detection

The main goal of this step is to decide whether the hand-like segments of a particular frame come from a hand-to-hand occlusion. Given a reasonable sampling rate, it is possible to assume that a hand-to-hand occlusion requires the presence of both hands in the previous. Let’s define then, the previously detected L/R hand-segments, as and respectively, and the larger binary hand-segment of the current frame as . Two important assumptions must be verified here: i) There is not hand-to-hand occlusions in the first frame of the video segments containing both hands, ii) The detection and split reliability is high. The first assumption is particularly true for videos recording realistic hand interactions like the kitchen dataset. The second assumption is evaluated in section 4.1 and 4.2.2.

In case of occlusion, and given a small sampling rate, must intersect simultaneously with and . If this happens, it can be assumed that the hands are close enough, or connected by noisy pixels, to be considered a hand-to-hand occlusion. Algorithm 1 formally defines the hand-to-hand occlusion detection. For the sake of the compactness of notation, we use a bar over segments to refer to their area (e.g. refers to the area of segment ). As it can be seen in the pseudocode, is defined as hand-to-hand occlusion if the area of its intersection with and is between and of the total area of the L/R hand-segments.

1:procedure IsOcclusion()
2:       if  then
5:             if  then
7:             else
9:             end if
10:       else
12:       end if
14:end procedure
Algorithm 1 Hand-to-hand occlusion detection.

3.2.2 Occlusion splitting

In the case of hand-to-hand occlusion, the next step is to split the affected segment in two parts by exploiting its inner edges and the previous L/R hand-segments. Following the notation of section 3.2.1, let us define as a superpixel representation of the frame . Pseudocode 2 summarizes the stpdf to hand-to-hand occlusions split. Our approach initially relies on the intersection with the previous L/R segments and subsequently, if necessary, in the superpixels of the previous frame.

1:procedure SplitOcclusion()
4:       for   do
5:             if  then
7:             else if  then
9:             else
12:                    if  then
14:                    else
16:                    end if
17:             end if
18:       end for
20:end procedure
Algorithm 2 Splitting occluded hand segments.

Intuitively, the intersection of current hand-segments with the previous L/R segments provides reliable decisions for small sampling rates. However, due to the fast camera and hand moves, not all the pixels inside the occluded hand-segment can be solved in this way. For these pixels, we rely on the closest previous superpixel. The higher the sampling rate the more relevant the superpixel criteria.

In practice, we use the original SLIC algorithm as the superpixel method with the metric defined by (5). The same metric is used to find the closest previous superpixel in algorithm 2 line 11, where is the color metric given by equation (6), is the space metric given by equation (7), and is the spatial weight. Our experimental results use as color space for two reasons: i) It has been pointed out as the best performing feature for hand-segmentation in egocentric videos ii) It is the feature used in the original SLIC algorithm. 111See Alata2009 for a detailed comparison of different color spaces and their discriminative power.


3.3 Hand Identification

Assuming that the current frame is not occluded, or has been previously split, the next step is to decide if detections are left or right hands. As explained before, using only the horizontal position of the hands in the frame is not always reliable. It can be hypothesized that, by extending the horizontal position with the hand orientation, it is possible to the solve the difficult situations. To confirm this, it was performed an exhaustive analysis of the kitchen hand-masks extended with labels about the hand identity. These masks are subsequently used to define a probabilistic L/R hand-identification model based on the best-fitting ellipses (section 3.3.1). The ellipses are fitted with the algorithm proposed by Fitzgibbon1995 . Finally, the already mentioned Maxwell model is used in a likelihood ratio test to exploit the fact that one left/right hand can be present at most (section 3.3.2). It is noteworthy that the proposed identifier does not need initialization: it is independent of the sampling rate, and can be applied to frames with left, right, or both hands. Additionally, its parametric nature opens the door for further integration for higher inference levels as proposed in Betancourt2015b .

3.3.1 Building the L/R hand-identification model

A quick analysis of egocentric videos of daily activities easily points to the angle of the hands with respect to the lower frame border (), and the normalized horizontal distance to the left border () as two discriminative variables to build our L/R hand-identification model. Figure 6 illustrates these variables. For the remaining part of this section is the normalized value of with respect to the frame width .

(a) Geometric problem of the left hand-segment
(b) Geometric problem of the right hand-segment
Figure 6: Input variables for the L/R hand-identification model.

The upper half of Figure 7 shows the observed empirical distribution of the and for the left and the right hands of the kitchen dataset extended masks. In the horizontal axis is the relative distance to the left border (), for the left hand-like segments, and the relative distance to the right border (), for the right hand-like segments. The angular dimension is the anti-clockwise angle with respect to the horizontal border of the frame (). Interestingly, there is a small asymmetry between the left and right distributions, meaning that one of the two hands is used for a wider variety of movements than the other. We point this as an interesting finding that could lead to further device personalization depending on the dominant hand of the user, or to analyze the hand usage in daily activities.

Figure 7: Empirical (Top) and theoretical (Bottom) hand distribution function given the distance to relative distance to the sides of the image. For the left(right) the relative distance to the left(right) side is used.

Based on the empirical distributions, a mathematical formulation to fit the observed distribution is proposed; which, interestingly can be easily approximated two independent Maxwell distributions. The reasons behind the choice of the Maxwell distribution are the following: i) It is positive defined ii) It allows to include an asymmetry factor in our formulation. The mathematical formulation for the left hand () and the right hand () is given by equation (8) and (9) respectively, where is the Maxwell distribution with parameters . The values of and are defined in the interval and . In general controls the displacement of the distribution (with respect to the origin) and controls its amplitude.


In total, our formulation contains parameters summarized in equation (11). As notation, the subscript of refers to the left () or right () parameters, and the superscript refers to the horizontal distance () or the anti-clockwise angle (). The parameters of the model are selected by fitting the empirical distribution and the final values are given by equation (12). The second row of Figure 7 shows the theoretical distribution.


3.3.2 Using the L/R hand-identification models

To compare the fitting performances of the L/R hand-identification models given by equation (8) and (9), a likelihood ratio test on the post-processed hand-like segments is performed. The likelihood ratio test is given by equation (13).


Relying only on the likelihood ratio, could lead to cases where two hand-like segments are assigned the same label (left or right). To avoid this cases, and given that a frame cannot have two left nor two right hands, we follow a competitive rule in the following way. Let’s assume two hands-like segments in the frame described by and as explained in Figure 6, and their respective likelihood ratios given by and . The competitive ids are assigned by equation (14).


4 Results

This section evaluates our approach in two stpdf. Section 4.1 uses the Kitchen manual masks as a perfect hand-segmenter to assess the L/R hand-identification models and the occlusion detector. In section 4.2 the multi-model hand-segmenter is tuned, evaluated, and used for a realistic performance analysis of the overall system.

4.1 Assuming a perfect hand-segmenter

In this section, the extended L/R manual masks of the kitchen dataset is used as a perfect hand-segmenter. Each hand-segment is endowed with its best fitting ellipse and used as input for the L/R hand-identification model presented in section 3.3. Table 1 shows the results of the L/R identification without and with likelihood ratio competition.

No-Competition With Competition
Left Right Left Right
Left 0.994 0.006 0.997 0.003
Right 0.012 0.988 0.000 1.000
Table 1: Left and right hand identification at contour level

The comparison without likelihood ratio, left side of the table, refers to the hand-identification based only on the best model. However, it is intuitive to assume that in presence of two relevant hand-like segments, they cannot be both left or right. This restriction is included by using the likelihood ratio test introduced in section 3.3.2, and presented in the right half of the table. This scheme allow us to identify almost perfectly all the masks in the dataset (i.e. of the left hands and of the right hands). The values reported in the table refer to the identification problem (left/right) and does not constitute a hand-segmentation.

4.2 Without perfect segmentation

The assumption of a perfect hand-segmenter is not a realistic. Furthermore, hand segmentation is considered one of the most challenging objectives of FPV video analysis. To perform a more realistic evaluation of our approach we initially tune and evaluate the proposed based multi-model hand-segmenter (section 4.2.1). Subsequently, in section 4.2.2 the occlusion detector is assessed to conclude with an overall evaluation of the system including each of its components.

4.2.1 Hand-Segmentation

As presented in figure 4 the proposed hand-segmenter is intended to alleviate the illumination problems and consequently improve the quality of the segmentation. However, some important aspects of this approach must be defined first:

  1. How many illumination models () must be considered?.

  2. How many models (

    ) must be provided by the KNN recommender component?

  3. Which is the effect of these parameters to the quality of the segmentation?







Training Coffe 0.937 0.921 0.920 0.941 0.892 0.920
CofHoney 0.933 0.925 0.917 0.931 0.864 0.914
Hotdog 0.923 0.910 0.930 0.925 0.883 0.912
Tea 0.925 0.909 0.910 0.935 0.899 0.911
Pealate 0.918 0.906 0.902 0.922 0.904 0.913
Table 2: F1 Score when using different training videos.

In order to answer these questions a computational experiment is designed to tune and and evaluate proposed multi-model approach with the state-of-the-art hand-segmentation methods. For the experiment, the subject of the kitchen dataset is used. We train a multi-model hand segmenter using each video for training and the remaining ones for testing. As explained in section 3.1 each illumination model is a random forest, which introduces a random component to the hand-segmenter. To alleviate the randomness in the evaluation, the training-testing is executed times with different random seeds. With this in mind, given () and (), a total of training and testing errors are obtained (e.g. per training video times per random seed).

Figure 8 shows the average training (top plot) and testing (bottom plot) scores while changing the number of illumination models (). The colors of the lines (legend of the figure) refer to the use of the closest random forests in the fusion part. The image shows a quick improvement in the performance when the number of illumination models increases. As reference, the testing error changes from to when using illumination models instead of a single one. For the remaining part of this paper, the number of illumination models is set to . Regarding the number of illumination models to fuse , the performance quickly converges on ; concluding that, for the kitchen dataset, the fusion of more than illumination models does not provide additional improvements to the segmentation quality. In total the fusion of illumination models contributes two units in the score compared with the use of only . In the remainder of this paper a value of is used.

Figure 8: Hand-segmentation score when changing the number of illumination models () and the number of closest models () to fuse. The first and second plots are the training and testing scores, respectively. The number of illumination models is plotted in the horizontal axis, while the score is in the vertical axis; the colors represent the number of models to fuse .

For what concerns the training video selection, Table 2 shows a detailed comparison of the binary segmentation performances when trained with different videos. The table shows the mean

and its standard deviation. The diagonal of the table is training

while, and the remaining values are the testing scores. The overall testing is in the final column. Results allow us to conclude that the choice of the testing video does not create a substantial effect on the overall performance. The latter is true if the light conditions of the videos are similar. In the remainder of the paper, we use the “Coffee” video sequence ( frames - masks - occluded masks) as training sequences, and the remaining “CofHoney”, “Hotdog”, “Tea”, “Pealate” for testing. The testing sequences contain in total frames, Left/Right masks and occluded masks. Please refer to Fathi2012a for extra details about the Kitchen dataset.

Finally, Table 3 compares the multi-model hand-segmenter with previous works. If compared with the single pixel-by-pixel classifier of Li2013a , our approach achieves improvements between and score points. After the post-processing, our method achieves a total improvement of , and points on the “Coffee”, “Tea” and “Peanut” video sequences, respectively. In comparison to the shape aware hand-segmenter proposed by Zhu2014 , our implementation performs better in all the video sequences. In particular, the “Tea” video sequence is improved by points.

Coffee Tea Peanut
1999 - Single pixel color Jones1999 0.83 0.80 0.73
2011 - stabilization + gPb + superpixel + CRF Fathi2011a 0.71 0.82 0.72
2013 - Li window Li2013a 0.85 0.82 0.74
2013 - Li window Li2013a 0.88 0.88 0.76
2014 - Shape Aware Forest (post-process) Zhu2014 0.90 0.84 0.84
2016 - Ours (k=20, m=50) 0.88 0.87 0.77
2016 - Ours (k=20, m=50) + Hand-Id Post Process 0.94 0.94 0.88
Table 3: Hand-Segmenter state of the art comparison. The performances reported for the state-of-the-art are taken from Zhu2014

4.2.2 Occlusions and overall performance

The extended masks can be used to identify evaluation cases for the occlusion detector and the splitting method. To evaluate the occlusion detector we initially select the masks with hand-to-hand occlusions and check if the occlusion detector finds them. In total, the “Kitchen” dataset (subject ) contains hand-to-hand occluded frames, and the algorithm 1 identifies .

When automatically segmented, the silhouette of the hands will be affected by the false-positives and false-negatives, and as the consequence, some extra frames could be mis-detected as hand-to-hand occlusions (i.e., two noise hand-like segments close enough to be considered occluded). This is not a problem, since the algorithm will split these cases as a real occlusions and only some extra computational time is needed.

No-Hand Left Right
No-Hand 0.984 0.007 0.009
Left 0.058 0.934 0.009
Right 0.080 0.006 0.914
Table 4: Evaluation of the hand segmentation only when split is required
Without split With split
No-Hand Left Right No-Hand Left Right
No-Hand 0.992 0.004 0.004 0.992 0.004 0.004
Left 0.073 0.821 0.106 0.073 0.923 0.004
Right 0.096 0.066 0.838 0.096 0.001 0.903
Table 5: Effect of the occlusion detection and dissambiguation in the overall performance.

To evaluate the hand-to-hand occlusion split the extended L/R masks was used as ground truth to perform class pixel-by-pixel classification analysis (i.e., background, left hand, right hand). First, in seek of a better evaluation of the split procedure, only the frames detected as occluded are used. Table 4

shows the confusion matrix of the

class pixel-by-pixel segmentation for all the frames detected as occlusion. The table concludes that, in the case of occlusion, the split leads to a proper classification of and of the left-hand and right-hand pixels, respectively. It is important to note, as shown previously, that the main cause of the misclassified left/right pixels is not the split procedure, but the noisy segmentation.

CofHoney Hotdog Tea Pealette Total
















60 FPS No-hands 0.990 0.003 0.007 0.989 0.005 0.006 0.996 0.002 0.002 0.991 0.006 0.003 0.992 0.004 0.004
Left 0.064 0.932 0.004 0.040 0.958 0.002 0.056 0.943 0.001 0.120 0.871 0.009 0.073 0.923 0.004
Right 0.092 0.002 0.906 0.136 0.001 0.864 0.082 0.000 0.918 0.112 0.002 0.886 0.096 0.001 0.903
30 FPS No-hands 0.990 0.003 0.007 0.989 0.006 0.005 0.996 0.002 0.002 0.991 0.006 0.003 0.992 0.004 0.004
Left 0.064 0.930 0.006 0.039 0.958 0.002 0.057 0.932 0.011 0.119 0.874 0.007 0.073 0.921 0.007
Right 0.093 0.009 0.898 0.133 0.002 0.865 0.082 0.000 0.918 0.109 0.003 0.888 0.095 0.004 0.900
15 FPS No-hands 0.990 0.003 0.007 0.990 0.006 0.005 0.996 0.002 0.002 0.991 0.006 0.003 0.993 0.004 0.004
Left 0.063 0.919 0.017 0.040 0.914 0.047 0.056 0.940 0.004 0.118 0.865 0.017 0.072 0.907 0.021
Right 0.092 0.008 0.900 0.140 0.057 0.803 0.081 0.000 0.919 0.109 0.036 0.855 0.096 0.015 0.889
Table 6: L/R hand-segmentation confusion matrix. This table uses the “Coffe” video sequence for training

To conclude, Table 5 shows the benefit of the occlusion detection and the split to the overall hand-identification. The first vertical group ignores the occlusion problem, while the second is obtained using proposed approach. Both confusion matrix are identical in the background performance since the hand-segmenter is the same for the two experiments. The L/R hand-segmentation gains almost ten percentage points when occlusions are considered. Eventually, table 6 provides the detailed results for each testing video. For comparative purposes, the table provides the performances obtained by using , and frames per second. It can be noticed that the overall performance is not considerably affected by a sampling rate of . When using frames per second, the segmentation quality suffers a small reduction, but the throughput of the system is considerably improved. All the results reported in this paper use a latency of (bold digits).

5 Conclusions and future research

This work presented a hierarchical strategy to segment and identify the left and right hands of the user in egocentric videos. The proposed method provides valuable information about the hand-usage and opens the door to use wearable cameras in applications involving bi-manual tasks, for example for driving applications or medical therapy for upper limb mobility problems.

The first level of proposed method is a multi-model structure that delineates the hand-like pixels on each egocentric frame. Experimental results show that proposed multi-model implementation, jointly with the hand-identification post-processing, achieves scores of around , which constitutes a significant improvement to the shape-aware classifier proposed in Zhu2014 .

The second level, executed if required, is the hand-to-hand occlusion identification and disambiguation. The experimental section shows the importance of this step to understand the hands of the user as two cooperative entities working jointly to accomplish a particular task. Our results indicate that, by handling hand-to-hand occlusions, it is possible to obtain improvements around in L/R hand-segmentation.

The final level, the hand-identification, relies on a Maxwell function of angle and horizontal position, to decide whether a hand-like segment is left or right. Experimental results show that our L/R identification model identifies with certainty if a hand is left or right. We highlight this as a considerable improvement regarding efficiency and accuracy to the state of the art, where a SVM is used to understand the state of the hands as: i) only left, ii) only right, iii) both hands.

As a future research line, we highlight the use of the identified hands as cooperative entities to understand how the user is performing a particular task. The results obtained with our method can be used as the measurement model in the framework of tracking interacting objects to get reliable hands trajectories and augmented states. These trajectories could lead to a proper understanding of the user’s hands movements, which constitutes a starting point to use wearable cameras in medical therapy. Based on our current research the hand-tracking level requires considerable development in the definition of the dynamic models ruling the non-linear movement of the hands. Additional issues must be solved when noisy measurements are detected or in the presence of complex hand interactions.

Finally, some of the methods presented in this paper, such as the multi-model classification algorithm, could be applied in more general scenarios. The objective of this paper is to exploit the advantageous location of the camera to extract additional information about the hands of the user. The use of the segmentation model in other video perspective or application is left as an interesting future work.

6 Acknowledgement

This work was partially supported by the Erasmus Mundus joint Doctorate in Interactive and Cognitive Environments, which is funded by the EACEA, Agency of the European Commission under EMJD ICE.

The authors thank the Cyberinfrastructure Service for High Performance Computing, “Apolo”, at EAFIT University, for allowing us to run our computational experiments in their computing centre.



  • (1) K. van Laerhoven, D. Roggen, D. Gatica-Perez, M. Fukumoto, and T. Starner, “Wearable computing,” the 17Th Annual International Symposium, vol. 12, no. 2, pp. 125, 2013.
  • (2) A. Betancourt, P. Morerio, C. Regazzoni, and M. Rauterberg, “The Evolution of First Person Vision Methods: A Survey,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 25, no. 5, pp. 744–760, 2015.
  • (3) A. Fathi, A. Farhadi, and J. M. Rehg, “Understanding egocentric activities,” in Proceedings of the IEEE International Conference on Computer Vision. nov 2011, pp. 407–414, IEEE.
  • (4) L. Baraldi, F. Paci, G. Serra, L. Benini, and R. Cucchiara, “Gesture Recognition using Wearable Vision Sensors to Enhance Visitors’ Museum Experiences,” IEEE Sensors Journal, vol. 15, no. 5, pp. 1–1, 2015.
  • (5) A. Fathi, Y. Li, and J. M. Rehg, “Learning to recognize daily actions using gaze,” in

    Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

    , Florence, Itaty, 2012, vol. 7572 LNCS, pp. 314–327, Georgia Institute of Technology.
  • (6) V. Buso, I. González-Díaz, and J. Benois-Pineau,

    “Goal-oriented top-down probabilistic visual attention model for recognition of manipulated objects in egocentric videos,”

    Signal Processing: Image Communication, , no. June, 2015.
  • (7) M. Cai, K. Kitani, and Y. Sato, “A Scalable Approach for Understanding the Visual Structures of Hand Grasps,” IEEE International Conference on Robotics and Automation, pp. 1360–1366, 2015.
  • (8) Y. Yang, C. Fermuller, Y. Li, and Y. Aloimonos, “Grasp type revisited: A modern perspective on a classical feature for vision,” in

    2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    . jun 2015, pp. 400–408, IEEE.
  • (9) A. Betancourt, P. Morerio, L. Marcenaro, E. Barakova, M. Rauterberg, and C. Regazzoni, “Towards a Unified Framework for Hand-based Methods in First Person Vision,” in IEEE International Conference on Multimedia and Expo (Workshops), Turin, 2015, IEEE.
  • (10) M. E. Rentería, “Cerebral Asymmetry: A Quantitative, Multifactorial, and Plastic Brain Phenotype,” Twin Research and Human Genetics, vol. 15, no. 03, pp. 401–413, 2012.
  • (11) I. C. Mcmanus, “The history and geography of human handedness,” Language Lateralization and Psychosis, pp. 37–57, 2009.
  • (12) C. Li and K. Kitani, “Model Recommendation with Virtual Probes for Egocentric Hand Detection,” in 2013 IEEE International Conference on Computer Vision, Sydney, 2013, pp. 2624–2631, IEEE Computer Society.
  • (13) A. Fathi, X. Ren, and J. M. Rehg, “Learning to recognize objects in egocentric activities,” in Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Providence, RI, jun 2011, pp. 3281–3288, IEEE.
  • (14) X. Ren, M. Philipose, and Xiaofengren, “Egocentric recognition of handled objects: Benchmark and analysis,” in 2009 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2009, Miami, FL, jun 2009, pp. 49–56, IEEE.
  • (15) S. P. Swinnen and N. Wenderoth, “Two hands, one brain: Cognitive neuroscience of bimanual skill,” Trends in Cognitive Sciences, vol. 8, no. 1, pp. 18–25, 2004.
  • (16) M. Vincze, M. Zillich, W. Ponweiser, V. Hlavac, J. Matas, S. Obdrzalek, H. Buxton, J. Howell, K. Sage, A. Argyros, C. Eberst, and G. Umgeher, “Integrated vision system for the semantic interpretation of activities where a person handles objects,” Computer Vision and Image Understanding, vol. 113, no. 6, pp. 682–692, 2009.
  • (17) A. Turolla, M. Dam, L. Ventura, P. Tonin, M. Agostini, C. Zucconi, P. Kiper, A. Cagnin, and L. Piron, “Virtual reality for the rehabilitation of the upper limb motor function after stroke: a prospective controlled trial.,” Journal of neuroengineering and rehabilitation, vol. 10, pp. 85, 2013.
  • (18) L. Speth, Y. Janssen-Potten, P. Leffers, E. Rameckers, A. Defesche, R. Geers, R. Smeets, and H. Vles, “Observational skills assessment score: reliability in measuring amount and quality of use of the affected hand in unilateral cerebral palsy.,” BMC neurology, vol. 13, pp. 152, 2013.
  • (19) D. J. Goble and S. H. Brown, “The biological and behavioral basis of upper limb asymmetries in sensorimotor performance,” 2008.
  • (20) T. a. Knaus, J. Kamps, and a. L. Foundas, “Handedness in Children with Autism Spectrum Disorder,” Perceptual and Motor Skills, 2016.
  • (21) J. L. Cook, S. J. Blakemore, and C. Press, “Atypical basic movement kinematics in autism spectrum conditions,” Brain, vol. 136, no. 9, pp. 2816–2824, 2013.
  • (22) A. Betancourt, M. Lopez, C. Regazzoni, and M. Rauterberg, “A Sequential Classifier for Hand Detection in the Framework of Egocentric Vision,” in Conference on Computer Vision and Pattern Recognition, Columbus, Ohio, jun 2014, vol. 1, pp. 600–605, IEEE.
  • (23) A. Betancourt, P. Morerio, E. Barakova, L. Marcenaro, M. Rauterberg, and C. Regazzoni, “A Dynamic Approach and a New Dataset for Hand-Detection in First Person Vision.,” in International Conference on Computer Analysis of Images and Patterns, Malta, 2015.
  • (24) A. Betancourt, P. Morerio, L. Marcenaro, M. Rauterberg, and C. Regazzoni, “Filtering SVM frame-by-frame binary classification in a detection framework,” in International Conference on Image Processing, Quebec, Canada, 2015, IEEE.
  • (25) C. Li and K. Kitani, “Pixel-Level Hand Detection in Ego-centric Videos,” in 2013 IEEE Conference on Computer Vision and Pattern Recognition. jun 2013, pp. 3570–3577, Ieee.
  • (26) M. J. Jones and J. M. Rehg, “Statistical color models with application to skin detection,” in International Journal of Computer Vision, Fort Collins, CO, 2002, vol. 46, pp. 81–96, IEEE Computer Society.
  • (27) X. Zhu, X. Jia, and K.-y. K. Wong, “Pixel-Level Hand Detection with Shape-aware Structured Forests,” in Asian Conference on Computer Vision, Singapore, 2014, pp. 1–15.
  • (28) X. Zhu, X. Jia, and K. Y. K. Wong, “Structured forests for pixel-level hand detection and hand part labelling,” Computer Vision and Image Understanding, vol. 141, pp. 95–107, 2015.
  • (29) S. Lee, S. Bambach, D. Crandall, J. Franchak, and C. Yu, “This Hand Is My Hand: A Probabilistic Approach to Hand Disambiguation in Egocentric Video,” in Computer Vision and Pattern Recognition (CVPR), Columbus, Ohio, 2014, number Figure 2, pp. 1–8, IEEE Computer Society.
  • (30) P. Morerio, L. Marcenaro, and C. Regazzoni, “Hand Detection in First Person Vision,” in Fusion, Istanbul, 2013, University of Genoa, pp. 0–6.
  • (31) O. Alata and L. Quintard,

    “Is there a best color space for color image characterization or representation based on Multivariate Gaussian Mixture Model?,”

    Computer Vision and Image Understanding, vol. 113, no. 8, pp. 867–877, 2009.
  • (32) A. W. Fitzgibbon and R. B. Fisher, “A Buyer ’ s Guide to Conic Fitting,” British Machine Vision Conference, pp. 513–522, 1995.