Analysis of the hands in egocentric vision: A survey

by   Andrea Bandini, et al.
University Health Network

Egocentric vision (a.k.a. first-person vision - FPV) applications have thrived over the past few years, thanks to the availability of affordable wearable cameras and large annotated datasets. The position of the wearable camera (usually mounted on the head) allows recording exactly what the camera wearers have in front of them, in particular hands and manipulated objects. This intrinsic advantage enables the study of the hands from multiple perspectives: localizing hands and their parts within the images; understanding what actions and activities the hands are involved in; and developing human-computer interfaces that rely on hand gestures. In this survey, we review the literature that focuses on the hands using egocentric vision, categorizing the existing approaches into: localization (where are the hands or part of them?); interpretation (what are the hands doing?); and application (e.g., systems that used egocentric hand cues for solving a specific problem). Moreover, a list of the most prominent datasets with hand-based annotations is provided.



page 4

page 9


Predicting the Future from First Person (Egocentric) Vision: A Survey

Egocentric videos can bring a lot of information about how humans percei...

Hand Pose Estimation: A Survey

The success of Deep Convolutional Neural Networks (CNNs) in recent years...

Preprint Extending Touch-less Interaction on Vision Based Wearable Device

This is the preprint version of our paper on IEEE Virtual Reality Confer...

The Evolution of First Person Vision Methods: A Survey

The emergence of new wearable technologies such as action cameras and sm...

EgoTransfer: Transferring Motion Across Egocentric and Exocentric Domains using Deep Neural Networks

Mirror neurons have been observed in the primary motor cortex of primate...

Left/Right Hand Segmentation in Egocentric Videos

Wearable cameras allow people to record their daily activities from a us...

Pedestrian Detection with Wearable Cameras for the Blind: A Two-way Perspective

Blind people have limited access to information about their surroundings...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The hands are of primary importance for human beings, as they allow us to interact with objects and environments, communicate with other people, and perform activities of daily living (ADLs) such as eating, bathing, and dressing. It is not a surprise that in individuals with impaired or reduced hand functionality (e.g., after a stroke or cervical spinal cord injury – cSCI) the top recovery priority is to regain the function of the hands [125]. Given their importance, computer vision researchers have tried to analyze the hands from multiple perspectives: localizing them in the images [75], inferring the types of actions they are involved in [42], as well as enabling interactions with computers and robots [109, 145]. Wearable cameras (e.g., cameras mounted on the head or chest) have allowed studying the hands from a point of view (POV) that provides a first-person perspective of the world. This field of research in computer vision is known as egocentric or first-person vision (FPV). Although some studies were published as early as the 1990s [98], FPV gained more importance after 2012 with the emergence of smart glasses (i.e., Google Glass) and action cameras (i.e., GoPro). For an overview of the evolution of FPV methods, the reader is referred to the survey published by Betancourt et al. [16].

Egocentric vision presents many advantages when compared with third person vision, where the camera position is usually stable and disjointed from the user. For example: the device is recording exactly what the users have in front of them; camera movement is driven by the camera-wearer’s activity and attention; hands and manipulated objects tend to appear at the center of the image and hand occlusions are minimized [104]. These advantages made the development of novel approaches for studying the hands very appealing. However, when working in FPV, researchers must also face an important issue: the camera is not stable, but is moving with the human body. This causes fast movements and sudden illumination changes that can significantly reduce the quality of the video recordings and make it more difficult to separate the hand and objects of interest from the background.

Betancourt et al. [14]

clearly summarized the typical processing steps of hand-based methods in FPV. The authors proposed a unified and hierarchical framework where the lowest levels of the hierarchy concern the detection and segmentation of the hands, whereas the highest levels are related to interaction and activity recognition. Each level is devoted to a specific task and provides the results to higher levels (e.g., hand identification builds upon hand segmentation and hand detection, activity recognition builds upon the identification of interactions, etc.). Although clear and concise, this framework could not cover some of the recent developments in this field, made possible thanks to the availability of large amounts of annotated data and to the advent of deep learning. Other good surveys closely related to the topics discussed in our paper were published in the past few years

[109, 104, 33, 77, 40, 18]. The reader should refer to the work of Del Molino et al. [40] for an introduction into video summarization in FPV, to the survey of Nguyen et al. [104] for the recognition of ADLs from egocentric vision, and to the work of Bolaños et al. [18]

for a review on visual lifelogging. Hand pose estimation and hand gesture recognition methods are analyzed in

[77] and [33], respectively.

In this survey we define a comprehensive taxonomy of hand-based methods in FPV expanding the categorization proposed in [14]

and classifying the existing literature into three macro-areas: localization, interpretation, and application. For each macro-area we identify the main sub-areas of research, presenting the most prominent approaches published in the past 10 years (until December 2019) and discussing advantages and disadvantages of each method. Moreover, we summarize the available datasets published in this field. Our focus in defining a comprehensive taxonomy and comparing different approaches is to propose an updated and general framework of hand-based methods in FPV, highlighting the current trends and summarizing the main findings, in order to provide guidelines to researchers who want to improve and expand this field of research. The remainder of the paper is organized as follows: Section 2 presents a taxonomy of hand-based methods in FPV following a novel categorization that divides these approaches into three macro-areas: localization, interpretation, and application; Section 3 describes the approaches developed for solving the localization problem; Section 4 summarizes the work focused on interpretation; Section 5 summarizes the most important applications of hand-based methods in FPV; Section 6 reviews the available datasets published so far; and, finally, Section 7 concludes with a discussion of the current trends in this field.

2 Hand-Based Methods in Fpv – an Updated Framework

Starting from the raw frames, the first processing step is dedicated to the localization of the hands within the image. This allows restricting the processing to small regions of interest (ROIs) thus excluding unnecessary information from the background. Once the position of the hands (or parts of them) has been determined, higher-level information can be inferred to understand what the hands are doing (e.g. gesture and posture recognition, action and activity recognition). This information can be used for building applications such as human-computer interaction (HCI) and human-robot interaction (HRI) [109, 145]. Therefore, we categorize the existing studies that made use of hand-based methods in FPV into three macro-areas:

  • [noitemsep, wide=0pt, leftmargin=]

  • Localization – approaches that answer the question: where are the hands (or parts of them)?

  • Interpretation – approaches that answer the question: what are the hands doing?

  • Application – approaches that use methods from the above areas to build real-world applications.

For each area we define sub-areas according to the aims and nature of the proposed methods.

2.1 Localization – Where are the hands (or part of them)?

The localization area encloses all the approaches that aim at localizing hands (or part of them) within the images. The sub-areas are:

  • [noitemsep, wide=0pt, leftmargin=]

  • Hand segmentation – detecting the hand regions with pixel-level detail.

  • Hand detection – defined both as binary classification problem (does the image contain a hand?) and object localization problem (is there a hand? Where is it located?). The generalization of hand detection over time is hand tracking.

  • Hand identification – classification between left and right hand (and other hands present in the scene).

  • Hand pose estimation – estimation of hand joint positions. A simplified version of the hand pose estimation problem is fingertip detection, where only the fingertips of one or more fingers are identified.

From the above sub-areas it is possible to highlight two dimensions in the localization problem. The first one is the detail of the information extracted with a method. For example, hand detection results in low-detail information (binary label or coordinates of a bounding box), whereas hand segmentation produces high-detail information (pixel-level silhouette). The second dimension is the meaning of the obtained information (semantic content [104, 30]). Hand detection and segmentation, although producing different levels of detail, have the same semantic content, namely the global position of the hand. By contrast, hand pose estimation has higher semantic content than hand detection, as the position of the fingers and hand joints add more information to the global hand location (Figure 1).

Fig. 1: Hand-based approaches in FPV categorized by detail of the information and semantic content.

2.2 Interpretation – What are the hands doing?

The interpretation area includes those approaches that, starting from lower level information (i.e., detection, segmentation, pose estimation, etc.), try to infer information with higher semantic content. The main sub-areas are:

  • [noitemsep, wide=0pt, leftmargin=]

  • Hand grasp analysis – Detection of the dominant hand postures during hand-object interactions.

  • Hand gesture recognition – Classification of hand gestures, usually as input for virtual reality (VR) and augmented reality (AR) systems (Section 5).

  • Action/Interaction recognition – Predicting what type of action or interaction the hands are involved in. Following the taxonomy of Tekin et al. [128], an action is defined as a verb (e.g. “pour”), whereas an interaction as a verb-noun pair (e.g. “pour water”). This task is called interaction detection if the problem is reduced to a binary classification task (i.e., predicting whether or not the hands are interacting).

  • Activity recognition – Identification of the activities, defined as set of temporally-consistent actions [42]. For example, preparing a meal is an activity composed of several actions and interactions, such as cutting vegetables, pouring water, opening jars, etc.

We can qualitatively compare these sub-areas according to the two dimensions described above (detail of information and semantic content). Hand grasp analysis and gesture recognition have lower semantic content than action/interaction recognition that, in turns has lower semantic content than activity recognition. Activity recognition, although with higher semantic content than action recognition, produces results with lower detail. This is because the information is summarized towards the upper end of the semantic content dimension. Following these considerations, we represent the localization and interpretation areas of this framework on a two-dimensional plot whose axes are the detail of information and the semantic content (Figure 1).

2.3 Application

The application area includes all the FPV approaches and systems that make use hand-based methods for achieving certain objectives. The main applications are:

  • [noitemsep, wide=0pt, leftmargin=]

  • Healthcare application, for example the remote assessment of hand function and the development of ambient assisted (AAS) living systems.

  • HCI and HRI, for example VR and AR applications, or HRI systems that rely on the recognition of hand gestures.

Some egocentric vision applications were already covered by other surveys [104, 40, 18, 109]. Thus, we will summarizing novel aspects related to hand-based methods in FPV not covered in the previous articles.

3 Localization

The localization of hands (or parts of them) is the first and most important processing step of many hand-based methods in FPV. A good hand localization algorithm allows estimating the accurate position of the hands within the image, boosting the performance of higher-level inference [5]. For this reason, hand localization has been the main focus of researchers in egocentric vision. Although many hand detection, pose-estimation, and segmentation algorithms were developed in third person vision [77], the egocentric POV presents notable challenges that do not allow a direct translation of these methods. Rogez et al. [114] demonstrated that egocentric hand detection is considerably harder in FPV, and methods developed specifically for third person POV may fail when applied to egocentric videos.

Hand segmentation and detection are certainly the two most extensively studied sub-areas. They are often used in combination, for example to classify as “hand” or “not hand” previously segmented regions [153, 29], or to segment regions of interest (ROI) previously obtained with a hand detector [5]. However, considering the extensive research behind these two sub-areas, we summarize them separately.

Fig. 2: Diagram of hand localization tasks in egocentric vision. *: Hand identification is now typically incorporated within the hand detection step.

3.1 Hand segmentation

Hand segmentation is the process of identifying the hand regions at pixel-level (Figure 2). This step allows extracting the silhouette of the hands and has been extensively used as a pre-processing step for hand pose estimation, hand-gesture recognition, action/interaction recognition, and activity recognition. One of the most straightforward approaches is to use the color as discriminative feature to identify skin-like pixels [65]. Although very simple and fast, color-based segmentation fails whenever background objects have similar skin color (e.g., wooden objects) and it is robust only if the user wears colored gloves or patches to simplify the processing [83, 88]. However, this might not be feasible in real-world applications, where the hand segmentation algorithm is supposed to work without external cues, thus mimicking human vision. Illumination changes due to different environments also negatively affect the segmentation performance. Moreover, the availability of large datasets with pixel-level ground truth annotations is another issue when working with hand segmentation. This type of annotation requires a lot of manual work and the size of these datasets is much smaller than those with less detailed annotations (e.g., bounding boxes). Thus, several approaches were proposed to face the above issues.

3.1.1 Discriminating hands from objects and background

Traditional hand segmentation approaches (i.e., not based on deep learning) relies on the extraction of features from an image patch, classifying the central pixel or the entire patch as skin or no-skin using a binary classifier or regression model. The vast majority of approaches combined color with gradient and/or texture features, whereas random forest has been the most popular classification algorithm

[20]. The use of texture and gradient features allows capturing salient patterns and contours of the hands that, combined with the color features, help discriminate them from background and objects with similar color.

Pixel-based classification. Li and Kitani [75] tested different combinations of color (HSV, RGB, and LAB color spaces) and local appearance features (Gabor filters [139], HOG [38], SIFT [95], BRIEF [27], and ORB [119] descriptors) to capture local contours and gradients of the hand regions. Each pixel was classified as skin or no-skin using a random forest regression. When using color features alone, the LAB color space provided the best performance, whereas gradient and texture features, such as HOG and BRIEF, improved the segmentation performance when combined with the color information [75]. Zariffa and Popovic [146] used a mixture of Gaussian skin model with morphological operators (dilation followed by erosion) to detect a coarse estimate of the hand regions. The initial region was refined by removing small isolated blobs with texture different from the skin, by computing the Laplacian of the image within each blob. Lastly, pixel-level segmentation was achieved by backprojecting using an adaptively selected region in the colour space. In [86], the coarse segmentation obtained with a mixture of Gaussian skin model [65, 146] was refined by using a structured forest edge detection [41], specifically trained on available datasets [29, 10].

Patch-based classification. Other authors classified image patches instead of single pixels, in order to produce segmentation masks more robust to pixel-level noise [122, 124, 130, 151, 152]. Serra et al. [122] classified clusters of pixels (i.e., super-pixels) obtained with the simple linear iterative clustering (SLIC) algorithm [1]. For each super-pixel, they used a combination of color (HSV and LAB color spaces) and gradient features (Gabor filters and histogram of gradients) to train a random forest classifier. The segmentation was refined by assuming temporal coherence between consecutive frames and spatial consistency among groups of super-pixels. Similarly, Singh et al. [124] computed the hand binary mask by extracting texture and color features (Gabor filters with RGB, HSV, and LAB color features) from the super-pixels, whereas Urabe et al. [130]

used the same features in conjunction with the centroid location of each super-pixel to train a support vector machine (SVM) for segmenting the skin regions. Instead of classifying the whole patch from which color, gradient and texture features are extracted, Zhu et al.

[151, 152] learned the segmentation mask within the image patch, by using a random forest framework (shape-aware structured forest).

Deep learning may help solve hand segmentation problems in FPV. However, its use is still hampered by the lack of large annotated datasets with pixel-level annotations. Some deep learning approaches for hand segmentation [150, 80] tackled this issue by using the available annotations in combination with other image segmentation techniques (e.g., super-pixels or GrabCut [1, 82, 46, 118]) to generate new hand segmentation masks for expanding the dataset and fine-tuning pre-trained networks (see section 3.1.3 for more details). The availability of pre-trained CNNs for semantic object segmentation [94, 89] was exploited in [131, 127]. Wang et al. [136, 137] tackled the hand segmentation problem in a recurrent manner by using a recurrent U-NET architecture [117]. The rationale behind this strategy is to imitate the saccadic movements of the eyes that allow refining the perception of a scene. The computational cost can be another issue in CNN-based hand segmentation. To reduce this cost, while achieving good segmentation accuracy, Li et al. [76]

implemented the deep feature flow (DFF)

[154] with an extra branch to make the approach more robust against occlusions and distortion caused by DFF.

3.1.2 Robustness to illumination changes

The problem of variable illumination can be partially alleviated by choosing the right color-space for feature extraction (e.g., LAB

[75]) and increasing the size of the training set. However, the latter strategy may reduce the separability of the color space and increase the number misclassified examples [11]. Thus, a popular solution has been to use a collection of segmentation models, adaptively selecting the most appropriate one for the current test conditions [75, 122, 124, 11, 74]. Li and Kitani [75]

proposed an adaptive approach that selects the nearest segmentation model, namely the one trained in a similar environment. To learn different global appearance models, they clustered the HSV histogram of the training images using k-means and learned a separate random tree regressor for each cluster. They further extended this concept in

[74] where they formulated hand segmentation as a model recommendation task. For each test image, the system was able to propose the best hand segmentation model given the color and structure (HSV histogram and HOG features) of the observed scene and the relative performance between two segmentation models. Similarly, Betancourt et al. [11] trained binary random forests to classify each pixel as skin or not skin using the LAB values. For each frame they trained a separate segmentation model storing it along with the HSV histogram, as a proxy to summarize the illumination condition of that frame. K-nearest neighbors (k-NN) classification was performed on the global features to select the k most suitable segmentation models. These models were applied to the test frame and their segmentation results combined together to obtain the final hand mask.

3.1.3 Lack of pixel-level annotations

Annotating images at pixel-level is a very laborious and costly work that refrains many authors from publishing large annotated datasets. Thus, the ideal solution to the hand segmentation problem would be a self-supervised approach able to learn the appearance of the hands on-the-fly, or a weakly supervised method that relies on the available training data to produce new hand masks.

Usually, methods for online hand segmentation made assumptions on the hand motion [57, 56, 148, 149] and/or required the user to perform a calibration with pre-defined hand movements [71]. In this way, the combination of color and motion features facilitates the detection of hand pixels, in order to train segmentation models online. Kumar et al. [71] proposed an on-the-fly hand segmentation, where the user calibrated the systems by waving the hands in front of the camera. The combination of color and motion segmentation (Horn–Schunck optical flow [54]) and region growing, allowed locating the hand regions for training a GMM-based hand segmentation model. Region growing was also used by Huang et al. [57, 56]. The authors segmented the frames in super-pixels [1] and extracted ORB descriptors [119] from each super-pixel to find correspondences between regions of consecutive frames, which reflect the motion between two frames. Hand-related matches were distinguished from camera-related matches based on the assumption that camera-related matches play a dominant role in the video. These matches were estimated using RANSAC [47] and after being removed, those left were assumed to belong to the hands and used to locate the seed point for region growing. Zhao et al. [148, 149] based their approach on the typical motion pattern during actions involving the hands: preparatory phase (i.e., the hands move from the lower part of the frame to the image center) and interaction phase. During the preparatory phase they used a motion-based segmentation, computing the TV-L1 optical flow [106]. As the preparatory phase ends, the motion decreases and the appearance becomes more important. A super-pixel segmentation [1] was then performed and a super-pixel classifier, based on the initial motion mask, was trained using color and gradient features.

Transfer learning has also been used to deal with the paucity of pixel-level annotations. The idea is to exploit the available pixel-level annotations in combination with other image segmentation techniques (e.g., super-pixels or GrabCut [1, 82, 118]) to generate new hand segmentation masks and fine-tune pre-trained networks. Zhou et al. [150] trained a hand segmentation network using a large amount of bounding box annotations and a small amount of hand segmentation maps [43]. They adopted a DeconvNet architecture [105] made up of two mirrored VGG-16 networks [123] initialized with 1,500 pixel-level annotated frames from [43]. Their approach iteratively selected and added good segmentation proposals to gradually refine the hand map. The hand segmentation proposals were augmented by applying super-pixel segmentation [1] and Grabcut [118]

to generate the hand probability map within the ground truth bounding boxes. DeconvNet was trained in an Expectation-Maximization manner: 1) keeping the network parameter fixed, they generated a set of hand masks and selected the best segmentation proposals (i.e., those with largest match with the ground truth mask); 2) they updated the network weights by using the best segmentation hypotheses. Similarly, Li et al.

[80] relied on the few available pixel-level annotations to train Deeplab-VGG16 [32]. Their training procedure was composed of multiple steps: 1) Pre-segmentation – the CNN, pre-trained using the available pixel-level annotations, was applied on the target images to generate pre-segmentation maps; 2) Noisy mask generation – the pre-segmentation map was combined with a super-pixel segmentation [82]; and 3) Model retraining – the new masks were used as ground truth to fine tune the pre-trained Deeplab-VGG16.

3.1.4 3D segmentation

The use of depth sensors or stereo cameras helps alleviate some of the aforementioned issues, in particular the robustness to illumination changes and lack of training data. However, the high power consumption and large amount of video data streamed by these devices has limited their FPV application only to research studies.

Some authors used the depth information to perform a background/foreground segmentation followed by hand/object segmentation within the foreground region by using appearance information [133, 116, 142]. Wan et al. [133] used a time-of-flight (ToF) camera to capture the scene during hand-object interactions. They observed that the foreground (i.e., arm, hands, and manipulated objects) is usually close to the camera and well distinguishable, in terms of distance, from the background. Thus, after thresholding the histogram of depth values to isolate the foreground, hand pixels were detected by combining color (RGB thresholds) and texture (Gabor filters) features. The same ToF camera (Creative Senz3DTM) was used by Rogez et al. [116]. The authors trained a multi-class classifier on synthetic depth images of 1,500 different hand poses, in order to recognize one of these poses in the test depth images, thus producing a coarse segmentation mask. This mask was then processed in a probabilistic manner to find the binary map corresponding to the hand pixels. Color cues were also used by computing RGB-based super-pixels on the test image. Yamazaki et al. [142] reconstructed the colored point cloud of the scene recorded with a Microsoft KinectTM v2. The foreground was isolated by detecting and removing large plane structures (likely belonging to the background) using RANSAC [47]. Afterwards, color segmentation was performed using a person-specific skin color model calibrated on the user’s skin. Ren et al. [112] used a stereo camera to reconstruct the depth map of the scene. Specifically, the depth map was reconstructed using the scanline-based stereo matching and the hand was segmented only using depth information.

3.1.5 Remarks on hand segmentation

Because of the high detail of information obtained with hand segmentation algorithms, this task is the hardest one among hand-based methods in FPV. The pixel- or super-pixel-level accuracy required for this task, combined with the intrinsic problems of egocentric vision, made this sub-area the most challenging and debated of this field of research. The effort of many researchers in finding novel and powerful approaches to obtain better results is justified by the possibility to improve not only the localization accuracy, but also to boost the performance of higher-level inference. In fact, it was demonstrated that a good hand segmentation mask can be sufficient for recognizing actions and activities involving the hands with high accuracy [5, 4]. For this reason, pixel-level segmentation has often been used as basis of higher-inference methods.

3D segmentation can certainly improve and simplify the hand segmentation task. However, these methods are a minority with respect to the 2D counterpart, since no depth/3D cameras have been developed for specific egocentric applications. With the recent miniaturization of depth sensors (e.g., iPhone X and 11) the 3D segmentation is still an area worth exploring and expanding within the next few years.

Many authors considered detection and segmentation as two steps of the same task. We preferred to split these two sub-areas given the large amount of work produced in the past few years. However, as it will be illustrated in the next section, many hand detection approaches, especially those using region-based CNNs, used the segmentation mask for generating region proposals. Perhaps, with the possibility to re-train powerful object detectors, this process has become inefficient and instead of having a “detection over segmentation”, it will be more convenient to have a “segmentation over detection”, unless the specific problem calls for a pixel-level segmentation of the entire frame.

3.2 Hand detection and tracking

Hand detection is the process of localizing the global position of the hands at frame level. This task is usually performed by fitting a bounding box around the area where the hand has been detected (Figure 2). Hand detection allows extracting coarser information than hand segmentation, although this lower detail is counterbalanced by higher robustness to noise. If the application does not require very detailed information, this is the most popular choice as basis for hand-based higher inference. In the literature we can distinguish two main approaches: hand detection as image classification task; and hand detection as object detection task. Furthermore, hand detection generalized over time is defined as hand tracking.

3.2.1 Hand detection as image classification

Pixel-level segmentation of hand regions, if performed on the entire image, may be prone to high occurrence of false positives [14, 10]. In these cases, a pre-filtering step that prevents from processing frames without any hands is necessary. This approach allows determining whether an image contains hands and it is usually followed by a hand segmentation step responsible for locating the hand region [14, 146, 10, 148, 149].

In [146], the authors back-projected the frame using a histogram obtained from a mixture of Gaussian skin model [65], predicting the presence of hands within the image by thresholding the back-projected values. Betancourt et al. [10] proposed an approach based on HOG features and SVM classifier to predict the presence of hands at frame-level, reducing the number of false positives. However, this frame-by-frame filtering increased the risk of removing frames with barely visible hands, thus increasing the false negatives [10]

. To solve this issue, the authors proposed a dynamic Bayesian network (DBN) to smooth the classification results of the SVM and improve the prediction performance

[15]. Zhao et al. [148, 149] detected the presence of hands within each frame exploiting the typical interaction cycle of the hands (i.e., preparatory phase - interaction - hands out of the frame). Based on this observation, they defined an ego-saliency metric related to the probability of having hands within a frame. This metric was derived from the optical flow map calculated using [99] and was composed of two terms: spatial cue, which restricted the hand motion within the lower part of the image; and temporal cue, which constrained the hand motion to consequently increase.

3.2.2 Hand detection as object detection

Hand detection performed within an object localization framework presents notable challenges. Given the flexibility, the continuous variation of poses, and the high number of degrees of freedom, the hand appearance is highly variable and classical object detection algorithms (e.g., Haar like features with adaboost classification) may work only in constrained situations, such as detection of hands in a specific pose

[135]. For these reasons, and thanks to the availability of large annotated datasets with bounding boxes (Section 6), this is the area that most benefited from the advent of deep learning.

Region-based approaches. Many authors proposed region-based CNNs to detect the hands, exploiting segmentation approaches (Section 3.1) to generate region proposals. Bambach et al. [5, 4]

proposed a probabilistic approach for region proposal generation that combined spatial biases (e.g., reasoning on the position of the shape of the hands from training data) and appearance models (e.g., non-parametric modeling of skin color in the YUV color space). To guarantee high coverage, they generated 2,500 regions for each frame that were classified using CaffeNet

[64]. Afterwards, they obtained the hand segmentation mask within the bounding box, by applying GrabCut [118]. Zhu et al. [153] used a structured random forest to propose pixel-level hand probability maps. These proposals were passed to a multitask CNN to locate the hand bounding box, the shape of the hand within the bounding box, and the position of wrist and palm. In [29], the authors generated region proposals by segmenting skin regions with [75] and determining if the set of segmented blobs correspond to one or two arms. This estimation was performed by thresholding the fitting error of a straight line and applying k-means clustering (with k = 2 if two arms are detected) to split the blobs into two separate structures. The hand proposals, selected as the top part of a rectangular bounding box fitted to the arm regions, were passed to CaffeNet for the final prediction. To generate hand region proposals, Cruz et al. [36] used a deformable part model (DPM) to make the approach robust to different gestures. DPM learns the hand shape by considering the whole structure and its parts (i.e., the fingers) using HOG features. CaffeNet [64] was used for classifying the proposals. Faster R-CNN was used in [86, 103, 102]. In particular, Likitlersuang et al. [86] fine-tuned the network on videos from individuals with cSCI performing ADLs. False positives were removed based on the arm angle information computed by applying a Haar-like feature rotated 360 degrees around the bounding box centroid. The resulting histogram was classified with a random forest to determine whether the bounding box actually included a hand. Furthermore, they combined color and edge segmentation to re-center the bounding box, in order to promote maximum coverage of the hands while excluding parts of the forearm.

Regression-based approaches were also used for detecting the hands. Mueller et al. [101] proposed a depth-sensor-based (Intel RealSenseTM SR300) approach for hand detection. A Hand Localization Network (HALNet – architecture derived from ResNet50 [51] and trained on synthesized data) was used to regress the position of the center of the hand. The ROI was then cropped around this point based on its distance from the camera (i.e., the higher the depth, the smaller the bounding box). Recently, the You Only Look Once (YOLO) detector [110] was applied for localizing hands in FPV [132, 35, 67], demonstrating better trade-off between computational cost and localization accuracy than Faster R-CNN and single-shot detector (SSD) [132, 35, 92].

3.2.3 Hand tracking

Hand tracking allows estimating the position of the hands across multiple frames, reconstructing their trajectories in time. Theoretically, every hand detection and segmentation approach seen above (with the exception of the binary classification algorithms – Section 3.2.1) can be used as tracker as well, by performing a frame-by-frame detection. This is the most widely used choice for tracking the hand position over time. However, some authors tried to combine the localization results with temporal models to predict the future hand positions. This strategy has several advantages, such as decreasing the computational cost by avoiding to run the hand detection every frame [132], disambiguate overlapping hands by exploiting their previous locations [73, 25, 67], and refining the hand location [93].

Lee et al. [73] studied the child-parent social interaction from the child’s POV, by using a graphical model to localize the body parts (i.e., hands of child and parent, head of the parent). The model was composed of inter-frame links to enforce temporal smoothness of the hand positions over time, shift links to model the global shifts in the field of view caused by the camera motion, and intra-frame constraints based on the spatial configuration of the body parts. Skin color segmentation in the YUV color space was exploited to locate the hands and define intra-frame constraints on their position. This formulation forced the body parts to remain in the neighborhood of the same position between two consecutive frames, while allowing for large displacement due to global motion (caused by head movements) if this displacement is consistent with all parts. Liu et al. [93] demonstrated that the hand detection is more accurate in the central part of the image due to a center bias (i.e., higher number of training examples with hands in the center of the frame). To correct this bias and obtain homogeneous detection accuracy in the whole frame, they proposed an attention-based tracker (AHT). For each frame, they estimated the target location of the hand by exploiting the result at the previous frame. Then, the estimated hand region was translated to the image center, where a CNN fine-tuned on frames with centralized hands was applied. After segmenting the hand regions using [75], Cai et al. [25] used the temporal tracking method [3] to discriminate them in case of overlap.

Regression-based CNNs in conjunction with object tracking algorithms were used in [67, 132]. Kapidis et al. [67] fine-tuned YOLOv3 [111] on multiple datasets to perform the hand detection, discriminating the right and left hand trajectories over time using the simple online real-time tracking (SORT) [17]. For each detected bounding box, this algorithm allowed predicting its next position, also assigning it to existing tracks or to new ones. Visée et al. [132] combined hand detection and tracking to design an approach for fast and reliable hand localization in FPV. Motivated by the slow detection performance of YOLOv2 without GPU, they proposed to combine YOLOv2 with the Kernelized Correlation Filter (KCF) [53] as a trade-off between speed and accuracy. The authors used the detector to automatically initialize and reset the tracker in case of failure or after a pre-defined number of frames.

3.2.4 Remarks on hand detection and tracking

Hand detection and segmentation are two closely related tasks that can be combined together. If hand detection is performed using a region-based approach (e.g., Faster R-CNN), hand segmentation can be seen as the pre-processing step of the localization pipeline, whereas in case of regression-based CNNs (e.g., YOLO) hand segmentation may follow the bounding box detection. The higher performance of regression-based methods with respect to region-based CNNs [35, 132] makes the latter approach more appealing in view of optimizing the hand localization pipeline. If there is no need of segmenting the hands at pixel level, the segmentation can just be skipped, whereas in problems where detailed hand silhouettes are needed, hand segmentation can be applied only within the detected the ROI, avoiding unnecessary computation.

The combination of detection and tracking algorithm may help to speed-up the localization performance with the possibility of translating these approaches into real-world application where low resource hardware is the only available option [132]. Moreover, as we will show in Section 4, hand tracking is an important step for the characterization and recognition of dynamic hand gestures [52, 100].

3.3 Hand identification

Hand identification is the process of disambiguating the left and right hands of the camera wearer, as well as the hands of other persons in the scene. The egocentric POV has intrinsic advantages that allow discriminating the hands by using simple spatial and geometrical constraints [86, 73, 85]. Usually, the user’s hands appear in the lower part of the image, with the right hand to the right of the user’s left hand, and vice versa. By contrast, other people’s hands tend to appear in the upper part of the frame [73]. The orientation of the arm regions was used in [86, 85] to distinguish the left from the right user’s hand. To estimate the angle, the authors rotated a Haar-like feature around the segmented hand region, making this approach robust to the presence of sleeves and different skin colors, since it did not require any calibrations [85]. To identify the hands, they split the frame into four quadrants. The quadrant with the highest sum of the Haar-like feature vector determined the hand type: “user’s right” if right lower quadrant; “user’s left” if left lower quadrant; “other hands” if upper quadrants [86]. The angle of the forearm/hand regions was also used by Betancourt et al. [11, 13]. The authors fitted an ellipse around the segmented region, calculating the angle between the arm and the lower frame border and the normalized distance of the ellipse center from the left border. The final left/right prediction was the result of a likelihood ratio test between two Maxwell distributions. Although simple and effective, spatial and geometric constraints may fail in case of overlapping hands. In this case, the temporal information help disambiguate the hands [25, 67]. Cai et al. [25] were interested in studying the grasp of the right hand. After segmenting the hand regions [75], they implemented the temporal tracking method proposed in [3] to handle the case of overlapping hands, thus tracking the right hand. Kapidis et al. [67] used the SORT tracking algorithm [17]

. This approach combines the Kalman filter to predict the future position of the hand and the Hungarian algorithm to assign the next detection to existing tracks (i.e., left/right) or new ones.

With the availability of powerful and accurate CNN-based detectors, the hand identification as separated processing step is deprecated, being incorporated within hand detection (Section 3.2.2) [5, 103, 132, 35]. To this end, both region-based (e.g., Faster R-CNN) and regression-based methods (e.g., YOLO and SSD) have been used. These models were trained or fine-tuned to recognize two or more classes of hands (Figure 2), predicting the bounding box coordinates along with its label (i.e., left, right, and other hands) [132, 35].

3.4 Hand pose estimation and fingertip detection

Hand pose estimation consists in the localization of the hand parts (e.g., the hand joints) to reconstruct the articulated hand pose from the images (Figure 2). The possibility to obtain the position of fingers, palm, and wrist, simplifies higher inference tasks (e.g., grasp analysis and hand gesture recognition), since the dimensionality of the problem is reduced yet keeping high-detail information. An important difficulty in hand pose estimation lies in object occlusions and self-occlusions that make it hard to localize hidden joints/ parts of the hand. Some authors proposed the use of 3D cameras or depth sensors in conjunction with sensor-based techniques to train hand pose estimators more robust to self-occlusions [114, 115, 101, 142, 48]. However, as discussed above, the use of 3D imaging techniques might not be easily translated to FPV. Thus, several attempts have also been made to estimate the hand pose using only color images [83, 152, 130, 8, 128]. In this section, we summarize the previous work distinguishing between hand pose estimation approaches with 3D/depth sensors and hand pose estimation using monocular color images. Moreover, we summarize approaches for fingertip detection, which can be seen as an intermediate step between hand detection and hand pose estimation.

3.4.1 Hand pose estimation using 3D/depth sensors

One of the advantages of using depth images for extracting the hand pose is the possibility to synthesize large training sets of realistic depth maps by using computer graphics [114, 115]. In [114], the authors tackled hand pose estimation as a multiclass classification problem by using a hierarchical cascade architecture. The classifier was trained on synthesized depth images by using HOG features and tested on depth images obtained with a ToF sensor. Instead of estimating the joint coordinates independently, they predicted the hand pose as whole, in order to make the system robust to self-occlusions. Similarly, in [115], the authors predicted the upper limb pose (arm + hand) simultaneously, by using a multiclass linear SVM for recognizing K poses from depth data. However, instead of classifying scanning windows on the depth maps, they classified the whole egocentric work-space, defined as the 3D volume seen from the egocentric POV. Mueller et al. [101] proposed a CNN architecture (Joint Regression Net – JORNet) to regress the 3D locations of the hand joints within the cropped colored depth maps captured with a structured light sensor (Intel RealSenseTM SR300). Afterwards, a kinematic skeleton was fitted to the regressed joints, in order to refine the hand pose. Yamazaki et al. [142] estimated the hand pose from hand point clouds captured with the Kinect v2 sensor. The authors built a dataset by collecting pairs of hand point clouds and ground truth joint positions obtained with a motion capture system. The pose estimation was performed by aligning the test point cloud to the training examples and predicting its pose as the one that minimizes the alignment error. The sample consensus initial alignment [121] and iterative closest point algorithms [120] were used for aligning the point clouds. Garcia-Hernando et al. [48] evaluated a CNN-based hand pose estimator [144] for regressing the 3D hand joints from RGB-D images recorded with the Intel RealSenseTM SR300 camera. The authors demonstrated that state-of-the-art hand pose estimation performance can be reached by training the algorithms on datasets that include hand-object interactions, in order to improve its robustness to self-occlusions or hand-object occlusions.

3.4.2 Hand pose estimation from monocular color images

In general, hand pose estimation from monocular color images allows locating the parts of the hands either in the form of 2D joints or semantic sub-regions (e.g., fingers, palm, etc.). This estimation is performed within previously detected ROIs, obtained by either a hand detection or segmentation algorithm. Liang et al. [83] used a conditional regression forest (CRF) to estimate the hand pose from hand binary masks. Specifically, they trained a set of pose estimators separately, conditioned on different distances from the camera, since the hand pose can change dramatically with the distance from the camera. Thus, they synthesized a dataset in which the images were sampled at discretized intervals. The authors also proposed an intermediate step for improving the joint localization, by segmenting the binary silhouette into twelve semantic parts corresponding to different hand regions. The semantic part segmentation was performed with a random forest for pixel-level classification exploiting binary context descriptors. Similarly, Zhu et al. [152] built a structured forest to segment the hand region into four semantic sub-regions: thumb, fingers, palm, and forearm. This semantic part segmentation was performed extending the structured regression forest framework already used for hand segmentation (Section 3.1) to a multiclass problem [83].

Other studies adapted CNN architectures developed for human pose estimation (e.g., OpenPose [138, 28]) for solving the hand pose estimation problem [130, 8] and localizing 21 hand joints (Figure 2). Tekin et al. [128] used a fully convolutional network (FCN) architecture to simultaneously estimate the 3D hand and object pose from RGB images. For each frame, the FCN produced a 3D discretized grid. The 3D location of the key points in camera coordinate system was then estimated combining the predicted location within the 3D grid and the camera intrinsic matrix.

3.4.3 Fingertip detection

Fingertip detection can be seen as an intermediate step between hand detection and hand pose estimation. Unlike pose estimation, only the fingertips of one or multiple fingers are detected. These key-points alone do not allow reconstructing the articulated hand pose, but can be used as input to HCI/HRI systems such as [126, 19, 31] (Section 5). If the objective is to estimate the key-points of a single finger, the most common solution is to regress the coordinates of these points (usually the tip and knuckle of the index finger) from a previously detected hand ROI. This approach has been exploited in [93, 58]. The cropped images, after being resized, were passed to a CNN to regress the location of the key-points [58]. However, since the fingertip often lies at the border of the hand bounding box, the hand detection plays a significant role, and inaccurate detections greatly affect the fingertip localization result [93]. Wu et al. [140] extended the fingertip detection problem to the localization of the 5 fingertips of a hand. They proposed a heatmap-based FCN that, given the detected hand area, produced a 5-channel image containing the estimated likelihood of each fingertip at each pixel location. The maximum of each channel was used to predict the position of the fingertips.

3.4.4 Remarks on hand pose estimation

Among the hand localization tasks, hand pose estimation allows obtaining information high-detail information with high semantic content at the same time (Figure 1). This task, if performed correctly, can greatly simplify higher inference steps (e.g., hand gesture recognition and grasp analysis), but may be more prone to low robustness against partial hand occlusions.

Compared to other localization tasks, hand pose estimation presents a higher proportion of approaches that use 3D and depth sensors. This choice has several advantages: 1) the possibility to use motion capture methods for automatically obtaining the ground truth joint positions [142, 144]; 2) the availability of multiple streams (color and depth) that can be combined to refine the estimations [101, 19]; and 3) the possibility to synthesize large datasets of realistic depth maps [114, 115]. In the past few years, human pose estimation approaches [138, 28] have been successfully adapted to the egocentric POV, in order to estimate the hand and arm pose from monocular color images [130, 8]. This opens new possibilities to streamline and improve the performance of localization and hand-based higher inference tasks, such as grasp analysis. To further facilitate the adaptation of existing pose estimation approaches, large annotated datasets with hand joint information are needed. To this end, a combination of 2D and 3D information may be beneficial, in order to get accurate and extensive ground truth annotations in 3D that will allow solving the occlusion problems even when using color images alone.

4 Interpretation

After the hands have been localized within the images, higher-level inference can be conducted in the regions of interest (ROIs). This processing is usually devoted to the interpretation of gestures and actions of the hands that, in turns, can be used as cues for hand-based applications such as HCI and HRI (Section 5). Based on the literature published so far, hand-based interpretation approaches in FPV can be divided into hand grasp analysis, hand gesture recognition, action/ interaction recognition, and activity recognition (Figures 1 and 3).

Fig. 3: Diagram of the hand interpretation areas in egocentric vision. Grasp analysis and gesture recognition focus directly on describing the hand. In action/interaction and activity recognition, the hand is instrumental in describing the user’s behaviour.

4.1 Hand grasp analysis and gesture recognition

According to Feix et al. [45], “A grasp is every static hand posture with which an object can be held securely with one hand, irrespective of the hand orientation”. The recognition of the grasp types allows determining the different ways with which humans use their hands to interact with objects [55]. The common grasp modes can be used to describe hand-object manipulations, reducing the complexity of the problem, since the set of possible grasps is typically smaller than the set of possible hand shapes [45]. Moreover, the identification of the most recurrent grasp types has important applications in robotics, biomechanics, upper limb rehabilitation, and HCI. Thus, several taxonomies were proposed in the past decades [45, 44, 66, 37, 84, 22, 91]. For a comprehensive comparison among these taxonomies, the reader is referred to [45]. The analysis of hand grasps conducted via manual annotations is a lengthy and costly process. Thus, the intrinsic characteristics of egocentric vision allowed developing automated methods to study and recognize different grasp types, saving a huge amount of manual labor. Although in most cases the hand grasp analysis has been addressed in a supervised manner (grasp recognition – Section 4.1.1) [116, 25, 23, 24, 90, 9], some authors proposed to tackle this problem using clustering approaches, in order to discover dominant modes of hand-object interaction and identify high-level relationships among clusters (grasp clustering and abstraction – Section 4.1.2) [25, 55, 23, 81].

Similar to grasp analysis, hand gesture recognition aims at recognizing the semantic of the hand’s posture and it is usually performed as input to HCI/HRI systems. However, two main differences exist between these two topics: 1) Grasp analysis looks at the hand posture during hand-object manipulations, whereas hand gesture recognition is usually performed on hands free of any manipulations; 2) grasp analysis aims at recognizing only static hand postures [45], whereas hand gesture recognition can also be generalized to dynamic gestures. According to the literature, hand gestures can be static or dynamic [112]: static hand gesture recognition (Section 4.1.3) aims at recognizing gestures that do not depend on the motion of the hands, thus relying on appearance and hand posture information only [122, 112, 126, 129, 61, 63, 147]; dynamic hand gesture recognition (Section 4.1.4) is performed using temporal information (e.g., hand tracking), in order to capture the motion cues that allow generating specific gestures [61, 147, 6, 7, 52, 100].

4.1.1 Hand grasp recognition

Supervised approaches for grasp recognition are based on the extraction of features from previously segmented hand regions [75] and their multiclass classification following one of the taxonomies proposed in the literature [45].

Cai et al. [23] used HOG features to represent the shape of the hand and a combination of HOG and SIFT to capture the object context during the manipulation. These features were classified with a multi-class SVM (one-vs-all) using a subset of grasp types from Feix’s taxonomy [45, 44]. The authors extended their approach in [25, 24] by introducing CNN-based features extracted from the middle layers of [69] and features derived from the dense hand trajectory (DHT) [134] such as the displacement, gradient histograms, histogram of optical flow, and motion boundary histograms. The superior performance of CNN- and DHT-based features and their robustness across different tasks and users [25] suggested that high-level feature representation and motion and appearance information in the space-time volume may be important cues for discriminating different hand configurations. In [9]

, the authors used a graph-based approach to discriminate 8 grasp types. Specifically, the binary hand mask was used to produce a graph structure of the hand with an instantaneous topological map neural network. The eigenvalues of the graph’s Laplacians were used as features to represent the hand configurations, which were recognized using an SVM.

The use of depth sensors was explored by Rogez et al. [116]. The authors recognized 71 grasp types [91] using RGB-D data, by training a multi-class SVM with deep-learned features [123] extracted from both real and synthetic data. Moreover, the grasp recognition results were refined by returning the closest synthetic training example, namely the one that minimized the distance with the depth of the detected hand region.

4.1.2 Hand grasp clustering and abstraction

The first attempt to discover hand grasps in FPV was [55]. HOG features were extracted from previously segmented hand regions and grouped by means of a two-stage clustering approach. First, a set of candidate cluster centers was generated through the fast determinantal point process (DPP) algorithm [70]. This step allowed generating a wide diversity of clusters to cover many possible hand configurations. Secondly, each segmented region was assigned to the nearest cluster center. The use of the DPP algorithm was proven to outperform other clustering approaches such as k-means and to be more appropriate in situations, like grasp analysis, where certain clusters are more recurrent than other ones. A hierarchical structure of the grasp types was learned using the same DPP-based clustering approach [55]

. A hierarchical clustering approach was also used in

[81] to find the relationships between different hand configurations based on a similarity measure between pairs of grasp types. Similarly, in [25, 23]

, the authors used a correlation index to measure the visual similarity between grasp types: grasp types with high correlation were clustered at the lower nodes, whereas low-correlated types were clustered higher in the hierarchy. The above approaches

[25, 55, 23, 81] were used to build tree-like structures of the grasp types. These structures can be exploited to define new taxonomies depending on the trade-off between detail and robustness of grasp classification, as well as to discover new grasp types not included in previous categorizations [37].

4.1.3 Static hand gesture recognition

The recognition of static hand gestures is usually performed in a supervised manner, similarly to hand grasp recognition (Section 4.1.3). A common strategy is to exploit features extracted from previously segmented hand regions, classifying them into multiple gestures often using SVM classifiers [122, 112].

Serra et al. [122] classified the binary segmentation masks into multiple hand configurations by using an ensemble of exemplar-SVMs [97]. This approach was proven to be robust in case of unbalanced classes, like hand gesture recognition applications where most of the frames contain negative examples. Contour features were used in [112] to recognize 14 gestures. The authors described the silhouette of the hand shape using time curvature analysis and fed an SVM classifier with the extracted features. The use of CNNs has also been investigated for the recognition of static hand gestures [9, 63]. Ji et al. [63] used a hybrid CNN-SVM approach, where the CNN was implemented as feature extractor and the SVM as gesture recognizer. In [126], the authors proposed a CNN architecture to directly classify the binary hand masks into multiple gestures.

Depth images were used in [129, 61, 147]. In [129] the authors used depth context descriptors and random forest classification, whereas Jang et al. [61] implemented static-dynamic voxel features to capture the amount of point clouds within a voxel, in order to describe the static posture of the hands and fingers. Moreover, depth-based gesture recognition was demonstrated to be more discriminative than color-based recognition [147]. However, in addition to the drawbacks of wearable depth sensors already discussed in the previous sections, the performance were significantly lower in outdoor environments due to the deterioration of the depth map [147].

4.1.4 Dynamic hand gesture recognition

One of the most common choices for dynamic hand gesture recognition is to use optical flow descriptors from the segmented hand regions, in order to recognize the motion patterns of the gestures to be classified [6, 7, 52, 100].

Baraldi et al. [6, 7] developed an egocentric hand gesture classification system able to recognize the user’s interactions with artworks in a museum. After removing camera motion, they computed and tracked the feature points at different spatial scales within the hand ROI and extracted multiple descriptors from the obtained spatio-temporal volume (e.g., HOG, HOF, and MBH). Linear SVM was used for recognizing multiple gestures from the above descriptors, using Bag of Words (BoW) and power normalization to avoid sparsity of the features. In [52, 100] the flow vectors were calculated over the entire duration of a gesture and, based on the resultant direction of the flow vectors, different swipe movements (left, right, up, and down) were classified using fixed thresholds on the movement orientation.

Other approaches recognized dynamic gestures as generalization of the static gesture recognition problem [61, 147]. In [61], the authors proposed a hierarchical approach for estimating hand gestures using a static-dynamic forest to produce hierarchical predictions on the hand gesture type. Static gesture recognition was performed at the top level of the hierarchy, in order to select a virtual object corresponding to the detected hand configuration (e.g., holding a stylus pen). Afterwards, the recognition of dynamic gestures, conditioned to the previously detected static gesture, was performed (e.g., pressing or releasing the button on the pen). Zhang et al. [147] compared engineered features and deep learning approaches (2DCNN, 3DCNN, and recurrent models), demonstrating that 3DCNN are more suitable for hand gesture recognition and the combination of color and depth information can produce better results than the two image modalities alone.

4.1.5 Remarks on hand grasp analysis and gesture recognition

Many similarities can be found between grasp recognition and hand gesture recognition. As mentioned above, the main difference is the context in which the two problems are addressed. Grasp recognition is performed during hand-object manipulations, whereas hand gesture recognition without physical objects being manipulated. This difference links these two sub-areas to some of the higher levels and FPV applications. In fact, hand gesture recognition approaches have mainly been used for AR/VR applications [52, 129, 61, 100], whereas grasp analysis can be exploited for action/ interaction recognition and activity recognition [24, 34].

Hand grasp analysis and gesture recognition are the only interpretation sub-areas where the analysis of the hands is still the main target of the approaches. In fact, higher in the semantic content dimension (Figure 1) every sub-area (i.e., action/ interaction detection and recognition, activity recognition) may use the hand information in combination with other cues (e.g., object recognition) to perform higher level inference. It should be noted though, that not all the higher-level interpretation approaches utilized hand-based processing in FPV. Thus, in the following sections, we will discuss only those methods that explicitly used the hand information for predicting actions and activities, omitting other papers and referring the authors to other surveys or research articles.

4.2 Action/interaction and activity recognition

According to Tekin et al. [128], an action is a verb (e.g., “cut”), whereas an interaction is a verb-noun pair (e.g., “cut the bread”). Both definitions refer to short-term events that usually last a few seconds [124]. By contrast, activities are longer temporal events (minutes or hours) with higher semantic content, typically composed of temporally-consistent actions and interactions [104] (Figure 3).

In this section, we summarize FPV approaches that relied on hand information to recognize actions, interactions, activities from sequences of frames. Regarding the actions and interactions, two main types of approaches can be found in literature: those that used hands as the only cue for the prediction (Section 4.2.1) and approaches that used a combination of object and hand cues (Section 4.2.2). Although the second type of approaches might seem more suitable for interaction recognition (i.e., verb+noun prediction), some authors used them for predicting action verbs, exploiting the object information to prune the space of possible actions (i.e., removing unlikely verbs for a given object) [42]. Likewise, other authors tried to use only hand cues to recognize interactions [127], in order to produce robust predictions without relying on object features or object recognition algorithms. Either way, the boundary between action and interaction recognition is not well defined and often depends on the nature of the dataset on which a particular approach has been tested.

4.2.1 Action/ interaction recognition using hand cues

These approaches inferred the camera wearer’s actions exploiting the information provided by hand localization methods (Section 3). The hypothesis is that actions and interactions can be recognized using only hand cues, for instance features related to the posture and motion of the hands. Existing studies can be divided into feature-based approaches [71, 60, 26] and deep learning-based approaches [124, 130, 127].

Feature-based approaches combined motion and shape features of the hands to represent two complementary aspects of the action: movements of hand’s parts and grasp types. This representation allowed discriminating actions with similar hand motion, but different hand posture. Typical choices of motion features were dense trajectories [134], whereas the hand shape was usually represented with HOG [60] or shape descriptors on the segmented hand mask [26]. All these features were then combined and used to recognize actions/interactions via SVM classifiers. Ishihara et al. [60] used dense local motion features to track keypoints from which HOG, MBH, and HOF were extracted [134]

. Global hand shape was represented using HOG features within the segmented hand region. The authors used Fisher vectors and principal component analysis (PCA) to encode features extracted from time windows of fixed duration, followed by multiclass linear SVM for the recognition. Dense trajectory features were also used by Kumar et al.

[71]. The authors proposed a feature sampling scheme that preserved dense trajectories closer to the hand centroid while removing trajectories from the background (likely due to head motion). BoW representation was used and the recognition was performed using SVM with kernel. Cai et al. [26] combined hand shape, hand position, and hand motion features for recognizing user’s desktop actions (e.g., browse, note, read, type, and write). Histograms of the hand shape computed on the hand mask were used as shape features. Hand position was represented by the point within the hand region where a manipulation is most likely to happen (e.g., left tip of the right hand region). Motion descriptors relied on the computation of the large displacement optical flow (LDOF) [21] between two consecutive frames. Spatio-temporal distribution of hand motion (i.e., DTF coefficients on the average LDOF extracted from hand sub-regions over consecutive frames) was demonstrated to outperform temporal and spatial distributions alone, suggesting that spatial and temporal information should be considered together when recognizing hand’s actions.

The combination of temporal and spatial information was also exploited in deep-learning approaches. This strategy was usually implemented by means of multi-stream architectures. Singh et al. [124] proposed a CNN-based approach to recognize camera wearer’s actions using the following inputs: pixel-level hand segmentation mask; head motion – as frame-to-frame homography using RANSAC on optical flow correspondences (excluding the hand regions); and saliency map – as the flow map obtained after applying the homography. This information was passed to a 2-stream architecture composed of a 2DCNN and a 3DCNN. The deep-learned features from both streams were combined and actions were predicted using SVM. Urabe et al. [130] used the region around the hands to recognize cooking actions. Appearance and motion maps (created using the segmented hand mask) were passed to 2DCNN and 3DCNN, respectively. Afterwards, class-score fusion was performed by multiplying the output of both streams. The authors demonstrated that a multi-stream approach yielded better results than the two streams alone. Tang et al. [127] used the hand information as auxiliary stream within an end-to-end multi-stream deep neural network (MDNN) that used RGB, optical flow and depth frames as input. The hand stream was composed of a CNN with the hand mask as input. Its output was combined to the MDNN via weighted fusion, in order to predict the action label. The addition of the hand stream improved the recognition performance.

4.2.2 Action/ interaction recognition combining hand and object cues

Many authors demonstrated that the combination of object and hand cues can improve the recognition performance [42, 79, 133, 24, 67]. This is quite intuitive, since during an interaction the grasp type and hand movements strictly depend on the characteristics of the object that is being manipulated (e.g., dimension, shapes, functionality). Thus, grasp type or hand pose/ shape along with object cues can be used to recognize the actions and interactions [128, 42, 86, 48, 24, 34].

In [24], the authors predicted the attributes of the manipulated object (i.e., object shape and rigidity) and the type of grasp to recognize hand’s actions. They proposed a hierarchical 2-stage approach where the lower layer (visual recognition) recognized the grasp type and the object attributes and pass this information to the upper layer (action modeling) responsible for the action classification via linear SVM. Coskun et al. [34]

implemented a recurrent neural network (RNN) to exploit the temporal dependencies of consecutive frames using a set of deep-learned features related to grasp, optical flow, object-object, and hand-object interactions, as well as the trajectories of the hands over the past few frames. Other authors

[128, 42, 86, 48], used hand cues with lower semantic content than hand grasp, such as shape and pose. Fathi et al. [42] extracted a set of object and hand descriptors (including object and hand labels, optical flow, location, shape, and size) at super-pixel level and performed a 2-stage interaction recognition. First, they recognized actions using Adaboost; second, they refined the object recognition in a probabilistic manner by exploiting the predicted verb label and object classification scores. Likitlersuang et al. [86] detected the interactions between the camera wearer’s hands and manipulated objects (binary classification task). This was accomplished by combining the hand shape, represented with HOG descriptors, with color and motion descriptors (color histogram and optical flow) for the hand, the background, and the object (regions around the hands). Random forest was used for classification. The articulated hand pose was used in [128, 48]. Garcia-Hernando et al. [48] passed the hand and object key-points to an LSTM that predicted the interactions over the video frames. This approach was extended in [128]

, where hand-object interactions were first modeled using a multi-layer -perceptron and then used as input to the LSTM.

Other approaches, instead of explicitly using the hand information for predicting actions and interactions, exploited the results of hand localization algorithms to guide the feature extraction within a neighborhood of the manipulation region [79, 96]. This strategy was motivated by the fact that the most important cues (motion, object, etc.) during an action are likely to be found in proximity of the hands and manipulated object. Li et al. [79] used a combination of local descriptors for motion and object cues in conjunction with a set of egocentric cues. The former, were extracted from the dense trajectories to represent the motion of the action (i.e., shape of the trajectories, MBH, HoF) and the object appearance (e.g., HOG, LAB color histogram, and LBP along the trajectories). The latter were used to approximate the gaze information, by combining camera motion removal and hand segmentation, in order to focus the attention on the area where the manipulation is happening. Ma et al. [96] used a multi-stream deep learning approach composed of an appearance stream to recognize the object and a motion stream to predict the action verb. The object recognition network predicted the object label by using as input the hand mask and object ROI, whereas the action recognition network used the optical flow map to infer the verb. A fusion layer combined verb and object labels and predicted the interactions. Zhou et al. [150] used the hand segmentation mask with object features (mid-layer features extracted from AlexNet [69]) and optical flow to localize and recognize the active object (using VGG-16 [123]). Afterwards, object features were represented in a temporal pyramid manner and combined with motion characteristics extracted from improved dense trajectories, in order to recognize interactions using non-linear SVM. Although the above approaches might differ for the type of features and algorithm used to predict actions and interactions, most of them demonstrated that the combination of object and hand cues can provide better recognition performance than single modality recognition [79, 133].

4.2.3 Activity recognition

As we climb the semantic content dimension (Figure 1), the strong dependency on hand cues fades away. Other information comes into play and can be used in conjunction with the hands to predict the activities. This diversification becomes clear when we look at the review published by Nguyen et al. [104], which categorized egocentric activity recognition as: 1) combination of actions; 2) combination of active objects; 3) combination of active objects and locations; 4) combination of active objects and hand movements; and 5) combination of other information (e.g., gaze, motion, etc.). The description of all these approaches goes beyond the scope of this work, since we are interested in characterizing how hands can be used in activity recognition methods. For a more comprehensive description of activity recognition in FPV, the reader is referred to [104]. The boundary between the recognition of short and long temporal events (i.e., actions/ interactions and activities, respectively) is not always well defined and, similar to action/ interaction recognition, it may depend on the dataset used for training and testing a particular approach. In fact, some of the methods described in the previous subsections were also tested within an activity recognition framework [133, 96]. Generally, we can identify two types of approaches: activity recognition based on actions and interactions [42, 102] and approaches that used hand localization results to directly prectict the activities [5, 131, 4].

Approaches that relied on actions and interactions learned a classifier for recognizing the activities using the detected actions or hand-object interactions as features for the classification. This can be performed by using the histogram of action frequency in the video sequence and its classification using adaboost [42]. Nguyen et al. [102] used Bag of Visual Words representation to model the interactions between hands and objects, since these cues play a key role in the recognition of activities. Dynamic time warping was then used to compare a new sequence of features with the key training features.

Other authors [5, 131, 4] investigated how good the hand segmentation map is in predicting a small set of social activities (i.e., 4 interactions between two individuals). The authors used a CNN-based approach using the binary hand segmentation maps as input. The prediction was performed on a frame-by-frame basis and using temporal integration (voting strategy among a sequence of frames), with the latter approach providing better results (up to 73% of recognition accuracy) [5]. This result confirms what was already shown for actions and interactions, namely the temporal information becomes essential when performing higher-level inference, especially when modeling relatively long term events like activities. However, this approach was tested only in case of a small sample of social activities. To the best of our knowledge, no experiments using hand cues only were conducted for predicting other types of activities, such as ADLs.

4.2.4 Remarks on action/interaction and activity recognition

Many authors demonstrated that action/ interaction recognition performance can be improved by combining different cues, such as hand, object, and motion information. This was proven regardless the actual method. In fact, both feature-based and deep-learning based methods implemented this strategy by combining multiple features or using multi-stream DNN architectures. Another important aspect on which one should focus when developing novel approaches for action/ interaction recognition is the temporal information. This was exploited by using 3DCNN and RNNs or, in case of feature-based approaches, by encoding it in the motion information. The same conclusion can be drawn for activity recognition where, considering the longer duration of the events, the temporal information becomes even more important [5].

Sometimes the literature is not consistent on the choice of the taxonomy to describe these sub-areas. Some of the approaches summarized above, even though not explicitly referred as action/ interaction recognition, actually recognized short actions or interactions. We preferred to be consistent with the definition proposed by Tekin et al. [128], as we believe that a consistent taxonomy may help authors comparing different approaches and unify their efforts towards solving a specific problem. Moreover, the term “action” has often been used interchangeably with “activity”, which indicates a longer event with higher semantic content. The actions and interactions can rather be seen as the building blocks of the activities. This allowed some authors to exploit this mutual dependency, in order to infer activities in a hierarchical manner, using the methods described above [42, 102].

The number of egocentric activity recognition approaches based on hand information is lower than then number of action and interaction recognition approaches. This difference is due to the fact that higher in the semantic content, authors have a wider choice of cues and features for recognizing a temporal event. In particular, over the past few years, more and more end-to-end approaches for activity recognition have been proposed, following approaches similar to video recognition [113].

5 Application

The approaches summarized so far can be implemented to design real world FPV applications. The applications that have most commonly relied on hand-based methods are healthcare and HCI/HRI applications.

5.1 Healthcare applications

Egocentric vision has demonstrated the potential to have an important impact in healthcare. The possibilities to automatically analyze the articulated hand pose and recognize actions and ADLs have made these methods appealing for upper limb clinical assessment [146, 88, 87, 86, 132] and ambient assisted living (AAL) systems [68, 104, 103]. The assessment of upper limb function is an important phase in the rehabilitation after stroke or cSCI that allows clinicians to plan the optimal treatment strategy for each patient. However, the upper limb assessment performed during face-to-face visits with clinicians may not always reflect the actual hand use in natural settings. Egocentric vision has inspired researchers to develop video-based approaches for automatically studying hand functions at home [146, 88, 86, 132]. Studies have been conducted in individuals with cSCI, tackling the problem of hand function assessment from two perspectives: localization [146, 85, 132] and interpretation [88, 86]. Fine-tuning object detection algorithms to localize and recognize hands in people with SCI allowed developing hand localization approaches robust to impaired hand poses and uncontrolled situations [132]. Moreover, strategies for improving the computational performance of hand detection algorithms have been adopted (e.g., combining hand detection and tracking), making this application suitable for the use at home. The automatic detection of hand-object manipulations allowed extracting novel measures reflective of the hand usage at home, such as number of interactions per hour, the duration of interactions, and the percentage of interaction over time [86]. These measures, once validated against clinical scores, will help clinicians to better understand how individuals with cSCI use their hands at home.

Another healthcare application is the development of AAL systems. The increasing ageing population is posing serious social and financial challenges in many countries. These challenges have stimulated the interest in developing technological solutions to help and support older adults (with and without cognitive impairment) during their daily life [108]. Some of these applications used egocentric vision to provide help and support older adults during ADLs at home [68, 104, 103]. Egocentric vision AAL builds upon the action and activity recognition approaches illustrated in Section 4.2. In particular, approaches have been proposed to automatically recognize how older adults perform ADLs at home, for example to detect early signs of dementia [68] or to support people in conducting the activities [103].

Regardless of the specific application, the use of egocentric vision presents important advantages with respect to other solutions (e.g., sensor-based and third person vision):

  • [noitemsep, wide=0pt, leftmargin=]

  • FPV can provide high quality videos on how people manipulate objects. This is important when the aim is the recognition of hand-object manipulations and ADLs, since hand occlusions tend to be minimized.

  • Egocentric vision provides more details of hand-object interactions than sensor-based technology. Other sensor-based solutions such as sensor gloves, although providing highly accurate hand information, may limit movements and sensation, which are already reduced in individuals with upper limb impairment [86, 87].

5.2 Hci/ Hri

The bulk of hand-based applications in FPV is HCI/HRI. This area includes AR and VR systems as well as technologies for robot control and learning. Most of the applications categorized in this section rely on hand localization and gesture recognition algorithms.

Among the approaches proposed for AR and VR, many systems used the hand information to allow manipulating virtual objects [50, 62, 61, 129]. Depth sensors were usually implemented to capture the scene, whereas head-worn displays (HWD) allowed projecting the virtual object in the AR/VR scenario. The basic hand-based processing steps were hand localization – in particular, hand detection, segmentation, and pose-estimation – and hand gesture recognition. The recognition of specific hand gestures allowed providing inputs and commands to the system, in order to produce a specific action (e.g., selection of a virtual object by recognizing the clicking gesture [62]). The use of depth sensors has usually been preferred, since the localization of hands and objects can be more robust to illumination changes. Some authors implemented multiple depth sensors [50]: one specific for short distances (i.e., up to 1 m) – to capture more accurate hand information – and a long-range depth camera to reproduce the correct proportions between the physical and virtual environment [50]. To improve the hand localization robustness, other systems combined multiple hand localization approaches, for example hand pose estimation in conjunction with fingertip detection [62]. This approach can be helpful when the objective is to localize the fingertips in situations with frequent self-occlusions. Other AR/VR applications relied on dynamic hand gestures (e.g. swipe movements) recorded with a smartphone camera and frugal VR devices (e.g., google Cardboard), in order to enable interactions in the virtual environment [52, 100]. AR/VR applications were also implemented for telesupport and coexistence reality [143, 49], in order to allow multiple users to collaborate together remotely. Specific fields of applications were remote co-surgery [143] and expert’s tele-assistance and support [49].

Within HCI, hand-based methods were also used for cultural heritage applications. Thanks to hand localization and gesture recognition approaches, several authors developed systems for immersive museum and touristic experiences [122, 6, 6, 19]. Users can experience an entertaining way of accessing the museum knowledge, for example by taking pictures and providing feedback to the artworks with simple hand gestures [7]. Other authors [19] proposed a smart glasses-based system that allowed users to access touristic information while visiting a city by using pointing gestures of the hand. The use of hand-based information retrieved from FPV was also exploited for the recognition of hand-written characters [31, 59]. This application was usually performed in four steps: 1) hand localization (e.g., hand detection and tracking); 2) hand gesture recognition – to recognize a specific hand posture that triggers the writing module; 3) fingertip detection – to identify the point to track, whose trajectory defines the characters; and 4) character recognition, based on the trajectories of the detected fingertip [31, 59].

In the HRI field, FPV hand-based approaches have mainly been used for two purposes: robot learning and robot control. Approaches for robot learning recognized movements and/or actions performed by the user’s hands, in order to train a robot performing the same set of movements autonomously [2, 72]. Aksoy et al. [2], decomposed each manipulation into shorter chunks and encoded each manipulation into a semantic event chain (SEC), which encodes the spatial relationship between objects and hands in the scene. Each temporal transition in the SEC (e.g., change of state in the scene configuration) was considered as movement primitive for the robot imitation. In [72], the robot used the tracked hand locations of a human to learn the hand’s future position and predict trajectories of the hands when a particular action has to be executed. By contrast, robot control approaches mainly relied on hand gesture recognition to give specific real-time commands to the robots [126, 63]. The hand gestures are seen as means of communication between the human and the robot and can encode specific commands such as the action to be performed by a robot arm [126] or the direction to be taken by a reconnaissance robot [63].

6 FPV Datasets with hand annotation

Dataset Year Mode Device Location Frames Videos Duration Subjects
GTEA [43] 2011 C GoPro H 31k 28 34 min 4 1280x720
ADL [107] 2012 C GoPro H >1M 20 10 h 20 1280x960
EDSH [75] 2013 C - H 20k 3 10 min - 1280x720 msk
Museum [6]
2014 C - H - 700 - 5 800x450
EgoHands [5] 2015 C Google Glass H 130k 48 72 min 8 1280x720 msk
Maramotti [7] 2015 C - H - 700 - 5 800x450
Hands [12]
2015 C
H 150k - 98 min - 1280x720 det
GUN-71 [116] 2015 CD
- - 8 - grs
Action [133]
2015 CD
H - - - 20
in mid-air [31]
2016 CD
H 8k - - - -
Ego-Finger [59] 2016 C - H 93k 24 - 24 640x480
bodied [88]
2016 C Looxcie 2 - - - 44 min 4 640x480 int
UT Grasp [25] 2017 C
H - 50 4 h 5 960x540 grs
GestureAR [100] 2017 C
Nexus 6 and
Moto G3
H - 100 - 8 1280x720 gst
EgoGesture [140] 2017 C - - 59k - - - -
hand-action [141]
2017 D
H 154k 300 - 26 320x240 gst
BigHand2.2M [144] 2017 D
- 290k - - - 640x480 pos
Action [26]
2018 C
H 324k 60 3 h 6 1920x1080
Kitchens [39]
2018 C GoPro H  11.5M - 55 h 32 1920x1080 act
FPHA [48] 2018 CD
1,175 - 6
EYTH [131] 2018 C - -
3 - - - msk
EGTEA+ [78] 2018 C
SMI wearable
H >3M 86 28 h 32 1280x960
THU-READ [127] 2018 CD
H 343k 1,920 - 8 640x480
EgoGesture [147] 2018 CD
H 3M 24,161 - 50 640x480 gst
EgoDaily [35]
2019 C
- 50k 50 - 10 1920x1080
ANS SCI [86] 2019 C
H - - - 17 1920x1080
KBH [137] 2019 C HTC Vive H
161 - 50 230x306 msk
TABLE I: List of available datasets with hand-based annotations in FPV. Image modality (Mode): Color (C); Depth (D); Color+Depth (CD). Camera location: Head (H); Chest (C); Shoulder (S). Annotation: actions/activities (act); hand presence and location (det); fingertip positions (ftp); gaze (gaz); grasp types (grs); hand gestures (gst); hand disambiguation (hid); hand-object interactions (int); hand segmentation masks (msk); object classes (obj); hand pose (pos).

The importance that this field of research gained in recent years is clear when we look at the number of available datasets published since 2015 (Table I). Although the type of information and ground truth annotations made available by the authors is heterogeneous, it is possible to identify some sub-areas that are more recurrent than others. The vast majority of datasets provided hand segmentation masks [75, 6, 7, 5, 88, 26, 131, 78, 127, 137], reflecting the high number of approaches proposed in this area (Section 3). However, the high number of datasets is counterbalanced by a relative low number of annotated frames, usually in the order of a few hundreds or thousands of images. To expedite the lenghty pixel-level annotation process and build larger datasets for hand segmentation, some authors proposed semi-automated techniques, for example based on Grabcut [26, 118]. Actions/activities [107, 133, 26, 48, 78, 127] and hand gestures [6, 7, 31, 100, 140, 141, 147] are other common information that were captured and annotated in many datasets. This large amount of data has been used by researchers for developing robust HCI applications that relied on hand gestures. Compared to the amount of hand segmentation masks, action/activities and hand gestures datasets are usually larger, since the annotation process is easier and faster than pixel-level segmentation.

The vast majority of datasets included color information recorded from head-mounted cameras. The head position is usually preferred over the chest or shoulders, since it is easier to focus on hand actions and manipulations whenever the camera wearer’s is performing a specific activity. GoPro cameras were the most widely used devices for recording the videos, since they are specifically designed for egocentric POV and are readily available on the market. Few datasets, usually designed for hand pose estimation [48, 144, 141], hand gesture recognition [31, 147], and action/activity recognition [133, 48, 127], include depth or color+depth information. In most cases, videos were collected using Creative Senz3DTM or Intel RealSenseTM SR300 depth sensors, as these devices were small and lightweight. Moreover, these cameras were preferred over other depth sensors (e.g., Microsoft KinectTM) because they were originally developed for natural user interface that made them more suitable for studying hand movements in the short range (up to 1 m of distance from the camera).

Although FPV is gaining a lot of interest for developing healthcare applications (section 5.1), only one dataset (the ANS-SCI dataset [86]) included videos from people with clinical conditions (i.e., cSCI). This lack of available data is mainly due to ethical constraints that make it harder to share videos and images collected from people with diseases or clinical conditions. In the next few years researchers should try – within the ethical and privacy constraints – to build and share datasets for healthcare applications including videos collected from patients. This will benefit the robustness of the hand-based approaches in FPV against the inter-group and within group variability that can be encountered in many clinical conditions.

7 Conclusion

In this paper we showed how hand-related information can be retrieved and used in egocentric vision. We summarized the existing literature into three macro-areas, identifying the most prominent approaches for hand localization (e.g., hand detection, segmentation, pose estimation, etc.), interpretation (grasp analysis, gesture recognition, action and activity recognition), as well as the FPV applications for building real-world solutions. We believe that a comprehensive taxonomy and an updated framework of hand-based methods in FPV may serve as guidelines for the novel approaches proposed in this field by helping to identify gaps and standardize terminology.

One of the main factors that promoted the development of FPV approaches for studying the hands is the availability of wearable action cameras and AR/VR systems. However, we also showed how the use of depth sensors, although not specifically developed for wearable applications, has been exploited by many authors, in order to improve the robustness of hand localization. We believe that the possibility to develop miniaturized wearable depth sensors may further boost the research in this area and the development of novel solutions, since a combination of color and depth information can improve the performance of several hand-based methods in FPV.

From this survey it is clear how the hand localization step plays a vital role in any processing pipeline, as a good localization is necessary condition for hand-based higher inference, such as gesture or action recognition. This importance has motivated the extensive research conducted in the past 10 years, especially in sub-areas like hand detection and segmentation. The importance of hand localization methods may also be seen in those approaches where the hands play an auxiliary role, such as activity recognition. In fact, the position of the hands can be used to build attention-based classifiers, where more weight is given to the manipulation area.

Like other computer vision fields, the advent of deep learning has had a great impact on this area, by boosting the performance of several localization and interpretation approaches, as well as optimizing the number of steps required to pursue a certain objective (see the hand identification example – Section 3.3). Hand detection is the localization sub-area that has seen the largest improvements, especially thanks to the availability of object detection networks retrained on large datasets. Other sub-areas, such as hand segmentation and pose estimation, perhaps will see larger improvements in the next few years, especially if the amount of available data annotations grows. Recurrent models, 3DCNN, and the availability of large datasets (e.g., Epic Kitchens, EGTEA+, etc.) have helped pushing the state of the art of action and activity recognition, considering that the combination of temporal and appearance information was demonstrated to be crucial for these tasks. In the near future, efforts should be made in improving methods for the recognition of larger classes of unscripted ADLs, which would benefit the development of applications such as AAL.

As this field of research is still growing, we will see novel applications and improvement of the existing ones. The impact of hand-based methods in egocentric vision is clear from the development of applications for HCI, HRI, and healtcare. The importance of the hands as our primary means of interaction with the world around us is currently exploited by VR and AR systems, and the position of the wearable camera offers tremendous advantages for assessing upper limb function remotely and supporting older adults in ADLs. This will translate in the availability of rich information captured in natural environments, with the possibility to improve assessment and diagnosis, provide new interaction modalities, and enable personalized feedback on tasks and behaviours.


This work was supported in part by the Craig H. Neilsen Foundation (542675).


  • [1] R. Achanta, A. Shaji, K. Smith, A. Lucchi, P. Fua, and S. Süsstrunk (2012) SLIC superpixels compared to state-of-the-art superpixel methods. IEEE transactions on pattern analysis and machine intelligence 34 (11), pp. 2274–2282. Cited by: §3.1.1, §3.1.1, §3.1.3, §3.1.3.
  • [2] E. E. Aksoy, M. J. Aein, M. Tamosiunaite, and F. Wörgötter (2015) Semantic parsing of human manipulation activities using on-line learned models for robot imitation. In 2015 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 2875–2882. Cited by: §5.2.
  • [3] A. A. Argyros and M. I. Lourakis (2004) Real-time tracking of multiple skin-colored objects with a possibly moving camera. In European Conference on Computer Vision, pp. 368–379. Cited by: §3.2.3, §3.3.
  • [4] S. Bambach, D. J. Crandall, and C. Yu (2015) Viewpoint integration for hand-based recognition of social interactions from a first-person view. In Proceedings of the 2015 ACM on International Conference on Multimodal Interaction, pp. 351–354. Cited by: §3.1.5, §3.2.2, §4.2.3, §4.2.3.
  • [5] S. Bambach, S. Lee, D. J. Crandall, and C. Yu (2015) Lending a hand: detecting hands and recognizing activities in complex egocentric interactions. In Proceedings of the IEEE International Conference on Computer Vision, pp. 1949–1957. Cited by: §3.1.5, §3.2.2, §3.3, §3, §3, §4.2.3, §4.2.3, §4.2.4, TABLE I, §6.
  • [6] L. Baraldi, F. Paci, G. Serra, L. Benini, and R. Cucchiara (2014) Gesture recognition in ego-centric videos using dense trajectories and hand segmentation. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops

    pp. 688–693. Cited by: §4.1.4, §4.1.4, §4.1, §5.2, TABLE I, §6.
  • [7] L. Baraldi, F. Paci, G. Serra, L. Benini, and R. Cucchiara (2015) Gesture recognition using wearable vision sensors to enhance visitors’ museum experiences. IEEE Sensors Journal 15 (5), pp. 2705–2714. Cited by: §4.1.4, §4.1.4, §4.1, §5.2, TABLE I, §6.
  • [8] G. Baulig, T. Gulde, and C. Curio (2018) Adapting egocentric visual hand pose estimation towards a robot-controlled exoskeleton. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 0–0. Cited by: §3.4.2, §3.4.4, §3.4.
  • [9] M. Baydoun, A. Betancourt, P. Morerio, L. Marcenaro, M. Rauterberg, and C. Regazzoni (2017) Hand pose recognition in first person vision through graph spectral analysis. In 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1872–1876. Cited by: §4.1.1, §4.1.3, §4.1.
  • [10] A. Betancourt, M. M. López, C. S. Regazzoni, and M. Rauterberg (2014) A sequential classifier for hand detection in the framework of egocentric vision. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 586–591. Cited by: §3.1.1, §3.2.1, §3.2.1.
  • [11] A. Betancourt, L. Marcenaro, E. Barakova, M. Rauterberg, and C. Regazzoni (2016) GPU accelerated left/right hand-segmentation in first person vision. In European Conference on Computer Vision, pp. 504–517. Cited by: §3.1.2, §3.3.
  • [12] A. Betancourt, P. Morerio, E. I. Barakova, L. Marcenaro, M. Rauterberg, and C. S. Regazzoni (2015) A dynamic approach and a new dataset for hand-detection in first person vision. In International conference on Computer Analysis of Images and Patterns, pp. 274–287. Cited by: TABLE I.
  • [13] A. Betancourt, P. Morerio, E. Barakova, L. Marcenaro, M. Rauterberg, and C. Regazzoni (2017) Left/right hand segmentation in egocentric videos. Computer Vision and Image Understanding 154, pp. 73–81. Cited by: §3.3.
  • [14] A. Betancourt, P. Morerio, L. Marcenaro, E. Barakova, M. Rauterberg, and C. Regazzoni (2015) Towards a unified framework for hand-based methods in first person vision. In 2015 IEEE International Conference on Multimedia & Expo Workshops (ICMEW), pp. 1–6. Cited by: §1, §1, §3.2.1.
  • [15] A. Betancourt, P. Morerio, L. Marcenaro, M. Rauterberg, and C. Regazzoni (2015) Filtering svm frame-by-frame binary classification in a detection framework. In 2015 IEEE International Conference on Image Processing (ICIP), pp. 2552–2556. Cited by: §3.2.1.
  • [16] A. Betancourt, P. Morerio, C. S. Regazzoni, and M. Rauterberg (2015) The evolution of first person vision methods: a survey. IEEE Transactions on Circuits and Systems for Video Technology 25 (5), pp. 744–760. Cited by: §1.
  • [17] A. Bewley, Z. Ge, L. Ott, F. Ramos, and B. Upcroft (2016) Simple online and realtime tracking. In 2016 IEEE International Conference on Image Processing (ICIP), pp. 3464–3468. Cited by: §3.2.3, §3.3.
  • [18] M. Bolanos, M. Dimiccoli, and P. Radeva (2016) Toward storytelling from visual lifelogging: an overview. IEEE Transactions on Human-Machine Systems 47 (1), pp. 77–90. Cited by: §1, §2.3.
  • [19] N. Brancati, G. Caggianese, M. Frucci, L. Gallo, and P. Neroni (2015) Robust fingertip detection in egocentric vision under varying illumination conditions. In 2015 IEEE International Conference on Multimedia & Expo Workshops (ICMEW), pp. 1–6. Cited by: §3.4.3, §3.4.4, §5.2.
  • [20] L. Breiman (2001) Random forests. Machine learning 45 (1), pp. 5–32. Cited by: §3.1.1.
  • [21] T. Brox and J. Malik (2010) Large displacement optical flow: descriptor matching in variational motion estimation. IEEE transactions on pattern analysis and machine intelligence 33 (3), pp. 500–513. Cited by: §4.2.1.
  • [22] I. M. Bullock, J. Z. Zheng, S. De La Rosa, C. Guertler, and A. M. Dollar (2013) Grasp frequency and usage in daily household and machine shop tasks. IEEE transactions on haptics 6 (3), pp. 296–308. Cited by: §4.1.
  • [23] M. Cai, K. M. Kitani, and Y. Sato (2015) A scalable approach for understanding the visual structures of hand grasps. In 2015 IEEE International Conference on Robotics and Automation (ICRA), pp. 1360–1366. Cited by: §4.1.1, §4.1.2, §4.1.
  • [24] M. Cai, K. M. Kitani, and Y. Sato (2016) Understanding hand-object manipulation with grasp types and object attributes.. In Robotics: Science and Systems, Vol. 3. Cited by: §4.1.1, §4.1.5, §4.1, §4.2.2, §4.2.2.
  • [25] M. Cai, K. M. Kitani, and Y. Sato (2017) An ego-vision system for hand grasp analysis. IEEE Transactions on Human-Machine Systems 47 (4), pp. 524–535. Cited by: §3.2.3, §3.2.3, §3.3, §4.1.1, §4.1.2, §4.1, TABLE I.
  • [26] M. Cai, F. Lu, and Y. Gao (2018) Desktop action recognition from first-person point-of-view. IEEE transactions on cybernetics 49 (5), pp. 1616–1628. Cited by: §4.2.1, §4.2.1, TABLE I, §6.
  • [27] M. Calonder, V. Lepetit, C. Strecha, and P. Fua (2010) Brief: binary robust independent elementary features. In European conference on computer vision, pp. 778–792. Cited by: §3.1.1.
  • [28] Z. Cao, T. Simon, S. Wei, and Y. Sheikh (2017) Realtime multi-person 2d pose estimation using part affinity fields. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7291–7299. Cited by: §3.4.2, §3.4.4.
  • [29] A. Cartas, M. Dimiccoli, and P. Radeva (2017) Detecting hands in egocentric videos: towards action recognition. In International Conference on Computer Aided Systems Theory, pp. 330–338. Cited by: §3.1.1, §3.2.2, §3.
  • [30] A. A. Chaaraoui, P. Climent-Pérez, and F. Flórez-Revuelta (2012) A review on vision techniques applied to human behaviour analysis for ambient-assisted living. Expert Systems with Applications 39 (12), pp. 10873–10888. Cited by: §2.1.
  • [31] H. J. Chang, G. Garcia-Hernando, D. Tang, and T. Kim (2016) Spatio-temporal hough forest for efficient detection–localisation–recognition of fingerwriting in egocentric camera. Computer Vision and Image Understanding 148, pp. 87–96. Cited by: §3.4.3, §5.2, TABLE I, §6, §6.
  • [32] L. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille (2017) Deeplab: semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE transactions on pattern analysis and machine intelligence 40 (4), pp. 834–848. Cited by: §3.1.3.
  • [33] H. Cheng, L. Yang, and Z. Liu (2015) Survey on 3d hand gesture recognition. IEEE transactions on circuits and systems for video technology 26 (9), pp. 1659–1673. Cited by: §1.
  • [34] H. Coskun, Z. Zia, B. Tekin, F. Bogo, N. Navab, F. Tombari, and H. Sawhney (2019) Domain-specific priors and meta learning for low-shot first-person action recognition. arXiv preprint arXiv:1907.09382. Cited by: §4.1.5, §4.2.2, §4.2.2.
  • [35] S. Cruz and A. Chan (2019) Is that my hand? an egocentric dataset for hand disambiguation. Image and Vision Computing 89, pp. 131–143. Cited by: §3.2.2, §3.2.4, §3.3, TABLE I.
  • [36] S. R. Cruz and A. B. Chan (2018) Hand detection using deformable part models on an egocentric perspective. In 2018 Digital Image Computing: Techniques and Applications (DICTA), pp. 1–7. Cited by: §3.2.2.
  • [37] M. R. Cutkosky et al. (1989) On grasp choice, grasp models, and the design of hands for manufacturing tasks.. IEEE Transactions on robotics and automation 5 (3), pp. 269–279. Cited by: §4.1.2, §4.1.
  • [38] N. Dalal and B. Triggs (2005) Histograms of oriented gradients for human detection. Cited by: §3.1.1.
  • [39] D. Damen, H. Doughty, G. Maria Farinella, S. Fidler, A. Furnari, E. Kazakos, D. Moltisanti, J. Munro, T. Perrett, W. Price, et al. (2018) Scaling egocentric vision: the epic-kitchens dataset. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 720–736. Cited by: TABLE I.
  • [40] A. G. del Molino, C. Tan, J. Lim, and A. Tan (2016) Summarization of egocentric videos: a comprehensive survey. IEEE Transactions on Human-Machine Systems 47 (1), pp. 65–76. Cited by: §1, §2.3.
  • [41] P. Dollár and C. L. Zitnick (2013) Structured forests for fast edge detection. In Proceedings of the IEEE international conference on computer vision, pp. 1841–1848. Cited by: §3.1.1.
  • [42] A. Fathi, A. Farhadi, and J. M. Rehg (2011) Understanding egocentric activities. In 2011 International Conference on Computer Vision, pp. 407–414. Cited by: §1, 4th item, §4.2.2, §4.2.2, §4.2.3, §4.2.3, §4.2.4, §4.2.
  • [43] A. Fathi, X. Ren, and J. M. Rehg (2011) Learning to recognize objects in egocentric activities. In CVPR 2011, pp. 3281–3288. Cited by: §3.1.3, TABLE I.
  • [44] T. Feix, R. Pawlik, H. Schmiedmayer, J. Romero, and D. Kragic (2009) A comprehensive grasp taxonomy. In Robotics, science and systems: workshop on understanding the human hand for advancing robotic manipulation, Vol. 2, pp. 2–3. Cited by: §4.1.1, §4.1.
  • [45] T. Feix, J. Romero, H. Schmiedmayer, A. M. Dollar, and D. Kragic (2015) The grasp taxonomy of human grasp types. IEEE Transactions on Human-Machine Systems 46 (1), pp. 66–77. Cited by: §4.1.1, §4.1.1, §4.1, §4.1.
  • [46] P. F. Felzenszwalb and D. P. Huttenlocher (2004) Efficient graph-based image segmentation. International journal of computer vision 59 (2), pp. 167–181. Cited by: §3.1.1.
  • [47] M. A. Fischler and R. C. Bolles (1981) Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography. Communications of the ACM 24 (6), pp. 381–395. Cited by: §3.1.3, §3.1.4.
  • [48] G. Garcia-Hernando, S. Yuan, S. Baek, and T. Kim (2018) First-person hand action benchmark with rgb-d videos and 3d hand pose annotations. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 409–419. Cited by: §3.4.1, §3.4, §4.2.2, §4.2.2, TABLE I, §6, §6.
  • [49] A. Gupta, S. Mohatta, J. Maurya, R. Perla, R. Hebbalaguppe, and E. Hassan (2017) Hand gesture based region marking for tele-support using wearables. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 69–75. Cited by: §5.2.
  • [50] T. Ha, S. Feiner, and W. Woo (2014) WeARHand: head-worn, rgb-d camera-based, bare-hand user interface with visually enhanced depth perception. In 2014 IEEE International Symposium on Mixed and Augmented Reality (ISMAR), pp. 219–228. Cited by: §5.2.
  • [51] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: §3.2.2.
  • [52] S. Hegde, R. Perla, R. Hebbalaguppe, and E. Hassan (2016) Gestar: real time gesture interaction for ar with egocentric view. In 2016 IEEE International Symposium on Mixed and Augmented Reality (ISMAR-Adjunct), pp. 262–267. Cited by: §3.2.4, §4.1.4, §4.1.4, §4.1.5, §4.1, §5.2.
  • [53] J. F. Henriques, R. Caseiro, P. Martins, and J. Batista (2012) Exploiting the circulant structure of tracking-by-detection with kernels. In European conference on computer vision, pp. 702–715. Cited by: §3.2.3.
  • [54] B. K. Horn and B. G. Schunck (1981) Determining optical flow. Artificial intelligence 17 (1-3), pp. 185–203. Cited by: §3.1.3.
  • [55] D. Huang, M. Ma, W. Ma, and K. M. Kitani (2015) How do we use our hands? discovering a diverse set of common grasps. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 666–675. Cited by: §4.1.2, §4.1.
  • [56] S. Huang, W. Wang, S. He, and R. W. Lau (2018) Egocentric hand detection via dynamic region growing. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM) 14 (1), pp. 10. Cited by: §3.1.3.
  • [57] S. Huang, W. Wang, and K. Lu (2016) Egocentric hand detection via region growth. In 2016 23rd International Conference on Pattern Recognition (ICPR), pp. 639–644. Cited by: §3.1.3.
  • [58] Y. Huang, X. Liu, L. Jin, and X. Zhang (2015)

    Deepfinger: a cascade convolutional neuron network approach to finger key point detection in egocentric vision with mobile camera

    In 2015 IEEE International Conference on Systems, Man, and Cybernetics, pp. 2944–2949. Cited by: §3.4.3.
  • [59] Y. Huang, X. Liu, X. Zhang, and L. Jin (2016) A pointing gesture based egocentric interaction system: dataset, approach and application. In Proceedings of the IEEE conference on computer vision and pattern recognition workshops, pp. 16–23. Cited by: §5.2, TABLE I.
  • [60] T. Ishihara, K. M. Kitani, W. Ma, H. Takagi, and C. Asakawa (2015) Recognizing hand-object interactions in wearable camera videos. In 2015 IEEE International Conference on Image Processing (ICIP), pp. 1349–1353. Cited by: §4.2.1, §4.2.1.
  • [61] Y. Jang, I. Jeon, T. Kim, and W. Woo (2016) Metaphoric hand gestures for orientation-aware vr object manipulation with an egocentric viewpoint. IEEE Transactions on Human-Machine Systems 47 (1), pp. 113–127. Cited by: §4.1.3, §4.1.4, §4.1.5, §4.1, §5.2.
  • [62] Y. Jang, S. Noh, H. J. Chang, T. Kim, and W. Woo (2015) 3d finger cape: clicking action and position estimation under self-occlusions in egocentric viewpoint. IEEE Transactions on Visualization and Computer Graphics 21 (4), pp. 501–510. Cited by: §5.2.
  • [63] P. Ji, A. Song, P. Xiong, P. Yi, X. Xu, and H. Li (2017) Egocentric-vision based hand posture control system for reconnaissance robots. Journal of Intelligent & Robotic Systems 87 (3-4), pp. 583–599. Cited by: §4.1.3, §4.1, §5.2.
  • [64] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell (2014) Caffe: convolutional architecture for fast feature embedding. In Proceedings of the 22nd ACM international conference on Multimedia, pp. 675–678. Cited by: §3.2.2.
  • [65] M. J. Jones and J. M. Rehg (2002) Statistical color models with application to skin detection. International Journal of Computer Vision 46 (1), pp. 81–96. Cited by: §3.1.1, §3.1, §3.2.1.
  • [66] N. Kamakura, M. Matsuo, H. Ishii, F. Mitsuboshi, and Y. Miura (1980) Patterns of static prehension in normal hands. American Journal of Occupational Therapy 34 (7), pp. 437–445. Cited by: §4.1.
  • [67] G. Kapidis, R. Poppe, E. van Dam, L. P. Noldus, and R. C. Veltkamp (2019) Egocentric hand track and object-based human action recognition. arXiv preprint arXiv:1905.00742. Cited by: §3.2.2, §3.2.3, §3.2.3, §3.3, §4.2.2.
  • [68] S. Karaman, J. Benois-Pineau, V. Dovgalecs, R. Mégret, J. Pinquier, R. André-Obrecht, Y. Gaëstel, and J. Dartigues (2014)

    Hierarchical hidden markov model in detecting activities of daily living in wearable videos for studies of dementia

    Multimedia tools and applications 69 (3), pp. 743–771. Cited by: §5.1, §5.1.
  • [69] A. Krizhevsky, I. Sutskever, and G. E. Hinton (2012) Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pp. 1097–1105. Cited by: §4.1.1, §4.2.2.
  • [70] A. Kulesza, B. Taskar, et al. (2012) Determinantal point processes for machine learning. Foundations and Trends® in Machine Learning 5 (2–3), pp. 123–286. Cited by: §4.1.2.
  • [71] J. Kumar, Q. Li, S. Kyal, E. A. Bernal, and R. Bala (2015) On-the-fly hand detection training with application in egocentric action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 18–27. Cited by: §3.1.3, §4.2.1, §4.2.1.
  • [72] J. Lee and M. S. Ryoo (2017) Learning robot activities from first-person human videos using convolutional future regression. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 1–2. Cited by: §5.2.
  • [73] S. Lee, S. Bambach, D. J. Crandall, J. M. Franchak, and C. Yu (2014) This hand is my hand: a probabilistic approach to hand disambiguation in egocentric video. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 543–550. Cited by: §3.2.3, §3.2.3, §3.3.
  • [74] C. Li and K. M. Kitani (2013) Model recommendation with virtual probes for egocentric hand detection. In Proceedings of the IEEE International Conference on Computer Vision, pp. 2624–2631. Cited by: §3.1.2.
  • [75] C. Li and K. M. Kitani (2013) Pixel-level hand detection in ego-centric videos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3570–3577. Cited by: §1, §3.1.1, §3.1.2, §3.2.2, §3.2.3, §3.3, §4.1.1, TABLE I, §6.
  • [76] M. Li, L. Sun, and Q. Huo (2019) Flow-guided feature propagation with occlusion aware detail enhancement for hand segmentation in egocentric videos. Computer Vision and Image Understanding 187, pp. 102785. Cited by: §3.1.1.
  • [77] R. Li, Z. Liu, and J. Tan (2019) A survey on 3d hand pose estimation: cameras, methods, and datasets. Pattern Recognition 93, pp. 251–272. Cited by: §1, §3.
  • [78] Y. Li, M. Liu, and J. M. Rehg (2018) In the eye of beholder: joint learning of gaze and actions in first person video. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 619–635. Cited by: TABLE I, §6.
  • [79] Y. Li, Z. Ye, and J. M. Rehg (2015) Delving into egocentric actions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 287–295. Cited by: §4.2.2, §4.2.2.
  • [80] Y. Li, L. Jia, Z. Wang, Y. Qian, and H. Qiao (2019) Un-supervised and semi-supervised hand segmentation in egocentric images with noisy label learning. Neurocomputing 334, pp. 11–24. Cited by: §3.1.1, §3.1.3.
  • [81] Y. Li, Y. Zhang, H. Qiao, K. Chen, and X. Xi (2016) Grasp type understanding—classification, localization and clustering. In 2016 12th World Congress on Intelligent Control and Automation (WCICA), pp. 1240–1245. Cited by: §4.1.2, §4.1.
  • [82] Z. Li and J. Chen (2015)

    Superpixel segmentation using linear spectral clustering

    In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1356–1363. Cited by: §3.1.1, §3.1.3.
  • [83] H. Liang, J. Yuan, and D. Thalman (2015) Egocentric hand pose estimation and distance recovery in a single rgb image. In 2015 IEEE International Conference on Multimedia and Expo (ICME), pp. 1–6. Cited by: §3.1, §3.4.2, §3.4.
  • [84] C. Light, P. Chappell, P. Kyberd, and B. Ellis (1999) A critical review of functionality assessment in natural and prosthetic hands. British Journal of Occupational Therapy 62 (1), pp. 7–12. Cited by: §4.1.
  • [85] J. Likitlersuang and J. Zariffa (2015) Arm angle detection in egocentric video of upper extremity tasks. In World Congress on Medical Physics and Biomedical Engineering, June 7-12, 2015, Toronto, Canada, pp. 1124–1127. Cited by: §3.3, §5.1.
  • [86] J. Likitlersuang, E. R. Sumitro, T. Cao, R. J. Visée, S. Kalsi-Ryan, and J. Zariffa (2019) Egocentric video: a new tool for capturing hand use of individuals with spinal cord injury at home. Journal of neuroengineering and rehabilitation 16 (1), pp. 83. Cited by: §3.1.1, §3.2.2, §3.3, §4.2.2, §4.2.2, 2nd item, §5.1, TABLE I, §6.
  • [87] J. Likitlersuang, E. R. Sumitro, P. Theventhiran, S. Kalsi-Ryan, and J. Zariffa (2017) Views of individuals with spinal cord injury on the use of wearable cameras to monitor upper limb function in the home and community. The journal of spinal cord medicine 40 (6), pp. 706–714. Cited by: 2nd item, §5.1.
  • [88] J. Likitlersuang and J. Zariffa (2016) Interaction detection in egocentric video: toward a novel outcome measure for upper extremity function. IEEE journal of biomedical and health informatics 22 (2), pp. 561–569. Cited by: §3.1, §5.1, TABLE I, §6.
  • [89] G. Lin, A. Milan, C. Shen, and I. Reid (2017) Refinenet: multi-path refinement networks for high-resolution semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1925–1934. Cited by: §3.1.1.
  • [90] Y. Lin, G. Hua, and P. Mordohai (2014) Egocentric object recognition leveraging the 3d shape of the grasping hand. In European Conference on Computer Vision, pp. 746–762. Cited by: §4.1.
  • [91] J. Liu, F. Feng, Y. C. Nakamura, and N. S. Pollard (2014) A taxonomy of everyday grasps in action. In 2014 IEEE-RAS International Conference on Humanoid Robots, pp. 573–580. Cited by: §4.1.1, §4.1.
  • [92] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C. Fu, and A. C. Berg (2016) Ssd: single shot multibox detector. In European conference on computer vision, pp. 21–37. Cited by: §3.2.2.
  • [93] X. Liu, Y. Huang, X. Zhang, and L. Jin (2016) Fingertip in the eye: an attention-based method for real-time hand tracking and fingertip detection in egocentric videos. In Chinese Conference on Pattern Recognition, pp. 145–154. Cited by: §3.2.3, §3.2.3, §3.4.3.
  • [94] J. Long, E. Shelhamer, and T. Darrell (2015) Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3431–3440. Cited by: §3.1.1.
  • [95] D. G. Lowe et al. (1999) Object recognition from local scale-invariant features.. In iccv, Vol. 99, pp. 1150–1157. Cited by: §3.1.1.
  • [96] M. Ma, H. Fan, and K. M. Kitani (2016) Going deeper into first-person activity recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1894–1903. Cited by: §4.2.2, §4.2.3.
  • [97] T. Malisiewicz, A. Gupta, and A. Efros (2011) Ensemble of exemplar-svms for object detection and beyond. Cited by: §4.1.3.
  • [98] S. Mann (1998) ’WearCam’(the wearable camera): personal imaging systems for long-term use in wearable tetherless computer-mediated reality and personal photo/videographic memory prosthesis. In Digest of Papers. Second International Symposium on Wearable Computers (Cat. No. 98EX215), pp. 124–131. Cited by: §1.
  • [99] R. Margolin, A. Tal, and L. Zelnik-Manor (2013) What makes a patch distinct?. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1139–1146. Cited by: §3.2.1.
  • [100] S. Mohatta, R. Perla, G. Gupta, E. Hassan, and R. Hebbalaguppe (2017) Robust hand gestural interaction for smartphone based ar/vr applications. In 2017 IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 330–335. Cited by: §3.2.4, §4.1.4, §4.1.4, §4.1.5, §4.1, §5.2, TABLE I, §6.
  • [101] F. Mueller, D. Mehta, O. Sotnychenko, S. Sridhar, D. Casas, and C. Theobalt (2017) Real-time hand tracking under occlusion from an egocentric rgb-d sensor. In Proceedings of the IEEE International Conference on Computer Vision, pp. 1284–1293. Cited by: §3.2.2, §3.4.1, §3.4.4, §3.4.
  • [102] J. Nebel, F. Florez-Revuelta, et al. (2018) Recognition of activities of daily living from egocentric videos using hands detected by a deep convolutional network. In International Conference Image Analysis and Recognition, pp. 390–398. Cited by: §3.2.2, §4.2.3, §4.2.3, §4.2.4.
  • [103] T. H. C. Nguyen, J. Nebel, G. Hunter, and F. Florez-Revuelta (2018) Automated detection of hands and objects in egocentric videos, for ambient assisted living applications. In 2018 14th International Conference on Intelligent Environments (IE), pp. 91–94. Cited by: §3.2.2, §3.3, §5.1, §5.1.
  • [104] T. Nguyen, J. Nebel, F. Florez-Revuelta, et al. (2016) Recognition of activities of daily living with egocentric vision: a review. Sensors 16 (1), pp. 72. Cited by: §1, §1, §2.1, §2.3, §4.2.3, §4.2, §5.1, §5.1.
  • [105] H. Noh, S. Hong, and B. Han (2015) Learning deconvolution network for semantic segmentation. In Proceedings of the IEEE international conference on computer vision, pp. 1520–1528. Cited by: §3.1.3.
  • [106] J. S. Pérez, E. Meinhardt-Llopis, and G. Facciolo (2013) TV-l1 optical flow estimation. Image Processing On Line 2013, pp. 137–150. Cited by: §3.1.3.
  • [107] H. Pirsiavash and D. Ramanan (2012) Detecting activities of daily living in first-person camera views. In 2012 IEEE Conference on Computer Vision and Pattern Recognition, pp. 2847–2854. Cited by: TABLE I, §6.
  • [108] P. Rashidi and A. Mihailidis (2012) A survey on ambient-assisted living tools for older adults. IEEE journal of biomedical and health informatics 17 (3), pp. 579–590. Cited by: §5.1.
  • [109] S. S. Rautaray and A. Agrawal (2015) Vision based hand gesture recognition for human computer interaction: a survey. Artificial intelligence review 43 (1), pp. 1–54. Cited by: §1, §1, §2.3, §2.
  • [110] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi (2016) You only look once: unified, real-time object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 779–788. Cited by: §3.2.2.
  • [111] J. Redmon and A. Farhadi (2018) Yolov3: an incremental improvement. arXiv preprint arXiv:1804.02767. Cited by: §3.2.3.
  • [112] Y. Ren, X. Xie, G. Li, and Z. Wang (2016) Hand gesture recognition with multiscale weighted histogram of contour direction normalization for wearable applications. IEEE Transactions on Circuits and Systems for Video Technology 28 (2), pp. 364–377. Cited by: §3.1.4, §4.1.3, §4.1.3, §4.1.
  • [113] I. Rodríguez-Moreno, J. M. Martínez-Otzeta, B. Sierra, I. Rodriguez, and E. Jauregi (2019) Video activity recognition: state-of-the-art. Sensors 19 (14), pp. 3160. Cited by: §4.2.4.
  • [114] G. Rogez, M. Khademi, J. Supančič III, J. M. M. Montiel, and D. Ramanan (2014) 3d hand pose detection in egocentric rgb-d images. In European Conference on Computer Vision, pp. 356–371. Cited by: §3.4.1, §3.4.4, §3.4, §3.
  • [115] G. Rogez, J. S. Supancic, and D. Ramanan (2015) First-person pose recognition using egocentric workspaces. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4325–4333. Cited by: §3.4.1, §3.4.4, §3.4.
  • [116] G. Rogez, J. S. Supancic, and D. Ramanan (2015) Understanding everyday hands in action from rgb-d images. In Proceedings of the IEEE international conference on computer vision, pp. 3889–3897. Cited by: §3.1.4, §4.1.1, §4.1, TABLE I.
  • [117] O. Ronneberger, P. Fischer, and T. Brox (2015) U-net: convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computer-assisted intervention, pp. 234–241. Cited by: §3.1.1.
  • [118] C. Rother, V. Kolmogorov, and A. Blake (2004) Grabcut: interactive foreground extraction using iterated graph cuts. In ACM transactions on graphics (TOG), Vol. 23, pp. 309–314. Cited by: §3.1.1, §3.1.3, §3.2.2, §6.
  • [119] E. Rublee, V. Rabaud, K. Konolige, and G. R. Bradski (2011) ORB: an efficient alternative to sift or surf.. In ICCV, Vol. 11, pp. 2. Cited by: §3.1.1, §3.1.3.
  • [120] S. Rusinkiewicz and M. Levoy (2001) Efficient variants of the icp algorithm.. In 3dim, Vol. 1, pp. 145–152. Cited by: §3.4.1.
  • [121] R. B. Rusu, N. Blodow, and M. Beetz (2009) Fast point feature histograms (fpfh) for 3d registration. In 2009 IEEE International Conference on Robotics and Automation, pp. 3212–3217. Cited by: §3.4.1.
  • [122] G. Serra, M. Camurri, L. Baraldi, M. Benedetti, and R. Cucchiara (2013) Hand segmentation for gesture recognition in ego-vision. In Proceedings of the 3rd ACM international workshop on Interactive multimedia on mobile & portable devices, pp. 31–36. Cited by: §3.1.1, §3.1.2, §4.1.3, §4.1.3, §4.1, §5.2.
  • [123] K. Simonyan and A. Zisserman (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556. Cited by: §3.1.3, §4.1.1, §4.2.2.
  • [124] S. Singh, C. Arora, and C. Jawahar (2016) First person action recognition using deep learned descriptors. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2620–2628. Cited by: §3.1.1, §3.1.2, §4.2.1, §4.2.1, §4.2.
  • [125] G. J. Snoek, M. J. IJzerman, H. J. Hermens, D. Maxwell, and F. Biering-Sorensen (2004) Survey of the needs of patients with spinal cord injury: impact and priority for improvement in hand function in tetraplegics. Spinal cord 42 (9), pp. 526. Cited by: §1.
  • [126] H. Song, W. Feng, N. Guan, X. Huang, and Z. Luo (2016) Towards robust ego-centric hand gesture analysis for robot control. In 2016 IEEE International Conference on Signal and Image Processing (ICSIP), pp. 661–666. Cited by: §3.4.3, §4.1.3, §4.1, §5.2.
  • [127] Y. Tang, Z. Wang, J. Lu, J. Feng, and J. Zhou (2018) Multi-stream deep neural networks for rgb-d egocentric action recognition. IEEE Transactions on Circuits and Systems for Video Technology. Cited by: §3.1.1, §4.2.1, §4.2.1, §4.2, TABLE I, §6, §6.
  • [128] B. Tekin, F. Bogo, and M. Pollefeys (2019) H+ o: unified egocentric recognition of 3d hand-object poses and interactions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4511–4520. Cited by: 3rd item, §3.4.2, §3.4, §4.2.2, §4.2.2, §4.2.4, §4.2.
  • [129] D. Thalmann, H. Liang, and J. Yuan (2015) First-person palm pose tracking and gesture recognition in augmented reality. In International Joint Conference on Computer Vision, Imaging and Computer Graphics, pp. 3–15. Cited by: §4.1.3, §4.1.5, §4.1, §5.2.
  • [130] S. Urabe, K. Inoue, and M. Yoshioka (2018) Cooking activities recognition in egocentric videos using combining 2dcnn and 3dcnn. In Proceedings of the Joint Workshop on Multimedia for Cooking and Eating Activities and Multimedia Assisted Dietary Management, pp. 1–8. Cited by: §3.1.1, §3.4.2, §3.4.4, §3.4, §4.2.1, §4.2.1.
  • [131] A. Urooj and A. Borji (2018) Analysis of hand segmentation in the wild. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4710–4719. Cited by: §3.1.1, §4.2.3, §4.2.3, TABLE I, §6.
  • [132] R. J. Visée, J. Likitlersuang, and J. Zariffa (2019) An effective and efficient method for detecting hands in egocentric videos for rehabilitation applications. arXiv preprint arXiv:1908.10406. Cited by: §3.2.2, §3.2.3, §3.2.3, §3.2.4, §3.2.4, §3.3, §5.1.
  • [133] S. Wan and J. Aggarwal (2015) Mining discriminative states of hands and objects to recognize egocentric actions with a wearable rgbd camera. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 36–43. Cited by: §3.1.4, §4.2.2, §4.2.2, §4.2.3, TABLE I, §6, §6.
  • [134] H. Wang, A. Kläser, C. Schmid, and L. Cheng-Lin (2011) Action recognition by dense trajectories. Cited by: §4.1.1, §4.2.1.
  • [135] J. Wang and C. Yu (2014) Finger-fist detection in first-person view based on monocular vision using haar-like features. In Proceedings of the 33rd Chinese Control Conference, pp. 4920–4923. Cited by: §3.2.2.
  • [136] W. Wang, K. Yu, J. Hugonot, P. Fua, and M. Salzmann (2018) Beyond one glance: gated recurrent architecture for hand segmentation. arXiv preprint arXiv:1811.10914. Cited by: §3.1.1.
  • [137] W. Wang, K. Yu, J. Hugonot, P. Fua, and M. Salzmann (2019) Recurrent u-net for resource-constrained segmentation. arXiv preprint arXiv:1906.04913. Cited by: §3.1.1, TABLE I, §6.
  • [138] S. Wei, V. Ramakrishna, T. Kanade, and Y. Sheikh (2016) Convolutional pose machines. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4724–4732. Cited by: §3.4.2, §3.4.4.
  • [139] T. P. Weldon, W. E. Higgins, and D. F. Dunn (1996) Efficient gabor filter design for texture segmentation. Pattern recognition 29 (12), pp. 2005–2015. Cited by: §3.1.1.
  • [140] W. Wu, C. Li, Z. Cheng, X. Zhang, and L. Jin (2017) Yolse: egocentric fingertip detection from single rgb images. In Proceedings of the IEEE International Conference on Computer Vision, pp. 623–630. Cited by: §3.4.3, TABLE I, §6.
  • [141] C. Xu, L. N. Govindarajan, and L. Cheng (2017) Hand action detection from ego-centric depth sequences with error-correcting hough transform. Pattern Recognition 72, pp. 494–503. Cited by: TABLE I, §6, §6.
  • [142] W. Yamazaki, M. Ding, J. Takamatsu, and T. Ogasawara (2017) Hand pose estimation and motion recognition using egocentric rgb-d video. In 2017 IEEE International Conference on Robotics and Biomimetics (ROBIO), pp. 147–152. Cited by: §3.1.4, §3.4.1, §3.4.4, §3.4.
  • [143] J. Yu, S. Noh, Y. Jang, G. Park, and W. Woo (2015) A hand-based collaboration framework in egocentric coexistence reality. In 2015 12th International Conference on Ubiquitous Robots and Ambient Intelligence (URAI), pp. 545–548. Cited by: §5.2.
  • [144] S. Yuan, Q. Ye, B. Stenger, S. Jain, and T. Kim (2017) Bighand2. 2m benchmark: hand pose dataset and state of the art analysis. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4866–4874. Cited by: §3.4.1, §3.4.4, TABLE I, §6.
  • [145] X. Zabulis, H. Baltzakis, and A. A. Argyros (2009) Vision-based hand gesture recognition for human-computer interaction.. The universal access handbook 34, pp. 30. Cited by: §1, §2.
  • [146] J. Zariffa and M. R. Popovic (2013) Hand contour detection in wearable camera video using an adaptive histogram region of interest. Journal of neuroengineering and rehabilitation 10 (1), pp. 114. Cited by: §3.1.1, §3.2.1, §3.2.1, §5.1.
  • [147] Y. Zhang, C. Cao, J. Cheng, and H. Lu (2018) Egogesture: a new dataset and benchmark for egocentric hand gesture recognition. IEEE Transactions on Multimedia 20 (5), pp. 1038–1050. Cited by: §4.1.3, §4.1.4, §4.1, TABLE I, §6, §6.
  • [148] Y. Zhao, Z. Luo, and C. Quan (2017) Unsupervised online learning for fine-grained hand segmentation in egocentric video. In 2017 14th Conference on Computer and Robot Vision (CRV), pp. 248–255. Cited by: §3.1.3, §3.2.1, §3.2.1.
  • [149] Y. Zhao, Z. Luo, and C. Quan (2018) Coarse-to-fine online learning for hand segmentation in egocentric video. EURASIP Journal on Image and Video Processing 2018 (1), pp. 20. Cited by: §3.1.3, §3.2.1, §3.2.1.
  • [150] Y. Zhou, B. Ni, R. Hong, X. Yang, and Q. Tian (2016) Cascaded interactional targeting network for egocentric video analysis. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1904–1913. Cited by: §3.1.1, §3.1.3, §4.2.2.
  • [151] X. Zhu, X. Jia, and K. K. Wong (2014) Pixel-level hand detection with shape-aware structured forests. In Asian Conference on Computer Vision, pp. 64–78. Cited by: §3.1.1.
  • [152] X. Zhu, X. Jia, and K. K. Wong (2015) Structured forests for pixel-level hand detection and hand part labelling. Computer Vision and Image Understanding 141, pp. 95–107. Cited by: §3.1.1, §3.4.2, §3.4.
  • [153] X. Zhu, W. Liu, X. Jia, and K. K. Wong (2016) A two-stage detector for hand detection in ego-centric videos. In 2016 IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 1–8. Cited by: §3.2.2, §3.
  • [154] X. Zhu, Y. Xiong, J. Dai, L. Yuan, and Y. Wei (2017) Deep feature flow for video recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2349–2358. Cited by: §3.1.1.