Analysis of Hand Segmentation in the Wild
A large number of works in egocentric vision have concentrated on action and object recognition. Detection and segmentation of hands in first person videos, however, has less been explored. For many applications in this domain, it is necessary to accurately segment not only hands of the camera wearer but also the hands of others with whom he is interacting. Here, we take an in-depth look at the hand segmentation problem. First, we evaluate the performance of the state of the art hand segmentation methods, off the shelf and finetuned, on existing datasets. Second, we finetune RefineNet, a leading semantic segmentation method, for hand segmentation and find that it does much better than best contenders. Third, we contribute by collecting two new datasets including a) EgoYouTubeHands dataset which includes egocentric videos containing hands in the wild, and b) HandOverFace dataset to analyze the performance of our models in presence of similar appearance occlusions. Fourth, we investigate whether conditional random fields can be helpful to refine hand segmentations produced by our model. Fifth, we train a CNN for hand-based activity recognition and achieve higher activity recognition accuracy when the trained CNN used hand maps produced by finetuned RefineNet model. Finally, we annotate a subset of the EgoHands dataset for fine-level activity recognition and show that just looking at a single hand pose, we can achieve 58.6 recognition accuracy where chance level is 12.5
READ FULL TEXT