1 Related Work
Recent advances in Deep Learning techniques, general purpose GPU usage and the availability of large amounts of annotated data lead to tremendous improvements in various computer vision related problems like object detection[11, 20, 25], tracking [8, 18] and object localisation . To deal with the temporal component of activity and gesture recognition using deep learning there are two main approaches. The first one uses optical flow and a 3D CNN to extract spatio-temporal features and to pass them to an SVM or class entropy layer for classification [22, 21, 24, 19, 14]. The second uses the outputs of CNNs as inputs to RNN as these kind of networks are designed to inherently manage the temporal dimension[2, 4].
Neurological research  suggests a two branch approach to action recognition, a ventral branch for object recognition, and a dorsal branch for motion recognition. This inspired the two stream approach for action recognition using deep learning , which was extended to egocentric action recognition by adding a third stream consisting of ego-hand segmentation information that was manually generated . Taking inspiration from the above work, instead of manually adding hand masks as input, we rather encode ego-hand masks automatically and use these encoded features in a second branch which consists of an RNN to deal with the challenge of recognising egocentric gestures from image sequences.
Zhang et al.  recently published a database of egocentric gestures specific to interactions in wearable devices. This dataset contains 83 different gestures performed by 50 users in various settings. However, we posit that gestures are continuously evolving and there is a need for adding new gestures. For a new gesture to be recognised in the framework proposed by [26, 2] 50 different users have to perform the gesture in 8 different scenarios. This makes adding a new gesture for recognition a cumbersome task, which is one of the challenges we address in our work.
, and provides this pose information to a Long Short Term Memory network (LSTM) for recognising gestures. However, this method has two limitations, it can not deal with gestures performed with both hands and all parts of the hand need to be visible in order to get proper pose information, which is not the case with our network.
2 Ego-Centric Gesture Database
Gestures are usually coupled with a specific task and are rarely the same across different applications. Having a database with a large amount of ego-gestures  is important to help with the evaluation of different recognition algorithms. However, defining a new gesture is a cumbersome task since it needs to be performed by many subjects in many different scenarios. To alleviate this problem, we used a data augmentation technique which reduces the amount of data that needs to be collected and pre-processed. Table 1 shows the number of different backgrounds that are needed for our database in comparison to existing ego-hand gesture databases.
We defined a set of 10 basic gestures that use left, right and both hands (see Figure 3). We collected training gestures
from 22 users, each repeated the gesture 3 times. Users performed the gesture in front of a green screen wearing HoloLens without any restrictions on the duration of each gesture, allowing them to express naturally. This resulted in a large variance in the duration per gesture and per user, which are described in Figure4. In addition to the images (RGB) captured by the egocentric camera, we also collected the 6DOF camera/head pose information that is given by the HoloLens.
|Database||# of Subjects||# of Backgrounds|
The generated images were processed using a green screen segmentation algorithm of a typical video editing software  to generate masks of hands automatically, eliminating the need for manual generation of hand masks per image. Figure 2 illustrates the process of database generation. The hand masks along with their corresponding images and labels per frame were stored and will be made publicly available.
Unlike the training gestures, we collected testing gestures in natural settings. These gestures were captured in real office environments with various backgrounds. We have collected testing gestures from 6 users, each gesture being repeated twice. After inspecting each video, we removed gestures that were performed outside of camera’s field of view and we ended up with 7 to 9 samples per gesture.
To reflect real world situations, our dataset is generated from users with varying skin tones, under different lighting conditions, some users wearing full sleeves, and some wearing watches or bracelets. In comparison, in previously captured datasets ([26, 10]), the gestures are more clinical in the sense that each gesture has the same movement of hands or restricted duration. We showed users a video of each gesture at the beginning of the capture and then let them express the gesture naturally.
3 Network Architecture
The idea of the new architecture is to find spatial feature maps specific to ego-hand gestures and to use these feature maps in a RNN to learn temporal discrimination for recognising ego-hand gestures, while keeping the network small. We achieve this by designing a two stage network architecture.The first stage is Ego-Hand Mask Encoder Network (EHME Net). EHME Net has an hourglass (Fig 1) structure, with a series of convolution filters of increasing depth and decreasing height and width, until they are sufficiently small. Then, we suffix the network with deconvolution filters with decreasing depth and increasing height and width until they reach the size of the mask. We give a detailed description of our EHME Net in Section 3.1. Finally, we use the feature maps near the neck
of the hourglass as an input to an LSTM network to classify the sequence of encoded features.
The number of parameters that needs to be estimated can approximate the complexity and size of a network. Our network compared to AirGest is much smaller (Table2 lists the total number of estimated parameters for our network in comparison to the AirGest network). We believe this could eventually lead to a portable implementation on mobile devices.
|Network||# of Parameters|
In the subsections 3.1, 3.2 we elaborate on the architecture of the module that encodes ego-hand features and the module that recognises gestures from a sequence of encoded ego-hand features respectively.
3.1 EHME Net
The first stage of EHME consists of two layers of Resnet 18 , appended with convolution and pooling layers. The input size of an image is fixed at . The series of Resnet, convolution and pooling layers gradually change the feature map size at the neck of EHME to . The depth of the layer can be varied depending on the gesture dataset. For experiments on our dataset, we fixed the depth to 64. Table 5
shows all the parameters used to define the shape of EHME used in testing on our gesture dataset. At the end of the neck, we append deconvolution layers to gradually upsample the width and height to size of the mask. And the feature maps’ depth decreases to 2 from 64 through the deconvolution layers. At this stage we use a 2D CrossEntropyLoss layer for training, which assigns a probability to each pixel belonging to egocentric hand or not.
Inspired by  we add an extension to the neck in order to simultaneously generate an ego-hand mask and recognise the encoded features that belong to a particular gesture (see Figure 1). To achieve this, we reduce the size of encoded features to a depth using an average pooling layer and connect this to a fully connected layer of the size of the number of gestures used . For training, we also include a 1D CrossEntroyLoss layer at the end. At this point we get a per frame gesture recognition (frame level).
3.2 Sequence Recognition Net
Frame level recognition can be inaccurate and noisy due to different reasons. Individual images from different gesture sequences can be very similar. In our natural scenario we also have large individual variations for same gestures. Exploiting the temporal dimension and coherence in the data helps to improve the results significantly as we will show in our results (sequence level). Generally, RNNs are known to encode such time-related information well. However, traditional RNNs suffer from the problem of vanishing or exploding gradients. LSTMs are considered an improvement for that problem over traditional RNNs  as they can also forget information that is not relevant over time. Not only do they have better convergence but also provide the ability to be trained on and used with sequences of arbitrary length. This property is crucial for recognition of natural gestures considering the variation in duration needed to express the same gesture by different people (see Figure 4). Table 5 shows the parameters of our LSTM that is used for the gesture sequence classification. The hidden layer from the last image in the sequence is connected with a fully connected layer of a size to classify the sequence of gestures.
We trained our network using the augmented training dataset. The dataset augmentation process is described in Section 4.1. The trained network is evaluated using our testing dataset that was collected in natural environments without green screen in the background. This ensures that our network works well on unseen data. The hyper-parameters used for training EHME and Sequence Net are presented in Table 3. In Section 4.2 we discuss the strategy used for training our network.
4.1 Data Preprocessing
We apply the mask that is obtained by the green screen removal process as described in Section 2 to the corresponding egocentric image and add backgrounds images to it. Figure 5 shows the mask applied to one of the images. This creates images from one captured image with the same mask. As background we chose random images from a set of 40,000 images from the COCO Test dataset . For our training we set , which increases the size of dataset fivefold. In addition, we also add one of the following ’none’, ’poisson’, ’gaussian’, or ’salt&pepper’ noises randomly, to ensure that we are not over-fitting data. Then, we store each of these images separately along with the gesture id and their corresponding mask. We scale down all the images and masks to 224x126 resolution and normalise them.
4.2 Training Procedure
The training data is 90/10 split for training and validation respectively throughout the training process. We train the network in 3 phases, the network parameters from each phase are transferred to the next one. This process is described in Figure 6. In the first phase, we train our EHME defined in Section 3.1
with one loss function appended to deconvolution layers learning the ego-hand masks. We use the 2D cross entropy loss and an ADAM optimizer with learning rate
for this purpose. The data is shuffled and we train for 5 epochs with batch size of 50. This first phase, in principle trains a hand segmentation network that can actually be used for this purpose as we will show in our experiments with the AirGest dataset described in Section 6.2.
In the second phase, we append an average pooling layer and then a fully connected layer with outputs size to the neck of the network. A 1D Cross Entropy Layer is added to do per frame gesture recognition. The parameters obtained from Phase 1 are transferred to Phase 2. Then the network is trained on a combined loss function using 2D cross entropy loss from phase 1 and 1D cross entropy loss from this phase with equal weight for both loss functions. We use ADAM optimiser with learning rate and train for 18 epochs with batch size of 50. At this point we get a frame level gesture recognition framework, which can be inaccurate as explained before and evaluated in our experiments below.
For the final phase, we modify our data augmentation approach. Instead of using a random background and noise for every image, we now use the same augmentation for the whole gesture sequence, such that each gesture sequence has the same background and noise. This is needed for the final phase of training to simulate real conditions. We send a given sequence into the EHME net in a single batch, and save the encoded features as a sequence. Once all the sequences are saved we use these as input to the LSTM. The hidden layer from the last sequence is connected to a fully connected layer with outputs size
and then to a 1D cross entropy layer for sequence recognition. All the parameters from earlier phases are used to initialise the weights of EHME + LSTM networks and we do end-to-end training combining the three loss functions. For optimisation, we use a Stochastic Gradient Descent algorithm, with learning rate ofand momentum . We train for 60 epochs. All the hyper-parameters used for training are summarised in Table 3. After this final phase we get our full network for sequence gesture recognition.
5 Experiments and Results
|Network||# of Gestures Classified Correctly||# of Gestures Classified Wrongly||Accuracy %|
|EHME||Encoded Features||Sequence Net|
|Node Type||Output Size||Node Parameters||Node Type||Output Size||Node Parameters||Node Type||Output Size||Node Parameters|
|Resnet18 3||28x16x128||Parameters from |
|conv2||7x4x64||3x3, stride 2, padding 1||Average Pool||1x1x64||7x4||LSTM||64||input 64, hidden 128, layers 3|
|deconv1||14x8x32||4x4, stride 2, padding 1||Fully Connected||1x11||64x11||Fully Connected||10||64x10|
|deconv2||28x16x16||4x4, stride 2, padding 1|
|deconv3||56x32x8||4x4, stride 2, padding 1|
|deconv4||112x64x4||4x4, stride 2, padding 1|
|deconv5||224x126x2||4x4, stride 2, padding (2,1)|
We have tested our network architecture on our testing database and on a dataset from AirGest . Our testing dataset contains 10 natural gestures, where fingers are clipped, frames have a strong motion blur, and there is a variation within a gesture (see Figure 7 for examples). The AirGest dataset contains 4 gestures (click, bloom, zoom out and zoom in) that are performed in a clinical manner, i.e gestures have clear separation between phases, they are performed slowly without motion blur, and are fully contained in the center of the video.
In the following section we present and discuss the results from various phases of our network on our dataset, and in Section 5.2
we present comparative results of our network on the AirGest dataset. Our network architecture is implemented in PyTorch and we used a PC with an Intel Core i7 CPU and NVidia Titan Xp GPU for both, training and testing.
5.1 Recognition on our dataset
As mentioned in Section 4.2 we used 3 phases for training. In our testing we report results of networks from Phase 2 and Phase 3. The network result obtained from Phase 2 of training can perform recognition per frame. The input to this network is a sequence of RGB images of a gesture, and we get a gesture classification for each frame. A simple voting strategy is followed giving each gesture a vote if a frame is predicted to be that gesture. The sequence is then assigned the gesture with maximum number of votes. We call this frame level recognition.
For sequence level recognition we input a sequence of images to the network from Phase 3. As we can observe from the results in Table 4 sequence level recognition performs much better than frame level recognition. The same hand pose can be part of multiple different gestures, but during different stages of performing the gesture. Since frame level recognition does not consider time, it could easily misclassify a gesture. Adding a temporal recognition component like an LSTM solves this issue as is evident from the results in Table 4.
To analyse recognition performance on each gesture we present a normalised confusion matrix in Figure8 for results from the sequence level recognition. The mislabelled gestures are within the same hand (as in a left-handed gesture is being labelled as another left-handed gesture but not a right-handed). The recognition of gesture 7 - Left Block is especially low and is confused with Left Shoot and Teleport gestures. Looking at the testing videos closely, one observation that could explain this confusion is a large head movement that creates relative motion inside the frame similar to the one in Shoot (up-left motion) and Teleport (circular motion in left direction). To improve the accuracy in these situations we are planing in the future to utilise the head pose transformation that can be obtained from HoloLens.
Also, in the Teleport gesture, some users used the whole arm to create circular motion, where others used only one finger. This small finger-motion is especially challenging to distinguish from an egocentric view as it can be occluded by the arm or hand, and can be easily confused with Shoot or Punch gestures. This is something that should be taken into consideration when designing gestures for egocentric view recognition.
The network to find hand pose  which was used in AirGestAR  could not recover poses for many of the ego-hand images in our dataset, as it was not designed to handle complex scenarios like motion blur and clipped fingers which frequently occur in our dataset. Our network however could handle such situations which was not the case with AirGestAR network strategy. So we could not perform a comparative study using their network.
5.2 Recognition on AirGest Dataset
To compare our network with previous work we used a dataset from AirGest . This dataset does not have hand masks as ground truth, which are needed to perform Phase 1 training in our network. To avoid manual mask extraction, we used our Phase 1 network that is trained with our training dataset to generate these masks automatically. After visual verification, we used these masks as a ground truth in addition to frame level labels in Phase 2 training. Finally, we followed the same procedure mentioned in section 4.2 for Phase 3 training.
To provide comparative results we used the same training and testing data as described in . The confusion matrix is presented in Figure 9 and the overall accuracy in Table 6. Our final network’s performance is able to match the AirGest network’s despite being smaller in size.
|Network||# of Gestures Classified Correctly||# of Gestures Classified Wrongly||Unclassified||Probability Threshold ()||Accuracy %|
6 Conclusion & Future Work
We propose a novel deep learning network architecture which simultaneously encodes ego-hands in a sequence of images and recognises the egocentric gesture. A novel data augmentation technique using green screening to decrease the burden of collecting large amounts of data from a large number of users is introduced. The network architecture in conjunction with the data augmentation technique makes adding a new egocentric gesture for recognition easier. In addition, we also publish our training and testing dataset with 10 gestures performed in a less clinical and a less constrained manner. We evaluate our network which is trained on the augmented dataset and tested on a natural (i.e. gestures performed not in front of green screen) dataset. Our network can deal with a variations in the gestures’ length, style and motion blur as presented in the results.
Recognising gestures on continuous video is also essential for making natural interactions possible on head-mounted AR devices. Handling this challenging task is one of the directions we want to explore in the future. Another direction we want to seek is to use head pose information and other modalities provided by the AR devices to deal with sudden and extreme head motion, paving the way for recognition of more complicated gestures in difficult scenarios and natural activities (e.g. GTEA Gaze+ ).
Acknowledgements.This publication has emanated from research conducted with the financial support of Science Foundation Ireland (SFI) under the Grant Number 15/RP/2776. Sincere thanks to Sahitya Parvathaneni for doing the major part of illustrations.
L. Baraldi, F. Paci, G. Serra, L. Benini, and R. Cucchiara.
Gesture recognition in ego-centric videos using dense trajectories
and hand segmentation.
IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops, pp. 702–707, 2014.
C. Cao, Y. Zhang, Y. Wu, H. Lu, and J. Cheng.
Egocentric Gesture Recognition Using Recurrent 3D Convolutional Neural Networks with Spatiotemporal Transformer Modules.2017 IEEE International Conference on Computer Vision (ICCV), pp. 3783–3791, 2017.
-  R. Cutler and M. Turk. View-based interpretation of real-time optical flow for gesture recognition. Proceedings of IEEE International Conference on Automatic Face and Gesture Recognition, pp. 416–421, 1998.
-  J. Donahue, L. A. Hendricks, S. Guadarrama, M. Rohrbach, S. Venugopalan, K. Saenko, and T. Darrell. Long-term recurrent convolutional networks for visual recognition and description. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 39(4):677–691, 2015.
-  R. Girshick. Fast R-CNN. Proceedings of the IEEE International Conference on Computer Vision, 2015 Inter:1440–1448, 2015.
-  M. a. Goodale and a. D. Milner. Separate visual pathways for perception and action. Trends in Neurosciences, 15(I):20–5, 1992.
-  K. He, G. Gkioxari, P. Dollár, and R. Girshick. Mask R-CNN. Proceedings of the IEEE International Conference on Computer Vision, 2017.
-  D. Held, S. Thrun, and S. Savarese. GoTurn:Learning to Track at 100 FPS with Deep Regression Networks. European Conference on Computer Vision (ECCV), 2016.
-  S. Hochreiter and J. Urgen Schmidhuber. Long Short-Term Memory. Neural Computation, 9(8):1735–1780, 1997.
-  V. Jain, R. Perla, and R. Hebbalaguppe. AirGestAR: Leveraging Deep Learning for Complex Hand Gestural Interaction with Frugal AR Devices. Adjunct Proceedings of the 2017 IEEE International Symposium on Mixed and Augmented Reality, ISMAR-Adjunct 2017, pp. 235–239, 2017.
-  A. Krizhevsky, I. Sutskever, and G. E. Hinton. ImageNet Classification with Deep Convolutional Neural Networks. Advances In Neural Information Processing Systems, pp. 1–9, 2012.
-  Y. Li, Z. Ye, and J. M. Rehg. Delving into egocentric actions. Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. 287–295, 2015.
-  T.-Y. Lin, C. L. Zitnick, and P. Doll. Microsoft COCO : Common Objects in Context. Arixiv, pp. 1–15, 2015.
-  P. Molchanov, X. Yang, S. Gupta, K. Kim, S. Tyree, and J. Kautz. Online Detection and Classification of Dynamic Hand Gestures with Recurrent 3D Convolutional Neural Networks. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4207–4215, 2016.
-  M. A. Moni and A. B. M. Shawkat Ali. HMM based hand gesture recognition: A review on techniques and approaches. Proceedings - 2009 2nd IEEE International Conference on Computer Science and Information Technology, pp. 433–437, 2009.
-  Mu-Chun Su. A fuzzy rule-based approach to spatio-temporal hand gesture recognition. IEEE Transactions on Systems, Man and Cybernetics, Part C (Applications and Reviews), 30(2):276–281, 2000.
-  Natron. Natron. www.natron.fr, 2018.
I. Posner and P. Ondruska.
Deep Tracking: Seeing Beyond Seeing Using Recurrent Neural
Proceedings of the 30th Conference on Artificial Intelligence (AAAI 2016), pp. 3361–3367, 2016.
-  K. Simonyan and A. Zisserman. Two-Stream Convolutional Networks for Action Recognition in Videos. Advances in Neural Information Processing Systems, pp. 1–11, 2014.
-  K. Simonyan and A. Zisserman. Very Deep Convolutional Networks for Large-Scale Image Recognition. CoRR, abs/1409.1:1–14, 2014.
-  S. Singh, C. Arora, and C. V. Jawahar. First Person Action Recognition Using Deep Learned Descriptors. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2620–2628, 2016.
-  D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri. Learning spatiotemporal features with 3D convolutional networks. Proceedings of the IEEE International Conference on Computer Vision, 2015 Inter:4489–4497, 2015.
-  H. Wang, C. Schmid, A. Recognition, and T. Iccv. Action Recognition with Improved Trajectories. Proceedings of the IEEE International Conference on Computer Vision, pp. 3551–3558, 2013.
-  P. Wang, W. Li, S. Liu, Z. Gao, C. Tang, and P. Ogunbona. Large-scale Isolated Gesture Recognition Using Convolutional Neural Networks. IEEE International Conference on Pattern Recognition, pp. 19–24, 2017.
-  S. Wu, S. Zhong, and Y. Liu. Deep Residual Learning for Image Recognition. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1–17, 2016.
-  Y. Zhang, C. Cao, J. Cheng, and H. Lu. EgoGesture: A New Dataset and Benchmark for Egocentric Hand Gesture Recognition. IEEE Transactions on Multimedia, 9210(c):1–1, 2018.
-  C. Zimmermann and T. Brox. Learning to Estimate 3D Hand Pose from Single RGB Images. Proceedings of the IEEE International Conference on Computer Vision, 2017.