Egocentric Human Segmentation for Mixed Reality

by   Andrija Gajic, et al.

The objective of this work is to segment human body parts from egocentric video using semantic segmentation networks. Our contribution is two-fold: i) we create a semi-synthetic dataset composed of more than 15, 000 realistic images and associated pixel-wise labels of egocentric human body parts, such as arms or legs including different demographic factors; ii) building upon the ThunderNet architecture, we implement a deep learning semantic segmentation algorithm that is able to perform beyond real-time requirements (16 ms for 720 x 720 images). It is believed that this method will enhance sense of presence of Virtual Environments and will constitute a more realistic solution to the standard virtual avatars.



There are no comments yet.


page 2

page 4


Real Time Egocentric Object Segmentation: THU-READ Labeling and Benchmarking Results

Egocentric segmentation has attracted recent interest in the computer vi...

BusyHands: A Hand-Tool Interaction Database for Assembly Tasks Semantic Segmentation

Visual segmentation has seen tremendous advancement recently with ready ...

Learning to Segment Human Body Parts with Synthetically Trained Deep Convolutional Networks

This paper presents a new framework for human body part segmentation bas...

Enhanced Self-Perception in Mixed Reality: Egocentric Arm Segmentation and Database with Automatic Labelling

In this study, we focus on the egocentric segmentation of arms to improv...

Deep Hierarchical Semantic Segmentation

Humans are able to recognize structured relations in observation, allowi...

Relating Cascaded Random Forests to Deep Convolutional Neural Networks for Semantic Segmentation

We consider the task of pixel-wise semantic segmentation given a small s...

Using CNNs For Users Segmentation In Video See-Through Augmented Virtuality

In this paper, we present preliminary results on the use of deep learnin...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In the past recent years, the release of video see-through cameras such as Microsoft Real Sense or ZED mini attached to headsets or built-in such as the ones in HTC Vive Pro have led Mixed Reality (MR) researchers to see the opportunities of egocentric vision. Augmented Virtuality (AV) is a MR subcategory that aims to augment a virtual environment (VE) with the reality surrounding him, previously captured with an egocentric camera. Among its main benefits, AV allows physical interaction with real objects and increases the awareness with the real world while being immersed in a VE.

Users’ sense of presence (SoP) within a VE is a very important construct that affects their propensity to experience the VE as if it real. One effective way to increase this SoP is by providing a user representation, which helps to shift from being a merely observer to really experiencing the VE [13]. Self-avatars are the mainstream solution followed in the Virtual Reality (VR) community, which are virtual models with a human-like appearance, and are mostly focused on hand representations. Apart from increasing the SoP, self-avatars also increase the sense of embodiment (SoE) [1]111As defined by Kilteni et al. sense of embodiment refers to the feeling of being inside, controlling and having a virtual body.

and distance estimation

[8]. Among their limitations, they have some problems related to misalignment between virtual and real body [9].

One promising application of egocentric vision for MR is the use of video self-avatars, that is, rather than using virtual hand models, adopting the user’s real ones by segmenting the egocentric vision. The VR community has explored this idea for some time. For instance color-based approaches [4, 2, 10, 7] can be deployed in real time but tend to work well just for controlled conditions. However, they failed at dealing with different skin colors or with long-sleeve clothes [4]. Alternatively, depth solutions based on the incorporation of real objects within a distance from the camera have been proposed [12]. Despite effective for some situations, they still lack enough precision to provide a generic, realistic and real time immersive experience. In our previous work [6], we explore deep semantic segmentation networks to perform arm segmentation using the EgoArm semi-synthetic dataset. Results proved their increased robustness against uncontrolled scenarios with respect to color-based or depth-based approaches, but the particular architecture explored was not light enough for real-time performance. Futhermore, the immersive experience and counterpart SoP and SoE can be further enhanced by providing not only video self-avatars of hands but of the entire human body. Therefore, this paper investigate how to integrate egocentric user’s body from the egocentric capture in real time.

The rest of this article is structured as follows. Section 2 discusses relevant related works. In Section 3 we present the Egocentric Human Segmentation dataset created for this task. Section 4 describes the deep learning architecture designed to target real time egocentric segmentation. Finally Section 5 reports some preliminary results and concludes the paper with some open problems and future works.

2 Related Works

In the literature there are several works that introduces users’ whole body into the VE. For instance, Bruder et al. proposed skin segmentation algorithm to incorporate users’ hands handling different skin colors [2]. Conversely, a floor subtraction approach was developed to incorporate users’ legs and feet in the VE. Making the assumption that the floor appearance was uniform, the body was retained by simply filtering out all pixels not belonging to the floor [2].

Then, Chen et al., in the context of 360 video cinematic experiences, explored depth keying techniques to incorporate all objects below a predefined distance threshold [3]. This distance threshold could be changed dynamically to control how much of the real world was shown in the VE. The user could also control the transitions between VE and real world through head shaking and hand gestures. Some of the limitations that the authors pointed out were related to the limited field of view of the depth sensor.

Pigny et Dominjon [11] were the first to propose a deep algorithm to segment egocentric bodies. Their architecture was based on U-NET and trained using a hybrid dataset composed of images from the COCO dataset belonging to persons and a -image custom dataset created following the automatic labelling procedure reported in [6]. They reported ms of inference time for images, and also observed problems of false positives that downgrade the AV experience. In this work we plan to extend our previous work [6] and explore further this segmentation problem targeting to perform real-time segmentation at higher resolutions, which are needed for creating and achieving a realistic immersive experience.

3 Egocentric Body Dataset

Figure 1: Samples from the Egocentric Human Segmentation Dataset: From Left to Rigth: stand up position looking to the front or slightly leaned, stand up position looking to the floor and sit down position looking to the floor.

Due to the lack of egocentric human datasets with pixel-wise labelling, we decided to create our own, following the same procedure as reported in [6]222As far as we know, the dataset created in [11] is not publicly available.. This procedure aimed to create semi-synthetic images by firstly capturing human body parts in front of chroma-key backdrop and then merging them with realistic backgrounds. This smart method prevented us from the extremely time consuming and error-prone task of pixel-wise labelling. For this new task, we have extended our previously published EgoArm dataset with images from the egocentric lower-body parts, as it follows:

Figure 2: Deep Learning Arquitecture proposed for Egocentric Human Segmentation.
  • Egocentric body capture: we asked a total of users to walk freely through the chroma-key backdrop while being recorded. The second round was performed with the users sitting down in a chair also covered by the chroma-key. The recording was done using an Android app installed in a Samsung S8 smartphone placed in the Samsung Gear Framework headset, taking images at . Users repeated the experiments with different outfits including short and long sleeves. Among the users, there is also variety in terms of gender, and skin color. Then, images were sampled from videos so that all users were equally represented.

  • Egocentric background capture: at the second stage, videos of realistic backgrounds were acquired using the same app in three different positions: at stand up position looking to the front; at stand up position looking to the floor, and at a sit-down position looking to the floor. A total of , and different background videos were acquired for the different positions, encompassing different indoors scenarios including offices, houses, restaurants, and halls. frames were sampled per each video. Frames pertaining to the videos looking to the floor were augmented using rotation of , and .

  • HSV filtering and Alpha-channel based combination: First, binary foreground masks of the recorded chroma-key videos are estimated using the process described in [6]

    . Then, binarized foreground masks were smoothed by applying the Shared Sampling alpha matting algorithm

    [5]. Finally, alpha masks were used to realistically merge the segmented body parts with a randomly chosen background counterpart. Examples of resulting images can be seen in Fig. 1.

A total of lower body images were created. Before joining both datasets, EgoArm was down sampling from images to images, so both datasets were equally represented. Futhermore, the MIT Scene Parsing backgrounds used originally in the EgoArm dataset were replaced with the more realistic ones acquired in this new setup. As a result, we have a total of images conforming the Ego Human Segmentation dataset333It will be publicly available for research purposes for the camera ready version..

4 Egocentric Deep Segmentation

Fig. 2 depicts the general architecture designed to segment among two classes: background and human egocentric body parts. It is inspired by the ThunderNet architecture [14]. This architecture is mainly based on three parts: an encoding subnetwork; a pyramid pooling module (PPM); and a decoding subnetwork. The encoding subnetwork is based on the first three Resnet-18 blocks [14]. The output of the encoding block is followed by a PPM. Unlike the original network [14] and due to the larger size of training images, we decided to use larger sampling pooling factors: . The decoding subnetwork is similar to the one proposed in the original architecture, made up of two deconvolutional blocks. Besides, apart from the skip connections included within the encoding and decoding blocks, we include three more long skip connections between encoding and decoding subtnetworks for refining object boundaries.

4.1 Training and experimental protocol

This new Thundernet architecture has been developed and trained using Keras framework. The weights from the three Resnet-18 blocks inside encoder are inherited from a model pre-trained on ImageNet dataset. Afterwards, the whole architecture is fine-tuned in an end-to-end approach. Among the entire Ego Human dataset, a total of

images were selected as the validation subset, assuring that users from both sets were disjoint. The remainder of images, a total of

were used as training. Different strategies were applied. First, the total set of backgrounds without foreground were included as part of the training set. Chromatic and cropping augmentation techniques were also applied to the training images. The loss function used was the weighted cross entropy, whose weights were estimated according to the whole frequency of foreground and background pixels in the training set (

and for the background and human class, respectively). After many extensive experiments, the hyper-parameters found for the best performance were obtained using an Adam optimizer, a batch size of (due to the high size of the training images), epochs, learning rate of , and weight decay of .

5 Results and Conclusions

GTEA EDSH2 EDSHK Ego Hands EgoGesture THU-READ TEgO vanilla
TEgO vanilla
TEgO wild
TEgO wild
Table 1: Empirical result comparison with the ones reported in our previous work [6], in terms of Intersection over Union in the range
Figure 3: Qualitative results from real egocentric frames. Above, results achieved with our previos method [6], below, results achieved with the new and shallower Thundernet arquitecture.

Table 1 reports the Intersection over Union from different egocentric arm/hand datasets. As can be seen, in most cases, the new network is achieving comparable results using a shallower arquitecture (notice that due to some discrepancy between groundtruth associated to these datasets, some results are underestimated, see [6] for clarification). Another benefit of this new network, is their ability to segment whole egocentric bodies. Due to the lack of datasets with labelled human body, Fig. 3 presents some qualitative results from real egocentric frames. The most interesting benefit is that, unlike previous approaches [11], our method is able to perform real time segmentation on high resolution images (inference time per images takes only using a PC Intel Xeon ES-2620 V4 @ Ghz with GB powered with GPU GTX-1080 Ti with RAM.

We experience that without the smoothed blending between human body parts and realistic backgrounds, the network tends to focus the learning mainly on the edges and could not provide good segmentation. Second, the use of chromatic and cropping augmentation helps the network to be less dependent of illumination and spatial position, respectively. The weighted cross entropy loss helps to focus the learning more on the human body parts, rather than in the backgrounds, whose unlimited variability can not be completely retained in the network. Despite the good segmentation accuracy and real time performance, we observed that there are still some false positives in the background. For future work, we plan to tackle more in depth this problem by studying more sophisticated loss functions, or approaching the training in two stages, and train as a first step the arquitecture on a bigger dataset with more background classes.


  • [1] F. Argelaguet, L. Hoyet, M. Trico, and A. Lécuyer (2016) The role of interaction in virtual embodiment: effects of the virtual hand representation. In Proc. IEEE VR, pp. 3–10. Cited by: §1.
  • [2] G. Bruder, F. Steinicke, K. Rothaus, and K. Hinrichs (2009) Enhancing presence in head-mounted display environments by visual body feedback using head-mounted cameras. In Proc. Int. Conf. on CW, pp. 43–50. Cited by: §1, §2.
  • [3] G. Chen, M. Billinghurst, R. W. Lindeman, and C. Bartneck (2017) The effect of user embodiment in av cinematic experience. In Proc of ICAT-EGVE, Cited by: §2.
  • [4] L. P. Fiore and V. Interrante (2012) Towards achieving robust video selfavatars under flexible environment conditions. International Journal of VR 11 (3), pp. 33–41. Cited by: §1.
  • [5] E. Gastal and M. M. Oliveira (2010) Shared sampling for real-time alpha matting. In Computer Graphics Forum, Vol. 29, pp. 575–584. Cited by: 3rd item.
  • [6] E. Gonzalez-Sosa, P. Perez, R. Tolosana, R. Kachach, and A. Villegas (2020) Enhanced self-perception in mixed reality: egocentric arm segmentation and database with automatic labelling. arXiv preprint arXiv:2003.12352. Cited by: §1, §2, 3rd item, §3, Figure 3, Table 1, §5.
  • [7] T. Günther, I. S. Franke, and R. Groh (2015) Aughanded virtuality-the hands in the virtual environment. In Proc. of IEEE 3DUI, pp. 157–158. Cited by: §1.
  • [8] E. McManus, B. Bodenheimer, S. Streuber, S. De La Rosa, H. H. Bülthoff, and B. J. Mohler (2011) The influence of avatar (self and character) animations on distance estimation, object interaction and locomotion in immersive virtual environments. In Proc. of the ACM SIGGRAPH SAP, pp. 37–44. Cited by: §1.
  • [9] N. Ogawa, T. Narumi, and M. Hirose (2020) Effect of avatar appearance on detection thresholds for remapped hand movements. IEEE TRANS. on VCG. Cited by: §1.
  • [10] P. Perez, E. Gonzalez-Sosa, R. Kachach, J. Ruiz, F. Pereira, and A. Villegas (2019) Immersive gastronomic experience with distributed reality. In Proc . of IEEE WEVR, pp. 1–4. Cited by: §1.
  • [11] P. Pigny and L. Dominjon (2020) Using cnns for users segmentation in video see-through augmented virtuality. arXiv preprint arXiv:2001.00487. Cited by: §2, §5, footnote 2.
  • [12] M. Rauter, C. Abseher, and M. Safar (2019) Augmenting virtual reality with near real world objects. In Proc. IEEE VR, pp. 1134–1135. Cited by: §1.
  • [13] M. Slater and M. Usoh (1993) The influence of a virtual body on presence in immersive virtual environments. In Proc. of VR, pp. 34–42. Cited by: §1.
  • [14] W. Xiang, H. Mao, and V. Athitsos (2019) ThunderNet: a turbo unified network for real-time semantic segmentation. In Proc. of IEEE WACV, pp. 1789–1796. Cited by: §4.