Learning from Synthetic Humans, CVPR 2017
Estimating human pose, shape, and motion from images and videos are fundamental challenges with many applications. Recent advances in 2D human pose estimation use large amounts of manually-labeled training data for learning convolutional neural networks (CNNs). Such data is time consuming to acquire and difficult to extend. Moreover, manual labeling of 3D pose, depth and motion is impractical. In this work we present SURREAL (Synthetic hUmans foR REAL tasks): a new large-scale dataset with synthetically-generated but realistic images of people rendered from 3D sequences of human motion capture data. We generate more than 6 million frames together with ground truth pose, depth maps, and segmentation masks. We show that CNNs trained on our synthetic dataset allow for accurate human depth estimation and human part segmentation in real RGB images. Our results and the new dataset open up new possibilities for advancing person analysis using cheap and large-scale synthetic data.READ FULL TEXT VIEW PDF
This paper addresses the problem of 3D human pose estimation in the wild...
Commonly used human motion capture systems require intrusive attachment ...
Analysis of faces is one of the core applications of computer vision, wi...
The automatic extraction of animal 3D pose from images without markers
The goal of many computer vision systems is to transform image pixels in...
Neural networks need big annotated datasets for training. However, manua...
Convolutional Neural Networks (CNNs) trained on large scale RGB database...
Learning from Synthetic Humans, CVPR 2017
Learning from Synthetic Humans, CVPR 2017
Convolutional Neural Networks provide significant gains to problems with large amounts of training data. In the field of human analysis, recent datasets [4, 37] now gather a sufficient number of annotated images to train networks for 2D human pose estimation [23, 41]. Other tasks such as accurate estimation of human motion, depth and body-part segmentation are lagging behind as manual supervision for such problems at large scale is prohibitively expensive.
Images of people have rich variation in poses, clothing, hair styles, body shapes, occlusions, viewpoints, motion blur and other factors. Many of these variations, however, can be synthesized using existing 3D motion capture (MoCap) data [3, 18] and modern tools for realistic rendering. Provided sufficient realism, such an approach would be highly useful for many tasks as it can generate rich ground truth in terms of depth, motion, body-part segmentation and occlusions.
Although synthetic data has been used for many years, realism has been limited. In this work we present SURREAL: a new large-scale dataset with synthetically-generated but realistic images of people. Images are rendered from 3D sequences of MoCap data. To ensure realism, the synthetic bodies are created using the SMPL body model , whose parameters are fit by the MoSh  method given raw 3D MoCap marker data. We randomly sample a large variety of viewpoints, clothing and lighting. SURREAL contains more than 6 million frames together with ground truth pose, depth maps, and segmentation masks. We show that CNNs trained on synthetic data allow for accurate human depth estimation and human part segmentation in real RGB images, see Figure 1. Here, we demonstrate that our dataset, while being synthetic, reaches the level of realism necessary to support training for multiple complex tasks. This opens up opportunities for training deep networks using graphics techniques available now. SURREAL dataset is publicly available together with the code to generate synthetic data and to train models for body part segmentation and depth estimation .
The rest of this paper is organized as follows. Section 2 reviews related work. Section 3 presents our approach for generating realistic synthetic videos of people. In Section 4 we describe our CNN architecture for human body part segmentation and depth estimation. Section 5 reports experiments. We conclude in Section 6.
Knowledge transfer from synthetic to real images has been recently studied with deep neural networks. Dosovitskiy et al.  learn a CNN for optical flow estimation using synthetically generated images of rendered 3D moving chairs. Peng et al.  study the effect of different visual cues such as object/background texture and color when rendering synthetic 3D objects for object detection task. Similarly,  explores rendering 3D objects to perform viewpoint estimation. Fanello et al.  render synthetic infrared images of hands and faces to predict depth and parts. Recently, Gaidon et al.  have released the Virtual KITTI dataset with synthetically generated videos of cars to study multi-object tracking.
Several works focused on creating synthetic images of human bodies for learning 2D pose estimation [27, 30, 36], 3D pose estimation [7, 9, 14, 24, 35, 39, 44], pedestrian detection [22, 27, 28], and action recognition [31, 32]. Pishchulin et al.  generate synthetic images with a game engine. In , they deform 2D images with a 3D model. More recently, Rogez and Schmid  use an image-based synthesis engine to augment existing real images. Ghezelghieh et al.  render synthetic images with 10 simple body models with an emphasis on upright people; however, the main challenge using existing MoCap data for training is to generalize to poses that are not upright. Human3.6M dataset  presents realistic rendering of people in mixed reality settings; however, the approach to create these is expensive.
A similar direction has been explored in [31, 32, 33, 38]. In , action recognition is addressed with synthetic human trajectories from MoCap data. [32, 38] train CNNs with synthetic depth images. EgoCap  creates a dataset by augmenting egocentric sequences with background.
The closest work to this paper is , where the authors render large-scale synthetic images for predicting 3D pose with CNNs. Our dataset differs from  by having a richer, per-pixel ground truth, thus allowing to train for pixel-wise predictions and multi-task scenarios. In addition, we argue that the realism in our synthetic images is better (see sample videos in ), thus resulting in a smaller gap between features learned from synthetic and real images. The method in  heavily relies on real images as input in their training with domain adaptation. This is not the case for our synthetic training. Moreover, we render video sequences which can be used for temporal modeling.
Our dataset presents several differences with existing synthetic datasets. It is the first large-scale person dataset providing depth, part segmentation and flow ground truth for synthetic RGB frames. Other existing datasets are used either for taking RGB image as input and training only for 2D/3D pose, or for taking depth/infrared images as input and training for depth/parts segmentation. In this paper, we show that photo-realistic renderings of people under large variations in shape, texture, viewpoint and pose can help solving pixel-wise human labeling tasks.
This section presents our SURREAL (Synthetic hUmans foR REAL tasks) dataset and describes key steps for its generation (Section 3.1). We also describe how we obtain ground truth data for real MoCap sequences (Section 3.2).
Our pipeline for generating synthetic data is illustrated in Figure 2. A human body with a random 3D pose, random shape and random texture is rendered from a random view-point for some random lighting and a random background image. Below we define what “random” means in all these cases. Since the data is synthetic, we also generate ground truth depth maps, optical flow, surface normals, human part segmentations and joint locations (both 2D and 3D). As a result, we obtain 6.5 million frames grouped into continuous image sequences. See Table 1 for more statistics, Section 5.2 for the description of the synthetic train/test split, and Figure 3 for samples from the SURREAL dataset.
Synthetic bodies are created using the SMPL body model . SMPL is a realistic articulated model of the body created from thousands of high-quality 3D scans, which decomposes body deformations into pose (kinematic deformations due to skeletal posture) and shape (body deformations intrinsic to a particular person that make them different from others). SMPL is compatible with most animation packages like Blender . SMPL deformations are modeled as a combination of linear blend skinning and linear blendshapes defined by principal components of body shape variation. SMPL pose and shape parameters are converted to a triangulated mesh using Blender, which then applies texture, shading and adds a background to generate the final RGB output.
In order to render varied, but realistic, body shapes we make use of the CAESAR dataset , which was used to train SMPL. To create a body shape, we select one of the CAESAR subjects at random and approximate their shape with the first 10 SMPL shape principal components. Ten shape components explain more than
of the shape variance in CAESAR (at the resolution of our mesh) and produce quite realistic body shapes.
To generate images of people in realistic poses, we take motion capture data from the CMU MoCap database . CMU MoCap contains more than 2000 sequences of 23 high-level action categories, resulting in more than 10 hours of recorded 3D locations of body markers.
It is often challenging to realistically and automatically retarget MoCap skeleton data to a new model. For this reason we do not use the skeleton data but rather use MoSh  to fit the SMPL parameters that best explain raw 3D MoCap marker locations. This gives both the 3D shape of the subject and the articulated pose parameters of SMPL. To increase the diversity, we replace the estimated 3D body shape with a set of randomly sampled body shapes.
We render each CMU MoCap sequence three times using different random parameters. Moreover, we divide the sequences into clips of 100 frames with 30%, 50% and 70% overlaps for these three renderings. Every pose of the sequence is rendered with consistent parameters (i.e. body shape, clothing, light, background etc.) within each clip.
We use two types of real scans for the texture of body models. First, we extract SMPL texture maps from CAESAR scans, which come with a color texture per 3D point. These maps vary in skin color and person identities, however, their quality is often low due to the low resolution, uniform tight-fitting clothing, and visible markers placed on the face and the body. Anthropometric markers are automatically removed from the texture images and inpainted. To provide more variety, we extract a second set of textures obtained from 3D scans of subjects with normal clothing. These scans are registered with 4Cap as in . The texture of real clothing substantially increases the realism of generated images, even though SMPL does not model 3D deformations of clothes.
of our data is rendered with the first set ( CAESAR textures randomly sampled from ), and the rest with the second set ( clothed textures). To preserve the anonymity of subjects, we replace all faces in the texture maps by the average CAESAR face. The skin color of this average face is corrected to fit the face skin color of the original texture map. This corrected average face is blended smoothly with the original map, resulting in a realistic and anonymized body texture.
The body is illuminated using Spherical Harmonics with coefficients 
. The coefficients are randomly sampled from a uniform distribution betweenand , apart from the ambient illumination coefficient (which has a minimum value of ) and the vertical illumination component, which is biased to encourage the illumination from above. Since Blender does not provide Spherical Harmonics illumination, a spherical harmonic shader for the body material was implemented in Open Shading Language.
The projective camera has a resolution of , focal length of mm and sensor size of
mm. To generate images of the body in a wide range of positions, we take 100-frame MoCap sub-sequences and, in the first frame, render the body so that the center of the viewport points to the pelvis of the body, at a random distance (sampled from a normal distribution withmeters mean, meter deviation) with a random yaw angle. The remainder of the sequence then effectively produces bodies in a range of locations relative to the static camera.
We render the person on top of a static background image. To ensure that the backgrounds are reasonably realistic and do not include other people, we sample from a subset of LSUN dataset  that includes total of 400K images from the categories kitchen, living room, bedroom and dining room.
We perform multiple rendering passes in Blender to generate different types of per-pixel ground truth. The material pass generates pixel-wise segmentation of rendered body parts, given different material indices assigned to different parts of our body model. The velocity pass, typically used to simulate motion blur, provides us with a render simulating optical flow. The depth and normal passes, used for emulating effects like fog, bokeh or for performing shading, produce per-pixel depth maps and normal maps. The final texture rendering pass overlays the shaded, textured body over the random background. Together with this data we save camera and lighting parameters as well as the 2D/3D positions of body joints.
Human3.6M dataset [17, 18] provides ground truth for 2D and 3D human poses. Additionally, a subset of the dataset (H80K)  has segmentation annotation, but the definition of parts is different from the SMPL body parts used for our training. We complement this ground truth and generate predicted SMPL body-part segmentation and depth maps for people in Human3.6M for all frames. Here again we use MoSh  to fit the SMPL body shape and pose to the raw MoCap marker data. This provides a good fit of the model to the shape and the pose of real bodies. Given the provided camera calibration, we project models to images. We then render the ground truth segmentation, depth, and 2D/3D joints as above, while ensuring correspondence with real pixel values in the dataset. The depth is different from the time-of-flight (depth) data provided by the official dataset. These MoSh fits provide a form of approximate “ground truth”. See Figures 6 and 7 for generated examples. We use this for evaluation on the test set as well as for the baseline where we train only on real data, and also for fine-tuning our models pre-trained on synthetic data. In the rest of the paper, all frames from the synthetic training set are used for synthetic pre-training.
In this section, we present our approach for human body part segmentation [5, 25] and human depth estimation [10, 11, 19], which we train with synthetic and/or real data, see Section 5 for the evaluation.
Our approach builds on the stacked hourglass network architecture introduced originally for 2D pose estimation problem 
. This network involves several repetitions of contraction followed by expansion layers which have skip connections to implicitly model spatial relations from different resolutions that allows bottom-up and top-down structured prediction. The convolutional layers with residual connections and 8 ‘hourglass’ modules are stacked on top of each other, each successive stack taking the previous stack’s prediction as input. The reader is referred to for more details. A variant of this network has been used for scene depth estimation . We choose this architecture because it can infer pixel-wise output by taking into account human body structure.
Our network input is a 3-channel RGB image of size cropped and scaled to fit a human bounding box using the ground truth. The network output for each stack has dimensions in the case of segmentation (14 classes plus the background) and
for depth (19 depth classes plus the background). We use cross-entropy loss defined on all pixels for both segmentation and depth. The final loss of the network is the sum over 8 stacks. We train for 50K iterations for synthetic pre-training using the RMSprop algorithm with mini-batches of size 6 and a learning rate of. Our data augmentation during training includes random rotations, scaling and color jittering.
We formulate the problem as pixel-wise classification task for both segmentation and depth. When addressing segmentation, each pixel is assigned to one of the pre-defined 14 human parts, namely head, torso, upper legs, lower legs, upper arms, lower arms, hands, feet (separately for right and left) or to the background class. Regarding the depth, we align ground-truth depth maps on the z-axis by the depth of the pelvis joint, and then quantize depth values into 19 bins (9 behind and 9 in front of the pelvis). We set the quantization constant to 45mm to roughly cover the depth extent of common human poses. The network is trained to classify each pixel into one of the 19 depth bins or background. At test time, we first upsample feature maps of each class with bilinear interpolation by a factor of 4 to output the original resolution. Then, each pixel is assigned to the class for which the corresponding channel has the maximum activation.
We test our approach on several datasets. First, we evaluate the segmentation and depth estimation on the test set of our synthetic SURREAL dataset. Second, we test the performance of segmentation on real images from the Freiburg Sitting People dataset . Next, we evaluate segmentation and depth estimation on real videos from the Human3.6M dataset [17, 18] with available 3D information. Then, we qualitatively evaluate our approach on the more challenging MPII Human Pose dataset . Finally, we experiment and discuss design choices of the SURREAL dataset.
We use intersection over union (IOU) and pixel accuracy measures for evaluating the segmentation approach. The final measure is the average over 14 human parts as in . Depth estimation is formulated as a classification problem, but we take into account the continuity when we evaluate. We compute root-mean-squared-error (RMSE) between the predicted quantized depth value (class) and the ground truth quantized depth on the human pixels. To interpret the error in real world coordinates, we multiply it by the quantization constant (45mm). We also report a scale and translation invariant RMSE (st-RMSE) by solving for the best translation and scaling in z-axis to fit the prediction to the ground truth. Since inferring depth from RGB is ambiguous, this is a common technique used in evaluations .
To evaluate our methods on synthetic images, we separate of the synthetic frames for the test set and train all our networks on the remaining training set. The split is constructed such that a given CMU MoCap subject is assigned as either train or test. Whereas some subjects have a large number of instances, some subjects have unique actions, and some actions are very common (walk, run, jump). Overall, 30 subjects out of 145 are assigned as test. 28 test subjects cover all common actions, and 2 have unique actions. Remaining subjects are used for training. Although our synthetic images have different body shape and appearance than the subject in the originating MoCap sequence, we still found it appropriate to split by subjects. We separate a subset of our body shapes, clothing and background images for the test set. This ensures that our tests are unbiased with regards to appearance, yet are still representative of all actions. Table 1 summarizes the number of frames, clips and MoCap sequences in each split. Clips are the continuous 100-frame sequences where we have the same random body shape, background, clothing, camera and lighting. A new random set is picked at every clip. Note that a few sequences have less than 100 frames.
The evaluation is performed on the middle frame of each 100-frame clip on the aforementioned held-out synthetic test set, totaling in 12,528 images. For segmentation, the IOU and pixel accuracy are 69.13% and 80.61%, respectively. Evaluation of depth estimation gives 72.9mm and 56.3mm for RMSE and st-RMSE errors, respectively. Figure 4 shows sample predictions. For both tasks, the results are mostly accurate on synthetic test images. However, there exist a few challenging poses (e.g. crawling), test samples with extreme close-up views, and fine details of the hands that are causing errors. In the following sections, we investigate if similar conclusions can be made for real images.
Freiburg Sitting People (FSitting) dataset  is composed of 200 high resolution (300x300 pixels) front view images of 6 subjects sitting on a wheel chair. There are 14 human part annotations available. See Figure 5 for sample test images and corresponding ground truth (GT) annotation. We use the same train/test split as , 2 subjects for training and 4 subjects for test. The amount of data is limited for training deep networks. We show that our network pre-trained only on synthetic images is already able to segment human body parts. This shows that the human renderings in the synthetic dataset are representative of the real images, such that networks trained exclusively on synthetic data can generalize quite well to real data.
Table 2 summarizes segmentation results on FSitting. We carry out several experiments to understand the gain from synthetic pre-training. For the ‘Real’ baseline, we train a network from scratch using 2 training subjects. This network overfits as there are few subjects to learn from and the performance is quite low. Our ‘Synth’ result is obtained using the network pre-trained on synthetic images without fine-tuning. We get 51.88% pixel accuracy and 40.1% IOU with this method and clearly outperform training from real images. Furthermore, fine-tuning (Synth+Real) with 2 training subjects helps significantly. See Figure 5 for qualitative results. Given the little amount for training in FSitting, the fine-tuning converges after 200 iterations.
, the authors introduce a network that outputs a high-resolution segmentation after several layers of upconvolutions. For a fair comparison, we modify our network to output full resolution by adding one bilinear upsampling layer followed by nonlinearity (ReLU) and a convolutional layer withfilters that outputs instead of as explained in Section 4. If we fine-tune this network (Synth+Real+up) on FSitting, we improve performance and outperform  by a large margin. Note that  trains on the same FSitting training images, but added around 2,800 Pascal images. Hence they use significantly more manual annotation than our method.
To evaluate our approach, we need sufficient real data with ground truth annotations. Such data is expensive to obtain and currently not available. For this reason, we generate nearly perfect ground truth for images recorded with a calibrated camera and given their MoCap data. Human3.6M [17, 18] is currently the largest dataset where such information is available. There are 3.6 million frames from 4 cameras. We use subjects S1, S5, S6, S7, S8 for training, S9 for validation and S11 for testing as in [35, 42], but from all 4 cameras. Note that this is different from the official train/test split . Each subject performs each of the 15 actions twice. We use all frames from one of the two instances of each action for training, and every 64 frame from all instances for testing. The frames have resolution pixels, we assume a cropped human bounding box is given to reduce computational complexity. We evaluate the performance of both segmentation and depth, and compare with the baseline for which we train a network on real images only.
Table 3 summarizes the parts segmentation results on Human3.6M. Note that these are not comparable to the results in  both because they assume the background segment is given and our ground truth segmentation data is not part of the official release (see Section 3.2). We report both the mean over 14 human parts (fg) and the mean together with the background class (fg+bg). Training on real images instead of synthetic images increases IOU by 3.4% and pixel accuracy by 2.14%. This is expected because the training distribution matches the test distribution in terms of background, camera position and action categories (i.e. poses). Furthermore, the amount of real data is sufficient to perform CNN training. However, since there are very few subjects available, we see that the network doesn’t generalize to different clothing. In Figure 6, the ‘Real’ baseline has the border between shoulders and upper arms exactly on the T-shirt boundaries. This reveals that the network learns about skin color rather than actual body parts. Our pre-trained network (Synth) performs reasonably well, even though the pose distribution in our MoCap is quite different than that of Human3.6M. When we fine-tune the network with real images from Human3.6M (Synth+Real), the model predicts very accurate segmentations and outperforms the ‘Real’ baseline by a large margin. Moreover, our model is capable of distinguishing left and right most of the time on all 4 views since it has been trained with randomly sampled views.
Depth estimation results on Human3.6M for various poses and viewpoints are illustrated in Figure 7. Here, the pre-trained network fails at the very challenging poses, although it still captures partly correct estimates (first row). Fine-tuning on real data compensates for these errors and refines estimations. In Table 4, we show RMSE error measured on foreground pixels, together with the scale-translation invariant version (see Section 5.1). We also report the error only on known 2D joints (PoseRMSE) to have an idea of how well a 3D pose estimation model would work based on the depth predictions. One would need to handle occluded joints to infer 3D locations of all joints, and this is beyond the scope of the current paper.
FSitting and Human3.6M are relatively simple datasets with limited background clutter, few subjects, single person per image, full body visible. In this section, we test the generalization of our model on more challenging images. MPII Human Pose  is one of the largest datasets with diverse viewpoints and clutter. However, this dataset has no ground truth for part segmentation nor depth. Therefore, we qualitatively show our predictions. Figure 8 illustrates several success and failure cases. Our model generalizes reasonably well, except when there are multiple people close to each other and extreme viewpoints, which have not appeared during training. It is interesting to note that although lower body occlusions and cloth shapes are not present in synthetic training, the models perform accurately in such cases, see Figure 8 caption.
We did several experiments to answer questions such as ‘How much data should we synthesize?’, ‘Is CMU MoCap enough?’, ‘What’s the effect having clothing variation?’.
We plot the performance as a function of training data size. We train with a random subset of , , , % of the 55K training clips using all frames of the selected clips, i.e., % corresponds to 550 clips with a total of 55k frames. Figure 9 (left) shows the increase in performance for both segmentation and depth as we increase training data. Results are plotted on synthetic and Human3.6M test sets with and without fine-tuning. The performance gain is higher at the beginning of all curves. There is some saturation, training with 55k frames is sufficient, and it is more evident on Human3.6M after a certain point. We explain this by the lack of diversity in Human3.6M test set and the redundancy of MoCap poses.
Similarly, we study what happens when we add more clothing. We train with a subset of 100 clips containing only 1, 10 or 100 different clothings (out of a total of 930), because the dataset has maximum 100 clips for a given clothing and we want to use same number of training clips, i.e., 1 clothing with 100 clips, 10 clothings with 10 clips each and 100 clothings with 1 clip each. Figure 9 (right) shows the increase in performance for both tasks as we increase clothing variation. In the case of fine-tuning, the impact gets less prominent because training and test images of Human3.6M are recorded in the same room. Moreover, there is only one subject in our test set, ideally such experiment should be evaluated on more diverse data.
Pose distribution depends on the MoCap source. To experiment with the effect of having similar poses in training as in test, we rendered synthetic data using Human3.6M MoCap. Segmentation and depth networks pre-trained on this data (IOU: 48.11%, RMSE: 2.44) outperform the ones pre-trained on CMU MoCap (42.82%, 2.57) when tested on real Human3.6M. It is important to have diverse MoCap and to match the target distribution. Note that we exclude the Human3.6M synthetic data in Section 5.4 to address the more generic case where there is no dataset specific MoCap data available.
In this study, we have shown successful large-scale training of CNNs from synthetically generated images of people. We have addressed two tasks, namely, human body part segmentation and depth estimation, for which large-scale manual annotation is infeasible. Our generated synthetic dataset comes with rich pixel-wise ground truth information and can potentially be used for other tasks than considered here. Unlike many existing synthetic datasets, the focus of SURREAL is on the realistic rendering of people, which is a challenging task. In our future work, we plan to integrate the person into the background in a more realistic way by taking into account the lighting and the 3D scene layout. We also plan to augment the data with more challenging scenarios such as occlusions and multiple people.
This work was supported in part by the Alexander von Humbolt Foundation, ERC grants ACTIVIA and ALLEGRO, the MSR-Inria joint lab, and Google and Facebook Research Awards. We acknowledge the Human3.6M dataset owners for providing the MoCap marker data.
Relevant feature selection for human pose estimation and localization in cluttered images.ECCV, 2008.