1 Introduction
With the advent of deep learning, the problem of predicting the geometry of the human body from single images has experienced a tremendous boost. The combination of Convolutional Neural Networks with large MoCap datasets
[42, 19], resulted in a substantial number of works that robustly predict the 3D position of the body joints [27, 28, 30, 34, 38, 46, 49, 52, 59].In order to estimate the full body shape a standard practice adopted in [11, 13, 17, 22, 50, 61]
is to regress the parameters of low rank parametric models
[9, 26]. Nevertheless, while these parametric models describe very accurately the geometry of the naked body, they are not appropriate to capture the shape of clothed humans.Current trends focus on proposing alternative representations to the low rank models. Varol [51] advocate for a direct inference of volumetric body shape, although still without accounting for the clothing geometry. Very recently, [33] uses 2D silhouettes and the visual hull algorithm to recover shape and texture of clothed human bodies. Despite very promising results, this approach still requires frontalview input images of the person with no background, and under relatively simple body poses.
In this paper, we introduce a general pipeline to estimate the geometry of dressed humans which is able cope with a wide spectrum of clothing outfits and textures, complex body poses and shapes, and changing backgrounds and camera viewpoints. For this purpose, we contribute in three key areas of the problem, namely, the data collection, the shape representation and the imagetoshape inference.
Concretely, we first present 3DPeople a new largescale dataset with 2.5 Million photorealistic synthetic images of people under varying clothes and apparel. We split the dataset 40 male/40 female with different body shapes and skin tones, performing 70 distinct actions (see Fig. 1). The dataset contains the 3D geometry of both the naked and dressed body, and additional annotations including skeletons, depth and normal maps, optical flow and semantic segmentation masks. This additional data is indeed very similar to SURREAL [52] which was built for similar purposes. The key difference between SURREAL and 3DPeople, is that in SURREAL the clothing is directly mapped as a texture on top of the naked body, while in 3DPeople the clothing does have its own geometry.
As essential as gathering a rich dataset, is the question of what is the most appropriate geometry representation for a deep network. In this paper we consider the “geometry image” proposed originally in [16] and recently used to encode rigid objects in [7, 43]. The construction of the geometry image involves two steps, first a mapping of a genus0 surface onto a spherical domain, and then to a 2D grid resembling an image. Our contribution here is on the spherical mapping. We found that existing algorithms [12, 7] were not accurate, specially for the elongated parts of the body. To address this issue we devise a novel spherical areapreserving parameterization algorithm that combines and extends the FLASH [12] and the optimal mass transportation methods [31].
Our final contribution consists of designing a generative network to map input RGB images of a dressed human into his/her corresponding geometry image. Since we consider geometry images, learning such a mapping is highly complex. We alleviate the learning process through a coarsetofine strategy, combined with a series of geometryaware losses. The full network is trained in an endtoend manner, and the results are very promising in variety of input data, including both synthetic and real images.
2 Related work
3D Human shape estimation. While the problem of localizing the 3D position of the joints from a single image has been extensively studied [27, 28, 30, 34, 38, 46, 49, 52, 59, 62], the estimation of the 3D body shape has received relatively little attention. This is presumably due to the existence of wellestablished datasets [42, 19], uniquely annotated with skeleton joints.
The inherent ambiguity for estimating human shape from a single view is typically addressed using shape embeddings learned from body scan repositories like SCAPE [9] and SMPL [26]. The body geometry is described by a reduced number of pose and shape parameters, which are optimized to match image characteristics [10, 11, 25]. Dibra [13] are the first in using a CNN fed with silhouette images to estimate shape parameters. In [47, 50] SMPL body parameters are predicted by incorporating differential renders into the deep network to directly estimate and minimize the error of image features. On top of this, [22] introduces an adversarial loss that penalizes nonrealistic body shapes. All these approaches, however, build upon lowdimensional parametric models which are only suitable to model the geometry of the naked body.
Nonparametric representations for 3D objects. What is the most appropriate 3D object representation to train a deep network remains an open question, especially for nonrigid bodies. Standard nonparametric representations for rigid objects include voxels [14, 58], octrees [48, 54, 55] and pointclouds [45]. [7, 43] uses 2D embeddings computed with geometry images [16] to represent rigid objects. Interestingly, very promising results for the reconstruction of nonrigid hands were also reported. DeformNet [36] proposes the first deep model to reconstruct the 3D shape nonrigid surfaces from a single image. Bodynet [51] explores a network that predicts voxelized human body shape. Very recently, [33] introduces a pipeline that given a single image of a person in frontal position predicts the body silhouette as seen from different views, and then uses a visual hull algorithm to estimate 3D shape.
Generative Adversarial Networks. Originally introduced by [15], GANs have been previously used to model human body distributions and generate novel images of a person under arbitrary poses [37]. Kanazawa [22] explicitly learned the distribution on real parameters of SMPL. DVP [23], paGAN [32] and GANimation [35] presented models for continuous face animation and manipulation. They have also been applied to edit [18, 44, 53] and generate [5] talking faces.
Datasets for body shape analysis. Datasets are fundamental in the deeplearning era. While obtaining annotations is quite straightforward for 2D poses [41, 8, 21], it requires using sophisticated MoCap systems for the 3D case. Additionally, the datasets acquired this way [42, 19, 19] are mostly indoors. Even more complex is the task of obtaining 3D body shape, which requires expensive setups with muticameras or 3D scanners. To overcome this situation, datasets with synthetically but photorealistic images have emerged as a tool to generate massive amounts of training data. SURREAL [52] is the largest and more complete dataset so far, with more than 6 million frames generated by projecting synthetic textures of clothes onto random SMPL body shapes. The dataset is further annotated with body masks, optical flow and depth. However, since clothes are projected onto the naked SMPL shapes just as textures, they can not be explicitly modeled. To fill this gap, we present the 3DPeople dataset of 3D dressed humans in motion.
3 3DPeople dataset
To facilitate future researches, we introduce 3DPeople, the first dataset of dressed humans with specific geometry representation for the clothes. The dataset in numbers can be summarized as follows: it contains 2.5 Million photorealistic images split into 40 male/40 female performing 70 actions. Every subjectaction sequence is captured from 4 camera views, and for each view the texture of the clothes, the lighting direction and the background image are randomly changed. Each frame is annotated with (see Fig. 1): 3D textured mesh of the naked and dressed body; 3D skeleton; body part and cloth segmentation masks; depth map; optical flow; and camera parameters. We next briefly describe the generation process:
Body models: We have generated fully textured triangular meshes for 80 human characters using Adobe Fuse [1] and MakeHuman [2]. The distribution of the subjects physical characteristics cover a broad spectrum of body shapes, skin tones and hair geometry (see Fig. 1).
Clothing models: Each subject is dressed with a different outfit including a variety of garments, combining tight and loose clothes. Additional apparel like sunglasses, hats and caps are also included. The final rigged meshes of the body and clothes contain approximately 20K vertices.
Mocap squences: We gather 70 realistic motion sequences from Mixamo [3]. These include human movements with different complexity, from drinking and typing actions that produce small body motions to actions like breakdance or backflip that involve very complex patterns. The mean length of the sequences is of 110 frames. While these are relatively short sequences, they have a large expressivity, which we believe make 3DPeople also appropriate for exploring action recognition tasks.
Textures, camera, lights and background: We then use Blender [4] to apply the 70 MoCap animation sequences to each character. Every sequence is rendered from 4 camera views, yielding a total of 22,400 clips. We use a projective camera with a 800 mm focal length and pixel resolution. The 4 viewpoints correspond approximately to orthogonal directions aligned with the ground. The distance to the subject changes for every sequence to ensure a full view of the body in all frames. The textures of the clothes are randomly changed for every sequence (see again Fig. 1). The illumination is composed of an ambient lighting plus a light source at infinite, which direction is changed per sequence. As in [52] we render the person on top of a static background image, randomly taken from the LSUN dataset [60].
Semantic labels: For every rendered image, we provide segmentation labels of the clothes (8 classes) and body (14 parts). Observe in Fig. 1topright that the former are aligned with the dressed human, while the body parts are aligned with the naked body.
4 Problem formulation
Given a single image of a person wearing an arbitrary outfit, we aim at designing a model capable of directly estimating the 3D shape of the clothed body. We represent the body shape through the mesh associated to a geometry image with vertices where are the 3D coordinates of the th vertex, expressed in the camera coordinates system and centered on the root joint . This representation is a key ingredient of our design, as it maps the 3D mesh to a regular 2D grid structure that preserves the neighborhood relations, fulfilling thus the locality assumption required in CNN architectures. Furthermore, the geometry image representation allows uniformly reducing/increasing the mesh resolution by simply uniformly downsampling/upsampling. This will play an important role in our strategy of designing a coarsetofine shape estimation approach.
We next describe the two main steps of our pipeline: 1) the process of constructing the geometry images, and 2) the deep generative model we propose for predicting 3D shape.
5 Geometry image for dressed humans
The deep network we describe later will be trained using pairs of images and their corresponding geometry image. For creating the geometry images we consider two different cases, one for a reference mesh in a tpose configuration, and another for any other mesh of the dataset.
5.1 Geometry image for a reference mesh
One of the subjects of our dataset in a tpose configuration is chosen as a reference mesh. The process for mapping this mesh into a planar regular grid is illustrated in Fig. 2. It involves the following steps:
Repairing the mesh. Let be the reference mesh with vertices in a tpose configuration (Fig. 2
a). We assume this mesh to be a manifold mesh and to be genus0. Most of the meshes in our dataset, however, do not fulfill these conditions. In order to fix the mesh we follow the heuristic described in
[7] which consists of a voxelization, a selection of the largest connected region of the shape, and subsequent hole filling using a medial axis approach. We denote by the repaired mesh.Spherical parameterization. Given the repaired genus0 mesh , we next compute the spherical parameterization that maps every vertex of onto the unit sphere (Fig. 2b). Details of the algorithm we use are explained below.
Unfolding the sphere. The sphere is mapped onto an octahedron and then cut along edges to output a flat geometry image . Let us formally denote by , and by the mapping from the reference mesh to the geometry image. The unfolding process is shown in Fig. 2(c,d,e). Color lines in the geometry image correspond to the same edge in the octahedron, and are split after the unfolding operation. We will later enforce this symmetry constraint when predicting geometry images.
5.2 Spherical areaPreserving parameterization
Although there exist several spherical parameterization schemes ( [12, 7]) we found that they tend to shrink the elongated parts of the full body models such as the arms and legs, making the geometry images incomplete (see Fig. 3). In this work, we develop a spherical areapreserving parameterization algorithm for genus0 full body models by combining and extending the FLASH method [12] and the optimal mass transportation method [31]. Our algorithm is particularly advantageous for handling models with elongated parts. The key idea is to begin with an initial parameterization onto a planar triangular domain with a suitable rescaling correcting the size of it. The area distortion of the initial parameterization is then reduced using quasiconformal composition. Finally, the spherical areapreserving parameterization is produced using optimal mass transportation followed by the inverse stereographic projection.
5.3 Geometry image for arbitrary meshes
The approach for creating the geometry image described in the previous subsection is quite computationally demanding (can last up to 15 minutes for meshes with complex topology). To compute the geometry image for several thousand training meshes we have devised an alternative approach. Let be the mesh of any subject of the dataset under an arbitrary pose (Fig. 4a), and let be its tpose configuration (Fig. 4b). We assume there is a 1to1 vertex correspondence between both meshes, that is, where is a known bijective function^{1}^{1}1This is guaranteed in our dataset, with all meshes of the same subject having the same number of vertices..
We then compute dense correspondences between and the reference tpose , using a nonrigid icp algorithm [6]. We denote this mapping as (see Fig. 4c). We can then finally compute the geometry image for the input mesh by concatenating mappings:
(1) 
where is the mapping from the reference mesh to the geometry image domain estimated in Sec. 5.1. It is worth pointing the the nonrigid icp between the pairs of tposes is also highly computationally demanding, but it only needs to be computed once per every subject of the dataset. Once this is done, the geometry image for a new input mesh can be created in a few seconds.
An important consequence of this procedure is that all geometry images of the dataset will be semantically aligned, that is, every entry in will correspond to (approximately) the same semantic part of the model. This will significantly alleviate the learning task of the deep network.
6 GimNet
We next introduce GimNet, our deep generative network to estimate geometry images (and thus 3D shape) of dressed humans from a single images. An overview of the model is shown in Fig. 5. Given the input image, we first extract the 2D joint locations represented as heatmaps [57, 36], which are then fed into a mesh regressor trained to reconstruct the shape of the person in employing a geometry image based representation. Due to the high complexity of the mapping (both and are of size ), the regressor operates in a coarsetofine manner, progressively reconstructing meshes at higher resolution. To further enforce the reconstruction to lie on the manifold of anthropomorphic shapes, an adversarial scheme with two discriminators is applied.
6.1 Model architecture
Mesh regressor. Given the input image and the estimated 2D body joints , the mesh regressor aims to predict the geometry image , we seek to estimate the mapping . Instead of directly learning the complex mapping , we break the process into a sequence of more manageable steps. initially estimates a lowresolution mesh, and then progressively increases its resolution (see Fig. 5). This coarsetofine approach allows the regressor to first focus on the basic shape configuration and then shift attention to finer details, while also providing more stability compared to a network that learns the direct mapping.
As shown in Fig. 2e, the geometry images have symmetry properties derived from unfolding the octahedron into a square, specifically, each side of the geometry image is symmetric with respect to its midpoint. We force this property using a differentiable layer that linearly operates over the edges of the estimated geometry images.
MultiScale Discriminator.
Evaluating highresolution meshes poses a significant challenge for a discriminator, as it needs to simultaneously guarantee local and global mesh consistency on very high dimensional data. We therefore use two discriminators with the same architecture, but that operate in different geometry image scales: (i) a discriminator with a large receptive field that evaluates the shape coherence as a whole; and (ii) a local discriminator that focuses on small patches and enforces the local consistency of the surface triangle faces.
6.2 Learning the model
3D reconstruction error. We first define a supervised multilevel L1 loss for 3D reconstruction as:
(2) 
being and the real and generated data distribution of clothed human geometry images respectively, the number of scales, the groundtruth reconstruction at scale and the estimated reconstruction. The error at each scale is weighted by where is the ratio between and sizes. During initial experimentation L1 loss reported better reconstructions than mean squared error.
2D Projection Error. To encourage the mesh to correctly project onto the input image we penalize, at every scale , its projection error computed as:
where is the differentiable projection equation and is calculated as above.
Adversarial loss. In order to further enforce the mesh regressor to generate anthropomorphic shapes we perform a minmax strategy game [15] between the regressor and two discriminators operating at different scales. It is wellknown that nonoverlapping support between the true data distribution and model distributions can cause severe training instabilities. As proven by [40, 29], this can be addressed by penalizing the discriminator when deviating from the Nashequilibrium, ensuring that its gradients are nonzero orthogonal to the data manifold. Formally, being the ^{th} discriminator, the loss is defined as:
(3) 
where is a penalty regularization for discriminator gradients, only considered on the true data distribution.
Feature matching loss. To improve training stabilization we penalize higher level features on the discriminators [56]. Similar to a perception loss, the estimated geometry image is compared with the ground truth at multiple feature levels of the discriminators. Being the ^{th} layer of the ^{th} discriminator, is defined as:
(4) 
where is a weight regularizer denoting the number of elements in the ^{th} layer of the ^{th} discriminator.
Total Loss. Finally, we to solve the minmax problem:
(5) 
where , and are the hyperparameters that control the relative importance of every loss term.
6.3 Implementation details
For the mesh regressor we build upon the UNet architecture [39] consisting on an encoderdecoder structure with skip connections between features at the same resolution extended to estimate geometry images at multiple scales. Both discriminator networks operate at different mesh resolutions [56] but have the same PatchGan [20] architecture mapping from the geometry image to a matrix , where
represents the probability of the patch
to be close to a real geometry image distribution. The global discriminator evaluates the final mesh resolution at scale and the local discriminator the downsampled mesh at scale .The model is trained with 170,000 synthetic images of cropped clothed people resized to pixels and geometry images of
(meshes with 16,384 vertices) during 60 epochs and
. As for the optimizer, we use Adam [24] with learning rate of , beta1 , beta2 and batch size . Every epochs we decay the learning rate by a factor of . The weight coefficients for the loss terms are set to , , and .7 Experimental evaluation
We next present quantitative and qualitative results on synthetic images of our dataset and on images in the wild.
Synthetic Results We evaluate our approach on 25,000 test images randomly chosen for 8 subjects (4 male/ 4 female) of the test split. For each test sample we feed GimNet with the RGB image and the ground truth 2D pose, corrupted by Gaussian noise with 2 pixel std. For a given test sample, let be the estimated mesh, resulting from a direct reshaping of its estimated geometry image . Also, let be the ground truth mesh, which does not need to have neither the same number of vertices as , nor necessarily the same topology. Since there is no direct 1to1 mapping between the vertices of the two meshes we propose using the following metric:
(6) 
where represents the average Euclidean distance for all vertices of to their nearest neighbor in . Note that is not a true distance measure because it is not symmetric. This is why we compute it bidirectionally.
The quantitative results are summarized in Fig. 6
. We report the average error (in mm) of GimNet for 30 actions (the 15 with the highest and lowest error). Note that the error of GimNet is bounded between 15 and 35mm. Recall, however, that we do not consider outlier 2D detections in our experiments, but just 2D noise. We also evaluate the error of the ground truth geometry image, as it is an approximation of the actual ground truth mesh. This error is below 5mm, indicating that the geometry image representation does indeed capture very accurately the true shape. Finally, we also provide the error of the recent parametric approach of
[22], that fits SMPL parameters to the input images. Nevertheless, these results are just indicative, and cannot be directly compared with our approach, as we did not retrain [22]. We add them here just to demonstrate the challenge posed by the new 3DPeople dataset. Indeed, the distance error in [22] was computed after performing a rigidicp of the estimated mesh with the ground truth mesh (there was no need of this for GimNet).Qualitative Results We finally show in Fig. 7 qualitative results on synthetic images from 3DPeople and real fashion images downloaded from Internet. Remarkably, note how our approach is able to reconstruct long dresses (top row images), known to be a major challenge [33]
. Note also that some of the reconstructed meshes have some spikes. This is one of the limitations of the nonparametric models, that the reconstructions tend to be less smooth than when using parametric fittings. However, nonparametric models have also the advantage that, if properly trained, can span a much larger configuration space.
8 Conclusions
In this paper we have made three contributions to the problem of reconstructing the shape of dressed humans: 1) we have presented the first largescale dataset of 3D humans in action in which cloth geometry is explicitly modelled; 2) we have proposed a new algorithm to perform spherical parameterizations of elongated body parts, to later model rigged meshes of human bodies as geometry images; and 3) we have introduced an endtoend network to estimate human body and clothing shape from single images, without relying on parametric models. While the results we have obtained are very promising, there are still several avenues to explore. For instance, extending the problem to video, exploring new regularization schemes on the geometry images, or combining segmentation and 3D reconstruction are all open problems that can benefit from the proposed 3DPeople dataset.
Acknowledgments
This work is supported in part by an Amazon Research Award, the Spanish Ministry of Science and Innovation under projects HuMoUR TIN201790086R, ColRobTransp DPI201678957 and María de Maeztu Seal of Excellence MDM20160656; and by the EU project AEROARMS ICT20141644271. We also thank NVidia for hardware donation under the GPU Grant Program.
References
 [1] https://www.adobe.com/es/products/fuse.html.
 [2] http://www.makehumancommunity.org/.
 [3] https://www.mixamo.com/.
 [4] Blender  a 3d modelling and rendering package. https://www.blender.org/.
 [5] Wav2pix: Speechconditioned face generation using generative adversarial networks. In ICASSP.
 [6] Nonrigid ICP, MATLAB Central File Exchange, 2019. https://www.mathworks.com/matlabcentral/fileexchange/41396nonrigidicp/, 2019.
 [7] J. B. A. Sinha and K. Ramani. Deep learning 3D Shape Surfaces using Geometry Images. In ECCV, 2016.
 [8] M. Andriluka, L. Pishchulin, P. Gehler, and B. Schiele. 2D Human Pose Estimation: New Benchmark and State of the Art Analysis. In CVPR, June 2014.
 [9] D. Anguelov, P. Srinivasan, D. Koller, S. Thrun, J. Rodgers, and J. Davis. SCAPE: Shape Completion and Animation of People. ACM Transactions on Graphics, 24(3):408–416, July 2005.
 [10] A. O. Balan, L. Sigal, M. J. Black, J. E. Davis, and H. W. Haussecker. Detailed human shape and pose from images. In CVPR, 2007.
 [11] F. Bogo, A. Kanazawa, C. Lassner, P. Gehler, J. Romero, and M. J. Black. Keep it SMPL: Automatic Estimation of 3D Human Pose and Shape from a Single Image. In ECCV, 2016.
 [12] P. T. Choi, K. C. Lam, and L. M. Lui. FLASH: Fast landmark aligned spherical harmonic parameterization for genus0 closed brain surfaces. SIAM J. Imaging Science, 8(1):67–94, 2015.
 [13] E. Dibra, H. Jain, C. Oztireli, R. Ziegler, and M. Gross. Human Shape from Silhouettes using Generative HKS Descriptors and Crossmodal Neural Networks. In CVPR, 2017.
 [14] R. Girdhar, D. Fouhey, M. Rodriguez, and A. Gupta. Estimating Human Shape and Pose from a Single Image. In ECCV, 2016.
 [15] I. Goodfellow, J. PougetAbadie, M. Mirza, B. Xu, D. WardeFarley, S. Ozair, A. Courville, and Y. Bengio. Generative Adversarial Nets. In NIPS, 2014.
 [16] X. Gu, S. J. Gortler, and H. Hoppe. Geometry Images. ACM Transactions on Graphics, 21(3):355–361, July 2002.
 [17] P. Guan, A. Weiss, A. O. Balan, and M. J. Black. Estimating Human Shape and Pose from a Single Image. In ICCV, 2009.
 [18] Z. L. P. L. X. W. Hang Zhou, Yu Liu. Talking face generation by adversarially disentangled audiovisual representation. In AAAI, 2019.
 [19] C. Ionescu, D. Papava, V. Olaru, and C. Sminchisescu. Human3.6M: Large Scale Datasets and Predictive Methods for 3D Human Sensing in Natural Environments. PAMI, 36(7):1325–1339, 2014.
 [20] P. Isola, J.Y. Zhu, T. Zhou, and A. A. Efros. Imagetoimage Translation with Conditional Adversarial Networks. In CVPR, 2017.
 [21] S. Johnson and M. Everingham. Clustered Pose and Nonlinear Appearance Models for Human Pose Estimation. In BMVC, 2010.
 [22] A. Kanazawa, M. J. Black, D. W. Jacobs, and J. Malik. Endtoend Recovery of Human Shape and Pose. In CVPR, 2018.
 [23] H. Kim, P. Garrido, A. Tewari, W. Xu, J. Thies, M. Nießner, P. Péerez, C. Richardt, M. Zollhöfer, and C. Theobalt. Deep video portraits. ACM Transactions on Graphics 2018 (TOG), 2018.
 [24] D. Kingma and J. Ba. ADAM: A method for stochastic optimization. In ICLR, 2015.
 [25] C. Lassner, J. Romero, M.Kiefel, F.Bogo, M.J.Black, and P.V.Gehler. Unite the People: Closing the Loop between 3D and 2D human representations. In CVPR, 2017.
 [26] M. Loper, N. Mahmood, J. Romero, G. PonsMoll, and M. J. Black. SMPL: A skinned multiperson linear model. ACM Transactions on Graphics, 34(6):248:1–248:16, Oct. 2015.
 [27] J. Martinez, R. Hossain, J. Romero, and J. Little. A simple yet effective baseline for 3d human pose estimation. In ICCV, 2017.
 [28] D. Mehta, S. Sridhar, O. Sotnychenko, H. Rhodin, M. Shafiei, H.P. Seidel, W. Xu, D. Casas, and C. Theobalt. VNect: Realtime 3D Human Pose Estimation with a Single RGB Camera. ACM Transactions on Graphics, 36(4), 2017.
 [29] L. Mescheder, A. Geiger, and S. Nowozin. Which Training Methods for GANs do actually Converge? In ICML, 2018.
 [30] F. MorenoNoguer. 3D Human Pose Estimation from a Single Image via Distance Matrix Regression. In CVPR, 2017.
 [31] S. Nadeem, Z. Su, W. Zeng, A. E. Kaufman, and X. Gu. Spherical Parameterization Balancing Angle and Area Distortions. IEEE Trans. Vis. Comput. Graph., 23(6):1663–1676, 2017.
 [32] K. Nagano, J. Seo, J. Xing, L. Wei, Z. Li, S. Saito, A. Agarwal, J. Fursund, and H. Li. pagan: realtime avatars using dynamic textures. In SIGGRAPH Asia 2018 Technical Papers, page 258. ACM, 2018.
 [33] R. Natsume, S. Saito, Z. Huang, W. Chen, C. Ma, H. Li, and S. Morishima. SiCloPe: SilhouetteBased Clothed People. In CVPR, 2019.
 [34] G. Pavlakos, X. Z. and. K. G. Derpanis, and K. Daniilidis. Coarsetofine volumetric prediction for singleimage 3D human pose. In CVPR, 2017.

[35]
A. Pumarola, A. Agudo, A. M. Martinez, A. Sanfeliu, and F. MorenoNoguer.
Ganimation: Anatomicallyaware facial animation from a single image.
In
Proceedings of the European Conference on Computer Vision (ECCV)
, pages 818–833, 2018.  [36] A. Pumarola, A. Agudo, L. Porzi, A. Sanfeliu, V. Lepetit, and F. MorenoNoguer. Geometryaware network for nonrigid shape prediction from a single view. In CVPR, 2018.

[37]
A. Pumarola, A. Agudo, A. Sanfeliu, and F. MorenoNoguer.
Unsupervised person image synthesis in arbitrary poses.
In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
, pages 8620–8628, 2018.  [38] G. Rogez and C. Schmid. MoCapguided Data Augmentation for 3D Pose Estimation in the Wild. In NIPS, 2016.
 [39] O. Ronneberger, P. Fischer, and T. Brox. Unet: Convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computerassisted intervention, pages 234–241. Springer, 2015.
 [40] K. Roth, A. Lucchi, S. Nowozin, and T. Hofmann. Stabilizing training of generative adversarial networks through regularization. In NIPS. Curran Associates Inc., 2017.
 [41] B. Sapp and B. Taska. Modec: Multimodal Decomposable Models for Human Pose Estimation. In CVPR, 2013.
 [42] L. Sigal, A. O. Balan, and M. J. Black. HumanEva: Synchronized Video and Motion Capture Dataset and Baseline Algorithm for Evaluation of Articulated Human Motion. IJCV, 87(12), 2010.
 [43] A. Sinha, A. Unmesh, Q. Huang, and K. Ramani. SurfNet: Generating 3D shape surfaces using deep residual networks. In CVPR, 2017.
 [44] Y. Song, J. Zhu, X. Wang, and H. Qi. Talking face generation by conditional recurrent adversarial network. arXiv preprint arXiv:1804.04786, 2018.
 [45] H. Su, H. Fan, and L. Guibas. A Point Set Generation Network for 3D Object Reconstruction from a Single Image. In CVPR, 2017.
 [46] X. Sun, B. Xiao, S. Liang, and Y. Wei. Integral Human Pose Regression. In ECCV, 2018.
 [47] J. Tan, I. Budvytis, and R. Cipolla. Indirect Deep structured Learning for 3D Human Body Shape and Pose Prediction. In BMVC, 2017.
 [48] M. Tatarchenko, A. Dosovitskiy, and T. Brox. Octree Generating Networks: Efficient Convolutional Architectures for Highresolution 3D Outputs. In ICCV, 2017.
 [49] D. Tome, C. Russell, and L. Agapito. Lifting from the deep: Convolutional 3D pose estimation from a single image. In CVPR, 2017.

[50]
H.Y. Tung, H.W. Tung, E. Yumer, and K. Fragkiadaki.
Selfsupervised learning of motion capture.
In NIPS, 2017.  [51] G. Varol, D. Ceylan, B. Russell, J. Yang, E. Yumer, I. Laptev, and C. Schmid. BodyNet: Volumetric inference of 3D human body shapes. In ECCV, 2018.
 [52] G. Varol, J. Romero, X. Martin, N. Mahmood, M. J. Black, I. Laptev, and C. Schmid. Learning from synthetic humans. In CVPR, 2017.
 [53] K. Vougioukas, S. Petridis, and M. Pantic. Endtoend speechdriven facial animation with temporal gans. In BMVC, 2018.
 [54] P.S. Wang, Y. Liu, Y.X. Guo, C.Y. Sun, and X. Tong. OCNN: Octreebased Convolutional Neural Networks for 3D Shape Analysis. ACM Transactions on Graphics, 36(4), 2017.
 [55] P.S. Wang, C.Y. Sun, Y. Liu, and X. Tong. Adaptive OCNN: A Patchbased Deep Representation of 3D Shapes. ACM Transactions on Graphics, 37(6), 2018.
 [56] T.C. Wang, M.Y. Liu, J.Y. Zhu, A. Tao, J. Kautz, and B. Catanzaro. Highresolution image synthesis and semantic manipulation with conditional gans. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018.
 [57] S.E. Wei, V. Ramakrishna, T. Kanade, and Y. Sheikh. Convolutional Pose Machines. In CVPR, 2016.
 [58] X. Yan, J. Yang, E. Yumer, Y. Guo, and H. Lee. Perspective Transformer Nets: Learning Singleview 3D object Reconstruction without 3D Supervision. In NIPS, 2016.
 [59] W. Yang, W. Ouyang, X. Wang, J. Rena, H. Li, , and X. Wang. 3d human pose estimation in the wild by adversarial learning. In CVPR, 2018.
 [60] F. Yu, Y. Zhang, S. Song, A. Seff, and J. Xiao. LSUN: Construction of a Largescale Image Dataset using Deep Learning with Humans in the Loop. arXiv:1506.03365, 2015.
 [61] A. Zanfir, E. Marinoiu, M. Zanfir, A.I. Popa, and C. Sminchisescu. Deep network for the integrated 3d sensing of multiple people in natural images. In Advances in Neural Information Processing Systems, pages 8420–8429, 2018.
 [62] X. Zhou, Q. Huang, X. Sun, X. Xue, and Y. Wei. Towards 3d human pose estimation in the wild: A weaklysupervised approach. In ICCV, 2017.