Appearance Consensus Driven Self-Supervised Human Mesh Recovery

08/04/2020 ∙ by Jogendra Nath Kundu, et al. ∙ 0

We present a self-supervised human mesh recovery framework to infer human pose and shape from monocular images in the absence of any paired supervision. Recent advances have shifted the interest towards directly regressing parameters of a parametric human model by supervising them on large-scale datasets with 2D landmark annotations. This limits the generalizability of such approaches to operate on images from unlabeled wild environments. Acknowledging this we propose a novel appearance consensus driven self-supervised objective. To effectively disentangle the foreground (FG) human we rely on image pairs depicting the same person (consistent FG) in varied pose and background (BG) which are obtained from unlabeled wild videos. The proposed FG appearance consistency objective makes use of a novel, differentiable Color-recovery module to obtain vertex colors without the need for any appearance network; via efficient realization of color-picking and reflectional symmetry. We achieve state-of-the-art results on the standard model-based 3D pose estimation benchmarks at comparable supervision levels. Furthermore, the resulting colored mesh prediction opens up the usage of our framework for a variety of appearance-related tasks beyond the pose and shape estimation, thus establishing our superior generalizability.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 2

page 7

page 10

page 11

page 14

page 17

page 19

page 20

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Inferring highly deformable 3D human pose and shape from in-the-wild monocular images has been a longstanding goal in the vision community [hogg1983model]. This is considered as a key step for a wide range of downstream applications such as robot interaction, rehabilitation guidance, animation industry, etc. Being one of the important subtasks, human pose estimation has gained considerable performance improvements in recent years [sun2018integral, martinez2017simple, rogez2017lcr]

, but in a fully-supervised setting. Such approaches heavily rely on large-scale 2D or 3D pose annotations. Following this, the parametric models of human body, such as SCAPE 

[anguelov2005scape], SMPL [loper2015smpl], SMPL(-X) [pavlakos2019expressive, romero2017embodied] lead the way for a full 3D pose and shape estimation. Additionally, to suppress the inherent 2D-to-3D ambiguity, researchers have also utilized auxiliary cues of supervision such as temporal consistency [arnab2019exploiting, sun2019human], multi-view image pairs [rhodin2016general, joo2018total, huang2017towards], or even alternate sensor data from Kinect [weiss2011home] or IMUs [von2017sparse]. However, estimating 3D human pose and shape from a single RGB image without relying on any direct supervision remains a very challenging problem.

Early approaches [bogo2016keep, Guan2009estimating, LassnerClosing] adopt iterative optimization techniques to fit a parametric human model (e.g. SMPL) to a given image observation. These works attempt to iteratively estimate the body pose and shape that best describe the available 2D observation, which is most often the 2D landmark annotations. Though these works usually get good body fits, such approaches are slow and heavily rely on the 2D landmark annotations [andriluka20142d, johnson2010clustered, kundu2018ispa]

or predictions of an off-the-shelf, fully-supervised Image-to-2D pose networks. However, the recent advances in deep learning has shifted the interest towards data-driven regression based methods 

[kanazawa2018end, tung2017self], where a deep network directly regresses parameters of the human model for a given input image [omran2018neural, pavlakos2018learning, zanfir2018deep] in a single-shot computation. This is a promising direction as the network can utilize the full image information instead of just the sparse landmarks to estimate human body shape and pose. In the absence of datasets having images with 3D pose and shape ground-truth (GT), several recent works leverage a variety of available paired 2D annotations [texturepose, tan2018indirect] such as 2D landmarks or silhouettes [pavlakos2018learning]; alongside the unpaired 3D pose samples to instill the 3D pose priors [kanazawa2018end] (i.e. to assure recovery of valid 3D poses). The strong reliance on paired 2D keypoint ground-truth limits the generalization of such approaches when applied to images from an unseen wild environment. Given the transient nature of human fashion, the visual appearance of human attire keeps evolving. This demands such approaches to periodically update their 2D pose dataset in order to retain their functionality.

Figure 1: Our framework disentangles the co-salient FG human from input image pairs. The resulting colored mesh prediction opens up its usage for a variety of tasks.

In this work, the overarching objective is to move away from any kind of paired pose-related supervision for superior generalizability. Our aim is to explore a form of self-supervised objective which can learn both pose and shape from monocular images without accessing any paired GT annotations. We draw motivation from works [mathieu2016disentangling, rifai2012disentangling, ma2018disentangled, kundu2020self] that aim to disentangle the fundamental factors of variations from a given image. For human-centric images [kundu2020kinematic], these factors could be; a) pose, b) foreground (FG) appearance, and c) background (BG) appearance. Here, we leverage the full advantage of incorporating a parametric human model in our framework. Note that, this parametric model not only encapsulates the pose but also segregates the FG region from the BG, which is enabled by projecting the 3D mesh onto the image plane. Thus, the problem boils down to a faithful registration of the 3D mesh onto the image plane or in other words disentanglement of FG from BG. To achieve this disentanglement, we rely on image pairs depicting consistent FG appearance but varied 3D poses. Such image pairs can be obtained from videos depicting actions of a single person, which are abundantly available on the internet. Our idea stems from the concept of co-saliency detection [zhang2016co, hsu2018unsupervised] where the objective is to segment out the common, salient FG from a set of two or more images. Surprisingly, this idea works the best for image pairs sampled from wild videos as compared to videos captured in a constrained in-studio setup (static homogeneous background). This is because in wild scenarios, the commonness of FG is distinctly salient in relatively diverse BGs as a result of substantial camera movements (see Fig. 1B). Thus, in contrast to prior self-supervised approaches that either rely on videos with static BG [rhodin2018unsupervised] or operate under the assumption of BG commonness between temporally close frames [jakab2018unsupervised]; our approach is more favorable to learn from wild videos hence better generalizable.

Model-based
methods
2D keypoint
supervision
Temporal
supervision
Colored mesh
prediction
[kanazawa2018end, kolotouros2019learning, kolotouros2019nal, pavlakos2018learning, omran2018neural] Yes No No
[sun2019human, arnab2019exploiting, kanazawa2019learning] Yes Yes No
Ours(self-sup.) No No Yes
Table 1: Characteristic comparison against prior-arts.

In the proposed framework, we first employ a CNN regressor to obtain the parameters (both pose and shape) of the SMPL model for a given input image. The human mesh model uses these parameters to output the mesh vertex locations. In contrast to the general trend [alp2018densepose, kanazawa2018learning], we propose a novel way of inferring mesh texture where the network’s burden to regress vertex color or any sort of appearance representation (such as UV map) is entirely taken away. This is realized via a differentiable Color-recovery module which aims to assign color to the mesh vertices via spatial registration of the mesh over the image plane while effectively accounting for the challenges of mesh-vertex visibility like self and inter-part occlusions. To obtain a fully-colored mesh, we use a predefined, 4-way symmetry grouping knowledge (front-back and left-right) to propagate the color from camera visible vertices to the non-visible ones in a fully differentiable fashion.

For a given image pair, we pass them through two parallel pathways of our colored mesh prediction framework (see Fig. 1A). The commonness of FG appearance allows us to impose an appearance consistency loss between the predicted mesh representations. In the absence of any paired supervision, this appearance consistency not only helps us to segregate the common FG human from their respective wild BGs but also discovers the required pose deformation in a fully self-supervised manner. The proposed reflectional symmetry module brings in a substantial advantage in our self-supervised framework by allowing us to impose appearance consistency even between body parts which are “commonly invisible” in both the images. Recognizing the unreliability of consistent raw color intensities which can easily be violated as result of illumination changes, we propose a part-prototype consistency objective. This aims to match a higher level appearance representation beyond the raw color intensities which is enabled by operating the Color-recovery module on convolutional feature maps instead of the raw image. Additionally, to regularize the self-supervised framework, we also impose a shape consistency loss alongside the imposition of 3D pose prior learned from a set of unpaired MoCap samples. Note that at test time, we perform single image inference to estimate 3D human pose and shape.

In summary, we make the following main contributions:

  • We propose a self-supervised learning technique to perform simultaneous pose and shape estimation which uses image pairs sampled from in-the-wild videos in the absence of any paired supervision.

  • The proposed Color-recovery module completely eliminates the network’s burden to regress any appearance-related representation via efficient realization of color-picking and reflectional symmetry. This best suits our self-supervised framework which relies on FG appearance consistency.

  • We demonstrate generalizability of our framework to operate on unseen wild datasets. We achieve state-of-the-art results against the prior model-based pose estimation approaches when tested at comparable supervision levels.

2 Related Work

Vertex-color reconstruction. In literature, we find different ways to infer textured 3D mesh from a monocular RGB image. Certain approaches [l2019differ, song2017semantic] train a deep network to directly regress 3D features (RGB colors) for individual vertices. In the second kind, a fully convolutional deep network is trained to map the location of each pixel to the corresponding continuous UV-map coordinate parameterization [alp2018densepose]. In the third kind, the deep model is trained to directly regress the UV-image [kanazawa2018learning]. Note that, the spatial structure of the UV image is much different from that of the input image which prevents employing a fully-convolutional network for the same. Recently proposed, Soft-Rasterizer [liu2019soft] uses a color-selection and color-sampling network whose outputs are processed to obtain the final vertex colors. All the above approaches adopt a learnable way to obtain the mesh color (i.e. obtained as neural output). In such cases, the deep network requires substantial training iterations to instill the knowledge of pre-defined UV mapping conventions. We believe this is an additional burden for the network specifically in absence of any auxiliary paired supervisions.

Model-based human mesh estimation. Recently, parametric human models [anguelov2005scape, loper2015smpl] have been used as the output target for the simultaneous pose and shape estimation task. Such a well-defined mesh model with ordered vertices provides a direct mapping to the corresponding 3D pose and part segments. Both optimization [bogo2016keep, LassnerClosing, zanfir2018monocular] and regression [kanazawa2018end, omran2018neural, pavlakos2018learning, zanfir2018deep] based approaches estimate the body pose and shape that best describes the available 2D observations such as 2D keypoints [kanazawa2018end], silhouettes [pavlakos2018learning], body/part segmentation [omran2018neural] etc. Due to the lack of datasets having wild images with 3D pose and shape GT, most of the above approaches fully rely on the availability of 2D keypoint annotations [andriluka20142d, lin2014microsoft] followed by different variants of a 2D reprojection loss [tan2018indirect, tung2017self] (see Table 1).

Use of auxiliary supervision. In the absence of any shape supervision, certain prior works also leverage full mesh supervision available from synthetically rendered human images [varol2017learning] or images with fairly successful body fits [LassnerClosing]. Furthermore, multi-view image pairs have also been used for 3D pose [rhodin2018unsupervised] and shape estimation [hofmann2009multi, liang2019shape] via enforcing consistency of canonical 3D pose across multiple views. Liang et al[liang2019shape] use a multi-stage regressor for multi-view images to further reduce the projection ambiguity in order to obtain a better performance for 3D human body under clothing. To inculcate strong 3D pose prior, Zhou et al[zhou2017towards] makes use of left-right symmetric bone-length constraint for the skeleton based 3D pose estimation task. Further, to assure recovery of valid 3D poses for the model-based pose estimation task, Kanazawa et al[kanazawa2018end] enforce learning based human pose and shape prior via adversarial networks using unpaired sample of plausible 3D pose and shape. With the advent of differentiable renderers [henderson19ijcv, kato2018renderer] certain methods supervise 3D shape and pose estimation through a textured mesh prediction network to encourage matching of the rendered texture image with the image FG [kanazawa2018learning], alongside the 2D keypoint supervision [texturepose].

3 Approach

We aim to discover the 3D human pose and shape from unlabeled image pairs of consistent FG appearance. During training, we assume access to a parametric human mesh model to aid our self-supervised paradigm. The mesh model provides a low dimensional parametric representation of variations in human shape and pose deformations. However, by design, this model is unaware of the plausibility restrictions of human pose and shape. Thus, it is prone to implausible poses and self-penetrations specifically in the absence of paired 3D supervision [kanazawa2018end]. Therefore, to constrain the pose predictions, we assume access to a pool of human 3D pose samples to learn a 3D pose prior.

Fig. 2 shows an overview of our training approach. For a given image pair, two parallel pathways of shared CNN regressors predict the human shape and pose parameters alongside the required camera settings to segregate the co-salient FG human. Moreover, to realize a colored mesh representation, we develop a differentiable Color-recovery module which infers mesh vertex colors directly from the given image without employing any explicit appearance extraction network.

3.1 Representation and notations

Human mesh model. We employ the widely used SMPL body model [loper2015smpl] which parameterizes a triangulated human mesh of vertices. This model factorizes the mesh deformations into shape and pose with skeleton joints [kanazawa2018end]. We use the first 10 PCA coefficients of the shape space as a compact shape representation inline with [kanazawa2018end]. And, the pose is parameterized as parent-relative rotations in the axis-angle representation. This differentiable SMPL function outputs mesh vertex locations in a canonical 3D space which is represented as . Here, the corresponding 3D pose (i.e. 3D location of joints) is obtained using a pre-trained linear regressor, i.e. parameterized by . RGB color corresponding to the mesh vertices, is denoted as , where CRM is the Color-recovery module. For each vertex id , stores the corresponding RGB color intensities. As shown in Fig. 2, we use subscripts and to associate the terms with the respective input images, and .

Camera model. We define a weak perspective camera model using a global orientation in axis-angle representation (3 angle parameters), a translation and a scale . Given these parameters, the 2D camera space coordinates of the 3D mesh vertices with vertex index is obtained as ; , where denotes orthographic projection and denotes the space of image coordinates. Similarly, the camera projected 2D joint locations (2D pose) is expressed as .

3.2 Mesh estimation architecture

For a given monocular image, as input, we first employ a CNN regressor to predict the SMPL parameters (i.e. and ) alongside the camera parameters, . This is followed by the Color-recovery module. The prime functionality of this module is to assign color to the 3D mesh vertices, ; based on the corresponding image space coordinates obtained via camera projection. However, a reliable color assignment requires us to segregate the vertices based on the following two important criteria.

a) Non-camera-facing vertices: First, the camera-facing vertices are separated from the non-camera-facing ones using the mesh vertex normals. Here, the vertex normal is computed as the normalized average of the surface normals of the faces connected to a given vertex. We first transform these normals from the default canonical system to the camera coordinate system. Following this, Z-component of the camera-space-normals, are used to segregate the non-camera-facing vertices via a sigmoid operation, as shown in Fig. 2.

b) Camera-facing, self-occluded vertices: Note that, can not be used to select all the camera-visible vertices in presence of inter-part occlusions (see Fig. 2). As, in such scenario, there exist mesh vertices which face the camera but are obscured by other camera-facing vertices which are closer to the camera in 3D. This calls for modeling the relative depth of mesh-vertices as the second criteria to reliably select the vertices which are closer to the camera among all the camera-facing vertices projected to a certain spatial region. To realize this, we utilize camera-space-depths, which stores the Z-component (or depth) of the vertex location in the camera transformed space.

3.2.1 Color-recovery module.

In absence of any appearance related features, we plan to realize a spatial depth map using a fast differentiable renderer [henderson19ijcv] where the camera-space-depth of the mesh vertices, is treated as the color intensities for the rendering pipeline. The resultant depth-map is represented as , where spans the space of spatial indices. The general idea is to use this depth-map as a margin. More concretely, for effective color assignment, one must select the spatially modulated mesh vertices which have the least absolute depth difference with respect to the above defined depth margin. To realize this, we compute a depth difference as , where is computed by performing bilinear sampling on . In accordance with the above discussion, we formulate a visibility-aware-weighing which takes into account both the above mentioned criteria required for an effective mesh vertex selection.

Here, performs a soft selection by assigning a higher weight value (close to 1) for mesh vertices, whose camera-space-depth is in agreement with and vice-versa. In the second term,

denotes a sigmoid function with a higher steepness

to reject the non-camera-facing mesh vertices by attributing a low (close to 0) weighing value. Refer Fig. 2 for visual illustration.

Figure 2: The proposed self-supervised framework makes use of a differentiable Color-recovery module to recover the fully colored mesh vertices. Yellow-circle: camera-facing vertices does not account for inter-part occlusion. Green-circle: accounts for the inter-part occlusion. Blue-circle: Fully colored mesh vertices via reflectional symmetry.

Intermediate vertex color assignment. The above defined visibility-aware-weighing is employed to realize a primary vertex color assignment. We denote as the intermediate vertex color, where stores the corresponding RGB color intensities acquired from the given input image . Thus, the primary vertex colors are obtained as, , where stores the RGB color intensities at the spatial coordinates realized via performing bilinear sampling on the input RGB image . The scaled weighing function assigns negative weight to the vertices having low visibility. This assigns a negative color intensity for the corresponding vertices thereby allowing a distinction between the less-bright (near-black) colors versus unassigned vertices.

3.2.2 Vertex color assignment via reflectional symmetry.

Here, the prime objective is to propagate the reliable color intensities from the assigned vertices to the unreliable/unassigned ones. The idea is to use reflectional symmetry as a prior knowledge by accessing a predefined set of reflectional groups. For each group-id , a set of 4 vertices are identified according to left-right and front-back symmetry which would have the same color property (except the vertices belonging to the head where only left-right symmetry is used). This symmetry knowledge is stored as a multi-hot encoding denoted as which constitutes of four ones indicating vertex members in the symmetry group . All the symmetry groups are combined in a symmetry-encoding matrix represented as . This multi-hot symmetry group representation helps us to perform a fully-differentiable vertex color assignment for all the vertices including the occluded and non-camera facing ones.

To realize the final vertex colors , we first estimate a group-color for each group which is denoted by . Here, denotes dot product between the

-dimensional vectors. The group color can be interpreted as a combination of the intermediate vertex colors weighted by their visibility weighing

. This effectively handles the cases when only one or more of the vertices in a group are initially colored (visible). That is, when visibility is active only for a single vertex among the four vertices in a symmetry set; and when visibility is active for all the 4 vertices in a symmetry set; and also the intermediate cases. Finally, the group color is directly propagated to all the mesh vertices using the following matrix multiplication operation, i.e. , where (see Suppl for more details).

3.3 Self-supervised learning objectives

For a given image pair, denoted as and (depicting the same person in diverse pose and BGs), we forward them through two parallel pathways of our colored mesh estimation architecture (see Fig. 2). The commonness of FG appearance allows us to impose an appearance consistency loss between the predicted fully colored mesh representations.

a) Color consistency. First, we impose the following consistency loss,

Here, denotes element-wise multiplication. Note that, enforces a vertex-color consistency on the co-visible mesh vertices (computed as ), i.e. the vertices which are visible in both the mesh representations obtained from the image pair, . However, enforces full vertex color consistency. Here, combines both of the losses thereby providing a higher weightage to the co-visible vertex colors as compared to the approximate full color representation, considering the approximate nature of the symmetry assumption.

b) Part-prototype consistency. The proposed Color-recovery module can also be applied on the convolutional feature maps. For a given vertex and a convolutional feature map , we sample . Note that, we define a fixed vertex to part-segmentation mapping represented as , which stores a set of vertex indices for each part . Now, one can use the vertex visibility weighing to obtain a prototype appearance feature for each body-part , which is computed as; Following this, we enforce a prototype consistency loss between the image pairs as . Note that, the prototype feature computation is inherently aware of the inter-part occlusions as a result of incorporating the visibility weighing . As compared to enforcing vertex-color consistency, (i.e. the raw color intensities), the part-prototype consistency aims to match a higher-level semantic abstraction (e.g. checkered regular patterns versus just plain individual colors) of the part appearances extracted from the image pairs. This also helps us to overcome the unreliability of raw vertex colors which could arise due to illumination differences. Motivated by the perceptual loss idea [johnson2016perceptual], we obtain and as the Conv2-1 features corresponding to and

from an ImageNet trained (frozen) VGG-16 network 

[simonyan2014very].

c) Shape-consistency. We also enforce a shape consistency loss between the shape parameters obtained from the image pair, i.e. . Almost all the prior works [kanazawa2018end, pavlakos2018learning, texturepose] utilize an unpaired human shape dataset to enforce plausibility of the shape predictions via adversarial prior. However, in the proposed self-supervised framework we do not access any human shape dataset. To regularize the shape parameters during the initial training iterations we enforce a loss on shape predictions with respect to a fixed mean shape as a regularization. However, after gaining a decent mesh estimation performance we gradually reduce weightage of this loss by allowing shape variations beyond the mean shape driven by the proposed appearance and shape consistency objectives.

d) Enforcing validity of pose predictions. Additionally, to assure validity of the predicted pose parameters we train an adversarial auto-encoder [makhzani2015adversarial] to realize a continuous human pose manifold [kundu2019bihmp, kundu2019unsupervised] mapped from a latent pose representation, . This is trained using an unpaired 3D human pose dataset. The frozen pose decoder obtained from this generative framework is directly employed as a module, with instilled human 3D pose prior. More concretely, a tanh non-linearity on the pose-prediction head of the CNN regressor (inline with the latent pose ) followed by the frozen pose decoder prevents implausible pose predictions during our self-supervised training. In contrast to enforcing an adversarial pose prior objective [kanazawa2018end, texturepose], the proposed setup greatly simplifies our training procedure (devoid of discriminator training).

In absence of paired supervision, parameters of the shared CNN regressor is trained by directly enforcing the above consistency losses, i.e. , , and .

4 Experiments

We perform thorough experimental analysis to demonstrate the generalizability of our framework across several datasets on a variety of tasks.

Implementation details. We use Resnet-50 [he2016identity] initialized from ImageNet as the base CNN network. The average pooled last layer features are forwarded through a series of fully-connected layers to regress the pose (latent pose encoding ), shape and camera parameters. Note that, the series of differentiable operations post the CNN regressor do not include any trainable parameters even to estimate the vertex colors. During training, we optimize individual loss terms at alternate training iteration using Adam optimizer [kingma2014adam]. We enforce prediction of the mean shape for initial 100k training iterations. We also impose a silhouette loss on the predicted human mesh with respect to a pseudo silhouette ground-truth obtained either by using an unsupervised saliency detection method [zhu2014saliency] or by using a background estimate as favourable for static camera scenarios [rhodin2018unsupervised].

Figure 3: Qualitative results. In each panel, 1st column depicts the input image, 2nd column depicts our colored mesh prediction, and 3rd column shows the model-based part segments. Our model fails (in magenta) in presence of complex inter-part occlusions.

Datasets. We sample image pairs with diverse BG (pairs with large L2 distance) from the following standard datasets, i.e. Human3.6M [ionescu2013human3], MPII [andriluka20142d], MPI-INF-3DHP [mehta2017monocular] and an in-house collection of wild YouTube videos. In contrast to the in-studio datasets with hardly any camera movement implying static BG [ionescu2013human3], the videos collected from YouTube have diverse camera movements (e.g. Parkour and Free-running videos). We prune the raw video samples using a person-detector [ren2015faster] to obtain reliable human-centric crops as required for the mesh estimation pipeline (see Suppl). The unpaired 3D pose dataset required to train the 3D pose prior is obtained from CMU-MoCap (also used in MoSh [loper2014mosh]).

a) Human3.6M This is a widely used dataset consisting of paired image with 3D pose annotations of actors imitating various day-to-day tasks in a controlled in-studio environment. Adhering to well established standards [kanazawa2018end] we consider subjects S1, S6, S7, S8 for training, S5 for validation and S9, S11 for evaluation, in both Protocol-1 [rhodin2018unsupervised, rhodin2018learning] and Protocol-2 [kanazawa2018end].

b) LSP A standard 2D pose dataset consisting of wild athletic actions. We access the LSP test-set with silhouette and part segment annotations as given by Lassner et al[LassnerClosing]. In absence of any standard shape evaluation dataset, segmentation results are considered as a proxy for the shape fitting performance [kanazawa2018end, kolotouros2019nal].

c) 3DPW We also evaluate on the 3D Poses in the Wild dataset [von2018recovering]. We do not train on 3DPW and use it only to evaluate our cross-dataset generalizability [kundu2020unsupervised]. We compute the mean per joint position error (MPJPE) [ionescu2013human3], both before and after rigid alignment. Rigid alignment is done via Procrustes Analysis [gower1975generalized]. MPJPE computed post Procrustes alignment is denoted by PA-MPJPE.

Figure 4: A. Qualitative results on single image colored human mesh recovery. The model fails in presence of complex inter-limb occlusions (in magenta box). B. Qualitative analysis demonstrating importance of incorporating to extract relevant part-semantics.

4.1 Ablative study

To analyze effectiveness of individual self-supervised consistency objectives, we perform ablations by removing certain losses as shown in Table 6. First, we train Baseline-1 by enforcing and . Following this, in Baseline-2 we enforce by incorporating which further penalizes color inconsistency between the vertices which are commonly visible in both the mesh representations. This results in marginal improvement of performance. Moving forward, we recognize a clear limitation in our assumption of FG color consistency (raw RGB intensities) which can easily be violated by illumination differences. Further, the assumption of left-right and front-back symmetry in apparel color can also be violated specifically for asymmetric upper body apparel. As a solution, the proposed part-prototype consistency objective, tries to match a higher level appearance representation beyond just raw color intensities (see Fig. 4B), thus resulting in a significant performance gain (Ours(unsup) in Table 6). Note that, is possible as a consequence of the proposed differentiable Color-recovery module.

Figure 5: Ablative study (on Human3.6M) to analyze importance of self-supervised objectives (first 3 rows), and results at varied degree of paired supervision (last 3 rows). P1 and P2 denote MPJPE and PA-MPJPE in Protocol-1 and Protocol-2 respectively.
Methods P1() P2()
Baseline-1; () 127.1 101.2
Baseline-2; () 119.6 97.4
Ours(unsup.); (++) 110.8 90.5
Ours(multi-view-sup) 102.1 74.1
Ours(weakly-sup) 86.4 58.2
Ours(semi-sup) 73.8 48.1
Figure 6: Evaluation on wild 3DPW dataset in a fully-unseen setting. Note that, in contrast to Temporal-HMR [kanazawa2019learning] we do not use any temporal supervision. Methods in first 5 rows use equivalent 2D and 3D pose supervision, thus directly comparable.
Methods MPJPE() PA-MPJPE()
Martinez et al[martinez2017simple] - 157.0
SMPLify [bogo2016keep] 199.2 106.1
TP-Net [Dabral_2018_ECCV] 163.7 92.3
Temporal-HMR [kanazawa2019learning] 127.1 80.1
Ours(semi-sup) 125.8 78.2
Ours(weakly-sup) 153.4 89.8
Ours(unsup) 187.1 102.7

Further, maintaining a fair comparison ground against the prior weakly supervised approaches, we train 3 variants of the proposed framework by utilizing increasing level of paired supervisions alongside our self-supervised objectives.

a) Ours(multi-view-sup) Under multi-view supervision, we impose additional consistency loss on the canonically aligned (view-invariant) 3D mesh vertices (i.e. ) and the 3D pose (i.e. ) for the time synchronized multi-view pairs, . Inline with Rhodin et al[rhodin2018unsupervised], we also use full 3D pose supervision only for S1 while evaluating on the standard Human3.6M dataset. We outperform Rhodin et al[rhodin2018unsupervised] by a significant margin as reported in the Table 8. This is beyond the usual trend of weaker performance in non-parametric approaches against the model-based parametric ones. Thus, we attribute this performance gain to the proposed appearance consensus driven self-supervised objectives.

b) Ours(weakly-sup) In this setting, we access image datasets with paired 2D landmark annotations, inline with the supervision setting of prior model-based approaches [kanazawa2018end]. Alongside the proposed self-supervised objectives, we impose a direct 2D landmark supervision loss (i.e. ) with respect to the corresponding ground-truths but only on samples from specific datasets, such as LSP, LSP-extended [johnsonclustered] and MPII [andriluka20142d]. Certain prior arts, such as HMR [kanazawa2018end], use even more images with paired 2D landmark annotations from COCO [lin2014microsoft].

c) Ours(semi-sup) In this variant, we access paired 3D pose supervision on the widely used in-studio Human3.6M [ionescu2013human3] dataset alongside the 2D landmark supervision as used in Ours(weakly-sup). Note that, a better performance on Human3.6M (with limited BG and FG diversity as a result of the in-studio data collection setup) does not translate to the same on wild images as a result of the significant domain gap. As we impose the above supervisions alongside the proposed self-supervised objective on unlabeled wild images, such a training is expected to deliver improved performance by successfully overcoming the domain-shift issue. We evaluate this on the wild 3DPW dataset.

4.2 Comparison with the state-of-the-art

Evaluation on Human3.6M. Table 8 shows a comparison of different variants of the proposed framework against the prior-arts which are grouped based on the respective supervision levels. We clearly outperform in all the three groups i.e. while accessing comparable a) 3D pose supervision, b) 2D landmark supervision, and c) multi-view supervision. Except Rhodin et al[rhodin2018unsupervised] all the prior works mentioned in Table 8 use parametric human model for the human mesh estimation task. Note the significant performance gain specifically in absence of any 3D pose supervision, i.e. for Ours(weakly-sup) and Ours(multi-view-sup) against the relevant counterparts as reported in the last 4 rows.

Evaluation on 3DPW. Table 6 reports a comparison of different variants of the proposed framework against the prior-arts which use comparable pose supervision as used in Ours(semi-sup) (except certain methods, such as HMR [kanazawa2018end] which use even more supervision on 3D pose from the MPI-INF-3DHP [mehta2017monocular] dataset). It is worth noting that none of our model variants is trained on the samples from 3DPW dataset (not even in self-supervised paradigm). A better performance in such unseen setting highlights our superior cross-dataset generalizability.

Figure 7: Evaluation on Human3.6M (Protocol-2). Methods in first 9 rows use equivalent 2D and 3D pose supervision hence are directly comparable. Same analogy applies for the rows 10-11 and 12-13.
No. Methods PA-MPJPE()
1. Lassner et al[LassnerClosing] 93.9
2. Pavlakos et al[pavlakos2018learning] 75.9
3. Omran et al[omran2018neural] 59.9
4. HMR [kanazawa2018end] 56.8
5. Temporal HMR [kanazawa2019learning] 56.9
6. Arnab et al[arnab2019exploiting] 54.3
7. Kolotouros et al[kolotouros2019nal] 50.1
8. TexturePose [texturepose] 49.7
9. Ours(semi-sup) 48.1
10. HMR unpaired [kanazawa2018end] 66.5
11. Ours(weakly-sup) 58.2
12. Rhodin et al[rhodin2018unsupervised] 98.2
13. Ours(multi-view-sup) 74.1
Figure 8: Evaluation of FG-BG and 6-part segmentation on LSP test set. It reports accuracy (Acc.) and F1 score values of ours against the prior-arts. First group: Iterative, optimization-based approaches. Last 3 groups: Regression-based methods grouped based on comparable supervision levels.
Methods FG-BG Seg. Part Seg.
Acc.() F1() Acc.() F1()
SMPLify oracle [bogo2016keep] 92.17 0.88 88.82 0.67
SMPLify [bogo2016keep] 91.89 0.88 87.71 0.64
SMPLify on [pavlakos2018learning] 92.17 0.88 88.24 0.64
Bodynet [varol2018bodynet] 92.75 0.84 - -
HMR [kanazawa2018end] 91.67 0.87 87.12 0.60
Kolotouros et al[kolotouros2019nal] 91.46 0.87 88.69 0.66
TexturePose [texturepose] 91.82 0.87 89.00 0.67
Ours(semi-sup) 91.84 0.87 89.08 0.67
HMR unpaired [kanazawa2018end] 91.30 0.86 87.00 0.59
Ours(weakly-sup) 91.70 0.87 87.12 0.60
Ours(unsup) 91.46 0.86 87.26 0.64

Evaluation of part-segmentation. We also evaluate our performance on FG-BG segmentation and body part-segmentation tasks which are considered as a proxy to quantify the shape fitting performance. In presence of 2D landmark annotation, iterative model fitting approaches have a clear advantage over the single-shot regressor based approaches as shown in Table 8. At comparable supervision, Ours(semi-sup) not only outperforms the relevant regression based prior arts but also performs competitive to the iterative model fitting based approaches with a significant advantage on inference time (1 min vs  0.04 sec). Note that, Ours(unsup) performs competitive to the prior supervised regression-based approaches, thus establishing the importance of FG appearance consistency for accurate shape recovery.

4.3 Qualitative results

The proposed mesh recovery model not only infers pose and shape but also outputs a colored mesh representation as a result of the proposed reflectional-symmetry procedure. To evaluate effectiveness of the recovered part appearance we perform 2 different tasks a) part-conditioned appearance transfer, and b) full-body appearance transfer as shown in Fig. 9. On the top, we show the target images whose pose and shape (network predicted) is combined with part appearances recovered from the source image (only for the highlighted parts) shown on left, to realize a novel synthesized image. Note that, in case of part-conditioned appearance transfer, appearance of the non-highlighted parts are taken from the target image shown on the top. For instance, in the first row, the synthesized image depicts upper-body apparel of the person in the source image combined with the lower-body apparel from the target (and in the target image pose). Qualitative results of Ours(semi-sup) model on other primary tasks are shown in Fig. 3 and Fig. 4 with highlighted failure scenarios (see Suppl).

Figure 9: Qualitative results on A. Part-conditioned, and B. Full-body appearance transfer. This is enabled as a result of our ability to infer the colored mesh representation.

5 Conclusion

We introduce a self-supervised framework for model-based human pose and shape recovery. The proposed appearance consistency not only helps us to segregate the common FG human from their respective wild BGs but also discovers the required pose deformation in a fully self-supervised manner. However, extending such a framework for human centric images with occlusion by external objects or truncated human visibility, remains to be explored in future.

Acknowledgements. We thank Qualcomm Innovation Fellowship India 2020.

See pages 1-1 of 2788-supp_compressed.pdf See pages 2-2 of 2788-supp_compressed.pdf See pages 3-3 of 2788-supp_compressed.pdf See pages 4-4 of 2788-supp_compressed.pdf See pages 5-5 of 2788-supp_compressed.pdf See pages 6-6 of 2788-supp_compressed.pdf See pages 7-7 of 2788-supp_compressed.pdf

References