Deep Patch-based Human Segmentation

07/11/2020 ∙ by Dongbo Zhang, et al. ∙ 24

3D human segmentation has seen noticeable progress in re-cent years. It, however, still remains a challenge to date. In this paper, weintroduce a deep patch-based method for 3D human segmentation. Wefirst extract a local surface patch for each vertex and then parameterizeit into a 2D grid (or image). We then embed identified shape descriptorsinto the 2D grids which are further fed into the powerful 2D Convolu-tional Neural Network for regressing corresponding semantic labels (e.g.,head, torso). Experiments demonstrate that our method is effective inhuman segmentation, and achieves state-of-the-art accuracy.



There are no comments yet.


page 6

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

3D human segmentation is a fundamental problem in human-centered computing. It can serve many other applications such as skeleton extraction, editing, interaction etc,. Given that traditional optimization methods have limited segmentation outcomes, deep learning techniques have been put forwarded to achieve better results.

Recently, a variety of human segmentation methods based upon deep learning have emerged [13, 14, 22, 23]. The main challenges are twofold. Firstly the “parameterization” scheme and, secondly, the feature information as input. Regarding the parametrization scheme, some methods convert 3D geometry data to 2D image style with brute force [13]. Methods such as [22] convert the whole human model into an image-style 2D domain using geometric parameterization. However, it usually requires certain prior knowledge like the selection of different groups of triplet points. Some methods like [23] simply perform a geodesic polar map. Nevertheless, such methods often need augmentation to mitigate origin ambiguity and sometimes generate poor patches for non-rigid humans. Regarding the input feature information, one simple solution is using 3D coordinates for learning which highly relies on data augmentation [14]. Other methods [13, 22] employ shape descriptors like WKS [3] as their input.

In this paper, we propose a novel deep learning approach for 3D human segmentation. In particular, we first cast the 3D-2D mapping as a geometric parameterization problem. We then convert each local patch into a 2D grid. We do this so as to embed both global features and local features into the channels of the 2D grids which are taken as input for powerful image-based deep convolutional neural networks like VGG [30]. In the testing phase, we first parameterize a new 3D human shape in the same way as training, and then feed the generated 2D grids into the trained model to output the labels.

We conduct experiments to validate our method and compare it with state-of-the-art human segmentation methods. Experimental results demonstrate that it achieves highly competitive accuracy for 3D human segmentation. We also conduct further ablation studies on different features and different neural networks.

2 Related Work

2.1 Surface Mapping

Surface mapping approaches solve the mapping or parameterization, ranging from local patch-like surfaces to global shapes. The Exponential Map is often used to parameterize a local region around a central point. It defines a bijection in the local region and preserves the distance with low distortion. Geodesic Polar Map (GPM) describes the Exponential Map using polar coordinates. [9, 19, 27, 24] implemented GPM on triangular meshes based on approximate geodesics. Exact discrete geodesic algorithms such as [31, 35] are featured with relatively accurate tracing of geodesic paths and hence polar angles. The common problem with GPM is that it easily fails to generate a one-to-one map due to the poor approximation of geodesic distances and the miscalculation of polar angles. To overcome the problem one needs to find the inward ray of geodesics mentioned in [21]. However, sometimes the local region does not form a topological disk and the tracing of the isocurve among the triangles is very difficult. To guarantee a one-to-one mapping in a local patch, one intuitive way is to adapt the harmonic maps or the angle-preserving conformal maps. A survey [7] reviewed the properties of these mappings. The harmonic maps minimize deformation and the algorithm is easy to implement on complex surfaces. However, as shown in [6, 8], in the discrete context (i.e. a triangle mesh) if there are many obtuse triangles, the mapping could be flipped over. [12, 15, 26, 28, 29] solved the harmonic maps on closed surfaces with zero genus, which is further extended to arbitrary-genus by [12, 20]. These global shapes are mapped to simple surfaces with the same genus. If the domains are not homeomorphous, one needs to cut or merge pieces into another topology [32, 10]. These methods are globally injective and maintain the harmonicity while producing greater distortion around the cutting points.

2.2 Deep Learning on Human Segmentation

Inspired by current deep learning techniques, there have been a number of approaches attempting to extend these methods to handle the 3D human segmentation task. Limited by irregular domain of 3D surfaces, successful network architecture can not be applied straightforwardly. By leveraging Convolutional Neural Networks (CNNs), Guo et al. [13] initially handled 3D mesh labeling/segmentation in a learning way. To use CNNs on 3D meshes, they reshape per-triangle hand-crafted features (e.g. Curvatures, PCA, spin image) into a regular gird where CNNs are well defined. This approach is simple and flexible for applying CNNs on 3D meshes. However, as the method only considers per-triangle information, it fails to aggregate information among nearby triangles which is crucial for human segmentation. At the same time, Masci et al. [23] designed the network architecture, named GCNN (Geodesic Convolutional Neural Networks), so as to deal with non-Euclidean manifolds. The convolution is based on a local system of geodesic polar coordinates to parameterize a local surface patch. This convolution requires to be insensitive to the origin of angular coordinates, which means it disregards patch orientation. Following [23], anisotropic heat kernels were introduced in [5] to learn local descriptor with incorporating patch orientation. To use CNNs on surface setting, Maron et al. [22] introduced a deep learning method on 3D mesh models via parameterizating a surface to a canonical domain (2D domain) where the successful CNNs can be applied directly. However, their parameterization rely on the choice of three points on surfaces, which would involve significant angle and scale distortion. Later, an improved version of parameterization was employed in [14] to produce a low distortion coverage in the image domain. Recently, Rana et al. [16] designed a specific method for triangle meshes by modifying traditional CNNs to operate on mesh edges.

3 Method

3.1 Overview

In this work, we address 3D human segmentation by assigning a semantic label to each vertex with the aid of its local structure (patch). Due to intrinsic irregularity of surfaces, traditional 2D CNNs can not be applied to this task immediately. To this end, we map a surface patch into a 2D grid (or image), in which we are able to leverage successful network architectures (e.g. ResNet [17], VGG [30]).

As shown in Fig. 1

, for each vertex on a 3D human model, a local patch is built under geodesic measurement. We then convert each local patch into a 2D grid (or image) via a 3D-2D mapping step, to suit the powerful 2D CNNs. To preserve geometric information both locally and globally, we embed local and global shape descriptors into the 2D grid as input features. Finally, we establish the relation between per-vertex (or per-patch) feature tensor and its corresponding semantic label in a supervised learning manner. We first introduce the surface mapping step for converting a local patch into 2D grid in Section

3.2, and then explain the neural network and implementation details in Section 3.3.

Figure 1: Overview of our method. For each vertex, we first build a local patch on surface and then parameterize it into a 2D grid (or image). We embed the global and local features (WKS, Curvatures, AGD) into the 2D grid which is finally fed into VGG16 [30] to regress its corresponding semantic label.

3.2 Surface Mapping

Patch extraction. Given a triangular mesh , we compute the local patch for each vertex based on the discrete geodesic distance by satisfying for all . is an empirically fixed radius for all patches. Assume the area of is , , where is set to in this work. The geodesic distance is computed locally using the ICH algorithm [35] due to its efficiency and effectiveness.

Parameterization. There are two cases for parameterization in our context, whereby is a topological disk and otherwise. For the former case, we denote the 2D planar unit disk by , we compute the harmonic maps by solving the Laplace equations


with Dirichlet boundary condition


where is an interior vertex of (Eq. (1)) and is the cotangent weight on edge . In Eq. (2), () belongs to the boundary vertex set of . The boundary vertex set contains vertices, which are sorted in a clockwise order according to the position on the boundary of . Suppose is an interior edge. and are two adjacent triangles, is calculated as


where and is the angle between and , and between and , respectively.

There are cases where the local patch is not a topological disk and the harmonic maps can not be computed. In this case, we trace the geodesic paths for each , by reusing the routing information stored by the ICH algorithm when computing . See Fig. 2 for illustration of parameterization. Similar to [23], we then obtain a surface charting represented by polar coordinates on . We next perform an alignment and a grid discretization on .

Figure 2:

Left: flow vector field

rendered in red color on a human model. Top right: the close-up view of a local patch around . We show the projected flow vector emanating from in orange and the base edge in blue. Bottom left: surface mapping. The red double-arrow line indicates the polar axis of the local polar coordinate system. Green dots represent the parameterized vertices from . Bottom right: alignment of polar angles and grid discretization. The polar axis is rotated to overlap . The angle of rotation is indicated in the yellow box. is the new polar angle for the vertex. After the alignment, the grid with cells are embedded to the unit disk .

Alignment and grid discretization. The orientation of is ambiguous in the context of the local vertex indexing. We remove the ambiguity by aligning each patch with a flow vector field on . For each vertex and its associated patch , the flow vector serves as the reference direction of when mapping to . Fig. 2 illustrates the reference direction as an example. is defined as a vector field flowing from a set of pre-determined sources to the sinks . We initially solve a scalar function on using the following Laplace equation.

and the flow vector field is . We further calibrate the polar angles of with . Considering the first adjacent edge around a source vertex as a base edge, BaseToRef is the angle between the projected flow vector and . is the projected onto a random adjacent face of the base edge. From the harmonic maps , we easily obtain the polar angles of the local, randomly-oriented polar coordinate system. The polar angles are represented by AxisToV for all and AxisToBase for . To align the local polar axis to the reference direction, the calibrated polar angle for all is calculated as .

The grid with cells is embedded inside the calibrated such that

is the circumcircle of the grid. We build a Cartesian coordinate system in

, and the origin is the pole in the polar system. The x-axis and the y-axis overlap the polar axis and , respectively. The vertices and triangles on are converted to this Cartesian coordinate system. Some cells belong to a triangle if the cell centers are on the triangle. We compute the barycentric coordinates of each involved cell (center) with respect to the three vertices of that triangle. The barycentric coordinates will be used for calculating cell features based on vertex features later.

Shape descriptors. After generating

grids (or images), we embed shape descriptors as features into them. The features of each cell are calculated with linear interpolation using the barycentric coordinates computed above. The descriptors include Wave Kernel Signature (WKS) 

[3], curvatures (minimal, maximal, mean, Gaussian) and average geodesic distance (AGD) [18]. We normalize each kind of descriptors on a global basis, that is, the maximum and minimum values are selected from the descriptor matrix, rather than simply from a single row or column of the matrix.

3.3 Neural Network and Implementation

Neural network. As a powerful and successful network, we adopt VGG network architecture with layers (see Fig. 1

) as our backbone in this work. The cross-entropy loss is employed as our loss function for the VGG16 net. It is worth noting, however, that the surface parameterization presented in this work is quite general in nature, being applicable to many other CNNs elsewhere in the literature.

Implementation details.

We implement the VGG16 network in PyTorch on a desktop PC with an Intel Core i7-9800X CPU (3.80 GHz, 24GB memory). We set a training epoch number of

and a mini-batch size of . SGD is set as our optimizer and the learning rate is decreased from to with increasing epochs. To balance the distribution of each label, in the training stage we randomly sample samples per label in each epoch. Training takes about hours on a GeForce GTX 2080Ti GPU (11GB memory, CUDA 9.0).

Once the model is trained, we can infer semantic labels of a human shape in a vertex-wise way. Given a human shape, we first compute the involved shape descriptors for each vertex. For each vertex, we build a local surface patch and parameterize it into a 2D grid (or image) as described in Section 3.2. We embed all the shape descriptors into a 2D grid and feed it into our trained model for prediction.

4 Experimental Results

In this section, we first introduce the dataset used in our experiments, and then explain the evaluation metric. We then show the visual and the quantitative results. We also perform ablation studies for the input features and different neural networks.

4.1 Dataset Configuration

In this work, we use dataset from [22] which consists of train human models from SCAPE [2], FAUST [4], MIT [33] and Adobe Fuse [1], and test human models from SHREC07 [11]. Some examples of our training dataset are shown in Fig. 3. For each human model, there are semantic labels (e.g., Head, Arm, Torso, Limb, Feet), as shown in Fig. 1. To represent geometric information of a human model both globally and locally, we concatenate a set of shape descriptors as input features: WKS features [3], curvature features (, , , ) and AGD [18].

Figure 3: Examples from the the training set.

4.2 Evaluation Metric

To provide a fair comparison, we also evaluate our segmentation results in an area-aware manner [22]. For each segmentation result, the accuracy is computed as a weighted ratio of correctly labeled triangles over the sum of all triangle area. Therefore, the overall accuracy on all involved human shapes is defined as


where denotes the number of test human models and is the sum of triangle area of the -th human model. is the set including the indices of correctly labeled triangles of the -th human model and represents the -th triangle area of the

-th human model. Since we address the human segmentation task in a vertex-wise manner, the per-vertex labels need to be transferred into per-face labels for the quantitative evaluation. The face label is simply estimated by using a voting strategy among its three vertex labels. We immediately set the label with two or three vertices as the label on the face. We randomly select a vertex label as the face label, if three vertex labels are totally different.

Figure 4: Some visual results of our method on the test set. The top row and the bottom row respectively show the results of our method and the corresponding ground truth models.

4.3 Visual and Quantitative Results

In this section, we show the visual and quantitative results. As shown in Fig. 4, the top row lists several of our results in the test set, and the bottom row displays the corresponding ground-truth models. To further evaluate our method for 3D human segmentation, a quantitative comparison with recent human segmentation techniques are summarized in Table 1. As we can see from Table 1, our method achieves an accuracy of , ranking the second place among all methods. Our approach is a bit inferior to the best method [14] which certainly benefits from its data augmentation strategy.

Method #Features ACC
DynGCNN [34] 64 86.40%
Toric CNN [22] 26 88.00%
MDGCNN [25] 64 89.47%
SNGC [14] 3 91.03%
GCNN [23] 64 86.40%
Our Method 31 89.89%
Table 1: Comparisons with recent methods for 3D human segmentation.

4.4 Ablation Study

Besides the above results, we also evaluate different selection choices for input features. Table 2 shows that the input features including WKS, curvatures and AGD obtain the best performance, in terms of accuracy. Moreover, we evaluate the performance of two different neural networks in 3D human segmentation, as shown in Table 3. It is obvious that the VGG16 obtains a better accuracy than the ReseNet50, and we thus employ VGG16 as the backbone in this work.

Features Used #Features ACC
SWCA 50 89.25%
SWA 46 89.81%
WCA (Our) 31 89.89%
Table 2: Comparisons for different input features. For simplicity, S, W, C and A are respectively short for SI-HKS, WKS, Curvatures (Cmin, Cmax, Cmean, Cgauss) and AGD.
Network Features ACC
ResNet50 WKS, Curvatures, AGD 87.60%
VGG16 WKS, Curvatures, AGD 89.89%
Table 3: Comparisons for two different network architectures.

5 Conclusion

We have presented a deep learning method for 3D human segmentation. Given a 3D human mesh as input, we first parameterize each local patch in the shape into 2D image style, and feed it into the trained model for automatically predicting the label of each patch (i.e., vertex). Experiments demonstrate the effectiveness of our approach, and show that it can achieve state-of-the-art accuracy in 3D human segmentation. In the future, we would like to explore and design more powerful features for learning the complex relationship between the non-rigid 3D shapes and the semantic labels.


  • [1] Adobe fuse 3d characters.. Note: Cited by: §4.1.
  • [2] D. Anguelov, P. Srinivasan, D. Koller, S. Thrun, J. Rodgers, and J. Davis (2005) SCAPE: shape completion and animation of people. In ACM SIGGRAPH 2005 Papers, pp. 408–416. Cited by: §4.1.
  • [3] M. Aubry, U. Schlickewei, and D. Cremers (2011) The wave kernel signature: a quantum mechanical approach to shape analysis. In

    2011 IEEE International Conference on Computer Vision Workshops

    pp. 1626–1633. Cited by: §1, §3.2, §4.1.
  • [4] F. Bogo, J. Romero, M. Loper, and M. J. Black (2014) FAUST: dataset and evaluation for 3d mesh registration. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    pp. 3794–3801. Cited by: §4.1.
  • [5] D. Boscaini, J. Masci, E. Rodolà, and M. Bronstein (2016) Learning shape correspondence with anisotropic convolutional neural networks. In Advances in Neural Information Processing Systems, pp. 3189–3197. Cited by: §2.2.
  • [6] T. Duchamp, A. Certain, A. DeRose, and W. Stuetzle (1997) Hierarchical computation of pl harmonic embeddings. preprint. Cited by: §2.1.
  • [7] M. S. Floater and K. Hormann (2005) Surface parameterization: a tutorial and survey. In Advances in Multiresolution for Geometric Modelling, pp. 157–186. Cited by: §2.1.
  • [8] M. S. Floater (1998) Parametric tilings and scattered data approximation. International Journal of Shape Modeling 4 (03n04), pp. 165–182. Cited by: §2.1.
  • [9] M. S. Floater (2003) Mean value coordinates. Computer Aided Geometric Design 20 (1), pp. 19–27. Cited by: §2.1.
  • [10] M. Floater (2003) One-to-one piecewise linear mappings over triangulations. Mathematics of Computation 72 (242), pp. 685–696. Cited by: §2.1.
  • [11] D. Giorgi, S. Biasotti, and L. Paraboschi (2007) Shape retrieval contest 2007: watertight models track. SHREC competition 8 (7). Cited by: §4.1.
  • [12] X. Gu and S. Yau (2003) Global conformal surface parameterization. In Proceedings of the 2003 Eurographics/ACM SIGGRAPH Symposium on Geometry Processing, pp. 127–137. Cited by: §2.1.
  • [13] K. Guo, D. Zou, and X. Chen (2015) 3d mesh labeling via deep convolutional neural networks. ACM Transactions on Graphics 35 (1), pp. 1–12. Cited by: §1, §2.2.
  • [14] N. Haim, N. Segol, H. Ben-Hamu, H. Maron, and Y. Lipman (2019) Surface networks via general covers. In Proceedings of the IEEE International Conference on Computer Vision, pp. 632–641. Cited by: §1, §2.2, §4.3, Table 1.
  • [15] S. Haker, S. Angenent, A. Tannenbaum, R. Kikinis, G. Sapiro, and M. Halle (2000) Conformal surface parameterization for texture mapping. IEEE Transactions on Visualization and Computer Graphics 6 (2), pp. 181–189. Cited by: §2.1.
  • [16] R. Hanocka, A. Hertz, N. Fish, R. Giryes, S. Fleishman, and D. Cohen-Or (2019) MeshCNN: a network with an edge. ACM Transactions on Graphics 38 (4), pp. 1–12. Cited by: §2.2.
  • [17] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pp. 770–778. Cited by: §3.1.
  • [18] M. Hilaga, Y. Shinagawa, T. Kohmura, and T. L. Kunii (2001) Topology matching for fully automatic similarity estimation of 3d shapes. In Proceedings of the 28th Annual Conference on Computer Graphics and Interactive Techniques, pp. 203–212. Cited by: §3.2, §4.1.
  • [19] T. Ju, S. Schaefer, and J. Warren (2005) Mean value coordinates for closed triangular meshes. In ACM SIGGRAPH 2005 Papers, pp. 561–566. Cited by: §2.1.
  • [20] A. Khodakovsky, N. Litke, and P. Schröder (2003) Globally smooth parameterizations with low distortion. ACM Transactions on Graphics 22 (3), pp. 350–357. Cited by: §2.1.
  • [21] I. Kokkinos, M. M. Bronstein, R. Litman, and A. M. Bronstein (2012) Intrinsic shape context descriptors for deformable shapes. In 2012 IEEE Conference on Computer Vision and Pattern Recognition, pp. 159–166. Cited by: §2.1.
  • [22] H. Maron, M. Galun, N. Aigerman, M. Trope, N. Dym, E. Yumer, V. G. Kim, and Y. Lipman (2017) Convolutional neural networks on surfaces via seamless toric covers.. ACM Transactions on Graphics. 36 (4), pp. 71–1. Cited by: §1, §2.2, §4.1, §4.2, Table 1.
  • [23] J. Masci, D. Boscaini, M. Bronstein, and P. Vandergheynst (2015) Geodesic convolutional neural networks on riemannian manifolds. In Proceedings of the IEEE International Conference on Computer Vision Workshops, pp. 37–45. Cited by: §1, §2.2, §3.2, Table 1.
  • [24] E. L. Melvær and M. Reimers (2012) Geodesic polar coordinates on polygonal meshes. In Computer Graphics Forum, Vol. 31, pp. 2423–2435. Cited by: §2.1.
  • [25] A. Poulenard and M. Ovsjanikov (2018) Multi-directional geodesic neural networks via equivariant convolution. ACM Transactions on Graphics 37 (6), pp. 1–14. Cited by: Table 1.
  • [26] E. Praun and H. Hoppe (2003) Spherical parametrization and remeshing. ACM Transactions on Graphics 22 (3), pp. 340–349. Cited by: §2.1.
  • [27] R. Schmidt, C. Grimm, and B. Wyvill (2006) Interactive decal compositing with discrete exponential maps. In ACM SIGGRAPH 2006 Papers, pp. 605–613. Cited by: §2.1.
  • [28] A. Sheffer and E. de Sturler (2001) Parameterization of faceted surfaces for meshing using angle-based flattening. Engineering with Computers 17 (3), pp. 326–337. Cited by: §2.1.
  • [29] A. Sheffer, C. Gotsman, and N. Dyn (2004) Robust spherical parameterization of triangular meshes. Computing 72 (1-2), pp. 185–193. Cited by: §2.1.
  • [30] K. Simonyan and A. Zisserman (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556. Cited by: §1, Figure 1, §3.1.
  • [31] V. Surazhsky, T. Surazhsky, D. Kirsanov, S. J. Gortler, and H. Hoppe (2005) Fast exact and approximate geodesics on meshes. ACM Transactions on Graphics 24 (3), pp. 553–560. Cited by: §2.1.
  • [32] W. T. Tutte (1963) How to draw a graph. Proceedings of the London Mathematical Society 3 (1), pp. 743–767. Cited by: §2.1.
  • [33] D. Vlasic, I. Baran, W. Matusik, and J. Popović (2008) Articulated mesh animation from multi-view silhouettes. In ACM SIGGRAPH 2008 Papers, pp. 1–9. Cited by: §4.1.
  • [34] Y. Wang, Y. Sun, Z. Liu, S. E. Sarma, M. M. Bronstein, and J. M. Solomon (2019) Dynamic graph cnn for learning on point clouds. ACM Transactions on Graphic 38 (5), pp. 1–12. Cited by: Table 1.
  • [35] S. Xin and G. Wang (2009) Improving chen and han’s algorithm on the discrete geodesic problem. ACM Transactions on Graphics 28 (4), pp. 1–8. Cited by: §2.1, §3.2.