Statistical shape modeling (SSM) Heimann and Meinzer (2009); Dryden and Mardia (1998) has become a powerful tool of learning the shape variations from given population, which has been widely used in a variety of applications including image segmentation Heimann and Meinzer (2009), 3D animation Hasler et al. (2009), parametric shape modeling Baek and Lee (2012); Chu et al. (2010), and shape matching Wang and Qian (2016)1998) to extract the mean shape and eigen shapes which span a linear shape space. The above assumption is accurate when the population does not contain large pose variations. Otherwise in application like motion animation Hasler et al. (2009), where relative rotations of the body parts are involved, the underlying shape space is nonlinear and the linear interpolation of any two shapes will not produce a valid shape.
To provide better shape modeling, the Shape Completion and Animation for PEople (SCAPE) method Anguelov et al. (2005) separates pose variations from unstructured shape variations using articulated skeleton model. The pose variations are modeled by the rigid motions (rotations and translations) of the articulated segments of the skeleton. The skeleton is obtained from the articulated object model in Anguelov et al. (2004), which is automatically recovered by the Markov Network algorithm given a set of registered Amberg et al. (2007) shapes in different poses represented by triangle meshes. The Markov Network algorithm finds the segmentation of the shapes and rigid transformations of the segmented parts that maximize the posterior probability of the input shapes conditioned on the template shape. To guarantee that the segmentation is spatially coherent, soft and hard contiguity constraints are imposed, which makes it difficult to fine tune the weight that balances the data probability and the contiguity constraints. The Markov Network algorithm also doesn’t work well on the objects that have large non-rigid deformations (compared to articulated objects) and annealing is exploited in such situation Anguelov et al. (2004).
In this study we find that clustering and dimensionality reduction are the essences of recovering articulated object model from registered triangle meshes of the training shapes. Assigning each vertex a label indicating the corresponding articulated part is clustering. Finding the rotations and translations that map the points on the same corresponding positions of the training shapes to a single point on the reference shape is dimensionality reduction. Mixtures of factor analyzers (MFA) Ghahramani et al. (1996); Mclachlan et al. (2003) conducting clustering and dimensionality reduction simultaneously whose convergence is proved Mclachlan et al. (2003); Meng (1993); Meng and Van Dyk (1997). Different from the Markov Network algorithm in Anguelov et al. (2004), MFA maximizes the complete data log-probability and the latent variable is assumed to be Gaussian distributed which automatically penalizes the spatially-disjoint segmentation results thus guarantees the spatial contiguity. The only hyper-parameter in MFA is the number of clusters.
This paper formulates the problem of learning the pose variations from given shape population as the problem of mixtures of factor analyzers. Assume we have number of training shapes and each training shape is sampled by number of points. The inputs to the MFA algorithm are
number of data vectors each is obtained by concatenating the points on the same corresponding positions of the
training shapes. The outputs are the factor analyzers composed by the mixture proportions, the factor loading matrices, and the mixture variances. Each factor analyzer corresponds to an articulated part of the shapes. The vertices’ labels are calculated from the component posterior probability of the data vectors. To avoid arbitrary factor loading matrices, constraints are imposed and the corresponding closed form optimal solution is derived. Thought inBaek et al. (2010); Tang et al. (2012); Mclachlan et al. (2003)
different constraints are imposed on the factor loading matrices, the goal is to reduce unnecessary degrees of freedom so to avoid bad local minimums. In this paper the loading matrices are explicitly constrained to be composed by rotation matrices. Based on the proposed method, the pose variations are automatically learned from the given shape populations. The method is applied in motion animation where new poses are generated by interpolating the existing poses in the training set. The obtained results are smooth and realistic. The shape data used in this paper is fromSumner and Popović (2004).
This paper is organized as follows, in Section 2 the problem of learning pose variations from given shape population is formulated as problem of mixtures of factor analyzers. In Section 3 the hierarchical optimization approach that automatically refines the initial MFA result is presented. In Section 4 the algorithm of pose interpolation is demonstrated. Section 5 shows the experimental results on different shapes. This paper is concluded in Section 6. The derivation of the closed form optimal solution is elaborated in the Appendix.
2 Related Work
In Schaefer and Yuksel (2007)
an example based skeleton extraction approach is proposed. Given a set of example meshes that share the same mesh connectivity, the approach uses hierarchical face clustering to group adjacent mesh elements into different mesh regions each corresponds to a rigid part of the mesh body. In the hierarchical clustering, the adjacent mesh elements are merged by the error of rigid transformations from the corresponding mesh elements of the example meshes to the reference mesh. Though spatial coherency is automatically imposed inSchaefer and Yuksel (2007) by merging adjacent mesh regions, updating the rigid transformation error in each iteration is a costly operation and the hierarchical clustering is sensitive to local nonrigid transformations.
In De Aguiar et al. (2008)
a spectral clustering approach is proposed to segment the mesh examples into different rigid parts. The inputs to the spectral clustering are the motion trajectories of the seed vertices among the example meshes and the outputs areclusters of vertices corresponding to rigid parts. The time complexity of spectral clustering is
due to eigen-analysis of the affinity matrix, whereis the number of vertices. That’s why instead of the full mesh vertices, seed vertices are used which are obtained by curvature based segmentation approach.
In James and Twigg (2005) mean shift clustering of rotation sequences of the mesh elements is used to segment the example shapes into corresponding rigid parts. Each rotation sequence is composed by the rotation matrices between the triangle elements of the example shapes and the corresponding element of the reference shape. However, the segmentation obtained is lack of spatial coherency and shows broken patches in the areas of non-rigid deformation (e.g. shoulders).
In Tierny et al. (2008) a Reeb graph based approach is proposed for kinematic skeleton extraction of 3D dynamic mesh sequences. The contours of the Reeb graph that locally maximizes the local length deviation metric are identified as motion boundaries by which the kinematic skeleton is extracted. The local length deviation metric measures the extent of local length preservation across the mesh sequences. Though simple and efficient, the Reeb graph based approach generally suffers from robustness issues due to the discrete nature of mesh.
In Le and Deng (2012)K-means clustering is used to cluster the vertices of example meshes into different rigid groups. In the assignment step of the K-means clustering, each vertex is associated to the group that has the smallest rigid transformation error. The rigid transformation from each vertex group of an example shape to the reference shape is calculated by the Kabsch algorithm Kabsch (1978). Similarly, the drawback of the above approach is lacking of spatial coherency.
Recently, deep auto-encoders Hinton and Salakhutdinov (2006) has been applied in learning shape variations from a population Shu et al. (2018); Nash and Williams (2017); Li et al. (2017). In Shu et al. (2018)
deforming autoencoders for images are introduced to disentangle shape from appearance. However, the learned field deformations is not capable of describing the articulated motions. InNash and Williams (2017), given a set of part-segmented objects with dense point correspondences, the shape variational auto-encoder (ShapeVAE) is capable of synthesizing novel, realistic shapes. However, the method requires pre-segmented training shapes. In Li et al. (2017) a recursive neural net (RvNN) based autoencoder is proposed for encoding and synthesis of 3D shapes, which are represented by voxels. The main limitations of the above approaches are: 1) only take data with an underlying Euclidean structure (e.g. 1D sequences, 2D or 3D images) thus can not capture articulated motions of the pose variations; 2) require large number of training data of the same class of shapes (e.g. 3701/1501 in Nash and Williams (2017)) which are not always available due to the tedious process of obtaining neat 3D shape models from either 3D scanning (e.g. scanning point clouds boundary triangulation) or images (e.g. image segmentation boundary triangulation), while the proposed MFA based approach is capable of preciously learning the pose variations from just tens of shape models.
In the review paper Bronstein et al. (2017)
geometric deep learning techniques that generalize (structured) deep neural models to non-Euclidean domains such as graphs and manifolds are introduced. InLitany et al. (2017) a deep residual network is proposed that takes dense descriptor fields defined on two shapes as input, and outputs a soft map between the two given objects. In Boscaini et al. (2016)
an Anisotropic Convolutional Neural Network (ACNN) is proposed to learn the dense correspondences between deformable shapes where classical convolutions are replaced by projections over a set of oriented anisotropic diffusion kernels. InYi et al. (2017) a Synchronized Spectral CNN is proposed for 3D Shape Segmentation. However, until now I have seen any literature in geometric deep learning that aims articulated shape segmentation and learns pose variations from underlying shape population.
3 Learning Pose Variations within Shape Population
The goal of this work is to learn the pose variations within the given shape population. To be more specific, it learns: 1) the vertex labels that segment the shape into a set of rigid parts whose shape is approximately invariant across the different poses; 2) the rotations and translations associated with the rigid parts across the different poses; 3) muscle movements of the rigid parts of the pose variations.
As shown in Figure 1 we have a set of training shapes in triangle meshes that are sampled by the same number of points in correspondence:
where is vertex on the th shape.
Assume that each training shape is composed by rigid parts, and assume is the label of vertex indicating the part Id, we have:
where is the th vertex on the reference shape in the latent space, and are the rotation and translation associated with rigid part of shape , and is the residual caused by the muscle movement of the pose variation. It should be noted that the rotations and translations are the same for the vertices belong to the same rigid part of the same shape, otherwise different.
Concatenate the rigid motions of the vertices on the same corresponding positions of the shapes in (2) we have:
From Equation (3) we see that the high-dimensional vector has its image in the low-dimensional space. Our goal is to learn the vertex labels , the rotations and translations , and the muscle movements , which is essentially a clustering problem with dimensionality reduction, and the mixtures of factor analyzers is a natural choice to address the problem. Differently from general mixtures of factor analyzers, as shown by equation (3), the factor loadings are composed by rotation matrices, which forms our constraints.
3.1 Mixtures of Factor Analyzers
We formulate the problem as mixtures of factor analyzers:
is the random variable in the high dimensional space whose samplings are,
is the latent variable which is normally distributed with zero mean and unit diagonal covariance matrix,is the factor loading matrix of the th mixture, is the diagonal matrix that scales the dimensions of , is the corresponding mean vector, and is the residual vector of the th mixture which is Gaussian distributed with zero mean and diagonal covariance matrix . It should be noted that could be viewed as samplings of in the latent space.
In the following is used to represent the parameters of the th mixture, where is the probability that an arbitrary data belongs to the th mixture, and is used to represent the full parameter sets of the mixtures.
Since and are Gaussian distributed, from Equation (4) we have that the joint probability of and under the th mixture is also Gaussian distributed:
where stands for Gaussian distribution with the first parameter its mean and the second parameter its covariance. From Equation 5 we have the marginal distribution of under the th mixture:
the conditional distribution of given :
and the conditional distribution of given :
3.2 Alternating Expectation–Conditional Maximization
Optimization formula (10
) is in the form of mixtures of Gaussian, which does not have closed form solution but can be efficiently maximized by the algorithm of Expectation Maximization (EM) or Alternating Expectation–Conditional Maximization (AECM) algorithm. AECM replaces the M-step of the EM algorithm by computationally simpler conditional maximization steps (CM-step). Here we choose AECM algorithm due to its computational simplicity and faster convergence.
are estimations of the parameters in theth iteration, by Expectation Maximization we have:
where the last line is obtained by factoring out and dropping which is irrelevant to the optimization parameters. Note that (abbreviated as when there is no ambiguity) means expectation with the conditional probability , where
is the posterior probability that belongs to mixture given the parameter estimations . It is called the responsibility in Mclachlan et al. (2003).
Formula (11) is maximized with the Alternating Expectation–Conditional Maximization (AECM) Algorithm with two cycles Mclachlan et al. (2003), each cycle contains one E-step and one CM-Step. The first cycle is the same with Mclachlan et al. (2003), the second cycle has taken into consideration the rotational constraints and derived the corresponding optimal closed form solutions. In the following we note for simplicity.
Alternating Expectation–Conditional Maximization:
Initial estimates of the parameters as in Mclachlan et al. (2003).
E Step: compute the responsibilities by (12) using the current estimations .
CM Step: compute and using the current responsibilities , and the parameters and :
E Step: update the responsibilities using the parameters and .
CM Step 2: compute and using the current responsibilities and :
The covariance matrix of the th mixture (rigid part) of the th shape in the th iteration is:
is the diagonal matrix of the eigenvalues of, then is estimated as:
Assume , and
is the singular value decomposition of, we have:
For the derivations of , , and and the way to ensure the right handedness of please refer to the section of Appendix.
Repeat (ii) and (iii) until converge.
4 Hierarchical Optimization Algorithm
As noted in Tang et al. (2012) when the number of mixtures is large, MFA could converge to bad local minimums. In order to avoid bad local minimums we designed a hierarchical optimization algorithm. In the following we use the parameter set to denote the th factor analyzer.
As shown in Algorithm 1, The inputs are the training shapes and the initial number of mixtures , which is usually small. In the upper level the coarse factor analyzers are obtained by the AECM algorithm AECM(, m) with the input data and the number of mixtures . Then for each factor analyzer (each coarse rigid part), refinement is conducted until the average variances is less that or the variances cease to decrease as increases. Note that is the average squared error between the real shapes and corresponding images in the latent space. In this study, all the shapes are scaled to be within the unit box. After the refinements, the vertices labels are obtained from the refined factor analyzers and are used to initialize the final optimization AECM(, ).
Given the factor analyzers , each vertex label is obtained by associating the vertex to the mixture that has the maximum posterior probability:
As shown in Figure 2 are the results of hierarchical optimization on the horse training data. Figure 2 shows the results of coarse factor analyzers, in which the horse is segmented into 11 rigid parts. The gaps between the different rigid parts in the figures are obtained by not showing the triangles on the common boundaries. Figure 2 shows the results of refinements. Figure 2 shows the 23 mixtures (rigid parts) obtained in the final stage, from which we could see that all the joints on the legs are separated, the head, neck, and body are also separated since the poses of training shapes include movements of the head and neck. The tail is also segmented into 3 parts, which swings in the running.
The images of the vertices in the latent space are obtained by:
As shown in Figure 3 are images of the rigid parts in the latent space of the horse training data. From the images we can see that the horse is precisely segmented by the hierarchical optimization algorithm. From the rigid parts in the latent space the horse shapes can be reconstructed by the learned rotations and translations, for example, the vertices of the th shape () are reconstructed as:
As shown in Figure 4 are the 11 horse shapes reconstructed from the latent space by the learned rotations and translations. It can be seen that the learned rigid pose variations (rotations and translations) have captured the major variations in the shape population. The remaining small discrepancies between the reconstructed shape (red points) and original shape (blue points) are caused by the muscle movements corresponding to the different poses.
5 Pose Interpolation
Through the hierarchical AECM optimization, we have the mixtures of factor analyzers , from which we have the vertices labels that indicate which rigid part each vertex belongs to, the images of the rigid parts in the latent space , and the rotations and translations of the mixtures. Based on the above information, given any two shapes we can interpolate their poses.
As shown in Algorithm 2, given two shapes , , and the interpolation parameter , the ”root part” is chosen to have the smallest rotation from to . Starting from root part , a broad first search is conducted to interpolate one rigid part by one rigid part. For the rigid part , the rotation is interpolated using the spherical linear interpolation of quaternion Lengyel (2001): Slerp(, , ); the translation is linearly interpolated if is the root part, otherwise it is calculated by:
where is the parent part of , , is the point of contacting between parts and of the th training shape, which is decided by the mass center of the triangles that are in between parts and , is the image of in the latent space of mixture , similar for . Equation (23) means that after the interpolation and should still contact each other when mapped to the physical space. is the abbreviation of in Algorithm 2, similar for . The final interpolation is generated by:
where the residual (the muscle movement) is blended in the latent space and is then mapped to the physical space.
As shown in Figure 5 are the results of pose interpolation between the horse shape and horse shape . Figure 6 are the results of pose interpolation between the horse shape and horse shape . From the results it can be seen that the interpolated poses are smooth and lifelike.
In this section experimental results of the examples of horse, flamingo, camel, and cat are presented. As can be shown from the examples, the segmentation of horses, flamingos, and camels are very neat, all the joints are nicely segmented. The segmentation of cats are not as good:
1) The segmentation results in some regions (eg. on the neck and right shoulder of the cat example) are fragmented. This is because more muscle movements are involved in the pose variations of the specific regions of the cat and lion example when compared with the horses, flamingos and camels, so a simple rotation and translation are not enough to describe the movements in these regions (we need many independent rotations).
2) The segmentation is very coarse in some regions (eg. the joints are not separated on left front leg of the cat model, and the joints are not separated on the right back leg of the elephant model). This is caused by the limited number of poses we have in the data set. For example, in Figure 13 the bending of the right front leg of the cat is observed (the 10th subfigure), but similar movements is missing for the left front leg, and that’s why the segmentation of the left front leg is not as fine as the right front leg, since there is not enough movements to learn from. In the elephant example, the two front legs have much more movements than the two back legs, thus the segmentation of the two front legs is much finer than the segmentation of the two back legs.
It should be noted that the first defect (fragmented segmentation) will harm the results of interpolation, while the second defect will not harm the interpolation since such movements is not presented in the data.
In this paper, an method based on mixtures of factor analyzers is proposed to learn the pose variations from the given shape population. The inputs are the training shapes and the initial number of mixtures. The outputs are the automatically refined mixtures from which we have the vertices labels and the rotations and translations associated with each rigid part of the training shapes. The spatially coherent articulated segmentation is guaranteed by the fact that the latent space and the physical space are homotopic. In all the examples the MFA algorithm converges in less than 20 iterations. The results have demonstrated the effectiveness of the proposed approach. The contributions of this work include: 1) formulating the problem of learning pose variation as the problem of mixtures of factor analyzers; 2) a hierarchical optimization algorithm to automatically refine the learned factor analyzers; 3) derivation of the closed form solution of the optimal factor loading matrices under the constraints that it is composed by rotation matrices; 4) Lemma 1 to gaurantee that the obtained rotation matrices are right handed; 5) a breadth first search algorithm for pose interpolation based on the obtained factor analyzers. Future work includes extending the method to larger data set of 3D shapes such as the CAESAR project Robinette et al. (2002).
In this section we derive the optimal values of the parameters , , , , and in the th iteration of the AECM algorithm. Though the optimal values of and are known results in Mclachlan et al. (2003), we still present their derivation for the self-containing of this paper.
The parameters , , , , are derived following their order in the AECM algorithm, since the former, for example and , will be used in the derivation of their later parameters.
where is the Lagrange that ensures the sum of unity of .
The optimal values of is estimated as in (18), since by (18) the image of the th mixture in the latent space is scaled to have the same variances as the th mixture (rigid part) of the number of shapes, then it is possible to align the image and the rigid parts by rotations and translations.
Assume the variances is isotropic in the direction, we have , where
is identity matrix if size. Since , Equation (26) can be further factored out:
Assume , When , we have the maximum value of :
However, the formulation in (30) doesn’t guarantee the right handedness of the obtained rotation, and in this study it is observed that in a few cases is left handed, which means the actual shape is mirrored. To ensure that we always obtain right handed rotation matrix, in the case that is left handed, we have:
where . The proof is in Lemma 1.
Lemma 1: Assume is the singular decomposition of the matrix , the singular values are ordered from large to small in , and . When is left handed, is the rotation that maximizes among the all right handed rotations and the maximum value is .
Proof: ” is left handed” means that either or is left handed, let’s assume is left handed first, we have:
It is obvious that is right handed rotation matrix. Assume the corresponding rotation angle and the corresponding rotation axis, note , we have