Learning a good 3D human pose representation is important for human pose related tasks, e.g. human 3D pose estimation and action recognition. Within all these problems, preserving the intrinsic pose information and adapting to view variations are two critical issues. In this work, we propose a novel Siamese denoising autoencoder to learn a 3D pose representation by disentangling the pose-dependent and view-dependent feature from the human skeleton data, in a fully unsupervised manner. These two disentangled features are utilized together as the representation of the 3D pose. To consider both the kinematic and geometric dependencies, a sequential bidirectional recursive network (SeBiReNet) is further proposed to model the human skeleton data. Extensive experiments demonstrate that the learned representation 1) preserves the intrinsic information of human pose, 2) shows good transferability across datasets and tasks. Notably, our approach achieves state-of-the-art performance on two inherently different tasks: pose denoising and unsupervised action recognition. Code and models are available at: <https://github.com/NIEQiang001/unsupervised-human-pose.git>READ FULL TEXT VIEW PDF
Human action recognition and human behavior analysis have extensive applications on human-robot interaction (HRI) systems, such as health caring, entertainment, education, security and many other intelligent surveillance scenarios, which also makes the 3D human pose estimation a hot research topic for many decades. Learning a good human 3D pose representation has great significance both to the research of human action recognition and the human pose estimation.
While understanding the human pose is a challenging task, which requires the computer to learn the dependencies between joints of the human skeleton robustly in different viewpoints. These dependencies include kinematic relationships between joints and geometric features of the human body. The kinematic relationship describes the motion transmission process between joints and the role of each joint in an action. The geometric feature refers to those specific appearance characteristics of the human body, such as fixed bone lengths and the symmetry between left and right limbs. Many existing works have utilized the geometric features of the human body [17, 13, 22, 36, 18]
, but few works are capable to model the kinematic relationships between human body joints. Kinematics is a physical process and hard to be modeled by regular CNN, RNN or MLP neural networks. Hence, we proposed a sequential bidirectional recursive network (SeBiReNet) to model the dependencies of the human skeleton.
Besides the dependencies between joints, the human 3D pose presents infinite modalities when recorded or observed from different viewpoints, which makes the processing of the human 3D pose quite intractable for the intelligent system. Increasing the size of training dataset with different views may be effective. However, it’s impossible to record the data from all possible viewpoints. To tackle the view variation, some previous works applied preprocessing treatment to eliminate the view variation [12, 3]. These methods are always dataset dependent because of the specifically designed preprocessing method. Many other methods [32, 16, 32, 6, 23, 29] extracted hand-crafted view-invariant features as pose descriptors based on the prior knowledge of human beings. Although these hand-crafted features are view-invariant, there is information loss in extracting these features as only a few explanatory factors are considered. There are some methods [3, 9, 35]
trying to learn discriminative pose representations using the deep learning method. However, the transferability of the representations learned by existing approaches in different datasets and different tasks is limited.
Human pose result from the rich interaction of many factors, such as the subject, the action, and the viewpoint. Learning view-invariant features means to extract features that are insensitive to the direction of view variation, which also means some features that are sensitive to the variations but informative are discarded. As Bengio et al.  mentioned, a better way to overcome these challenges is to leverage the data itself, …, to disentangle as many factors as possible and discarding as little information about the data as in practice.
Motivated by aforementioned issues, we propose an unsupervised method for learning a latent representation of the human 3D pose by disentangling the pose-dependent and view-dependent features from human skeleton data. We introduce a novel SeBiReNet to model the human skeleton data. A Siamese denoising autoencoder based on the SeBiReNet is designed to learn the latent human pose representation. Ability of denoising corrupted skeletons from an unseen dataset proves the learned representation preserves the intrinsic information of human pose, including both the kinematic and geometric dependencies. Disentangling the pose-dependent (view-invariant) and view-dependent (view-variant) feature from skeleton data other than extracting the view-invariant feature enables us to transfer the viewpoint of human pose in the latent space, which is used as a strengthened regularization in our training process.
We summarize our contributions as follows:
We propose a novel SeBiReNet to model the kinematic dependencies between body joints in the human skeleton data.
Based on SeBiReNet, a Siamese denosing autoencoder is proposed for learning 3D human pose representation with feature disentanglement. The unsupervisedly learned pose representation 1) preserves the intrinsic information of human pose, 2) shows good transferability across datasets and tasks.
Extensive experiments demonstrate that state-of-the-art performance can be achieved when applying the learned representation on two inherently different tasks: pose denoising and unsupervised action recognition.
To understand the human 3D pose, the most important is to figure out the dependencies between body joints, which should include both the kinematic and the geometric dependency. Compared to kinematic dependency between joints, geometric characteristic is much easier to model. Ramakrishna et al.  used normalized limb lengths as anthropometric regularity to reconstruct 3D human pose from the 2D image landmarks. Sun et al.  proposed to use the summation of bone lengths as a supervision loss. The summation of bone length considers all bones between every paired two joints. In essence, the summation of bone lengths is a pairwise geodesic distance. Their work proved that the accuracy of human pose regression can be improved based on the summations of bone length. As ratios between bone lengths remain relatively fixed in a human skeleton, Zhou et al.  utilized the length ratios of bones as a weak supervision for reconstructing 3D pose from wild images without 3D annotations. Though human skeleton is similar to the tree structure, few works have applied the recursive network for the human pose modeling. Wei et al.  introduced a vanilla tree network for skeleton-based action recognition. However, only the output from the single tree root node is utilized, which is inherently different from the structure of our SeBiReNet proposed to model the human 3D pose.
Demisse et al.  proposed a denoising autoencoder for unsupervised skeleton-based action recognition by using MLP layers. But to evaluate the extracted features in cross-view action recognition, a preprocessing treatment is applied to estimate the view variation. Li et al. 
proposed a method to learn pose representation from sequential RGB data by adding a view discriminator to decide which view the learned feature comes from. Using view classifier indicates that their views are depend on the training data and view labels were given. While in our method, the poses are randomly rotated and no view label is given. Zheng et al. presented an adversarial training strategy to learn representations of skeleton sequences for action recognition. Compared to these methods, the proposed method is able to learn a view-invariant pose-dependent feature from single pose without any additional label or auxiliary network. Requiring no temporal information makes our representation can be applied to both time-related or time-independent tasks, as verified in our experiments. It’s interesting to find that Aberman et al.  applied a similar feature decomposition and re-composition process in their work of retargeting video-captured motion between different human performers. Our method differs with theirs in two aspects: 1) we embed a denoising process in the learning, which helps the network capture the intrinsic feature of skeleton pose; and 2) our disentangled features have more interaction by sharing some weights in the decomposition and multiplying with each other in the re-composition process.
Given a human 3D pose , a latent representation can be learned by assigning a function with parameters , i.e., . In order to make sure the learned representation contains useful information of original data, is required to be able to recover the original pose through another function with parameters . The reconstructed pose can be formulated as , which is the basic idea of autoencoder in representation learning. Vincent et al.  has proven that using the denoising autoencoder to reconstruct the clean data from its corrupted version is helpful in avoiding trivial solutions and improving the performance of learned latent representations. Therefore, basically, learning the human 3D pose representation can be formulated as the following equation.
where is the learned latent representation of the human 3D pose and is the corrupted pose corresponding to the clean pose . However, the representation learned in eq. 1 contains both the pose-dependent and the view-dependent information. As analyzed in Sec. 1, we hope to learn a view-invariant representation as well as avoid discarding the view-dependent feature of human pose for the sake of information preservation. Thus different from traditional methods, we attempt to disentangle the view-invariant feature from view-dependent feature and using the combination of as a latent representation of the human 3D pose. Under this consideration, the representation learning is reformulated as eq. 2.
where and . is an operation to couple and together, which can be matrix multiplication or concatenation. From a generative perspective, the learning process in eq. 2 can also be written as
where denotes the pose corruption process. If we assume the prior distribution can be factorized as , i.e., they are independent. Then we have
According to eq. 4, learning of the view-dependent feature and pose-dependent feature don’t have much influence on each other. To strengthen the interaction between these two features and disentangle them smoothly, we propose to have and , where is the shared parameters in the parameter space. In this manner, and are disentangled and affect each other through the common latent feature . Although we are trying to disentangle the view-dependent and pose-dependent feature from original pose, this is not necessarily induced so far, as the learned latent representation is still not well constrained. To introduce the concept of viewpoint into the learning process, an additional transformation is added to the corrupted pose by randomly rotating it in the 3D space. At this circumstance, eq. 3 becomes
where is the randomly rotated corrupted pose corresponding to , denotes the random rotation process. A marginal distribution consistency of should be satisfied from corrupted pose and randomly rotated pose
. Thus, besides the pose reconstruction loss, we regularize the pose-dependent feature by minimizing the Kullback–Leibler divergence between pose-dependent features of poses under different observation angles as shown in eq.6.
Putting all together, our human 3D pose representation learning process is modelled as
In order to capture the kinematic dependencies of human skeleton structure, a sequential bidirectional recursive neural network (SeBiReNet) is proposed. The bidirectional recursive neural network has two tree structures as shown in Fig. 1, which models the human skeleton structure intuitively. The recursive neural network is widely used for text or language analysis [7, 21] due to its ability in summarising the semantic meanings. However, the conventional recursive neural network has only one direction, which means the information can only flow from leaf nodes to the root node. On the contrary, the motion of the human body is transmitted from parent joint to child joints. Usually, to determine the position of a joint, both the position of parent joint and the positions of child joints have to be considered. In this regards, the proposed SeBiReNet models the dependency and between parent joint and child joint respectively through a recursive subnet (left part in Fig. 1) and a diffuse subnet (right part in Fig. 1). The two subnets have independent kernel weights but share the hidden states , where is the joint number and is the feature size. The shared hidden states store the intermediate inference results when information flows in the network, and the intermediate results will be continually refined when the network recurrently runs. This network is named SeBiReNet because information flows sequentially and reversely in the two subnets. The proposed architecture not only models the forward and inverse kinematic process but also imitates the repeated thinking process of human.
The node number of SeBiReNet can be adjusted according to the joint number of a human skeleton model. As most skeleton models contain 17 joints, the basic version of our proposed model is designed to have 34 nodes. In SeBiReNet, each node is a GRU cell. Other node types, such as LSTM, can also be used. The forgetting mechanism GRU cell enables the network to tackle noisy input. The inference process of SeBiReNet can be formulated as equation 8.
where and are the input of the node , which contains the 3D position of corresponding joint and the hidden states output from all its child nodes or parent node . denotes the shared hidden state of the node . The superscript represents the recursive subnet and are kernel weights and biases of it. The superscript denotes the diffuse subnet and are kernel weights and biases belong to it. denotes the nonlinear function of the GRU cell.
is the activation function andis used in our experiments. After each inference in the recursive subnet or diffuse subnet, the shared hidden states and network output will be updated by the hidden states and output of corresponding subnet. Outputs of all nodes are concatenated together as the final output of the SeBiReNet.
Complexity of the SeBiReNet Assuming the hidden units of each GRU node is and the dimension of input feature is . The number of parameters in a node with child nodes (in recursive subnetwork) or parent node (in diffuse subnetwork) is . In a SeBiReNet with 17 joints, there are 6 leaf nodes (), 26 nodes with one child node or parent node (), 2 nodes have 3 child nodes ().
According to the analysis in Sec. 3.1, we designed a denoising autoencoder (DAE) to learn the representation of the human 3D pose based on SeBiReNet. Different from general practice  that adds Gaussian noise to the clean input and achieves a gently polluted version, we directly destroy the skeleton to an unreasonable version where some randomly selected joints are moved to illegal positions. The network is expected to distinguish valid human pose from invalid human pose and recover the correct position of those invalid joints.
Though the kinematic dependency has intrinsically modeled by the SeBiReNet, the geometric characteristics haven’t been well considered. To this end, we added a bone length loss and a symmetry loss to the pose reconstruction loss, as shown in eq. 9.
where the first part is the reconstruction error of joint position, denotes the 3D position of joint of sample , is the corresponding recovered position. The second term calculates the bone length loss, which requires the recovered bone length between joint and to be equal to the ground truth length . The third term constrains the recovered bone lengths of the left limb must be equal to the corresponding bone lengths of recovered right limb.
The view-dependent feature and pose-dependent feature are disentangled after the SeBiReNet in the encoder. Sharing some weights before disentanglement can strengthen the interaction between and as explained in Sec. 3.1 and make the network more compact. It’s a reasonable requirement that view-dependent feature should not change the metrics of pose-dependent feature space. As the coupling operation we adopt is matrix multiplication, the requirement is satisfied only when the view-dependent feature plays a role of unit unitary transformation. For our real domain problem, we regularize the view-variant feature in the space as shown in eq. 10, where is the dimension of . is a weight factor and
is an identity matrix. The orthogonal regularization is also capable of preventing the pose-related information from leaking into view-dependent feature.
To regularize the learned pose-dependent feature being view-invariant, random rotation is added to those corrupted human poses and keeping consistency between distribution and by using a feature loss, as shown in Fig. 2. In Fig. 2, there are two pipelines to process the corrupted pose and the randomly rotated pose separately. is a randomly generated rotation matrix. The SeBiReNet is utilized both in the encoder and decoder. Weights are shared between all the encoders and decoders to make sure that poses under different views are encoded and decoded in the same manner. The feature loss of learned pose-dependent features from different views is defined as the Frobenius norm . We believe that, if features are well disentangled from human pose, poses can be transfered between different views by exchanging their pose-dependent features and view-dependent features. This belief is utilized as a reinforced regularization for learning the pose representation in our method, as shown in Fig. 2 where and are view-transferred poses. Therefore, writing all together, our optimization target of learning a human 3D pose representation is formulated as eq. 11, where are the pose reconstruction loss defined in eq. 9, are weights to adjust the influence of each loss, is the L2 weight regularization term to avoid overfitting.
Implementation Details. The hidden unit number of GRU cell in SeBiReNet is 32. Except the output layer, nonlinear activation function is utilized after each MLP layer. Gradient descent optimizer with an initial learning rate of 5e-5 is used in training the DAE. Weights of different losses defined in eq. 11 are . in eq. 10 is set to 0.1. The batch size is 64.
Training Set. The Cambridge-Imperial APE (Action-Pose-Estimation) dataset is used to train the proposed Siamese DAE. The Cambridge-Imperial APE dataset, which contains 245 sequences from 7 subjects performing 7 different categories of actions, is collected for 3D human pose estimation. Corrupted skeletons are generated by randomly selecting 15 joints from each skeleton and moving the them to unreasonable positions with a relatively large displacement. As shown in Fig. 3 (a), these corrupted skeletons violate bio-constraints, such as bone length and allowed motion angle limits. Totally, 52500 corrupted poses are generated for training and 14000 skeletons are generated for testing. The Mean Per Joint Position Error (MPJPE) is adopted as a performance measurement of reconstructed skeletons and the trained model.
Test Sets. To verify the effectiveness of learned representations, we evaluate them on two different tasks: pose denoising and unsupervised cross-view action recognition. Two benchmark action datasets are used: Northwestern-UCLA (N-UCLA) dataset  and NTU RGB+D dataset . Both of the two datasets contain skeletons captured from different views and performed by different subjects. NTU RGB+D dataset is one of the largest skeleton datasets and N-UCLA is one of the most commonly used datasets. Pretrained encoder is applied on them to extract pose representations without any additional training, i.e., these two datasets are not used in the training phase of DAE. A 1-layer LSTM with 128 hidden units is used as the classifier in action recognition task.
|Network Structure||MPJPE (mm)|
|conventional tree (only has the recursive subnet) ||65.76|
|the diffuse subnet||64.94|
Our model is trained and validated on the Cambridge-Imperial APE dataset. Fig. 3 (b) shows several recovered skeletons. Although we destroy the skeleton randomly and extremely, our network is still able to recover the correct positions of those invalid joints. To further show the effectiveness of our network design, we compared the performance of the proposed SeBiReNet with some baseline structures: conventional tree structure (only has the recursive part), the diffuse subnet, the concatenated structure, and the recurrent SeBiReNet. Different from the SeBiReNet which shares hidden states between the recursive subnet and the diffuse subnet, the concatenated structure takes the concatenation of the outputs from the recursive subnet and the diffuse subnet as its output. The recurrent SeBiReNet means the SeBiReNet runs in a recurrent mode as shown in Fig 1. In this experiment, we only implement it one more times.
For a fair comparison of the capability of different structures in encoding the human 3D pose, results in Table 1 is achieved by replacing the the decoder in Fig. 2 with a three-layer MLP. As shown in Table 1, even with a simple decoder, using the proposed SeBiReNet as encoder achieves an MPJPE of 42.03 mm, which is a 35% improvement compared to the first three structures in recovering corrupted skeletons. Recurrently running the SeBiReNet doesn’t bring too much promotion. As skeleton data is relatively simple and low dimension, implementing the SeBiReNet only once is enough to obtain a good result. Compared to structures that only has SeBiReNet in encoder, the proposed structure in Fig. 2 which embeds the SeBiReNet both in encoder and decoder attains the best performance 33.39mm. The noteworthy result indicates that the SeBiReNet is superior to MLP layers in processing skeleton data.
To further demonstrate that the learned representation does encode the intrinsic feature of human 3D pose, we applied the pretrained network on unseen N-UCLA dataset for a qualitative pose denoising evaluation. As Fig. 4 shows, from perspectives of fixed bone length, symmetry, and motion limit of human joints, the recovered skeletons are much more stable and reasonable compared to the original skeletons captured by the 3D sensor. The capability of denoising unseen skeleton verifies that our network has learned the intrinsic feature of human 3D pose.
To evaluate the learned pose-dependent feature, we further exploit it for unsupervised cross-view action recognition on the N-UCLA dataset and NTU RGB+D dataset. The results are shown in Table 2. In unsupervised action recognition, it’s a general way to keep the pre-trained encoder fixed and only train the classifier [3, 9, 14]
. As our target is to evaluate the performance of learned pose-dependent representation in cross-view action recognition, a simple 1-layer LSTM is adopted as classifier to reduce the influence of classifier design. Also, to this end, we only compare with those state-of-the-art methods based on RNN. Though our classifier is much simpler than those compared methods, the accuracy we achieved is competitive and even surpass some of the supervised methods. Action recognitions that are directly based on skeleton coordinates are used as baselines. Among them, the ”raw coordinates” means directly feeding the raw coordinates of skeletons into classifier. The ”normalized coordinates” means the poses are further normalized according to the mean position and standard deviation of joints. Translation of human pose is neglected when training the DAE. But for action recognition, the translation, which should be a part of the human motion, is concatenated together with the learned pose-dependent feature.
|Dataset||Method||Acc.(%)||# of params.|
|Multi-task RNN ||87.3||-|
|Unsupervised||Li et al. ||62.5||-|
|LongT GAN ||74.3*||-|
|Ours (1-layer LSTM)||80.30||-|
|NTU RGB+D||Baseline||normalized coordinates||69.08||-|
|Supervised||Hand-crafted LARP ||52.76||-|
|Part-aware LSTM ||70.27||-|
|Two-stream GCA-LSTM ||85.10||24.54M|
|Bayesian GC-LSTM ||89.0||-|
|Unsupervised||LongT GAN ||48.1*||40.18M|
|Ours (1-layer LSTM)||79.71||0.27M|
It shows explicitly in Table. 2 that the learned pose-dependent feature improves the cross-view action recognition accuracy significantly compared with baseline results, about improving by 30% on N-UCLA dataset and 10% on NTU RGB+D dataset. Among those unsupervised methods on N-UCLA dataset, our method achieves the best performance with an increment of 18% compared to the work of . The method of  is exclusively designed for learning a temporal representation using sequential skeletons in action recognition, while our method is designed for learning a representation from single pose. The accuracy of Denoised-LSTM  which is based on conventional DAE is quite close to our result, but the feature they learned is not view-invariant and a preprocessing treatment is needed to alleviate the influence of view changing. A similar performance is reported on NTU RGB+D dataset. Even compared with supervised methods, the accuracy is better than some of them that have more complex classifier. Performance attained on these two benchmark datasets sufficiently demonstrates the effectiveness and robustness of the learned pose representation in our method.
Though temporal information is not considered in learning pose representation, the performance in action recognition indicates that informative temporal features still can be extracted from sequential learned representations with simple LSTM layer, which should be attributed to the intrinsic feature of human pose it has learned.
Moreover, as shown in Table 2, we also contrast the size of model with other state-of-the-art works that evaluated on the NTU dataset. Considering the SeBiReNet and all MLP layers used in our learning architecture, the learnable parameters in our method is about 0.27 million. As some details missed in several works, we can only estimate the lowest number of parameters in those methods, such as EnGAN-PoseRNN  and AGC-LSTM . It can be seen that our method achieves a competitive result with the least parameters, which also shows the efficiency of our method from another perspective.
To evaluate the contribution of each part in the learning architecture, we have an ablation study based on the N-UCLA and NTU RGB+D dataset as shown in Table. 3. In Table 3, the raw DAE means the structure denoted in eq. 1. ”FD” means the learned latent feature is disentangled to view-dependent feature and pose-dependent feature as denoted in eq.2.
refers to the unit orthogonal matrix constraint on view-dependent feature, as denoted in eq.10. and are the feature loss and reconstruction loss of randomly rotated pose as defined in Sec. 3.3. ”full architecture” means integrating all the components defined in eq. 11 for a better disentanglement and representation learning.
|baseline (relative coordinates)||51.53||69.08|
|raw DAE + Feature Decomposition (FD)||60.61||73.99|
|raw DAE + FD +||62.55||74.46|
|raw DAE + FD + +||73.81||75.72|
|raw DAE + FD + + +||76.84||77.07|
As shown in Table 3, the raw DAE with skeleton corruption achieved an accuracy of 58.66%, which is 10% higher than the baseline result. By disentangling the latent feature and adding orthogonal loss to view-dependent feature, another 4% improvement is obtained. However, the accuracy steeply increase to 73.81% when adding the feature loss to pose-dependent feature, which indicates that the network learns better view-invariant pose feature in this case. The reconstruction losses of randomly rotated pose and generated view-transfered poses can further help improve the performance to 80.3% in cross-view action recognition, which indicates the features are better disentangled. The improvements brought by different components are steady on the NTU RGB+D dataset, but all the components designed in our method contribute to the final performance. Feature loss and view-transferred pose losses are strong regularizations in preserving all the intrinsic pose information and learning view-invariant representations. The results, in turn, demonstrate the effectiveness of disentangling features rather than only extracting the view-invariant feature.
We further design a simple frame to explore the extension of the learned representation for 3D pose estimation from 2D pose. The extension frame contains a 3D encoder, a 2D encoder, and a decoder as shown in Fig. 5. The 3D encoder and decoder form a 3D stream and are pre-trained using 3D poses as we did in the former section. Encoder and decoder are the same with DAE in Fig. 2. In the second step, by regularizing the 2D encoder to learn a representation similar to the representation obtained in the 3D stream, 3D pose is expected to be estimated from the 2D pose. The result achieved by finetuning the 3D stream on H3.6M dataset as shown in Table 5. It can be seen that the learned representation is also applicable to the 3D pose estimation with a simple frame.
In this paper, we propose a neural network architecture to learn a human 3D pose representation by disentangling the view-dependent and pose-dependent features. Different from previous methods, the proposed method use the view-dependent and pose-dependent feature together as a pose representation for sake of preserving information. A SeBiReNet is proposed to model the human skeleton data, which considers the kinematic dependency between body joints of the human skeleton. Extensive experiments prove that the learned representation keeps the intrinsic feature of the human 3D pose and is capable of achieving excellent performance in skeleton denoising and unsupervised action recognition tasks. Utilizing the disentangled pose feature, our extension research will be focused on the view transfer between different poses.
Hierarchical recurrent neural network for skeleton based action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1110–1118. Cited by: Table 2.
Proceedings of the 25th international conference on Machine learning, pp. 1096–1103. Cited by: §3.1.
Thirty-second AAAI conference on artificial intelligence, Cited by: Figure 5.
Eigenjoints-based action recognition using naive-bayes-nearest-neighbor. In Computer vision and pattern recognition workshops (CVPRW), 2012 IEEE computer society conference on, pp. 14–19. Cited by: §1.