Human Motion Prediction (HMP) is a fundamental research topic that benefits many other applications such as intelligent security, autonomous driving, human-robot interaction and so on. Early works employed nonlinear Markov models[lehrmann2014efficient], Gaussian Process dynamical models [wang2005gaussian]taylor2007modeling]
to tackle this problem, while recently a large number of methods based on deep learning have emerged, showing significant merits.
Due to the sequential nature of pose sequences, HMP is mostly tackled with Recurrent Neural Networks (RNN)[fragkiadaki2015recurrent, jain2016structural, ghosh2017learning, martinez2017human, gui2018adversarial, tang2018long, gui2018few, guo2019human, liu2019towards, chiu2019action, gopalakrishnan2019neural, sang2020human, corona2020context, pavllo2020modeling]
. However, RNN-based approaches usually yield problems of discontinuity and error accumulation which might be due to the training difficulty of RNNs. There are a few works that employ Convolutional Neural Networks (CNN) to solve the HMP problem[butepage2017deep, li2018convolutional, shu2021spatiotemporal, cui2021efficient]. They treat a pose sequence as an image and apply 2D convolutions to the pose sequence, but poses are essentially not regular data which limits the effectiveness of the 2D convolutions. Recently, lots of works demonstrate that Graph Convolutional Networks (GCN) is very suitable for HMP [aksan2019structured, mao2019learning, mao2020history, cui2020learning, li2020dynamic, li2021symbiotic, li2020multitask, liu2020multi, lebailly2020motion, dang2021msr, cui2021towards]. They treat a human pose as a graph by viewing each joint as a node of the graph and constructing edges between any pair of joints. GCNs are then used to learn spatial relations between joints which benefit the pose prediction.
We observe that starting from the seminal work of LTD [mao2019learning], all recent GCN-based approaches [dang2021msr, cui2020learning, mao2020history, sofianos2021space] share the following preprocessing steps: (1) They duplicate the last observed pose as many times as the length of the future pose sequence, and append the duplicated poses to the observed sequence to form an extended input sequence. (2) Similarly, the ground truth future poses are appended to the observed poses to obtain the extended ground truth output sequence. Their proposed networks are used to predict from the extended input sequence to the extended output sequence instead of from the original observed poses to the future poses. Ablation comparisons show that the prediction between the extended sequences is easier than between the original sequences, and the former achieves significantly better prediction accuracy than the latter. Dang et al. [dang2021msr]
ascribed this to the global residual connection between the extended input and output, while in this paper we interpret this phenomenon from another perspective: the last observed pose provides an“initial guess” for the target future poses. From the initial guess, the network just needs to move slightly such that it can reach the target positions. However, we argue that the last observed pose is not the best initial guess. For example, the toy experiments in Figure 1 (a) show that the mean pose of future poses is better than the last observed pose as the initial guess.
The problem is that we do not really know the mean pose of the future poses. Thus as shown in Figure 1 (b), using the mean of future poses as intermediate target, we propose to predict the mean of the future poses firstly and then predict the final target future poses by viewing the predicted mean as the initial guess. Although the predicted mean is not as good as the ground truth mean when used as the initial guess, it is better than the last observed pose. Further, for more accuracy gain, we extend the two-stage prediction strategy to a multi-stage version. To this end, we recursively smooth the ground truth output sequence, obtaining a set of sequences at different smoothing levels. By treating these smoothed results as intermediate targets at the multiple stages, our multi-stage prediction framework progressively predicts better initial guesses towards the next stages until the final target pose sequence obtained.
Any existing human motion prediction model such as [martinez2017human, li2018convolutional, mao2019learning] can be used to accomplish the prediction task at each of our stages. Among them, we choose GCN as the buildingblock to construct our multi-stage framework. Existing GCN-based approaches [mao2019learning, cui2020learning, dang2021msr] only employ GCN to extract spatial features. Instead of them, we propose to process both spatial and temporal features by GCNs. Specifically, we propose S-DGCN and T-DGCN. S-DGCN views each pose as a fully-connected graph and encodes global spatial dependencies in human pose, while T-DGCN views each joint trajectory as a fully-connected graph and encodes global temporal dependencies in motion trajectory. S-DGCN and T-DGCN together extract global spatiotemporal features, which further improve our prediction accuracy.
In summary, the main contributions of this paper are three-fold:
We propose a novel multi-stage human motion prediction framework utilizing recursively smoothed results of the ground truth target sequence as the intermediate targets, by which we progressively improve the initial guess of the final target future poses for better prediction accuracy.
We propose a network based on S-DGCN and T-DGCN that extracts global spatiotemporal features effectively to fulfill the prediction task at each stage.
We conduct extensive experiments showing that our method outperforms previous approaches by large margins on three public datasets.
2 Related Work
Due to the serialized nature of human motion data, most previous works adopt RNN as backbone [fragkiadaki2015recurrent, jain2016structural, ghosh2017learning, martinez2017human, gui2018adversarial, tang2018long, gui2018few, guo2019human, liu2019towards, chiu2019action, gopalakrishnan2019neural, sang2020human, corona2020context, pavllo2020modeling]. For example, ERD [fragkiadaki2015recurrent] improves the recurrent layer of LSTM [hochreiter1997long] by placing an encoder before it and a decoder after it. Jain et al. [jain2016structural] organized RNNs according to the spatiotemporal structure of human pose, proposing the Structural-RNN. Martinez et al. [martinez2017human] used sequence to sequence architecture that is often adopted for language processing to predict human motion. RNNs are hard to train and cannot effectively capture spatial relationships between joints, usually yielding problems of discontinuity and error accumulation.
To enhance the ability of extracting spatial features of human pose, Shu et al. [shu2021spatiotemporal] compensated RNN with skeleton-joint co-attention mechanism. The works of [butepage2017deep, li2018convolutional, liu2020trajectorycnn] use CNNs for this purpose but CNNs cannot directly model the interaction between any pair of joints.
Viewing human pose as a graph, recent works have popularly adopted GCNs for human motion prediction [aksan2019structured, mao2019learning, mao2020history, cui2020learning, li2020dynamic, li2021symbiotic, li2020multitask, liu2020multi, lebailly2020motion, dang2021msr, cui2021towards, Shi:AAAI2022, Shi:CVPR2021, Duan:AAAI2022]. Aksan et al. [aksan2019structured] did not use GCN, but they adopted a very similar idea that relies on many small networks to exchange features between adjacent joints. The works of [li2020dynamic, li2021symbiotic, lebailly2020motion] use GCN either in the encoder [li2020dynamic, li2021symbiotic] for feature encoding or in the decoder [lebailly2020motion] for better decoding. The works of [mao2019learning, mao2020history, cui2020learning, dang2021msr] are totally based on GCN. Mao et al. [mao2019learning] viewed a pose as a fully-connected graph and used GCN to discover the relationship between any pair of joints. In the temporal domain, they represented the joint trajectories by Discrete Cosine Transform coefficients. Dang et al. [dang2021msr] extended [mao2019learning] to a multi-scale version across the abstraction levels of human pose. We also use GCN as the basic buildingblock, but propose S-DGCN and T-DGCN that extract global spatiotemporal features, better than [mao2019learning, mao2020history, dang2021msr] that just extract spatial features. Recently, Sofianos et al. [sofianos2021space] proposed a method that can also extract spatiotemporal features by GCNs. The difference is that we achieve that by only two GCNs while [sofianos2021space] uses much more GCNs.
Transformer [vaswani2017attention, Dong:MM2021] has also been adapted to tackle the problem of human motion prediction [aksan2020spatio, cai2020learning]. Similar to GCN, the self-attention mechanism of Transformer can compute pairwise relations of joints. In this paper, we choose GCN as the buildingblock. We show that our proposed method outperforms the existing Transformer-based approaches in terms of both running time and accuracy.
Let denote an observed pose sequence of length where is a pose at time , and be the future pose sequence of length . Instead of directly mapping from to , we follow [mao2019learning, mao2020history, dang2021msr] to repeat the last observed pose , times and append them to
, obtaining the padded input sequenceof length with . Then our aim becomes to find a mapping from the padded sequence to its ground truth .
3.1 Multi-Stage Progressive Prediction Framework
For the above purpose, we design a multi-stage progressive prediction framework as shown in Figure 2 (the two-stage framework shown in Figure 1 (b) is a special case of the multi-stage framework), which contains stages represented by respectively. These stages perform the following subtasks step by step:
in which is the output of stage . The input to every stage is composed of two parts: the observed poses and the initial guess. For the first stage, the initial guess is . For stage , the initial guess is which is the future part of the output at the previous stage.
Recall that for the two-stage prediction framework as shown in Figure 1 (b), the mean pose of future poses is used as the intermediate target, while for the multi-stage framework we resort to smoothing () recursively to obtain , and use them as the intermediate targets of the corresponding stage networks to guide the generation of (in reverse order), respectively. The adopted smoothing algorithm is Accumulated Average Smoothing (AAS) which is introduced in the following.
Let each pose have joints, and each joint be a point in the -dimensional space. For a pose sequence , we have trajectories: , and each trajectory is composed of the same coordinate across all the poses: . Since all of the trajectories are smoothed by the same method, we omit the subscript in the following without loss of generality.
Note that the trajectory contains two parts: the historical part and the future part . We just need to smooth the future part and keep the historical part unchanged. The AAS algorithm is defined as:
That is, the smoothed value of a point on a curve is the average of all the previous points on the curve. We apply AAS to recursively, obtaining .
shows results by AAS and compares them with those by a Gaussian filter (standard normal distribution) with filtering window size of 21. In each group of curves, the gray curve represents a historical trajectory, the black is the ground truth trajectory in the future, and the dash line is obtained by padding the last observed data. From dark to light blue are the recursively smoothed results. Compared with Gaussian filter, AAS has two advantages. (1) AAS preserves the continuity between the historical and future trajectories, while Gaussian filter yields jumps at the junctions. (2) AAS has stronger smoothing ability than Gaussian filter. As can be seen, the results by AAS evenly and steadily approach the dash line. The dash line is a good guess of the smoothest curve of AAS. Meanwhile, each curve by AAS is a good guess of the curve at the previous smoothing level. From this point, AAS is very suitable for preparing intermediate targets for our multi-stage framework. In contrast, the results of Gaussian filter are concentrated together, and all of them are far from the dash line.
3.2 Encoder-Copy-Decoder Stage Prediction Network Comprising S-DGCN and T-DGCN
In this section, we introduce our network that fulfills the prediction task at each stage, the overview of which is illustrated at the bottom-left of Figure 2. Our network is totally based on GCNs. Specifically, we propose S-DGCN and T-DGCN that extract global spatial and temporal interactions between joints. Based on S-DGCN and T-DGCN, we build an Encoder-Copy-Decoder prediction network. In the following, we introduce them one by one.
S-DGCN. By Dense GCN, i.e. DGCN, we mean the processed graph is fully connected. S-DGCN defines a spatially dense graph convolution applied to a pose, and the graph convolution is shared by all the poses of a pose sequence. Let be a pose sequence where is the length of the sequence, is the number of joints of a pose, and indicates the number of features of a joint. Defining a learnable adjacency matrix the elements of which measure relationships between pairs of joints of a pose, S-DGCN computes:
where indicates the learnable parameters of S-DGCN, and is the output of S-DGCN.
T-DGCN. T-DGCN defines a temporal graph convolution applied to a joint trajectory, and the graph convolution is shared by all the trajectories. We first transpose the first two dimensions of to obtain . Defining a learnable adjacency matrix measuring weights between pairs of joints of a trajectory, T-DGCN computes:
where is the learnable parameters of T-DGCN, and . Finally, we transpose the first two dimensions back to make .
GCL. As shown at the bottom-right of Figure 2
, we define a Graph Convolutional Layer (GCL) as a unit that sequentially executes S-DGCN, T-DGCN, batch normalization[ioffe2015batch], tanh, and dropout [srivastava2014dropout]. GCL can extract spatiotemporal features over the global receptive field of the whole pose sequence.
Encoder. As shown in Figure 2, the encoder is a residual block containing a GCL and multiple Graph Convolutional Blocks (GCB). The first GCL projects the input from the pose space of to the feature space of . We set in this paper. Each GCB is a residual block containing two GCLs. They always work in the feature space. In order to add the global residual connection for the encoder, we employ a convolutional layer with 16 kernels that maps the input into the space of which is then added to the output of the GCBs.
Copy. The encoder outputs a feature map in the space of . We duplicate it and append the copy to the original feature map along the trajectory direction, obtaining a feature map of size which is used as the input to the decoder. We find in practice that the “copy” operator improves the prediction performance. The effectiveness of “copy” can be intuitively explained by the fact that the “copy” operator doubles the size of the latent space, enabling more parameters in the decoder to ensure more sufficient feature fusing.
Decoder. The decoder is a residual block containing multiple GCBs and a pair of S-DGCN and T-DGCN. The GCBs work in the feature space of , while the final S-DGCN and T-DGCN project the features back into the pose space. Since the input to the decoder is of length , the adjacency matrix of all the T-DGCNs, including those in the GCBs, are of size . In order to add the residual connection for the decoder, a convolutional layer with 3 kernels is applied to the input of the decoder. The result of the decoder is of length , while we just retain the front poses as the final result.
3.3 Loss Function
We apply loss on all the outputs: .
Human3.6M111The authors Tiezheng Ma and Yongwei Nie signed the license and produced all the experimental results in this paper. Meta did not have access to the Human3.6M dataset. [ionescu2013human3] has 15 types of actions performed by 7 actors (S1, S5, S6, S7, S8, S9, and S11). Each pose has 32 joints in the format of exponential map. We convert them to 3D coordinates and angle representations, and discard 10 redundant joints. The global rotations and translations of poses are excluded. The frame rate is downsampled from 50fps to 25fps. S5 and S11 are used for testing and validation respectively, while the remaining are used for training.
CMU-MoCap has 8 human action categories. Each pose contains 38 joints in the format of exponential map which are also converted to 3D coordinates and angle representations. The global rotations and translations of the poses are excluded too. Following [mao2019learning, dang2021msr], we keep 25 joints and discard the others. The division of training and testing datasets is also the same as [mao2019learning, dang2021msr].
3DPW [von2018recovering] is a challenging dataset containing human motion captured from both indoor and outdoor scenes. The poses in this dataset are represented in the 3D space. Each pose contains 26 joints and 23 of them are used (the other 3 are redundant).
4.2 Comparison Settings
|Method||Train(Per batch)||Test(Per batch)||Model Size|
We train and test on both coordinate and angle representations. Due to the space limit, we only show the results measured by 3D coordinates in this paper. The results on angle can be found in the supplemental material. We use the Mean Per Joint Position Error (MPJPE) as our evaluation metric for 3D errors, and use Mean Angle Error (MAE) for angle errors.
Test Scope. We note that the works of [martinez2017human, li2020dynamic, mao2019learning] randomly take 8 samples per action for test, Mao et al. [mao2020history] randomly take 256 samples per action, and Dang et al. [dang2021msr] take all the samples for test. We follow Dang et al. [dang2021msr] to test on the whole test dataset in this paper. The comparison results on the random 8 and 256 test sets are provided in the supplemental material.
Lengths of Input and Output Sequences. Following [dang2021msr], the input length is 10 and the output is 25 for Human3.6M and CMU-MoCap, respectively. Following [mao2019learning], the input are 10 poses and the output are 30 poses for 3DPW.
Implementation Details Our multi-stage framework contains stages. In each Encoder-Copy-Decoder prediction network, the encoder contains 1 GCB and the decoder contains 2 GCBs. The framework contains 12 GCBs in total. We employ Adam as the solver. The learning rate is initially 0.005 and multiplied by
after each epoch. The model is trained for 50 epochs with batchsize of 16. The devices we used are an NVIDIA RTX 2060 GPU and an AMD Ryzen 5 3600 CPU. For more implementation details, please refer to the supplemental material.
4.3 Comparisons with previous approaches
We compare our method with Res. Sup. [martinez2017human], DMGNN [li2020dynamic], LTD [mao2019learning], and MSR [dang2021msr] on these three datasets 222We strictly comply with the agreement of using all the datasets for non-commercial research purpose only.
. Res. Sup. is an early RNN based approach. DMGNN uses GCN to extract features and RNN for decoding. LTD relies on GCN totally and performs the prediction in the frequency domain. MSR is a recent method executing LTD in a multiscale fashion. All these methods are previous state-of-the-arts which release their code publicly. For fair comparison, we use their pre-trained models or re-train the models using their default hyper-parameters.
Human3.6M. Table 1 shows the quantitative comparisons of short-term prediction (less than 400ms) on Human3.6M between our method and the above four approaches. Table 2 shows the comparisons of long-term prediction (more than 400ms but less than 1000ms) on Human3.6M. In most cases, our results are better than those of the compared methods. We show and compare the performance of different methods by statistics in Figure 4. In Figure 4 (a) and (b), we treat LTD as the baseline, and subtract the prediction errors of MSR and our method from those of LTD. In (a), the relative average prediction errors with respect to LTD at every future timestamp are plotted. As can be seen, MSR is better than LTD, while our method is much better than MSR. Our advantage is the most significant at 400ms. In (b), the relative average prediction errors with respect to LTD for every action category are plotted. The advantage of our method compared with LTD and MSR is large, and for the action of “walking dog” the advantage is the most significant. In (c), we plot the advantage per joint of our method over LTD and MSR. The darker the color, the higher the advantage. As can be seen, our method achieves higher performance gain on limbs, especially on hands and feet. In Figure 5, we show an example of the predicted poses of different methods. With the increase of the forecast time, the result of our method becomes more and more better than those of the others.
CMU-MoCap and 3DPW. Table 3 and Table 4 show the comparisons on CMU-MoCap and 3DPW respectively. Due to space limit, we only show the average prediction errors at every timestamp. More detailed tables are provided in the supplementary material. On the two datasets, our method also outperforms the compared approaches. Especially, for the challenging dataset 3DPW, our advantage is very significant.
Time and Model Size Comparisons. As seen in Table 5, our model size is smaller than LTD (both models having 12 GCN blocks) as we use a smaller latent feature dimension than LTD (16 vs. 256). Our model is slightly slower than LTD due to the additional computations of intermediate losses and AAS, while faster than all the other methods.
|Single stage prediction||11.95||24.47||49.69||60.94||79.56||93.93||105.86||113.41||67.48|
Without intermediate loss
|Supervised by GT at all stages||11.04||23.49||48.83||59.89||78.13||92.20||103.87||111.46||66.11|
|Replacing Encoder-Copy-Decoder by LTD [mao2019learning]||11.11||24.01||49.48||60.67||79.34||93.91||105.55||113.10||67.15|
|Replacing S-DGCN, T-DGCN by ST-GCN [yan2018spatial]||11.84||25.78||51.87||62.73||80.23||93.61||105.00||112.72||67.97|
|Copy times||Error||Model size||Copy dimension||Error||Model size|
|No copy||65.99||1.06M||Copy in channel||65.75||1.67M|
|Copy once(Ours)||65.02||1.74M||Copy in spatial||65.21||1.69M|
|Copy three times||65.35||3.28M||Copy in temporal(Ours)||65.02||1.74M|
4.4 Ablation Analysis
We conduct ablation studies to analyze our method in depth. All experimental results are obtained on Human3.6M.
Architecture. Several design choices contribute to the effectiveness of our method: (1) the multi-stage learning framework, (2) the intermediate supervisions, (2) the Encoder-Copy-Decoder prediction network, and (4) the “Copy” operator. Table 6 shows the ablation experiments on different variants of the full model. The full model has 4 stages each containing 3 GCBs. There are 12 GCBs in total. The average prediction error is 65.02. (1) To show the effectiveness of “multi-stage”, we test the case when , i.e., there is only one Encoder-Copy-Decoder network which however has 12 GCBs with 6 GCBs in the encoder and 6 in the decoder. The prediction error becomes 67.48 which is a very large performance drop. (2) We use stages but remove the losses imposed on the intermediate outputs. The prediction error becomes 67.07, demonstrating the necessity of the intermediate supervisions. (3) In the third experiment, we use the ground truth (GT) to supervise all the intermediate outputs, which yields the prediction error of 66.11 on average. (4) We use LTD [mao2019learning] instead of the proposed Encoder-Copy-Decoder network to fulfill the task at each stage. The prediction error increases from 65.02 to 67.15. (5) We replace S-DGCN and T-DGCN by ST-GCN [yan2018spatial]. The prediction error drastically increases from 65.02 to 67.97. (6) Finally, we remove the “Copy” operator in the middle of the Encoder-Copy-Decoder network, while yields a slightly increase of the prediction error from 65.02 to 65.99.
Number of stages. In Figure 6 (a), we conduct ablations about from 1 to 6. For different , the corresponding frameworks all contain 12 GCBs distributed in each stage network evenly. For example, if , there will be 4 GCBs in each stage network. The experiments tell that the best performance is obtained when .
Direction and number of “Copy”. In the default setting of the Encoder-Copy-Decoder network, we copy the output of the encoder just one time and paste it along the temporal direction. In Table 7, we conduct ablation studies on the number of copying and the direction of pasting. As can be seen, copying once or three times is better than not copying. But copying three times does not bring more performance gain than copying once. Copying once along the spatial dimension, the channel dimension and the temporal dimension are all better than not copying, while copying along the temporal dimension yields the best result.
AAS vs. Gaussian filter. In Table 8, we compare between Accumulated Average Smoothing (AAS) and Gaussian filter. “Gaussian-” means the filtering window size is . It can be seen that AAS performs better than the two Gaussian filters.
AAS vs. Mean. Recall that for our two-stage framework, i.e., the one shown in Figure 1 (b), we can use Mean- as the intermediate target. For the same framework, we can also use as the intermediate target. We call the two schemes “Our two-stage with Mean-” and “Our multi-stage when ”, respectively, and compare between them in Figure 6 (b). As can be seen, “Our multi-stage when ” is better than both “Our two-stage with Mean-” and “Our two-stage with Mean-”, which demonstrate that the smoothed result by AAS is better than the global mean of the future poses when used as the intermediate target. “Our multi-stage full model” when achieves even better results.
4.5 Limitations and Future Works
Our method has two limitations: (1) The average prediction of LTD [mao2019learning] is 68.08. Ours is 65.02. In contrast, “LTD with Mean-25” in Figure 1 (a) is 29.76. We still have much room to reduce the absolute prediction error. In the future, one can investigate more effective intermediate targets. (2) Our method requires a set of poses as input, while in real applications the poses may be occluded. How to deal with incomplete observations is worthy of further study.
We have presented a multi-stage human motion prediction framework. The key to the effectiveness of the framework is that we decompose the originally difficult prediction task into many subtasks, and ensure each subtask is simple enough. We achieve this by taking the recursively smoothed versions of the target pose sequence as the prediction targets of the subtasks. The adopted Accumulated Average Smoothing strategy guarantees that the smoothest intermediate target approaches to the last observed data, and the intermediate target of the current stage is a good guess of the next stage. Besides that, we have proposed the novel Encoder-Copy-Decoder prediction network, the S-DGCN and T-DGCN of which can extract spatiotemporal features effectively while the “Copy” operator further enhances the capability of the decoder. We have conducted extensive experiments and analysis demonstrating the effectiveness and advantages of our method.
This research is sponsored by Prof. Yongwei Nie’s and Prof. Guiqing Li’s National Natural Science Foundation of China (62072191, 61972160), and their Natural Science Foundation of Guangdong Province (2019A1515010860, 2021A1515012301).