Humans recognize and interact with the real world relying on their ability to predict their surrounding changes over time. For example, if we observe a moving people is losing her balance, we may guess that she will fall down in a near future and be ready to help her avoid danger immediately. Similarly, intelligent robots that can perceive and interact with the moving people must have the ability to predict the future dynamics of human motion. In this paper, as is shown in Figure 1, we focus on the problem of human motion prediction with D joint position data, which aims to predict the future human motion sequence based on the observed human motion sequence.
The key to human motion prediction is to model the motion dynamics of the human body. The velocities of moving joints carry rich motion dynamic information, which can boost the performance of human motion prediction [21, 2]. However, most of existing literatures ignored the velocities information hidden in a sequence of poses and just modeled the motion dynamics of the human motion sequence in pose space [4, 12, 19, 5]. Recently, some works presented to consider including the velocity information of moving poses [21, 2, 16, 25, 15]
. Most of these works implicitly modeled the velocity information of the future poses simply using residual connection between the input and output of their Decoder, and ignored the velocities of previous poses in their Encoder[21, 16, 25, 15]. To better capture the motion dynamics, we present to explicitly model the velocities of human motion sequence both at Encoder and Decoder.
Recursively predicting multiple future poses by feeding the current output as the input of the next prediction is an efficient way to obtain more information from the latest prediction for the long-term prediction [21, 14, 6]. Most sequence to sequence models are proposed to recursively predict multiple future human poses entirely based on the recursive unit inherent in RNNs [5, 6, 7]
. Although RNNs have shown their temporal modeling power in many computer vision tasks such as natural language processing and machine translation, they fail to capture the spatial correlations among joints of the human body. Therefore, some works were proposed to address this issue using feedforward neural network[1, 14, 8]. For example, Li et al. 
proposed a CNN (Convolutional Neural Networks) based sequence to sequence model to predict multiple future poses recursively. Follow this strategy, we propose a new feedforward Decoder that considers both spatial and temporal modeling of future poses, which enables the network to predict multiple future poses recursively.
Most of the prediction models were optimized simply using the loss [21, 14, 4] or MPJPE (Mean Per Joint Position Error) [20, 11] that calculates the errors between the target and predicted poses, which pays equal attention to different future poses. It seems reasonable that the predictive errors at previous time-steps are smaller than that at later time-steps. However, these models implicitly pay more attention to the later predictions since the loss at the later time-steps is larger than that at the early time-steps. Therefore, these models are inherently difficult to achieve accurate predictions, especially in a recursive prediction models. Because the early prediction is prone to affect the prediction of the later time-steps in the recursive prediction model, which easily suffers from error accumulation. To address this problem, we propose to pay more attention to the predictions of early time-steps and thus achieves more accurate predictions.
Moreover, existing works ignore the difference among , and coordinates of joints [21, 14, 20]. Take the motion sequences in Figure 1 as an example, Figure 1 shows their joint trajectories along different axes. The evolution of joint trajectories along different axes is different. Therefore, ignoring the information between different coordinates may not capture the motion dynamics well. But there is also interaction between different axes. Considering the above problems, in this paper, we separately treat the , and coordinates of joint at early stage through different branches, and the different branches share the same parameters to capture the correlations between different axes.
Our main contribution can be summarized as follows.
() A novel network, AGVNet, is proposed to forecast the future motion sequence, which explicitly models the velocities both at Encoder and Decoder.
() We proposed a new two stream Encoder that models the motion features from both the positions and velocities of previous poses, which considers the difference among coordinates of joints carefully and thus can better encode the motion dynamics of skeletal motion.
() A new CNN based Decoder is built, which enables the network to predict multiple future poses recursively like RNN based Encoder-Decoder framework.
() A novel loss, ATPL, is proposed, guiding the network to achieve more accurate predictions by paying increasing attention to the early predicted poses and less attention to the later predictions.
Ii Related work
. Most of the existing works were presented based on mocap vector[14, 7, 8, 9]
. The key of this problem is to model the temporal dependencies of human motion. Due to the effectiveness of RNN (Recurrent Neural Network) in short-term temporal modeling, many works are proposed based on RNN to address this problem[4, 19, 5, 6]. Fragkiadaki et al.  proposed an Encoder-Recurrent-Decoder (ERD) model by incorporating nonlinear encoder and decoder before and after recurrent layers built with LSTM to predict the future mocap vectors recursively. Due to the error accumulation inherently in RNN, these models may coverage to the performance of mean poses [5, 6, 26]. Moreover, human movements are constrained by the physical structure of the human body. Traditional RNN models fail to capture the physical constrains of the human body. Therefore, other RNN models are presented by incorporating some skeletal representation method such as Lie algebra representation to capture the spatial correlations of the human body [6, 19]. Liu et al.  proposed a novel model, HMR (Hierarchical Motion Recurrent) , to anticipate future motion sequence. The authors modeled the global and local motion contexts by using LSTM hierarchically and capture the anatomical constraints of the human body by representing the skeletal frames with the Lie algebra representation.
Human motion prediction with D joint position data: in these works, human poses are represented as a group joints of D coordinates. Rare works focus on the problem of human motion prediction using D joint position data [1, 20, 9]. Butepage et al. 
proposed to learn a generic representation from the input Cartesian skeleton data and predicted future 3D poses using feed-forward neural networks. Mao et al. proposed a novel model to predict the future motion sequence with position data. The authors modeled the temporal trajectory information of joints using DCT (Discrete Cosine Transform) and captured the spatial structure information of the human body by representing the body joints as graph using GCN (Graph Convolutional Network). In , the authors have shown that D joint position data can better describe the human pose and not suffer from ambiguities since two different mocap vectors can represent the same pose . Therefore, in this paper, we focus on the problem of human motion prediction using D joint position data.
Pose Velocities modeling in human motion prediction: most of related works implicitly modeled the motion velocity of future poses through residual connection [21, 7, 8, 6]. For example, Gui et al. [6, 7] and Butepage et al.  introduced a residual connection between the input and the output of their GRU based Decoder to model the velocities of future poses. Rare works modeled the velocities information of human motion both at Encoder and Decoder [2, 16, 15]. Chiu et al.  proposed an RNN based model, TP-RNN (triangular-prism RNN), to predict the velocities of future poses with the velocities of previous poses as inputs. This model was built entirely based on LSTM, which fails to capture the spatial correlations among joints of the human body.
Iii-a Problem formulation
As is shown in Figure 1, human motion sequence can be represented by a sequence of D joint coordinates of the human body. Our goal is to predict a sequence of D joint coordinates of the future motion sequence given a sequence of D joint coordinates of the observed motion sequence. In this paper, we first predict the velocities of future poses as an intermediate result, and then restore the final poses from the velocities information, rather than directly predict the final future poses as done by many previous literatures.
Given an input human motion sequence with length , its corresponding future human motion sequence with length and the velocities with length of future poses . Here, denotes the pose of sequence at time-step , denotes the future pose of sequence at time-step and denotes the velocity of future pose at time-step . In this paper, the process of human motion prediction can be considered as: , which can be divided into two stages. The first stage aims to predict the velocities , the second stage aims to generate final poses , and the future pose can be restored based on the pose and the future velocity , which can be formulated as: .
Iii-B Skeletal Representation
Recent works have shown that explicitly modeling the temporal information of human motion can enhanced the final performance of the network [17, 22]. Therefore, to better capture the human motion, we introduce a skeletal representation to model the skeletal motion both implicitly and explicitly by representing the human motion sequence in position space and velocity space, respectively.
Take the input sequence shown in SectionIII-A as an example, where , denotes -th joint of pose at time step , is the number of joints. In this paper, we represent the human motion sequence into six
D tensors to conveniently model the difference of joint trajectories among, and coordinates, including and in position and velocity space, respectively. Moreover, human body can be naturally divided into five parts, including two limbs, two legs and truck [3, 18]. Therefore, as is shown in the bottom right of Figure 2, joints of the same part in our representation are organized in adjacent positions to capture the local characteristic of the human body.
In position space, , and denote the representation of the input sequence along , and axis, respectively, which can be defined as equation 1.
Where denotes the dimension of joints, i.e. , and . denotes -th joint of pose at time step along axis .
In velocity space, , and denote the representation of the input sequence along , and axis, respectively. The velocity of two consecutive poses is defined as equation 2, and , or is defined as equation 3, which is similar to that in position space.
Where denotes the dimension of joints, i.e. , and .
Iii-C Architecture of AGVNet
Our overall architecture of AGVNet is as shown in Figure 3, which mainly includes three parts: Encoder, Decoder and Loss. Among which, Encoder aims to encode skeletal motion of previous poses, and Decoder is to decode the future velocities of future poses. We will describe from three aspects in the following: encode skeletal motion, decode future velocities and our loss.
Iii-D Encode skeletal motion
Backbone layer: inspired by , as is shown in Figure 4, we propose a new backbone, Densely Connected Convolutional Module (DCCM), to maximize the information flow propagation layer by layer, which mainly consists of convolutional layers. At each convolutional layer, the input receives the enhanced features by fusing the concatenated feature maps from all preceding layers using a convolution, which can be formulated as equation 4. Here, the point-level features can be learned by the convolutions of the residual connections. Therefore, the dense residual connections in DCCM allow the network to gradually obtain the enhanced features of deep layers by the point-level features of shallow layers which can be formulated as equation 4.
Where () denotes the output feature map of the th layer, denotes the fusion layer built with the operation of concatenation across channel followed by a convolution and denotes a
Encode skeletal motion: Based on the skeletal representation described above, as is shown in the left of Figure 3, a novel two-stream Encoder is built with DCCMs to encode skeletal motion of the input sequence, which mainly consists of pose branch, velocity branch and a fusion module.
At pose branch, , and are fed into each sub-branch respectively, which enables the network to focus on modeling the joint trajectories information along each axis. So that we can modeled the difference between , and coordinates. Moreover, each sub-branch of pose branch are shared weight to reduce the model complexity and also model correlations among , and coordinates. Finally, the feature maps of each sub-branch are concatenated along channel and another DCCM is applied to fuse these information.
At velocity branch, similar to that of pose branch, , and are fed into each sub-branch respectively to capture the spatial and temporal information in velocity space. The pose branch and velocity branch are shared parameters to reduce the model complexity and also improve the final performance potentially in the meantime.
Finally, a fusion module aims to fuse the motion dynamics features captured by pose branch and velocity branch, which is built with the operation of concatenation along channel followed by a convolutional layer and Leaky ReLU.
Iii-E Decode future velocities
RNN models predict future pose from previous state and current pose [21, 7]. Motivated by this, as is shown in the right of Figure 3, a new Decoder built with convolutional layers is proposed to decode future velocities recursively by conditioning on history features of previous time-steps. We assume that previous hidden state needs more operations to extract its spatial and temporal features. Therefore, we first apply two convolutional layers to extract the spatio-temporal features of previous poses, one convolutional layer to extract the spatial representation of velocities at current time-step and then add them element-wisely. Finally, another one convolutional layer and FC layer are applied to restore the spatial information of future velocity at the next time-step. This process can formulated as equation 5.
To explicitly model the velocities of future poses and achieve more accurate predictions, as is shown in Figure 3, our final loss consists of two parts: velocity loss and position loss, which can be formulated as: , where and
are two hyperparameters to balance the loss of velocities and positions of future poses. () velocity loss, that guides the network to decode the velocities of future motion sequence; () position loss that encourages the network to restore the spatial information of future poses. For all formulation, denotes the number of joints.
In a recursive model, the current prediction is vulnerable to the prediction of previous time-steps. To address this problem, we propose a attention temporal prediction loss (ATPL) to guide the network to achieve more accurate predictions at early time-steps by paying an increasing attention to the previous time-steps, which can be defined as equation 6.
Where and denote the groundtruth joint and the predicted joint, respectively, denotes the attention weight at -th time-step, and .
The velocity loss and position loss can be caculated by 6, which represents the ATPL in velocity space and position space, respectively.
We evaluate our model on two challenging datasets, including HumanM (HM)  and D Pose in the Wild dataset (DPW) . In the following section, we first introduce these datasets and the implementation details. Then, we compare our method with state-of-the-arts and report these results both quantitatively and qualitatively. Finally, we conduct ablation experiments to analyze the effectiveness of several components in AGVNet.
Iv-a Datasets and Implementation Details
Datasets: () HM : HM is a largest dataset for human motion prediction. This dataset consists of actions performed by seven professional actors, such as walking, eating, smoking and Discussion. The human body is represented by joints. () DPW : DPW is a dataset in the wild with accurate D poses performing various activities such as shopping, doing sports. This dataset includes sequences, more than k frames. The human body is represented by joints.
D coordinate data, and all experimental settings and data processing are consistent with the baselines. Our model is implemented by TensorFlow. MPJPE (Mean Per Joints Position Error) proposed in in millimeter is used as our metric to evaluate the performance of our proposed method. All models are trained with Adam optimizer, and the learning rate is initial with .
Iv-B Comparison with state-of-the-arts
Baselines: () RGRU  is a classical model for human motion prediction, which is built with GRUs. This model used residual connections to implicitly predict the velocities of future poses. () CSS  is a feedforward model based on CNN and predicts multiple future poses recursively. () DTraj  is the currently most state-of-the-arts method for human motion prediction based on position data, which is built with DCT and GCN. For a fair comparison, the D errors of  and  used in this paper are reported in .
Results on HM: Table I reports the results for short term prediction on HM. Our method outperforms all baselines on average at all time-steps, which shows the effectiveness of our proposed model. Specifically, compared with the RNN baseline , the errors of our method are decreased significantly. The possible reason is that our method explicitly models velocities information of human motion sequence, while  simply models the velocities of future poses using residual connection in their Decoder and ignores a part of spatial features among joints of the human body using GRU cell. Compared with other feedforward baselines [14, 20], our model achieves the best results in most cases. This benefits from two folds: () our method models the velocities of human motion both at Encoder and Decoder, while the baselinses [14, 20] ignore the modeling of velocity information. Therefore, our method can better capture the motion dynamics of human motion. () our model predicts multiple future poses recursively, while  predicts the future poses in a non-recursive manner. So that our model can make full use of the latest predictive information to predict the later future poses, while  can not.
Table II reports the results for long term prediction on HM. Similarly, our method achieves the best performance for both ms and ms, especially in ms, which shows the effectiveness of our proposed method powerfully.
To further show the performance of our proposed method, frame-wise performance is evaluated qualitatively in Figure 5. Compared with , our method achieves the best visualization performance for both short term and long term prediction, which demonstrates the effectiveness of our method again. As is denoted in Figure 5, our performance significantly outperforms the baseline . The main reason is that our method explicitly models the velocities of human motion both at Encoder and Decoder, while  ignores the modeling of velocities of human poses. Therefore, our model can better capture the motion dynamics, so it can achieve better results. More visualization results can be found in the supplementary material.
Results on DPW: Table III reports the results for short term and long term prediction on DPW. In general, our conclusion remains unchanged. Our method consistently outperforms the baselines at all time-steps for both short term and long term prediction, which further verifies the effectiveness of our proposed method.
Iv-C Ablation analysis
In this section, we conduct ablation experiments to show the effectiveness of several components in AGVNet. The experimental results are reported in Table IV. Compared the errors between # and #, the errors of “#” are increased at all time-steps, which shows that modelling the difference among , and coordinates can better capture the motion dynamics and thus improve the final performance. The experiments of “#”, “#” and “#” show that combining the pose branch and velocity branch can achieve better results, which shows the effectiveness of our two stream Encoder. The performance of “#” outperforms than that of “#”, which shows that shared weights of pose branch and velocity branch can better improve the final performance. This is consistent with the common two-stream method. The errors of “#” and “#” are larger than that of “#”, which shows that ignoring the position loss or the velocity loss can lead to worse results. Moreover, the errors of “#” are increased significantly, especially at the later time-steps. The main reason is that can guide the network to explicitly model the velocities of future poses. In this case, the network can capture the motion dynamics well, so that we can obtain better results, especially at later time-steps. Compared the errors between “#” and “#”, “#” achieves lower errors on average, especially at the early prediction. The decremental weight design in ATPL guides the network to predict more accurate results at early time-steps, and further enhance the overall performance using our proposed recursive model. In summary, all components contribute a positive influence on our final network, and the combination of all components can achieve the best performance.
In this paper, we propose a novel end-to-end architecture, AGVNet, for human motion prediction. Our method first predicts the velocities of future poses as an intermediate result rather than the position of future poses directly. Different from prior works, we explicitly model the velocities of skeletal motion both at Encoder and Decoder, which can better capture the motion dynamics of human motion. What’s more, we propose an attention temporal prediction loss (ATPL) for the recursive prediction model, which can efficiently guide the network to achieve more accurate predictions, especially at early prediction. Finally, we evaluate our model on two challenging datasets, and our model achieves state-of-the-art performance. The experiments also show that modeling the difference among , and coordinates can improve the performance of human motion prediction.
This work was supported partly by the National Natural Science Foundation of China (Grant No. 61673192), Fundamental Research Funds for The Central Universities (No. 2019RC27), and BUPT Excellent Ph.D. Students Foundation (CX2019111).
Deep representation learning for human motion prediction and classification.
The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §I, §II.
-  (2019) Action-agnostic human pose forecasting. In IEEE Winter Conf. on Applications of Computer Vision (WACV), Cited by: §I, §II.
-  (2015) Hierarchical recurrent neural network for skeleton based action recognition. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1110–1118. Cited by: §III-B.
-  (2015) Recurrent network models for human dynamics. In The IEEE International Conference on Computer Vision (ICCV), Cited by: §I, §I, §II.
-  (2019) A neural temporal model for human motion prediction. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §I, §I, §II.
-  (2018) Adversarial geometry-aware human motion prediction. In European Conference on Computer Vision (ECCV), Cited by: §I, §II, §II.
-  (2018) Few-shot human motion prediction via meta-learning. In European Conference on Computer Vision (ECCV), Cited by: §I, §II, §II, §III-E.
Human motion prediction via learning local structure representations and temporal dependencies.
AAAI Conference on Artificial Intelligence (AAAI), Cited by: §I, §II, §II.
-  (2019) Human motion prediction via spatio-temporal inpainting. In The IEEE International Conference on Computer Vision (ICCV), pp. 7134–7143. Cited by: §II, §II.
-  (2017) Densely connected convolutional networks. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4700–4708. Cited by: §III-D.
-  (2014-07) Human3.6m: large scale datasets and predictive methods for 3d human sensing in natural environments. IEEE Transactions on Pattern Analysis and Machine Intelligence 36 (7), pp. 1325–1339. Cited by: §I, §II, §IV-A, §IV-A, §IV.
Structural-rnn: deep learning on spatio-temporal graphs. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §I.
-  (2018) Human action recognition and prediction: a survey. arXiv preprint arXiv:1806.11230. Cited by: §I.
-  (2018) Convolutional sequence to sequence model for human dynamics. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §I, §I, §I, §II, §IV-A, §IV-B, §IV-B, TABLE I, TABLE II, TABLE III.
-  (2019) Symbiotic graph neural networks for 3d skeleton-based human action recognition and motion prediction. arXiv preprint arXiv:1910.02212. Cited by: §I, §II.
-  (2020) Dynamic multiscale graph neural networks for 3d skeleton-based human motion prediction. arXiv preprint arXiv:2003.08802. Cited by: §I, §II.
-  (2018) Co-occurrence feature learning from skeleton data for action recognition and detection with hierarchical aggregation. arXiv preprint arXiv:1804.06055. Cited by: §III-B.
-  (2019) PISEP: pseudo image sequence evolution based 3d pose prediction. arXiv preprint arXiv:1909.01818. Cited by: §III-B.
-  (2019) Towards natural and accurate future motion prediction of humans and animals. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §I, §II.
-  (2019) Learning trajectory dependencies for human motion prediction. In The IEEE International Conference on Computer Vision (ICCV), pp. 9489–9497. Cited by: §I, §I, §II, Fig. 5, §IV-A, §IV-B, §IV-B, §IV-B, TABLE I, TABLE II, TABLE III.
-  (2017) On human motion prediction using recurrent neural networks. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §I, §I, §I, §I, §II, §III-E, §IV-A, §IV-B, §IV-B, TABLE I, TABLE II, TABLE III.
-  (2014) Two-stream convolutional networks for action recognition in videos. In Advances in Neural Information Processing Systems (NIPS), pp. 568–576. Cited by: §III-B.
-  (2007) Modeling human motion using binary latent variables. In Advances in Neural Information Processing Systems (NIPS), pp. 1345–1352. Cited by: §II.
-  (2018) Recovering accurate 3d human pose in the wild using imus and a moving camera. In The European Conference on Computer Vision (ECCV), pp. 601–617. Cited by: §IV-A, §IV.
-  (2019) Vred: a position-velocity recurrent encoder-decoder for human motion prediction. arXiv preprint arXiv:1906.06514. Cited by: §I.
-  (2019) VRED: a position-velocity recurrent encoder-decoder for human motion prediction. arXiv preprint arXiv:1906.06514. Cited by: §II.