3D skeleton-based human motion prediction aims to generate future skeleton sequences given past observed ones. This technique can help machines better understand human intention and have a broad prospect in scenarios such as human-robot interaction [14, 9, 11], autonomous driving , and pedestrian tracking [8, 2].
Joint relation modeling is essential for motion prediction. Prior works mainly relied on graphs to model joint relation combined with neural networks, such as RNN [7, 24, 19, 13, 1], CNN [16, 20], GCN [18, 6]. Most graphs are designed according to the kinematic structure of the human to extract motion features. Though they are effective, it is hard for them to learn the relations between spatial separated joint pairs directly. Recently, dynamic graphs were developed by [21, 22] to model the relations explicitly. Thus, the local interactions between joint pairs can be learned adequately. However, there still exists one drawback. The global coordination of all joints, which contributes to the balance of human motion, is not well learned. It is mainly because the global motion features are usually extracted by fusing different body components’ local features. In this process, all joints’ global relations are learned progressively and asynchronously, and thus the relations are usually weakened. It sometimes makes the predicted motion appears unnatural, e.g., the limbs are uncoordinated, as is shown in Figure 1.
In this paper, we aim to learn the global coordination of all joints. To this end, we learn a balance attractor (BA) to act as the medium to build new relations of all joints indirectly. Specifically, the BA is learned by calculating dynamic weighted aggregation of single joint feature. Then we calculate the difference between the BA and each joint feature. Finally, the resulting new joint features are used to calculate joints similarities to generate final joint relations. In this way, all joints are related indirectly but synchronously through the BA. Meanwhile, because the new joint relations encode global motion features, the global coordination of all joints can be better learned.
Additionally, enriching dynamic representation of raw input data is also beneficial for effective prediction. As is well known, the raw skeleton sequences only include each joint’s position information of different time steps, which are not sufficient to convey the dynamics property of motion. Previous works [17, 28] tended to introduce a two-stream architecture for extra velocity information. [17, 32] enlarge the time horizon by taking three neighbor frames into account. But it still ignores other dynamic information like accelerated speed, which is not limited to fixed timescales. Therefore, we extract the features among frames with multiple timescales to get enriching dynamic representation from raw 3D coordinates.
Based on the above two aspects, we present our framework referred to as Attractor-Guided Neural Network. Given observed motion sequences, we first learn an enriching dynamic representation from raw position information adaptively through Multi-timescale Dynamics Extractor (MTDE). Next, we introduce the Attractor-Based Joint Relation Extractor (AJRE), including a Local Interaction Extractor (LIE), a Global Coordination Extractor (GCE), and an Adaptive Feature Fusing Module. The LIE is used to encode the local interactions between joint pairs, and the GCE is designed to present the global coordination of all joints. The above different joint relations are adaptively aggregated in the Adaptive Feature Fusing module.
The main contributions of this paper are summarized as follows. 1. We propose a novel joint relation modeling module, AJRE, mainly including GCE and LIE. GCE is proposed to model the global coordination of all joints, encoding the balance property of human motion. LIE is presented to mine the local interactions between joint pairs. 2. We also put forward an MTDE module to extract enriching dynamic information from raw input data for effective prediction. 3. Our proposed Attractor-Guided Neural Network outperforms most state-of-the-art methods for short and long-term motion prediction on three standard benchmark datasets: H3.6M, CMU-Mocap, and 3DPW.
2 Related work
Skeleton-based motion prediction has attracted increasing attention recently. Recent works using neural networks [21, 7, 24, 19, 16, 20, 13, 1, 18, 22, 6, 5, 3, 29] have significantly outperformed traditional approaches [15, 30].
, who proposed an Encoder-Recurrent-Decoder (ERD) model to combine encoder and decoder with recurrent layers. They encode the skeleton in each frame to a feature vector and built temporal correlation recursively. Julieta et al. introduced a residual architecture to predict velocities and achieved better performance. However, these works all suffer from discontinuities between the observed poses and the predicted future ones. Though Gui et al.  proposed to generate a smooth and realistic sequence through adversarial training, it is hard to alleviate error-accumulation in a long-time horizon inherent to the RNNs scheme. A feedforward network was widely adopted to help alleviate those above questions because their prediction was not recursive and thus could avoid error-accumulation. Li et al.  introduced a convolutional sequence-to-sequence model that encodes the skeleton sequence as a matrix whose columns represent the pose at every time step. However, their spatiotemporal modeling is still limited by the convolutional filters’ size. Recently, [21, 20] were proposed to consider global spatial and temporal features simultaneously. They all transform temporal space to trajectory space to take the global temporal information into account. It contributes to capturing richer temporal correlation and thus achieved state-of-the-art results. In this paper, we follow this scheme but use different methods to model global spatial correlation.
Joint relation modeling. Previous work mainly focused on skeletal constraints to model correlations between joints. Jain et al.  first introduced a Structural-RNN model to explicitly model structural information relying on high-level spatiotemporal graphs. However, the graph is designed according to kinetic structure and is not flexible for different motions. Recently, some dynamic graph structures [21, 5, 18, 4] were developed to model more flexible joint relations. Mao et al.  used an adaptive graph to model motion, but it is still unreliable because the graph is initialized randomly without structure prior. Cui et al. further combined kinematic structure with dynamic graph structure. Li et al.  used stacked GCNs to build the interaction of different scales structure in each layer to model the correlation of both neighbor and distant joints. However, there still exists a problem in existing methods: the global coordination of all joints, which reflects the balance property of human motion, is usually weakened because they are learned from part to whole progressively and asynchronously. Therefore, in this paper, we aim to encode the global coordination of all joints. Based on this intuition, we propose an Attractor-Based Joint Relation Extractor (AJRE) to better leverage global coordination of all joints combined. Among the module, local interactions between joint pairs are also included as auxiliary information.
Dynamic representation of Skeleton sequence. Considering the raw skeleton sequence only represents each joint’s position information at each time step, which is not sufficient to convey the dynamic property of motion. Many attempts [17, 32, 28] proposed to extract enriching dynamic representation from raw data. They relied on two-stream architecture to introduce velocity information. A drawback of them is that they only extract the dynamics from neighbor frames. Though Li et al.  enlarged the time horizon by convolution operation, it is still insufficient because dynamics exist in different timescales. Therefore, in this paper, we extract the dynamic features among frames through multiple timescales convolution and fuse them for enriching dynamic representation from raw 3D coordinates.
3 Our Method
The proposed balance attractor guided framework, AGN, models human motion from a new perspective. It mainly includes two components, MTDE and AJRE. MTDE extracts multi-time scale temporal information to obtain rich features for motion prediction. AJRE mines the balance attractor based dynamics from the multi-time scale input to model the spatiotemporal evolution of human motion. Finally, the two convolutions are successively used to conduct dimension reduction to get final predictions.
3.1 Problem formulation
We denote the historical 3D skeleton-based poses as and future poses as , where represents the 3D pose at time with joints. The depicts the dimension of joint coordinates. Our goal is to generate predicted poses, through proposed framework .
3.2 Multi-timescale Dynamics Extractor (MTDE)
Dynamics is a important motion property to represent the patterns of current motion and is used to anticipate future motion trends. Many previous works utilize two-stream architecture to offer different modality inputs like velocity related to motion. While it makes sense, it is still not suitable for all motions because the length of dynamics in different motions varies. Thus, most of the previous works are incapable of getting efficient dynamic representation of motion. In this part, we conduct a combination of different time scales motion dynamics to coordinate with two-stream architecture to address this issue. And more fine-grained dynamic information can be achieved in our proposed Multi-timescale Dynamics Extractor.
The architecture is shown in Figure 2. We take two-stream architecture: one path is raw input with the size of and another path is the difference between adjacent frames in raw input with the size of representing the velocity of raw input. Both paths are connected with a feature extractor which encodes dynamics through three different time scales. Especially, we model dynamics of each joint separately to avoid the interference of other joints. For motion prediction, it is beneficial to enable the model to extract a richer representation of a single joint before building the correlation between joints.
We here take as an example. Given the input , we first use different temporal convolutions with different timescale to generate new dynamic features. Formally,
where , indicates the convolution operation and is the size of new channel.
Considering different contains different dynamic features of motion, we concatenate them along the channel. This operation enables the model to capture coarse and subtle detailed dynamics simultaneously. Meanwhile, here we also use a convolution to reduce feature channels for efficiency. Formally,
where represents the concatenation along the channel.
Similar to , we also extract the dynamics for with the same process to get the representation . Specifically, is calculated by makeing differences between adjacent frames of . To make use of different features, we synthesize them along temporal dimensions to get dynamic representation.
where represents the concatenation along temporal dimension.
3.3 Attractor-Based Joint Relation Extractor (AJRE)
The AJRE is used to exploit more prosperous joint relations of motion to help effective modeling. We thus propose Global Coordination Extractor (GCE) and Local Interaction Extractor (LIE) to separately model global coordination of all joints and local interactions between joint pairs. The Adaptive Feature Fusion module (AFFM) is introduced to fuse features according to channel-wise attention to improve the flexibility of joint relation modeling.
3.3.1 Global Coordination Extractor (GCE)
Global coordination of all joints plays an essential role in human motion. It needs all joints to coordinate synchronously and controls the balance of the human body during motion. However, it is usually weakened in previous works because the global motion features are generally learned by fusing the local features of different body components asynchronously and progressively. To tackle this issue, we learn a medium to build new joint relations indirectly. Through the medium, all joints are related synchronously, and thus the global coordination of all joints can avoid being weakened and thus it can be better learned.
As is shown in Figure 3
, we illustrate how to learn global coordination of all joints through the BA. In the Balance Attractor Unit, we first learn a medium called balance attractor (BA) by calculating all joints’ aggregation to characterize the global motion features. We then calculate the difference between the BA and each joint feature to fuse the global motion features into each joint feature. In the Cosine Similarity Unit, we generate a new joint relation by measuring the similarities of joint pairs’ new features. In this way, all joints can be related synchronously through this medium and thus can reflect the global coordination of all joints. The relation graph is subsequently to guide the motion feature extraction. It is noteworthy that we learn the BA in high dimensional space instead of in 3D space because the spatiotemporal features of motion in high dimensional space represent more dynamics. Besides, we name the medium as the balance attractor because it is used to model all joints’ global coordination, which equals human motion’s balance property.
More details are illustrated in Figure 4. In the Balance Attractor Unit, given input , we first do the dimension transpose to get . The channel size is set to , which represents the number of joints, and the resulting feature map in each channel with the size of represents the spatiotemporal features of each joint. Next, we here adopt a simple convolution to learn the BA. Because the output of a convolution is the global response of the input channel, the BA represents all joints’ comprehensive features and reflects the global motion features. This process is a dynamic weighted feature aggregation of joints features. The weight is learned by and adaptive to different motions. Formally,
where represents the transformation between the joint dimension and the temporal dimension.
After getting a BA, it is used as a medium to build a new representation relative to BA for each joint indirectly through making differences. The purpose is to fuse the global motion features into each joint feature.
We focus on building new relations of all joints through in the Cosine Similarity Unit. This step aims to encode the coordination of all joints into the relative joint relations graph. Specifically, We first use a convolution to learn a embedding of .
Next, we aim to calculate the relative relations of joints. The size of one feature map of , of which each row represents the spatiotemporal features of one joint, is
. Therefore, we can calculate the cosine similarity between all row vector pairs to illustrate the correlation between joint pairs. The reasons why we choose cosine similarity are: (1) this metric contains angle information that corresponds to the mutual influence between joints; (2) the value is limited into [-1,1], which avoids the violent variance.
Formally, we denote as a row vector of each feature map at channel , where . And then we can calculate the correlation matrix as:
where represents similarity of and , denotes the correlation between all joints.
Notably, we calculate the correlation matrix on each channel because each channel encodes specific spatiotemporal features and should focus on different correlations compared with other channels. Therefore, we can get the correlation matrix of all channels:
The last step is to calculate the aggregated features according to the joint relation . Specifically, convolution is used to extract intra-joint features and then combine with the guidance of to get the final features .
where represents channel-wise multiplication.
|Purchasing||Sitting||Sitting down||Taking photo|
|motion||Waiting||Walking dog||Walking Together||Average|
3.3.2 Local Interaction Extractor (LIE)
Local Interaction Extractor (LIE) is used to learn local interactions between joint pairs, including adjacent and distant joints. The local connection via bones brings spatial correlation for adjacent joints. For distant joints, some joints may have a strong correlation even if they are not directly connected, e.g., left hand and right hand are tightly correlated during ‘eating’. Therefore, these two relations are equally important for effective prediction.
As is shown in Figure 5, given an input which is the same as GCE, there exist two main paths to separately learn the relations between adjacent joint pairs and distant joint pairs. To learn the relations between adjacent joint pairs, a pure convolution is adopted to extract spatiotemporal features between adjacent joint pairs. To learn the relations between distant joint pairs, the self-attention module Non-local  is used to capture spatiotemporal features between adjacent joint pairs. The outputs can be described as follows. More details of this module are provided in the supplementary materials.
3.3.3 Adaptive Feature Fusing Module (AFFM)
The different motions will have a respective preference for local interactions between joint pairs and global coordination of all joints. Here we adopt the channel attention mechanism to fuse features adaptively and reform more reliable representation.
As is shown in Figure 6
, the global average pooling of the raw input represents the value of the feature map. After several operations of neural networks, we can get the importance ratio of each channel through the sigmoid function. Last we do channel-wise multiplication between ratio and raw input to reform features. More details of this module are provided in the supplementary materials.
|motion||Purchases||Sitting||Sitting down||Taking photo||Waiting||Walking Dog||Walking Together||Average|
3.4 Loss Function
We evaluate our model on several benchmark motion capture (mocap) datasets, including Human3.6M (H3.6M) , the CMU mocap dataset, and the 3DPW dataset . We first introduce these datasets and corresponding implantation details. And then, we compare it with the state-of-the-arts by MPJPE.
4.1 Datasets and Implementation Details
|motion||Basketball||Basketball Signal||Directing Traffic|
H3.6M  is the most widely used benchmark for motion prediction. It involves 15 actions performed by professionals, and each human pose involves a 32-joint skeleton. Following [21, 20], we compute the joint’s 3D coordinates by applying forward kinematics and down-sample the motion sequence to 25 frames per second. To remove the global rotation, translation, and constant 3D coordinates of each human pose, there remain 22 joints. We test our method on subject 5 (S5).
3DPW  The 3D Pose in the Wild dataset (3DPW)  consists of challenging indoor and outdoor actions. The dataset consists of various activities such as shopping, doing sports, and hugging, including 60 sequences and more than 51k frames. For a fair comparison, we evaluate the whole test set.
CMU-Mocap The CMU mocap dataset mainly includes five categories. Be consistent with [21, 20], we select 8 detailed actions: “basketball”, “basketball signal”, “directing traffic”, “jumping”, “running”, “soccer”, “walking” and “washing window”.
Network Setting. We take three timescales: 3, 5, and 7 frames around the target frame in MTDE. The size of the high-level dimension is 32. We use 5 layers in the encoder and 4 layers in the decoder to get enough receptive field. The size of the temporal dimension is enlarged to 64. More details can be found in the supplementary material.
4.2 Comparison with state-of-the-art
Here we show the prediction performance for both short-term and long-term motion prediction on H3.6M, CMU Mocap, and 3DPW. We quantitatively evaluate various methods by the MPJPE between the generated motions and ground truths in 3D coordinates space. To be consistent with the literature[20, 21], we report our results for short-term ( 500ms) and long-term ( 500ms) predictions. For all datasets, we are given 10 frames (400 milliseconds) to predict the future 10 frames (400 milliseconds) for short-term prediction and to predict the future 25 frames (1 second) for long-term prediction. More results can be found in the supplementary material.
4.2.1 Results on H3.6M
Short-term motion prediction. Table 1 provides the short-term predictions on H3.6M for the 15 activities and the average results. Note that our method outperforms all the baselines on average and almost all motions. It demonstrates that our approach learns the general representation of different movements. Specifically, for those motions that need the upper body and lower body to cooperate, e.g., “Walking dog”, “Phoning” and “Sitting down”, our method outperforms the most, reflecting the efficacy of our proposed BA in joint relation modeling. Besides, the results on 320ms and 400ms increase most, which shows that our method is good at capturing temporal continuity compared with other methods. We also provide qualitative comparisons in Figure 1. They further evidence that our predictions are closer to the ground truth than those of the above actions’ baselines. More visualizations are included in the supplementary material.
Long-term motion prediction. In Table 2, we compare our results with those of the baselines for long-term prediction on H3.6M. Our method outperforms all the baselines on average. For long-term prediction, with the uncertainly of motion increasing, our method still obtains competitive performances on almost all motions. Especially in motions with more dynamics like “Walking Dog”, our method outperforms other competitors most. The observations demonstrate the advantages of our proposed dynamics representation and BA.
4.2.2 Results on CMU-Mocap and 3DPW
Table 3 reports the MPJPE for short-term and long-term prediction on CMU-Mocap and Table 4 reports the results on 3DPW. In essence, the conclusions remain unchanged: our method consistently outperforms the baselines for both short-term and long-term prediction with BA guidance.
5 Ablation study
In this section, we conduct several ablation experiments on H3.6M to testify the effectiveness of different components in our proposed framework.
5.1 Effectiveness of components of MTDE
MTDE is designed mainly to get enriching dynamics information of raw input data. Table 5 shows the results of experiments. The results of 320ms and 400ms increase significantly, which shows MTDE encodes more temporal information and offers more meaningful guidance for prediction, especially in the long time horizon.
5.2 Effectiveness of components of GCE
GCE is designed mainly to model the global coordination of joints according to the nature of the human body to keep balance. It mainly has two components: Balance Attractor Unit (BAU) and Cosine Similarity Unit (CSU). To prove the effectiveness of CSU, we design an experiment with a common softmax function as a comparison. To prove the guidance of BA is useful, we also design an experiment without BAU. Here “” and “” represent the usage of cosine similarity and softmax respectively. “BAU” is the Balance Attractor Unit. Table 6 shows the results.
We have the following observations:
(1) The BAU is essential for effective prediction, especially on long horizon. It demonstrates that the indirect BA offer useful guidance and this module extract meaningful global motion features.
(2) The cosine similarity is better compared with the softmax function used in self-attention models. It arises from two aspects. First, it avoids violent differences in the softmax function because cosine similarity limits the value in. Second, it has the angle information to represent both orientation and intensity of correlation, while softmax only represents the intensity of correlation.
(3) Methods with proposed GCE outperforms 0.5, 1.1, 2.9, 3.0 by the one without GCE for 80ms, 160ms, 320ms, 400ms, respectively. This proves the effectiveness of the GCE module.
5.3 Effectiveness of LIE and AFFM
In table 7, the method with a single GCE outperforms the one with single LIE. This demonstrates that our proposed GCE is superior to those encodes local interactions of joints, which indicates the importance of our proposed BA. The improved performance due to fusing these two paths proves that these two paths are complementary.
AAFM improves the results by 0.4 on average. It reflects that the channel attention enhances the whole performance. Besides, it increases slowly compared with the introduction of GCE and LIE, which reflects that our model’s improvement mainly benefits from the design of GCE and LIE.
In this paper, we have proposed a simple yet effective framework referred to as Attractor-Guided Neural Network to model spatiotemporal features for skeleton-based human motion prediction. We extract the dynamic representation of raw skeleton data from a MTDE for effective prediction. To exploit richer joint relation, we propose an AJRE module to better leverage joint relation, including GCE and LIE. The former presents global coordination of all joints and later encodes local interactions between joint pairs. With those two fine-grained features introduced, our proposed method achieves state-of-the-art results on three benchmark datasets.
-  (2019) Structured prediction helps 3d human motion modelling. In ICCV, pp. 7143–7152. External Links: Cited by: §1, §2.
-  (2018) Long-term on-board prediction of people in traffic scenes under uncertainty. In CVPR, pp. 4194–4202. External Links: Cited by: §1.
-  (2017) Deep representation learning for human motion prediction and classification. In , pp. 6158–6166. Cited by: §2.
Exploiting spatial-temporal relationships for 3d pose estimation via graph convolutional networks. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 2272–2281. Cited by: §2.
-  (2020) Learning progressive joint propagation for human motion prediction. In European Conference on Computer Vision, pp. 226–242. Cited by: §2, §2, Table 1, Table 3.
-  (2020) Learning dynamic relationships for 3d human motion prediction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6519–6527. Cited by: §1, §2.
-  (2015) Recurrent network models for human dynamics. In ICCV, pp. 4346–4354. External Links: Cited by: §1, §2, §2.
-  (2011) Multi-hypothesis motion planning for visual object tracking. In ICCV, pp. 619–626. External Links: Cited by: §1.
-  (2018) Teaching robots to predict human motion. In IROS, pp. 562–567. External Links: Cited by: §1.
-  (2016) Deep residual learning for image recognition. In CVPR, pp. 770–778. External Links: Cited by: Figure 2.
-  (2014) Action-reaction: forecasting the dynamics of human interaction. In ECCV, Vol. 8695, pp. 489–504. External Links: Cited by: §1.
-  (2014) Human3.6m: large scale datasets and predictive methods for 3d human sensing in natural environments. TPAMI 36 (7), pp. 1325–1339. External Links: Cited by: §4.1, §4.
Structural-rnn: deep learning on spatio-temporal graphs. In CVPR, pp. 5308–5317. External Links: Cited by: §1, §2, §2.
-  (2013) Anticipating human activities for reactive robotic response. In IROS, pp. 2071–2071. External Links: Cited by: §1.
Efficient nonlinear markov models for human motion. In CVPR, pp. 1314–1321. External Links: Cited by: §2.
-  (2018) Convolutional sequence to sequence model for human dynamics. In CVPR, pp. 5226–5234. External Links: Cited by: Figure 1, §1, §2, §2, Table 1.
-  (2018) Co-occurrence feature learning from skeleton data for action recognition and detection with hierarchical aggregation. In IJCAI, pp. 786–792. External Links: Cited by: §1, §2.
-  (2020) Dynamic multiscale graph neural networks for 3d skeleton based human motion prediction. In CVPR, pp. 211–220. External Links: Cited by: §1, §2, §2.
-  (2018) Adversarial geometry-aware human motion prediction. In ECCV, Vol. 11208, pp. 823–842. External Links: Cited by: §1, §2, §2.
-  (2020) TrajectoryCNN: a new spatio-temporal feature learning network for human motion prediction. IEEE Transactions on Circuits and Systems for Video Technology, pp. 1–1. External Links: Cited by: §1, §2, §2, §3.4, Table 1, Table 2, §4.1, §4.1, §4.2, Table 3.
-  (2019) Learning trajectory dependencies for human motion prediction. In ICCV, pp. 9488–9496. External Links: Cited by: Figure 1, §1, §2, §2, §2, §3.4, Table 1, Table 2, §4.1, §4.1, §4.2, Table 3, Table 4.
-  (2020) History repeats itself: human motion prediction via motion attention. In European Conference on Computer Vision, pp. 474–489. Cited by: §1, §2.
-  (2018) Recovering accurate 3d human pose in the wild using imus and a moving camera. In ECCV, Vol. 11214, pp. 614–631. External Links: Cited by: §4.1, §4.
On human motion prediction using recurrent neural networks. In CVPR, pp. 4674–4683. External Links: Cited by: Figure 1, §1, §2, §2, Table 1.
-  (2016) A survey of motion planning and control techniques for self-driving urban vehicles. IEEE Transactions on Intelligent Vehicles 1 (1), pp. 33–55. External Links: Cited by: §1.
-  (2017) Automatic differentiation in pytorch. Cited by: §4.1.
-  (2015) U-net: convolutional networks for biomedical image segmentation. In MICCAI, Vol. 9351, pp. 234–241. External Links: Cited by: Figure 2.
-  (2014) Two-stream convolutional networks for action recognition in videos. In NeurIPS, pp. 568–576. Cited by: §1, §2.
-  (2019) Imitation learning for human pose prediction. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 7124–7133. Cited by: §2.
-  (2008) Gaussian process dynamical models for human motion. TPAMI 30 (2), pp. 283–298. External Links: Cited by: §2.
-  (2018) Non-local neural networks. In CVPR, pp. 7794–7803. External Links: Cited by: §3.3.2.
-  (2018) Spatial temporal graph convolutional networks for skeleton-based action recognition. In AAAI, pp. 7444–7452. Cited by: §1, §2.