Motion Prediction via Joint Dependency Modeling in Phase Space

01/07/2022
by   Pengxiang Su, et al.
Jilin University
0

Motion prediction is a classic problem in computer vision, which aims at forecasting future motion given the observed pose sequence. Various deep learning models have been proposed, achieving state-of-the-art performance on motion prediction. However, existing methods typically focus on modeling temporal dynamics in the pose space. Unfortunately, the complicated and high dimensionality nature of human motion brings inherent challenges for dynamic context capturing. Therefore, we move away from the conventional pose based representation and present a novel approach employing a phase space trajectory representation of individual joints. Moreover, current methods tend to only consider the dependencies between physically connected joints. In this paper, we introduce a novel convolutional neural model to effectively leverage explicit prior knowledge of motion anatomy, and simultaneously capture both spatial and temporal information of joint trajectory dynamics. We then propose a global optimization module that learns the implicit relationships between individual joint features. Empirically, our method is evaluated on large-scale 3D human motion benchmark datasets (i.e., Human3.6M, CMU MoCap). These results demonstrate that our method sets the new state-of-the-art on the benchmark datasets. Our code will be available at https://github.com/Pose-Group/TEID.

READ FULL TEXT VIEW PDF

Authors

page 1

page 2

page 3

page 4

08/15/2019

Learning Trajectory Dependencies for Human Motion Prediction

Human motion prediction, i.e., forecasting future body poses given obser...
08/19/2021

Generating Smooth Pose Sequences for Diverse Human Motion Prediction

Recent progress in stochastic motion prediction, i.e., predicting multip...
10/20/2019

Structured Prediction Helps 3D Human Motion Modelling

Human motion prediction is a challenging and important task in many comp...
09/16/2021

Raising context awareness in motion forecasting

Learning-based trajectory prediction models have encountered great succe...
10/11/2020

SDMTL: Semi-Decoupled Multi-grained Trajectory Learning for 3D human motion prediction

Predicting future human motion is critical for intelligent robots to int...
10/14/2021

Simple Baseline for Single Human Motion Forecasting

Global human motion forecasting is important in many fields, which is th...
10/06/2020

Motion Prediction Using Temporal Inception Module

Human motion prediction is a necessary component for many applications i...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

Humans can effortlessly predict the future motions of animals or other humans. Such a crucial ability enables humans to intelligently interact with the external world (Li et al., 2020; Cao et al., 2020; Aliakbarian et al., 2020; Yuan and Kitani, 2020)

. Similarly, in the domain of artificial intelligence, understanding and predicting future human motion is important, which enjoys a wide range of applications including human-robot interaction, intelligent driving, pose tracking, and motion generation 

(Zang et al., 2020; Kundu et al., 2019; Zhuo et al., 2020; Gui et al., 2018).

Figure 1. Trajectories of the wrist and ankle joints during jumping.

Given the observed past motion data, which is typically represented as the 3D skeletal pose sequence, the human motion prediction

task aims to accurately estimate the future motion. Traditional approaches broadly fall within the scope of latent variable models, including Gaussian processes 

(Wang et al., 2005)

and Markov models 

(Lehrmann et al., 2014)

. With the availability of large-scale labeled motion capture datasets, a number of neural network designs have been proposed, yielding much improved results. The general framework encompasses a sequence-to-sequence model whereby the observed sequence is encoded to a

latent motion context that is decoded to output future sequence. One line of work (Fragkiadaki et al., 2015; Martinez et al., 2017)

utilized recurrent neural networks (RNNs) such as Long Short Term Memory (LSTM) to model motion contexts. Another line of work

(Cui et al., 2020; Li et al., 2020; Mao et al., 2019) built upon the success of graph convolutional networks to better characterize the spatial connections between joints. Other works directed efforts at convolutional networks (Li et al., 2018), hierarchical motion context representation (Liu et al., 2019), and attention-based networks (Mao et al., 2020), respectively.

Empirically, we implemented existing methods following their released code and scrutinized their results on benchmark datasets. Unfortunately, they still suffer from inaccurate motion predictions. We conjecture that the reasons are twofolds.

First, pose representation.  Most existing works represent the historical motion sequence as a time series of the skeletal pose, parameterized as axis angles. A pose technically translates to the positions of all joints. Modeling motion contexts in the pose space implicitly entails all the joints, putting motion prediction task on an unnecessarily high-dimensional manifold. On the other hand, we visualized the trajectories of each joint during motions, and observed that the trajectory of a joint is usually smooth. An illustrated example for the jumping motion is presented in Fig. 1, which confirms the smoothness of joint motions. Furthermore, the results of (Mao et al., 2019) and our previous research (Liu et al., 2021b) show that modeling motion contexts in the trajectory space can effectively promote the prediction accuracy. Motivated by this, we present a new solution that works on the individual joint trajectory space, explicitly leveraging the smooth movement of a joint to predict its future. Specifically, we include the position of a joint and its displacements (motion), i.e., the first-order differences of joint positions between adjacent frames. In doing so, we effectively represent the motion sequence in trajectory phase space, where both the current joint location configuration and joint frame-wise displacements are explicitly modeled. This enables a complete characterization of the dynamic system and the motion prediction task converts to trajectory extrapolation for each joint. Specifically, this representation focus on each joint rather than the entire pose, which characterizes the trajectory of a joint using frame-wise joint positions and augments the joint trajectory with instantaneous displacements. In this way, we explicitly encode motion semantics and the difficulty of learning the temporal evolution in the entire pose sequence is significantly reduced at the base level.

Second, joint correlation modeling.

 On another front, the existence of laws of intersegmental dependencies is valuable for precise motion prediction. Although recent works have incorporated some form of inter-joint dependencies modeling through kinematic trees or graphs, there is insufficient tapping into the pose and motion prior. This deficiency manifests as inaccurate portrayals of the subtleties in predicting different motions. We therefore design a novel spatio-temporal convolutional neural network to better encode prior knowledge,

e.g., skeletal connections and motion semantics. Specifically, the model is composed of two phases. (1) In the first phase, we obtain a principled trajectory prediction of a joint by considering its own trajectory and trajectories of its explicitly correlated joints, which smoothes out the interferences of irrelevant joints and saves computational burdens. Three parallel convolution branches with different dilation rates are leveraged to extract multi-scale context information. (2) Since the trajectory extrapolation for each joint is handled individually so far, in the second phase we further engage in a global optimization module. The module captures latent dependencies of the predicted future trajectory of each joint with respect to each other as well as the global skeletal motion. The individual joint trajectory is refined, thus ensuring consistency and a harmonious alignment of each joint within the whole.

Contributions. To summarize, the key contributions of this paper are as follows:

(i) We propose to learn motion dynamics in a new trajectory space and present a spatio-temporal convolutional feed-forward network to encode trajectory information for motion prediction.

(ii) We introduce a two-phase model that extracts the explicit and implicit relationships between joints.

(iii) On two human motion benchmark datasets, including the large Human3.6M and CMU MoCap datasets, our method achieves the state-of-the-art performance in both short-term and long-term predictions.

The rest of the paper is organized as follows. We first review related work in Section 2. We then introduce the details of our approach in Section 3. Thereafter, we evaluate the performance of the proposed method and compare it with existing methods in Section 4. Finally, we conclude the paper with summary and discussions on the approach in Section 5.

2. Related Work

2.1. Motion Prediction

Previously, statistical approaches broadly fall within the scope of latent variable models, which leverage Gaussian processes (Wang et al., 2005)

and hidden Markov models 

(Lehrmann et al., 2014) to capture the temporal dynamic of human motions. Recently, the success of deep learning methods in various fields (Zhu et al., 2020; Fuli et al., 2019; Xun et al., 2018; Jianfeng et al., 2021; Liu et al., 2017; Yinwei et al., 2019; An-An et al., 2021; Liu et al., 2021a) lead to diverse neural network designs for modeling motion contexts (Liu et al., 2021d; Li et al., 2019; Gao et al., 2019; Chen et al., 2020). The general framework encompasses a sequence-to-sequence model, where the observed pose sequence is encoded to a latent motion context that is decoded to output future sequence. A major line of work utilizes RNN modules such as Long Short Term Memories (LSTM) (Fragkiadaki et al., 2015)

or Gated Recurrent Units (GRU)

(Martinez et al., 2017) as the encoder and decoder. Other options such as convolutional networks (CNN), hierarchical representation, and graph convolutional networks have also been explored.  (Li et al., 2018) proposes a CNN based model to capture long-term temporal and spatial correlations. (Liu et al., 2019) represents human skeleton with a Lie algebra representation to encode anatomical constraints, and employs a hierarchical recurrent network to encode local and global contexts. (Mao et al., 2019) designs a generic graph to encode the spatial dependencies of human pose and a simple feed-forward deep network to capture dynamic context. (Cui et al., 2020) builds a deep generative model based on a novel dynamic GNN and adversarial learning.  (Li et al., 2020) introduces a multiscale graph computational unit to extract and fuse features for motion feature learning. (Mao et al., 2020) develops an attention-based model that explicitly leverages historical information for motion prediction. In this paper, we present an alternative network, which improves the modeling of spatio-temporal contexts along with the capability of capturing both explicit and implicit inter-joint dependencies.

2.2. Skeleton-based Pose Representation

A fundamental component for the motion prediction task is to represent the pose sequence in the most suitable and effective way so as to facilitate modeling of motion contexts. Most existing works (Fragkiadaki et al., 2015; Martinez et al., 2017; Li et al., 2018; Liu et al., 2019) represent the pose as joints along a kinematic graph with joint orientations parameterized as axis angles. A related approach characterizes the skeletal structure as a graph (Li et al., 2020; Cui et al., 2020; Mao et al., 2019, 2020; Guo and Choi, 2019). A critical issue is that these schemes usually treat the joints on an equal standing, failing to account for the fact that the kinematic chain is a hierarchical structure. This raises severe difficulties in effectively capturing the joint dependencies, manifesting as large prediction errors for end effectors such as hands and foot. We therefore propose to discard with kinematic graphs, treat each joint as distinct entities, and directly learn the correlations and dependencies via data, facilitated by additional knowledge of skeletal anatomy.

Figure 2. The proposed TEID network performs trajectory extrapolation in two phases. The first explicit dependency modeling phase consists of three deformable convolution branches with different dilation rates to incorporate explicit prior knowledge of joint relationships and extracts spatio features from joint phase space trajectories. A GRU network outputs intermediate displacement features which are then refined via our implicit dependency modeling phase, which captures the joint trajectory dependencies with respect to the global motion dynamics to improve holistic consistency. Finally, the future motion sequence is reconstructed from the extrapolated displacement features and the last observed pose.

3. The Proposed Method

In this section, we provide the details of our approach, TEID (motion prediction with phase space Trajectory representation and Explicit/Implicit Dependencies modeling). TEID can be divided into three key components. (i)

A historical pose sequence is converted into a series of joint displacement vectors in trajectory space.

(ii) These features are fed into a novel multi-scale convolutional network based on prior knowledge to capture the spatial and temporal dynamic context. (iii) Finally, the predicted sequences are further input into a refinement network to improve the coordination of the entire sequence.

3.1. Phase Space Trajectory Representation

First, we move away from the widely adopted kinematic graph scheme and instead focus on each joint rather than the entire pose. In particular, a joint is characterized by its trajectory where denotes the position vector of joint in the frame (time). Furthermore, we augment a joint trajectory with joint instantaneous displacements by computing the joint displacements between adjacent frames, thereby obtaining phase space trajectory for each joint. Specifically, joint at frame is represented as a tuple where characterizes its position while captures the displacement at the frame. Displacement can be conveniently computed by:

(1)

Mathematically, the phase space is the cotangent bundle of the pose configuration space and by explicitly including the displacement we achieve a complete characterization of the dynamics of the system at any given frame (time) (Craig, 2009). The displacement information also provides valuable motion contexts for future prediction. (Martinez et al., 2017) models angular velocities to solve the problem of the discontinuities between the observed sequence and the first-frame prediction. (Mao et al., 2019) tries to encode temporal information in the frequency space via the discrete cosine transform. Our proposed method instead uses the direct joint instantaneous displacements to explicitly model trajectory features, therefore, the network does not have to learn to extract the implicit motion dynamics of the joints.

3.2. The Proposed Network

An overview of the proposed TEID network is shown in Fig. 2. An important highlight of our approach consists in rejecting the widely adopted kinematic graph based pose representation which inherently invokes an unnecessary complexity burden due to the hierarchical nature of the joints representation. Instead, we examine the phase space trajectories at individual joint levels. Taking phase space trajectory representations as inputs, TEID comprises an explicit modeling component as well as an implicit modeling module to capture explicit and implicit dependencies between joint trajectories, respectively. This allows extrapolation and connection of the individual joint trajectories to form a single coherent pose sequence.

More specifically, TEID first models the known and explict dependencies between joints. Following (Liu et al., 2021b), three kinds of explicit joint dependencies are encoded: 1) The natural skeletal connection between joints. 2) The correlation between symmetric arms and legs, e.g., the left arm and the right arm. 3) The synchronization tendency between an arm and a leg in opposite sides, e.g., the left arm and the right leg. Then, TEID considers the hidden and implicit denpendencies between joints and the naturalness of the entire pose. The implicit dependency modeling encompasses a global optimization module that examines the implicit relationship between the predicted joint trajectory extrapolations and the entire predicted motion sequence to ensure consistency and naturalness of the final output.

Explicit Dependency Modeling  Effectively capturing the precise spatial relationships between closely related joints is indispensable to understanding and modeling motion. Existing methods have expended significant efforts in this direction such as considering kinematic trees (Liu et al., 2019) or graphs (Li et al., 2020; Cui et al., 2020; Mao et al., 2020)

and examining the problem in the frequency domain via discrete Fourier transforms

(Mao et al., 2019). Yet, the fixed graph or kinematic tree structures in such methods make it difficult to incorporate prior knowledge, while using frequency domain usually involves some intricate or cumbersome network designs to extract joint relationships.

In our approach, prior anatomical knowledge of motion can be readily leveraged. Technically, explicit dependencies that can be tapped into include bone connections, symmetrical properties of the arms and legs, as well as more complicated examples such as coordination of different limbs in opposite sides. As depicted in Fig. 2, the phase trajectory inputs for a joint along with all its explicitly related joints are collectively fed into a spatio-temporal module. Specifically, this convolutional module consists of three parallel layers of deformable convolutions with different dilation rates, which allow simultaneous extraction of multi-scale spatial features from the phase space trajectories. Various dilation rates are engaged to facilitate multi-scale processing of the dynamic information. For each joint, the consideration of only its explicitly (closely) related joints rather than all the joints is beneficial in smoothing out noises and irrelevant interferences. Finally, we obtain for joint a feature of size where is the number of channels for the convolution operation, is the number of explicitly related joints pertaining to joint , and is the length of historical sequence (which yields displacements).

The obtained features are then fed into a GRU network to generate principled trajectory predictions for future frames .

Implicit Dependency Modeling  We further refine the principled trajectory prediction for each joint by engaging in implicit dependency modeling, which considers the relationships between predicted trajectories of all joints. After the previous step, the predictions of different joint trajectories are independent to each other, the dependencies between a single joint and the entire pose sequence are also ignored. In order to ensure that the predicted motion sequence is harmonious and reasonable from a global perspective, we introduce a global refinement module which can optimize the predicted joint features with respect to the entire predicted pose sequence.

As illustrated in Fig. 2, we feed the forecasted displacement vector sequences into two convolution modules for generating two feature matrices and , respectively. Then, we reshape them to , where is the total number of joint positions in future frames. Formally,

(2)

Next, we apply matrix multiplication on the transpose of and

, and utilize a softmax operation to obtain a global affinity matrix

:

(3)

The affinity matrix models the correlation between any two joints in the predicted future frames. Within the affinity matrix, element measures the influence of joint in the future frame on the joint in the future frame. More precisely, the element of the matrix measures the affinity between any two joint displacement features even when the two joints are in different frames. Finally, we refine the predicted displacement vector feature of joint in the future frame (generated after explicit dependency modeling) by:

(4)

where is the displacement prediction of the joint in the future frame, and is the displacement prediction of the joint in future frame, which might have influence on . The final displacement vector obtained by Eq. 4 can be used to restore the new joint position by incorporating the joint position in the last observed frame.

3.3. Loss Function

Previous works such as (Cui et al., 2020; Mao et al., 2020)

have generally employed the standard L2 loss. However, the motion ranges of different joints are quite different. Some joints may stay motionless in an action while some joints may undergo large displacements. It is grossly inappropriate to assign the same weight to all the joints. We assign a spatio-temporal weight to each joint. We give higher weights to joints with larger historical position changes (which enforces the network to consider these prominent joints) and to earlier frames in the prediction (which is to reduce accumulation errors). Empirically, this is beneficial to improve the performance of short-term forecasting and can effectively reduce the error accumulation in long-term forecasting. Formally, the loss function can be formulated as:

(5)

where is the displacement vector prediction of the joint in the frame, while represents the corresponding ground truth. denotes the weight. The loss enforces the predicted frame-wise displacements to approach the ground truth.

4. Experiments

In this section, we evaluate our method over the widely-used human motion benchmark datasets Human3.6M and CMU MoCap.

4.1. Datasets

Human3.6M is the largest dataset for human motion prediction task, containing 3.6 million human poses and corresponding videos (images). The videos are recorded by a vicon motion capture system. The dataset consists of 7 subjects, each of them performed 15 activities (e.g., discussion, eating, and sitting). There are 32 skeletal joints involved to characterize the human pose. Following the evaluation protocol of previous works (Fragkiadaki et al., 2015), we removed duplicate points, performed a down sampling to 25 frames per second (FPS), and set subject 5 as the test set, with training done on the remaining 6 subjects.

CMU MoCap is released by Carnegie Mellon University. 12 infrared cameras record the position of 41 markers taped on the human body to capture 3D skeleton motion. We adopted the same preprocessing schema as that in (Mao et al., 2019), 7 actions were selected for evaluating the performance of the model (e.g., walking, soccer, and jumping).

4.2. Experimental Settings

We use PyTorch to implemente our method. For training, we utilize the ADAM optimizer with a batch size of 16, a learning rate of 0.001, and a dropout of 0.05. The model is trained for 50 epochs on a Nvidia GeForce Titan Xp GPU. To avoid the problem of exploding gradients, gradient clipping is utilized at a threshold of 5. We use a single convolutional feature channel for the explicit dependency modeling module and the GRU unit size is set to 128.

Directions Eating Greeting
Millisecond(ms) 80 160 320 400 560 720 1,000 80 160 320 400 560 720 1,000 80 160 320 400 560 720 1,000
Res-GRU (Martinez et al., 2017) 21.6 41.3 72.1 84.1 101.1 114.5 129.1 16.8 31.5 53.5 61.7 74.9 85.9 98.0 31.2 58.4 96.3 108.8 126.1 138.8 153.9
ConSeq2Seq (Li et al., 2018) 13.5 29.0 57.6 69.7 86.6 99.8 115.8 11.0 22.4 40.7 48.4 61.3 72.8 87.1 22.0 45.0 82.0 96.0 116.9 130.7 147.3
HMR (Liu et al., 2019) 23.3 25.0 47.2 61.5 80.9 95.1 116.9 9.2 13.9 34.6 47.1 61.3 72.9 84.8 12.9 31.9 55.6 82.5 104.3 116.1 123.2
FC-GCN (Mao et al., 2019) 12.6 24.4 48.2 58.4 72.2 86.7 105.8 8.8 18.9 39.4 47.2 50.0 61.1 74.1 14.5 30.5 74.2 89.0 103.7 120.6 140.9
LDR (Cui et al., 2020) 13.1 23.7 44.5 50.9 78.3 7.6 15.9 37.2 41.7 53.8 9.6 27.9 66.3 78.8 129.7
TrajNet (Liu et al., 2021c) 9.7 22.3 50.2 61.7 84.7 104.2 8.5 18.4 37.0 44.8 59.2 71.5 12.6 28.1 67.3 80.1 91.4 84.3
SDMTL(Liu and Yin, 2020) 9.8 23.4 53.8 67.0 88.3 107.9 8.2 16.4 33.8 42.4 53.9 68.8 11.7 25.3 61.9 75.0 88.7 89.0
HRI (Mao et al., 2020) 7.4 18.4 44.5 56.5 73.9 88.2 106.5 6.4 14.0 28.7 36.2 50.0 61.4 75.7 13.7 30.1 63.8 78.1 101.9 118.4 138.8
Our 5.9 14.2 37.6 42.5 64.8 71.6 72.3 4.7 10.8 21.0 28.2 36.3 43.9 52.5 7.9 18.4 46.8 55.2 68.2 75.8 83.1
Sitting Sitting Down Taking Photo
Millisecond(ms) 80 160 320 400 560 720 1,000 80 160 320 400 560 720 1,000 80 160 320 400 560 720 1,000
Res-GRU (Martinez et al., 2017) 23.8 44.7 78.0 91.2 113.7 130.5 152.6 31.7 58.3 96.7 112.0 138.8 159.0 187.4 21.9 41.4 74.0 87.6 110.6 128.9 153.9
ConSeq2Seq (Li et al., 2018) 13.5 27.0 52.0 63.1 82.4 98.8 120.7 20.7 40.6 70.4 82.7 106.5 125.1 150.3 12.7 26.0 52.1 63.6 84.4 102.4 128.1
HMR (Liu et al., 2019) 12.6 25.6 44.7 60.7 76.4 96.3 118.4 9.6 18.6 41.1 57.7 101.7 128.8 148.3 7.9 19.0 31.5 57.3 83.5 93.7 108.5
FC-GCN (Mao et al., 2019) 10.7 24.6 50.6 62.0 76.4 93.1 115.7 11.4 27.6 56.4 67.6 96.2 115.2 142.2 6.8 15.2 38.2 49.6 72.5 90.9 116.3
LDR (Cui et al., 2020) 9.2 23.1 47.2 57.7 106.5 9.3 21.4 46.3 59.3 144.6 7.1 13.8 29.6 44.2 116.4
TrajNet (Liu et al., 2021c) 9.0 22.0 49.4 62.6 81.0 116.3 10.7 28.8 55.1 62.9 79.8 123.8 5.4 13.4 36.2 47.0 73.0 86.6
SDMTL(Liu and Yin, 2020) 8.7 22.2 52.2 65.5 83.9 115.5 9.3 23.8 50.6 60.9 77.7 118.9 6.0 14.0 36.1 47.0 67.1 91.1
HRI (Mao et al., 2020) 9.3 20.1 44.3 56.0 76.4 93.1 115.9 14.9 30.7 59.1 72.0 97.0 116.1 143.6 8.3 18.4 40.7 51.5 72.1 90.4 115.9
Our 7.6 17.2 36.9 51.2 69.5 78.3 93.6 7.2 16.3 32.7 50.9 62.1 94.5 101.9 5.2 12.2 23.8 38.4 62.4 76.2 82.4
Phoning Posing Purchases
Millisecond(ms) 80 160 320 400 560 720 1,000 80 160 320 400 560 720 1,000 80 160 320 400 560 720 1,000
Res-GRU (Martinez et al., 2017) 21.1 38.9 66.0 76.4 94.0 107.7 126.4 29.3 56.1 98.3 114.3 140.3 159.8 183.2 28.7 52.4 86.9 100.7 122.1 137.2 154.0
ConSeq2Seq (Li et al., 2018) 13.5 26.6 49.9 59.9 77.1 92.1 114.0 16.9 36.7 75.7 92.9 122.5 148.8 187.4 20.3 41.8 76.5 89.9 111.3 129.1 151.5
HMR (Liu et al., 2019) 12.5 21.3 39.3 58.6 71.3 88.7 112.8 13.6 23.5 62.5 114.1 126.3 135.9 143.6 15.3 30.6 64.7 73.9 97.5 107.2 122.7
FC-GCN (Mao et al., 2019) 11.5 20.2 37.9 43.2 67.8 83.0 105.1 9.4 23.9 66.2 82.9 107.6 136.1 175.0 19.6 38.5 64.4 72.2 98.3 115.1 139.3
LDR (Cui et al., 2020) 10.4 14.3 33.1 39.7 85.8 8.7 21.1 58.3 81.9 133.7 16.2 36.1 62.8 76.2 112.6
TrajNet (Liu et al., 2021c) 10.7 18.8 37.0 43.1 62.3 113.5 6.9 21.3 62.9 78.8 111.6 210.9 17.1 36.1 64.3 75.1 84.5 115.5
SDMTL (Liu and Yin, 2020) 10.5 18.5 37.2 43.1 60.8 112.3 6.8 20.5 64.0 82.4 107.2 204.7 18.4 38.8 61.1 68.2 80.9 113.6
HRI (Mao et al., 2020) 8.6 18.3 39.0 49.2 67.4 82.9 105.0 10.2 24.2 58.5 75.8 107.6 136.8 178.2 13.0 29.2 60.4 73.9 95.6 110.9 134.2
Our 6.6 10.1 24.3 31.4 48.2 74.8 102.7 5.2 16.9 49.2 68.3 96.6 118.0 123.8 10.2 23.6 52.3 58.6 73.0 92.7 112.2
Waiting Walking Dog Average
Millisecond(ms) 80 160 320 400 560 720 1,000 80 160 320 400 560 720 1,000 80 160 320 400 560 720 1,000
Res-GRU (Martinez et al., 2017) 23.8 44.2 75.8 87.7 105.4 117.3 135.4 36.4 64.8 99.1 110.6 128.7 141.1 164.5 25.0 46.2 77.0 88.3 106.3 119.4 136.6
ConSeq2Seq (Li et al., 2018) 14.6 29.7 58.1 69.7 87.3 100.3 117.7 27.7 53,6 90.7 103.3 122.4 133.8 162.4 16.6 33.3 61.4 72.7 90.7 104.7 124.2
HMR (Liu et al., 2019) 12.8 24.5 45.2 85.1 87.5 94.2 121.9 30.1 41.4 78.4 100.1 134.7 141.6 157.4 13.3 23.2 44.7 63.8 86.1 99.9 116.2
FC-GCN (Mao et al., 2019) 9.5 22.0 57.5 73.9 73.4 88.2 107.5 32.2 58.0 102.2 122.7 105.8 118.7 142.2 12.1 25.0 51.0 61.3 78.3 93.3 114.0
LDR (Cui et al., 2020) 9.2 17.6 47.2 71.6 127.3 25.3 56.6 87.9 99.4 143.2 10.7 22.5 45.1 55.8 97.8
TrajNet (Liu et al., 2021c) 8.2 21.0 53.4 68.9 92.9 165.9 23.6 52.0 98.1 116.9 141.1 181.3 10.2 23.2 49.3 59.7 77.7 110.6
SDMTL (Liu and Yin, 2020) 7.5 19.0 46.8 58.3 81.4 159.2 21.0 54.9 100.4 119.8 137.7 181.5 9.8 22.7 48.0 58.2 74.5 110.7
HRI (Mao et al., 2020) 8.7 19.2 43.4 54.9 74.5 89.0 108.2 20.1 40.3 73.3 86.3 108.2 120.6 146.9 10.4 22.6 47.1 58.3 77.3 91.8 112.1
Our 6.5 15.2 37.5 47.3 68.8 79.4 95.6 15.5 31.9 62.3 67.4 85.6 106.3 126.1 8.6 19.6 39.2 50.5 68.9 81.3 93.8
Table 1. Comparisons of position error for short-term and long-term predictions on H3.6m dataset. Our method consistently outperformance other methods.
Figure 3. The average Mean Per Joint Position Error for all 15 actions on H3.6m dataset.
Figure 4. The visual results comparison on H3.6M. In each sub-group, the first row shows the ground truth, and the following rows are the results of Res-GRU, FC-GCN, HRI, and our method.

4.3. Evaluation on Human Datasets

We evaluate our method and existing works on two popular benchmark datasets. For fair comparison, we follow the evaluation metric in existing works

(Mao et al., 2019, 2020). Specifically, the standard Mean Per Joint Position Error (MPJPE) is adopted to measure the mean Euclidean distance between the predicted joint positions and the ground truth. The results are reported on both short-term (ms) and long-term (ms).

H3.6m  We benchmark the proposed method against 8 existing works, including Res-GRU (Martinez et al., 2017), ConSeq2Seq (Li et al., 2018), HMR (Liu et al., 2019), FC-GCN (Mao et al., 2019), TrajNet (Liu et al., 2021c), SDMTL(Liu and Yin, 2020), LDR (Cui et al., 2020), and HRI (Mao et al., 2020). The experimental results are demonstrated in Table 1 and Fig. 3. We report the results of 11 diverse actions and the average over all actions. There are different levels of complexity for different activities. Existing methods tend to perform well for motions with high degrees of periodicity such as “Walking” or regularity such as “Eating”. However, their performance suffers major dips when modeling more stochastic and irregular movements such as “Posing” or “Directions”. On the contrary, a major highlight of our model is the ability to deliver accurate predictions even when faced with highly complex and aperiodic action types. We observe that our method consistently delivers superior performance over state-of-the-art methods, which achieves a substantial 15.4% accuracy improvement in average comparing to state-of-the-art approach. The improvement is even more significant for long-term predictions.

We further visualize sample motion prediction results in Fig. 4, where the results for “Directions” and “Eating” activities are presented. The first three frames correspond to the observed sequences and the subsequent frames are the predicted poses. Each row illustrates the results of a method. It is easy to see that the forecasted sequences obtained by our method are more similar with the ground truth as well as being more natural. A significant issue arising in existing methods for highly stochastic actions is that the predicted motions tend to converge to static motionless states. For example, the HRI (Mao et al., 2020) predictions for “Directions” have very limited range of motions, which appears unnatural. Incorporating displacements and adopting a joint phase space trajectory representation alleviates this plaguing issue, as it provides richer and smoother individualized motion contexts that are easier to model than the pose representation. In the illustrations of the “Eating” action, the hands and foot coordination for our method also appears more coherent. We attribute this to our explicit modeling of the joint dependencies that serves to avoid interference and excess of information from the closely correlated joints.

CMU MoCap  We also evaluate our method on the CMU MoCap dataset. As shown in Table 2, our method consistently and substantially outperforms all the existing methods, especially for complex activity types. For example, on complex activity types such as “Basketball”, “Directing Traffic”, and “Soccer”, our method significantly outperforms state-of-the-art methods by 17.4%, 14.7%, and 25.4% , respectively.

For further evaluation, we show the visual results on the CMU-MoCap dataset in Fig. 5. Our method outperforms all state-of-the-art methods in both long-term and short-term predictions and is proven to be effective and robust in capturing temporal dynamics. For the “Jumping” action, our model successfully captures the fast rise of legs while other methods fail to extract such a trend. More specifically, Res-GRU and FC-GCN fail to predict the subtle correlation between the legs; HRI generates unreasonable prediction for the left foot. For the “Directing Traffic” action, we can observe that our method generates more accurate hand motions than state-of-the-art methods, while Res-GRU exhibits discontinuity between the first frame of prediction and the historical sequence; HRI yields good short-term predictions but is not accurate in the long-term.

Basketball Basketball signal Directing traffic Jumping
Millisecond(ms) 80 160 320 400 560 1,000 80 160 320 400 560 1,000 80 160 320 400 560 1,000 80 160 320 400 560 1,000
Res-GRU (Martinez et al., 2017) 18.4 33.8 59.5 70.5 106.7 12.7 23.8 40.3 46.7 77.5 15.2 29.6 55.1 66.1 127.1 36.0 68.7 125.0 145.5 195.5
ConSeq2Seq (Li et al., 2018) 16.7 30.5 53.8 64.3 91.5 8.4 16.2 30.8 37.8 76.5 10.6 20.3 38.7 48.4 115.5 22.4 44.0 87.5 106.3 162.6
FC-GCN (Mao et al., 2019) 14.0 25.4 49.6 61.4 77.4 106.1 3.5 6.1 11.7 15.2 25.3 53.9 7.4 15.1 31.7 42.2 70.3 152.4 16.9 34.4 76.3 96.8 131.4 164.6
LPJP (Cai et al., 2020) 11.6 21.7 44.4 57.3 90.9 2.6 4.9 12.7 18.7 75.8 6.2 12.7 29.1 39.6 149.1 12.9 27.6 73.5 92.2 176.6
LDR (Cui et al., 2020) 13.1 22.0 37.2 55.8 97.7 3.4 6.2 11.2 13.8 47.3 6.8 16.3 27.9 38.9 131.8 13.2 32.7 65.1 91.3 153.5
SDMTL (Liu and Yin, 2020) 10.9 20.2 40.9 50.8 66.1 110.2 2.9 6.2 16.4 23.1 37.4 71.6 5.1 10.9 23.2 30.2 46.1 104.5 11.1 24.6 65.7 90.3 130.9 191.2
Our 10.7 16.9 33.4 42.9 57.3 88.5 2.1 4.9 9.2 10.4 19.8 44.3 4.3 9.5 18.9 27.6 40.0 92.4 10.6 20.4 52.0 80.8 114.9 146.0
Soccer Walking Wash window Average
Millisecond(ms) 80 160 320 400 560 1,000 80 160 320 400 560 1,000 80 160 320 400 560 1,000 80 160 320 400 560 1,000
Res-GRU (Martinez et al., 2017) 20.3 39.5 71.3 84.0 129.6 8.2 13.7 21.9 24.5 52.2 8.4 15.8 29.3 35.4 61.1 16.8 30.5 54.2 63.6 99.0
ConSeq2Seq (Li et al., 2018) 12.1 21.8 41.9 52.9 94.6 7.6 12.5 23.0 27.5 49.8 8.2 15.9 32.1 39.9 58.9 12.5 22.2 40.7 49.7 84.6
FC-GCN (Mao et al., 2019) 11.3 21.5 44.2 55.8 82.6 117.5 7.7 11.8 19.4 23.1 27.2 40.2 5.9 11.9 30.3 40.0 53.0 79.3 11.5 20.4 37.8 46.8 62.9 96.5
LPJP (Cai et al., 2020) 9.2 18.4 39.2 49.5 93.9 6.7 10.7 21.7 27.5 37.4 5.4 11.3 29.2 39.6 79.1 9.8 17.6 35.7 45.1 93.2
LDR (Cui et al., 2020) 10.3 21.1 42.7 50.9 91.4 7.1 10.4 17.8 20.7 37.5 5.8 12.3 27.8 38.2 56.6 9.4 18.8 31.6 43.2 82.9
SDMTL (Liu and Yin, 2020) 8.1 16.5 36.6 50.6 77.0 140.7 6.1 9.0 17.5 20.0 26.3 51.9 4.6 10.1 29.6 39.2 50.9 82.4 8.0 14.5 31.9 41.9 59.4 102.7
Our 6.5 12.5 26.3 40.6 68.1 81.5 5.3 7.8 15.9 18.0 25.5 44.7 4.2 8.5 24.3 32.6 46.0 55.4 6.6 12.4 27.0 36.6 51.4 76.2
Table 2. Comparisons of position error for both short-term and long-term predictions on CMU-MoCap dataset with state-of-the-art methods.
Figure 5. The visual results comparison on CMU-MoCap dataset. In each sub-group, the first row shows the ground truth, and the following rows are the results of Res-GRU, FC-GCN, HRI, and our method.

4.4. Ablation Studies

We investigate the effectiveness of different modules within our TEID through the following ablation studies. Experiments are performed on the H3.6m dataset, with results reported in Table 3.

Explicit dependency modeling module  The explicit dependency modeling block in our TEID serves to explicitly account for prior knowledge such as natural anatomical connection between joints and coordination of different limbs. From the results in Table 3, we observe that the removal of this module will bring slight accuracy decay in the short-term while the long-term prediction accuracy is greatly affected. We interpret this as the failure to leverage prior anatomy knowledge will be detrimental in effectively filtering out joint correlations, resulting in error accumulation for the long-term.

Implicit global optimization module  This module is designed to globally optimize the position of a joint with respect to the entire forecasted motion sequence. Removing this block results in slight dip in accuracy. This reveals that modeling only explicitly related joints while ignoring irrelevant joints is able to deliver relatively accurate motion prediction. However, implicit relation modeling considers the hidden dependencies between each pair of joints ( e.g., the potential correlation between left hand and right hand joints), which improves the naturalness of the predicted pose sequence.

Phase space trajectory representation  Instead of explicitly incorporating the displacement vector as inputs, we also try employing only the position vectors for each joint. From the quantitative results in Table 3, we observe a significant deterioration in prediction accuracy, testifying to the empirical effectiveness of including displacements. This agrees with our intuition that a complete characterization of the dynamic system with a phase space representation plays an important role in motion prediction.

E I D 80 160 320 400 560 720 1,000
28.9 34.8 52.6 65.3 88.1 102.0 117.5
9.5 24.5 44.7 60.3 82.8 93.6 105.3
9.2 23.3 42.5 54.7 75.1 86.0 95.2
8.6 19.6 39.2 50.5 68.9 81.3 93.8
Table 3. Ablation studies for the Explicit denpendency modeling module, Implicit dependency modeling module, and Displacement inputs. E and I stand for Explicit and Implicit dependency modeling modules, respectively. D represents displacement inputs.

5. Conclusion

In this paper, we tackle the motion prediction problem by moving away from kinematics graph pose representations and instead adopting a phase space trajectory representation for each constituent joint. This serves to reduce the inherent complexity of the problem by considering individualized joint trajectories instead of the entire pose sequence. We further design a network consisting of prior anatomical knowledge encoding and multi-scale convolution for explicit joint dependency modeling. Along with a global affinity-based optimization module, we obtain joint trajectory extrapolations that aggregate coherently to form a consistent and natural pose sequence. Our method is robust and accurate, demonstrating significant improvements over state-of-the-art methods.

6. Acknowledgments

This research is supported in part by the National Key Research and Development Program of China under Grant No.2020AAA0140004, the Natural Science Foundation of Zhejiang Province, China (Grant No. LQ19F020001), the National Natural Science Foundation of China (No. 61902348), and the Key R&D Program of Zhejiang Province (No. 2021C01104).

References

  • (1)
  • Aliakbarian et al. (2020) Mohammad Sadegh Aliakbarian, Fatemeh Sadat Saleh, Mathieu Salzmann, Lars Petersson, and Stephen Gould. 2020. A Stochastic Conditioning Scheme for Diverse Human Motion Prediction.

    2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2020

    (Aug 2020), 5222–5231.
  • An-An et al. (2021) Liu An-An, Tian Hongshuo, Xu Ning, Nie Weizhi, Zhang Yongdong, and Kankanhalli Mohan. 2021. Toward Region-Aware Attention Learning for Scene Graph Generation. IEEE Transactions on Neural Networks and Learning Systems (2021).
  • Cai et al. (2020) Yujun Cai, Lin Huang, Yiwei Wang, Tat-Jen Cham, Jianfei Cai, Junsong Yuan, Jun Liu, Xu Yang, Yiheng Zhu, Xiaohui Shen, Ding Liu, Jing Liu, and Nadia Magnenat-Thalmann. 2020. Learning Progressive Joint Propagation for Human Motion Prediction. Computer Vision - ECCV 2020 (August 2020), 226–242.
  • Cao et al. (2020) Zhe Cao, Hang Gao, Karttikeya Mangalam, Qi-Zhi Cai, Minh Vo, and Jitendra Malik. 2020. Long-Term Human Motion Prediction with Scene Context. Computer Vision - ECCV 2020 - 16th European Conference 12346 (Nov 2020), 387–404.
  • Chen et al. (2020) Wenheng Chen, He Wang, Yi Yuan, Tianjia Shao, and Kun Zhou. 2020. Dynamic Future Net: Diversified Human Motion Generation. MM ’20: The 28th ACM International Conference on Multimedia (Oct 2020), 2131–2139.
  • Craig (2009) John J. Craig. 2009. Introduction to robotics - mechanics and control. Addison-Wesley (May 2009).
  • Cui et al. (2020) Qiongjie Cui, Huaijiang Sun, and Fei Yang. 2020. Learning Dynamic Relationships for 3D Human Motion Prediction. IEEE Conference on Computer Vision and Pattern Recognition (June 2020), 6518–6526.
  • Fragkiadaki et al. (2015) Katerina Fragkiadaki, Sergey Levine, Panna Felsen, and Jitendra Malik. 2015. Recurrent Network Models for Human Dynamics. IEEE International Conference on Computer Vision (December 2015), 4346–4354.
  • Fuli et al. (2019) Feng Fuli, He Xiangnan, Tang Jie, and Chua Tat-Seng. 2019. Graph adversarial training: Dynamically regularizing based on graph structure. IEEE Transactions on Knowledge and Data Engineering (2019).
  • Gao et al. (2019) Zan Gao, Hai-Zhen Xuan, Hua Zhang, Shaohua Wan, and Kim-Kwang Raymond Choo. 2019. Adaptive Fusion and Category-Level Dictionary Learning Model for Multiview Human Action Recognition. IEEE Internet Things Journal 6 (2019), 9280–9293.
  • Gui et al. (2018) Liang-Yan Gui, Yu-Xiong Wang, Xiaodan Liang, and José M. F. Moura. 2018. Adversarial Geometry-Aware Human Motion Prediction. Computer Vision - ECCV 2018 - 15th European Conference 11208 (May 2018), 823–842.
  • Guo and Choi (2019) Xiao Guo and Jongmoo Choi. 2019. Human Motion Prediction via Learning Local Structure Representations and Temporal Dependencies. The Thirty-Third AAAI Conference on Artificial Intelligence, AAAI 2019 (Feb 2019), 2580–2587.
  • Jianfeng et al. (2021) Dong Jianfeng, Li Xirong, Xu Chaoxi, Yang Xun, Yang Gang, Wang Xun, and Meng Wang. 2021. Dual encoding for video retrieval by text. IEEE Transactions on Pattern Analysis and Machine Intelligence (2021).
  • Kundu et al. (2019) Jogendra Nath Kundu, Maharshi Gor, and R. Venkatesh Babu. 2019. BiHMP-GAN: Bidirectional 3D Human Motion Prediction GAN. The Thirty-First Innovative Applications of Artificial Intelligence Conference (2019), 8553–8560.
  • Lehrmann et al. (2014) Andreas M. Lehrmann, Peter V. Gehler, and Sebastian Nowozin. 2014. Efficient Nonlinear Markov Models for Human Motion. IEEE Conference on Computer Vision and Pattern Recognition (June 2014), 1314–1321.
  • Li et al. (2018) Chen Li, Zhen Zhang, Wee Sun Lee, and Gim Hee Lee. 2018. Convolutional Sequence to Sequence Model for Human Dynamics. IEEE Conference on Computer Vision and Pattern Recognition (June 2018), 5226–5234.
  • Li et al. (2020) Maosen Li, Siheng Chen, Yangheng Zhao, Ya Zhang, Yanfeng Wang, and Qi Tian. 2020. Dynamic Multiscale Graph Neural Networks for 3D Skeleton Based Human Motion Prediction. CVPR (2020), 211–220.
  • Li et al. (2019) Xinzhe Li, Qianru Sun, Yaoyao Liu, Qin Zhou, Shibao Zheng, Tat-Seng Chua, and Bernt Schiele. 2019. Learning to Self-Train for Semi-Supervised Few-Shot Classification. In NeurIPS, Hanna M. Wallach, Hugo Larochelle, Alina Beygelzimer, Florence d’Alché-Buc, Emily B. Fox, and Roman Garnett (Eds.). 10276–10286.
  • Liu et al. (2017) Anan Liu, Yuting Su, Weizhi Nie, and Mohan S. Kankanhalli. 2017. Hierarchical Clustering Multi-Task Learning for Joint Human Action Grouping and Recognition. IEEE Trans. Pattern Anal. Mach. Intell. 39, 1 (2017), 102–114. https://doi.org/10.1109/TPAMI.2016.2537337
  • Liu et al. (2021d) An-An Liu, Heyu Zhou, Weizhi Nie, Zhenguang Liu, Wu Liu, Hongtao Xie, Zhendong Mao, Xuanya Li, and Dan Song. 2021d. Hierarchical multi-view context modelling for 3D object classification and retrieval. Information Sciences 547 (Dec 2021), 984–995.
  • Liu and Yin (2020) Xiaoli Liu and Jianqin Yin. 2020. SDMTL: Semi-Decoupled Multi-grained Trajectory Learning for 3D human motion prediction. CoRR abs/2010.05133 (October 2020).
  • Liu et al. (2021c) Xiaoli Liu, Jianqin Yin, Jin Li, Pengxiang Ding, Jun Liu, and Huaping Liu. 2021c. TrajectoryCNN: A New Spatio-Temporal Feature Learning Network for Human Motion Prediction. IEEE Trans. Circuits Syst. Video Technol. 31, 6 (Jun 2021), 2133–2146.
  • Liu et al. (2021a) Zhenguang Liu, Kedi Lyu, Shuang Wu, Haipeng Chen, Yanbin Hao, and Shouling Ji. 2021a. Aggregated Multi-GANs for Controlled 3D Human Motion Prediction. In AAAI 2021. 2225–2232.
  • Liu et al. (2021b) Zhenguang Liu, Pengxiang Su, Shuang Wu, Xuanjing Shen, Haipeng Chen, Yanbin Hao, and Meng Wang. 2021b. Motion Prediction using Trajectory Cues. ICCV (2021). Accepted.
  • Liu et al. (2019) Zhenguang Liu, Shuang Wu, Shuyuan Jin, Qi Liu, Shijian Lu, Roger Zimmermann, and Li Cheng. 2019. Towards natural and accurate future motion prediction of humans and animals. CVPR (2019), 10004–10012.
  • Mao et al. (2020) Wei Mao, Miaomiao Liu, and Mathieu Salzmann. 2020. History Repeats Itself: Human Motion Prediction via Motion Attention. Computer Vision - ECCV 2020 - 16th European Conference (August 2020), 474–489.
  • Mao et al. (2019) Wei Mao, Miaomiao Liu, Mathieu Salzmann, and Hongdong Li. 2019. Learning Trajectory Dependencies for Human Motion Prediction. IEEE International Conference on Computer Vision (October 2019), 9488–9496.
  • Martinez et al. (2017) Julieta Martinez, Michael J. Black, and Javier Romero. 2017. On Human Motion Prediction Using Recurrent Neural Networks. IEEE Conference on Computer Vision and Pattern Recognition (July 2017), 4674–4683.
  • Wang et al. (2005) Jack M. Wang, David J. Fleet, and Aaron Hertzmann. 2005. Gaussian Process Dynamical Models. Advances in Neural Information Processing Systems (December 2005), 1441–1448.
  • Xun et al. (2018) Yang Xun, Zhou Peicheng, and Wang Meng. 2018. Person reidentification via structural deep metric learning. IEEE Transactions on NNLS 30, 10 (2018), 2987–2998.
  • Yinwei et al. (2019) Wei Yinwei, Wang Xiang, Nie Liqiang, He Xiangnan, Hong Richang, and Chua Tat-Seng. 2019. MMGCN: Multi-modal graph convolution network for personalized recommendation of micro-video. In Proceedings of the 27th ACM International Conference on Multimedia. 1437–1445.
  • Yuan and Kitani (2020) Ye Yuan and Kris Kitani. 2020. DLow: Diversifying Latent Flows for Diverse Human Motion Prediction. Computer Vision - ECCV 2020 - 16th European Conference 12354 (Nov 2020), 346–364.
  • Zang et al. (2020) Chuanqi Zang, Mingtao Pei, and Yu Kong. 2020. Few-shot Human Motion Prediction via Learning Novel Motion Dynamics. International Joint Conference on Artificial Intelligence (2020), 846–852.
  • Zhu et al. (2020) Lei Zhu, Xu Lu, Zhiyong Cheng, Jingjing Li, and Huaxiang Zhang. 2020. Deep Collaborative Multi-View Hashing for Large-Scale Image Search. IEEE Transactions on Image Processing 29 (2020), 4643–4655.
  • Zhuo et al. (2020) Tao Zhuo, Zhiyong Cheng, Peng Zhang, Yongkang Wong, and Mohan S. Kankanhalli. 2020. Unsupervised Online Video Object Segmentation With Motion Property Understanding. IEEE Trans. Image Process. 29 (May 2020), 237–249.