Structured Prediction Helps 3D Human Motion Modelling

10/20/2019
by   Emre Aksan, et al.
0

Human motion prediction is a challenging and important task in many computer vision application domains. Existing work only implicitly models the spatial structure of the human skeleton. In this paper, we propose a novel approach that decomposes the prediction into individual joints by means of a structured prediction layer that explicitly models the joint dependencies. This is implemented via a hierarchy of small-sized neural networks connected analogously to the kinematic chains in the human body as well as a joint-wise decomposition in the loss function. The proposed layer is agnostic to the underlying network and can be used with existing architectures for motion modelling. Prior work typically leverages the H3.6M dataset. We show that some state-of-the-art techniques do not perform well when trained and tested on AMASS, a recently released dataset 14 times the size of H3.6M. Our experiments indicate that the proposed layer increases the performance of motion forecasting irrespective of the base network, joint-angle representation, and prediction horizon. We furthermore show that the layer also improves motion predictions qualitatively. We make code and models publicly available at https://ait.ethz.ch/projects/2019/spl.

READ FULL TEXT VIEW PDF

Authors

page 14

08/15/2019

Learning Trajectory Dependencies for Human Motion Prediction

Human motion prediction, i.e., forecasting future body poses given obser...
01/07/2022

Motion Prediction via Joint Dependency Modeling in Phase Space

Motion prediction is a classic problem in computer vision, which aims at...
05/16/2018

QuaterNet: A Quaternion-based Recurrent Model for Human Motion

Deep learning for predicting or generating 3D human pose sequences is an...
12/20/2021

DMS-GCN: Dynamic Mutiscale Spatiotemporal Graph Convolutional Networks for Human Motion Prediction

Human motion prediction is an important and challenging task in many com...
04/18/2020

Attention, please: A Spatio-temporal Transformer for 3D Human Motion Prediction

In this paper, we propose a novel architecture for the task of 3D human ...
01/21/2019

Modeling Human Motion with Quaternion-based Neural Networks

Previous work on predicting or generating 3D human pose sequences regres...
10/10/2021

Application of Graph Convolutions in a Lightweight Model for Skeletal Human Motion Forecasting

Prediction of movements is essential for successful cooperation with int...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Modelling of human motion over time has a number of applications in activity recognition, human computer interaction, human detection and tracking, and image-based pose estimation in the context of robotics or self-driving vehicles. Humans have the ability to forecast the sequence of poses over short-term horizons with high accuracy and can imagine probable motion over arbitrary time scales. Despite recent progress in data-driven modelling of human motion

[7, 8, 14, 20, 25, 33], this task remains difficult for machines.

The difficulty of the task is manifold. First, human motion is highly dynamic, non-linear and over time becomes a stochastic sequential process with a high degree of inherent uncertainty. Humans leverage strong structural and temporal priors about continuity and regularity in natural motion. However, these are hard to model algorithmically due to i) the inter-dependencies between joints and ii) the influence of high-level activities on the motion sequences (e.g., transition from walking to jumping). In fact many recent approaches forgo explicit modelling of human motion [14] in favor of pure data-driven models [8, 20, 25].

Initial Deep Learning-based motion modelling approaches have focused on recurrent neural networks (RNNs)

[8, 7, 14], using curriculum learning schemes to increase robustness to temporal drift. Martinez [20] have shown that a simple running-average provides a surprisingly difficult to beat baseline in terms of Euler angle error. Following this, sequence-to-sequence models trained in an auto-regressive fashion have been proposed [20], sometimes using adversarial training to address the drift problem in long-term predictions [33]. Pavllo [25] study the impact of joint angle representation and show that a quaternion-based parameterization improves short-term predictions.

However, it has been observed that quantitative performance does not always translate to qualitatively meaningful predictions [20, 25]. Furthermore, the H3.6M benchmark is becoming saturated, limiting progress. This leads to the two main research questions studied in this work: i) How to measure accuracy of pose predictions in a meaningful way such that low errors corresponds to good qualitative results and how to improve this performance? ii) How to exploit spatial structure of the human skeleton for better predictions?

With respect to i) we note that much of the literature relies on the H3.6M [12] dataset and an Euler angle based metric as performance measure, evaluated on a limited number of test sequences. While enabling initial exploration of the task, the dataset is limited in size (roughly 3 hours from 210 sequences) and in diversity of activities and poses, which contributes to a saturation effect in terms of performance. In this paper we show that existing techniques do not scale well when trained on larger and more diverse datasets. To this end, we leverage the recently released AMASS dataset [19]

, itself consisting of multiple smaller motion datasets, offering many more samples (14x over H3.6M) and a wider range of activities. To further unpack the performance of motion modelling techniques, we introduce several evaluation metrics to the task of human motion prediction.

Our main technical contribution is a novel structured prediction layer (SPL) that addresses our second research question. We leverage the compositional structure of the human skeleton by explicitly decomposing the pose into individual joints. The SP-layer models the structure of the human skeleton and hence the spatial dependencies between joints. This is achieved via a hierarchy of small-sized neural networks that are connected analogously to the kinematic chains of the human skeleton. Each node in the graph receives information about the parent node’s prediction and thus information is propagated along the kinematic chain. We furthermore introduce a joint-wise decomposition of the loss function as part of SPL. The proposed layer is agnostic to the underlying network and can be used in combination with most previously proposed architectures.

We show experimentally that introducing this layer to existing approaches improves performance of the respective method. The impact is most pronounced on the larger and more challenging AMASS dataset. This indicates that our approach is indeed a step towards successfully exploiting spatial priors in human motion modelling and in turn allows recurrent models to capture temporal coherency more effectively. We thoroughly evaluate the SP-layer on H3.6M and AMASS. On AMASS, for any base model, any metric, and any input representation, it is beneficial to use the SP-layer. Furthermore, even simple architectures that are outperformed by a zero-velocity baseline [20] perform competitive if paired with the SP-layer.

In summary, we contribute: i) An in-depth analysis of state-of-the-art motion modelling methods and their evaluation. ii) A new benchmark and evaluation protocol on the recent, much larger AMASS dataset. iii) A novel prediction layer, incorporating structural priors. iv) A thorough evaluation of the SP-layer’s impact on motion modelling in combination with several base models.

2 Related Work

We briefly review the most related literature on human motion modelling focusing on Deep Learning for brevity.

Deep recurrent models

Early work makes use of specialized Deep Belief Networks for motion modelling

[30], whereas more recent works leverage recurrent architectures. For example, Fragkiadaki [7] propose the Encoder-Recurrent-Decoder (ERD) framework, which maps pose data into a latent space where it is propagated through time via an LSTM cell. The prediction at time step is fed back as the input for time step . This scheme quickly leads to error accumulation and hence catastrophic drift over time. To increase robustness, Gaussian noise is added during training. While alleviating the drift problem, this training scheme is hard to fine-tune. Quantitative and qualitative evaluations are performed on the publicly available H3.6M dataset [12], with a joint angle data representation using the exponential map (also called angle-axis). The joint-wise Euclidean distance on the Euler angles is used as the evaluation metric. Most of the follow-up work adheres to this setting.

Inspired by [7], Du [6] have recently proposed to combine a three-layer LSTM with bio-mechanical constraints encoded into the loss function for pedestrian pose and gait prediction. Like [6], we also incorporate prior knowledge into our network design, but do so through a particular design of the output layer rather than enforcing physical constraints in the loss function. Similar in spirit to [7], Ghosh [8]

stabilize forecasting for long-term prediction horizons via application of dropouts on the input layer of a denoising autoencoder. In this work we focus on short-term predictions, but also apply dropouts directly on the inputs to account for noisy predictions of the model at test time. Contrary to

[8], our model can be trained end-to-end.

Martinez [20] employ a sequence-to-sequence architecture using a single layer of GRU cells [4]

. The model is trained auto-regressively, using its own predictions during training. A residual connection on the decoder leads to smoother and improved short-term predictions. Martinez also show that simple running-average baselines are surprisingly difficult to beat in terms of the Euler angle metric. The currently best performance on H3.6M is reported by Wang

[33]. They also use a sequence-to-sequence approach trained with an adversarial loss to address the drift-problem and to create smooth predictions. Highlighting some of the issues with the previously used loss, [33] propose a more meaningful geodesic loss.

In this work we show that sequence-to-sequence models, despite good performance on H3.6M, do not fare as well on the larger, more diverse AMASS dataset. Although augmenting them with our SP-layer boosts their performance, they are outperformed by a simple RNN that uses the same SP-layer. To better characterize motion modelling performance we furthermore introduce several new evaluation metrics.

Structured Prediction

Jain [14] propose to explicitly model structural information by automatically converting an st-graph into an RNN (S-RNN). The skeleton is divided into 5 major clusters, whose interactions are then manually encoded into an st-graph. Our model is also structure-aware. However, our approach does not require a coarse subdivision of joints and does not require manual definition of st-graphs. Moreover, our layer is agnostic to the underlying network and can be interfaced with most existing architectures.

Bütepage [2] propose to encode poses with a hierarchy of dense layers following the kinematic chain starting from the end-effectors (dubbed H-TE), which is similar to our SP-layer. In contrast to this work, H-TE operates on the input rather than the output, and has only been demonstrated with non-recurrent networks when using 3D positions to parameterize the poses.

Structure-aware network architectures have also been used in 3D pose estimation from images

[16, 29, 21, 17, 31]. [17] and [31] both learn a structured latent space. [21] exploit structure only implicitly by encoding the poses into distance matrices which then serve as inputs and outputs of the network. [16] and [29] are closest to our work as they explicitly modify the network to account for skeletal structure, either via the loss function [29], or using a sequence of LSTM cells for each joint in the skeleton [16]. [16] introduces many new layers into the architecture and needs hyper-parameter tuning to be most effective. In contrast, our proposed SP-layer is simple to implement and train. We show that it improves performance of several baseline architectures out-of-the-box.

Parameterizations

Most work parameterizes joint angles as exponential maps relative to each joint’s parent. Pavllo [25] show results competitive with the state of the art using quaternions. Their model, QuaterNet, consists of 2 layers of GRU cells and similar to [20] uses a skip connection. The use of quaternions allows for integration of a differentiable forward kinematics layer, facilitating loss computation in the form of Euclidean distance of 3D joint positions. For short-term predictions, QuaterNet directly optimizes for the Euler-angle based metric as introduced by [7]. We show that QuaterNet also benefits from augmentation with our SP-layer, indicating that SPL is independent of the underlying joint angle representation.

Bütepage [2, 3] and Holden [9] convert the data directly to 3D joint positions. These works do not use recurrent structures, which necessitates the extraction of fixed-size, temporal windows for training. [2] and [9] focus on learning of latent representations, which are shown to be helpful for various tasks, such as denoising, forecasting, or motion generation along a given trajectory [10]. [3] extends [2] by applying a conditional variational autoencoder (VAE) to the task of online motion prediction in human-robot interactions. We use the positional representation of human poses to compute an informative metric of the prediction quality. However, for learning we use joint angles since they encode symmetries better and are inherently bone-length invariant.

3 Method

The goal of our work is to provide a general solution to the problem of human motion modelling. To this end we are motivated by the observation that human motion is strongly regulated by the spatial structure of the skeleton. However, integrating this structure into deep neural network architectures has so far not yielded better performance than architectures that only model temporal dependencies explicitly. In this section we outline a novel structured prediction layer (SPL) that explicitly captures the spatial connectivity. The layer is designed to be agnostic to the underlying network. We empirically show in Sec. 5 and 6 that it improves the performance of a variety of existing models irrespective of the dataset or the data representation used.

3.1 Problem Formulation

A motion sample can be considered as a sequence where a frame at time-step denotes the -dimensional body pose. depends on the number of joints in the skeleton, , and the size of the per-joint representation (angle-axis, rotation matrices, quaternions, or 3D positions), i.e. .

Due to their temporal nature, motion sequences are often modelled with auto-regressive approaches. Such models factorize the joint probability of a motion sequence as a product of conditionals as follows:

(1)

where the joint distribution is parameterized by

. At each time step , the next pose is predicted given the past poses.

While this auto-regressive setting explicitly models the temporal dependencies, the spatial structure is treated only implicitly. In other words, given a pose vector

, the model must predict the whole pose vector at the next time step. This assumes that joints are independent from each other given a particular context (i.e., a neural representation of the past frames). However, the human body is composed of hierarchical joints and the kinematic chain introduces spatial dependencies between them.

3.2 Structured Prediction Layer

To address this shortcoming, we propose a novel structured prediction layer (SPL). This is formed by decomposing the model prediction into individual joints. This decomposition is guided by the spatial prior of the human kinematic chain, depicted in Fig. 1. Formally, is a concatenation of joints :

Figure 1: SPL overview. Given the context of past frames, joint predictions are made hierarchically by following the kinematic chain defined by the underlying skeleton. Only a subset of joints is visualized for clarity.

To interface with existing architectures, the SP-layer takes a context representation as input. Here, is assumed to summarize the motion sequence until time . Without loss of generality, we assume this to be a hidden RNN state or its projection. While existing work typically leverages several dense layers to predict the -dimensional pose vector from , our SP-layer predicts each joint individually with separate smaller networks:

(2)

where extracts the parent of the -th joint. Importantly, the full body pose is predicted by following the skeletal hierarchy in Fig. 1 as follows:

(3)

In this formulation each joint receives information about its own configuration and that of the immediate parent both explicitly, through the conditioning on the parent joint’s prediction, and implicitly via the context . The joint probability of Eq. 1 is further factorized in the spatial domain:

(4)

The benefit of this structured prediction approach is two-fold. First, the proposed factorization allows for integration of a structural prior in the form of a hierarchical architecture where each joint is modelled by a different network. This allows the model to learn dedicated representations per joint and hence saves model capacity. Second, analogous to message passing, each parent propagates its prediction to the child joints, allowing for more precise local predictions because the joint has access to the information it depends on (i.e., the parent’s prediction).

In our experiments (cf. Sec. 5 and 6) we show that this layer improves the prediction performance of a diverse set of underlying architectures across many settings and metrics. One potential reason for why this is the case can be found in the resulting network structure and its implications on network training. Fig. 2 compares our structured approach with the traditional one-shot prediction using a dense layer. Because the per-joint decomposition leads to many small separate networks, we can think of an SP-layer as a dense layer where some connections have been set to zero explicitly by leveraging domain knowledge. This decomposition changes the gradients w.r.t. the units in the hidden layer, which are now only affected by the gradients coming from the joint hierarchy that they model. In the traditional setting, the error computed as an average over all joints can easily be distributed over all network weights in an arbitrary fashion.

Figure 2: Difference between dense and SP-layer with 2 joints. When all dashed weights are zero, a dense hidden layer is equivalent to a SP-layer that ignores the hierarchy. In a dense layer, the hidden unit is connected to all joints via and . Hence, the gradient is affected by both joints, whereas in SPL only contributes by design.

3.3 Per-joint Loss

We additionally propose to perform a similar decomposition in the objective function that leads to further improvements. The training objective is often a metric in Euclidean space between ground-truth poses and predictions :

(5)

where is a loss function such as an norm. The loss is calculated on the entire pose vector and averaged across the temporal and spatial domain. In our work, we use a slightly modified version that preserves joint integrity:

(6)

where the loss is first calculated on every joint and then summed up to calculate the loss for the entire motion sequence. In this work we use the MSE for , but the formulation allows for an easy adaptation of domain-specific losses such as the geodesic distance proposed by [33].

4 Human Motion Modelling

We now evaluate our SP-layer on the task of human motion modelling. We perform our experiments on two datasets and three different underlying architectures which use three different data representations. In the following we explain the datasets and models in more detail.

4.1 Datasets

For ease of comparison to the state of the art we first report results from the H3.6M dataset. We follow the same experiment protocol used in [14, 20].

Given the small size of H3.6M and the reported variance of test results

[25], we propose to use the recently introduced AMASS dataset [19] for the motion modelling task. We downloaded the dataset from [11] as the data from [19] has not yet been released at the time of this writing. AMASS is composed of publicly available databases, the CMU Mocap database [5] or HumanEva [26] and uses the SMPL model [18] to represent motion sequences. The dataset contains sequences, which comprise a total of frames sampled at 60 Hz. This is roughly equivalent to hours of recording, making AMASS about 14 times bigger than H3.6M ( frames at 50 Hz).

We split the AMASS dataset into training, validation and test splits consisting of roughly , and of the samples, respectively. Similar to the H3.6M protocol, the input sequences are seconds ( frames) and the target sequences are -ms ( frames) long. The H3.6M benchmarks use a total of test samples across categories. This is a relatively small test set and it has been reported to cause high variance [24]. In our H3.6M experiments we use this setting to ensure fair comparison. However, on AMASS we use every frame in the test split by shifting a 2-second window over the motion sequences, which extracts test samples. H3.6M and AMASS model the human skeleton with and major joints, respectively. We implement separate SP-layers corresponding to the underlying skeleton.

Walking Eating Smoking Discussion
milliseconds 80 160 320 400 80 160 320 400 80 160 320 400 80 160 320 400
LSTM-3LR [7] 0.77 1.00 1.29 1.47 0.89 1.09 1.35 1.46 1.34 1.65 2.04 2.16 1.88 2.12 2.25 2.23
SRNN [14] 0.81 0.94 1.16 1.30 0.97 1.14 1.35 1.46 1.45 1.68 1.94 2.08 1.22 1.49 1.83 1.93
Zero-Velocity [20] 0.39 0.68 0.99 1.15 0.27 0.48 0.73 0.86 0.26 0.48 0.97 0.95 0.31 0.67 0.94 1.04
AGED [33] 0.22 0.36 0.55 0.67 0.17 0.28 0.51 0.64 0.27 0.43 0.82 0.84 0.27 0.56 0.76 0.83
Seq2seq-sampling-sup [20] 0.28 0.49 0.72 0.81 0.23 0.39 0.62 0.76 0.33 0.61 1.05 1.15 0.31 0.68 1.01 1.09
Seq2seq-sampling-sup-SPL 0.23 0.37 0.53 0.61 0.20 0.32 0.52 0.67 0.26 0.48 0.92 0.90 0.29 0.63 0.90 0.99
Seq2seq-sampling [20] 0.27 0.47 0.70 0.78 0.25 0.43 0.71 0.87 0.33 0.61 1.04 1.19 0.31 0.69 1.03 1.12
Seq2seq-sampling-SPL 0.23 0.38 0.58 0.67 0.20 0.32 0.52 0.66 0.26 0.48 0.92 0.90 0.30 0.64 0.91 0.99
QuaterNet [25] 0.21 0.34 0.56 0.62 0.20 0.35 0.58 0.70 0.25 0.47 0.93 0.90 0.26 0.60 0.85 0.93
QuaterNet-SPL 0.22 0.35 0.54 0.61 0.20 0.33 0.55 0.68 0.25 0.47 0.91 0.88 0.26 0.59 0.84 0.91
RNN 0.30 0.48 0.78 0.89 0.23 0.36 0.57 0.72 0.26 0.49 0.97 0.95 0.31 0.67 0.95 1.03
RNN-SPL 0.26 0.40 0.67 0.78 0.21 0.34 0.55 0.69 0.26 0.48 0.96 0.94 0.30 0.66 0.95 1.05
Table 1: H3.6M results for the commonly used walking, eating, smoking, and discussion activities across different prediction horizons. Values correspond to the Euler angle metric measured at the given time. “Seq2seq-sampling” and “Seq2seq-sampling-sup” models correspond to “Residual unsup. (MA)” and “Residual sup. (MA)” models in [20], respectively. Note the relative performance improvement for each base model when augmented with our SP-layer.

4.2 Models

The modular nature of our SP-layer allows for flexible deployment with a diverse set of base models. In our experiments, we test the layer with the following three representative architectures proposed in the literature. To ease experimentation with SPL and other base architectures, we make all code and pre-trained models available at https://ait.ethz.ch/projects/2019/spl.

Seq2seq is a model proposed by Martinez [20], consisting of a single layer of GRU cells. It contains a residual connection between the inputs and predictions. Input poses are represented as exponential maps.

QuaterNet uses a quaternion representation instead [24, 25]. The model augments RNNs with quaternion based normalization and regularization operations. Similarly, the residual connection from inputs to outputs is implemented via the quaternion product. In our experiments, we replace the final linear output layer with our SP-layer and keep the remaining setup intact.

RNN uses a single layer recurrent network to calculate the context , which we feed to our SP-layer. In contrast to the Seq2seq and QuaterNet settings, we represent poses via rotation matrices. To account for the error accumulation problem at test time [7, 8, 14], we apply dropout directly on the inputs. This architecture is similar to the ERD [7] but is additionally augmented with the residual connection of [20].

In the SP-layer, each joint is modelled with only one small hidden layer (64 or 128 units) followed by a ReLU activation and a linear projection to the joint prediction

. We experiment with different hierarchical configurations in SPL (cf. Sec. 6.3) where following the true kinematic chain performed best. Some models benefit from inputting all parent joints in the kinematic chain compared to using only the immediate parent. Note that we changed existing Seq2seq and QuaterNet models only as much as required to integrate them with SPL. To ensure a fair comparison we fine-tune hyper-parameters like learning rate, batch size and hidden layer units. See appendix Sec. 8.1 for details.

5 Evaluation on H3.6M Dataset

In our first set of comparisons we baseline the proposed SP-layer on the H3.6M dataset using the Euler angle metric as is common practice in the literature.

5.1 Metrics

Euler angles

Let denote a rotation of angle around the unit axis . is the angle-axis (or exponential map) representation of a single joint angle. The Euler angles are extracted from by first converting it into a rotation matrix using Rodrigues’ formula and then computing the angles following [27]. This assumes that follows the z-y-x order. Furthermore, as noted by [27], there exist always two solutions for , from which [14] picks the one that leads to the least amount of rotation. The Euler angle metric for time step is then

(7)

where are the predicted Euler angles of joint at time . is defined by [14] and comprises of 120 sequences.

5.2 Results

Tab. 1 summarizes the relative performances of models with and without the SP-layer on the H3.6M dataset and compares them to the state of the art. The publicly available Seq2seq [20] and QuaterNet [25] models are augmented with our SP-layer, but we otherwise follow the original training and evaluation protocols of the respective baseline model.

Using the SP-layer improves the Seq2seq performance significantly and achieves state-of-the-art performance in the walking category. Similarly, SPL yields the best performance with QuaterNet in short-term smoking and discussion motions and marginally outperforms the vanilla QuaterNet in most categories or is comparative to it. While our SP-layer also boosts the performance of the RNN model in walking, eating and smoking motion categories, performance remains similar for discussion.

We follow the same evaluation setting as in previous work for direct comparability. It is noteworthy to mention that the evaluation metrics reported on H3.6M exhibit high variance due to the small number of test samples [24] and low errors do not always correspond to good qualitative results [20].

6 AMASS: A New Benchmark

In this section we evaluate the baseline methods and our SP-layer on the large-scale AMASS dataset, detailed in Sec. 4.1. The diversity and large amount of motion samples in AMASS increase both the task’s complexity and the reliability of results due to a larger test set. In addition to proposing a new evaluation setting for motion modelling we suggest usage of a more versatile set of metrics for the task.

Euler Joint Angle Positional PCK (AUC)
milliseconds 100 200 300 400 100 200 300 400 100 200 300 400 100 200 300 400
Zero-Velocity [20] 1.91 5.93 11.36 17.78 0.37 1.22 2.44 3.94 0.14 0.48 0.96 1.54 0.86 0.83 0.84 0.82
Seq2seq [20]* 1.52 5.14 10.66 17.84 0.27 0.99 2.19 3.85 0.11 0.39 0.87 1.53 0.91 0.86 0.86 0.82
Seq2seq-PJL 1.46 5.28 11.46 19.78 0.24 0.95 2.16 3.87 0.09 0.35 0.80 1.41 0.91 0.87 0.87 0.83
Seq2seq-SPL 1.57 5.00 10.01 16.43 0.27 0.94 2.01 3.45 0.10 0.36 0.79 1.36 0.91 0.87 0.87 0.84
Seq2seq-sampling [20]* 2.01 5.99 11.22 17.33 0.37 1.17 2.27 3.59 0.14 0.45 0.88 1.39 0.86 0.84 0.85 0.83
Seq2seq-sampling-PJL 1.71 5.15 9.71 15.15 0.32 1.00 1.97 3.14 0.12 0.39 0.77 1.23 0.88 0.86 0.87 0.85
Seq2seq-sampling-SPL 1.71 5.13 9.60 14.86 0.31 0.97 1.91 3.04 0.12 0.38 0.74 1.18 0.89 0.86 0.88 0.85
Seq2seq-dropout 1.54 4.98 9.94 16.13 0.27 0.95 2.00 3.39 0.10 0.37 0.79 1.34 0.91 0.87 0.87 0.84
Seq2seq-dropout-PJL 1.26 4.41 9.24 15.46 0.23 0.84 1.82 3.13 0.09 0.33 0.71 1.21 0.92 0.88 0.88 0.85
Seq2seq-dropout-SPL 1.26 4.26 8.67 14.23 0.23 0.81 1.74 2.96 0.09 0.32 0.68 1.16 0.92 0.89 0.89 0.86
QuaterNet [25]* 1.49 4.70 9.16 14.54 0.26 0.89 1.83 3.00 0.10 0.34 0.71 1.18 0.90 0.87 0.88 0.85
QuaterNet-SPL 1.34 4.25 8.39 13.43 0.25 0.83 1.71 2.83 0.09 0.32 0.67 1.10 0.91 0.88 0.89 0.86
RNN 1.69 5.23 10.18 16.29 0.31 1.05 2.17 3.62 0.12 0.41 0.85 1.43 0.89 0.85 0.86 0.83
RNN-SPL 1.33 4.13 8.03 12.84 0.22 0.73 1.51 2.51 0.08 0.28 0.57 0.96 0.93 0.90 0.90 0.88
Table 2: AMASS results of the base models with and without the proposed SP-layer. We report normalized area-under-the-curve (AUC) for PCK values (higher is better, maximum is ). For the remaining metrics, lower is better. “Seq2seq” and “Seq2seq-dropout” are trained by using ground-truth inputs. "-dropout" applies dropout on the inputs. “*” indicates our evaluation of this model. "-PJL" stands for our proposed per-joint loss on the vanilla model, showing a significant improvement already. Note that models with SPL perform better except on short-term predictions for “Seq2seq” model.

6.1 Metrics

So far, motion prediction has been benchmarked on H3.6M using the Euclidean distance between target and predicted Euler angles [14, 20, 25, 33]. Numbers are usually reported per action at certain time steps averaged over 8 samples [14]. Unfortunately, Euler angles have twelve different conventions (not counting the fact that each of these can be defined using intrinsic or extrinsic rotations), which makes the practical implementation of this metric error-prone.

For a more precise analysis we introduce additional metrics from related pose estimation areas [28, 32, 34]. In order to increase the robustness we furthermore suggest to i) sum until time step rather than report the metric at time step , ii) use more test samples covering a larger portion of the test data set and iii) evaluate the models with complementary metrics. Note that we do not train the models on these metrics; they only serve as evaluation criteria at test time.

Joint angle difference

To circumvent the potential source of error in the Euler angle metric, we propose using another angle-based metric following [11, 32]. This metric computes the angle of the rotation required to align the predicted joint with the target joint. Unlike , this metric is independent of how rotations are parameterized. It is furthermore similar to the geodesic loss proposed by [33]. Let be the predicted joint angle for a given joint, parameterized as a rotation matrix, and the respective target rotation . The difference in rotation can be computed as , from which we construct the metric at time step as follows:

(8)

where is the rotation matrix of joint at time . In contrast to we compute the loss on global joint angles by first unrolling the kinematic chain before computing .

Positional

Following Pavllo ’s [25] suggestion, we introduce a positional metric. This metric simply performs forward kinematics on and to obtain 3D joint positions and , respectively. It then computes the Euclidean distance per joint. We normalize the skeleton bones such that the right thigh bone has unit length.

(9)

Pck

In cases where large errors occur, the value of can be misleading. Hence, following the 3D (hand) pose estimation literature [13, 22, 28, 34], we introduce PCK by computing the percentage of predicted joints lying within a spherical threshold around the target joint position, i.e.

(10)

where returns if its input is true, and otherwise. Note that for PCK we do not sum, but average, until time step .

6.2 Results

Tab. 2 summarizes the performance of the three model variants, each with and without the SP-layer. We trained the base models with minimal modifications, i.e. design, training objective and regularizations are kept intact. We use angle-axis, quaternion and rotation matrix representations for Seq2seq, QuaterNet, and RNN models, respectively. To make a fair comparison, we run hyper-parameter search on the batch size, cell type, learning rate and hidden layer size.

Figure 3: PCK curves of the best Seq2seq variant and QuaterNet with and without SPL on AMASS for ms predictions. More results are shown in appendix Sec. 8.3.
Figure 4: Qualitative Comparison on AMASS. We use a -second seed sequence and predict the next second (60 frames). The last pose of the seed and the first pose of the prediction sequences are consecutive frames. Note that there is no transition problem. Top: Ground-truth sequence. Middle: Output of the vanilla RNN which quickly deteriorates. Bottom: The same RNN model augmented with our SP-layer. It produces accurate short-term predictions as well as natural long-term motion.

Unlike on H3.6M, LSTM cells consistently outperform GRUs on AMASS for the Seq2seq and RNN models. Different from [20], we also train the Seq2seq model by applying dropout on the inputs similar to our RNN architecture. QuaterNet gives its best performance with GRU cells while some fine-tuning for the teacher forcing ratio is necessary.

In all settings, the Seq2seq models fail to give competitive performance on this large-scale task and are sometimes outperformed by the zero-velocity baseline proposed by Martinez [20]. QuaterNet shows a strong performance and is in fact the closest vanilla model to the SPL variants. However, our SP-layer still improves the QuaterNet results further. The contribution of the SP-layer is best observable on the RNN model. With the help of a larger dataset, the proposed RNN-SPL achieves the best results under different metrics and prediction horizons. Fig. 3 compares two baseline methods for millisecond predictions with their corresponding SPL extension for different choices of the threshold . The RNN-SPL consistently outperforms other methods. More results are shown in the appendix Sec. 8.3.

Please also note the complementary effect of the proposed metrics in Tab. 2. The Seq2seq-dropout-SPL model at ms shows a significant improvement () w.r.t. the Euler angle metric, and in fact achieves the best result across all models. However, this is no longer the case when we look at the proposed metrics. The model performs marginally worse than the best performing model, RNN-SPL, in these metrics. The joints closer to the root of the kinematic chain have a much larger impact on the overall pose since wrong rotations propagate to all the child joints on the chain. This effect might be ignored when only local rotations are considered, which is the case for . and account for this by first unrolling the kinematic chain.

In line with [25, 33], we report that the residual connection from [20] is very effective for short-term predictions. All models we trained performed better with the residual connection irrespective of the dataset or pose representation.

6.3 Ablation Study

To study the SPL in more depth we conduct an ablation study presented in Tab. 3. We observe that the main performance boost is achieved by the decomposition of the output layer and the per-joint loss in Eq. (6). While the per-joint-loss alone (i.e., without SPL) is not beneficial on H3.6M, on AMASS its application alone already helps (RNN-PJL). It is also effective on Seq2seq models with noisy inputs, but the performance degrades on vanilla Seq2seq model. In longer-term predictions, SP-layer shows a significant contribution (see Tab. 2). Assuming independent joints without modelling any hierarchy (RNN-SPL-indep.) improves the results further. Introducing hierarchy into the prediction layer either in reverse or random order performs often similar or better. However, introducing the spatial dependencies according to the kinematic chain (RNN-SPL) yields the best results with the exception of the positional metric.

AMASS H3.6M
Euler Joint Angle Pos. Walking
RNN 16.44 3.570 1.396 0.900
RNN-PJL 13.13 2.573 0.986 0.950
RNN-SPL-indep. 12.96 2.552 0.982 0.836
RNN-SPL-random 12.98 2.547 0.980 0.863
RNN-SPL-reverse 13.03 2.543 0.973 0.849
RNN-SPL 12.85 2.533 0.975 0.772
Table 3: Ablation study on AMASS and H3.6M (walking) for ms predictions. Each entry is an average over 5 randomly initialized training runs. Please refer to Sec. 6.3 for detailed explanations and the appendix for more results.

7 Conclusion

We introduce prior knowledge about the human skeletal structure into a neural network by means of a structured prediction layer (SPL). The SP-layer explicitly decomposes the pose into individual joints and can be interfaced with a variety of baseline architectures. We furthermore introduce AMASS, a large-scale motion dataset, and several metrics to the task of motion prediction. On AMASS, we empirically show that for any baseline model, any metric, and any input representation, it is better to use the proposed SP-layer. The simple RNN model augmented with the SP-layer achieved state-of-the-art performance on the new AMASS benchmark.

Acknowledgements

We thank the reviewers for their insightful comments and Martin Blapp for fruitful discussions. This project has received funding from the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme grant agreement No 717054. We thank the NVIDIA Corporation for the donation of GPUs used in this work.

References

  • [1] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean, M. Devin, S. Ghemawat, I. Goodfellow, A. Harp, G. Irving, M. Isard, Y. Jia, R. Jozefowicz, L. Kaiser, M. Kudlur, J. Levenberg, D. Mané, R. Monga, S. Moore, D. Murray, C. Olah, M. Schuster, J. Shlens, B. Steiner, I. Sutskever, K. Talwar, P. Tucker, V. Vanhoucke, V. Vasudevan, F. Viégas, O. Vinyals, P. Warden, M. Wattenberg, M. Wicke, Y. Yu, and X. Zheng (2015) TensorFlow: large-scale machine learning on heterogeneous systems. Note: Software available from tensorflow.org External Links: Link Cited by: §8.1.
  • [2] J. Bütepage, M. J. Black, D. Kragic, and H. Kjellström (2017) Deep representation learning for human motion prediction and classification.

    2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    , pp. 1591–1599.
    Cited by: §2, §2.
  • [3] J. Bütepage, H. Kjellström, and D. Kragic (2018) Anticipating many futures: online human motion prediction and generation for human-robot interaction. In 2018 IEEE International Conference on Robotics and Automation, ICRA 2018, Brisbane, Australia, May 21-25, 2018, pp. 1–9. External Links: Link, Document Cited by: §2.
  • [4] K. Cho, B. van Merriënboer, Ç. Gülçehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y. Bengio (2014-10) Learning phrase representations using rnn encoder–decoder for statistical machine translation. In

    Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP)

    ,
    Doha, Qatar, pp. 1724–1734. External Links: Link Cited by: §2.
  • [5] F. De la Torre, J. Hodgins, A. Bargteil, X. Martin, J. Macey, A. Collado, and P. Beltran (2008) Guide to the carnegie mellon university multimodal activity (cmu-mmac) database. Robotics Institute, pp. 135. Cited by: §4.1.
  • [6] X. Du, R. Vasudevan, and M. Johnson-Roberson (2019) Bio-lstm: a biomechanically inspired recurrent neural network for 3d pedestrian pose and gait prediction. IEEE Robotics and Automation Letters (RA-L). Note: accepted Cited by: §2.
  • [7] K. Fragkiadaki, S. Levine, P. Felsen, and J. Malik (2015) Recurrent network models for human dynamics. In Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV), ICCV ’15, Washington, DC, USA, pp. 4346–4354. External Links: ISBN 978-1-4673-8391-2, Link, Document Cited by: §1, §1, §2, §2, §2, §4.2, Table 1.
  • [8] P. Ghosh, J. Song, E. Aksan, and O. Hilliges (2017) Learning human motion models for long-term predictions. In 2017 International Conference on 3D Vision, 3DV 2017, Qingdao, China, October 10-12, 2017, pp. 458–466. External Links: Link, Document Cited by: §1, §1, §1, §2, §4.2.
  • [9] D. Holden, J. Saito, T. Komura, and T. Joyce (2015) Learning motion manifolds with convolutional autoencoders. In SIGGRAPH Asia 2015 Technical Briefs, SA ’15, New York, NY, USA, pp. 18:1–18:4. External Links: ISBN 978-1-4503-3930-8, Link, Document Cited by: §2.
  • [10] D. Holden, J. Saito, and T. Komura (2016-07) A deep learning framework for character motion synthesis and editing. ACM Trans. Graph. 35 (4), pp. 138:1–138:11. External Links: ISSN 0730-0301, Link, Document Cited by: §2.
  • [11] Y. Huang, M. Kaufmann, E. Aksan, M. J. Black, O. Hilliges, and G. Pons-Moll (2018-11) Deep inertial poser: learning to reconstruct human pose from sparse inertial measurements in real time. ACM Transactions on Graphics, (Proc. SIGGRAPH Asia) 37, pp. 185:1–185:15. Cited by: §4.1, §6.1.
  • [12] C. Ionescu, D. Papava, V. Olaru, and C. Sminchisescu (2014-07) Human3.6m: large scale datasets and predictive methods for 3d human sensing in natural environments. IEEE Transactions on Pattern Analysis and Machine Intelligence 36 (7), pp. 1325–1339. Cited by: §1, §2.
  • [13] U. Iqbal, P. Molchanov, T. Breuel, J. Gall, and J. Kautz (2018) Hand pose estimation via latent 2.5d heatmap regression. In ECCV (11), Lecture Notes in Computer Science, Vol. 11215, pp. 125–143. Cited by: §6.1.
  • [14] A. Jain, A. R. Zamir, S. Savarese, and A. Saxena (2016) Structural-rnn: deep learning on spatio-temporal graphs. In 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 27-30, 2016, pp. 5308–5317. External Links: Link, Document Cited by: §1, §1, §1, §2, §4.1, §4.2, Table 1, §5.1, §6.1, §8.1.
  • [15] D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §8.1.
  • [16] K. Lee, I. Lee, and S. Lee (2018) Propagating LSTM: 3d pose estimation based on joint interdependency. In Computer Vision - ECCV 2018 - 15th European Conference, Munich, Germany, September 8-14, 2018, Proceedings, Part VII, pp. 123–141. External Links: Link, Document Cited by: §2.
  • [17] S. Li, W. Zhang, and A. B. Chan (2015) Maximum-margin structured learning with deep networks for 3d human pose estimation. In 2015 IEEE International Conference on Computer Vision, ICCV 2015, Santiago, Chile, December 7-13, 2015, pp. 2848–2856. External Links: Link, Document Cited by: §2.
  • [18] M. Loper, N. Mahmood, J. Romero, G. Pons-Moll, and M. J. Black (2015) SMPL: a skinned multi-person linear model. ACM Transactions on Graphics (TOG) 34 (6), pp. 248. Cited by: §4.1.
  • [19] N. Mahmood, N. Ghorbani, N. F. Troje, G. Pons-Moll, and M. J. Black (2019-10) AMASS: archive of motion capture as surface shapes. In The IEEE International Conference on Computer Vision (ICCV), External Links: Link Cited by: §1, §4.1.
  • [20] J. Martinez, M. J. Black, and J. Romero (2017-07) On human motion prediction using recurrent neural networks. In Proceedings IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 2017, Piscataway, NJ, USA. Cited by: §1, §1, §1, §1, §1, §2, §2, §4.1, §4.2, §4.2, Table 1, §5.2, §5.2, §6.1, §6.2, §6.2, §6.2, Table 2, §8.1, §8.1, §8.1, §8.1, §8.1, Table 5.
  • [21] F. Moreno-Noguer (2017) 3D human pose estimation from a single image via distance matrix regression. In CVPR, pp. 1561–1570. Cited by: §2.
  • [22] F. Mueller, F. Bernard, O. Sotnychenko, D. Mehta, S. Sridhar, D. Casas, and C. Theobalt (2018-06) GANerated hands for real-time 3d hand tracking from monocular rgb. In Proceedings of Computer Vision and Pattern Recognition (CVPR), External Links: Link Cited by: §6.1.
  • [23] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer (2017)

    Automatic differentiation in pytorch

    .
    In NIPS-W, Cited by: §8.1.
  • [24] D. Pavllo, C. Feichtenhofer, M. Auli, and D. Grangier (2019) Modeling human motion with quaternion-based neural networks. CoRR abs/1901.07677. External Links: Link, 1901.07677 Cited by: §4.1, §4.2, §5.2.
  • [25] D. Pavllo, D. Grangier, and M. Auli (2018) QuaterNet: A quaternion-based recurrent model for human motion. In British Machine Vision Conference 2018, BMVC 2018, Northumbria University, Newcastle, UK, September 3-6, 2018, pp. 299. External Links: Link Cited by: §1, §1, §1, §1, §2, §4.1, §4.2, Table 1, §5.2, §6.1, §6.1, §6.2, Table 2, §8.1, §8.1, §8.1, §8.1, Table 5.
  • [26] L. Sigal, A.O. Balan, and M.J. Black (2010) Humaneva: synchronized video and motion capture dataset and baseline algorithm for evaluation of articulated human motion. International Journal on Computer Vision (IJCV) 87 (1), pp. 4–27. External Links: ISSN 0920-5691 Cited by: §4.1.
  • [27] G. G. Slabaugh Computing euler angles from a rotation matrix. Note: http://www.gregslabaugh.net/publications/euler.pdf, last accessed 21.03.2019 Cited by: §5.1.
  • [28] A. Spurr, J. Song, S. Park, and O. Hilliges (2018) Cross-modal deep variational hand pose estimation. In CVPR, Cited by: §6.1, §6.1.
  • [29] X. Sun, J. Shang, S. Liang, and Y. Wei (2017) Compositional human pose regression. In ICCV, pp. 2621–2630. Cited by: §2.
  • [30] G. W. Taylor, G. E. Hinton, and S. T. Roweis (2011) Two distributed-state models for generating high-dimensional time series. Journal of Machine Learning Research 12, pp. 1025–1068. External Links: Link Cited by: §2.
  • [31] B. Tekin, I. Katircioglu, M. Salzmann, V. Lepetit, and P. Fua (2016) Structured prediction of 3d human pose with deep neural networks. In Proceedings of the British Machine Vision Conference 2016, BMVC 2016, York, UK, September 19-22, 2016, External Links: Link Cited by: §2.
  • [32] T. von Marcard, B. Rosenhahn, M. Black, and G. Pons-Moll (2017) Sparse inertial poser: automatic 3d human pose estimation from sparse imus. Computer Graphics Forum 36(2), Proceedings of the 38th Annual Conference of the European Association for Computer Graphics (Eurographics), pp. 349–360. Cited by: §6.1, §6.1.
  • [33] Y. Wang, L. Gui, X. Liang, and J. M. F. Moura (2018-10) Adversarial geometry-aware human motion prediction. In European Conference on Computer Vision (ECCV), Cited by: §1, §1, §2, §3.3, Table 1, §6.1, §6.1, §6.2.
  • [34] C. Zimmermann and T. Brox (2017) Learning to estimate 3d hand pose from single rgb images. In IEEE International Conference on Computer Vision (ICCV), Note: https://arxiv.org/abs/1705.01389 External Links: Link Cited by: §6.1, §6.1.

8 Appendix

We provide architecture details in Sec. 8.1, results on long-term predictions in Sec. 8.2, PCK plots in Sec. 8.3, and more detailed ablation studies in Sec. 8.4.

8.1 Architecture Details

The RNN and Seq2seq models are implemented in Tensorflow [1]. For the QuaterNet-SPL model we extend the publicly available source code in Pytorch [23]. Our aim is to make a minimum amount of modifications to the baseline Seq2seq [20] and QuaterNet [25] models. In order to get the best performance on the new AMASS dataset, we fine-tune the hyper-parameters including batch size, learning rate, learning rate decay, cell type and number of cell units, dropout rate, hidden output layer size and teacher forcing ratio decay for QuaterNet.

Fig. 5 provides an overview over these models. The SP-layer replaces the standard dense layers, which normally use the context representation , i.e., GRU or LSTM state until time-step , to make the pose vector prediction . The SPL component follows the kinematic chain and uses the following network for every joint:

where the hidden layer size per joint is either or and the joint size is , , or for exponential map, quaternion, or rotation matrix pose representation, respectively (see Tab. 4). Similar to the H3.6M setup [14, 20] we use a -second seed and -milisecond target sequences . The sequence corresponds to the target predictions.

H3.6M AMASS
SPL Units Cell SPL Units Cell
RNN-SPL sparse 64 GRU dense 64 LSTM
Seq2seq-SPL sparse 64 GRU dense 64 LSTM
QuaterNet-SPL sparse 128 GRU sparse 128 GRU
Table 4: SPL configuration. sparse and dense refer to making a joint prediction by feeding only the immediate parent or all parent joints in the kinematic chain, respectively. Models use a hidden layer of either or units per joint. GRU cell outperforms LSTM on H3.6M while LSTM is consistently better on AMASS dataset. The vanilla models use their original setting with the reported cell.

We train the baseline Seq2seq [20] and QuaterNet [25]

models by using the training objectives as proposed in the original papers. The SPL variants, however, implement these objectives by using our proposed joint-wise loss. After an epoch of training we evaluate the model on the validation split and apply early stopping with respect to the joint angle metric. Please note that the early stopping metric is different than the training objective for all models.

Figure 5: Model overview. Top: RNN-SPL  Middle: Seq2seq-SPL, Bottom: Quaternet-SPL. Note that both Seq2seq and QuaterNet models follow sequence-to-sequence architecture where the encoder and decoder share the parameters. The -second seed sequence is first fed to the encoder network to calculate the hidden cell state which is later used to initialize the prediction into the future. The dashed lines from the prediction to the input correspond to the sampling based training. In other words, the predictions are fed back during training.
Euler Joint Angle Positional PCK (AUC)
milliseconds 600 800 1000 600 800 1000 600 800 1000 600 800 1000
Zero-Velocity [20] 32.36 48.39 65.25 7.46 11.31 15.3 2.93 4.46 6.06 0.78 0.76 0.74
Seq2seq [20]* 36.39 60.07 88.72 8.39 14.36 21.61 3.38 5.82 8.81 0.75 0.71 0.67
Seq2seq-PJL 41.96 71.63 109.45 8.75 15.57 24.43 3.13 5.55 8.76 0.76 0.71 0.66
Seq2seq-SPL 32.58 52.49 75.69 7.23 11.99 17.62 2.88 4.81 7.10 0.79 0.75 0.72
Seq2seq-sampling [20]* 31.37 47.37 64.72 6.72 10.31 14.23 2.61 4.03 5.58 0.79 0.77 0.75
Seq2seq-sampling-PJL 27.72 42.19 58.01 5.96 9.21 12.79 2.34 3.64 5.07 0.81 0.79 0.77
Seq2seq-sampling-SPL 27.01 40.90 55.97 5.76 8.90 12.36 2.24 3.48 4.85 0.82 0.80 0.78
Seq2seq-dropout 31.16 48.92 68.77 6.94 11.22 16.06 2.78 4.54 6.54 0.78 0.75 0.72
Seq2seq-dropout-PJL 31.20 50.62 73.09 6.59 10.93 15.98 2.53 4.18 6.09 0.80 0.76 0.73
Seq2seq-dropout-SPL 28.02 44.95 64.23 6.15 10.11 14.67 2.42 4.00 5.84 0.81 0.78 0.75
QuaterNet [25]* 27.08 41.32 56.66 5.88 9.21 12.84 2.32 3.64 5.09 0.82 0.79 0.77
QuaterNet-SPL 25.37 39.02 53.95 5.58 8.79 12.32 2.19 3.47 4.87 0.82 0.80 0.78
RNN 31.19 48.84 68.64 7.33 11.87 17.09 2.93 4.79 6.96 0.78 0.74 0.71
RNN-SPL 24.44 38.02 53.06 5.04 8.08 11.50 1.94 3.14 4.49 0.84 0.81 0.79
Table 5: Long-term AMASS results of the base models with and without the proposed structured prediction layer (SPL). For PCK we report the area-under-the-curve (AUC), which is upper-bounded by (higher is better). Euler, joint angle and positional losses are lower-bounded by (lower is better). "*" indicates our evaluation of the corresponding model on AMASS. "-dropout" stands for dropout applied directly on the inputs. "-PJL" stands for our proposed per-joint loss on the vanilla model, showing a significant improvement already. All models use residual connections. Note that models with our proposed SP-layer always perform better.

Rnn-Spl

We use the rotation matrix pose representation with zero-mean unit-variance normalization, following teacher-forcing training. In other words, the model is trained by feeding the ground-truth pose to predict . The training objective is the proposed joint-wise loss with -norm (see Sec. 3.3 in the paper) which is calculated over the entire seed and target predictions .

We do not follow a sampling-based training scheme. In the absence of such a training regularization, the model overfits to the likelihood (i.e., ground-truth input samples) and hence performs poorly in the auto-regressive test setup. We find that a small amount of dropout with a rate of on the inputs makes the model robust against the exposure bias problem.

The dropout is followed by a linear layer with units. We use a single LSTM cell with units. The vanilla RNN model makes the predictions by using

where . We also experimented with GRU units instead of LSTM cells, but experimentally found that LSTMs consistently outperformed GRUs. Finally, we use the Adam optimizer [15] with its default parameters. The learning rate is initialized with and exponentially decayed with a rate of at every decay steps.

Seq2seq-SPL

As proposed by Martinez [20] we use the exponential map pose representation with zero-mean unit-variance normalization. The model consists of encoder and decoder components where the parameters are shared. The seed sequence is first fed to the encoder network to calculate the hidden cell state which is later used by the decoder to initialize the prediction into the future (i.e., ). Similarly, the training objective is calculated between the ground-truth targets and the predictions . We use the proposed joint-wise loss with -norm.

In our AMASS experiments, we find that a single LSTM cell with units performs better than a single GRU cell. In the training of the Seq2seq-sampling model, the decoder prediction is fed back to the model [20]. The other two variants, Seq2seq-dropout (with a dropout rate of ) and Seq2seq (see Tab. 2 in the paper), are trained with ground-truth inputs similar to the RNN models. Similarly, the vanilla Seq2seq model has a hidden output layer of size on AMASS dataset.

We use the Adam optimizer with its default parameters. The learning rate is initialized with and exponentially decayed with a rate of at every decay steps.

QuaterNet-SPL

We use the quaternion pose representation without any further normalization on the data [25]. The data is pre-processed following Pavllo ’s suggestions to avoid mixing antipodal representations within a given sequence. QuaterNet also follows the sequence-to-sequence architecture where the seed sequence is used to initialize the cell states. As in the vanilla model, the training objective is based on the Euler angle pose representation. More specifically, the predictions in quaternion representation are converted to Euler angles to calculate the training objective.

The model consists of two stacked GRU cells with units each. In contrast to the RNN and Seq2seq models, the residual velocity is implemented by using quaternion multiplication. Moreover, the QuaterNet model applies a normalization penalty and explicitly normalizes the predictions in order to enforce valid rotations. As proposed by Pavllo [25], we exponentially decay the teacher-forcing ratio with a rate of . The teacher-forcing ratio determines the probability of using ground-truth poses during training. Over time this value gets closer to and hence increases the probability of using the model predictions rather than the ground-truth poses. Similar to the vanilla RNN and Seq2seq models, a hidden output layer of size performed better on AMASS dataset.

Finally, the model is trained by using the Adam optimizer with its default parameters. The learning rate is initialized with and exponentially decayed with a rate of after every training epoch.

Figure 6: Qualitative Comparison on AMASS. We use a -second seed sequence and predict the next second (60 frames). The last pose of the seed and the first pose of the prediction sequences are consecutive frames. In green (2nd and 4th row) are results from the vanilla versions of Seq2seq and QuaterNet, respectively. In orange (3rd and 5th row) are results when augmenting the vanilla model with our SP-layer. Although the SPL-variants shown here are still outperformed by the RNN-SPL shown in the main paper, they still show slight improvement over their non-SPL counterparts.
Walking Eating Smoking Discussion
milliseconds 80 160 320 400 80 160 320 400 80 160 320 400 80 160 320 400
RNN-mean 0.319 0.515 0.771 0.900 0.242 0.384 0.583 0.742 0.264 0.493 0.984 0.967 0.312 0.668 0.945 1.040
RNN-PJL 0.324 0.534 0.816 0.950 0.233 0.391 0.616 0.776 0.258 0.483 0.961 0.932 0.312 0.675 0.969 1.067
RNN-SPL-indep. 0.288 0.453 0.720 0.836 0.228 0.366 0.575 0.736 0.258 0.482 0.947 0.916 0.313 0.676 0.962 1.064
RNN-SPL-random 0.298 0.473 0.758 0.863 0.227 0.354 0.578 0.717 0.263 0.490 0.956 0.925 0.311 0.677 0.975 1.079
RNN-SPL-reverse 0.302 0.483 0.725 0.849 0.225 0.344 0.557 0.721 0.264 0.494 0.96 0.929 0.312 0.679 0.960 1.050
RNN-SPL 0.264 0.413 0.669 0.772 0.205 0.326 0.559 0.721 0.260 0.486 0.958 0.930 0.307 0.667 0.950 1.049
Table 6: H3.6M ablation study. Comparison of SPL with different joint configurations and the proposed per-joint loss on H3.6M. Each model entry corresponds to an average of several runs with different initialization.

8.2 Long-term Prediction on AMASS

In Tab. 5, we report longer-term prediction results as an extension to the results provided in Tab. 2 in the main paper. Please note that all models are trained to predict -ms. In fact, the Seq2seq and QuaterNet models have been proposed to solve short-term prediction tasks only.

Consistent with the short-term prediction results shown in the main paper, our proposed SP-layer always improves the underlying model performance. While QuaterNet-SPL is competitive, RNN-SPL yields the best performance under different metrics.

In Fig. 6 we show more qualitative results for QuaterNet and Seq2seq when augmented with our SP-layer. Please refer to the supplemental video for more qualitative results.

8.3 PCK Plots

We provide additional PCK plots for , , and ms prediction horizon in Fig. 7. Please note that shorter time horizons do not use the entire range of thresholds to avoid a saturation effect.

Figure 7: PCK Curves of models with and without our SP-layer (dashed lines) on AMASS for , , , and milliseconds (top left to bottom right).

8.4 Ablation Study

The full ablation study on H3.6M and AMASS is shown in Tab. 6 and 7, respectively. For an explanation of each entry, please refer to the main text in Sec. 6.3.

Euler Joint Angle Positional PCK (AUC)
milliseconds 100 200 300 400 100 200 300 400 100 200 300 400 100 200 300 400
RNN-mean 1.65 5.21 10.24 16.44 0.318 1.057 2.157 3.570 0.122 0.408 0.838 1.396 0.886 0.854 0.861 0.832
RNN-PJL 1.33 4.15 8.16 13.13 0.230 0.758 1.550 2.573 0.086 0.287 0.590 0.986 0.923 0.897 0.901 0.877
RNN-SPL-indep. 1.30 4.08 8.04 12.96 0.228 0.750 1.537 2.552 0.085 0.283 0.587 0.982 0.924 0.897 0.901 0.878
RNN-SPL-random 1.31 4.09 8.03 12.98 0.228 0.749 1.533 2.547 0.086 0.284 0.586 0.980 0.924 0.897 0.901 0.878
RNN-SPL-reverse 1.31 4.10 8.08 13.03 0.229 0.749 1.532 2.543 0.086 0.282 0.582 0.973 0.924 0.897 0.902 0.878
RNN-SPL 1.29 4.04 7.95 12.85 0.227 0.744 1.525 2.533 0.085 0.282 0.582 0.975 0.924 0.898 0.902 0.878
Table 7: AMASS ablation study. Comparison of SPL with different joint configurations and the proposed per-joint loss on AMASS. Each model entry corresponds to an average of several runs with different initialization.