Repository for "Space-Time-Separable Graph Convolutional Network for Pose Forecasting" (ICCV 2021)
Human pose forecasting is a complex structured-data sequence-modelling task, which has received increasing attention, also due to numerous potential applications. Research has mainly addressed the temporal dimension as time series and the interaction of human body joints with a kinematic tree or by a graph. This has decoupled the two aspects and leveraged progress from the relevant fields, but it has also limited the understanding of the complex structural joint spatio-temporal dynamics of the human pose. Here we propose a novel Space-Time-Separable Graph Convolutional Network (STS-GCN) for pose forecasting. For the first time, STS-GCN models the human pose dynamics only with a graph convolutional network (GCN), including the temporal evolution and the spatial joint interaction within a single-graph framework, which allows the cross-talk of motion and spatial correlations. Concurrently, STS-GCN is the first space-time-separable GCN: the space-time graph connectivity is factored into space and time affinity matrices, which bottlenecks the space-time cross-talk, while enabling full joint-joint and time-time correlations. Both affinity matrices are learnt end-to-end, which results in connections substantially deviating from the standard kinematic tree and the linear-time time series. In experimental evaluation on three complex, recent and large-scale benchmarks, Human3.6M [Ionescu et al. TPAMI'14], AMASS [Mahmood et al. ICCV'19] and 3DPW [Von Marcard et al. ECCV'18], STS-GCN outperforms the state-of-the-art, surpassing the current best technique [Mao et al. ECCV'20] by over 32 requiring 1.7 illustrate the graph interactions by the factored joint-joint and time-time learnt graph connections. Our source code is available at: https://github.com/FraLuca/STSGCNREAD FULL TEXT VIEW PDF
Repository for "Space-Time-Separable Graph Convolutional Network for Pose Forecasting" (ICCV 2021)
Forecasting future human poses is the task of modelling the complex structured-sequence of joint spatio-temporal dynamics of the human body. This has received increasing attention due to its manifold applications to autonomous driving , healthcare , teleoperations  and collaborative robots [28, 45], where e.g. anticipating the human motion avoids crashes and helps the robots plan the future.
Research has so far addressed modelling space and time in separate frameworks. Time has generally been modelled with convolutions in the temporal dimension 
, with recurrent neural networks (RNN[36, 35, 49, 11], GRU [53, 1] and LSTM 
) or with Transformer Networks. Space and the interaction of joints has instead been recently modelled by Graph Convolutional Networks (GCN) , mostly connecting body joints along a kinematic tree. The separate approach has side-stepped the complexity of a joint model across the spatial and temporal dimensions, which are diverse in nature, and has leveraged progress in the relevant fields. However this has also limited the understanding of the complex human body dynamics.
Here we propose to forecast human motion with a novel Space-Time-Separable Graph Convolutional Network (STS-GCN). STS-GCN encodes both the spatial joint-joint and the temporal time-time correlations with a joint spatio-temporal GCN . The single-graph framework favors the cross-talk of the body joint interactions and their temporal motion patterns. Further to better performance, using the GCN-only model results in considerably less parameters.
To the best of our knowledge, STS-GCN is the first space-time separable GCN. We realize this by factorizing the graph adjacency matrix into . Our intuition is that bottleneck’ing the cross-talk of the spatial joints and the temporal frames helps to improve the interplay of spatial joints and temporal patterns. This differs substantially from recent work [29, 5] which separate the graph interactions from the channel convolutions, being therefore depthwise separable. Still both separable designs are advantageous for the reduction of model parameters.
Fig. 1 illustrates the encoder-decoder design of our model. Following the body motion encoding by the STS-GCN, the future pose coordinates are forecast with few simple convolutional layers, generally termed Temporal Convolutional Network (TCN) [16, 4, 33], robust and fast to train.
Note from Fig. 1 that the factored graph adjacency matrices are learnt. This results in better performance and it allows us to interpret the joint-joint and the time-time interactions, as we further illustrate in Fig. 3 and in Sec. 4.
In extensive experiments over the modern, challenging and large-scale datasets of Human3.6M , AMASS  and 3DPW , we demonstrate that STS-GCN improves over the state-of-the-art. Notably, STS-GCN outperforms the current best technique  by over 32% on all three datasets, in average at the most difficult long-term predictions, while only adopting 1.7% of its parameters.
We summarize our main contributions as follows:
Our space-time human body representation is the first to exclusively use a GCN and it adopts only 1.7% parameters of the current best competing technique ;
The joint-joint and time-time graph edge weights are learnt, which allows to explain their interactions.
Human pose forecasting is a long-standing problem . We discuss related work by distinguishing the temporal aspects of sequence modelling and the spatial representations. Finally we relate to separable convolutional networks.
Most recent work in human pose forecasting has leveraged Recurrent Neural Networks (RNN)[15, 25, 37, 11, 36, 35, 49]
, as well as recurrent variants such as Gated Recurrent Units (GRU)[53, 1]
and Long Short-Term Memory Networks. These techniques are flexible, but they have issues with long-term predictions such as inefficient training and poor long-term memory [6, 30, 35, 36]
. Research has attempted to tackle this, e.g. by training with generative adversarial networks
and by imitation learning[41, 49]. Emerging trends have adopted (self-)attention to model time [36, 9], which also applies to model spatial relations [41, 9].
Representation of body joints Nearly all literature adopts 3D coordinates or angles.  has noted that encoding residuals of coordinates, thus velocity, may be beneficial. [35, 36] has adopted Discrete cosine transform (DCT), thus frequency, which greatly supports for periodic motion. Here we experiment with 3D coordinates and angles, but those representations are compatible with our model.
Representation of human pose Graphs are a natural choice to represent the body. These have mostly been hand-designed, mainly leveraging the natural structure of the kinematic tree [25, 8, 51], and encoded via Graph Convolutional Networks (GCN) .  learns the adjacency matrix of the graph, still limiting the connectivity to the kinematic tree. Most recently, research has explored all joints linked together and learnt graph edges [35, 36]. Ours also let the training learn a data-driven graph connectivity and edge weights (see Fig. 3 and Sec. 4 for an illustration).
Separable convolutions [40, 13, 22] decouple processing the cross-channel correlations via 1x1 convolutional filters and the spatial correlations via channel-wise spatial convolutions.
These are depthwise-separable convolutions, based on the hypothesis that the cross-channel and spatial correlations are sufficiently decoupled, so it is preferable not to map them jointly .
To the best of our knowledge, only  and  apply this concept to GCNs, but they design different graph edge weights for different channels, in the spatial  or spectral domain . By contrast, our STS-GCN is the first GCN design which separates the graph connectivity itself, by factoring the space-time adjacency matrix. In the spirit of , our hypothesis is that the space-time cross-talk is limited and that decoupling them is more effective and efficient.
The proposed model proceeds by encoding the coordinates of the body joints which are observed in the given input frames and then it leverages the space-time representation to forecast the future joint coordinates. Encoding is modelled by the proposed STS-GCN graph, which considers the interaction of body joints over time, bottleneck’ing the space-time interplay. Decoding future coordinates is modelled with a TCN. In this section, we further provide insights into the STS-GCN model.
We observe the body pose of a person, given by the 3D coordinates or angles of itsjoints, for frames. Then we predict the body joints for the next future frames.
We denote the joints by 3D vectors representing joint at time
. The motion history of human poses is denoted by the tensorwhich we construct out of matrices of 3D coordinates or angles of joints for frames . The goal is to predict the future poses .
The motion history tensor is encoded into a graph which models the interaction of all body joints across all observed frames. We define the encoding graph , with nodes , which are all body joints across all observed time frames. Edges are represented by a spatio-temporal adjacency matrix , relating the interactions of all joints at all times.
The spatio-temporal dependencies of joints across times may be conveniently encoded by a GCN, a graph-based neural network model . The input to a graph convolutional layer is the tensor , which encodes the observed joints in the frames.
is the input dimensionality of the hidden representation. For the first layer, it is and .
A graph convolutional layer outputs the , given by the following
where is the spatio-temporal adjacency matrix of layer , are the trainable graph convolutional weights of layer projecting each graph node from to dimensions, and
Two notable graph representations are worth mentioning for their robustness and performance.  constrains the graph encoding to the joint-joint relations, thus to a spatial-only , only along the kinematic tree, and addresses the time-time relations by a convolutional layer of kernel , mapping frames to channels. , the current state-of-the-art in human pose forecasting, also adopts a spatial-only adjacency matrix , but fully connected. In both cases, the adjacency matrices are trainable.
The proposed STS-GCN takes motivation from the interaction of the temporal evolution and the spatial joints, as well as from the belief that the interplay of joint-joint and time-time are privileged. Human pose dynamics depend on 3 types of interactions: i. joint-joint; ii. time-time; and iii. joint-time. STS-GCN allows for all 3 types of interactions, but it bottlenecks the joint-time cross-talk.
The interplay of joints over time is modelled by relating the 3 types of relations within a single spatio-temporal encoding GCN. Bottleneck’ing the space-time cross-talk is realized by factoring the space-time adjacency matrix into the product of separate spatial and temporal adjacency matrices . A separable space-time graph convolutional layer is therefore written as follows
where the same notation as in Eq. (1) applies, apart from the factored of layer which we explain next.
The adjacency matrix is responsible for the joint-joint interplay. It has dimensionality , and it models the full joint-joint relations by trainable matrices for each instant in time (there are such matrices). Similarly is responsible for the time-time relations. It has dimensionality and it defines a full and trainable time-time relation matrix for each of the joints.
Note that Eq. 2 represents a single GCN layer, encoding the spatio-temporal interplay of the the body dynamics. The factored space-time matrix bottlenecks the space-time cross-talk, it reduces the model parameters and it yields a considerable increase in the forecasting performance, as we illustrate in Sec. 4
. Overall, the graph encoding employs four such GCN layers with residual connections PReLU activation functions, cf.4 for the implementation details.
Also note that STS-GCN is the sole human pose forecasting graph encoding which exclusively uses GCNs. This contrasts other competing techniques, mostly encoding time with recurrent neural networks [36, 35, 49, 11, 53, 1, 52], or by the use of convolutional layers with kernels across the temporal dimension [51, 10]. This is also a key element to parameter efficiency (see Sec. 4.)
Here we first relate STS-GCN to self-attention mechanisms, then we comment on STS-GCN in relation to most recent work on signed and directed GCNs.
|Sitting Down||Taking Photo||Waiting||Walking Dog||Walking Together||Average|
Separable graph convolutions and self-attention Most recent pose forecasting work has leveraged self-attention to encode the relation of frames [36, 9] and/or the relation of joints . Here we relate the proposed STS-GCN to the self-attention mechanisms of [36, 9]. Finally we relate these to Graph Attention Networks (GAT) .
Let us first re-write part of the GCN layer of Eq. 1 with the Einstein summation, omitting the indication of layer , the projection matrix and the non-linearity for better clarity of notation:
having explicitly indicated with indexes the dimensions of and , i.e. indexing spatial joints as and times with .
Let us now re-write the corresponding part of the STS-GCN layer of Eq. 2 with the Einstein summation, again omitting the projection matrix and the non-linearity for clarity of notation:
where, as above, we have indicated indexes for (for each of the T times) and (for each of the V joints) as for the spatial joints and as for the times.
Let us now turn to the current best technique for pose forecasting . They adopt a GCN for modelling the spatial interaction of joints at the same time, which coincides with the rightmost term in Eq. 4.
Their temporal modelling is however different from ours, as they adopt an attention formulation . Writing it with the Einstein summation yields:
Comparing the right term of Eq. 5 with the separable temporal GCN (the term within parentheses in Eq. 4), we note that the approach of , modelling space and time with the different mechanisms of GCN and attention, may also be explained as a separable space-time GCN. The main difference is that is a function of the product of inner representation vectors, both stemming from . By contrast our temporal adjacency matrix learns the specific pair-wise interaction of relative time shifts. Similar arguments apply when comparing the proposed STS-GCN with the recent GAT . We evaluate the difference wrt  quantitatively and conduct ablation studies on the adjacency matrices in Sec. 4.
Signed and Directed GCNs Let us now consider that the adjacency matrix and its factored terms, and , are trainable parameters. Adjacency matrices were similarly trained by , which considers a fully connected matrix, and by , which defines specific learnable parameters (denoted in ) to multiply the manually-constructed graph (based on the kinematic tree and the sequential time connections). When encoding the spatio-temporal body dynamics, trainable parameters yield better performance and match the intuition, i.e. they learn the interaction between specific joints and at certain relative temporal offsets.
Trainable parameters result in signed and directed GCNs (see Figs. 3 for an illustration). Both aspects have been surveyed recently [7, 50]. In particular, recent work from [43, 31, 3] maintain that directed graphs encode richer information from their neighborhood, instead of being limited to distance ranges. Similarly, recent work from  demonstrate the superior performance of signed GCNs.
Following the classification of [7, 50], the proposed STS-GCN and the GCNs of [36, 51] are spatial GCNs. This follows from their non-symmetric and possibly ill-posed signed Laplacian matrices, which do not have orthogonal eigendecompositions and are not easily interpretable by spectral-domain constructions . We maintain this makes an interesting direction for future investigation, only partly addressed by very recent work .
Given the encoded observed body dynamics, the estimation of the 3D coordinates or angles of the body joints in the future is delegated to convolutional layers applying to the temporal dimension. These map the observed frames into the future horizon and refine the estimates via a multi-layered architecture.
Altogether, these layers make a decoder which is generally dubbed Temporal Convolutional Networks (TCN) [16, 4, 33]. While several other sequence modelling options are available, including LSTM , GRU  and Transformer Networks , here we adopt TCNs for their simplicity and robustness, further to satisfactory performance .
|Sitting Down||Taking Photo||Waiting||Walking Dog||Walking Together||Average|
The proposed architecture is trained end-to-end supervisedly. Supervision is provided by either of the losses that measure error wrt ground truth in terms of Mean Per Joint Position Error (MPJPE) [24, 35] and Mean Angle Error (MAE) [37, 30, 18, 49, 36]. The loss based on MPJPE is:
where denotes the predicted coordinates of the joint in the frame and is the corresponding ground truth. The loss based on MAE is given by:
where denotes the predicted joint angles in exponential map representation of the joint in the frame and is its ground truth.
We experimentally evaluate the proposed model against the state-of-the-art on three recent, large-scale and challenging benchmarks, Human3.6M , AMASS  and 3DPW . Additionally we conduct ablation studies, evaluate the model qualitatively and illustrate what spatio-temporal graph is trained from data.
Human3.6M  The dataset is wide-spread for human pose forecasting and large, consisting of 3.6 million 3D human poses and the corresponding images. It consists of 7 actors performing 15 different actions (e.g. Walking, Eating, Phoning). The actors are represented as skeletons of 32 joints. The orientation of joints are represented as exponential maps, from which the 3D coordinates may be computed [42, 15]. For each pose, we consider 22 joints out of the provided 32 for estimating MPJPE and 16 for the MAE. Following the current literature [35, 36, 37], we use the subject 11 (S11) for validation, the subject 5 (S5) for testing, and all the rest of the subjects for training.
AMASS  The Archive of Motion Capture as Surface Shapes (AMASS) dataset has been recently proposed, to gather 18 existing mocap datasets. Following , we select 13 from those and take 8 for training, 4 for validation and 1 (BMLrub) as the test set. Then we use the SMPL  parameterization to derive a representation of human pose based on a shape vector, which defines the human skeleton, and its joints rotation angles. We obtain human poses in 3D by applying forward kinematics. Overall, AMASS consists of 40 human-subjects that perform the action of walking. Each human pose is represented by 52 joints, including 22 body joints and 30 hand joints. Here we consider for forecasting the body joints only and discard from those 4 static ones, leading to an 18-joint human pose. As for , also these sequences are downsampled to 25 fps.
3DPW  The 3D Pose in the Wild dataset  consists of video sequences acquired by a moving phone camera. 3DPW includes indoor and outdoor actions. Overall, it contains 51,000 frames captured at 30Hz, divided into 60 video sequences. We use this dataset to test generalization of the models which we train AMASS.
Metrics Following the benchmark protocols, we adopt the MPJPE and MAE error metrics (see Sec. 3.6). The first quantifies the error of the 3D coordinate predictions in mm. The second measures the angle error in degrees. We follow the protocol of  and compute MAE with Euler angles. Due to this representation, MAE suffers from an inherent ambiguity, and MPJPE is more effective [9, 2], so mostly adopted here.
Implementation details The graph encoding is given by 4 layers of STS-GCN, which only differ in the number of channels : from 3 (the input 3D coordinates x,y,z or angles), to 64, then 32, 64 and finally 3 (cf. Sec. 3.3), by means of the projection matrices
. At each layer we adopt batch normalization
and residual connections. Our code is in Pytorch and uses ADAM
as optimizer. The learning rate is set to 0.01 and decayed by a factor of 0.1 every 5 epochs after the. The batch size is 256. On Human3.6M, training for 30 epochs on an NVIDIA RTX 2060 GPU takes 20 minutes.
We quantitatively evaluate our proposed model against the state-of-the-art both for short-term (500 msec) and long-term (500 msec) predictions.
We include into the comparison: ConvSeq2Seq , which adopts convolutional layers, separately encoding long- and short-term history; LTD-X-Y , which encodes the sequence frequency with a DCT, prior to a GCN (X and Y stand for the number of observed and predicted frames); BC-WGAIL-div 
, adopting reinforcement learning; and finally DCT-RNN-GCN, the current best performer, which extends LTD-X-Y with an RNN and motion-attention.
All algorithms take as input 10 frames (400 msec), with the exception of LTD, for which we also report the case of larger number of input frames. Then algorithms predict future poses for the next 2 to 10 frames (80-400 msec) in the case of short-term, and for 14-25 frames (560-1000 msec) in the long-term case.
Human3.6M: 3D Joint Positions Let us consider Tables 1 and 2 for the tests on short- and long-term prediction respectively. Across all time horizons in both tables, our model outperforms all competing techniques, with the only exception of 3 experiments out of 120 (2-frame predictions for Walking, Eating and Directions), where it is within a marginal error.
Considering the average errors in Table 1, the improvement of our model over the current best  ranges from 3% in the case of 2 time frames, up to 34% improvement for the more challenging case of 10 frames. Note that, at the 10-frame horizon, improvements are less in the case of periodic actions such as Walking (17%) but larger for aperiodic actions such as Posing (40%). We believe this is because of the DCT encoding of .
We illustrate in Table 2 the more ardous long-term prediction horizons. Our predictions at 560 msec (14 frames) are more accurate than those of  by 27 mm, while at 1 sec (25 frames) our model reaches an improvement of 37 mm. In average across predictions over 14-25 frames, our model outperforms the current best  by 34%.
Human3.6M: Joint Angles Average angle errors are reported in Table 3. Our model outperforms the current best  with larger improvements on the long-term horizon. The performance increase is 23% for 2 frames and it is 34% for 25 future frames.
AMASS Also in the case of AMASS, in Table 4, for short and long-term predictions of 3D coordinates, our model outperforms the state-of-the-art by 32% on the longest time horizon (25-frame, 1000 msec).
3DPW In Table 5, we test the generalizability of our model by training on AMASS and testing on 3DPW. Results are significantly beyond the state-of-the-art. For 2-frame predictions we reduce the error by 32%, compared to the second best. For any other time horizon above 4 frames, we reduce the error by at least 43%.
Qualitative evaluation We provide sample predictions (purple/green) in Fig. 2 on Human3.6M against ground truth sequences (gray/black). All predictions are long-term (25 frames) but we only display one every three frames for Discussion and Posing, to fit the illustrations into a row. Results are in line with the long-term error statistics of Table 2. The forecast Walking is accurate, within 5.2 cm-accuracy in average at 25 frames (1 sec) and pictorially matching the ground truth. This shows how our model learns periodic motion well. Predicted future poses are also relatively accurate for Discussion, where the average error is 7.9 cm (cf. competing algorithms are nearly 12 cm). In this case, our model predicts well the mostly static pose of the discussing person, but the error is larger on the waving left hand. Finally our model is producing larger errors on Posing (10.6 cm in average), as it is a more challenging aperiodic action, which different people perform in different ways.
Table 6 illustrates the following ablative variants of our proposed STS-GCN encoding technique:
Distinct graphs and This stands for separate GCNs for space and time, with separate adjacency and projection matrices, intertwined by an activation function. The variant underperforms our proposed model, which confirms the importance of spatio-temporal interaction within a single graph . Interestingly the errors are much larger for the short-range (nearly 3x larger) than for the long-range (+6% errors). We believe longer-term correlations may aid the variant.
Full (non-separable) graph The variant adopts a full space-time adjacency matrix . We observe a similar trend as for distinct graphs, i.e. worse performance with larger error increase (+18%) for short-term predictions but better for long-term ones (+9%). Notably the full graph model requires nearly 4x more parameters than our proposed one, cf. rightmost column in Table 6.
Separable graph shared across layers This only differs as it learns shared adjacency matrices across all layers, rather than layer-specific ones. Errors are comparable in the long-term (+2%) but larger in the short-term (+12%), against saving 37% of the parameters.
In Fig. 3, we illustrate two learnt adjacency matrices, upon training on Human3.6M. On the left, we represent a spatial adjacency matrix , i.e. imagine the red dots positioned on the 22 keypoints of a frontally posing Vitruvian man. Learnt parameters are directed (as the learnt matrix is not symmetric) and signed edges (cf. Sec. 3.3), color-coded weights as in the legend. For clarity of illustration, we represent the two strongest connections for each keypoint. Note how most learnt connections follow the kinematic tree, which confirms the importance of the physical linkage. However additional strong connections also emerge, which bridge distant but motion-related joints, such as the two feet, the feet to the head, and the shoulders to the opposite hips, which intuitively interact for future pose prediction.
In Fig. 3 (right), we represent a temporal adjacency matrix , also asymmetric and signed. It is noticeable the information flow from the earlier to the later observed frames. So the bottom-left side of the matrix shows larger absolute values. In particular most information is drawn to the last two frames (bottom two rows), corresponding to the 9th and 10th observed frames. Note also that the range of temporal relation coefficients is smaller than the spatial , which privileges spatial information above the temporal when forecasting future poses.
We have proposed a novel Space-Time-Separable Graph Convolutional Network (STS-GCN) for pose forecasting. The single-graph framework favors the cross-talk of space and time, while bottleneck’ing the space-time interaction allows to better learn the fully-trainable joint-joint and time-time interactions. The model improves considerably on the state-of-the-art performance and but only requires a fractions of the parameters. These results further support the adoption of GCN and future research on it.
The authors wish to acknowledge Panasonic for partially supporting this work and the project of the Italian Ministry of Education, Universities and Research (MIUR) “Dipartimenti di Eccellenza 2018-2022”.
Diffusion-convolutional neural networks. In Advances in Neural Information Processing Systems (NIPS), Cited by: §3.4.
The International Conference on Machine Learning (ICML) - Workshop on Graph Representation Learning and Beyond (GRL+ 2020), Cited by: 1st item, §1, §2.
Geometric deep learning: going beyond euclidean data. IEEE Signal Processing Magazine 34 (4), pp. 18–42. Cited by: §3.4, §3.4.
The Conference on Empirical Methods in Natural Language Processing (EMNLP), Cited by: §3.5.
Learning motion manifolds with convolutional autoencoders. In SIGGRAPH Asia 2015 Technical Briefs, Cited by: §2.
The International Joint Conference on Artificial Intelligence (IJCAI), Cited by: §2.