Consider the video sequences in Figure 1. Given the past frames of action, we can easily imagine the future motion of the athletes, whether hitting a tennis serve or pitching a baseball. In this paper, we consider this problem of predicting the motion of the person as a 3D mesh given some past image sequences. We propose a learning-based framework that given past frames can successively predict the future 3D mesh of the person in an autoregressive manner. Our model is trained on videos obtained in-the-wild without ground truth 3D annotations.
Learning generative models of sequences has a long tradition, particularly in language [5, 20] and speech generation . Our approach is the first counterpart to these approaches for 3D mesh motion generation from video. While there has been much interest in future prediction from video, most approaches focus on predicting 2D components from video such as 2D keypoints [54, 59], flow , or pixels [13, 16]. On the other hand, the several works that predict 3D human motion all take past 3D skeleton sequences as input obtained from motion capture data. To our knowledge, no previous approach explores the problem of 3D human motion prediction from video. 3D is a natural space of motion prediction with many practical applications such as human-robot interaction, where autonomous systems such as self-driving cars or drones must operate safely around people from visual inputs in-the-wild. Our approach is trainable on videos without 3D annotations, providing a more abundant and natural source of information than 3D motion sequences obtained from a motion capture studio.
We introduce an autoregressive architecture for this task following recent successes in convolutional autoregressive sequence modeling approaches [4, 38]. A challenge introduced by the problem of 3D motion prediction from video is that the space of input (image frames) and that of output (3D meshes) are different. Following the advances in the language-modeling literature, we remedy this issue by learning a shared latent representation in which we can apply the future prediction model in an autoregressive manner.
We build on our previous work that learns a latent representation for 3D human dynamics from the temporal context of image frames . We modify the approach so that the convolutional model has a causal structure, where only past temporal context is utilized. Once we learn a causal latent representation for 3D human motion from video, we train an autoregressive prediction model in this latent space. While our previous work has demonstrated one-step future prediction, it can only take a single image as an input and requires a separate future hallucinator for every time step. In contrast, in this paper, we learn an autoregressive model which can recurrently predict a longer range future (arbitrarily many frames versus 1 frame) that is more stable by taking advantage of a sequence of past image frames. To our knowledge, our work is the first study on predicting 3D human motion from image sequences. We demonstrate our approach on the Human3.6M dataset  and the in-the-wild Penn Action dataset .
2 Related Work
Generative Modeling of Sequences.20, 43]. Feed-forward models with convolutional layers are also used for sequence modeling tasks, such as image generation  and audio waveform generation . Recent studies suggest that these feed-forward models can outperform recurrent networks  or do equivalently , while being parallelizable and easier to train with stable gradients. In this work, we also use feed-forward convolutional layers for our autoregressive future prediction model.
Visual Prediction. There are a number of methods that predict the future from video or images. Ryoo  predicts future human activity classes from a video input. Kitani  predict possible trajectories of a person in the image from surveillance footage.  predict paths of pedestrians from a stereo camera on a car.  anticipate action trajectories in a human-robot interaction from RGB-D videos. More recent deep learning-based approaches explore predicting denser 2D outputs such as raw pixels [46, 16, 14, 13], 2D flow fields , or more structured outputs like 2D human pose [54, 59].
There are also approaches that from a single image predict future in the form of object dynamics , object trajectories , flow [41, 53, 31, 19], a difference image , action categories , or image representations . All approaches predict future in 2D domains or categories from input video. In this work, we propose a framework that predicts 3D motions from video inputs.
3D Pose from Video. There has been much progress in recovering 3D human pose from RGB video. Common pose representations include 3D skeletons [34, 12, 36, 35, 39] and meshes [3, 15, 27, 44]. We build on our previous weakly-supervised, mesh-based method that can take advantage of large-scale Internet videos without ground truth annotations . While mesh-based methods do not necessarily have the lowest error on Human3.6M, recent work suggests that performance on Human3.6M does not correlate very well with performance on challenging in-the-wild video datasets such as 3DPW [48, 27].
3D to 3D Human Motion Prediction. Modeling the dynamics of humans has long-standing interest 
. Earlier works model the synthesis of human motion using techniques such as Hidden Markov Models, linear dynamical systems , bilinear spatiotemporal basis models , and Gaussian process latent variable models [45, 56] and other variants [22, 55]. More recently, there are deep learning-based approaches that use recurrent neural networks (RNNs) to predict 3D future human motion from past 3D human skeletons [18, 24, 9, 32, 47]. All of these approaches operate in the domain where the inputs are 3D past motion capture sequences. In contrast, our work predicts future 3D human motion from past 2D video inputs.
3D Future Prediction from Single Image. More related to our setting is that of Chao 
, who from a single image, predict 2D future pose sequences from which the 3D pose can be estimated. While this approach does produce 3D skeletal human motion sequences from a single image, the prediction happens in the 2D to 2D domain. More recently, our previous presents an approach that can predict a single future 3D human mesh from a single image by predicting the latent representation which can be used to get the future 3D human mesh. While this approach requires learning a separate future predictor for every time step, we propose a single autoregressive model that can be reused to successively predict the future 3D meshes.
Our goal is to predict the future 3D mesh sequence of a human given past image sequences. Specifically, our input is a set of past image frames of a video , and our output is a future sequence of 3D human meshes . We represent the future 3D mesh as consisting of pose parameters and shape parameters .
We propose Predicting Human Dynamics (PHD), a neural autoregressive network for predicting human 3D mesh sequences from video. Our network is divided into two components: one that learns a latent representation of 3D human motion from video, and another that learns an autoregressive model of the latent representation from which the 3D human prediction may be recovered. Figure 2 shows an overview of the model. For the first part, we build upon our recent work  which learns a latent representation of 3D human motion from video. However, this approach is not causal since the receptive field is conditioned on past and future frames. Future prediction requires a causal structure to ensure that predictions do not depend on information from the future.
In this section, we first present an overview of the output 3D mesh representation. Then, we discuss the encoder model that learns a causal latent representation of human motion and an autoregressive model for future prediction in this latent space. Lastly, we explain our training procedures.
3.1 3D Mesh Representation
We represent the 3D mesh with 82 parameters consisting of pose and shape. We employ the SMPL 3D mesh body model , which is a differentiable function that outputs a triangular mesh with 6890 vertices given pose and shape . The pose parameters contain the global rotation of the body and relative rotations of 23 joints in axis-angle representation. The shape parameters are the linear coefficients of a PCA shape space. The SMPL function shapes a template mesh conditioned on and , applies forward kinematics to articulate the mesh according to , and deforms the surface via linear blend skinning. More details can be found in .
We use a weak-perspective camera model that represents scale and translation. From the mesh, we can extract the 3D coordinates of joints using a pre-trained linear regressor . From the 3D joints and camera parameters, we can compute the 2D projection which we denote as .
In this work, we use the SMPL mesh as a design decision, but many of the core concepts proposed could be extended to a skeletal model.
3.2 Causal Model of Human Motion
Although conditioned only on past context, our causal model performs comparably with a non-causal model that can see the past and the future. Both models have a receptive field of 13 frames, but our model uses edge padding for the convolutions while use zero padding. MPJPE and Reconstruction error are measured in mm. PCK  is a percentage.
In this work, we train a neural autoregressive prediction model on the latent representation of 3D human motion encoded from the video. This allows seamless transition between conditioning on the past images frames and conditioning on previously generated future predictions.
In order to learn the latent representation, we follow-up on our previous work , which learns a latent encoding of 3D human motion from the temporal context of image sequences. That work uses a series of 1D convolutional layers over the time dimension of per-frame image features to learn an encoding of an image context, whose context length is equal to the size of the receptive field of the convolutions. However, because the goal was to simply obtain a smooth, temporally consistent representation of humans, that convolution kernel was centered, incorporating both past and future context. Since the goal in this work is to perform future prediction, we require our encoders to be causal, where the encoding of past temporal context at time is convolved only with elements from time and earlier in the previous layers . Here we discuss our causal encoder for human motion from video.
Our input is a video sequence of a person where each frame is cropped and centered around the subject. Each video sequence is paired with 3D pose and shape parameters or 2D keypoint annotations. We use a pre-trained per-frame feature extractor to encode each image frame
into a feature vector. To encode 3D human dynamics, we train a causal temporal encoder . Intuitively, represents a “movie strip” that captures the temporal context of 3D human motion leading up to time . This differs from the temporal encoder in  since the encoder there captures the context centered at . Now that we have a representation capturing the motion up to time , we train a 3D regressor that predicts the 3D human mesh at as well as the camera parameters . The temporal encoder and the 3D regressors are trained with 3D losses on videos that have ground truth 3D annotations: .
However, datasets with 3D annotations are generally limited. 3D supervision is costly to obtain, requiring expensive instrumentation that often confines the videos captured to controlled environments that do not accurately reflect the complexity of human appearances in the real world. To make use of in-the-wild datasets that only have 2D ground truth or pseudo-ground truth pose annotations, we train our models with 2D re-projection loss  on visible 2D keypoints: , where is the visibility indicator over ground truth keypoints. We also use the factorized adversarial prior loss proposed in [26, 27] to constrain the predicted poses to lie in the manifold of possible human poses. We regularize our shape using the shape prior . Thus, for each frame , the total loss is . As in , we include a loss to encourage the model to predict a consistent shape: and predict the mesh of nearby frames, encouraging the model to pay more attention to the temporal information in the movie strip at hand. With a receptive field of 13,  uses neighboring frames whereas we use since our model is causal. Altogether, the objective function per sequence for the causal temporal encoder is
3.3 Autoregressive prediction
Now that we have a latent representation
of the motion leading up to a moment in time, we wish to learn a prediction model that generates the future 3D human mesh model given the latent movie-strip representation of the input video
. We treat this problem as a sequence modeling task, where we model the joint distribution of the future latent representation as:
One way of modeling the future distribution of 3D human motion
is as a product of conditional probabilities of its past:
In practice, we condition on the past image features, where is the receptive field size of the causal convolution. Since the future is available, this can be trained in a self-supervised manner via a distillation loss that encourages the predicted movie strips to be close to the real movie strips:
where is the ground truth movie strip produced by the temporal encoder . Moreover, since this latent representation should accurately capture the future state of the person, we use to read out the predicted meshes from the predicted movie strips. Without seeing the actual images, and are unable to predict any meaningful camera predictions. To compute the re-projection loss without a predicted camera, we solve for the optimal camera parameters that best align the orthographically projected joints with the visible ground truth 2D joints :
where is the orthographically projected 3D joints ( with the depth dimension dropped).
Now, we can apply all the losses from Section 3.2 to future prediction. In summary the total loss is:
To better compare with methods that perform 3D prediction from 3D input, we also study a variant of our approach that makes autoregressive predictions directly in the pose space :
3.4 Training Procedures
We employed a two-step training strategy. We first trained the temporal encoder and the 3D regressor , and then trained the autoregressive predictor . We froze the weights of the pre-trained ResNet from  and trained the temporal encoder and the 3D regressor jointly on the task of estimating 3D human meshes from video. After training converged, we froze and trained and the autoregressive predictor jointly.
To train the autoregressive model, we employ a curriculum-based approach . When training sequence generation models, it is common to use teacher forcing, in which the ground truth is fed into the network as input at each step. However, at test time, the ground truth inputs are unavailable, resulting in drifting since the model wasn’t trained with its own predictions as input. To help address this, we train consecutive steps with the model’s own output at previous time steps as inputs, similar to what is done in . We slowly increase the number of consecutive predicted outputs fed as input to the autoregressive model, starting at 1 step and eventually hitting 25 steps.
While our approach can be conditioned on a larger past context by using dilated convolutions , our setting is bottlenecked by the length of the training videos. Recovering long human tracks from video is challenging due to occlusion and movement out of frame, and existing datasets of humans in the wild that we can train on tend to be short. Penn Action  videos, for instance, have a median length of 56 frames. Since both the temporal encoder and autoregressor have a receptive field of 13, 25 was near the upper-bound of the number of autoregressive predictions we could make given our data. See the supplemental materials for further discussion.
In this section, we present quantitative and qualitative results of 3D mesh motion generation from video.
4.1 Experimental Setup
Network Architecture. We use the pre-trained ResNet-50 provided by  as our image feature encoder and use the average pooled feature of the last layer as our . The causal temporal encoder and the autoregressive predictor both have the same architecture, consisting of 3 residual blocks. Following 
, each block consists of GroupNorm, ReLU, 1D Convolution, GroupNorm, ReLU, 1D Convolution. Each 1D Convolution uses a kernel size of 3 and a filter size of 2048. Unlike which use zero padding, we use edge padding for the 1D convolutions. In total, the 3 residual blocks induce a receptive field of 13 frames (about 0.5 seconds at 25 fps). To make the encoder causal, we shift the output indices so that the prediction for time corresponds to the output that depends only on inputs up to from previous layers. The movie-strip representation also has 2048 dimensions. The autoregressive variant model that predicts the future in the 3D mesh space has the same architecture except it directly outputs the 82D mesh parameters . For the 3D regressor , we use the architecture in .
Datasets. We train on 4 datasets with different levels of supervision. The only dataset that has ground truth 3D annotations is Human3.6M , which contains videos of actors performing various activities in a motion capture studio. We use Subjects 1, 6, 7, and 8 as the training set, Subject 5 as the validation set, and Subjects 9 and 11 as the test set. Penn Action  and NBA  are datasets with 2D ground truth keypoint annotations of in-the-wild sports videos. Penn Action consists of 15 sports activities such as golfing or bowling, while NBA consists of videos of professional basketball players attempting 3-points shots. InstaVariety  is a large-scale dataset of internet videos scraped from Instagram with pseudo-ground truth 2D keypoint annotations from OpenPose . We evaluate on the Human3.6M and Penn Action datasets which have 3D and 2D ground truth respectively. We train on all of these videos together in an action- and dataset-agnostic manner.
4.2 Quantitative Evaluation
Dynamic Time Warping. Predicting the future motion is a highly challenging task. Even if we predict the correct type of motion, the actual start time and velocity of the motion are still ambiguous. Thus, for evaluation we employ Dynamic Time Warping (DTW), which is often used to compute the similarity between sequences that have different speeds. In particular, we compute the similarity between the ground truth and predicted future sequence after applying the optimal non-linear warping to both sequences. The optimal match maximizes the similarity of the time-warped ground truth joints and the time-warped predicted joints subject to the constraint that each set of ground truth joints must map to at least one set of predicted joints and vice-versa. In addition, the indices of the mapping must increase monotonically. For detailed evaluation without DTW as well as an example alignment after applying DTW, please see the Supplementary Materials.
Evaluation Procedures and Metrics. For Human3.6M where ground truth 3D annotations exist, we report the reconstruction error in mm by computing the mean per joint position error after applying Procrustes Alignment. For Penn Action which only has 2D ground truth annotations, we measure the percentage of correct keypoints (PCK)  at . We begin making autoregressive predictions starting from every 25th frame for Human3.6M and starting from every frame for Penn Action after conditioning on 15 input frames. Although we train with future prediction up to 25 frames, we evaluate all metrics for poses through 30 frames into the future.
Baselines. We propose a no-motion baseline (Constant) and a Nearest Neighbor baseline (NN), which we evaluate in Table 4. The no-motion baseline freezes the estimated pose corresponding to the last observed frame. We use the causal temporal encoder introduced in Section 3.2 for the 3D pose and 2D keypoint estimations.
The Nearest Neighbor baseline takes a window of input conditioning frames from the test set and computes the closest sequence in the training set using Euclidean distance of normalized 3D joints. The subsequent future frames are used as the prediction. We estimate the normalized 3D joints (mean SMPL shape) for each frame using our temporal encoder. See the Supplementary Materials for examples of Nearest Neighbors predictions.
Prediction Evaluations. In Table 4, we compare ablations of our method with both baselines. We evaluate our predictions in both the latent space and the pose space as proposed in Section 3.3. The results show that predictions in the latent space significantly outperform the predictions in the pose space, with the difference becoming increasingly apparent further into the future. This is unsurprising since the pose can always be read from the latent space, but the latent space can also capture additional information such as image context that may be useful for determining the action type. Thus, performance in the latent space should be at least as good as that in the pose space.
We also evaluate the effect of the distillation loss by removing . The performance diminishes slightly on Penn Action but is negligibly different on Human3.6M. It is possible that the latent representation learned by is more useful in the absence of 3D ground truth.
Finally, our method in the latent space significantly outperforms both baselines. The no-motion baseline performs reasonably at first since it’s initialized from the correct pose but quickly deteriorates as the frozen pose no longer matches the motion of the sequence. On the flip side, the Nearest Neighbors baseline performs poorly at first due to the difficulty of aligning the global orientation of the root joint. However, on Penn Action, NN often identifies the correct action and eventually outperforms the no-motion baseline and auto-regressive predictions in the pose space.
Comparison with Single Frame future prediction. In Table 3, we compare our method with the single frame future prediction in  and . To remain comparable, we retrained  to forecast the pose 5 frames into the future and evaluate all methods on the 5th frame past the last observed frame. Note that our method is conditioned on a sequence of images in an auto-regressive manner while  and  hallucinate the future 3D pose and 2D keypoints respectively from a single image. Our method produces significant gains on the Penn Action dataset where past context is valuable for future prediction on fast-moving sports sequences.
4.3 Qualitative Evaluation
Qualitative Analysis. We show qualitative results of our proposed method in the latent space on videos from Human3.6M, Penn Action, and NBA in Figure 3. We observe that the latent space model does not always regress the correct tempo but usually predicts the correct type of motion. On the other hand, the pose space model has significant difficulty predicting the type of motion itself unless the action is obvious from the pose (Situps and Pushups). See Figure 4 and the supplementary for comparisons between the latent and pose space models.
Capturing the “Statue” Moment. Classical sculptures are often characterized by dynamic poses ready to burst into action. Myron’s Discobolus (Figure 4(a)) is a canonical example of this, capturing the split-second before the athlete throws the discus . We show that the proposed framework to predict 3D human motion from video can be used to discover such Classical “statue” moments from video, by finding the frame that spikes the prediction accuracy. In Figure 5, we visualize frames from Penn Action when the prediction accuracy increases the most for each sequence. Specifically, for each conditioning window for every sequence in Penn Action, we computed the raw average future prediction accuracy for the following 15 frames. Then, we computed the per-frame change in accuracy using a low-pass difference filter and selected the window with the largest improvement. We find that the frame corresponding to the timestep when the accuracy improves the most effectively captures the “suggestive” moments in an action.
In this paper, we presented a new approach for predicting 3D human mesh motion from a video input of a person. We train an autoregressive model on the latent representation of the video, which allows the input conditioning to transition seamlessly from past video input to previously predicted futures. In principle, the proposed approach could predict arbitrarily long sequences in an autoregressive manner using 3D or 2D supervision. Our approach can be trained on motion capture video in addition to in-the-wild video with only 2D annotations.
Much more remains to be done. One of the biggest challenges is that of handling multimodality since there can be multiple possible futures. This could deal with inherent uncertainties such as speed or type of motion. Other challenges include handling significant occlusions and incorporating the constraints imposed by the affordances of the 3D environment.
Acknowledgments. We would like to thank Ke Li for insightful discussion and Allan Jabri and Ashish Kumar for valuable feedback. We thank Alexei A. Efros for the statues. This work was supported in part by Intel/NSF VEC award IIS-1539099 and BAIR sponsors.
-  (2018) Deep lip reading: A comparison of models and an online application. In Interspeech, pp. 3514–3518. Cited by: §3.4.
-  (2012) Bilinear spatiotemporal basis models. SIGGRAPH 31 (2), pp. 17. Cited by: §2.
-  (2019) Exploiting temporal context for 3d human pose estimation in the wild. In CVPR, Cited by: §2.
-  (2018) An empirical evaluation of generic convolutional and recurrent networks for sequence modeling. arXiv preprint arXiv:1803.01271. Cited by: §1, §2, §3.2, §3.3, §3.4.
-  (2003) A neural probabilistic language model. JMLR 3 (Feb), pp. 1137–1155. Cited by: §1.
-  (2016) Keep it SMPL: automatic estimation of 3D human pose and shape from a single image. In ECCV, Cited by: §3.2.
-  (2000) Style machines. In SIGGRAPH, pp. 183–192. Cited by: §2.
-  (1997) Learning and recognizing human dynamics in video sequences. In CVPR, pp. 568–574. Cited by: §2.
-  (2017) Deep representation learning for human motion prediction and classification. In CVPR, pp. 2017. Cited by: §2.
-  (2017) Realtime multi-person 2d pose estimation using part affinity fields. In CVPR, Cited by: §4.1.
-  (2017) Forecasting human dynamics from static images.. In CVPR, pp. 3643–3651. Cited by: §2, §4.2, Table 3, §4.
-  (2018) Structure-aware and temporally coherent 3d human pose estimation. ECCV. Cited by: §2.
-  (2018) Stochastic video generation with a learned prior. ICML. Cited by: §1, §2.
-  (2017) Unsupervised learning of disentangled representations from video. In NeurIPS, pp. 4414–4423. Cited by: §2.
Sim2real transfer learning for 3d pose estimation: motion to the rescue. arXiv preprint arXiv:1907.02499. Cited by: §2.
-  (2016) Unsupervised learning for physical interaction through video prediction. In NeurIPS, pp. 64–72. Cited by: §1, §2.
-  (2014) Predicting object dynamics in scenes. In CVPR, pp. 2019–2026. Cited by: §2.
-  (2015) Recurrent network models for human dynamics. In ICCV, pp. 4346–4354. Cited by: §2.
-  (2018) Im2Flow: motion hallucination from static images for action recognition. In CVPR, Cited by: §2.
-  (2013) Generating sequences with recurrent neural networks. arXiv preprint arXiv:1308.0850. Cited by: §1, §2.
-  (2016) Identity mappings in deep residual networks. In ECCV, pp. 630–645. Cited by: §4.1.
-  (2005) Style translation for human motion. In SIGGRAPH, Vol. 24, pp. 1082–1089. Cited by: §2.
-  (2014) Human3.6M: large scale datasets and predictive methods for 3D human sensing in natural environments. TPAMI 36 (7), pp. 1325–1339. Cited by: §1, §4.1.
-  (2016) Structural-rnn: deep learning on spatio-temporal graphs. In CVPR, pp. 5308–5317. Cited by: §2.
-  (2012) The discobolus. British Museum objects in focus, British Museum Press. External Links: Cited by: §4.3.
-  (2018) End-to-end recovery of human shape and pose. In CVPR, Cited by: §3.2, §3.4, §4.1.
-  (2019) Learning 3d human dynamics from video. In CVPR, Cited by: §1, §2, §2, §3.2, §3.2, §3.2, §3.2, Table 1, §3, §4.1, §4.1, §4.2, Table 3, §4.
-  (2012) Activity forecasting. In ECCV, pp. 201–214. Cited by: §2.
-  (2014) Context-based pedestrian path prediction. In ECCV, pp. 618–633. Cited by: §2.
-  (2016) Anticipating human activities using object affordances for reactive robotic response. TPAMI 38 (1), pp. 14–29. Cited by: §2.
-  (2018) Flow-grounded spatial-temporal video prediction from still images. In ECCV, Cited by: §2.
-  (2018) Auto-conditioned recurrent networks for extended complex human motion synthesis. ICLR. Cited by: §2, §3.4, §6.1.
-  (2015) SMPL: a skinned multi-person linear model. SIGGRAPH Asia. Cited by: §3.1.
-  (2017) A simple yet effective baseline for 3d human pose estimation. In ICCV, Cited by: §2.
-  (2018) XNect: real-time multi-person 3d human pose estimation with a single rgb camera. arXiv preprint arXiv:1907.00837. Cited by: §2.
-  (2017-07) VNect: real-time 3d human pose estimation with a single rgb camera. In SIGGRAPH, External Links: Cited by: §2.
-  (2019) Stable recurrent models. In ICLR, Cited by: §2.
-  (2016) Wavenet: a generative model for raw audio. arXiv preprint arXiv:1609.03499. Cited by: §1, §1, §2, §3.3.
-  (2019) 3D human pose estimation in video with temporal convolutions and semi-supervised training. In CVPR, Cited by: §2.
-  (2001) Learning switching linear models of human motion. In NeurIPS, pp. 981–987. Cited by: §2.
-  (2014) Déja vu. In ECCV, pp. 172–187. Cited by: §2.
-  (2011) Human activity prediction: early recognition of ongoing activities from streaming videos. In ICCV, pp. 1036–1043. Cited by: §2.
-  (2014) Sequence to sequence learning with neural networks. In NeurIPS, pp. 3104–3112. Cited by: §2.
Self-supervised learning of motion capture. In NeurIPS, pp. 5242–5252. Cited by: §2.
-  (2008) Topologically-constrained latent variable models. In ICML, pp. 1080–1087. Cited by: §2.
-  (2016) Conditional image generation with pixelcnn decoders. In NeurIPS, pp. 4790–4798. Cited by: §2, §2.
-  (2018) Neural kinematic networks for unsupervised motion retargetting. In CVPR, Cited by: §2.
-  (2018-09) Recovering accurate 3d human pose in the wild using imus and a moving camera. In ECCV, Cited by: §2.
-  (2016) Anticipating visual representations from unlabeled video. In CVPR, pp. 98–106. Cited by: §2.
-  (2014) Predicting actions from static scenes. In ECCV, pp. 421–436. Cited by: §2.
An uncertain future: forecasting from static images using variational autoencoders. In ECCV, pp. 835–851. Cited by: §1, §2.
-  (2014) Patch to the future: unsupervised visual prediction. In CVPR, pp. 3302–3309. Cited by: §2.
-  (2015) Dense optical flow prediction from a static image. In ICCV, pp. 2443–2451. Cited by: §2.
-  (2017) The pose knows: video forecasting by generating pose futures. In ICCV, Cited by: §1, §2.
-  (2007) Multifactor gaussian process models for style-content separation. In ICML, pp. 975–982. Cited by: §2.
-  (2008) Gaussian process dynamical models for human motion. TPAMI 30 (2), pp. 283–298. Cited by: §2.
-  (2016) Single image 3d interpreter network. In ECCV, Cited by: §3.2.
-  (2016) Visual dynamics: probabilistic future frame synthesis via cross convolutional networks. In NeurIPS, pp. 91–99. Cited by: §2.
-  (2018) MT-vae: learning motion transformations to generate multimodal human dynamics. In ECCV, pp. 265–281. Cited by: §1, §2.
-  (2013) Articulated human detection with flexible mixtures of parts. TPAMI 35 (12), pp. 2878–2890. Cited by: Table 1, §4.2.
-  (2013) From actemes to action: a strongly-supervised representation for detailed action understanding. In CVPR, pp. 2248–2255. Cited by: §1, §3.4, §4.1.
In this section, we provide:
Discussion of the implementation details with limited sequence length in Section 6.1.
A random sample of discovered “Statue” poses from Penn Action in Figure 7.
An example of Dynamic Time Warping in Figure 8.
A comparison of our method with Constant and Nearest Neighbor baseline without Dynamic Time Warping in Table 8.
6.1 Implementation Details of Sequence Length
As discussed in the main paper, while our approach can be conditioned on a larger past context by using dilated convolutions, our setting is bottlenecked by the length of the training videos. Here we describe some implementation details for predicting long range future with short video tracks.
The length of consistent tracklets of human detections is limited given that people often walk out of the frame or get occluded. In Penn Action, for instance, the median video length is 56 frames. Thus, we chose to train on videos with at least 40 frames. Recall that to avoid drifting, we train our on its own predictions . Since has a receptive field of 13, our model must predict 14 timesteps into the future before it is fully conditioned on its own predicted movie strips. This is further complicated by the fact that each movie strip is also causal and has its own receptive field, again pushing back when can begin its first future prediction. In principle, the maximum number of ground truth images that could be conditioned on would be one less than the sum of the receptive field of and . For a receptive field of 13, this would be images. However, with tracklets that have a minimum length of 40 frames, this would leave just timesteps for future prediction. This means that just 2 predictions would be fully conditioned on previously predicted movie strips. To support future prediction of 25 frames with a sequence length of 40, we edge pad the first image such that is only conditioned on 15 images. This allows us to compute losses for 25 predictions into the future, leaving enough training samples in which the past input includes previous predictions. See the illustration in Figure 6.
|H3.6M Reconst.||Penn Action PCK|
|AR on , no||56.9||61.2||64.9||66.8||83.6||80.4||75.4||70.2||65.6||59.0|