1 Introduction
Movies are a treasure trove of human “behavior episodes” [barker1955midwest]. They are produced in many different countries in multiple genres, giving us tremendous cultural diversity and range. Datasets, most prominently, AVA [gu2018ava] have emerged which provide a rich annotation of spatio-temporally localized human actions in movies. This would seem like ideal data on which to train systems for video understanding, and furthermore use that as a stepping stone for acquiring “common sense” from observations of diverse human behavior. This “visual” route could be complementary to the “linguistic” route to capturing common sense and arguably more fundamental.
But before we go too far with our wishful thinking, we must confront a fundamental challenge of video data derived from movies – the complication of “shots”. Film has a grammar [arijon1976grammar]. Stories are communicated through a juxtaposition of shots, typically from different camera angles viewing the same scene. Alfred Hitchcock’s Rope and Sam Mendes’s 1917 are noteworthy precisely because they are presented as a single take, without any discernible breaks corresponding to shot boundaries.
These shot changes manifest as sudden discontinuities in video as illustrated in Figure 1. Current temporal 3D human mesh and motion recovery methods, as well as most action classification algorithms, treat these shots as independent scenes, which reduce the rich potential of a film to a series of short independent temporal sequences. We see this as a lost opportunity, as these shots depict a single underlying 4D scene seen from different viewpoints despite the temporal discontinuities at the frame level. Properly modeled shot changes can thus be powerful signal rather than noise, as they provide a multi-view signal of the underlying dynamic scene. This can be a powerful cue in disambiguating the underlying the 3D pose and motion of humans, which is particularly helpful for close-up, heavily truncated image of people. We take this insight and propose a multi-shot optimization procedure, which recovers a consistent 3D motion sequence across shot changes, simultaneously addressing both challenges of temporal fragmentation and partial humans.
The proposed multi-shot optimization recovers long 3D motion sequences from movies and serve as a rich source of pseudo-ground truth 3D training data for robust human mesh recovery from images or video. This workflow is illustrated in Figure 2. Training a single-frame human mesh recovery model [kanazawa2018end] on the recovered 3D meshes from the multi-shot optimization results in improved robustness against heavy truncation and other ambiguities. More importantly, on the video setting, we propose a pure transformer-based temporal human mesh and motion recovery model (t-HMMR), which we demonstrate to be not only competitive, but also a particularly suitable architecture for films. Aside from sudden shot changes, films pose a challenge that the same person may not be consecutively depicted in the scene due to shot changes to another character or a background. Transformers can easily address these situations as it can explicitly not attend to frames that do not contain the person of interest and naturally ignore them, while still processing a larger temporal context before and after the missing input frames.
For our experiments, we employ AVA [gu2018ava], a large scale dataset of movies with atomic action annotations. We apply our multi-shot optimization on AVA, which results in over 350k frames of data with pseudo ground truth 3D. We refer to this dataset as Multi-Shot-AVA (MS-AVA), and we use it to train regression models for human mesh recovery, both from single image (HMR) and from video (t-HMMR). We demonstrate the importance of our multi-shot optimization and the benefit of the collected data through extensive experimentation on MS-AVA and the common benchmarks.
In summary, we introduce the problem of human mesh recovery from multiple shots and we propose a novel multi-shot optimization approach. This results in a new dataset, which is used to train a more robust single frame model for human mesh recovery and a pure transformer-based temporal model. Upon publication we will release our code and MS-AVA. We hope MS-AVA opens new opportunities for many future research directions.
2 Background
This section provides reference to prior work and acts as background to our approach. The relevant literature is vast, so here we consider the most relevant approaches.
2.1 Human body modelling
Recent work in 3D human reconstruction has been influenced heavily by the availability of powerful human body models. The SMPL model [loper2015smpl] is one of the most popular choices that, among others, has enabled work on reconstruction [kanazawa2018end], prediction [zhang2019predicting], as well as imitation [peng2018sfv]. At a high level, one can consider SMPL as a function that takes as input pose parameters and shape parameters (collectively ) and returns the 3D body mesh and joints . Other body models follow similar formulations, with differences on the modelling side [haoyang2020blsm, osman2020star, xu2020ghum], or the expressivity of the model [anguelov2005scape, joo2018total, pavlakos2019expressive].
2.2 3D pose and shape from single image
Optimization: Reconstructing 3D pose and shape from a single image is often addressed in an optimization setting. In these approaches [bogo2016keep, guan2009estimating, huang2017towards, lassner2017unite, pavlakos2019expressive, zanfir2018monocular], a set of features are detected on an image (typically 2D keypoints), and then a configuration of the body model is recovered such that it is consistent with the features. This requires a reprojection objective that penalizes deviations of the projected model from the detected features, and a set of objectives , that express the priors and encourage the reconstruction to be valid. At test time, the sum of these objectives is minimized in an iterative manner. The SMPLify [bogo2016keep, pavlakos2019expressive] methods are canonical examples of this type of approach for single image reconstruction, but other settings have also been considered, , from multiple views [dong2020motion, huang2017towards], or monocular video [arnab2019exploiting, kocabas2020vibe]. In this work, we present an approach that focuses on the setting of reconstruction from multiple shots.
Direct prediction: Directly regressing the SMPL parameters has seen many successes recently due to deep learning advances. A canonical example is HMR [kanazawa2018end], which learns a direct mapping from raw RGB images to SMPL parameters and involves design principles adopted by many follow-up works [arnab2019exploiting, kolotouros2019learning, georgakis2020hierarchical, rong2019delving, pavlakos2019texturepose]. More specifically, HMR consist of a feature encoder that converts an image to a feature representation , followed by an iterative feedback regressor that maps the intermediate features to model parameters, , and camera parameters, . Using the predicted camera parameters, the reconstructed mesh can be projected to the image, which enables supervision with reprojection losses, given 2D annotations. Concurrently with HMR, other works have investigated decoupled regression approaches [tung2017self, choi2020pose2mesh, moon2020i2l, omran2018neural, pavlakos2018learning, song2020human, xu2019denserac], where the intermediate feature representation is hardcoded, , 2D keypoints, silhouettes, semantic parts or dense correspondences.
Limitations: Previous works [joo2020exemplar, rockwell2020full] have identified the limitations of relevant reconstruction approaches particularly when it comes to heavy truncation of humans. Joo [joo2020exemplar] propose augmentation with synthetically cropped examples, while Rockwell and Fouhey [rockwell2020full], retrain their model with confident reconstructions. In our work, we use complementary information from neighboring shots to improve the 3D reconstruction and collect training examples that improve the robustness of our HMR model.
2.3 3D pose and shape from video
For video approaches, the goal is 3D reconstruction given a video sequence , of length . Video methods that follow-up HMR, , [kanazawa2019learning, kocabas2020vibe, luo20203d], take a similar workflow with the addition of a temporal encoder function , which maps per-frame features to per-frame sequence features , from which the model and camera parameters for each frame are predicted via a 3D regressor , . These methods often differ in the choice of the architecture for the temporal encoder . Kanazawa [kanazawa2019learning] use a convolutional model, Kocabas [kocabas2020vibe] and Luo [luo20203d] use a recurrent model, while Sun [sun2019human] use a hybrid model combining convolutions with self-attention. In our work, we propose a pure transformer model, based exclusively on self-attention. We show that transformer encoder is not only a competitive and stable architecture for video, but also a suitable architecture to handle missing identities that often occur in films.
2.4 Common video datasets
Early work on 3D human reconstruction relied on motion capture datasets with full 3D supervision [sigal2010humaneva, ionescu2013human3] via mocap or multi-view stereo [mehta2017monocular]. Penn Action [zhang2013actemes] contains videos of people performing sport actions in front of a camera. 3DPW [von2018recovering] is a recent video dataset with 3D ground truth of people outdoors obtained via IMUs. While these datasets are approaching “in the wild” settings outside a motion capture lab, they are still designed for performance capture where viewpoints are biased towards those that depict the entire body of people. This is in contrast to how people are depicted in films [gu2018ava, huang2020movienet] and other edited media [rockwell2020full], which are abound with shot changes and close-up shots of people that induce a large truncation. We propose both optimization and direct prediction methods that can address these challenges and introduce MS-AVA in goals of further advancing 3D human mesh recovery from video.
3 Multi-shot optimization
Here we present the first step of our workflow based on multi-shot optimization. First, we describe the necessary preprocessing steps and the multi-shot optimization routine we use for pseudo ground truth generation. Then, we provide more details about the MS-AVA dataset we introduce.
3.1 Preprocessing steps
To apply our multi-shot optimization on a general video, we need a sequence of an individual within a scene, including shot changes. First, we detect 2D body joints using an off-the-shelf 2D pose tracker like OpenPose [cao2018openpose] or AlphaPose [fang2017rmpe]. While these methods obtain quite reliable 2D joint tracklets, they fail across shot boundaries. To extend tracklet duration, we run a shot detection algorithm [sidiropoulos2011temporal, rao2020local], and use a person re-identification network trained on movie data [huang2018person] to link identities across shots. The result is long 2D joint tracklets, extending beyond shot boundaries, which are used as inputs to the multi-shot optimization.
3.2 Multi-shot optimization
Relying on the insight that the input shots depict a single underlying 4D scene, we propose a multi-shot optimization method that recovers a reliable 3D human mesh across shot changes. To make this more concrete, Let us consider the case where we have access to two consecutive frames and , before and after the shot boundary respectively. In the SMPLify [bogo2016keep] setting, we have data terms and prior terms for both frames. In the case of sequence input, relevant methods [arnab2019exploiting, kocabas2020vibe] include smoothness regularization on the camera frame. Since the shot (and therefore the viewpoint) has changed, this is not applicable here. As a result, these methods would consider the shot change as a discontinuity and reconstruct each frame independently.
In contrast to that, to leverage the continuity of the 3D body structure, we apply smoothness regularization in the canonical frame. More specifically, we explicitly decompose the pose parameters to global orientation
and body pose parameters
. By undoing the global orientation, we can compute the body joints in the canonical space. This formulation allows factoring out the camera motion, which can be abrupt, and imposing the smoothness term only in the canonical frame:(1) | ||||
(2) |
The sum of objectives is optimized over the entire sequence of length :
(3) |
returning model parameters for every frame
of the sequence. For faster convergence to a more accurate solution, we initialize our reconstruction with pose and shape estimates provided by a regression network
[kolotouros2019learning].3.3 Multi-Shot AVA dataset
Although the above workflow is applicable in many occasions with videos from TV series or movies, in this work, we focus primarily on the AVA dataset [gu2018ava]. AVA contains 300 movies annotated with human bounding boxes and atomic actions. Bounding box annotations are available at 1fps and organized in short tracklets. We also process the data at 1fps, but follow our preprocessing to extend the tracklet duration (i.e., link short tracklet of the same identity). Each tracklet is reconstructed in 3D with our multi-shot optimization (section 3.2). We call this new dataset Multi Shot AVA, or MS-AVA.
Two important features of MS-AVA are the diverse and challenging visual conditions (truncation), and the length and quantity of the sequences that it includes. In contrast to previous datasets [ionescu2013human3, von2018recovering, zhang2013actemes], with MS-AVA, we are able to recover many, long and diverse tracklets. There are two factors that help us achieve that: a) connecting tracklets across shots and b) keeping track of person identities, even if they are missing for some frames. Both aspects allow us to connect smaller, potentially overfragmented subsequences into longer sequences, useful for training temporal models. This statistic is also highlighted in Table 1. More specifically, compared to single shot processing (“AVA single-shot”), we can increase the number of longer tracklets, if we consider identities that appear consecutively across shots (“AVA continuous identity”), or by re-identification after absence in some frames (“MS-AVA”). Examples of such sequences are illustrated in Figure 3.
Dataset | Length (hours) | #Tracklets |
---|---|---|
All / Long | All / Long | |
Human3.6M [ionescu2013human3] | 6.45 / 6.45 | 330 / 330 |
3DPW [von2018recovering] | 0.68 / 0.56 | 87 / 58 |
Penn Action [zhang2013actemes] | 0.85 / 0 | 2.3k / 0 |
AVA single-shot | 75.1 / 11.1 | 5.1k / 1.3k |
AVA continuous identity | 78.4 / 17.1 | 6.3k / 1.8k |
MS-AVA | 81.1 / 20.2 | 6.7k / 2.1k |

Finally, our insight that pose changes smoothly across the shot boundary also offers the opportunity to evaluate the 3D accuracy of the recovered human mesh from monocular camera sequences via novel view evaluation. Specifically, given a shot change from frame to frame , we project the shape of frame to frame , and vice versa. See Figure 4 for illustration. This allows us to evaluate the predicted shape using 2D reprojection metrics, , PCK [yang2012articulated]. We refer to this metric as cross-shot PCK and we use it to evaluate 3D shape quality in AVA, where only 2D keypoints are present.


4 Human Mesh Recovery
The 3D motion sequences we recovered with the offline multi-shot optimization step offer a rich source of data with pseudo ground truth 3D bodies. Here, we demonstrate how to incorporate this data in the training of direct prediction models for Human Mesh Recovery from single images or video, without the reliance on keypoint detections.
4.1 Single-frame model
The first step is to train an updated single-frame model. In general, the setting is similar to the original HMR [kanazawa2018end]. Let our image encoder for frame predict model parameters and camera parameters . Model joints are projected to 2D locations . Our supervision for the network comes from the output of the multi-shot optimization for the corresponding frame, , and the detected 2D joints .
(4) | ||||
(5) |
Our experiments show that the diversity and the challenging visual conditions (, truncation) of MS-AVA, help improve the robustness of our single-frame model.
4.2 Temporal model
Using an updated and robust HMR, we proceed towards learning the temporal encoding function . In the past, this function has been represented by convolutional [kanazawa2019learning], recurrent [kocabas2020vibe] or hybrid encoders [sun2019human]. However, in the above cases, the temporal training data come from curated collections of clean videos with continuous person tracking [ionescu2013human3, mehta2017monocular, zhang2013actemes]
. In contrast, in more general use cases, including MS-AVA, video data can be more challenging with issues like shot changes or frames where the person of interest is not present (due to occlusion, tracking failures or re-identification failures). These cases are not easily handled by convolutional or recurrent encoders, which would require padding the inputs with zeros or concatenating all valid frames together (i.e., ignoring the time difference between consecutive frames).
To address these limitations, we propose t-HMMR, a temporal model based on a pure transformer architecture [vaswani2017attention]. Transformers include an attention mechanism, allowing us to explicitly select the elements of the input sequence they will attend to. This is a convenient feature, particularly with the discontinuous nature of MS-AVA sequences (see Figure 3). Only considering the sequences where a person appears continuously without identity changes, limits the available training sequences (, Table 1 for MS-AVA). The proposed transformer model has the advantage to address this challenge elegantly, making effective use of all the available sequences.
Our transformer encoder takes as input an intermediate HMR embedding of sequence of frames . This sequence comes with a scalar value per-frame , which indicates whether the person is present in frame (), or not (). A fixed positional encoding is added to the input features to indicate the time instance of each input element. The updated features are then processed by a transformer encoder layer. This follows the architecture of the original transformer model, including a self-attention mechanism and a shallow feedforward network. The values are used to ensure that the invalid input frames will not contribute in the self-attention computation. The output of the transformer layer is a residual value added to the feature
through a residual connection. The result of this block is the video feature representation
. This is illustrated in Figure 5.For training the transformer encoder, following prior work [kanazawa2019learning, kocabas2020vibe], we fix the weights of the image encoder , and only update the temporal encoder and the parameter regressor . Similarly to the single-frame model, supervision is provided by the multi-shot optimization results, where we have corresponding losses with Equations 4 and 5, and respectively, for each frame . Also, to further encourage temporal consistency, smoothness losses are applied on 3D joints and 3D model parameters (equivalent to equations 1 and 2 respectively).
5 Experiments
The focus of our quantitative evaluation is three-fold: we first demonstrate the quality of our multi-shot optimization results; then we investigate the effect of our data for improving the performance and robustness of a single-frame human mesh recovery model; finally we examine our temporal model and validate the importance of leveraging the transformer architecture. In the next paragraph, we present in detail each one of these aspects.
5.1 Implementation details
For our HMR baseline, we retrain an HMR model on the standard datasets (Human3.6M [ionescu2013human3]
, COCO
[lin2014microsoft], MPII [andriluka20142d]) using pseudo ground truth SMPL parameters from SPIN [kolotouros2019learning], and incorporating the cropping augmentation scheme [joo2018total, rockwell2020full]. We use this baseline to initialize our multi-shot optimization and for ablative experiments. After the offline multi-shot optimization, our final HMR model is trained with the same strategy, including the data from MS-AVA. For the t-HMMR model, we freeze the image encoder of HMR, as done in [kanazawa2019learning, kocabas2020vibe], for computational reasons, and train the temporal encoder and 3D regressor only.5.2 Quantitative evaluation
Multi-shot optimization: Our multi-shot optimization integrates information across the shot boundary to improve pose reconstruction. To evaluate its success, we use the cross-shot PCK metric as discussed in Section 3.3. As a sanity check, we compare with two optimization based baselines, one that considers a single frame only [bogo2016keep], and one that considers a temporal sequence without shot changes [arnab2019exploiting, kocabas2020vibe]. The results are presented in Figure 6, where multi-shot optimization outperforms the two baselines. As expected, the multi-shot optimization outperforms the two baselines, since the formulation of the objective function allows the integration of information from multiple shots, and the large gap in performance indicates that it is very successful in this regard. Qualitative examples of this behavior are presented in Figure 7 and in the Supplementary.


Single-frame model: The introduced MS-AVA dataset provides a rich source of data for training our single-frame model for human mesh recovery. Considering the nature of the data, they could help improving the robustness of our model, particularly when it comes to truncation, a common mode of failure for the state-of-the-art models [rockwell2020full]. We demonstrate this on the Partial Humans dataset [rockwell2020full] in Table 2. Our model trained with the MS-AVA data outperforms the state-of-the-art approaches. Many of these methods are trained primarily with full body images so there are significant failures under truncation, whereas our network can be robust in these cases too. This robustness is illustrated in Figure 8, where we provide qualitative results for different methods on the “Uncropped” subset of Partial Humans. Moreover, our HMR model trained on MS-AVA outperforms our baseline trained with synthetic cropping augmentation, underlining the effect of our MS-AVA data at achieving better robustness. Finally, we also evaluate on the 3DPW dataset [von2018recovering] in Table 3, where we compare with the most relevant approaches and observe the same trends.
PCKh on Partial Humans Uncropped | ||||
Method | VLOG | YouCook | Instr | Cross |
HMR [kanazawa2018end] | 81.2 | 93.6 | 86.9 | 92.7 |
GraphCMR [kolotouros2019convolutional] | 65.7 | 80.1 | 77.5 | 79.3 |
SPIN [kolotouros2019learning] | 73.4 | 85.1 | 85.6 | 85.5 |
Partial Humans [rockwell2020full] | 68.7 | 95.4 | 77.9 | 91.1 |
HMR (retrained) | 86.9 | 96.8 | 92.4 | 96.2 |
+ MS-AVA data | 90.3 | 98.9 | 94.1 | 98.2 |

Method | PA-MPJPE |
---|---|
HMR [kanazawa2018end] | 81.3 |
SPIN [kolotouros2019convolutional] | 59.2 |
HMR (retrained) | 59.2 |
+ MS-AVA data | 57.8 |
Temporal model: For our temporal model, first we investigate the transformer encoder from an architecture viewpoint, factoring out the effect of challenging/incomplete data. To this end, we compare it directly with other architecture choices for temporal encoders, i.e., the convolutional encoder of HMMR [kanazawa2019learning] and the recurrent encoder of VIBE [kocabas2020vibe]. For a direct comparison, we follow the exact training data/schedule with VIBE using their public implementation, and replacing only the temporal encoder. We perform multiple training runs and plot validation performance in Figure 9
. If we consider the best performance across all iterations, all models achieve similar results. However, the convolutional and recurrent encoders tend to diverge after a few epochs, while the transformer encoder is more stable. Simultaneously, this version still achieves state-of-the-art results (PA-MPJPE of 56.1mm vs 56.5mm for VIBE on the 3DPW test set), even though it is trained on less data than the full VIBE model.

Next, we also demonstrate the suitability of the transformer model in cases of learning and testing on incomplete data, , on movies. In these cases, the architectures other than the transformers do not have a straightforward way to handle missing frames. We experimented with different alternatives, and we found that padding missing frames with zero features performed the best for the convolutional and the recurrent encoder. Using this strategy, we report cross-shot PCK results on AVA for different encoder choices in Table 4. In the first setting, we test on the clean sequences (“test: clean” - consecutive sequences without missing frames), trained using our clean sequences (“train: clean”) from AVA. In this setting all models perform well, however, when we add the sequences with missing frames in the training (“train: all”), the transformer model benefits the most from the additional training data. In the second setting, we test on all sequences (“test: all”). The transformer still performs the best in this more challenging case.
Finally, in Figure 10 we provide some example reconstructions of our temporal t-HMMR model, in comparison with the single frame HMR model, both trained on MS-AVA. While our HMR obtains reasonable results, output from t-HMMR is more consistent as it has a larger temporal context.
Temporal | train: clean | train: all | train: all |
---|---|---|---|
Model | test: clean | test: all | |
Convolutional | 68.8 | 69.7 | 67.5 |
Recurrent | 70.8 | 70.8 | 68.5 |
Transformer | 70.7 | 73.8 | 70.7 |

6 Conclusion
We introduce a new task of 3D human reconstruction from multiple shots, proposing a framework to generate training data from edited media like movies, and use them in different human mesh recovery tasks. Our contributions span all three major forms of Human Mesh Recovery: off-line iterative optimization, single-view prediction, and temporal prediction. Our experiments demonstrate the significance of our contributions in each of these aspects. Future work can investigate further the available data from MS-AVA for other downstream applications or apply our data generation framework in other edited media.
Acknowledgements: This research was supported by BAIR sponsors.
Comments
There are no comments yet.