Bilevel Online Adaptation for Human Mesh Reconstruction
This paper considers a new problem of adapting a pre-trained model of human mesh reconstruction to out-of-domain streaming videos. However, most previous methods based on the parametric SMPL model <cit.> underperform in new domains with unexpected, domain-specific attributes, such as camera parameters, lengths of bones, backgrounds, and occlusions. Our general idea is to dynamically fine-tune the source model on test video streams with additional temporal constraints, such that it can mitigate the domain gaps without over-fitting the 2D information of individual test frames. A subsequent challenge is how to avoid conflicts between the 2D and temporal constraints. We propose to tackle this problem using a new training algorithm named Bilevel Online Adaptation (BOA), which divides the optimization process of overall multi-objective into two steps of weight probe and weight update in a training iteration. We demonstrate that BOA leads to state-of-the-art results on two human mesh reconstruction benchmarks.READ FULL TEXT VIEW PDF
Bilevel Online Adaptation for Human Mesh Reconstruction
Human mesh reconstruction is a hot topic in computer vision, where improving the generalization ability is one of the major challenges at present. We observe that previous models[kanazawa2018end, Moon_2020_ECCV_I2L-MeshNet, kocabas2020vibe, kolotouros2019learning, aksan2019structured, xu2019denserac, guler2019holopose, pavlakos2018learning] are prone to overfit the training dataset and usually underperform in out-of-domain testing scenarios. As shown in Figure 1, between different datasets, there usually exist large domain gaps in camera parameters, lengths of body bones, backgrounds, and occlusions, whose negative impact becomes even more severe when we apply the model to streaming data due to the rapidly changing environment of the test domain. In this work, we are interested in finding a good solution to adapting human mesh reconstruction models to out-of-domain video frames that arrive in a sequential order, which is a practical task in many downstream, real applications, , augmented reality [DBLP:conf/vr/Billinghurst04], and human-robot interaction [DBLP:conf/hri/StubbsHW06].
The most serious technical challenge of this task is the lack of 3D annotations of test data. To cope with this problem, some optimization-based approaches [joo2020eft, loper2015smpl, SMPL-X:2019] learn to update the model on each test frame using frame-wise losses, such as the pose re-projection loss [kanazawa2018end, kolotouros2019learning] of 2D keypoints222It is a common practice to use the ground-truth 2D keypoints for cross-domain human mesh/pose reconstruction
. However, the imperfect frame-based loss functions do not always lead to effective online learning directions as expected by the 3D evaluation metric. There is a severe gap between them. As shown in Figure2
, it may cause severe ambiguity in the estimation of depth information, thus worsening the quality of mesh reconstruction. Moreover, due to the asynchronous arrival of streaming data, the online adaptation model is prone to over-fitting, which will further amplify the difference between the 2D objectives and 3D evaluation metrics.
A straightforward solution is to regularize the training process towards 2D pose objectives using temporal constraints [kocabas2020vibe, sun2019human, DBLP:conf/cvpr/KanazawaZFM19], such as the smoothness of mesh reconstruction over time. If the temporal constraints are used properly, the ambiguity of depth estimation can be greatly reduced. However, empirically, a simple combination of 2D losses and temporal constraints tends to obtain undesirable results due to the competition and incompatibility between multiple objectives, in the sense that the gradient of 2D objectives may interfere with the training of the temporal one. Further, solving this problem becomes even more urgent in online adaptation scenarios with streaming data, because without global knowledge of the test domain, the model can easily fall into a sub-optimal solution to either part of the loss functions that is more readily available.
The above two concerns motivate us to tackle the challenging problem of out-of-domain mesh reconstruction from a new perspective. We propose an algorithm named Bilevel Online Adaptation (BOA) that greatly benefits joint learning of multiple objectives in this task. It effectively incorporates temporal consistency into the few-step online training by performing bilevel optimization on the streaming test data. Specifically, in BOA, the lower-level optimization step serves as a weight probe to rational model parameters under single-frame pose constraints, while the upper-level optimization step finds a feasible response to overall loss function with temporal constraints. On one hand, our approach avoids overfitting the temporal constraints by retaining the 2D losses for the upper-level optimization. On the other hand, it avoids overfitting the 2D losses by updating the model only at the upper-level optimization step with second-order derivatives. By this means, our approach effectively combines the benefits of pose and temporal constraints. In experiments, we use Human3.6M [h36m_pami] as the source domain, and take 3DPW [vonMarcard2018]
and MPI-INF-3DHP[mehta2017monocular] as target domains with streaming video frames. On both benchmarks, our approach consistently outperforms existing approaches [joo2020eft, loper2015smpl, SMPL-X:2019], showing the excellent ability to tackle notable domain gaps.
A SMPL-based solution to human mesh reconstruction can be usually specified as a tuple of , where denotes the observation space333We here consider as a set of consecutive video frames., and is the parameter space of SMPL [loper2015smpl]. For each input frame , a first-stage model is trained to estimate . Then the SMPL model generates the corresponding mesh and recovers 3D keypoints denoted by using a mesh-to-3D-skeleton mapping pre-defined in SMPL. The third element in the tuple is a weak-perspective projection model for projecting to 2D space, , , where is estimated from . The last one in the tuple defines a loss function on to learn the first-stage model
, usually in terms of neural networks.
In this work, we make two special modifications to the above task. First, we focus on out-of-domain scenarios, in the sense that large discrepancies may exist between the data distributions of the source training domain and the target test domain . Second, we specifically focus on dealing with streaming video frames at test time. These changes bring in two challenges: (1) The ground truth values of the target domain parameters in are always unavailable throughout the learning process, which is different from standard online learning. (2) The distribution of the target domain is difficult to be estimated because the frames are sequentially available and their distributions are continuously changing, which is different from standard domain adaptation setups.
In this section, we first formalize the online adaptation framework as a solution to out-of-domain human mesh reconstruction from streaming sequential data. To make the online adaptation more effective, we then propose a bilevel optimization algorithm that incorporates unsupervised temporal constraints into the training paradigm.
Unlike the existing approaches [kanazawa2018end, DBLP:conf/nips/DoerschZ19, sun2019human] that try to solve the problem introduced in Section 2 by learning more generalizable features in the source domain , we here present an alternative solution that performs online test-time training directly on the target domain . A potential benefit is that as it is solely performed on , it can be jointly used with the state-of-the-art approaches that learn generalizable features from
to further improve the quality of transfer learning.
Alg. 1 shows the proposed online adaptation framework. Here we denote the pre-trained model from the source domain as . Our framework does not have special requirements for the pre-training method, but typically, is trained offline to regress the ground truth SMPL parameters in a fully supervised manner. Given sequentially arrived target video frames , a straight forward solution to quickly absorbing the domain-specific knowledge is to fine-tuning continuously on each individual , following the online adaptation paradigm proposed by Tonioni [tonioni2019real]. We take it as a baseline algorithm that computes the unsupervised loss function with pose constraints444We discuss more about the specific forms of in Section 4. on each , and performs a single optimization step as follows before the inference step:
where is the learning rate of gradient descent.
A potential disadvantage of the baseline algorithm is that although fine-tuning a learned model on unlabeled target data may help to handle rapidly changing test environments, an imperfect unsupervised loss function may lead to wrong directions of the one-step gradient descent, and may harm the overall algorithm. It may cause catastrophic overfitting to some undesirable information of current observation that is unrelated to the reconstruction quality. To alleviate this issue, we propose the following spatiotemporal bilevel optimization approach.
Considering the setting of out-of-domain streaming data, , video frames arrive at a sequential order, there generally exists strong temporal dependency between frames, which can be leveraged to improve the quality of online adaptation. Let us suppose that we have two objectives respectively for frame-wise constraints and temporal consistency, denoted by and , whose specific forms will be discussed later. Straightforward approaches to combine and include jointly optimizing them by adding them together or iteratively performing two-stage optimization (Figure 2b). However, these methods usually lead to sub-optimal results due to the competition and incompatibility between the objectives, in the sense that the gradient of the single-frame constraint may interfere with the training of the temporal one. We also observe that the single-frame constraint is usually optimized much faster than the temporal one. That is to say, in a small number of inference-stage optimization steps, the model may learn pose priors very quickly but then get stuck trying to learn temporal consistency. Therefore, the first and foremost challenge we confront is to design an online optimization scheme to prevent overfitting to each objective and maximize the power of both single-frame and temporal constraints.
We formulate the problem of identifying effective model weights under spatiotemporal multi-objectives as a bilevel optimization problem. In this setup, as shown in Figure 3, the lower-level optimization step serves as a weight probe to rational models under single-frame pose constraints, while the upper-level optimization step finds a feasible response to temporal constraints. Specifically, for the -th test sample, the model from the last online adaptation step, denoted by , is firstly optimized with the single-frame constraints, , to obtain a set of temporary weights denoted by . We name this procedure as the lower-level probe (Lines 5-6 in Alg. 1), in the sense that first, can be feasible responses to the easy component of multi-objectives, which best facilitates the rest of the learning procedure for temporal consistency; Second, is not directly used to update . At this level we focus on the spatial constraints on individual frames:
where are the loss weights. The first term in is a straightforward supervision of the re-projection error of 2D keypoints. The second term is the prior constraint on the shape and pose parameters, which is a common practice in mesh reconstruction. calculates the distance of the estimated to their statistic priors555These priors are obtained from a commonly-used third-party database.. The third term is the fully supervised loss with 3D keypoints on a randomly sampled source data, which has two benefits: (1) preventing the catastrophic forgetting of the basic knowledge learned from . (2) providing the online updated model a continuous 3D supervision to keep it from overfitting the imperfect unsupervised loss functions. After optimization at the lower level, we obtain the probe model for subsequent upper-level learning. Note that, due to a lack of 3D supervisions in the target domain, the above is insufficient to recover the 3D body. Therefore, it is essential to explore temporal correlations in streaming data to reduce the ambiguity of mesh construction.
At an upper optimization level, we calculate the overall spatiotemporal multi-objectives using obtained at the lower-level optimization step, and then back-propagate with second-order derivatives to update the original , as shown in Lines 7-8 in Alg. 1. As for the specific form the motion constraints, given two images , at an interval with their 2D keypoints , and the estimated , , the motion loss is defined as
where , and . Note that both and are obtained from the probe model . Furthermore, we maintain an exponential moving average of history models with a teacher model (similar to MeanTeacher [tarvainen2017mean]), denoted by . We regularize the output of to be consistent with :
which is then combined with the motion loss to obtain the overall temporal constraints that focus on the consistency of both the sequentially updated model weights and the reconstruction results as well:
where and control the weights of the two temporal loss terms. From another perspective, these two losses are complementary with each other: the teacher model maintains long-term temporal information, and the motion loss is a constraint on short-term motion consistency.
As briefly mentioned above, there are several single-level optimization alternatives of the spatiotemporal multi-objectives, , (1) one-stage joint adaptation: online adapting the model with a combined loss of . (2) two-stage adaptation: adapting the model iteratively with and in a cascaded optimization manner. However, we observe that the joint adaptation scheme is prone to lead to ineffective training of the temporal constraints due to the incompatibility between multiple objectives. The two-stage scheme adapts the model to individual frames under the single-frame constraints repeatedly, which commonly leads to severe overfitting and drifting away from the final 3D reconstruction metric. The key insights of BOA are as follows: First, it avoids overfitting the temporal constraints by retaining the pose prior loss for the upper optimization level. Second, it avoids overfitting the pose priors by updating the model weights only at the upper optimization level with second-order derivatives. By this means, BOA effectively combines the profits of both single-frame and temporal constraints, achieving considerable improvement over its alternatives.
Following the majority of previous SMPL-based human reconstruction models, we use a ResNet-50 [he2016deep]
pre-trained on ImageNet[deng2009imagenet] for encoding individual video frames. The encoded features are then delivered to two fully-connected layers with neurons, followed by a dropout layer [srivastava2014dropout]. The final layer of is a fully-connected layer with
neurons. During streaming adaptation, only one image is taken as input. As a result, we replace Batch Normalization[ioffe2015batch] with Group Normalization [wu2018group] to estimate more accurate statistics.
We use the Human3.6M dataset for training the source model and learn to adapt the model to the 3DHP and 3DPW datasets. Table 1 presents the statistics of typical domain gaps among these datasets.
Human3.6M [h36m_pami] is captured in a controlled environment, which has subjects in total. Following the previous approaches [kocabas2020vibe, kanazawa2018end], we train the base model on subjects (S1, S5, S6, S7, S8), and down-sample all videos from fps to fps.
3DHP [mehta2017monocular] is the test split of the MPI-INF-3DHP dataset. It consists of valid frames from subjects performing actions, collected from both indoor and outdoor environments.
3DPW [vonMarcard2018] is a multi-person dataset captured by a handheld camera, where most videos are collected from outdoor environments. As 3DHP, we also use the test set of 3DPW as a streaming target domain.
|Dataset||Focal len. (pixel)||Bone len. (m)||Camera dist. (m)||Camera ht. (m)|
We first train the base model on the Human3.6M dataset and take 3DPW and 3DHP as test sets. All video frames are cropped and then scaled to pixels according to the bounding boxes calculated from 2D keypoints. For the base model , we follow the same training scheme as SPIN (more details can be found in [kolotouros2019learning]). For the training of BOA on 3DPW, we choose the Adam optimizer [kingma2014adam] with the learning rate (). The loss weights in are , and . The loss weights in are and . As for 3DHP, the learning rate is set to (). We set in and in . Note that the order of streaming videos in 3DPW and 3DHP is pre-defined (same for all compared models), and the batch size of online optimization is . We set (Alg. 1) for the efficiency of adaptation. Please refer to the supplementary material for more analyses of hyper-parameters.
We initially compare BOA with end-to-end methods, including frame-based methods [kanazawa2018end, kolotouros2019convolutional, Choi_2020_ECCV_Pose2Mesh, kolotouros2019learning, Choi_2020_ECCV_Pose2Mesh, Moon_2020_ECCV_I2L-MeshNet], video-based methods [DBLP:conf/cvpr/KanazawaZFM19, kocabas2020vibe], and those attempting to learn generalizable features from the training domain, such as Sim2Real [DBLP:conf/nips/DoerschZ19] and DSD-SATN [sun2019human]. Given a video frame, end-to-end methods directly estimate its SMPL parameters. We also include existing approaches that fine-tune SMPL parameters [bogo2016keep, arnab2019exploiting] or model parameters [joo2020eft] on the target domain. Different from these approaches, BOA adapts in an online fashion, which is more challenging. Please refer to the supplementary material for more details.
Following previous works [kanazawa2018end, DBLP:conf/nips/DoerschZ19, zhang2020inference], we evaluate our model in terms of Mean Per Joint Position Error (MPJPE), Procrustes-Aligned MPJPE (PA-MPJPE), and the Percentage of Correct Keypoints (PCK) with a threshold of mm on 3DHP.
|Prot.||SMPL annotation||#Valid frames|
|#PH (HMMR)||The fits||26,234|
Table 2 presents quantitative comparisons on 3DPW in MPJPE and PA-MPJPE. Following HMMR or SPIN, most existing methods adopt two kinds of pre-processing protocols on 3DPW as illustrated in Table 3. These two protocols have significant differences in the number of test images and SMPL annotations, which have a great impact on the evaluation. Please refer to the supplementary materials for more details. In Table 2, we mark the protocol used in the original literature of each compared method. Compared with other end-to-end methods (top part), BOA achieves better performance in both #PS and #PH, and particularly outperforms the methods that are designed to learn generalizable features at training time [sun2019human, kanazawa2018end, DBLP:conf/nips/DoerschZ19], which indicates that our test-time adaptation approach can better mitigate the domain gap by properly exploiting the streaming data from the test domain. Besides, we also observe that BOA outperforms the compared models that are fine-tuned on the entire training set of 3DPW in an offline manner (middle part). Note that BOA does not require access to the training set. In addition, we do not include the results from VIBE [kocabas2020vibe] ( in PA-MPJPE and in MPJPE) in quantitative comparison, since it was evaluated on the same number of test images under #PH but uses the same SMPL annotations under #PS.
Table 4 gives the MPJPE and PA-PMJPE results on 3DHP, which is the test set of the entire MPI-INF-3DHP domain. Note that all models but BOA are directly trained on the training set of MPI-INF-3DHP in an offline fashion, in the sense that the global knowledge from the test domain is more accessible to these compared models. Although BOA has never been trained on , it still performs best on the corresponding test split, showing a strong adaptability to a rapidly changing test environment.
Figure 4 presents a typical showcase of mesh reconstruction on the challenging 3DPW dataset. The first row refers to human meshes generated by VIBE [kocabas2020vibe], while the second row corresponds to our results. We zoom in on the limbs for better visualization and observe that the reconstruction quality of VIBE is less satisfying, , the positions of arms and legs are not correctly estimated. By contrast, our model can capture the depth structure of the human subject, which is mainly due to the proposed bilevel optimization scheme and spatiotemporal constraints. Figure 5 presents a sequence of input videos with severe occlusions, where the subject of interest (the man in the middle) is covered by the walking woman. Still, BOA successfully estimates the occluded human body.
With the growth of the optimization steps ( in Alg. 1), as shown in Figure 6, the error of single-level training scheme, which combines and in a multi-objective, in both PA-MPJPE and MPJPE increases quickly. For comparison, the performance of BOA decreases at a much slower rate, which indicates that the single-level optimization is more likely to result in over-fitting to the current video frame, and thus makes it difficult for the model to quickly adapt to the next frame. This effect can be greatly alleviated by the proposed bilevel optimization method.
As shown in Table 5, we investigate the effectiveness of the proposed bilevel adaptation scheme and compare it with other variants. Specifically, B1-B3 refer to the single-level, one-stage optimization scheme, while B4-B5 are trained by updating model parameters with alternate loss functions (, and ). Note that the major difference between two-stage and bilevel is whether the parameters in are obtained from or . We observe that, despite the use of temporal constraints, B3 performs worse than B1, indicating that the straightforward combination of multi-objectives leads to sub-optimal results. Even though we use the multi-objectives in a two-stage training scheme (B5), we can only observe a minor improvement over the vanilla B1 model. By contrast, the final proposed BOA is shown to effectively combine the best of both constraints and achieve considerable improvement over all compared baselines.
Table 6 shows the ablation studies for the proposed two temporal constraints and . By comparing B7 with Final, we can observe that the ues of reduces PA-MPJPE from mm to mm, while the use of B8 reduces PA-MPJPE from mm to mm. A possible reason is that the motion loss focuses on short-term temporal constraint and helps to recalibrate pose artifacts relative to the last frame. By comparing B8 with Final, we can find that the use of significantly reduces MPJPE, which indicates that the long-term information carried by is beneficial for consistent mesh reconstruction. It may help to mitigate the domain gaps caused by systematic biases such as the focal length and camera orientations.
In Figure 7, the X-axis refers to the normalized 2D keypoint loss and the Y-axis is the MPJPE. The blue dots are the results of the vanilla baseline only trained with frame-based loss functions (B1), the yellow stars correspond to the baseline model trained with multi-objectives (B3), and the red dots indicate bilevel online adaptation. We can see that, first, although a straightforward use of temporal constraints (B3) achieves comparable results in the 2D loss with B1, it harms the 3D evaluation metric. Second, with BOA, the model achieves MPJPE results that are more consistent with the frame-based loss, indicating that BOA can reduce 3D ambiguity by using the temporal constraints more appropriately.
is a widely used parametric model for 3D human mesh reconstruction, which is also adopted in this work. The early methods[guan2009estimating, sigal2008combined, bogo2016keep, lassner2017unite, huang2017towards, zanfir2018monocular] generally adopt the optimization scheme, where a standard T-Pose SMPL model is gradually fit to an input image according to the silhouettes [lassner2017unite] or 2D keypoints [bogo2016keep]. These optimization-based methods are time-consuming, , they often struggle to reduce the inference time spent on a single input. Recently, many approaches [kanazawa2018end, Moon_2020_ECCV_I2L-MeshNet, kocabas2020vibe, kolotouros2019learning, aksan2019structured, xu2019denserac, guler2019holopose, pavlakos2018learning] use deep neural networks to regress the parameters of the SMPL model, which are efficient and accurate if large-scale data is available. The major drawback of CNN-based regression models is the generalization ability. For example, deep models trained on indoor dataset generally do not have satisfying results [kanazawa2018end] if tested on an in-the-wild dataset. To tackle this problem, Kanazawa [kanazawa2018end] propose an adversarial framework, utilizing the unpaired 3D annotations, to facilitate the reconstruction. Several researches [sun2019human, tung2017self, li2019towards, omran2018neural, rueegg2020chained] also show that the paired 3D annotation is not necessary, attempting to find more representative temporal features [sun2019human, tung2017self] or employ more informative input such as RGB-D [li2019towards], and part segmentation [omran2018neural, rueegg2020chained] to facilitate human mesh reconstruction. However, there still exists a principled challenge in this task, where neither the unpaired 3D annotation nor the other mentioned intermediate representations could effectively fill the gap between two largely different datasets. In this work, we propose to tackle this problem by using an online adaptation algorithm, named BOA. The key insight is BOA exploits the time constraints of test frames while avoiding overfitting with bilevel optimization.
Unsupervised online adaptation refers to sequentially adapting a pre-trained model at test time in an unsupervised manner. It is an emerging technique to prevent model crashing when the test data is diverse from the training data. Previous methods [duchi2011adaptive, broderick2013streaming, bobu2018adapting, liu2020learning, park2018meta, voigtlaender2017online, tonioni2019real, broderick2013streaming, tonioni2019learning, zhang2020online, li2020self] use it for tasks other than mesh reconstruction, such as video segmentation [voigtlaender2017online], tracking [park2018meta], and stereo matching [tonioni2019real, tonioni2019learning]. In this paper, we present a pilot study of unsupervised online adaptation in the context of human mesh reconstruction. Beyond unsupervised online adaptation, many previous approaches effectively learn generalizable features through meta-learning [finn2017model, fallah2020convergence], extracting domain-invariant representations [khosla2012undoing, muandet2013domain, ghifary2015domain, li2017deeper, li2018domain, li2018learning, li2019episodic], or learning with adversarial examples [shankar2018generalizing, volpi2018generalizing] without requiring access to target labels. However, none of these approaches focus on how to adapt a pre-trained model to streaming data.
In this paper, we presented a new research problem of reconstructing human meshes from out-of-domain streaming videos. We proposed a new online adaption algorithm that learns temporal consistency with bilevel optimization and demonstrated that it can greatly benefit the multi-objective training process in space-time. Our approach outperforms the state-of-the-art mesh reconstruction methods on two benchmarks with rapidly changing test environments.
This work was supported by NSFC (U19B2035) and Shanghai Municipal Science & Technology Major Project (2021SHZDZX0102). This work was also supported by the NSFC grants U20B2072 and 61976137.