Adversarial Motion Modelling helps Semi-supervised Hand Pose Estimation

06/10/2021
by   Adrian Spurr, et al.
0

Hand pose estimation is difficult due to different environmental conditions, object- and self-occlusion as well as diversity in hand shape and appearance. Exhaustively covering this wide range of factors in fully annotated datasets has remained impractical, posing significant challenges for generalization of supervised methods. Embracing this challenge, we propose to combine ideas from adversarial training and motion modelling to tap into unlabeled videos. To this end we propose what to the best of our knowledge is the first motion model for hands and show that an adversarial formulation leads to better generalization properties of the hand pose estimator via semi-supervised training on unlabeled video sequences. In this setting, the pose predictor must produce a valid sequence of hand poses, as determined by a discriminative adversary. This adversary reasons both on the structural as well as temporal domain, effectively exploiting the spatio-temporal structure in the task. The main advantage of our approach is that we can make use of unpaired videos and joint sequence data both of which are much easier to attain than paired training data. We perform extensive evaluation, investigating essential components needed for the proposed framework and empirically demonstrate in two challenging settings that the proposed approach leads to significant improvements in pose estimation accuracy. In the lowest label setting, we attain an improvement of 40% in absolute mean joint error.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 4

page 6

11/30/2021

Semi-Supervised 3D Hand Shape and Pose Estimation with Label Propagation

To obtain 3D annotations, we are restricted to controlled environments o...
11/28/2018

3D human pose estimation in video with temporal convolutions and semi-supervised training

In this work, we demonstrate that 3D poses in video can be effectively e...
07/16/2021

Semi-supervised 3D Hand-Object Pose Estimation via Pose Dictionary Learning

3D hand-object pose estimation is an important issue to understand the i...
10/13/2020

Multi-Scale Networks for 3D Human Pose Estimation with Inference Stage Optimization

Estimating 3D human poses from a monocular video is still a challenging ...
12/11/2017

3D Hand Pose Estimation: From Current Achievements to Future Goals

In this paper, we strive to answer two questions: What is the current st...
08/14/2020

Preterm Infants’ Pose Estimation With Spatio-Temporal Features

Objective: Preterm infants’ limb monitoring in neonatal intensive care u...
03/21/2020

Monocular Real-time Hand Shape and Motion Capture using Multi-modal Data

We present a novel method for monocular hand shape and pose estimation a...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Figure 1: Adversarial motion modelling improves hand pose estimation via leveraging unlabeled videos. (Left) Ground-truth labels. (Middle) Given only a fraction of labeled data, standard hand pose models perform poorly on unseen samples. (Right) Using our proposed adversarial motion model, we train the per-frame hand pose model to produce predictions on unlabeled videos of valid motion, leading to significantly more accurate pose estimation.

Estimating the 3D hand pose from monocular RGB is in itself a challenging task. Many application areas such as VR/AR require that hand pose estimation models are robust to different environmental conditions, object- and self-occlusion as well as diversity in hand shape and texture. Although impressive performance has been achieved by supervised learning approaches (e.g.

[6, 4, 21]), this has only been demonstrated in constrained environments. This limitation stems from the need for annotated data, which is typically only available if conditions can be controlled due to the difficulties to acquire 3D hand pose ground truth. However, as [58] demonstrated, models trained on labeled datasets typically do not generalize well to other settings.

One option would be to acquire more labeled data in a more diverse range of settings. Yet, such attempts would need to overcome many challenges since correctly annotating complex and diverse 3D hand poses is difficult even with sophisticated ground truth acquisition methods. In addition, such setups limit portability and hence the setting. This can be partially mitigated by making use of weakly supervised data [7, 6, 49] or machine-annotations. For example, [34] introduce a new dataset by making use of a machine-annotation method that samples videos from YouTube containing hands. These sequences are labeled using OpenPose [9] and by fitting the MANO [47] model to the obtained 2D keypoints. While the idea to leverage vast amounts of YouTube videos is promising, the labelling approach is inherently bounded by the accuracy of OpenPose.

We ask the question, if there is an alternative approach to make use of unlabeled videos. For this, we take inspiration from the field of learned motion modeling. Research in that area focuses on modeling the 3D motion of human bodies, by either performing motion-infilling [32, 24, 20, 48], extrapolation [2, 48, 40, 8, 52, 28, 45, 37, 38] or denoising [24, 14] by exploiting temporal information linking the individual poses of a given sequence.

The predictions of a hand pose estimator on videos can be interpreted as a sequence of hand motion. Thus it is feasible to make use of learned motion modeling in order to exploit temporal information and to correct per-frame predictions by enforcing consistency and validity of the motion of the predicted hand poses. However, we observe that so far there is no prior work on motion modeling of hand poses. Unpacking this observation further we note that full body and hand motion have important differences. Full body motion is often cyclic and body motion models tend to exploit this [1, 38]. Additionally, the motion modeling efforts for body pose are driven by the availability of datasets of appropriate size, such as Human3.6M [26] and AMASS [36]. In contrast, hand motion is less cyclic and can contain changes in pose over a very short time horizon. Potentially because of these reasons, simply applying existing motion-models to hand pose estimates is not straightforward and in our experiments did not yield sufficiently good performance. Additionally, there is a lack of suitable hand motion datasets.

Sequential hand pose datasets such as BigHands2.2M [54] do exist. However, participants where instructed to explore the full range of motion of hands, including extremal poses, and to perform random movement. Hence its motion statistics differ significantly from natural motion and the distribution of joint velocity/acceleration is significantly different from other datasets in the literature, see supplementary for details. In our experiments the inclusion of BH2.2M has lead to detrimental results.

We therefore propose a simple, yet effective way to leverage temporal information from unlabeled videos via an adversarial motion model. Adversarial training has shown promising results in the field of body pose estimation [29, 30, 33], but has not yet been explored on the hand pose estimation task.

The goal of the adversarial setting is for the pose predictor to learn to produce a valid sequence of hand poses on unlabeled videos, as determined by the discriminator. Analogously, the discriminator is trained to distinguish between ground-truth sequences and predictions. We empirically demonstrate that optimizing this min/max game leads to significant performance gains for the hand pose model accuracy. Importantly, at inference time we do not require sequence data and hence the adversarial motion model can be used to improve existing per-frame hand pose models.

The main advantage of making use of motion models for learning from unlabeled videos is its unpaired nature. Hence, our model does not require fully labeled video sequences. Instead we only require sequences of hand motion to be available as well as sequences of videos containing hands. Such a setting not only has its uses in semi-supervised learning, but could also be used in scenarios where video and motion are both recorded, but synchronization or calibration are infeasible.

In this paper, we demonstrate some essential components in making a hand pose model learn from unlabeled videos using a discriminator. For each component, we provide empirical evidence to support its use. In the interest of future research, we also report building blocks that lead to detrimental results. We envision that our work will provide the basis for follow-ups on applying an adversarial approach to hand pose estimation. We evaluate our model in two challenging semi-supervised settings on the FPHAB [12] and HO-3D [17] dataset and demonstrate how adversarial learning improves the performance of the hand pose model. In lowest-label settings, we observe improvements of up to in absolute mean joint error.

In summary, our contribution are as follow. i) We introduce a simple and practical method to motion modeling for hands using an adversarial approach ii) We provide empirical evidence for the design choices of said motion model, providing a solid foundation on which future work can build on. iii) We make use of this acquired knowledge to tap into the area of semi-supervised learning from unlabeled video data and iv) show that our approach leads to significant improvements for the hand pose model.

2 Related work

Our method aims to learn from unlabeled videos, making use of adversarial models to model motion. Here we briefly review the literature on learning-based hand pose estimation approaches. We then continue focusing on adversarial methods in context of pose estimation. Lastly, we briefly discuss the area of motion models to which the discriminator used in our approach pertains.

Hand pose estimation. Several approaches have been introduced to perform learning-based hand pose estimation. These generally predict 3D joint skeletons directly [57, 50, 44, 27, 7, 51, 53, 10, 49, 43], follow the MANO paradigm [47], where the parameters of a parametric hand model are regressed [3, 6, 22, 4, 21, 55], or predict the full mesh model of the hand directly [13, 34, 42]. In the following, we will detail these works. Predicting 3D joint skeleton directly tends to achieve higher accuracy, however they do not provide dense surfaces. [57] introduce a staged approach where the 2D keypoints are regressed and then lifted to 3D. [44] create a synthetic dataset and reduce the synthetic/real discrepancy via a GAN. [50] propose a cross-modal latent space to facilitate better learning. [27] presents a 2.5D hand representation, achieving state-of-the-art results. In our work, we use the same representation. [7] augment the training by making use of supplementary depth supervision. [51] present a unified approach for hand and object pose estimation as well as action recognition. [53] introduce a disentangled latent space to perform better image synthesis. [10] use a graph-based refinement network to jointly refine the object and hand pose predictions. [49] introduce a biomechanical model to better refine the pose predictions on weakly-supervised data. [43] address the issue of inter-hand interaction by proposing a model that can predict the pose of both hands at the same time.

Template-based approaches like MANO implicitly induce a prior of poses upon the predictive model and provide a mesh surface. However due to the regularization of its parameters, its representation space is limited. [3, 6, 55] predict the MANO parameters directly, making use of additional weak supervision such as hand masks [3, 55] or in-the-wild 2D annotations [6, 55]. [22] jointly learns the MANO mesh, as well as the object mesh for hand-object pose estimation. [21] extends this framework in order to learn from partially labeled sequences exploiting a photometric loss on the unlabeled frames. However, they assume the object mesh to be known a priori and only estimate its 6D pose. In terms of labeling setting, it is the closest to ours. [42] propose an alternative to MANO by predicting pose and subject dependant correctives to base hand model.

Regressing the dense surface of a hand directly is the most generalizable approach to obtaining a mesh, however it requires corresponding annotations which may be difficult to acquire. [13] alleviate this by introducing a synthetic dataset that is fully annotated with the corresponding mesh and perform noisy supervision for real data. [34] regress the mesh directly using spiral-convolution and supervise their approach using a MANO model.

Generative Adversarial Nets.Generative adversarial networks [15] have been used in the body pose literature to refine pose predictions. [29, 30] use a discriminator to distinguish if the predicted SMPL parameters correspond to a real human pose. [33] extend the adversarial model to distinguish between valid and predicted sequences of SMPL parameters. This method is the most similar to ours in that they make use of adversarial loss that operates on sequences of poses. We differ from them in two main aspects. First, our task is hand pose estimation which contains different and irregular sequence of motions. Second, our setting considers learning from unlabeled videos whereas [33] aim to refine predictions in the fully supervised setting.

Motion modeling. Our proposed adversarial approach can be categorized as a motion model. Modeling human motion has been in the focus of many approaches in the literature. To the best of our knowledge, no such motion model exists for hand motion. Most body pose motion models follow either a recurrent modeling approach [2, 11, 14, 28, 40, 46, 16, 56] or non-recurrent approach, instead opting to make use of graph convolutional networks or convolution networks [32, 35, 48]. As our discriminative method runs on a convolutional architecture, we will focus on such works here. [35, 32] use an encoder-decoder layer to predict motion. [48] introduce special encoding-decoding layers to alleviate the lack of spatial continuity of the matrix representation of human pose sequences. Some motion models also employ an adversarial loss [48, 16, 5] to regularize the pose predictions. Similarly, we also use an adversarial model to learn on unlabeled videos and show that a simple approach yields promising results.

3 Method

Figure 2: Method overview. a) Given the frames of an unlabeled video sequence, our hand pose model predicts the joints of a hand skeleton. The per-frame predictions are concatenated into a sequence, generating a hand motion. The predicted motion sequence is input into an adversarial motion model which is capable of discriminating between plausible and invalid hand motion. This capability is learned via unpaired hand motion data. The hand pose estimator is then trained to produce valid motions using the gradients of the discriminator. b) Our method is general in the sense that during inference time, the hand pose estimator only requires a single frame to predict the corresponding pose.

We introduce our method to leverage unlabeled videos via an adversarial motion model to improve the performance of a hand pose estimation model. Our approach is summarized in Fig. 2. Given a pre-trained hand pose estimation model, we predict for each frame of a video sequence a hand pose. Predictions are concatenated to create a motion sequence. This is input into a motion model that is capable of determining the validity of the given sequence. This capability is learned with the help of unpaired ground-truth hand motion data. The gradient feedback of the motion model helps the hand pose model in improving its prediction performance on the unlabeled videos. Our key contribution is the introduction of the necessary building blocks in order to achieve learning on unlabeled videos.

Notation. We denote variables that are outputs of a network via the hat notation

, lowercase boldface denote vectors

, whereas uppercase boldface denote matrices .

3.1 Hand pose model

Our hand pose model uses the 2.5D representation proposed by [27], where the network predicts both the 2D keypoints and the root-relative depth of a monocular RGB image. The relationship between and is expressed as follows:

(1)

Where is the camera intrinsics and is the absolute depth value of the root joint (i.e ). Both branches are trained on the available labeled data using the L1 norm:

(2)

As detailed in [27], both values and can be used to acquire the 3D pose . Prior work [49] emphasized that this acquisition step can be unstable and introduce a refinement step to alleviate this which we also use here. The resulting 3D pose is recovered as follows:

(3)

Here is the refined absolute depth value.

Before performing adversarial learning on the unlabeled video, we pre-train on the labeled data. We emphasize here that although our motion discriminator acts on sequences, the hand pose model is a per-frame model.

Properties of 2.5D. We briefly touch on some useful properties of the chosen representation which we will reference later. First, both the predicted values are inherently bounded. The 2D keypoints must lie within the image, hence are constrained by the image size. The root-relative depth is naturally bounded by the skeletal structure of the hand, where its maximum absolute value is the maximum distance between the root joint and the corresponding keypoint. However, a full 3D representation is unbounded as the 3D joint skeleton can lie anywhere within the camera coordinate frame. The second useful property is translation invariance. No matter where in world coordinates the 3D pose lies, the camera intrinsics are adjusted such that they project onto the same image plane, resulting in the same 2D joint positions. Similarly, translating the 3D pose does not affect , as the translation is applied to all keypoints.

3.2 Adversarial motion model

The goal of the adversarial learning is to make use of additional unlabeled video data on which can be further trained on. To this end, we introduce a discriminator which takes in a sequence of hand joints and outputs a value in

, classifying between real and predicted sequences. Formally,

is trained to optimize the following objective function:

(4)

Here, is the distribution of real hand motion, whereas is the distribution of the prediction of our hand pose model. We use the LSGAN [39]loss function due to its stability, following [33]. To train on unlabeled videos, we update it to confuse the discriminator:

(5)

Where are the frames of a given unlabeled image sequence. We define , i.e returns a per-frame prediction.

Final loss. is jointly trained on supervised data via and on unlabeled data using . The final loss is

(6)

Where is a weighing factor.

3.3 Design choices

In Sec. 1 we have alluded to the difficulties in translating motion models from the full body literature to the hand pose setting. Since we believe that the proposed adversarial setting is a fruitful step into this direction and can lead to further improvements in future work, we briefly discuss our design choices for the adversarial learning step and the underlying model. Each of them is backed up by empirical evidence which is provided in Sec. 5.4. For the sake of clarity, we also discuss methods that have worked in related fields but lead to detrimental results in our experiments.

Spectral normalization [41]. Applying spectral normalization to the weights of the discriminator has been shown to stabilize training and generate images of higher quality when comparing to previous training stabilization techniques. During the course of our experiments, we have found that spectral normalization does not yield clear improvements. Therefore omit it from our discriminator.

Network depth. We represent a sequence of predictions as a matrix. This means that adjacent values do not necessarily share a spatial relationship. Prior work alleviate this by introducing special spatial encoding layers [48] or increasing the receptive field of the network to be large enough [32]. We determined that two residual blocks and an encoding and decoding layer performed best.

Batch normalization [25].Batch normalization speeds up training of neural networks. During training, it normalizes its input feature maps to have a zero mean and unit variance distribution across the batch followed by learnable scaling and shifting. At test time, it makes use of the aggregate batch statistics which was acquired during training for normalization. We have made the observation that would diverge during adversarial learning when the current batch statistics are used. Therefore we use the aggregate batch statistics after the pre-training phase.

Sequence length. We experiment with various sequence length. We test 1, 16, 32 and 64 frames which correspond roughly to 1/30, 0.5, 1, 2 seconds respectively. Longer sequence lengths allows the discriminator to decide based on more information. However, it increases computation time and dimensionality, which may lead to overfitting. In our experiments using 16 frames yielded the best performance.

Joint representation. predicts the 2.5D representation of keypoints which can be easily converted to 3D (Eq. 3). It is unclear however, which representation is best used for the motion modeling task. Whereas the 3D representation directly encodes the task at hand, the 2.5D representation is more constrained. Our analysis shows that more effectively learns when the discriminator uses the same representation, hence we choose 2.5D over 3D.

Data augmentation. Augmenting the input to the discriminator has been shown to improve performance [31]. As the input are joint skeleton sequences, we perform geometric augmentation. The 2.5D representation is translation invariant therefore we resort to rotations. Care needs to be taken to apply the augmentation consistently across the sequence as they should not incorrectly change the statistics of a given sequence. Hence, given a sequence of poses, we rotate it around the root of the first joint skeleton of the sequence. This ensures that the relative change in motion between two frames of a sequence remains the same. Because out-of-image-plane rotations could result in joint skeletons being behind the camera, we only perform rotations around the z-axis of the camera.

4 Implementation

Our pose model uses a ResNet-18 [23] backbone that takes a 256 256 RGB image and outputs the 2.5D representation of keypoints [27]. Following [49], we implement a refinement network to stabilize the predictions of the absolute depth. We pre-train the hand pose model on the labeled data twice, once using the current batch statistics and once using the aggregate. We then jointly train on both labeled data via and unlabeled data via

. Our discriminator architecture is implemented as a CNN with residual connections that takes a sequence of joint predictions and outputs a valid/invalid motion flag. It is trained on the hand motion data of the respective dataset. Exact training and architecture details can be found in the appendix.

5 Experiments

Figure 3: Qualitative results. (top row) HO3D (bottom row) FPHAB. For each block, the left column shows the ground-truth, the middle shows the prediction result of the baseline model train on 20% (left blocks) or 40% (right blocks) labels, the right column shows predictions of the model that learned from on additional unlabeled videos using our proposed adversarial motion model. The predictions are more accurate, biomechanically plausible and fit the displayed hand better.

We analyze how the proposed motion model leads to improved pose prediction using unlabeled video data.

5.1 Datasets

To quantitatively evaluate our framework, we require access to fully labeled sequential data of 3D hand poses. Currently, only two datasets that fulfill this requirement.

FPHAB [12] contains egocentric sequences of RGB-D videos. The motion performed capture a wide range of interactions, such as hand-object as well as hand-hand. It contains ground-truth annotations of the hand pose, acquired with magnetic sensors strapped to the hand. Following [21], we report our results on the action split.

HO3D [18] contains 3rd person view sequences of hand-object interactions. It was captured using a multi-RGB-D-camera set up. An earlier version of this dataset exists [17] which was used by [21]. Here, we use the newest version.

5.2 Evaluation metrics

FPHAB. We report the mean joint error in mm for both the absolute and root-relative pose. The former is the quantity of interest, however it heavily depends on correctly estimating the distance to the camera. As this is a severely ill-posed for monocular RGB, one may encounter large absolute pose errors despite predicting correct articulations. Hence we also report the root-relative error to quantify the articulation error in absence of the camera distance estimate.

HO3D. An online submission system returns the absolute, scale/translation (ST) aligned as well as the procrustes-aligned EPE in cm. To be consistent, we report the former two and convert them to mm.

5.3 Settings

We explore two different semi-supervised protocols. To the best of our knowledge, the only related work that explores a similar setting to ours is [21]. Therefore all of our semi-supervised experiments compare to their method. To compare directly, Protocol 1 follows [21] labeling precisely, where the labels of each sequence are sampled uniformly, starting from the first frame. However, as frames within a sequence tend to be similar, we cannot extrapolate the performance of our method on completely unlabeled videos. Therefore we introduce Protocol 2 where we sample entire labeled videos, keeping the remaining videos unlabeled. This setting is more challenging, as there exists a much bigger difference between the labeled and unlabeled data and as such more closely simulates truly unseen videos. More details for both protocols are found in the appendix.

5.4 Ablation study

Ablation study 3D hand pose estimation
EPE (mm)
Absolute Root-relative
Motion model type
Temporal prior 27.44 13.31
Data-driven 23.86 11.20
Spectral normalization
With 24.42 10.93
Without 23.52 11.02
Network depth
1 23.86 11.20
2 23.52 11.02
4 25.57 12.35
8 26.94 12.89
Sequence length
1 27.16 12.80
16 23.52 11.02
32 25.03 11.52
64 25.48 11.58
Data augmentation
With 23.52 11.02
Without 24.46 11.30
Keypoint representation
3D 26.62 13.59
2.5D 23.52 11.02
Table 1: Full ablative evaluation motivating our design choices. We evaluate on FPHAB in the labeled setting of Protocol 2.

We empirically present our results that motivated the design choices in Sec. 3.3. The complete evaluation for all ablative experiments is presented in Tab. 1. These are done using Protocol 2 and labeled data on FPHAB.

Motivating data-driven motion models. Prior work [29] make use of adversarial methods to improve single frame predictions which could be used to leverage unlabeled images, yielding a pose prior. Alternatively, simple temporal priors could be employed to take advantage of unlabeled videos. We explore both viable alternatives to the proposed data-driven motion model. For the pose prior, we change the length of the sequence that we feed to the discriminator to 1. This yielded mm MPJPE. Making use of a temporal smoother that penalizes non-smooth motion () performed worse, achieving mm MPJPE. However, both approaches perform worse than the proposed motion model which yields mm EPE.

Design choices. Tab. 1 shows the empirical evidence motivating the design choices of the motion model. Spectral normalization and deeper network models lead to an increase in training time and error whereas using 16 frames lead to optimal results. Our proposed data augmentation scheme produced a decrease of mm. Lastly, making use of the 2.5D keypoint representations yields a mm drop in error.

The analysis of design of the adversarial motion model training framework yielded a total reduction in absolute error from mm to mm, a decrease of mm.

5.5 Semi-supervised adversarial learning

FPHAB Absolute MPJPE (mm)
Tekin et al. CVPR’19 [51] 15.8
Hasson et al. CVPR’20 [21] 15.7
Ours 15.5
HO-3D ST-aligned MPJPE (mm)
Hasson et al. CVPR’19 [22] 31.8
Hampali et al. CVPR’20 [51] 30.4
Ours 24.5
Table 2: Prior work comparison in the fully supervised setting. Please note that [22] and [21] are differing works.
(a) FPHABA - Protocol 1
(b) FPHABA - Protocol 2
(c) HO-3D - Protocol 2
Figure 4: Comparison with related work. Fig. 3(a) Our method outperforms [21] for all labeling percentages. In addition our method yields greater improvement over the respective baseline for all labeling percentages. Fig. 3(b) leveraging unlabeled sequences is more challenging, which can be seen by the overall higher error rate. We observe improvements for both absolute, as well as root-relative hand pose error. Fig. 3(c) demonstrates a similar trend on HO3D.

Next we evaluate the proposed approach under different amount of labeled sequences and under both protocols.

Fully-supervised performance. We evaluate the performance of our baseline model in the fully supervised setting on FPHAB and HO3D. The results are presented in Tab. 2. For FPHAB, we compare with [51, 21], demonstrating that our baseline performs on par with prior work. For [21], we report their fully supervised ”Hands only” result, as we focus on hand pose too. For HO3D, we compare to [18, 22]. For [22], the results are obtained from [19].

Protocol 1. We report our results on in Fig. 3(a), comparing to [21]. As they make use of an early version of HO3D, we compare only on FPHAB. We observe that we outperform their approach for all percentage of supervision used. In addition, the improvement of our method over our baseline is larger than [21] over theirs. We note here that the semi-supervised results of [21] require access to the ground-truth mesh of the object, hence their reported results are for the model that predict hand and object pose. Designing a semi-supervised “hands only” variant is non-trivial due to hard-coded reliance on the to the ground-truth mesh to compute the photometric consistency loss.111Confirmed by authors via private correspondence. Increasing the amount of labeled information available, both approaches converge to their fully supervised results as is expected (cf. Tab. 2). We note here that both methods complement each other and can be combined. Whereas [21] reasons on a pixel-level, we reason on a motion level. Future work could explore a combination of both approaches.

Protocol 2. We present our quantitative results on FPHAB in Fig. 3(b) for both the absolute and root-relative pose. We compare a baseline method, which does not make use of the adversarial motion model, to our approach. Compared to protocol 1 the models exhibit larger discrepancies in error between the lower labeling percentages and the fully supervised case. This is indicative of the increased difficulty of Protocol 2, as the unlabeled data contains completely unseen imagery. We observe that when labeled data is scarce, our proposed adversarial loss has a much larger impact on the final performance. This is evident when comparing the performance of our method in the lower label regime with the baseline model of higher percentages. For example, our method with labeled data achieves mm error, outperforming the baseline with labeled data at mm error (cf. Fig. 3(b)). This suggests that the proposed motion model is effective in leveraging unlabeled videos. Fig. 3(c) shows our results on HO3D given by the online evaluation system. We observe similar trend as with FPHAB.

6 Conclusion

In this work we have proposed to leverage an adversarial motion model to improve the task of hand pose estimation to learn from unlabeled data. Due to lack of related work, we provided an overview of the essential components to make use of a simple but effective motion model in order to leverage unlabeled video for hand pose estimation learning. The goal is to provide a foundation on which future work combining both methodologies can build upon. We evaluated our proposed method in two challenging settings and demonstrated that significant improvements can be gained. Future work will look into extending the proposed methodology across datasets, making use of in-the-wild videos such as YouTube3DHands [34].

References

  • [1] E. Aksan, P. Cao, M. Kaufmann, and O. Hilliges (2020) Attention, please: a spatio-temporal transformer for 3d human motion prediction. arXiv preprint arXiv:2004.08692. Cited by: §1.
  • [2] E. Aksan, M. Kaufmann, and O. Hilliges (2019) Structured prediction helps 3d human motion modelling. In

    2019 IEEE/CVF International Conference on Computer Vision, ICCV 2019, Seoul, Korea (South), October 27 - November 2, 2019

    ,
    Cited by: §1, §2.
  • [3] S. Baek, K. I. Kim, and T. Kim (2019) Pushing the envelope for rgb-based dense 3d hand pose estimation via neural rendering. In

    IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019, Long Beach, CA, USA, June 16-20, 2019

    ,
    Cited by: §2, §2.
  • [4] S. Baek, K. I. Kim, and T. Kim (2020) Weakly-supervised domain adaptation via GAN and mesh model for estimating 3d hand poses interacting objects. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2020, Seattle, WA, USA, June 13-19, 2020, Cited by: §1, §2.
  • [5] E. Barsoum, J. Kender, and Z. Liu (2018) HP-gan: probabilistic 3d human motion prediction via gan. In CVPR workshops, Cited by: §2.
  • [6] A. Boukhayma, R. de Bem, and P. H. S. Torr (2019) 3D hand shape and pose from images in the wild. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019, Long Beach, CA, USA, June 16-20, 2019, Cited by: §1, §1, §2, §2.
  • [7] Y. Cai, L. Ge, J. Cai, and J. Yuan (2018) Weakly-supervised 3d hand pose estimation from monocular rgb images. In ECCV, Cited by: §1, §2.
  • [8] Y. Cai, L. Huang, Y. Wang, T. Cham, J. Cai, J. Yuan, J. Liu, X. Yang, Y. Zhu, X. Shen, et al. (2020) Learning progressive joint propagation for human motion prediction. In ECCV, Cited by: §1.
  • [9] Z. Cao, G. Hidalgo Martinez, T. Simon, S. Wei, and Y. A. Sheikh (2019) OpenPose: realtime multi-person 2d pose estimation using part affinity fields. TPAMI. Cited by: §1.
  • [10] B. Doosti, S. Naha, M. Mirbagheri, and D. J. Crandall (2020) HOPE-net: A graph-based model for hand-object pose estimation. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2020, Seattle, WA, USA, June 13-19, 2020, Cited by: §2.
  • [11] K. Fragkiadaki, S. Levine, P. Felsen, and J. Malik (2015) Recurrent network models for human dynamics. In 2015 IEEE International Conference on Computer Vision, ICCV 2015, Santiago, Chile, December 7-13, 2015, Cited by: §2.
  • [12] G. Garcia-Hernando, S. Yuan, S. Baek, and T. Kim (2018) First-person hand action benchmark with RGB-D videos and 3d hand pose annotations. In 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 18-22, 2018, Cited by: §1, §5.1.
  • [13] L. Ge, Z. Ren, Y. Li, Z. Xue, Y. Wang, J. Cai, and J. Yuan (2019) 3D hand shape and pose estimation from a single RGB image. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019, Long Beach, CA, USA, June 16-20, 2019, Cited by: §2, §2.
  • [14] P. Ghosh, J. Song, E. Aksan, and O. Hilliges (2017) Learning human motion models for long-term predictions. In 3DV, Cited by: §1, §2.
  • [15] I. J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. C. Courville, and Y. Bengio (2014) Generative adversarial nets. In Advances in Neural Information Processing Systems 27: Annual Conference on Neural Information Processing Systems 2014, December 8-13 2014, Montreal, Quebec, Canada, Cited by: §2.
  • [16] L. Gui, Y. Wang, X. Liang, and J. M. Moura (2018) Adversarial geometry-aware human motion prediction. In ECCV, Cited by: §2.
  • [17] S. Hampali, M. Oberweger, M. Rad, and V. Lepetit (2019) Ho-3d: a multi-user, multi-object dataset for joint 3d hand-object pose estimation. arXiv preprint arXiv:1907.01481. Cited by: §1, §5.1.
  • [18] S. Hampali, M. Rad, M. Oberweger, and V. Lepetit (2020) HOnnotate: A method for 3d annotation of hand and object poses. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2020, Seattle, WA, USA, June 13-19, 2020, Cited by: §5.1, §5.5.
  • [19] S. Hampali (2021) HOnnotate: a method for 3d annotation of hand and object poses - results. Cited by: §5.5.
  • [20] F. G. Harvey, M. Yurick, D. Nowrouzezahrai, and C. Pal (2020) Robust motion in-betweening. ToG. Cited by: §1.
  • [21] Y. Hasson, B. Tekin, F. Bogo, I. Laptev, M. Pollefeys, and C. Schmid (2020) Leveraging photometric consistency over time for sparsely supervised hand-object reconstruction. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2020, Seattle, WA, USA, June 13-19, 2020, Cited by: §1, §2, §2, Figure 4, §5.1, §5.1, §5.3, §5.5, §5.5, Table 2.
  • [22] Y. Hasson, G. Varol, D. Tzionas, I. Kalevatykh, M. J. Black, I. Laptev, and C. Schmid (2019) Learning joint reconstruction of hands and manipulated objects. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019, Long Beach, CA, USA, June 16-20, 2019, Cited by: §2, §2, §5.5, Table 2.
  • [23] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 27-30, 2016, Cited by: §4.
  • [24] D. Holden, J. Saito, T. Komura, and T. Joyce (2015)

    Learning motion manifolds with convolutional autoencoders

    .
    In SIGGRAPH Asia, Cited by: §1.
  • [25] S. Ioffe and C. Szegedy (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. In

    Proceedings of the 32nd International Conference on Machine Learning, ICML 2015, Lille, France, 6-11 July 2015

    ,
    Cited by: §3.3.
  • [26] C. Ionescu, D. Papava, V. Olaru, and C. Sminchisescu (2013) Human3. 6m: large scale datasets and predictive methods for 3d human sensing in natural environments. TPAMI. Cited by: §1.
  • [27] U. Iqbal, P. Molchanov, T. Breuel Juergen Gall, and J. Kautz (2018) Hand pose estimation via latent 2.5 d heatmap regression. In ECCV, Cited by: §2, §3.1, §4.
  • [28] A. Jain, A. R. Zamir, S. Savarese, and A. Saxena (2016)

    Structural-rnn: deep learning on spatio-temporal graphs

    .
    In 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 27-30, 2016, Cited by: §1, §2.
  • [29] A. Kanazawa, M. J. Black, D. W. Jacobs, and J. Malik (2018) End-to-end recovery of human shape and pose. In 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 18-22, 2018, Cited by: §1, §2, §5.4.
  • [30] A. Kanazawa, J. Y. Zhang, P. Felsen, and J. Malik (2019) Learning 3d human dynamics from video. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019, Long Beach, CA, USA, June 16-20, 2019, Cited by: §1, §2.
  • [31] T. Karras, M. Aittala, J. Hellsten, S. Laine, J. Lehtinen, and T. Aila (2020) Training generative adversarial networks with limited data. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual, Cited by: §3.3.
  • [32] M. Kaufmann, E. Aksan, J. Song, F. Pece, R. Ziegler, and O. Hilliges (2020) Convolutional autoencoders for human motion infilling. arXiv preprint arXiv:2010.11531. Cited by: §1, §2, §3.3.
  • [33] M. Kocabas, N. Athanasiou, and M. J. Black (2020) VIBE: video inference for human body pose and shape estimation. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2020, Seattle, WA, USA, June 13-19, 2020, Cited by: §1, §2, §3.2.
  • [34] D. Kulon, R. A. Güler, I. Kokkinos, M. M. Bronstein, and S. Zafeiriou (2020) Weakly-supervised mesh-convolutional hand reconstruction in the wild. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2020, Seattle, WA, USA, June 13-19, 2020, Cited by: §1, §2, §2, §6.
  • [35] C. Li, Z. Zhang, W. S. Lee, and G. H. Lee (2018) Convolutional sequence to sequence model for human dynamics. In 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 18-22, 2018, Cited by: §2.
  • [36] N. Mahmood, N. Ghorbani, N. F. Troje, G. Pons-Moll, and M. J. Black (2019) AMASS: archive of motion capture as surface shapes. In 2019 IEEE/CVF International Conference on Computer Vision, ICCV 2019, Seoul, Korea (South), October 27 - November 2, 2019, Cited by: §1.
  • [37] W. Mao, M. Liu, M. Salzmann, and H. Li (2019) Learning trajectory dependencies for human motion prediction. In 2019 IEEE/CVF International Conference on Computer Vision, ICCV 2019, Seoul, Korea (South), October 27 - November 2, 2019, Cited by: §1.
  • [38] W. Mao, M. Liu, and M. Salzmann (2020) History repeats itself: human motion prediction via motion attention. arXiv preprint arXiv:2007.11755. Cited by: §1, §1.
  • [39] X. Mao, Q. Li, H. Xie, R. Y. K. Lau, Z. Wang, and S. P. Smolley (2017) Least squares generative adversarial networks. In IEEE International Conference on Computer Vision, ICCV 2017, Venice, Italy, October 22-29, 2017, Cited by: §3.2.
  • [40] J. Martinez, M. J. Black, and J. Romero (2017)

    On human motion prediction using recurrent neural networks

    .
    In 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017, Cited by: §1, §2.
  • [41] T. Miyato, T. Kataoka, M. Koyama, and Y. Yoshida (2018) Spectral normalization for generative adversarial networks. In Proc. of ICLR, Cited by: §3.3.
  • [42] G. Moon, T. Shiratori, and K. M. Lee (2020) DeepHandMesh: a weakly-supervised deep encoder-decoder framework for high-fidelity hand mesh modeling. arXiv preprint arXiv:2008.08213. Cited by: §2, §2.
  • [43] G. Moon, S. Yu, H. Wen, T. Shiratori, and K. M. Lee (2020) InterHand2.2m: a dataset and baseline for 3d interacting hand pose estimation from a single rgb image. arXiv preprint arXiv:2008.09309. Cited by: §2.
  • [44] F. Mueller, F. Bernard, O. Sotnychenko, D. Mehta, S. Sridhar, D. Casas, and C. Theobalt (2018) GANerated hands for real-time 3d hand tracking from monocular RGB. In 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 18-22, 2018, Cited by: §2.
  • [45] D. Pavllo, D. Grangier, and M. Auli (2018) QuaterNet: A quaternion-based recurrent model for human motion. In British Machine Vision Conference 2018, BMVC 2018, Newcastle, UK, September 3-6, 2018, Cited by: §1.
  • [46] D. Pavllo, D. Grangier, and M. Auli (2018) QuaterNet: A quaternion-based recurrent model for human motion. In British Machine Vision Conference 2018, BMVC 2018, Newcastle, UK, September 3-6, 2018, Cited by: §2.
  • [47] J. Romero, D. Tzionas, and M. J. Black (2017) Embodied hands: modeling and capturing hands and bodies together. ACM Transactions on Graphics, (Proc. SIGGRAPH Asia). Cited by: §1, §2.
  • [48] A. H. Ruiz, J. Gall, and F. Moreno (2019) Human motion prediction via spatio-temporal inpainting. In 2019 IEEE/CVF International Conference on Computer Vision, ICCV 2019, Seoul, Korea (South), October 27 - November 2, 2019, Cited by: §1, §2, §3.3.
  • [49] A. Spurr, U. Iqbal, P. Molchanov, O. Hilliges, and J. Kautz (2020) Weakly supervised 3d hand pose estimation via biomechanical constraints. In ECCV, Cited by: §1, §2, §3.1, §4.
  • [50] A. Spurr, J. Song, S. Park, and O. Hilliges (2018) Cross-modal deep variational hand pose estimation. In 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 18-22, 2018, Cited by: §2.
  • [51] B. Tekin, F. Bogo, and M. Pollefeys (2019) H+O: unified egocentric recognition of 3d hand-object poses and interactions. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019, Long Beach, CA, USA, June 16-20, 2019, Cited by: §2, §5.5, Table 2.
  • [52] Y. Wang, L. Gui, X. Liang, and J. M. F. Moura (2018) Adversarial geometry-aware human motion prediction. In ECCV, Cited by: §1.
  • [53] L. Yang and A. Yao (2019) Disentangling latent hands for image synthesis and pose estimation. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019, Long Beach, CA, USA, June 16-20, 2019, Cited by: §2.
  • [54] S. Yuan, Q. Ye, B. Stenger, S. Jain, and T. Kim (2017) BigHand2.2m benchmark: hand pose dataset and state of the art analysis. In 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017, Cited by: §1.
  • [55] X. Zhang, Q. Li, H. Mo, W. Zhang, and W. Zheng (2019) End-to-end hand mesh recovery from a monocular RGB image. In 2019 IEEE/CVF International Conference on Computer Vision, ICCV 2019, Seoul, Korea (South), October 27 - November 2, 2019, Cited by: §2, §2.
  • [56] Y. Zhou, Z. Li, S. Xiao, C. He, Z. Huang, and H. Li (2018) Auto-conditioned recurrent networks for extended complex human motion synthesis. In Proc. of ICLR, Cited by: §2.
  • [57] C. Zimmermann and T. Brox (2017) Learning to estimate 3d hand pose from single RGB images. In IEEE International Conference on Computer Vision, ICCV 2017, Venice, Italy, October 22-29, 2017, Cited by: §2.
  • [58] C. Zimmermann, D. Ceylan, J. Yang, B. C. Russell, M. J. Argus, and T. Brox (2019) FreiHAND: A dataset for markerless capture of hand pose and shape from single RGB images. In 2019 IEEE/CVF International Conference on Computer Vision, ICCV 2019, Seoul, Korea (South), October 27 - November 2, 2019, Cited by: §1.