Self-Supervised 3D Hand Pose Estimation from monocular RGB via Contrastive Learning

06/10/2021
by   Adrian Spurr, et al.
0

Acquiring accurate 3D annotated data for hand pose estimation is a notoriously difficult problem. This typically requires complex multi-camera setups and controlled conditions, which in turn creates a domain gap that is hard to bridge to fully unconstrained settings. Encouraged by the success of contrastive learning on image classification tasks, we propose a new self-supervised method for the structured regression task of 3D hand pose estimation. Contrastive learning makes use of unlabeled data for the purpose of representation learning via a loss formulation that encourages the learned feature representations to be invariant under any image transformation. For 3D hand pose estimation, it too is desirable to have invariance to appearance transformation such as color jitter. However, the task requires equivariance under affine transformations, such as rotation and translation. To address this issue, we propose an equivariant contrastive objective and demonstrate its effectiveness in the context of 3D hand pose estimation. We experimentally investigate the impact of invariant and equivariant contrastive objectives and show that learning equivariant features leads to better representations for the task of 3D hand pose estimation. Furthermore, we show that a standard ResNet-152, trained on additional unlabeled data, attains an improvement of 7.6% in PA-EPE on FreiHAND and thus achieves state-of-the-art performance without any task specific, specialized architectures.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 6

12/21/2021

Augmented Contrastive Self-Supervised Learning for Audio Invariant Representations

Improving generalization is a major challenge in audio classification du...
10/13/2020

Self-Supervised Multi-View Synchronization Learning for 3D Pose Estimation

Current state-of-the-art methods cast monocular 3D human pose estimation...
06/20/2020

Adversarial Transfer of Pose Estimation Regression

We address the problem of camera pose estimation in visual localization....
01/12/2021

Explicit homography estimation improves contrastive self-supervised learning

The typical contrastive self-supervised algorithm uses a similarity meas...
11/17/2020

Exploring Intermediate Representation for Monocular Vehicle Pose Estimation

We present a new learning-based approach to recover egocentric 3D vehicl...
08/31/2021

ScatSimCLR: self-supervised contrastive learning with pretext task regularization for small-scale datasets

In this paper, we consider a problem of self-supervised learning for sma...
07/01/2021

Learning to Disambiguate Strongly Interacting Hands via Probabilistic Per-pixel Part Segmentation

In natural conversation and interaction, our hands often overlap or are ...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Estimating the 3D pose of human hands from monocular images alone has many important applications in robotics, Human-Computer Interaction and AR/VR. As such the problem has received significant attention in computer vision literature 

[41, 32, 15, 33, 12, 31, 26, 14]

. However, estimating the location of 3D hand joints within an RGB image is a challenging structured regression problem with difficulties that arise from a large diversity in backgrounds, lighting conditions, hand appearances, as well as self-occlusion caused by the high degrees of freedom of the human hand.

One way to alleviate these issues is to acquire annotated datasets that cover a larger diversity of environments and settings. However, acquiring 3D labeled data is laborious, cost intensive and typically requires multi-view imagery or some form of user instrumentation. Data collected under such circumstances is often difficult to transfer well to in-the-wild imagery [42, 20]. Therefore, much interest is given to approaches that can leverage auxiliary data, which has either no or only 2D joint annotations. For example, such data can be used to outperform many supervised approaches via making use of weak-supervision [4, 3], the integration of kinematic priors [31], or by exploiting temporal information [14]. Off-the-shelf joint detectors [5] have been leveraged to automatically generated 2D annotations in large quantities [20]. However, the accuracy of models trained on these labels, or on 3D annotations derived from them, are inherently bounded by the label noise. Therefore, the question of how to efficiently leverage unlabeled data for hand pose estimator training remains unanswered.

Recently, self-supervised approaches such as contrastive learning have shown that they can reach parity with supervised approaches on image classification tasks [6, 8]

. These methods leverage unlabeled data to learn powerful feature representations. To do so, positive and negative pairs of images are projected into a latent space via a neural network. The contrastive objective encourages the latent space samples of the positive pairs to lie close to each other and pushes negative pairs apart. The resulting pre-trained network can then be used on downstream tasks. Positive pairs are created by sampling an image and applying two sets of distinct augmentations on it, whereas negative pairs correspond to separate but similarly augmented images. These augmentations include appearance transformations, such as color drop, and geometric transformations, such as rotation. The contrastive objective induces invariance under all of these transformations. However, tasks such as hand pose estimation require

equivariance under geometric transformations. Hence, representations learnt from such an objective may not effectively transfer to pose estimation.

In this paper, we investigate such self-supervised representation learning techniques for hand pose estimation. To the best of our knowledge, we are the first to do so. We derive a method Pose Equivariant Contrastive Learning (PeCLR) that is able to effectively leverage the large diversity of existing hand images without any joint labels. These images are used to pre-train a network to acquire a general representation, which can then be transferred to the final hand pose estimation task via supervised fine-tuning. This provides a promising direction for hand pose estimation and enable an easy transfer of images collected in-the-wild or calibration to a specific domain by fine-tuning a powerful, pre-trained network with fewer labels.

Fig. 1 provides an overview of our method. In a first stage, we perform self-supervised representation learning. Given an RGB input image of the hand, we first apply appearance and geometric transformations to generate positive and negative pairs of derivative images. These are used to train an encoder via our proposed equivariant contrastive loss. By undoing the geometric transformation in latent space, we promote equivariance. However, this process of invertion needs to be performed with care. Because transformation on images should lead to proportional changes in the latent space, special care needs to be taken due to different magnitudes between latent space and pixel space. The resulting model then yields improved pose estimation accuracy (cf. Fig. 1, bottom).

In the second stage, the pre-trained encoder is fine-tuned on the task of 3D hand pose estimation using labeled data. The resulting model is evaluated thoroughly in a variety of settings. We demonstrate increased label efficiency for semi-supervision and show that using more unlabeled data is beneficial for the final performance, yielding improvements of up to in 3D EPE in the lowest labeled setting (cf. Fig. 6). Furthermore, we show that this improvement also transfers to the fully supervised case, where using a standard ResNet-152 in combination with unlabeled data and our proposed pre-training scheme outperforms specialized state-of-the-art architectures (cf. Tab. 2). Finally, we demonstrate that self-supervised pre-training leads to an improvement of PA-EPE in the unlabeled data, indicating that pre-training is beneficial for cross-domain generalization (cf. Tab. 3).

In summary, our contributions are as follows:

  1. [noitemsep]

  2. To the best of our knowledge, we perform the first investigation of contrastive learning to efficiently leverage unlabeled data for hand pose estimation.

  3. We propose a contrastive learning objective that encourages invariance to appearance transformations and equivariance to geometric transformations.

  4. We conduct controlled experiments to evaluate the quality of the learned representations, compared with SimCLR, and empirically derive the best performing augmentations.

  5. We show that the proposed method achieves better label efficiency in semi-supervised settings and that adding more unlabeled data is beneficial.

  6. We empirically show that our proposed method outperforms current, more specialized state-of-the-art methods using a standard ResNet model.

All code and models will be made available for research purposes.

2 Related work

Hand pose estimation. Hand pose estimation usually follows one of three paradigms. Some work predicts 3D joint skeletons directly [41, 32, 27, 18, 4, 33, 37, 12, 31, 26], makes use of MANO [30], where the parameters of a parametric hand model are regressed [1, 3, 15, 2, 14, 40], or predicts the full mesh model of the hand directly [13, 21, 25]. A staged approach is introduced in [41], where the 2D keypoints are regressed directly and then lifted to 3D. Spurr [32] introduces a cross-modal latent space which facilitates better learning. Mueller [27] makes use of a synthetically created dataset and reduces the synthetic/real discrepancy via a GAN. Cai [4] makes use of supplementary depth supervision to augment the training set. Proposing a more efficient hand representation, a 2.5D representation is introduced in [18]. Action recognition as well as hand/object pose estimation is performed in [33]. [37] introduces a disentangled latent space, for the purpose of better image synthesis. A graph-based neural network is used to jointly refine the hand/object pose in [12]. Biomechanical constraints are introduced to refine the pose predictions on 2D supervised data [31]. Moon [26] predict the pose of both hands and takes their interaction into account.

Templated-based methods such as MANO induce a prior of hand poses, as well as providing a mesh surface. Some methods [1, 3, 40] estimate the MANO parameters directly from RGB, sometimes making use of weak supervision such as hand masks [1, 40] or in-the-wild 2D annotations [3, 40]. A unified approach is introduced to jointly predict MANO as well as the object mesh [15]. Hasson [14] builds upon the mentioned framework, by learning from partially labeled sequences via a photometric loss. An alternative to MANO is proposed in [25] by predicting pose and subject dependant correctives to a base hand model. Some methods regress the mesh of a hand directly. However, mesh annotations are difficult to acquire. Ge [13] tackles this by introducing a fully mesh-annotated synthetic dataset and performs noisy supervision for real data. With the help of spiral convolutions, a hand mesh is predicted in [21], supervised using MANO.

Clearly, much work has been dedicated to custom, sometimes highly specialized architectures for hand-pose estimation. In contrast, we explore a purely data-driven approach, utilizing unlabeled data, and an equivariance inducing contrastive formulation to achieve SOTA performance with a standard CNN.

Self-supervised learning.

Self-supervised learning aims to learn representation of data without any annotations. Literature defines the pre-text task as the specific strategy to learn the representation in a self-supervised manner. Such tasks include predicting the position of a second patch relative to the first 

[11], colorizing a grayscale image [39], solving a jigsaw puzzle [28], estimating the motion flow of pixels in a scene [35], predicting positive future samples in audio signals [29], or completing the next sentence based on relations between two sentences [10]. However, it is not clear which pretext task would be optimal given a specific downstream task in terms of performance and generalizability.

Contrastive learning is a powerful paradigm for self-supervised, task-independent learning. At the core of contrastive learning lies a concept emerging from distance metric learning: a pair of data is encouraged to be close in latent space if they are connected in a meaningful way, while unrelated data are pushed apart. One of the appeals of contrastive learning lie in the numerous amounts of data that is available for training. General representations are learned through this paradigm and have been successfully used in many downstream tasks such as image and video classification [34, 6, 8], object detection [36, 17], and speech classification [29]. However, contrastive learning has not been investigated for the task of hand pose estimation.

The closest related work to this paper include Contrastive Predictive Coding (CPC) [29, 17], Contrastive Multiview Coding (CMC) [34], and SimCLR [6, 7]

. CPC learns to extract representations by predicting future representations in latent space. Autoregressive models are used to enable predictions of many steps in the future. While CPC learns from the two views of the past and future, CMC extends this idea to multi-view learning. It aims to learn view-invariant representations by maximizing mutual information among different views of the same content. The most relevant framework for contrastive learning is a simple yet effective contrastive learning approach

[6]. It largely benefits from data augmentation and its learnt representation achieves performance that is on par with supervised models on the image classification task. However, the learned transformation-invariant features are not suited for structured regression tasks such as hand pose estimation as these require an equivariant representation with respect to geometric transformations. In this work, we extend SimCLR by differentiating between appearance and geometric transformations, and propose a model that can successfully learn representations dedicated for both transformations.

3 Method

Figure 2: Method overview. An augmentation is applied to input image . Here and denote the geometric and appearance components of the augmentation , respectively. The model then generates the projections for each augmented input. Geometric augmentations are reversed in projection space before optimizing the contrastive objective. The agreement between projections from the same input image is maximized (left) and agreements amongst projections from different input images are minimized (right).

In this section, we start by reviewing SimCLR [6]. We then introduce the overall framework of pre-training and finetuning. Next, we identify an issue with SimCLRs contrastive formulation when applied to hand pose estimation, motivating our proposed equivariant contrastive objective. Lastly, we present our hand pose estimation model and the method used for 3D keypoint estimation during supervised training.

Notation. In the following, we denote the set of all transformations used as . It contains appearance transformations (e.g color jitter), geometric transformations (e.g. scale, rotation and translation) as well as compositions of them. For a given transformation , correspond to the appearance or geometric component of the transformation . Fig. 4 shows all transformation used in this study.

3.1 SimCLR

The idea of the SimCLR [6] framework is to maximize the agreement in latent space between the representations of samples that are similar, while repelling dissimilar pairs. The positive pairs are artificially generated by applying various augmentations on an image. Given a set of samples , we consider two augmented views , where , , .

The SimCLR framework consists of an encoder and a projection head . The overall model maps an image to a latent space sample , i.e. . The model is trained using a contrastive objective function that maximizes the agreement between all positive pairs of projections , which are extracted from two augmented views of the same image , while simultaneously minimizing the agreement amongst negative pairs of projections , where are extracted from different images.

In each iteration, SimCLR samples both positive and negative pairs. For a given batch of images, two augmentations are applied on each sample, resulting in augmented images. Hence, for every augmented image , there is one positive sample , and negative samples

. The model is trained to project positive samples close to each other, whereas keeping negative samples far apart. This is achieved via the following loss function, termed as

NT-Xent in [6]:

(1)

Here is a temperature parameter,

is the cosine similarity between

, and is the indicator function.

3.2 Equivariant contrastive representations

Inspecting Eq. 1, we observe that the objective function promotes invariance under all transformations. Given a sample and its positive sample , the numerator in Eq. 1 is minimized if . Hence, a model that satisfies Eq. 1 needs to be invariant to all transformations in . However, hand pose estimation requires equivariance with respect to geometric transformations as these change the displayed pose. Hence, we require:

(2)

Inverting transformations in latent space. In order to fulfill Eq. 2, we first note that it is equivalent to . This leads us to the following equivariant modification of NT-Xent:

(3)

where . In order to minimize the numerator in Eq. 3 it must hold that . This leads to the desired property of Eq. 2, further details can be found in the supplementary. As is an affine transformation, its inverse can be easily computed. However, whereas scaling and rotation are transformations that are performed relative to the magnitude, translation is performed in terms of an absolute quantity. In other words, if we translate an image by pixels, we need to translate its latent space projection by a proportional quantity. Therefore, we translate

by a quantity proportional to its magnitude. To achieve this, we obtain the translation proportional to the image size and scale it up by a factor proportional to the range spanned by the projections in latent space. To this end, we normalize the translation vector

before applying its inverse to a latent space sample to undo the transformation. The normalized vector is computed as follows:

(4)

Where . The intuition behind is that it corresponds to the magnitude of latent space values. Hence, the resulting translation vector is proportional in magnitude. Lastly, we note here that due to the cosine similarity used in Eq. 3, the effect of scaling is effectively removed (, for ). The complete equivariant contrastive learning framework is visualized in Fig. 2.

From pre-training to fine-tuning. After having performed pre-training using our proposed loss function, we fine-tune the encoder supervised on the task of hand pose estimation. To this end, following [6] we remove the projection layer from the model and replace it with a linear layer. The entire model is then trained end-to-end using the losses as described next, in Sec. 3.3.

3.3 3D Hand Pose Estimator

Our hand pose estimation model makes use of the 2.5D representation [18]. Given an image, the network predicts the 2D keypoints and the root-relative depth of the hand. As such, our hand pose model is trained with the following supervised loss functions:

(5)

Given the predicted values of and , the depth value of the root keypoint can be acquired as detailed in [18]. As a final step, we refine the acquired root depth to increase accuracy and stability as described [31], which yields . The resulting 3D pose is acquired as follows:

(6)

where is the camera intrinsic matrix.

4 Experiments

Sec. 4.3 investigates the impact of different data augmentation operations and evaluate their effectiveness in the hand pose estimation task. Next, with the self-supervised learnt representation, we demonstrate in Sec. 4.4 how our model efficiently makes use of labeled data in semi-supervised settings. In Sec. 4.5 we compare our method with related works in hand pose estimation and demonstrate that PeCLR can reach state-of-the-art performance on FH. Finally, in Sec. 4.6 we perform a cross-dataset evaluation to show the advantages of the proposed representation learning across domain distributions.

Figure 3: Qualitative keypoint predictions are shown for YT3D (top) and FreiHAND (bottom) test sets. Results from RN152 (Baseline) and RN152 + PeCLR are shown in each row. The ground truth data is not publicly available for FH, therefore, only the predictions are shown on the right.

4.1 Implementation

For pre-training, we use ResNet [16] as encoder, which takes monocular RGB images of size as input. We use LARS [38] with ADAM [19] with batches of size 2048 and learning rate of - in the representation learning stage. During fine-tuning, we use RGB images of size (Sec. 4.3, 4.4) or (Sec. 4.5, 4.6). As optimizer we use ADAM with a learning rate of - in the supervised fine-tuning stage. Further training details can be found in the supplementary.

4.2 Datasets

We use the following datasets in our experiments. FreiHAND (FH) [42] consists of 32’560 frames captured with green screen background in the training set, as well as real backgrounds in the test set. Its final evaluation is performed online, hence we do not have access to the ground-truth for the test set. We use the FH dataset for all supervised and self-supervised training and report the absolute as well as the procrustes-aligned MPJPE and AUC. YouTube3DHands (YT3D) [20] consists of in-the-wild images, with automatically acquired 3D annotations via key point detection from OpenPose [5] and MANO [30] fitting. It contains 47’125 in-the-wild frames. We use the YT3D dataset exclusively for self-supervised representation learning. YT3D contains only 3D vertices and no camera intrinsic information, hence we report the procrustes-aligned MPJPE and 2D pixel error via weak perspective projection.

Figure 4: Individual augmentation operations evaluated for hand pose estimation. Geometric transformations are in blue and appearance transformations are in green. The sample is taken from the FreiHAND dataset.
Performance of individual augmentation using SimCLR’s contrastive loss function.
Improvement of the proposed equivariant contrastive loss over SimCLR’s contrastive formulation under translation and rotation.
Figure 5: 5 The resulting feature representation power of individual augmentation as evaluated by an MLP. 5 Comparison between PeCLR and SimCLR for geometrical transformations. We note a significant improvement of and respectively.
Model 3D EPE AUC 2D EPE
(cm) (px)
SimCLR 16.62 0.72 12.05
PeCLR (ours) 16.05 0.74 10.51
Table 1: Comparison of SimCLR with our approach on the task of hand pose estimation. The encoders are pre-trained with SimCLR or PeCLR respectively, and are frozen during fine-tuning. Both methods use their optimal set of augmentations, as explained in Sec.4.3.

4.3 Evaluation of augmentation strategies

To study which set of data augmentations performs best, we first consider various augmentation operations for the representation learning phase. Fig. 4 visualizes the studied transformations in our experiment. We first evaluate individual transformations and then find their best composition.

We conduct the experiment on FH using our own training and validation split ( as training and as validation set) and use a ResNet-50 as the encoder. We train two encoders with different objective functions, one using NT-Xent (Eq.1) as proposed in SimCLR, and another one making use of our proposed contrastive formulation (Eq.3). To evaluate the learned feature representation, we freeze the encoder and train a two-layer MLP in a fully-supervised manner on 3D hand labels as described in Sec. 3.3.

Individual augmentation. Fig. 5 shows the performance errors when individual augmentation is applied. Here the SimCLR framework is used. We observe that encoders trained with transformations perform better than random initialization. However, we see that rotation transformation leads to particularly bad performance. As motivated in Sec. 3.2, SimCLR promotes invariance under all transformations, including geometric transformation. We hypothesize that the poor performance stems from this invariance property. To verify this, we compare the performance using the equivariant contrastive loss proposed in PeCLR and SimCLR’s contrastive formulation under two geometric transformations, namely translation and rotation. We emphasize here again that due to the cosine similarity, the effect of scale is eliminated. Fig. 5 shows that for both translation and rotation, PeCLR yields significant improvements of and relative to SimCLR, respectively. This results in scale, translation and rotation having the best feature representation as evaluated by the final MLP’s accuracy with PeCLR. This empirically verifies our intuition that promoting equivariance leads to better representations for pose estimation. Note that we only promote equivariance for geometric transformation. Therefore, all other appearance-related transformations yield the same performance for PeCLR and SimCLR.

Composite augmentations. Finally, we compare different compositions of transformations. To narrow down the search space, we pick the top-4 performing augmentations from Fig. 5 as candidates. We then conduct an exhaustive search over all combinations of the selected candidates and empirically find that scale, rotation, translation and color jitter deliver the best performance for PeCLR, whereas SimCLR performs best with scale and color jitter.

We compare PeCLR with SimCLR using their respective optimal composition and report the results in Tab. 1. Notice that PeCLR yields better feature than SimCLR, gaining the improvements of in terms of 3D EPE and in terms of 2D EPE. This demonstrates that the proposed equivariant contrastive loss leads to an effective representation learning approach for hand pose estimation.

Figure 6: Semi-supervised performance on FH. We observe that by pre-training with PeCLR we achieve greater accuracy in contrast to only training supervised. Adding additional unlabeled data increases this effect.

4.4 Semi-supervised learning

In this experiment, we evaluate the efficiency of PeCLR in making use of labeled data. To this end, we perform semi-supervised learning on FH with the pre-trained encoder. We use the optimal data augmentation compositions developed in Sec. 

4.3. As indicated in [7], deeper neural networks can make better use of large training data. Therefore, we increase our network capacity and use a ResNet-152 as the encoder in the following. Results and discussion of ResNet-50 can be found in supplementary.

Specifically, we pre-train our encoder on FH with the PeCLR. The encoder is then fine-tuned on varying amounts of labeled data on FH. For clarity, we term the resulting model . To quantify the effectiveness of our proposed pre-training strategy, we compare against a baseline method that is solely trained on the labeled data of FH, excluding the pre-training step. Finally, to demonstrate the advantage of self-supervised representation learning with large training data, we train a third model, pre-trained on both FH and YT3D, named .

From the results shown in Fig. 6, we see that , outperform the baseline regardless of the amount of used labels. This result is inline with [7], confirming that the pre-trained models can increase label efficiency for hand pose estimation. Comparing with , we see that increasing the amount of data during the pre-training phase is beneficial and further decreases the errors. These results from and shed light on label-efficiency of the pre-trained strategy. For example, we see that for of labeled data, performs almost on par with using of labeled data (cm vs cm 3D EPE).

4.5 Comparison with state of the art.

Method 3D PA-EPE (cm) PA-AUC
(PA) (PA)
Spurr et al[31] 0.90 0.82
Kulon et al[22] 0.84 0.83
Li et al[23] 0.80 0.84
Pose2Mesh[9] 0.77 -
I2L-MeshNet[24] 0.74 -
RN152 0.79 0.84
    + PeCLR (ours) 0.73 0.86
Table 2: Comparison with SotA. A standard RN152 model is unable to outperform state-of-the-art methods. By pre-training using PeCLR, we yield a performance increase, resulting in state-of-the-art performance.

With the optimal composition of transformations and representation learning strategy in place, we compare PeCLR with current state-of-the-art approaches on the FH dataset. For our method, we use an increased image resolution of pixels and the ResNet-152 as the encoder. The encoder is pre-trained on FH and YT3D with PeCLR and fine-tuned supervised on the FH dataset. In addition, we also have a baseline model that is solely trained on FH in a supervised manner.

Tab. 2 compares our results to the current state-of-the-art. We see that training a ResNet-152 model only on FH does not outperform the state-of-the-art, despite its large model capacity. We hypothesize that this is due to the comparably small dataset size of FH and thus lack of sufficient labeled data for training. However, using PeCLR to leverage YT3D in an unsupervised manner improves performance by PA-EPE. Note that all methods in Tbl. 2 use highly specialized architectures. In contrast with our formulation, state-of-the-art performance is established in a purely data-driven way.

4.6 Cross-dataset analysis

FH
Method 3D EPE (cm) AUC
Supervised 5.40 0.32
PeCLR (Ours) 5.09 0.34
Improvement 5.74 % 6.25 %
YT3D
Method 3D PA-EPE (cm) 2D EPE (px)
Supervised 3.08 20.59
PeCLR (Ours) 2.93 18.70
Improvement 4.84 % 9.18 %
Table 3: Cross-dataset evaluation. PeCLR model with the ResNet-152 architecture is pre-trained on YT3D and FH and then fine-tuned on FH. The model is then evaluated on both FH (top) and YT3D (bottom) test sets. We observe that similar improvements are gained across both datasets.

With a large amount of unlabeled training data, we hypothesize that our approach can produce better features that are beneficial for generalization. To verify this, we examine our models of Sec. 4.5 in a cross-dataset setting. More specifically, we investigate the performance of both models on the YT3D dataset. This sheds light on how the models perform under a domain shift. We emphasize here that neither models are trained supervised on YT3D.

The results in Tab. 3 show that PeCLR outperforms the fully-supervised baseline with improvements of in 3D EPE and in 2D EPE. These results indicate that PeCLR indeed provides a promising way forward in using unlabeled data for representation learning and training a model that can be more easily adapted to other data distributions. We note that cross-dataset generalization is seldom reported in the hand pose literature and it is generally assumed to be very challenging for most existing methods while important for real-world applications.

5 Conclusion

3D hand pose estimation from monocular RGB is a challenging task due to large diversity in environmental conditions and hand appearances. To make strides in pose estimation accuracy, we investigate self-supervised contrastive learning for hand pose estimation, taking advantages of large unlabeled data for representation learning. We identify a key issue in the contrastive loss formulation, where promoting invariance leads to detrimental results for pose estimation. To address this issue, we propose a novel method PeCLR that encourages equivariance for geometric transformations during the representation learning phase. We thoroughly investigate PeCLR by comparing the resulting feature representation and demonstrate improved performances of PeCLR over SimCLR. We show that our PeCLR has high label efficiency by means of semi-supervision. Consequently, our PeCLR achieves state-of-the-art results on the FreiHAND dataset. Lastly, we conduct a cross-dataset analysis and show the potential of PeCLR for cross-domain applications. We believe the proposed PeCLR as well as our extensive evaluations can be of benefits to the community. It provides a feasible solution to improve generalizability across datasets. We foresee the usage of PeCLR on other tasks such as human body pose estimation.

References

  • [1] Seungryul Baek, Kwang In Kim, and Tae-Kyun Kim. Pushing the envelope for rgb-based dense 3d hand pose estimation via neural rendering. In

    IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019, Long Beach, CA, USA, June 16-20, 2019

    , 2019.
  • [2] Seungryul Baek, Kwang In Kim, and Tae-Kyun Kim. Weakly-supervised domain adaptation via GAN and mesh model for estimating 3d hand poses interacting objects. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2020, Seattle, WA, USA, June 13-19, 2020, 2020.
  • [3] Adnane Boukhayma, Rodrigo de Bem, and Philip H. S. Torr. 3d hand shape and pose from images in the wild. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019, Long Beach, CA, USA, June 16-20, 2019, 2019.
  • [4] Yujun Cai, Liuhao Ge, Jianfei Cai, and Junsong Yuan. Weakly-supervised 3d hand pose estimation from monocular rgb images. In Proceedings of the European Conference on Computer Vision (ECCV), 2018.
  • [5] Zhe Cao, Gines Hidalgo, Tomas Simon, Shih-En Wei, and Yaser Sheikh. Openpose: Realtime multi-person 2d pose estimation using part affinity fields, 2019.
  • [6] Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey E. Hinton. A simple framework for contrastive learning of visual representations. In

    Proceedings of the 37th International Conference on Machine Learning, ICML 2020, 13-18 July 2020, Virtual Event

    , 2020.
  • [7] Ting Chen, Simon Kornblith, Kevin Swersky, Mohammad Norouzi, and Geoffrey E. Hinton. Big self-supervised models are strong semi-supervised learners. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual, 2020.
  • [8] Xinlei Chen, Haoqi Fan, Ross Girshick, and Kaiming He. Improved baselines with momentum contrastive learning, 2020.
  • [9] Hongsuk Choi, Gyeongsik Moon, and Kyoung Mu Lee. Pose2mesh: Graph convolutional network for 3d human pose and mesh recovery from a 2d human pose, 2020.
  • [10] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proc. of NAACL-HLT, Minneapolis, Minnesota, 2019.
  • [11] Carl Doersch, Abhinav Gupta, and Alexei A. Efros. Unsupervised visual representation learning by context prediction. In 2015 IEEE International Conference on Computer Vision, ICCV 2015, Santiago, Chile, December 7-13, 2015, 2015.
  • [12] Bardia Doosti, Shujon Naha, Majid Mirbagheri, and David J. Crandall. Hope-net: A graph-based model for hand-object pose estimation. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2020, Seattle, WA, USA, June 13-19, 2020, 2020.
  • [13] Liuhao Ge, Zhou Ren, Yuncheng Li, Zehao Xue, Yingying Wang, Jianfei Cai, and Junsong Yuan. 3d hand shape and pose estimation from a single RGB image. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019, Long Beach, CA, USA, June 16-20, 2019, 2019.
  • [14] Yana Hasson, Bugra Tekin, Federica Bogo, Ivan Laptev, Marc Pollefeys, and Cordelia Schmid. Leveraging photometric consistency over time for sparsely supervised hand-object reconstruction. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2020, Seattle, WA, USA, June 13-19, 2020, 2020.
  • [15] Yana Hasson, Gül Varol, Dimitrios Tzionas, Igor Kalevatykh, Michael J. Black, Ivan Laptev, and Cordelia Schmid. Learning joint reconstruction of hands and manipulated objects. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019, Long Beach, CA, USA, June 16-20, 2019, 2019.
  • [16] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 27-30, 2016, 2016.
  • [17] Olivier J. Hénaff. Data-efficient image recognition with contrastive predictive coding. In Proceedings of the 37th International Conference on Machine Learning, ICML 2020, 13-18 July 2020, Virtual Event, 2020.
  • [18] Umar Iqbal, Pavlo Molchanov, Thomas Breuel Juergen Gall, and Jan Kautz. Hand pose estimation via latent 2.5d heatmap regression. In Proceedings of the European Conference on Computer Vision (ECCV), September 2018.
  • [19] Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In Proc. of ICLR, 2015.
  • [20] Dominik Kulon, Riza Alp Güler, Iasonas Kokkinos, Michael M. Bronstein, and Stefanos Zafeiriou. Weakly-supervised mesh-convolutional hand reconstruction in the wild. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2020, Seattle, WA, USA, June 13-19, 2020, 2020.
  • [21] Dominik Kulon, Riza Alp Güler, Iasonas Kokkinos, Michael M. Bronstein, and Stefanos Zafeiriou. Weakly-supervised mesh-convolutional hand reconstruction in the wild. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2020, Seattle, WA, USA, June 13-19, 2020, 2020.
  • [22] Dominik Kulon, Riza Alp Güler, Iasonas Kokkinos, Michael M. Bronstein, and Stefanos Zafeiriou. Weakly-supervised mesh-convolutional hand reconstruction in the wild. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2020, Seattle, WA, USA, June 13-19, 2020, 2020.
  • [23] Moran Li, Yuan Gao, and Nong Sang. Exploiting learnable joint groups for hand pose estimation. arXiv preprint arXiv:2012.09496, 2020.
  • [24] Gyeongsik Moon and Kyoung Mu Lee. I2l-meshnet: Image-to-lixel prediction network for accurate 3d human pose and mesh estimation from a single rgb image, 2020.
  • [25] Gyeongsik Moon, Takaaki Shiratori, and Kyoung Mu Lee. Deephandmesh: A weakly-supervised deep encoder-decoder framework for high-fidelity hand mesh modeling. arXiv preprint arXiv:2008.08213, 2020.
  • [26] Gyeongsik Moon, Shoou-I Yu, He Wen, Takaaki Shiratori, and Kyoung Mu Lee. Interhand2. 6m: A dataset and baseline for 3d interacting hand pose estimation from a single rgb image. arXiv preprint arXiv:2008.09309, 2020.
  • [27] Franziska Mueller, Florian Bernard, Oleksandr Sotnychenko, Dushyant Mehta, Srinath Sridhar, Dan Casas, and Christian Theobalt. Ganerated hands for real-time 3d hand tracking from monocular RGB. In 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 18-22, 2018, 2018.
  • [28] Mehdi Noroozi and Paolo Favaro. Unsupervised learning of visual representations by solving jigsaw puzzles. In European Conference on Computer Vision. Springer, 2016.
  • [29] Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748, 2018.
  • [30] Javier Romero, Dimitrios Tzionas, and Michael J. Black. Embodied hands: Modeling and capturing hands and bodies together. ACM Transactions on Graphics, (Proc. SIGGRAPH Asia), 2017.
  • [31] Adrian Spurr, Umar Iqbal, Pavlo Molchanov, Otmar Hilliges, and Jan Kautz. Weakly supervised 3d hand pose estimation via biomechanical constraints, 2020.
  • [32] Adrian Spurr, Jie Song, Seonwook Park, and Otmar Hilliges. Cross-modal deep variational hand pose estimation. In 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 18-22, 2018, 2018.
  • [33] Bugra Tekin, Federica Bogo, and Marc Pollefeys. H+O: unified egocentric recognition of 3d hand-object poses and interactions. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019, Long Beach, CA, USA, June 16-20, 2019, 2019.
  • [34] Yonglong Tian, Dilip Krishnan, and Phillip Isola. Contrastive multiview coding. arXiv preprint arXiv:1906.05849, 2019.
  • [35] Jacob Walker, Carl Doersch, Abhinav Gupta, and Martial Hebert.

    An uncertain future: Forecasting from static images using variational autoencoders.

    In European Conference on Computer Vision. Springer, 2016.
  • [36] Zhirong Wu, Yuanjun Xiong, Stella Yu, and Dahua Lin. Unsupervised feature learning via non-parametric instance-level discrimination, 2018.
  • [37] Linlin Yang and Angela Yao. Disentangling latent hands for image synthesis and pose estimation. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019, Long Beach, CA, USA, June 16-20, 2019, 2019.
  • [38] Yang You, Igor Gitman, and Boris Ginsburg. Large batch training of convolutional networks, 2017.
  • [39] Richard Zhang, Phillip Isola, and Alexei A Efros.

    Colorful image colorization.

    In European conference on computer vision. Springer, 2016.
  • [40] Xiong Zhang, Qiang Li, Hong Mo, Wenbo Zhang, and Wen Zheng. End-to-end hand mesh recovery from a monocular RGB image. In 2019 IEEE/CVF International Conference on Computer Vision, ICCV 2019, Seoul, Korea (South), October 27 - November 2, 2019, 2019.
  • [41] Christian Zimmermann and Thomas Brox. Learning to estimate 3d hand pose from single RGB images. In IEEE International Conference on Computer Vision, ICCV 2017, Venice, Italy, October 22-29, 2017, 2017.
  • [42] Christian Zimmermann, Duygu Ceylan, Jimei Yang, Bryan C. Russell, Max J. Argus, and Thomas Brox. Freihand: A dataset for markerless capture of hand pose and shape from single RGB images. In 2019 IEEE/CVF International Conference on Computer Vision, ICCV 2019, Seoul, Korea (South), October 27 - November 2, 2019, 2019.