1 Introduction
Estimating the 3D pose of human hands from monocular images alone has many important applications in robotics, HumanComputer Interaction and AR/VR. As such the problem has received significant attention in computer vision literature
[41, 32, 15, 33, 12, 31, 26, 14]. However, estimating the location of 3D hand joints within an RGB image is a challenging structured regression problem with difficulties that arise from a large diversity in backgrounds, lighting conditions, hand appearances, as well as selfocclusion caused by the high degrees of freedom of the human hand.
One way to alleviate these issues is to acquire annotated datasets that cover a larger diversity of environments and settings. However, acquiring 3D labeled data is laborious, cost intensive and typically requires multiview imagery or some form of user instrumentation. Data collected under such circumstances is often difficult to transfer well to inthewild imagery [42, 20]. Therefore, much interest is given to approaches that can leverage auxiliary data, which has either no or only 2D joint annotations. For example, such data can be used to outperform many supervised approaches via making use of weaksupervision [4, 3], the integration of kinematic priors [31], or by exploiting temporal information [14]. Offtheshelf joint detectors [5] have been leveraged to automatically generated 2D annotations in large quantities [20]. However, the accuracy of models trained on these labels, or on 3D annotations derived from them, are inherently bounded by the label noise. Therefore, the question of how to efficiently leverage unlabeled data for hand pose estimator training remains unanswered.
Recently, selfsupervised approaches such as contrastive learning have shown that they can reach parity with supervised approaches on image classification tasks [6, 8]
. These methods leverage unlabeled data to learn powerful feature representations. To do so, positive and negative pairs of images are projected into a latent space via a neural network. The contrastive objective encourages the latent space samples of the positive pairs to lie close to each other and pushes negative pairs apart. The resulting pretrained network can then be used on downstream tasks. Positive pairs are created by sampling an image and applying two sets of distinct augmentations on it, whereas negative pairs correspond to separate but similarly augmented images. These augmentations include appearance transformations, such as color drop, and geometric transformations, such as rotation. The contrastive objective induces invariance under all of these transformations. However, tasks such as hand pose estimation require
equivariance under geometric transformations. Hence, representations learnt from such an objective may not effectively transfer to pose estimation.In this paper, we investigate such selfsupervised representation learning techniques for hand pose estimation. To the best of our knowledge, we are the first to do so. We derive a method Pose Equivariant Contrastive Learning (PeCLR) that is able to effectively leverage the large diversity of existing hand images without any joint labels. These images are used to pretrain a network to acquire a general representation, which can then be transferred to the final hand pose estimation task via supervised finetuning. This provides a promising direction for hand pose estimation and enable an easy transfer of images collected inthewild or calibration to a specific domain by finetuning a powerful, pretrained network with fewer labels.
Fig. 1 provides an overview of our method. In a first stage, we perform selfsupervised representation learning. Given an RGB input image of the hand, we first apply appearance and geometric transformations to generate positive and negative pairs of derivative images. These are used to train an encoder via our proposed equivariant contrastive loss. By undoing the geometric transformation in latent space, we promote equivariance. However, this process of invertion needs to be performed with care. Because transformation on images should lead to proportional changes in the latent space, special care needs to be taken due to different magnitudes between latent space and pixel space. The resulting model then yields improved pose estimation accuracy (cf. Fig. 1, bottom).
In the second stage, the pretrained encoder is finetuned on the task of 3D hand pose estimation using labeled data. The resulting model is evaluated thoroughly in a variety of settings. We demonstrate increased label efficiency for semisupervision and show that using more unlabeled data is beneficial for the final performance, yielding improvements of up to in 3D EPE in the lowest labeled setting (cf. Fig. 6). Furthermore, we show that this improvement also transfers to the fully supervised case, where using a standard ResNet152 in combination with unlabeled data and our proposed pretraining scheme outperforms specialized stateoftheart architectures (cf. Tab. 2). Finally, we demonstrate that selfsupervised pretraining leads to an improvement of PAEPE in the unlabeled data, indicating that pretraining is beneficial for crossdomain generalization (cf. Tab. 3).
In summary, our contributions are as follows:

[noitemsep]

To the best of our knowledge, we perform the first investigation of contrastive learning to efficiently leverage unlabeled data for hand pose estimation.

We propose a contrastive learning objective that encourages invariance to appearance transformations and equivariance to geometric transformations.

We conduct controlled experiments to evaluate the quality of the learned representations, compared with SimCLR, and empirically derive the best performing augmentations.

We show that the proposed method achieves better label efficiency in semisupervised settings and that adding more unlabeled data is beneficial.

We empirically show that our proposed method outperforms current, more specialized stateoftheart methods using a standard ResNet model.
All code and models will be made available for research purposes.
2 Related work
Hand pose estimation. Hand pose estimation usually follows one of three paradigms. Some work predicts 3D joint skeletons directly [41, 32, 27, 18, 4, 33, 37, 12, 31, 26], makes use of MANO [30], where the parameters of a parametric hand model are regressed [1, 3, 15, 2, 14, 40], or predicts the full mesh model of the hand directly [13, 21, 25]. A staged approach is introduced in [41], where the 2D keypoints are regressed directly and then lifted to 3D. Spurr [32] introduces a crossmodal latent space which facilitates better learning. Mueller [27] makes use of a synthetically created dataset and reduces the synthetic/real discrepancy via a GAN. Cai [4] makes use of supplementary depth supervision to augment the training set. Proposing a more efficient hand representation, a 2.5D representation is introduced in [18]. Action recognition as well as hand/object pose estimation is performed in [33]. [37] introduces a disentangled latent space, for the purpose of better image synthesis. A graphbased neural network is used to jointly refine the hand/object pose in [12]. Biomechanical constraints are introduced to refine the pose predictions on 2D supervised data [31]. Moon [26] predict the pose of both hands and takes their interaction into account.
Templatedbased methods such as MANO induce a prior of hand poses, as well as providing a mesh surface. Some methods [1, 3, 40] estimate the MANO parameters directly from RGB, sometimes making use of weak supervision such as hand masks [1, 40] or inthewild 2D annotations [3, 40]. A unified approach is introduced to jointly predict MANO as well as the object mesh [15]. Hasson [14] builds upon the mentioned framework, by learning from partially labeled sequences via a photometric loss. An alternative to MANO is proposed in [25] by predicting pose and subject dependant correctives to a base hand model. Some methods regress the mesh of a hand directly. However, mesh annotations are difficult to acquire. Ge [13] tackles this by introducing a fully meshannotated synthetic dataset and performs noisy supervision for real data. With the help of spiral convolutions, a hand mesh is predicted in [21], supervised using MANO.
Clearly, much work has been dedicated to custom, sometimes highly specialized architectures for handpose estimation. In contrast, we explore a purely datadriven approach, utilizing unlabeled data, and an equivariance inducing contrastive formulation to achieve SOTA performance with a standard CNN.
Selfsupervised learning.
Selfsupervised learning aims to learn representation of data without any annotations. Literature defines the pretext task as the specific strategy to learn the representation in a selfsupervised manner. Such tasks include predicting the position of a second patch relative to the first
[11], colorizing a grayscale image [39], solving a jigsaw puzzle [28], estimating the motion flow of pixels in a scene [35], predicting positive future samples in audio signals [29], or completing the next sentence based on relations between two sentences [10]. However, it is not clear which pretext task would be optimal given a specific downstream task in terms of performance and generalizability.Contrastive learning is a powerful paradigm for selfsupervised, taskindependent learning. At the core of contrastive learning lies a concept emerging from distance metric learning: a pair of data is encouraged to be close in latent space if they are connected in a meaningful way, while unrelated data are pushed apart. One of the appeals of contrastive learning lie in the numerous amounts of data that is available for training. General representations are learned through this paradigm and have been successfully used in many downstream tasks such as image and video classification [34, 6, 8], object detection [36, 17], and speech classification [29]. However, contrastive learning has not been investigated for the task of hand pose estimation.
The closest related work to this paper include Contrastive Predictive Coding (CPC) [29, 17], Contrastive Multiview Coding (CMC) [34], and SimCLR [6, 7]
. CPC learns to extract representations by predicting future representations in latent space. Autoregressive models are used to enable predictions of many steps in the future. While CPC learns from the two views of the past and future, CMC extends this idea to multiview learning. It aims to learn viewinvariant representations by maximizing mutual information among different views of the same content. The most relevant framework for contrastive learning is a simple yet effective contrastive learning approach
[6]. It largely benefits from data augmentation and its learnt representation achieves performance that is on par with supervised models on the image classification task. However, the learned transformationinvariant features are not suited for structured regression tasks such as hand pose estimation as these require an equivariant representation with respect to geometric transformations. In this work, we extend SimCLR by differentiating between appearance and geometric transformations, and propose a model that can successfully learn representations dedicated for both transformations.3 Method
In this section, we start by reviewing SimCLR [6]. We then introduce the overall framework of pretraining and finetuning. Next, we identify an issue with SimCLRs contrastive formulation when applied to hand pose estimation, motivating our proposed equivariant contrastive objective. Lastly, we present our hand pose estimation model and the method used for 3D keypoint estimation during supervised training.
Notation. In the following, we denote the set of all transformations used as . It contains appearance transformations (e.g color jitter), geometric transformations (e.g. scale, rotation and translation) as well as compositions of them. For a given transformation , correspond to the appearance or geometric component of the transformation . Fig. 4 shows all transformation used in this study.
3.1 SimCLR
The idea of the SimCLR [6] framework is to maximize the agreement in latent space between the representations of samples that are similar, while repelling dissimilar pairs. The positive pairs are artificially generated by applying various augmentations on an image. Given a set of samples , we consider two augmented views , where , , .
The SimCLR framework consists of an encoder and a projection head . The overall model maps an image to a latent space sample , i.e. . The model is trained using a contrastive objective function that maximizes the agreement between all positive pairs of projections , which are extracted from two augmented views of the same image , while simultaneously minimizing the agreement amongst negative pairs of projections , where are extracted from different images.
In each iteration, SimCLR samples both positive and negative pairs. For a given batch of images, two augmentations are applied on each sample, resulting in augmented images. Hence, for every augmented image , there is one positive sample , and negative samples
. The model is trained to project positive samples close to each other, whereas keeping negative samples far apart. This is achieved via the following loss function, termed as
NTXent in [6]:(1) 
Here is a temperature parameter,
is the cosine similarity between
, and is the indicator function.3.2 Equivariant contrastive representations
Inspecting Eq. 1, we observe that the objective function promotes invariance under all transformations. Given a sample and its positive sample , the numerator in Eq. 1 is minimized if . Hence, a model that satisfies Eq. 1 needs to be invariant to all transformations in . However, hand pose estimation requires equivariance with respect to geometric transformations as these change the displayed pose. Hence, we require:
(2) 
Inverting transformations in latent space. In order to fulfill Eq. 2, we first note that it is equivalent to . This leads us to the following equivariant modification of NTXent:
(3) 
where . In order to minimize the numerator in Eq. 3 it must hold that . This leads to the desired property of Eq. 2, further details can be found in the supplementary. As is an affine transformation, its inverse can be easily computed. However, whereas scaling and rotation are transformations that are performed relative to the magnitude, translation is performed in terms of an absolute quantity. In other words, if we translate an image by pixels, we need to translate its latent space projection by a proportional quantity. Therefore, we translate
by a quantity proportional to its magnitude. To achieve this, we obtain the translation proportional to the image size and scale it up by a factor proportional to the range spanned by the projections in latent space. To this end, we normalize the translation vector
before applying its inverse to a latent space sample to undo the transformation. The normalized vector is computed as follows:(4) 
Where . The intuition behind is that it corresponds to the magnitude of latent space values. Hence, the resulting translation vector is proportional in magnitude. Lastly, we note here that due to the cosine similarity used in Eq. 3, the effect of scaling is effectively removed (, for ). The complete equivariant contrastive learning framework is visualized in Fig. 2.
From pretraining to finetuning. After having performed pretraining using our proposed loss function, we finetune the encoder supervised on the task of hand pose estimation. To this end, following [6] we remove the projection layer from the model and replace it with a linear layer. The entire model is then trained endtoend using the losses as described next, in Sec. 3.3.
3.3 3D Hand Pose Estimator
Our hand pose estimation model makes use of the 2.5D representation [18]. Given an image, the network predicts the 2D keypoints and the rootrelative depth of the hand. As such, our hand pose model is trained with the following supervised loss functions:
(5) 
Given the predicted values of and , the depth value of the root keypoint can be acquired as detailed in [18]. As a final step, we refine the acquired root depth to increase accuracy and stability as described [31], which yields . The resulting 3D pose is acquired as follows:
(6) 
where is the camera intrinsic matrix.
4 Experiments
Sec. 4.3 investigates the impact of different data augmentation operations and evaluate their effectiveness in the hand pose estimation task. Next, with the selfsupervised learnt representation, we demonstrate in Sec. 4.4 how our model efficiently makes use of labeled data in semisupervised settings. In Sec. 4.5 we compare our method with related works in hand pose estimation and demonstrate that PeCLR can reach stateoftheart performance on FH. Finally, in Sec. 4.6 we perform a crossdataset evaluation to show the advantages of the proposed representation learning across domain distributions.
4.1 Implementation
For pretraining, we use ResNet [16] as encoder, which takes monocular RGB images of size as input. We use LARS [38] with ADAM [19] with batches of size 2048 and learning rate of  in the representation learning stage. During finetuning, we use RGB images of size (Sec. 4.3, 4.4) or (Sec. 4.5, 4.6). As optimizer we use ADAM with a learning rate of  in the supervised finetuning stage. Further training details can be found in the supplementary.
4.2 Datasets
We use the following datasets in our experiments. FreiHAND (FH) [42] consists of 32’560 frames captured with green screen background in the training set, as well as real backgrounds in the test set. Its final evaluation is performed online, hence we do not have access to the groundtruth for the test set. We use the FH dataset for all supervised and selfsupervised training and report the absolute as well as the procrustesaligned MPJPE and AUC. YouTube3DHands (YT3D) [20] consists of inthewild images, with automatically acquired 3D annotations via key point detection from OpenPose [5] and MANO [30] fitting. It contains 47’125 inthewild frames. We use the YT3D dataset exclusively for selfsupervised representation learning. YT3D contains only 3D vertices and no camera intrinsic information, hence we report the procrustesaligned MPJPE and 2D pixel error via weak perspective projection.
Model  3D EPE  AUC  2D EPE 

(cm)  (px)  
SimCLR  16.62  0.72  12.05 
PeCLR (ours)  16.05  0.74  10.51 
4.3 Evaluation of augmentation strategies
To study which set of data augmentations performs best, we first consider various augmentation operations for the representation learning phase. Fig. 4 visualizes the studied transformations in our experiment. We first evaluate individual transformations and then find their best composition.
We conduct the experiment on FH using our own training and validation split ( as training and as validation set) and use a ResNet50 as the encoder. We train two encoders with different objective functions, one using NTXent (Eq.1) as proposed in SimCLR, and another one making use of our proposed contrastive formulation (Eq.3). To evaluate the learned feature representation, we freeze the encoder and train a twolayer MLP in a fullysupervised manner on 3D hand labels as described in Sec. 3.3.
Individual augmentation. Fig. 5 shows the performance errors when individual augmentation is applied. Here the SimCLR framework is used. We observe that encoders trained with transformations perform better than random initialization. However, we see that rotation transformation leads to particularly bad performance. As motivated in Sec. 3.2, SimCLR promotes invariance under all transformations, including geometric transformation. We hypothesize that the poor performance stems from this invariance property. To verify this, we compare the performance using the equivariant contrastive loss proposed in PeCLR and SimCLR’s contrastive formulation under two geometric transformations, namely translation and rotation. We emphasize here again that due to the cosine similarity, the effect of scale is eliminated. Fig. 5 shows that for both translation and rotation, PeCLR yields significant improvements of and relative to SimCLR, respectively. This results in scale, translation and rotation having the best feature representation as evaluated by the final MLP’s accuracy with PeCLR. This empirically verifies our intuition that promoting equivariance leads to better representations for pose estimation. Note that we only promote equivariance for geometric transformation. Therefore, all other appearancerelated transformations yield the same performance for PeCLR and SimCLR.
Composite augmentations. Finally, we compare different compositions of transformations. To narrow down the search space, we pick the top4 performing augmentations from Fig. 5 as candidates. We then conduct an exhaustive search over all combinations of the selected candidates and empirically find that scale, rotation, translation and color jitter deliver the best performance for PeCLR, whereas SimCLR performs best with scale and color jitter.
We compare PeCLR with SimCLR using their respective optimal composition and report the results in Tab. 1. Notice that PeCLR yields better feature than SimCLR, gaining the improvements of in terms of 3D EPE and in terms of 2D EPE. This demonstrates that the proposed equivariant contrastive loss leads to an effective representation learning approach for hand pose estimation.
4.4 Semisupervised learning
In this experiment, we evaluate the efficiency of PeCLR in making use of labeled data. To this end, we perform semisupervised learning on FH with the pretrained encoder. We use the optimal data augmentation compositions developed in Sec.
4.3. As indicated in [7], deeper neural networks can make better use of large training data. Therefore, we increase our network capacity and use a ResNet152 as the encoder in the following. Results and discussion of ResNet50 can be found in supplementary.Specifically, we pretrain our encoder on FH with the PeCLR. The encoder is then finetuned on varying amounts of labeled data on FH. For clarity, we term the resulting model . To quantify the effectiveness of our proposed pretraining strategy, we compare against a baseline method that is solely trained on the labeled data of FH, excluding the pretraining step. Finally, to demonstrate the advantage of selfsupervised representation learning with large training data, we train a third model, pretrained on both FH and YT3D, named .
From the results shown in Fig. 6, we see that , outperform the baseline regardless of the amount of used labels. This result is inline with [7], confirming that the pretrained models can increase label efficiency for hand pose estimation. Comparing with , we see that increasing the amount of data during the pretraining phase is beneficial and further decreases the errors. These results from and shed light on labelefficiency of the pretrained strategy. For example, we see that for of labeled data, performs almost on par with using of labeled data (cm vs cm 3D EPE).
4.5 Comparison with state of the art.
Method  3D PAEPE (cm)  PAAUC 

(PA)  (PA)  
Spurr et al[31]  0.90  0.82 
Kulon et al[22]  0.84  0.83 
Li et al[23]  0.80  0.84 
Pose2Mesh[9]  0.77   
I2LMeshNet[24]  0.74   
RN152  0.79  0.84 
+ PeCLR (ours)  0.73  0.86 
With the optimal composition of transformations and representation learning strategy in place, we compare PeCLR with current stateoftheart approaches on the FH dataset. For our method, we use an increased image resolution of pixels and the ResNet152 as the encoder. The encoder is pretrained on FH and YT3D with PeCLR and finetuned supervised on the FH dataset. In addition, we also have a baseline model that is solely trained on FH in a supervised manner.
Tab. 2 compares our results to the current stateoftheart. We see that training a ResNet152 model only on FH does not outperform the stateoftheart, despite its large model capacity. We hypothesize that this is due to the comparably small dataset size of FH and thus lack of sufficient labeled data for training. However, using PeCLR to leverage YT3D in an unsupervised manner improves performance by PAEPE. Note that all methods in Tbl. 2 use highly specialized architectures. In contrast with our formulation, stateoftheart performance is established in a purely datadriven way.
4.6 Crossdataset analysis
FH  
Method  3D EPE (cm)  AUC 
Supervised  5.40  0.32 
PeCLR (Ours)  5.09  0.34 
Improvement  5.74 %  6.25 % 
YT3D  
Method  3D PAEPE (cm)  2D EPE (px) 
Supervised  3.08  20.59 
PeCLR (Ours)  2.93  18.70 
Improvement  4.84 %  9.18 % 
With a large amount of unlabeled training data, we hypothesize that our approach can produce better features that are beneficial for generalization. To verify this, we examine our models of Sec. 4.5 in a crossdataset setting. More specifically, we investigate the performance of both models on the YT3D dataset. This sheds light on how the models perform under a domain shift. We emphasize here that neither models are trained supervised on YT3D.
The results in Tab. 3 show that PeCLR outperforms the fullysupervised baseline with improvements of in 3D EPE and in 2D EPE. These results indicate that PeCLR indeed provides a promising way forward in using unlabeled data for representation learning and training a model that can be more easily adapted to other data distributions. We note that crossdataset generalization is seldom reported in the hand pose literature and it is generally assumed to be very challenging for most existing methods while important for realworld applications.
5 Conclusion
3D hand pose estimation from monocular RGB is a challenging task due to large diversity in environmental conditions and hand appearances. To make strides in pose estimation accuracy, we investigate selfsupervised contrastive learning for hand pose estimation, taking advantages of large unlabeled data for representation learning. We identify a key issue in the contrastive loss formulation, where promoting invariance leads to detrimental results for pose estimation. To address this issue, we propose a novel method PeCLR that encourages equivariance for geometric transformations during the representation learning phase. We thoroughly investigate PeCLR by comparing the resulting feature representation and demonstrate improved performances of PeCLR over SimCLR. We show that our PeCLR has high label efficiency by means of semisupervision. Consequently, our PeCLR achieves stateoftheart results on the FreiHAND dataset. Lastly, we conduct a crossdataset analysis and show the potential of PeCLR for crossdomain applications. We believe the proposed PeCLR as well as our extensive evaluations can be of benefits to the community. It provides a feasible solution to improve generalizability across datasets. We foresee the usage of PeCLR on other tasks such as human body pose estimation.
References

[1]
Seungryul Baek, Kwang In Kim, and TaeKyun Kim.
Pushing the envelope for rgbbased dense 3d hand pose estimation via
neural rendering.
In
IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019, Long Beach, CA, USA, June 1620, 2019
, 2019.  [2] Seungryul Baek, Kwang In Kim, and TaeKyun Kim. Weaklysupervised domain adaptation via GAN and mesh model for estimating 3d hand poses interacting objects. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2020, Seattle, WA, USA, June 1319, 2020, 2020.
 [3] Adnane Boukhayma, Rodrigo de Bem, and Philip H. S. Torr. 3d hand shape and pose from images in the wild. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019, Long Beach, CA, USA, June 1620, 2019, 2019.
 [4] Yujun Cai, Liuhao Ge, Jianfei Cai, and Junsong Yuan. Weaklysupervised 3d hand pose estimation from monocular rgb images. In Proceedings of the European Conference on Computer Vision (ECCV), 2018.
 [5] Zhe Cao, Gines Hidalgo, Tomas Simon, ShihEn Wei, and Yaser Sheikh. Openpose: Realtime multiperson 2d pose estimation using part affinity fields, 2019.

[6]
Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey E. Hinton.
A simple framework for contrastive learning of visual
representations.
In
Proceedings of the 37th International Conference on Machine Learning, ICML 2020, 1318 July 2020, Virtual Event
, 2020.  [7] Ting Chen, Simon Kornblith, Kevin Swersky, Mohammad Norouzi, and Geoffrey E. Hinton. Big selfsupervised models are strong semisupervised learners. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 612, 2020, virtual, 2020.
 [8] Xinlei Chen, Haoqi Fan, Ross Girshick, and Kaiming He. Improved baselines with momentum contrastive learning, 2020.
 [9] Hongsuk Choi, Gyeongsik Moon, and Kyoung Mu Lee. Pose2mesh: Graph convolutional network for 3d human pose and mesh recovery from a 2d human pose, 2020.
 [10] Jacob Devlin, MingWei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pretraining of deep bidirectional transformers for language understanding. In Proc. of NAACLHLT, Minneapolis, Minnesota, 2019.
 [11] Carl Doersch, Abhinav Gupta, and Alexei A. Efros. Unsupervised visual representation learning by context prediction. In 2015 IEEE International Conference on Computer Vision, ICCV 2015, Santiago, Chile, December 713, 2015, 2015.
 [12] Bardia Doosti, Shujon Naha, Majid Mirbagheri, and David J. Crandall. Hopenet: A graphbased model for handobject pose estimation. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2020, Seattle, WA, USA, June 1319, 2020, 2020.
 [13] Liuhao Ge, Zhou Ren, Yuncheng Li, Zehao Xue, Yingying Wang, Jianfei Cai, and Junsong Yuan. 3d hand shape and pose estimation from a single RGB image. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019, Long Beach, CA, USA, June 1620, 2019, 2019.
 [14] Yana Hasson, Bugra Tekin, Federica Bogo, Ivan Laptev, Marc Pollefeys, and Cordelia Schmid. Leveraging photometric consistency over time for sparsely supervised handobject reconstruction. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2020, Seattle, WA, USA, June 1319, 2020, 2020.
 [15] Yana Hasson, Gül Varol, Dimitrios Tzionas, Igor Kalevatykh, Michael J. Black, Ivan Laptev, and Cordelia Schmid. Learning joint reconstruction of hands and manipulated objects. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019, Long Beach, CA, USA, June 1620, 2019, 2019.
 [16] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 2730, 2016, 2016.
 [17] Olivier J. Hénaff. Dataefficient image recognition with contrastive predictive coding. In Proceedings of the 37th International Conference on Machine Learning, ICML 2020, 1318 July 2020, Virtual Event, 2020.
 [18] Umar Iqbal, Pavlo Molchanov, Thomas Breuel Juergen Gall, and Jan Kautz. Hand pose estimation via latent 2.5d heatmap regression. In Proceedings of the European Conference on Computer Vision (ECCV), September 2018.
 [19] Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In Proc. of ICLR, 2015.
 [20] Dominik Kulon, Riza Alp Güler, Iasonas Kokkinos, Michael M. Bronstein, and Stefanos Zafeiriou. Weaklysupervised meshconvolutional hand reconstruction in the wild. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2020, Seattle, WA, USA, June 1319, 2020, 2020.
 [21] Dominik Kulon, Riza Alp Güler, Iasonas Kokkinos, Michael M. Bronstein, and Stefanos Zafeiriou. Weaklysupervised meshconvolutional hand reconstruction in the wild. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2020, Seattle, WA, USA, June 1319, 2020, 2020.
 [22] Dominik Kulon, Riza Alp Güler, Iasonas Kokkinos, Michael M. Bronstein, and Stefanos Zafeiriou. Weaklysupervised meshconvolutional hand reconstruction in the wild. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2020, Seattle, WA, USA, June 1319, 2020, 2020.
 [23] Moran Li, Yuan Gao, and Nong Sang. Exploiting learnable joint groups for hand pose estimation. arXiv preprint arXiv:2012.09496, 2020.
 [24] Gyeongsik Moon and Kyoung Mu Lee. I2lmeshnet: Imagetolixel prediction network for accurate 3d human pose and mesh estimation from a single rgb image, 2020.
 [25] Gyeongsik Moon, Takaaki Shiratori, and Kyoung Mu Lee. Deephandmesh: A weaklysupervised deep encoderdecoder framework for highfidelity hand mesh modeling. arXiv preprint arXiv:2008.08213, 2020.
 [26] Gyeongsik Moon, ShoouI Yu, He Wen, Takaaki Shiratori, and Kyoung Mu Lee. Interhand2. 6m: A dataset and baseline for 3d interacting hand pose estimation from a single rgb image. arXiv preprint arXiv:2008.09309, 2020.
 [27] Franziska Mueller, Florian Bernard, Oleksandr Sotnychenko, Dushyant Mehta, Srinath Sridhar, Dan Casas, and Christian Theobalt. Ganerated hands for realtime 3d hand tracking from monocular RGB. In 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 1822, 2018, 2018.
 [28] Mehdi Noroozi and Paolo Favaro. Unsupervised learning of visual representations by solving jigsaw puzzles. In European Conference on Computer Vision. Springer, 2016.
 [29] Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748, 2018.
 [30] Javier Romero, Dimitrios Tzionas, and Michael J. Black. Embodied hands: Modeling and capturing hands and bodies together. ACM Transactions on Graphics, (Proc. SIGGRAPH Asia), 2017.
 [31] Adrian Spurr, Umar Iqbal, Pavlo Molchanov, Otmar Hilliges, and Jan Kautz. Weakly supervised 3d hand pose estimation via biomechanical constraints, 2020.
 [32] Adrian Spurr, Jie Song, Seonwook Park, and Otmar Hilliges. Crossmodal deep variational hand pose estimation. In 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 1822, 2018, 2018.
 [33] Bugra Tekin, Federica Bogo, and Marc Pollefeys. H+O: unified egocentric recognition of 3d handobject poses and interactions. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019, Long Beach, CA, USA, June 1620, 2019, 2019.
 [34] Yonglong Tian, Dilip Krishnan, and Phillip Isola. Contrastive multiview coding. arXiv preprint arXiv:1906.05849, 2019.

[35]
Jacob Walker, Carl Doersch, Abhinav Gupta, and Martial Hebert.
An uncertain future: Forecasting from static images using variational autoencoders.
In European Conference on Computer Vision. Springer, 2016.  [36] Zhirong Wu, Yuanjun Xiong, Stella Yu, and Dahua Lin. Unsupervised feature learning via nonparametric instancelevel discrimination, 2018.
 [37] Linlin Yang and Angela Yao. Disentangling latent hands for image synthesis and pose estimation. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019, Long Beach, CA, USA, June 1620, 2019, 2019.
 [38] Yang You, Igor Gitman, and Boris Ginsburg. Large batch training of convolutional networks, 2017.

[39]
Richard Zhang, Phillip Isola, and Alexei A Efros.
Colorful image colorization.
In European conference on computer vision. Springer, 2016.  [40] Xiong Zhang, Qiang Li, Hong Mo, Wenbo Zhang, and Wen Zheng. Endtoend hand mesh recovery from a monocular RGB image. In 2019 IEEE/CVF International Conference on Computer Vision, ICCV 2019, Seoul, Korea (South), October 27  November 2, 2019, 2019.
 [41] Christian Zimmermann and Thomas Brox. Learning to estimate 3d hand pose from single RGB images. In IEEE International Conference on Computer Vision, ICCV 2017, Venice, Italy, October 2229, 2017, 2017.
 [42] Christian Zimmermann, Duygu Ceylan, Jimei Yang, Bryan C. Russell, Max J. Argus, and Thomas Brox. Freihand: A dataset for markerless capture of hand pose and shape from single RGB images. In 2019 IEEE/CVF International Conference on Computer Vision, ICCV 2019, Seoul, Korea (South), October 27  November 2, 2019, 2019.
Comments
There are no comments yet.