1 Introduction
We use our hands as a primary means of sensing and interacting with the world. Thus to understand human activity, computer vision systems need to be able to detect the pose of the hands and to identify properties of the objects that are being handled. This human HandObject Pose Estimation (HOPE) problem is crucial for a variety of applications, including augmented and virtual reality, finegrained action recognition, robotics, and telepresence.
This is a challenging problem, however. Hands move quickly as they interact with the world, and handling an object, by definition, creates occlusions of the hand and/or object from nearly any given point of view. Moreover, handobject interaction video is often collected from firstperson (wearable) cameras (e.g., for Augmented Reality applications), generating a large degree of unpredictable camera motion.
Of course, one approach is to detect the poses of the hands and objects separately [27, 30]. However, this ignores the fact that hand and handled object poses are highly correlated: the shape of an object usually constrains the types of grasps (hand poses) that can be used to handle it. Detecting the pose of the hand can give cues as to the pose and identity of an object, while the pose of an object can constrain the pose of the hand that is holding it. Solving the two problems jointly can help overcome challenges such as occlusion. Recent work [26, 10]
proposed deep learningbased approaches to jointly model the hand and object poses. We build on this work, showing how to improve performance by more explicitly modeling the physical and anatomical constraints on handobject interaction.
We propose to do this using graph convolutional neural networks. Given their ability to learn effective representations of graphstructured data, graph convolutional neural networks have recently received much attention in computer vision. Human hand and body pose estimation problems are particularly amenable to graphbased techniques since they can naturally model the skeletal and kinematic constraints between joints and body parts. Graph convolution can be used to learn these interjoint relationships.
In this paper, we show that graph convolution can dramatically increase the performance of estimating 3D handobject pose in realworld handobject manipulation videos. We model handobject interaction by representing the hand and object as a single graph. We focus on estimating 3D handobject poses from egocentric (firstperson) and thirdperson monocular color video frames, without requiring any depth information. Our model first predicts 2D keypoint locations of hand joints and object boundaries. Then the model jointly recovers the depth information from the 2D pose estimates in a hierarchical manner (Figure 1).
This approach of first estimating in 2D and then “converting” to 3D is inspired by the fact that detectionbased models perform better in detecting 2D hand keypoints, but in 3D, because of the high degree of nonlinearity and the huge output space, regressionbased models are more popular [4]. Our graph convolutional approach allows us to use a detectionbased model to detect the hand keypoints in 2D (which is easier than predicting 3D coordinates), and then to accurately convert them to 3D coordinates. We show that using this graphbased network, we are not limited to training on only annotated real images, but can instead pretrain the 2D to 3D network separately with synthetic images rendered from 3D meshes of hands interacting with objects (e.g. ObMan dataset [10]). This is very useful for training a model for handobject pose estimation as realworld annotated data for these scenarios is scarce and costly to collect.
In brief, the core contributions of our work are:

We propose a novel but lightweight deep learning framework, HOPENet, which can predict 2D and 3D coordinates of hand and handmanipulated object in realtime. Our model accurately predicts the hand and object pose from single RGB images.

We introduce the Adaptive Graph UNet, a graph convolutionbased neural network to convert 2D hand and object poses to 3D with novel graph convolution, pooling, and unpooling layers. The new formulations of these layers make it more stable and robust compared to the existing Graph UNet [5] model.

Through extensive experiments, we show that our approach can outperform the stateoftheart models for joint hand and object 3D pose estimation tasks while still running in realtime.
2 Related Work
Our work is related to two main lines of research: joint handobject pose prediction models and graph convolutional networks for understanding graphbased data.
HandObject Pose Estimation.
Due to the strong relationship between hand pose and the shape of a manipulated object, several papers have studied joint estimation of both hand and object pose. Oikonomidis et al. [20] used handobject interaction as context to better estimate the 2D hand pose from multiview images. Choi et al. [3] trained two networks, one objectcentered and one handcentered, to capture information from both the object and hand perspectives, and shared information between these two networks to learn a better representation for predicting 3D hand pose. Panteleris et al. [21] generated 3D hand pose and 3D models of unknown objects based on handobject interactions and depth information. Oberweger et al. [19]
proposed an iterative approach by using Spatial Transformer Networks (STNs) to separately focus on the manipulated object and the hand to predict their corresponding poses. Later they estimated the hand and object depth images and fused them using an inverse STN. The synthesized depth images were used to refine the hand and object pose estimates. Recently, Hasson
et al. [10] showed that by incorporating physical constraints, two separate networks responsible for learning object and hand representations can be combined to generate better 3D hand and object shapes. Tekin et al. [26] proposed a single 3D YOLO model to jointly predict the 3D hand pose and object pose from a single RGB image.Graph Convolution Networks.
Graph convolution networks allow learning highlevel representations of the relationships between the nodes of graphbased data. Zhao et al. [31] proposed a semantic graph convolution network for capturing both local and global relationships among human body joints for 2D and 3D human pose estimation. Cai et al. [1] converted 2D human joints to 3D by encoding domain knowledge of the human body and hand joints using a graph convolution network which can learn multiscale representations. Yan et al. [29] used a graph convolution network for learning a spatialtemporal representation of human body joints for skeletonbased action recognition. Kolotouros et al. [15] showed that graph convolutional networks can be used to extract 3D human shape and pose from a single RGB image, while Ge et al. [7] used them to generate complete 3D meshes of hands from images. Li et al. [17] used graph convolutional networks for skeletonbased action recognition, while Shi et al. [25, 24] similarly used two stream adaptive graph convolution.
Gao et al. [5] introduced the Graph UNet structure with their proposed pooling and unpooling layers. But that pooling method did not work well on graphs with low numbers of edges, such as skeletons or object meshes. Ranjan et al. [22] used fixed pooling and Hanocka et al. [9] used edge pooling to prevent holes in the mesh after pooling. In this paper, we propose a new Graph UNet architecture with different graph convolution, pooling, and unpooling. We use an adaptive adjacency matrix for our graph convolutional layer and new trainable pooling and unpooling layers.
3 Methodology
We now present HOPENet, which consists of a convolutional neural network for encoding the image and predicting the initial 2D locations of the hand and object keypoints (hand joints and tight object bounding box corners), a simple graph convolution to refine the predicted 2D predictions, and a Graph UNet architecture to convert 2D keypoints to 3D using a series of graph convolutions, poolings, and unpoolings. Figure 2 shows an overall schematic of the HOPENet architecture.
3.1 Image Encoder and Graph Convolution
For the image encoder, we use a lightweight residual neural network [11]
(ResNet10) to help reduce overfitting. The image encoder produces a 2048D feature vector for each input image. Then initial predictions of the 2D coordinates of the keypoints (hand joints and corners of the object’s tight bounding box) are produced using a fullyconnected layer. Inspired by the architecture of
[14], we concatenate these features with the initial 2D predictions of each keypoint, yielding a graph with features ( image features plus initial estimates of and ) for each node. A 3layer adaptive graph convolution network is applied to this graph to use adjacency information and modify the 2D coordinates of the keypoints. In the next section, we explain the adaptive graph convolution in depth. The concatenation of the image features to the predicted and of each keypoint forces the graph convolution network to modify the 2D coordinates conditioned on the image features as well as the initial prediction of the 2D coordinates. These final 2D coordinates of the hand and object keypoints are then passed to our adaptive Graph UNet, a graph convolution network using adaptive convolution, pooling, and unpooling to convert 2D coordinates to 3D.3.2 Adaptive Graph UNet
In this section, we explain our graphbased model which predicts 3D coordinates of the hand joints and object corners based on predicted 2D coordinates. In this network, we simplify the input graph by applying graph pooling in the encoding part, and in the decoding part, we add those nodes again with our graph unpooling layers. Also, similar to the classic UNet [23], we use skip connections and concatenate features from the encoding stage to features of the decoding stage in each decoding graph convolution. With this architecture we are interested in training a network which simplifies the graph to obtain global features of the hand and object, but also tries to preserve local features via the skip connections from the encoder to the decoder layers. Modeling the HOPE problem as a graph helps use neighbors to predict more accurate coordinates and also to discover the relationship between hands and objects.
The Graph UNet concept was previously introduced by Gao et al. [5], but our network layers, i.e
. graph convolution, pooling, and unpooling, are significantly different. We found that the sigmoid function in the pooling layer of
[5] (gPool) can cause the gradients to vanish and to not update the picked nodes at all. We thus use a fullyconnected layer to pool the nodes and updated our adjacency matrix in the graph convolution layers, using the adjacency matrix as a kernel we apply to our graph. Moreover, Gao et al.’s gPool [5] removes the vertices and all the edges connected to them and does not have a procedure to reconnect the remaining vertices. This approach may not be problematic for dense graphs (e.g. Citeseer [13]) in which removing a node and its edges will not change the connectivity of the graph. But in graphs with sparse adjacency matrices, such as when the graph is a mesh or a hand or body skeleton, removing one node and its edges may cut the graph into several isolated subgraphs and destroy the connectivity, which is the most important feature of a graph convolutional neural network. Using an adaptive graph convolution neural network, we avoid this problem as the network finds the connectivity of the nodes after each pooling layer.Below we explain the three components of our network, graph convolution, pooling, and unpooling layers, in detail. The architecture of our adaptive Graph UNet is shown in Figure 3.
3.2.1 Graph Convolution
The core part of a graph convolutional network is the implementation of the graph convolution operation. We implemented our convolution based on the Renormalization Trick mentioned in [13]: the output features of a graph convolution layer for an input graph with nodes, input features, and output features for each node is computed as,
(1) 
where
is the activation function,
is the trainable weights matrix, is the matrix of input features, and is the rownormalized adjacency matrix of the graph,(2) 
where and is the diagonal node degree matrix. simply defines the extent to which each node uses other nodes’ features. So is the new feature matrix in which each node’s features are the averaged features of the node itself and its adjacent nodes. Therefore, to effectively formulate the HOPE problem in this framework, an effective adjacency matrix is needed.
Initially, we tried using the adjacency matrix defined by the kinematic structure of the hand skeleton and the object bounding box for the first layer of the network. But we found it was better to allow the network to learn the best adjacency matrix. Note that this is no longer strictly an adjacency matrix in the strict sense, but more like an “affinity” matrix where nodes can be connected by weighted edges to many other nodes in the graph. An adaptive graph convolution operation updates the adjacency matrix (
), as well as the weights matrix () during the backpropagation step. This approach allows us to model subtle relationships between joints which are not connected in the hand skeleton model (
e.g. strong relationships between finger tips despite not being physically connected).3.2.2 Graph Pooling
As mentioned earlier, we did not find gPool [5] helpful in our problem: the sigmoid function’s weaknesses are wellknown [16, 18] and the use of sigmoid in the pooling step created very small gradients during backpropagation. This caused the network not to update the randomlyinitialized selected pooled nodes throughout the entire training phase, and lost the advantage of the trainable pooling layer.
To solve this problem, we use a fullyconnected layer and apply it on the transpose of the feature matrix. This fullyconnected works as a kernel along each of the features and outputs the desired number of nodes. Compared to gPool, we found this module updated very well during training. Also due to using an adaptive graph convolution, this pooling does not fragment the graph into pieces.
3.2.3 Graph Unpooling
The unpooling layer used in our Graph UNet is also different from Gao et al.’s gUnpool [5]. That approach adds the pooled nodes to the graph with empty features and uses the subsequent graph convolution to fill those features. Instead, we use a transpose convolution approach in our unpooling layer. Similar to our pooling layer, we use a fullyconnected layer and applied it on the transpose matrix of the features to obtain the desired number of output nodes, and then transpose the matrix again.
3.3 Loss Function and Training the Model
Our loss function for training the model has three parts. We first calculate the loss for the initial 2D coordinates predicted by ResNet (
). We then add this loss to that calculated from the predicted 2D and 3D coordinates ( and ),(3) 
where we set and to to bring the 2D error (in pixels) and 3D error (in millimeters) into a similar range. For each of the loss functions, we used Mean Squared Error.
4 Results
We now describe our experiments and report results for handobject pose estimation.
4.1 Datasets
To evaluate the generality of our handobject pose estimation method, we used two datasets with very different contexts: FirstPerson Hand Action Dataset [6], which has videos captured from egocentric (wearable) cameras, and HO3D [8], which was captured from thirdperson views. We also used a third dataset of synthetic images, ObMan [10], for pretraining.
FirstPerson Hand Action Dataset [6] contains firstperson videos of hand actions performed on a variety of objects. The objects are milk, juice bottle, liquid soap, and salt, and actions include open, close, pour, and put. Threedimensional meshes for the objects are provided. Although this is a large dataset, a relatively small subset of frames () include 6D object pose annotations, with for training and for evaluation. The annotation provided for each frame is a 6D vector giving 3D translation and rotation for each of the objects. To fit this annotation to our graph model, for each object in each frame, we translate and rotate the 3D object mesh to the pose given by the annotation, and then compute a tight oriented bounding box (simply PCA on vertex coordinates). We use the eight 3D coordinates of the object box corners as nodes in our graph.
The HO3D Dataset [8] also contains hands and handled objects but is quite different because it is captured from a thirdperson pointofview. Hands and objects in these videos are smaller because they are further from the camera, and their position is less constrained than in firstperson videos (where people tend to center their field of view around attended objects). HO3D contains frames annotated with hands and objects, and was collected with 10 subjects and 10 objects. frames are designated as the training set and are for evaluation. Hands in the evaluation set of HO3D are just annotated with the wrist coordinates and the full hand is not annotated.
ObMan [10] is a large dataset of syntheticallygenerated images of handobject interactions. Images in this dataset were produced by rendering meshes of hands with selected objects from ShapeNet [2], using an optimization on the grasp of the objects. ObMan contains training, validation, and evaluation frames. Despite the largescale of the annotated data, we found that models trained with these synthetic images do not generalize well to real images. Nevertheless, we found it helpful to pretrain our model on the largescale data of ObMan, and then finetune using real images.
All of these datasets use 21 joints model for hands which contains one joints for the wrist and 4 joints for each of the fingers.
4.2 Implementation Details
Because of the nature of firstperson video, hands often leave the field of view, and thus roughly half of the frames in the FirstPerson Hand Action dataset have at least one keypoint outside of the frame (Figure 4). Because of this, we found that detectionbased models are not very helpful in this dataset. Thus we use a regressionbased model to find the initial 2D coordinates. To avoid overfitting, we use a lightweight ResNet which gave better generalization. This lightweight model is also fast, allowing us to run our model in near realtime. For both datasets, we use the official training and evaluation splits, and pretrain on ObMan [10].
Since HOPENet has different numbers of parameters and complexity, we train the image encoder and graph parts separately. The 2D to 3D converter network can be trained separately because it is not dependent to the annotated image. In addition to the samples in the FPHA dataset, we augment the 2D points with Gaussian noise () to help improve robustness to errors.
For both FPHA and HO3D datasets we train the ResNet model with an initial learning rate of and multiply it by every steps. We train ResNet for epochs and the graph convolutional network for epochs, starting from a learning rate of and multiplying by every steps. Finally we train the model endtoend for another epochs. All the images are resized to
pixels and passed to the ResNet. All learning and inference was implemented in PyTorch.
4.3 Metrics
Similar to [26], we evaluated our model using percentage of correct pose (PCP) for both 2D and 3D coordinates. In this metric, a pose is considered correct if the average distance to the ground truth pose is less than a threshold.
4.4 HandObject Pose Estimation Results
We now report the performance of our model in hand and object pose estimation on our two datasets. Figure 5 presents the percentage of correct object pose for each pixel threshold on the FirstPerson Hand Action dataset. As we can see in this graph, the 2D object pose estimates produced by the HOPENet model outperform the stateoftheart model of Tekin et al. [26] for 2D object pose estimation, even though we do not use an object locator and we operate on single frames without using temporal constraints. Moreover, our architecture is lightweight and faster to run.
Figure 6 presents the percentage of correct 3D poses for various thresholds (measured in millimeters) on the FirstPerson Hand Action dataset. The results show that the HOPENet model outperforms Tekin et al.’s RGBbased model [26] and Herando et al.’s [6] depthbased model in 3D pose estimation, even without using an object localizer or temporal information.
We also tested our graph model with various other inputs, including ground truth 2D coordinates, as well as ground truth 2D coordinates with Gaussian noise added (with zero mean and and ). Figure 6 presents the results. We note that the graph model is able to effectively remove the Gaussian noise from the keypoint coordinates.
Figure 7 shows selected qualitative results of our model on the FirstPerson Hand Action dataset. Figure 8 breaks out the error of the 2D to 3D converter for each finger and also for each kind of joint of the hand.
We also tested on the thirdperson videos of the very recent HO3D dataset. Although the locations of hands and objects in the images vary more in HO3D, we found that HOPENet performs better, perhaps because of the size of the dataset. The Area Under the Curve (AUC) score of HOPENet is for 2D pose and for 3D pose estimation. Note that hands in the evaluation set of HO3D are just annotated with the wrist (without the full hand annotation). Therefore the mentioned results are just for wrist keypoint.
(a)  (b) 
4.5 Adaptive Graph UNet Ablation Study
We also conducted an ablation study of our Adaptive Graph UNet to identify which components were important for achieving our results. We first compare our model to other models, and then we evaluate the influence of the adjacency matrix initialization on the adaptive Graph UNet performance.
To show the effectiveness of our UNet structure, we compared it to two different models, one with three Fully Connected Layers and one with three Graph Convolutional Layers without pooling and unpooling. We were interested in the importance of each of our graph convolutional models in the 3D output. Each of these models is trained to convert 2D coordinates of the hand and object keypoints to 3D. Table 1 shows the results. The adaptive Graph UNet performs better than the other methods by a large margin. This large margin seems to come from the UNet structure and the pooling and unpooling layers.
To understand the effect of our graph pooling layer, we compared it with Gao et al.’s [5] gPool, and also with fixed pooled nodes which do not break the graph into pieces. Table 2 compares the performance of different graph pooling methods. We see that by using a more efficient training algorithm and also by not breaking apart the graph after pooling, our pooling layer performs better than gPool.
Architecture  Average Error (mm) 

Fully Connected  185.18 
Adaptive Graph Convolution  68.93 
Adaptive Graph UNet  6.81 
Average error on 3D hand and object pose estimation given 2D pose. The first row is a multilayer perceptron and the second row is a 3layered graph convolution without pooling and unpooling. The Adaptive Graph UNet structure has the best performance.
Pooling method  Average Error (mm) 

gPool [5]  153.28 
Fixed Pooling Layers  7.41 
Trainable Pooling  6.81 
Since we are using an adaptive graph convolution, the network learns the adjacency matrix as well. We tested the effect of different adjacency matrix initializations on the final performance, including: hand skeleton and object bounding box, empty graph with and without selfloops, complete graph, and a random connection of vertices. Table 3
presents the results of the model initialized with each of these matrices, showing that the identity matrix is the best initialization. In other words, the model seems to learn best when it finds the relationship between the nodes starting with an unbiased (uninformative) initialization.
Initial Adjacency Matrix  Average Error (mm) 

Zeros ()  92805.02 
Random Initialization  94.42 
Ones ()  63.25 
Skeleton  12.91 
Identity ()  6.81 
The final trained adjacency matrices for the graph convolution layers (starting from ) are visualized in Figure 9. We see that the model has found relationships between nodes which are not connected in the hand skeleton model. For example, it found a relationship between node 6 (index finger’s PIP) and node 4 (thumb’s TIP), which are not connected in the hand skeleton model.
4.6 Runtime
As mentioned earlier, HOPENet consists of a lightweight feature extractor (ResNet10) and two graph convolutional neural networks which are more than ten times faster than the shallowest image convolutional neural network. The core inference of the model can be run in realtime on an Nvidia Titan Xp. On such a GPU, the entire 2D and 3D inference of a single frame requires just 0.005 seconds.
5 Conclusion and Future Work
In this paper, we introduced a model for handobject 2D and 3D pose estimation from a single image using an image encoder followed by a cascade of two graph convolutional neural networks. Our approach beats the stateoftheart, while also running in realtime.
Nevertheless, there are limitations of our approach. When trained on the FPHA and HO3D datasets, our model is wellsuited for objects that are of similar size or shape to those seen in the dataset during training, but might not generalize well to all categories of object shapes. For example, objects with a nonconvex geometry lacking a tight 3D bounding box would be a challenge for our technique. For realworld applications, a larger dataset including a greater variety of shapes and environments would help to improve the estimation accuracies.
Future work could include incorporating temporal information into our graphbased model, both to improve pose estimation results and as step towards action detection. Graph classification methods can be integrated into the proposed framework to infer categorical semantic information for applications such as detecting sign language or gesture understanding. Also, in addition to hand pose estimation, the Adaptive Graph UNet introduced in this work can be applied to a variety of other problems such as graph completion, protein classification, mesh classification, and body pose estimation.
Acknowledgment
The work in this paper was supported in part by the National Science Foundation (CAREER IIS1253549), and by the IU Office of the Vice Provost for Research, the College of Arts and Sciences, and the Luddy School of Informatics, Computing, and Engineering through the Emerging Areas of Research Project “Learning: Brains, Machines, and Children.”
References
 [1] (2019) Exploiting spatialtemporal relationships for 3d pose estimation via graph convolutional networks. In Proceedings of the IEEE International Conference on Computer Vision, pp. 2272–2281. Cited by: §2.
 [2] (2015) Shapenet: an informationrich 3d model repository. arXiv preprint arXiv:1512.03012. Cited by: §4.1.
 [3] (2017) Robust hand pose estimation during the interaction with an unknown object. In IEEE International Conference on Computer Vision (ICCV), pp. 3123–3132. Cited by: §2.
 [4] (2019) Hand pose estimation: A survey. CoRR abs/1903.01013. External Links: Link Cited by: §1.
 [5] (2019) Graph unets. In International Conference on Learning Representations (ICLR), Cited by: 2nd item, §2, §3.2.2, §3.2.3, §3.2, §4.5, Table 2.

[6]
(2018)
Firstperson hand action benchmark with rgbd videos and 3d hand pose annotations.
In
IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
, Cited by: Figure 6, §4.1, §4.1, §4.4.  [7] (2019) 3D hand shape and pose estimation from a single rgb image. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 10833–10842. Cited by: §2.
 [8] (2019) HOnnotate: a method for 3d annotation of hand and objects poses. External Links: 1907.01481 Cited by: §4.1, §4.1.
 [9] (2019) MeshCNN: a network with an edge. ACM Transactions on Graphics (TOG) 38 (4), pp. 90. Cited by: §2.
 [10] (2019) Learning joint reconstruction of hands and manipulated objects. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1, §1, §2, §4.1, §4.1, §4.2.
 [11] (2016) Deep residual learning for image recognition. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770–778. Cited by: §3.1.

[12]
(2015)
Batch normalization: accelerating deep network training by reducing internal covariate shift.
In
International Conference on Machine Learning (ICML)
, Cited by: §3.2.1.  [13] (2017) Semisupervised classification with graph convolutional networks. In International Conference on Learning Representations (ICLR), Cited by: §3.2.1, §3.2.
 [14] (201906) Convolutional mesh regression for singleimage human shape reconstruction. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §3.1.
 [15] (2019) Convolutional mesh regression for singleimage human shape reconstruction. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4501–4510. Cited by: §2.
 [16] (2012) Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems (NeurIPS), pp. 1097–1105. Cited by: §3.2.2.
 [17] (201906) Actionalstructural graph convolutional networks for skeletonbased action recognition. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.
 [18] (2010) Rectified linear units improve restricted boltzmann machines. In International Conference on Machine Learning (ICML), pp. 807–814. Cited by: §3.2.2.
 [19] (2019) Generalized feedback loop for joint handobject pose estimation. IEEE Transactions on Pattern Analysis and Machine Intelligence. Cited by: §2.
 [20] (2011) Full dof tracking of a hand interacting with an object by modeling occlusions and physical constraints. In 2011 International Conference on Computer Vision, pp. 2088–2095. Cited by: §2.
 [21] (2015) 3D tracking of human hands in interaction with unknown objects.. In British Machine Vision Conference (BMVC), pp. 123–1. Cited by: §2.

[22]
(2018)
Generating 3d faces using convolutional mesh autoencoders
. In European Conference on Computer Vision (ECCV), pp. 704–720. Cited by: §2.  [23] (2015) Unet: convolutional networks for biomedical image segmentation. In International Conference on Medical Image Computing and Computerassisted Intervention, pp. 234–241. Cited by: §3.2.
 [24] (201906) Skeletonbased action recognition with directed graph neural networks. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.
 [25] (201906) Twostream adaptive graph convolutional networks for skeletonbased action recognition. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.
 [26] (201906) H+o: unified egocentric recognition of 3d handobject poses and interactions. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1, §2, Figure 5, Figure 6, §4.3, §4.4, §4.4.
 [27] (2018) Realtime seamless single shot 6d object pose prediction. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 292–301. Cited by: §1, Figure 5.
 [28] (2018) Group normalization. In European Conference on Computer Vision (ECCV), pp. 3–19. Cited by: §3.2.1.

[29]
(2018)
Spatial temporal graph convolutional networks for skeletonbased action recognition.
In
ThirtySecond AAAI Conference on Artificial Intelligence
, Cited by: §2.  [30] (2018) Depthbased 3d hand pose estimation: from current achievements to future goals. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2636–2645. Cited by: §1.
 [31] (2019) Semantic graph convolutional networks for 3d human pose regression. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3425–3435. Cited by: §2.
Comments
There are no comments yet.