The role of computers and robots in our modern society keeps expanding and diversifying. As humanity breaks the technological barriers of the past, daily activities become more and more assisted by an excessive amount of operations based on the interaction between humans and computers. The development of more sophisticated systems frequently drives to complicated interaction ways that incommode their usage.
In order to democratize decision-making machines, straightforward ways of interaction need to be developed which imitate the relationship between humans . A convenient way for a human to interact with machines can be achieved by means of natural dialogue. Such examples that are already implemented on virtual assistants are known as interactive conversational systems 
. Computer vision techniques are often applied in this field, e.g., for face and emotion recognition 3D face mesh representation for augmented reality applications , action recognition , and finally, human body pose estimation  and hand pose estimation .
Human hand pose estimation is a long standing problem in computer vision and graphics research field, with a plethora of applications such as machine control, augmented and virtual reality . Due to its importance, numerous solutions have been proposed in the related literature with one of the most common being based on accurate 2D keypoint localization .
Although the recent advances in the field of deep neural networks, this topic is still considered to be a challenging problem, that remains to be completely solved. Properties such as the hand’s morphology, occlusions due to interaction with objects, appearance diversity due to clothing and jewelry, varying lightning conditions and different backgrounds, add extra burden to the nature of the problem. Nevertheless, unlike the human’s body or face, hands have almost uniform shape and lack local characteristics. In addition, because of the hand’s degrees of freedom, there is a myriad of different possible poses. Thus, it becomes critical to localize more than 20 keypoints in each hand in order to accurately estimate its pose and use it as an input device. Vision-based hand pose estimation has recently made significant progress. A vast amount of approaches uses as a basis Convolutional Neural Networks (CNNs), due to their capability of extracting a given signal’s features.
CNNs successfully perform 2D body pose estimation by classifying whether or not a body’s joint is present in each pixel. The proposed methods, also known as Convolutional Pose Machines (CPM), enforce a CNN to generate a set of heat maps, each of which, is expected to have its maximum activation value in the pixel that contains the corresponding keypoint. However, to refine the outcome, this procedure is applied iteratively upon the generated heat maps. The majority of hand pose estimation methods are mainly based upon this approach , which leads to computationally expensive networks and complicated system architectures.
Another line of work aims to directly map the input image to the keypoints’ coordinates on the plane or to a specific frame of reference for 2D and 3D pose estimation respectively, known as holistic regression 
. The abovementioned approach does not have to generate intermediate representations (pixel-wise classification) while also preserving the ability to understand global constraints and correlations between keypoints. However, it is claimed that holistic regression is not able to generalize and that translation variance diminish the predicted results.
In this letter, we propose a computationally inexpensive CNN architecture for 2D direct hand pose estimation, through holistic regression. Our novelty lies on the exploitation of a self-attention mechanism  combined with traditional convolutional layers. We show that this method yields state-of-the-art performance with an exceedingly low number of parameters.
In this section, we describe the structure of the proposed architecture and the key ingredients for estimating a hand’s 2D keypoints, coordinates, given a single RGB image. Towards a solution for this challenge, we make use of a feed-forward CNN architecture that directly produces the coordinates in a single stage, without intermediate supervision. The network’s architecture comprises two parts, the stem and the rest, for now on dubbed as tail.
2.1 Network’s architecture
The presented architecture is based on the very successful idea of DenseNets . In a DenseNet, each layer obtains additional inputs from all preceding ones and propagates its own feature-maps to all subsequent layers by a channel-wise concatenation, as shown in Figure 1. In such a way, this structure receives a “collective knowledge” from all previous layers.
To keep the total number of parameters as low as possible, we were inspired by a very popular building unit, the Inverted residual block, which is a highly efficient feature extractor, designed especially for mobile use . The replacement of the standard convolutional layer by depthwise separable ones offers a computation reduction by a factor:
where equals the kernel’s size, and equals the output depth size. The first convolutional layer expands the depth size by an e factor while the last squeezes it by dividing the input’s depth size by the same factor. Here, .
For stem, we use a number of dense blocks which, unlike the original design, it contains an inverted residual block. According to , architectures with concatenated skip-connections maintain more information since they allow subsequent layers to reuse intermediate representations, which in turn, leads to increased performance.
, unlike ReLU, is a smooth non-monotonic activation function which is defined as:
As mentioned in , Mish demonstrates better results than both
Swish  and ReLU for classification tasks. After extensive experimentation with both Swish and ReLU, we confirmed the above behaviour for the regression task at hand.
2.1.2 Blur Pooling
As it is widely known, many modern CNNs perform some sort of downsampling. A common practice for sub-sampling feature maps between convolutional layers is using either a pooling operation, or a strided convolution. In, it was explicitly discussed that a system based on both operations of convolution and sub-sampling lacks translation invariance, unless the translation is a multiple of each of the sub-sampling factors. Otherwise, sub-sampling creates alias that undermines the output. This property affects CNNs as well, since small spatial image transformations can lead to significant accuracy degradation . As stated in , a feature extractor function
is shift-equivariant when shifting the input equally shifts the output, making shifting and feature extraction commutable:
Furthermore, a representation is shift-invariant if shifting the inputs results in an identical representation:
Regural pooling methods break shift-equivariance. To overcome this issue, we propose the adaptation of an anti-aliasing filter, which is convolved with feature maps , with stride 2 to reduce spatial resolution. The method provides the ability to choose between different size of kernels, producible by a box filter. The following implements the anti-aliasing filter .
where denotes the outer product, , and .
In our case, the utilized anti-aliasing filter .
2.1.3 Attention Augmented Inverted Bottleneck Layer
Attention mechanisms enable a neural network to focus more on relevant elements of the input than on irrelevant parts. Visual attention is one of the most influential ideas in the deep learning research field. Attention mechanisms and especially self-attention, are powerful building blocks for processing not only text but also images. Many visual attention mechanisms have been proposed to enhance the convolutions’ already proved performance.
The general idea is that given a query and a set of key elements, the attention mechanism aggregates w.r.t the trainable parameters, the resemblance between key-query pairs. Multiple attention functions provide the ability to attend multiple representation subspaces and spatial positions. Finally, each head’s output is linearly aggregated with learnable weights . Our work was inspired by a design proposed in , in which a self-attention mechanism enfolds a standard residual block. More specifically, we implement an Attention Augmented Convolutional layer , which embeds an inverted bottleneck block, by adding its output to the product of the Depthwise Separable Convolutional layer, as shown in Figure 2.
A self-attention mechanism achieves better results when combined with convolutional layers . In practice, a self-attention module uses three sets of learnable parameters where stand for Query, Key and Value, respectively. According to 
, an input tensor, is flattened to a matrix and then forwarded to the Transformer attention architecture. Since it has been found beneficial to apply self-attention multiple times, Eq. 8 is applied once for each attention head, producing outputs, where .
and . The output of each head is then concatenated with the remaining, forming the Multihead Attention mechanism.
is a trainable matrix which linearly transforms the aggregated output of each head. We refer to theValues’ depth as , Queries’ depth as and the number of heads as .
An inherent characteristic of self-attention is that it is equivariant to an input’s reordering. This essentially means that any spatial information is not maintained, which is prohibitive for vision tasks due to the structured nature of the images. To alleviate the limitation, a trainable positional encoding is assigned to each pixel of the image. The relative position of both width and height, between each Query and Key pixel, is represented by two matrices that contain a relative position embedding for every pixel pair. The relationship’s strength between two pixels is computed as:
where and are the Query and Keyvectors for pixels , while and are learned embeddings for relative width and height, respectively.
Each attention head enhanced by relative position embeddings becomes:
where , are matrices of relative positional embeddings for every pixel pair.
As previously mentioned, this type of visual attention has the ability to attend feature subspaces and spatial positions simultaneously, both due to the attention mechanism that introduces additional feature maps and the convolution operator. The last part of the Attention’s Augmented Convolution integration includes the concatenation between the convolutional operator and Multiheaded Attention’s output.
Denoted as is the ratio between attention depth size and the output depth size, while is the ratio of key depth over the output depth.
For the network’s tail, we implement recurrently dense blocks that contain the Attention Augmented Inverted Bottleneck layer, with a similar manner proposed for the stem.
|Dense Block (1)||[Inverted bottleneck layer]|
|Dense Block (2)||[Inverted bottleneck layer]|
|Dense Block (3)||[Attention Augmented Inverted bottleneck layer]|
|Dense Block (4)||[Attention Augmented Inverted bottleneck layer]|
|Dense Block (5)||[Attention Augmented Inverted bottleneck layer]|
|Dense Block (6)||[Attention Augmented Inverted bottleneck layer]|
|Dense Block (7)||[Attention Augmented Inverted bottleneck layer]|
|Dense Block (8)||[Attention Augmented Inverted bottleneck layer]|
|AA-Bottleneck||[Attention Augmented Inverted bottleneck layer]|
|Average Pooling, s2|
To downsample the feature maps between dense blocks, a transition layer
is used, which comprises a pointwise convolutional layer for depth reduction, an Blur Pooling filter with stride 2 and finally, batch normalization.
During training, Cyclical Learning Rate  with triangular policy was used with Stochastic Gradient Descent optimizer. The selected hyper-parameters are, , minimum learning rate of and maximum learning rate of . The batchsize equals to 256, and the training was executed using Tensor Processing Units (TPUs) on the cloud, provided by Google. Finally, a mixed-precision training policy was used by exploiting both 16-bit (bfloat16) and 32-bit (float32) floating-point types . This practice resulted to memory gain, which in turn led to greater batch size, smaller model size and faster execution time. Table 1 explicitly shows the model’s architecture, totaling just 1.9M
parameters, which was developed using the TensorFlow library.
We evaluate our method’s 2D pose estimation in a number of contemporary datasets and with respect to state-of-the-art methods. We show that our exceptionally lightweight and straightforward technique outperforms other notably larger and complex deep learning architectures, which are computationally expensive. Our experiments were performed on five different datasets, the characteristics of which are presented below.
The PANOPTIC  is an accurate large-scale human posture dataset with many instances of occluded subjects. We based our training set on three dataset sessions, office1, cello3 and tools1. In accordance with the literature , the training set of MPII+NZSL was also included  resulting into a set of 165000 training images. The evaluation was made on the testing set of MPII+NZSL.
The HO-3D  in a newly released markerless dataset, consisting of 10505 images in the training set. We augmented the dataset’s images by flipping and rotating them by 0-90-180 degrees.
The FreiHAND  provides a multi-view hands dataset, recorded in green screen, augmented with artificial background, resulting into a total of 130240 image instances.
The LSMV  provides images of hands from multiple points of view. The total of frames is 80000.
The SHP  provides 3D pose annotations of a person’s hand, performing various gestures in 18000 frames.
Each dataset was separately evaluated and split by a rule of 80%-10%-10% for training, validation and testing, respectively. Every image was cropped to the resolution of . We compare our results with other state-of-the-art methods in Figure 3, according to the protocol proposed in , showing that our method outperforms other approaches. More specifically, in Figure 2(a) and Figure 2(b), the percentage of correct keypoints is visualized for different absolute and normalized thresholds, respectively, and compared to other techniques. Figure 2(c) depicts our method’s performance trained on different datasets. The abovementioned results are also summarized on Table 2.
|Zimm. et al. (ICCV 2017)||0.17||59.4||-|
|Bouk. et al. (CVPR 2019)||0.50||18.95||-|
|Gomez-Donoso et al.||-||10||-|
|Li et al. ||-||8||-|
|Stereo Hand Pose Dataset|
|Zimm et al. (ICCV 2017)||0.81||5||5.5|
We presented an alternative approach in contrast to the majority of pose estimation methods which propose complex and computationally inefficient architectures. The proposed self-attention mechanism exhibits competitive results with just 1.9M parameters and a model’s size of 11 Mbytes, by directly predicting the joints’ coordinates.
This work was supported by Google’s TensorFlow Research Cloud and Google’s Research Credits programme.
M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin,
S. Ghemawat, G. Irving, M. Isard et al.
, “Tensorflow: A system for large-scale machine learning,” inProc. USENIX Symposium on Operating Systems Design and Implementation (OSDI 16), 2016, pp. 265–283.
-  A. Azulay and Y. Weiss, “Why do deep convolutional networks generalize so poorly to small image transformations?” arXiv preprint arXiv:1805.12177, 2018.
-  I. Bello, B. Zoph, A. Vaswani, J. Shlens, and Q. V. Le, “Attention augmented convolutional networks,” arXiv preprint arXiv:1904.09925, 2019.
A. Boukhayma, R. d. Bem, and P. H. Torr, “3d hand shape and pose from images
in the wild,” in
Proc. IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp. 10 843–10 852.
-  Z. Cao, G. Hidalgo, T. Simon, S.-E. Wei, and Y. Sheikh, “Openpose: realtime multi-person 2d pose estimation using part affinity fields,” arXiv preprint arXiv:1812.08008, 2018.
-  N. Efremova, M. Patkin, and D. Sokolov, “Face and emotion recognition with neural networks on mobile devices: Practical implementation on different platforms,” in Proc. IEEE International Conference on Automatic Face & Gesture Recognition, 2019, pp. 1–5.
-  L. Engstrom, B. Tran, D. Tsipras, L. Schmidt, and A. Madry, “Exploring the landscape of spatial robustness,” in Proc. International Conference on Machine Learning, 2019, pp. 1802–1811.
-  B. Fang, D. Guo, F. Sun, H. Liu, and Y. Wu, “A robotic hand-arm teleoperation system using human arm/hand with a novel data glove,” in Proc. IEEE International Conference on Robotics and Biomimetics, 2015, pp. 2483–2488.
-  F. Gomez-Donoso, S. Orts-Escolano, and M. Cazorla, “Large-scale multiview 3d hand pose dataset,” arXiv preprint arXiv:1707.03742, 2017.
-  S. Hampali, M. Oberweger, M. Rad, and V. Lepetit, “Ho-3d: A multi-user, multi-object dataset for joint 3d hand-object pose estimation,” arXiv preprint arXiv:1907.01481, 2019.
-  G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger, “Densely connected convolutional networks,” in Proc. IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 4700–4708.
-  U. Iqbal, P. Molchanov, T. Breuel Juergen Gall, and J. Kautz, “Hand pose estimation via latent 2.5 d heatmap regression,” in Proc. European Conference on Computer Vision, 2018, pp. 118–134.
-  Y. Jang, S.-T. Noh, H. J. Chang, T.-K. Kim, and W. Woo, “3d finger cape: Clicking action and position estimation under self-occlusions in egocentric viewpoint,” IEEE Transactions on Visualization and Computer Graphics, vol. 21, no. 4, pp. 501–510, 2015.
I. Kansizoglou, L. Bampis, and A. Gasteratos, “An active learning paradigm for online audio-visual emotion recognition,”IEEE Transactions on Affective Computing, 2019.
-  Y. Kartynnik, A. Ablavatski, I. Grishchenko, and M. Grundmann, “Real-time facial surface geometry from monocular video on mobile gpus,” arXiv preprint arXiv:1907.06724, 2019.
-  V. Kepuska and G. Bohouta, “Next-generation of virtual personal assistants (microsoft cortana, apple siri, amazon alexa and google home),” in Proc. IEEE Computing and Communication Workshop and Conference, 2018, pp. 99–103.
-  S. Li and A. B. Chan, “3d human pose estimation from monocular images with deep convolutional neural network,” in Proc. Asian Conference on Computer Vision, 2014, pp. 332–347.
-  Y. Li, X. Wang, W. Liu, and B. Feng, “Pose anchor: A single-stage hand keypoint detection network,” IEEE Transactions on Circuits and Systems for Video Technology, 2019.
-  P. Micikevicius, S. Narang, J. Alben, G. Diamos, E. Elsen, D. Garcia, B. Ginsburg, M. Houston, O. Kuchaiev, G. Venkatesh et al., “Mixed precision training,” arXiv preprint arXiv:1710.03740, 2017.
-  D. Misra, “Mish: A self regularized non-monotonic neural activation function,” arXiv preprint arXiv:1908.08681, 2019.
-  A. Newell, K. Yang, and J. Deng, “Stacked hourglass networks for human pose estimation,” in Proc. European Conference on Computer Vision, 2016, pp. 483–499.
-  T. Piumsomboon, A. Clark, M. Billinghurst, and A. Cockburn, “User-defined gestures for augmented reality,” in Proc. IFIP Conference on Human-Computer Interaction, 2013, pp. 282–299.
-  P. Ramachandran, B. Zoph, and Q. V. Le, “Swish: a self-gated activation function,” arXiv preprint arXiv:1710.05941, vol. 7, 2017.
-  J. M. Rehg and T. Kanade, “Visual tracking of high dof articulated structures: an application to human hand tracking,” in Proc. European Conference on Computer Vision, 1994, pp. 35–46.
-  M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L.-C. Chen, “Mobilenetv2: Inverted residuals and linear bottlenecks,” in Proc. IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 4510–4520.
-  T. Simon, H. Joo, and Y. Sheikh, “Hand keypoint detection in single images using multiview bootstrapping,” in Proc. IEEE Conference on Computer Vision and Pattern Recognition, 2017.
-  E. P. Simoncelli, W. T. Freeman, E. H. Adelson, and D. J. Heeger, “Shiftable multiscale transforms,” IEEE Transactions on Information Theory, vol. 38, no. 2, pp. 587–607, 1992.
-  L. N. Smith, “Cyclical learning rates for training neural networks,” in Proc. IEEE Winter Conference on Applications of Computer Vision, 2017, pp. 464–472.
-  B. Tekin, A. Rozantsev, V. Lepetit, and P. Fua, “Direct prediction of 3d body poses from motion compensated sequences,” in Proc. IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 991–1000.
-  A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” in Advances in neural information processing systems, 2017, pp. 5998–6008.
-  R. Vemulapalli, F. Arrate, and R. Chellappa, “Human action recognition by representing 3d skeletons as points in a lie group,” in Proc. IEEE Conference on Computer Vision and Pattern Recognition, 2014, pp. 588–595.
-  C. Wan, T. Probst, L. Van Gool, and A. Yao, “Dense 3d regression for hand pose estimation,” in Proc. IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 5147–5156.
-  S.-E. Wei, V. Ramakrishna, T. Kanade, and Y. Sheikh, “Convolutional pose machines,” in Proc. IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 4724–4732.
-  S. Yuan, G. Garcia-Hernando, B. Stenger, G. Moon, J. Yong Chang, K. Mu Lee, P. Molchanov, J. Kautz, S. Honari, L. Ge et al., “Depth-based 3d hand pose estimation: From current achievements to future goals,” in Proc. IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 2636–2645.
-  F. Zhang, X. Zhu, and M. Ye, “Fast human pose estimation,” in Proc. IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp. 3517–3526.
-  J. Zhang, J. Jiao, M. Chen, L. Qu, X. Xu, and Q. Yang, “3d hand pose tracking and estimation using stereo matching,” arXiv preprint arXiv:1610.07214, 2016.
-  R. Zhang, “Making convolutional networks shift-invariant again,” arXiv preprint arXiv:1904.11486, 2019.
-  X. Zhu, D. Cheng, Z. Zhang, S. Lin, and J. Dai, “An empirical study of spatial attention mechanisms in deep networks,” arXiv preprint arXiv:1904.05873, 2019.
-  C. Zimmermann, D. Ceylan, J. Yang, B. Russell, M. Argus, and T. Brox, “Freihand: A dataset for markerless capture of hand pose and shape from single rgb images,” in Proc. IEEE International Conference on Computer Vision, 2019, pp. 813–822.
-  J. Złotowski, D. Proudfoot, K. Yogeeswaran, and C. Bartneck, “Anthropomorphism: opportunities and challenges in human–robot interaction,” International Journal of Social Robotics, vol. 7, no. 3, pp. 347–360, 2015.