Attention! A Lightweight 2D Hand Pose Estimation Approach

01/22/2020 ∙ by Nicholas Santavas, et al. ∙ 24

Vision based human pose estimation is an non-invasive technology for Human-Computer Interaction (HCI). Direct use of the hand as an input device provides an attractive interaction method, with no need for specialized sensing equipment, such as exoskeletons, gloves etc, but a camera. Traditionally, HCI is employed in various applications spreading in areas including manufacturing, surgery, entertainment industry and architecture, to mention a few. Deployment of vision based human pose estimation algorithms can give a breath of innovation to these applications. In this letter, we present a novel Convolutional Neural Network architecture, reinforced with a Self-Attention module that it can be deployed on an embedded system, due to its lightweight nature, with just 1.9 Million parameters. The source code and qualitative results are publicly available.



There are no comments yet.


page 1

page 2

page 3

page 4

page 5

page 6

page 7

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The role of computers and robots in our modern society keeps expanding and diversifying. As humanity breaks the technological barriers of the past, daily activities become more and more assisted by an excessive amount of operations based on the interaction between humans and computers. The development of more sophisticated systems frequently drives to complicated interaction ways that incommode their usage.

In order to democratize decision-making machines, straightforward ways of interaction need to be developed which imitate the relationship between humans [40]. A convenient way for a human to interact with machines can be achieved by means of natural dialogue. Such examples that are already implemented on virtual assistants are known as interactive conversational systems [16]

. Computer vision techniques are often applied in this field, e.g., for face and emotion recognition

[14][6] 3D face mesh representation for augmented reality applications [15], action recognition [31], and finally, human body pose estimation [5][35] and hand pose estimation [34].

Human hand pose estimation is a long standing problem in computer vision and graphics research field, with a plethora of applications such as machine control, augmented and virtual reality [13][22][8]. Due to its importance, numerous solutions have been proposed in the related literature with one of the most common being based on accurate 2D keypoint localization [24].

Although the recent advances in the field of deep neural networks, this topic is still considered to be a challenging problem, that remains to be completely solved. Properties such as the hand’s morphology, occlusions due to interaction with objects, appearance diversity due to clothing and jewelry, varying lightning conditions and different backgrounds, add extra burden to the nature of the problem. Nevertheless, unlike the human’s body or face, hands have almost uniform shape and lack local characteristics. In addition, because of the hand’s degrees of freedom, there is a myriad of different possible poses. Thus, it becomes critical to localize more than 20 keypoints in each hand

[4][12] in order to accurately estimate its pose and use it as an input device. Vision-based hand pose estimation has recently made significant progress. A vast amount of approaches uses as a basis Convolutional Neural Networks (CNNs), due to their capability of extracting a given signal’s features.

Figure 1: Dense Block with growth rate k [11]

CNNs successfully perform 2D body pose estimation by classifying whether or not a body’s joint is present in each pixel

[21][33]. The proposed methods, also known as Convolutional Pose Machines (CPM), enforce a CNN to generate a set of heat maps, each of which, is expected to have its maximum activation value in the pixel that contains the corresponding keypoint. However, to refine the outcome, this procedure is applied iteratively upon the generated heat maps. The majority of hand pose estimation methods are mainly based upon this approach [4][12], which leads to computationally expensive networks and complicated system architectures.

Another line of work aims to directly map the input image to the keypoints’ coordinates on the plane or to a specific frame of reference for 2D and 3D pose estimation respectively, known as holistic regression [29][17]

. The abovementioned approach does not have to generate intermediate representations (pixel-wise classification) while also preserving the ability to understand global constraints and correlations between keypoints. However, it is claimed that holistic regression is not able to generalize and that translation variance diminish the predicted results


In this letter, we propose a computationally inexpensive CNN architecture for 2D direct hand pose estimation, through holistic regression. Our novelty lies on the exploitation of a self-attention mechanism [3] combined with traditional convolutional layers. We show that this method yields state-of-the-art performance with an exceedingly low number of parameters.

2 Method

In this section, we describe the structure of the proposed architecture and the key ingredients for estimating a hand’s 2D keypoints, coordinates, given a single RGB image. Towards a solution for this challenge, we make use of a feed-forward CNN architecture that directly produces the coordinates in a single stage, without intermediate supervision. The network’s architecture comprises two parts, the stem and the rest, for now on dubbed as tail.

2.1 Network’s architecture

The presented architecture is based on the very successful idea of DenseNets [11]. In a DenseNet, each layer obtains additional inputs from all preceding ones and propagates its own feature-maps to all subsequent layers by a channel-wise concatenation, as shown in Figure 1. In such a way, this structure receives a “collective knowledge” from all previous layers.

To keep the total number of parameters as low as possible, we were inspired by a very popular building unit, the Inverted residual block, which is a highly efficient feature extractor, designed especially for mobile use [25]. The replacement of the standard convolutional layer by depthwise separable ones offers a computation reduction by a factor:


where equals the kernel’s size, and equals the output depth size. The first convolutional layer expands the depth size by an e factor while the last squeezes it by dividing the input’s depth size by the same factor. Here, .

2.1.1 Stem

For stem, we use a number of dense blocks which, unlike the original design, it contains an inverted residual block. According to [11], architectures with concatenated skip-connections maintain more information since they allow subsequent layers to reuse intermediate representations, which in turn, leads to increased performance.

A significant difference from the original block regarding its non-linearity is that we use the lately proposed Mishactivation function [20]. Mish

, unlike ReLU, is a smooth non-monotonic activation function which is defined as:


As mentioned in [20], Mish demonstrates better results than both Swish [23] and ReLU for classification tasks. After extensive experimentation with both Swish and ReLU, we confirmed the above behaviour for the regression task at hand.

2.1.2 Blur Pooling

As it is widely known, many modern CNNs perform some sort of downsampling. A common practice for sub-sampling feature maps between convolutional layers is using either a pooling operation, or a strided convolution. In

[27], it was explicitly discussed that a system based on both operations of convolution and sub-sampling lacks translation invariance, unless the translation is a multiple of each of the sub-sampling factors. Otherwise, sub-sampling creates alias that undermines the output. This property affects CNNs as well, since small spatial image transformations can lead to significant accuracy degradation [7][2]. As stated in [37], a feature extractor function

is shift-equivariant when shifting the input equally shifts the output, making shifting and feature extraction commutable:


Furthermore, a representation is shift-invariant if shifting the inputs results in an identical representation:


Regural pooling methods break shift-equivariance. To overcome this issue, we propose the adaptation of an anti-aliasing filter, which is convolved with feature maps [37], with stride 2 to reduce spatial resolution. The method provides the ability to choose between different size of kernels, producible by a box filter. The following implements the anti-aliasing filter .


where denotes the outer product, , and . In our case, the utilized anti-aliasing filter .

2.1.3 Attention Augmented Inverted Bottleneck Layer

Attention mechanisms enable a neural network to focus more on relevant elements of the input than on irrelevant parts. Visual attention is one of the most influential ideas in the deep learning research field. Attention mechanisms and especially self-attention, are powerful building blocks for processing not only text but also images. Many visual attention mechanisms have been proposed to enhance the convolutions’ already proved performance


The general idea is that given a query and a set of key elements, the attention mechanism aggregates w.r.t the trainable parameters, the resemblance between key-query pairs. Multiple attention functions provide the ability to attend multiple representation subspaces and spatial positions. Finally, each head’s output is linearly aggregated with learnable weights [38]. Our work was inspired by a design proposed in [38], in which a self-attention mechanism enfolds a standard residual block. More specifically, we implement an Attention Augmented Convolutional layer [3], which embeds an inverted bottleneck block, by adding its output to the product of the Depthwise Separable Convolutional layer, as shown in Figure 2.

A self-attention mechanism achieves better results when combined with convolutional layers [3]. In practice, a self-attention module uses three sets of learnable parameters where stand for Query, Key and Value, respectively. According to [30]

, an input tensor

, is flattened to a matrix and then forwarded to the Transformer attention architecture. Since it has been found beneficial to apply self-attention multiple times, Eq. 8 is applied once for each attention head, producing outputs, where .

Figure 2: Attention Augmented Inverted Bottleneck Layer

and . The output of each head is then concatenated with the remaining, forming the Multihead Attention mechanism.



is a trainable matrix which linearly transforms the aggregated output of each head. We refer to the

Values’ depth as , Queries’ depth as and the number of heads as .

Figure 3: Performance Evaluation. a) PCK curves on MPII+NZSL testing set b) PCKh curves on MPII+NZSL testing set and c) PCK curves of our method on different datasets

An inherent characteristic of self-attention is that it is equivariant to an input’s reordering. This essentially means that any spatial information is not maintained, which is prohibitive for vision tasks due to the structured nature of the images. To alleviate the limitation, a trainable positional encoding is assigned to each pixel of the image. The relative position of both width and height, between each Query and Key pixel, is represented by two matrices that contain a relative position embedding for every pixel pair. The relationship’s strength between two pixels is computed as:


where and are the Query and Keyvectors for pixels , while and are learned embeddings for relative width and height, respectively.

Each attention head enhanced by relative position embeddings becomes:


where , are matrices of relative positional embeddings for every pixel pair.

As previously mentioned, this type of visual attention has the ability to attend feature subspaces and spatial positions simultaneously, both due to the attention mechanism that introduces additional feature maps and the convolution operator. The last part of the Attention’s Augmented Convolution integration includes the concatenation between the convolutional operator and Multiheaded Attention’s output.


Denoted as is the ratio between attention depth size and the output depth size, while is the ratio of key depth over the output depth.

For the network’s tail, we implement recurrently dense blocks that contain the Attention Augmented Inverted Bottleneck layer, with a similar manner proposed for the stem.

Layers Output Size Architecture
Dense Block (1) [Inverted bottleneck layer]
Transition Layer
Dense Block (2) [Inverted bottleneck layer]
Transition Layer
Dense Block (3) [Attention Augmented Inverted bottleneck layer]
Transition Layer
Dense Block (4) [Attention Augmented Inverted bottleneck layer]
Transition Layer
Dense Block (5) [Attention Augmented Inverted bottleneck layer]
Transition Layer
Dense Block (6) [Attention Augmented Inverted bottleneck layer]
Transition Layer
Dense Block (7) [Attention Augmented Inverted bottleneck layer]
Transition Layer
Dense Block (8) [Attention Augmented Inverted bottleneck layer]
AA-Bottleneck [Attention Augmented Inverted bottleneck layer]
Average Pooling, s2
Convolutional Layer
Table 1: Network’s Architecture. The growth rate is k=10

2.1.4 Downsampling

To downsample the feature maps between dense blocks, a transition layer

is used, which comprises a pointwise convolutional layer for depth reduction, an Blur Pooling filter with stride 2 and finally, batch normalization.

Figure 4: Our 2D hand pose estimations on different testing sets.

2.2 Training

During training, Cyclical Learning Rate [28] with triangular policy was used with Stochastic Gradient Descent optimizer. The selected hyper-parameters are, , minimum learning rate of and maximum learning rate of . The batchsize equals to 256, and the training was executed using Tensor Processing Units (TPUs) on the cloud, provided by Google. Finally, a mixed-precision training policy was used by exploiting both 16-bit (bfloat16) and 32-bit (float32) floating-point types [19]. This practice resulted to memory gain, which in turn led to greater batch size, smaller model size and faster execution time. Table 1 explicitly shows the model’s architecture, totaling just 1.9M

parameters, which was developed using the TensorFlow library


3 Evaluation

We evaluate our method’s 2D pose estimation in a number of contemporary datasets and with respect to state-of-the-art methods. We show that our exceptionally lightweight and straightforward technique outperforms other notably larger and complex deep learning architectures, which are computationally expensive. Our experiments were performed on five different datasets, the characteristics of which are presented below.

The PANOPTIC [26] is an accurate large-scale human posture dataset with many instances of occluded subjects. We based our training set on three dataset sessions, office1, cello3 and tools1. In accordance with the literature [4], the training set of MPII+NZSL was also included [26] resulting into a set of 165000 training images. The evaluation was made on the testing set of MPII+NZSL.

The HO-3D [10] in a newly released markerless dataset, consisting of 10505 images in the training set. We augmented the dataset’s images by flipping and rotating them by 0-90-180 degrees.

The FreiHAND [39] provides a multi-view hands dataset, recorded in green screen, augmented with artificial background, resulting into a total of 130240 image instances.

The LSMV [9] provides images of hands from multiple points of view. The total of frames is 80000.

The SHP [36] provides 3D pose annotations of a person’s hand, performing various gestures in 18000 frames.

Each dataset was separately evaluated and split by a rule of 80%-10%-10% for training, validation and testing, respectively. Every image was cropped to the resolution of . We compare our results with other state-of-the-art methods in Figure 3, according to the protocol proposed in [26], showing that our method outperforms other approaches. More specifically, in Figure 2(a) and Figure 2(b), the percentage of correct keypoints is visualized for different absolute and normalized thresholds, respectively, and compared to other techniques. Figure 2(c) depicts our method’s performance trained on different datasets. The abovementioned results are also summarized on Table 2.

Mean Median
Zimm. et al. (ICCV 2017) 0.17 59.4 -
Bouk. et al. (CVPR 2019) 0.50 18.95 -
Ours 0.55 16.1 11
LSMV Dataset
Gomez-Donoso et al. - 10 -
Li et al. [18] - 8 -
Ours 0.89 3.3 2.5
Stereo Hand Pose Dataset
Zimm et al. (ICCV 2017) 0.81 5 5.5
Ours 0.92 2.2 1.8
HO-3D Dataset
Ours 0.87 3.9 3.3
FreiHand Dataset
Ours 0.87 4 3.1
Table 2: Quantitative results

4 Conclusions

We presented an alternative approach in contrast to the majority of pose estimation methods which propose complex and computationally inefficient architectures. The proposed self-attention mechanism exhibits competitive results with just 1.9M parameters and a model’s size of 11 Mbytes, by directly predicting the joints’ coordinates.


This work was supported by Google’s TensorFlow Research Cloud and Google’s Research Credits programme.