Log In Sign Up

Parallel Multi-Scale Networks with Deep Supervision for Hand Keypoint Detection

by   Renjie Li, et al.

Keypoint detection plays an important role in a wide range of applications. However, predicting keypoints of small objects such as human hands is a challenging problem. Recent works fuse feature maps of deep Convolutional Neural Networks (CNNs), either via multi-level feature integration or multi-resolution aggregation. Despite achieving some success, the feature fusion approaches increase the complexity and the opacity of CNNs. To address this issue, we propose a novel CNN model named Multi-Scale Deep Supervision Network (P-MSDSNet) that learns feature maps at different scales with deep supervisions to produce attention maps for adaptive feature propagation from layers to layers. P-MSDSNet has a multi-stage architecture which makes it scalable while its deep supervision with spatial attention improves transparency to the feature learning at each stage. We show that P-MSDSNet outperforms the state-of-the-art approaches on benchmark datasets while requiring fewer number of parameters. We also show the application of P-MSDSNet to quantify finger tapping hand movements in a neuroscience study.


page 6

page 8


Learning Deep Structured Multi-Scale Features using Attention-Gated CRFs for Contour Prediction

Recent works have shown that exploiting multi-scale representations deep...

MDSSD: Multi-scale Deconvolutional Single Shot Detector for small objects

In order to improve the detection accuracy for objects at different scal...

LAFFNet: A Lightweight Adaptive Feature Fusion Network for Underwater Image Enhancement

Underwater image enhancement is an important low-level computer vision t...

Multi-scale Convolution Aggregation and Stochastic Feature Reuse for DenseNets

Recently, Convolution Neural Networks (CNNs) obtained huge success in nu...

Attentional Feature Fusion

Feature fusion, the combination of features from different layers or bra...

Deep Visual Attention Prediction

Deep Convolutional Neural Networks (CNNs) have made substantial improvem...

Guiding Query Position and Performing Similar Attention for Transformer-Based Detection Heads

After DETR was proposed, this novel transformer-based detection paradigm...

1 Introduction

Hand keypoint detection is a method that identifies the position of different keypoints in images of hands. The keypoints of a hand are the fingertips and joints on the fingers such as knuckles. Hand keypoint detection methods have largely been used to extract hand movement features [dardas2011real, zhang2020mediapipe]. Such movement features can then be used in different domains such as healthcare and human-computer interaction [rautaray2015vision]. Recent approaches for keypoint detection across a range of areas rely heavily on Convolutional Neural Networks (CNNs) thanks to their ability in learning discriminable features from visual data [lecun1995convolutional]. By stacking multiple blocks of convolutional layers one on top of another, along with other operators such as pooling and normalisation, CNNs can learn different levels of abstractions of visual features that may improve the effectiveness of localising the positions of interest [huang2015deepfinger, wu2017yolse, yang2020msb]. However, the different sizes of objects in images pose a great challenge for learning reliable features for the prediction tasks. Several architectures have been proposed to deal with the issue, including Hourglass model [newell2016stacked], HigherHRNet [cheng2020higherhrnet], U-Net [ronneberger2015u] and Fully Convolutional Networks (FCN) [long2015fully]. These approaches aim to learn and fuse features at different scales to improve the performance by connecting multi-resolution sub-networks. In [yang2020msb]

, multi-scale features were extracted from a large block of feature maps, which were then fine-tuned and aggregated for prediction. Despite achieving some success, the feature fusion approaches increase the complexity and the opacity of CNNs. For hand keypoint detection, in particular, such scale variance problem and the lack of transparency in deep architecture still present major obstacles to the employment of CNNs in real-world applications.

To address the aforementioned limitations, we propose Multi-Scale Deep Supervision Network (P-MSDSNet), a novel deep neural network architecture with deeply supervised attention at multiple scales. The network is constructed in a modular fashion by stacking multiple stages, one on top of another. All stages have the same structure, each consisting of a set of multi-resolution sub-networks connected in parallel. The sub-networks are initialised by convolutional operators with different strides to learn features at different scales. This idea aims to reduce the need for another pooling layer as in

[zhang2017amulet]. We extend the idea of deep supervision [lee2015deeply, li2018deep] to learn attention maps for propagating the feature maps from one stage to another. With this process, the model can guide the flow of information to focus on spatial features that represent the keypoints. Although the deeply supervised attention does not provide a complete level of transparency, it offers a means to monitor the behavior of the deep network, i.e. what is learned at each stage and what information is forwarded to the next stage.

We conducted experiments to evaluate the effectiveness of P-MSDSNet on hand keypoint benchmark datasets. The results show that P-MSDSNet outperforms state-of-the-art deep learning models in hand keypoint detection while requiring fewer parameters. Furthermore, we demonstrate that P-MSDSNet can self-tune the attention maps in a multi-stage architecture. Finally, we show a real-world application of P-MSDSNet in estimating finger tapping movements for a neuroscience study.

Our contributions are summarised as follows:

  • A flexible architecture to effectively learn and fuse different context information at different scales and depths.

  • A deeply supervised attention module to guide the learning and control the propagation of information in deep models.

  • An effective and more transparent solution for real-world applications.

2 Related Work

2.1 Hand Keypoint Detection

Early works on hand keypoint detection focused on using features that were manually extracted from the image by humans, such as hand direction [grzejszczak2016hand] and color features [raheja2012fingertip]. The performances of these works were limited by uncontrolled environments and self-occlusions [erol2007vision]. Recently, CNNs based deep learning methods have proved to be an effective technique to address the hand keypoint detection problem. In [wu2017yolse], CNNs were used to integrate features at different depth levels to regress hand fingertip positions. Predicting hand keypoints from images is challenging as there can be redundant features such as human body parts which may interfere the performance. One approach to mitigate such impact is to focus the learning on the hand area. For example, in [huang2015deepfinger] the authors applied a cascade of CNNs to detect finger keypoints. The first CNN was used to obtain hand area and the second CNN was used to regress the positions of hand fingertips. In [mukherjee2019fingertip], Faster-RCNN [ren2016faster] was used to localise hands in images and the hand fingertips were detected by calculating the distances between the center and contour points as well as the curvature entropy at each contour point.

In this paper, we employ spatial attention to guide the model to focus on the potential areas of the keypoints. Therefore, our approach only needs one step of prediction and reduces the annotation cost.

2.2 Multi-Scale Networks with Auxiliary Supervision

Multi-scale networks have been designed to learn invariant features from input images. Recent approaches learn and fuse the features through down-up scaling processes (also known as high-to-low and low-to-high processes). U-Net [ronneberger2015u] and Fully Convolutional Networks (FCN) [long2015fully] adopted an encoder-decoder paradigm to extract small and large scale features for image segmentation. Amulet [zhang2017amulet] and Bidirectional Fully Connected Network (MSBFCN) [yang2020msb] extracted and fused multi-scale features from pre-defined models, i.e. VGG-16 [simonyan2014very] and ResNet [he2016deep] respectively. The Stacked Hourglass network [newell2016stacked] applied encoder-decoder architecture and skip connections to fuse different scale and depth information. HRNet [sun2019deep] and HigherHRNet [cheng2020higherhrnet] connected multi-scale sub-networks in parallel and aggregated the feature maps of the sub-networks to maintain high-resolution representations.

In deep learning, multi-scale feature fusion is usually coupled with auxiliary supervision (also known as deep supervision/intermediate supervision), adding supervision at different stages of neural networks [li2018deep]. The key advantage of auxiliary supervision is to improve the learning of discriminable features for prediction tasks [zhang2018deep, ronneberger2015u, yang2020msb]. This technique has been applied largely for keypoint detection [li2018deep, newell2016stacked, wei2016convolutional].

The P-MSDSNet architecture is inspired by ‘down-up’ scaling idea in [sun2019deep, cheng2020higherhrnet] but is different in that P-MSDSNet maintains both low-resolution and high-resolution representations in parallel sub-networks. For auxiliary supervision, instead of learning discriminable features explicitly as in [lee2015deeply], it learns a spatial attention map at each stage. The advantage of this idea is twofold. First, P-MSDSNet can focus on the discriminable features in the areas surrounding the keypoints. Second, the attention map can help shed light on the learning at intermediate layers, thus offering a certain degree of transparency. To the best of our knowledge, this is the first work attempting to use deep supervision as an auxiliary to build the spatial attention mechanism in hand keypoint detection problem.

3 Multi-Scale Deep Supervision Networks

In this section, we detail the structure of P-MSDSNet. The idea of deep supervision in this paper is represented as a repeated chain of convolutional blocks with spatial attention. The deep supervision is added to multi-scale feature maps to enhance the learning of local and global features. We also explain the learning strategy employed when P-MSDSNet is applied to hand keypoint detection.

3.1 Deep Supervision Networks with Spatial Attention

Figure 1:

An illustration of deep-supervision based spatial attention module. A new branch is added to the feature map and the intermediate output is supervised by the ground truth feature map. The spatial attention map is calculated by summing up the intermediate output feature maps along the channel axis and being activated by the Sigmoid function. Then, the spatial attention feature map is applied on the original feature map through the Hadamard product.

We will now detail the deep supervision networks with spatial attention. The structure of the network consists of M stages with each stage shown in Figure 1. At a stage , the network takes a feature map as an input and transforms it to an intermediate representation by applying a convolutional module . The module can consist of several convolutional operations, pooling, and normalisation functions. In our implementation, we employed three convolution and batch normalisation operators. The intermediate representation is then used to estimate a spatial attention map and combines with it (through a Hadamard product) to produce the input feature map for the next stage. In contrast to previous work [newell2016stacked] where deep supervision is set up to improve the discrimination of the features, we tailor the deep supervision to adaptively refine the features. To this end, at each stage the model learns the deeply supervised features to estimate the spatial attention map. To improve the stability of the gradients during the learning and to fuse different depth features, a skip connection is added to infer the deeply supervised features. The formulation of a stage is shown as follows.



where is a spatial attention function [woo2018cbam] and with is an input image. The final feature map , i.e. the output of the M layer, is fed into a convolution operation to predict the final output . In (1), (2), (3), , , denote the convolutional blocks.

3.2 Multi-scale Deep Supervision

Figure 2: An illustration of Multi-Scale Deep Supervision Network. The network fuses different scales’ feature maps (upscale and downscale) together at the same depth. The intermediate supervision attention modules are used at different scales (dashed square).

Scale variation poses a critical challenge to the prediction of correct poses [cheng2020higherhrnet, yang2020msb]. To address the issue, we extend P-MSDSNet to learn multi-scale high-level features and propagate them through the depth of the network for the final prediction. P-MSDSNet maintains the effect of deep supervision by adding parallel connections at different scales, meanwhile, stacking the deeply supervised layers at different depth levels.

A common solution for the scale variance problem is to build a multi-scale feature pyramid [lin2017feature]. These approaches, however, add more computational complexity to the learning and inference of deep CNN models. In [yang2020msb], a pyramid pooling idea is proposed where a ResNet-based feature map is pooled into a set feature maps with different sizes. However, the employment of ResNet makes it difficult for multi-staging. To improve the modularity, we used convolution with multiple stride sizes to produce feature maps at different scales . This approach allows us to easily incorporate multi-scale features into each stage of the deep supervision network proposed in the previous section (see Figure 2). Furthermore, to consolidate context information we fuse the feature maps from multi-scale levels and multi-depth levels. We carefully design the fusion mechanism to keep our network compact. In particular, at a stage we project an input feature map at a scale to the scale and concatenate it with . The process starts from and goes downward to the bottom, after that we perform upward fusion, starting from . The final architecture of P-MSDSNet is shown in Figure 2.

Formally, the inference at each stage is as follows:


for and . where is a feature map generated from .

3.3 Learning

As a multi-stage network, the loss function of P-MSDSNet is a combination of the final loss and the deep supervision losses at all stages. In general, the total loss function is:


where is the parameters of the convolutional blocks and the attention modules; is the parameters of the deep supevision; and is the parameters of the final layers for prediction. In (8), , () and (s=1, …,S) are the balance weights; controls the effect of the deep supervision overall; balances the deep supervision at different depth levels while balances the deep supervision at different scale levels. In [lee2015deeply], a decay function is applied to to reduce the effect of the deep supervision gradually during the learning process. In our application of P-MSDSNet to hand keypoint detection, for simplicity we use fixed values for those weights as discussed in the next section.

3.4 Implementation for Hand Keypoint Detection

3.4.1 Inputs and Labels

The input of P-MSDSNet for hand keypoint detection is an image of the size . The label is a heatmap of keypoints where each channel represents the positions of a hand keypoint. To improve the generality, we generate the Gaussian response heatmap with and for each channel, is the position of keypoint. The Gaussian heatmaps are the final labels we use for the training.

3.4.2 P-MSDSNet Architecture.

Our P-MSDSNet consists of 3 (or 6) stages (). Five different scales (S=5) are generated using the stride set . More details of the P-MSDSNet architecture for hand keypoint detection can be found in the Supplementary Material.

3.4.3 Loss Function.

Mean Square Error (MSE) is employed to calculate the loss function (8). We set and . In the experiment, we show that good performance can be achieved by setting . We will also investigate the effect of in the ablation study.

At each stage, we use the deeply supervised features to predict the keypoint heatmap at a scale , namely . The deep supervision loss is the mean square error of and . Here,

is the resized keypoint heatmap generated by applying bilinear interpolation to

. Since interpolation approach is approximation, resizing the keypoint heatmap to smaller scales would reduce the effectiveness of deep supervision. Therefore, we only include the deep supervision losses for which correspond the the stride sizes .

4 Experiments

4.1 Datasets

We conducted experiments based on three hand datasets: CMU Panoptic Dataset [simon2017hand], Onehand10K Dataset [wang2018mask] and HGR1 Dataset [dadashzadeh2019hgr]. The CMU dataset is a synthetic hand dataset that has 7,715 images with sizes varying from to . HGR1 dataset has 899 images with sizes varying from to . Images from both datasets have relatively consistent hand gestures and backgrounds. The OneHand10k dataset has 10,000 images for training and 1,703 images for testing, with random sizes. Images have relatively inconsistent hand gestures and complex backgrounds. Figure 3 shows sample images from the datasets.

Figure 3: Sample images from CMU Panoptic (Top Row), OneHand10K (Middle Row), and HGR1 (Bottom Row).

4.2 Metrics

4.2.1 Percentage of Correct Keypoints (PCK)

PCK measures the percentage of correct keypoints out of all keypoints. A keypoint is considered correct if the distance (Euclidean distance here) between the predicted and the true keypoint position was within a certain threshold (in pixels).


where is the number of images and is the number of types of hand keypoints, is the predicted position of the hand keypoint on the image, is the true position of the hand keypoint on the image and is the threshold set in advance.

4.2.2 Mean Per Joint Position Error (MPJPE)

MPJPE measures the mean of per joint position error for all correctly predicted keypoints based on the threshold.


where is the number of images that the hand keypoint being correctly detected (based on threshold ), is the number of types of hand keypoints, is the predicted position of the hand keypoint on the image and is the true position of the keypoint on the image.

4.3 Results

An Adam optimizer [kingma2014adam] was used as the stochastic optimization strategy and learning rate was set to be . Each dataset was partitioned into training, validating and testing sets in a ratio of . We ran each model several times and report the average performance with standard variance.

For comparison, we used HigherHRNet [cheng2020higherhrnet], YOLSE [wu2017yolse], MSBFCN [yang2020msb], U-Net [ronneberger2015u] and Stacked Hourglass Model  [newell2016stacked]. The number of parameters for all models in the experiment is shown in Table 1 and this highlights that P-MSDSNet has fewer parameters than other models.

Methods Number of Parameters
P-MSDSNet-(3 stages) 2.8M
P-MSDSNet-(6 stages) 5M
HigherHRNet 3.4M
U-Net 4M
Hourglass 8.3M
Table 1: Number of parameters

Figure 4 shows the PCK curves of all six models on three datasets. P-MSDSNet achieved higher performance on all three datasets. Hourglass is the second best model, however, its size is almost 3 times larger than the size of P-MSDSNet. YOLSE has a similar number of parameters as P-MSDSNet but it achieves the worst results among all models.

Figure 4: Different models’ Percentage of Correct Keypoints (PCK) on CMU Panoptic Testing Dataset (Left), OneHand10K Dataset (Mid) and HGR1 Testing Dataset (Right).

MPJPE is calculated based on the number of correctly detected keypoints (and this number varies among different methods). Table 2 shows the consistent superiority of P-MSDSNet as it achieved low MPJPEs in different datasets. Although MPJPEs for P-MSDSNet are not the lowest in some cases, there are the most correct keypoints (at different thresholds) detected by P-MSDSNet. Although PCK is more popular, for completeness, it is reasonable to use both PCK and MPJPE (with the number of correctly detected keypoints) together for evaluation. From the results for PCK and MPJPE it is evident that P-MSDSNet is a competitive model for hand keypoint detection.

Methods MPJPE@5pixels MPJPE@10pixels MPJPE@20pixels
CMU Testing Dataset
P-MSDSNet-(6 stages) 2.190.02 (9,347) 3.060.02 (11,371) 4.210.05 (12,674)
HigherHRNet 2.780.02 (5,175) 4.200.02 (7,716) 6.210.08 (9,606)
YOLSE 3.020.03 (3,261) 4.770.13 (5,555) 7.260/61 (7,341)
MSBFCN 2.880.02 (5,492) 4.580.07 (9,124) 6.330.12 (11,244)
U-Net 2.580.05 (4,484) 3.850.09 (6,225) 5.860.12 (7,666)
Hourglass 2.390.02 (7,483) 3.440.09 (9,679) 4.970.03 (11,217)
OneHand10K Testing Dataset
P-MSDSNet-(6 stages) 1.400.07 (16,262) 2.500.03 (20,589) 3.960.07 (23,195)
HigherHRNet 1.860.06 (9,408) 3.910.06 (15,097) 6.390.10 (19,905)
YOLSE 2.270.09 (3,962) 5.020.10 (8,359) 8.630.13 (13,654)
MSBFCN 0.890.23 (11,248) 2.520.50 (15,139) 4.270.74 (18,002)
U-Net 1.170.22 (11,198) 2.110.29 (13,352) 3.483.48 (15,006)
Hourglass 1.400.01 (13,584) 2.40.01 (16,507) 3.640.01 (18,419)
HGR1 Testing Dataset
P-MSDSNet-(6 stages) 1.810.03 (3,645) 2.290.05 (4,039) 2.540.04 (4,132)
HigherHRNet 2.330.03 (2,624) 3.420.09 (3,457) 4.120.13 (3,718)
YOLSE 1.790.25 (2,957) 2.280.48 (3,284) 2.880.52 (3,458)
MSBFCN 2.250.09 (2,268) 3.700.12 (3,257) 4.410.15 (3,531)
U-Net 2.490.02 (2,551) 3.790.04 (3,628) 4.460.06 (3,901)
Hourglass 1.950.06 (3,232) 2.590.08 (3,722) 3.100.14 (3,896)
Table 2: Different models’ Mean Per Joint Position Error (MPJPE) for correctly predicted keypoints at different thresholds. The average number of correctly detected keypoints (based on a particular threshold) is reported in the parenthesis.

Furthermore, in Figure 5 we demonstrate that deep supervision-based spatial attention maps can help guide the network to focus on the propagation of the information around keypoint areas. With its modular structure, i.e. stacking one stage on top of another, P-MSDSNet is able to refine the attention, step by step. The attention maps offer a means to monitor the learning at each stage. With this mechanism, we can make sure that only the relevant features (around hand areas) can get through for the final prediction step. This idea is inspired by the gating techniques [Hochreiter_1997].

Original Stage 1 Stage 2 Stage 3 Final Prediction Ground truth

Figure 5: Visualization of P-MSDSNet’s different stage deep supervision based spatial attention maps and final predictions

5 Ablation Study

To better understand 1) different ways to fuse multi-scale features, 2) the effectiveness of deep supervision spatial attention module and 3) the balance between final prediction loss and intermediate losses, we present several ablation studies. All experiments in ablation studies were conducted using the same hand datasets in the previous section.

5.1 Scaling Fusion and Effectiveness of Deep Supervision

P-MSDSNet has two main novelties: the new design of the upscale-downscale cyclical mechanism to fuse multi-scale features and the deep supervision-based spatial attention module. We conducted the ablation study by changing the multi-scale features fusion mechanism to Upscale Only, Downscale Only, and removing the deep supervision module. Figure 6 shows the PCK and Table 3 shows the MPJPE.

Figure 6: Ablation study of PCK performance on CMU Panoptic Testing Dataset (Left), OneHand10K Dataset (Mid) and HGR1 Testing Dataset (Right).
Methods MPJPE@5pixels MPJPE@10pixels MPJPE@20pixels
CMU Testing Dataset
P-MSDSNet-(6 stages) 2.190.02 (9,347) 3.060.02 (11,371) 4.210.05 (12,674)
P-MSDSNet-(3 stages) 2.480.02 (7,853) 3.720.05 (9,933) 5.490.05 (11,442)
HigherHRNet 2.780.02 (5,175) 4.200.02 (7,716) 6.210.08 (9,606)
YOLSE 3.020.03 (3,261) 4.770.13 (5,555) 7.260/61 (7,341)
MSBFCN 2.880.02 (5,492) 4.580.07 (9,124) 6.330.12 (11,244)
U-Net 2.580.05 (4,484) 3.850.09 (6,225) 5.860.12 (7,666)
Hourglass 2.390.02 (7,483) 3.440.09 (9,679) 4.970.03 (11,217)
OneHand10K Testing Dataset
P-MSDSNet-(6 stages) 1.400.07 (16,262) 2.500.03 (20,589) 3.960.07 (23,195)
P-MSDSNet-(3 stages) 1.400.02 (14,567) 2.560.04 (18,171) 4.060.08 (20,759)
HigherHRNet 1.860.06 (9,408) 3.910.06 (15,097) 6.390.10 (19,905)
YOLSE 2.270.09 (3,962) 5.020.10 (8,359) 8.630.13 (13,654)
MSBFCN 0.890.23 (11,248) 2.520.50 (15,139) 4.270.74 (18,002)
U-Net 1.170.22 (11,198) 2.110.29 (13,352) 3.483.48 (15,006)
Hourglass 1.400.01 (13,584) 2.40.01 (16,507) 3.640.01 (18,419)
HGR1 Testing Dataset
P-MSDSNet-(6 stages) 1.810.03 (3,645) 2.290.05 (4,039) 2.540.04 (4,132)
P-MSDSNet-(3 stages) 1.990.02 (3,278) 2.640.07 (3,747) 3.070.10 (3,900)
HigherHRNet 2.330.03 (2,624) 3.420.09 (3,457) 4.120.13 (3,718)
YOLSE 1.790.25 (2,957) 2.280.48 (3,284) 2.880.52 (3,458)
MSBFCN 2.250.09 (2,268) 3.700.12 (3,257) 4.410.15 (3,531)
U-Net 2.490.02 (2,551) 3.790.04 (3,628) 4.460.06 (3,901)
Hourglass 1.950.06 (3,232) 2.590.08 (3,722) 3.100.14 (3,896)
Table 3: Ablation Study of MPJPE for correctly predicted keypoints at different thresholds on CMU Panoptic Testing Dataset (Left), OneHand10K Dataset (Mid), and HGR1 Testing Dataset (Right). The average number of correctly detected keypoints (based on particular threshold) is reported in the parenthesis.
Figure 7: Balancing weights between intermediate losses and final loss

These show P-MSDSNet (using the downscale-upscale cyclical pattern for feature fusion) outperforms the one using Upscale Only or Downscale Only. Downscale Only performs worst due to the information flow being spread from large scale (where the prediction is made) to small scale without turning back. This means that the small scale features do not contribute to the final predictions. Compared with Upscale Only, the downscale-upscale cyclical pattern shows a better fusion of multi-scale and different depth information, particularly when facing complex datasets (OneHand10K and CMU).

In addition, P-MSDSNet outperforms the one without deep supervision in terms of PCK and MPJPE, and the gap increases when the dataset becomes more complex. This shows that the deep supervision-based spatial attention module is effective and it improves the hand keypoint detection performance.

5.2 Loss Balancing

We also studied the effect of value (in (8)) on the model performance. controls the balancing weights between intermediate losses and the final prediction loss. Figure 7 shows the PCK performance and Table 4 shows the MPJPE performance for different values.

Methods MPJPE@5pixels MPJPE@10pixels MPJPE@20pixels
CMU Testing Dataset
2.480.02 (5,984) 3.720.05 (8,137) 5.490.05 (9,755)
2.300.16 (7,853) 3.300.10 (9,933) 4.760.16 (11,442)
2.320.01 (7,380) 3.360.09 (9,430) 4.930.12 (10,984)
2.420.01 (6,898) 3.550.03 (9,079) 5.200.03 (10,708)
OneHand10K Testing Dataset
1.400.02 (13,926) 2.560.04 (17,478) 4.060.08 (20,055)
1.370.06 (13,937) 2.510.11 (17,383) 4.020.17 (19,910)
1.480.01 (14,567) 2.590.01 (18,171) 4.050.08 (20,759)
1.390.01 (13,736) 2.550.03 (17,250) 4.080.06 (19,831)
HGR1 Testing Dataset
1.990.02 (3,286) 2.640.07 (3,801) 3.070.10 (3,957)
1.930.03 (3,344) 2.540.06 (3,822) 2.930.11 (3,961)
1.930.03 (3,278) 2.540.04 (3,747) 2.990.08 (3,900)
1.940.01 (3,356) 2.550.05 (3,837) 2.930.04 (3,972)
Table 4: Effect of values on the performance.

The allocation for final prediction loss and intermediate losses does affect the model performance. works the best on HGR1 and OneHand10K dataset and works the best on CMU dataset. Allocating a smaller weights for intermediate losses during training may help improve the P-MSDSNet performance.

6 Application

In several areas of neuroscience, there is an urgent need to precisely quantify hand movements. For example, people with Parkinson’s have slower and less rhythmic patterns of hand movements but there are no accessible methods to quantify these accurately. Currently, finger tapping tests are used to assess progression of disease as well as response to new drugs. This test involves patients repetitively opening and closing their index finger and thumb against each other ten times, while a neuroscientist or clinician assesses visually how fast and rhythmically they move, and then applies a subjective score of 0-4 (where 0 is normal and 4 is highly abnormal). Usually wearable sensors have been used to track the finger movements but is inconvenient and not feasible for data collection at a large scale. We apply P-MSDSNet to detect and track hand finger keypoints from a normal laptop webcam (30FPS) in real-time. Before applying P-MSDSNet to estimate finger tapping frequency, we evaluated the model on a dataset of 220 10-second videos of 16 participants doing finger tapping tests. Thumb-tip and index fingertip are the two hand keypoints to be detected. We then apply the trained P-MSDSNet to calculate the finger tapping frequency based on the distance between thumb-tip and index fingertip. Interestingly, it is shown that tracking of P-MSDSNet and that of wearable sensors are consistently similar in Figure 8. This result implies the reliability of P-MSDSNet for a large scale finger tapping test.

Figure 8: Application of P-MSDSNet compared to wearable sensors for Neuroscience study. Fingertips detection result (Left) and the finger tapping frequency calculated from the detection of fingertips (Right).

7 Conclusions and Future work

We present and apply the P-MSDSNet on the CMU Panoptic [simon2017hand], OneHand10K [wang2018mask] and HGR1 datasets [dadashzadeh2019hgr]. In particular, our model is better than several comparables in terms of PCK and MPJPE metrics. Additionally, we have also applied P-MSDSNet on a real-life neuroscience domain finger tapping test, which showcased the applicability of P-MSDSNet.

For the future work, we aim to convert P-MSDSNet to be fully recursive with parameters shared among all stages. We will also develop an adaptive algorithm to determine the optimal depth (number of stages) of the network.