Preterm infants' limb-pose estimation from depth images using convolutional neural networks

07/26/2019 ∙ by Sara Moccia, et al. ∙ UnivPM 6

Preterm infants' limb-pose estimation is a crucial but challenging task, which may improve patients' care and facilitate clinicians in infant's movements monitoring. Work in the literature either provides approaches to whole-body segmentation and tracking, which, however, has poor clinical value, or retrieve a posteriori limb pose from limb segmentation, increasing computational costs and introducing inaccuracy sources. In this paper, we address the problem of limb-pose estimation under a different point of view. We proposed a 2D fully-convolutional neural network for roughly detecting limb joints and joint connections, followed by a regression convolutional neural network for accurate joint and joint-connection position estimation. Joints from the same limb are then connected with a maximum bipartite matching approach. Our analysis does not require any prior modeling of infants' body structure, neither any manual interventions. For developing and testing the proposed approach, we built a dataset of four videos (video length = 90 s) recorded with a depth sensor in a neonatal intensive care unit (NICU) during the actual clinical practice, achieving median root mean square distance [pixels] of 10.790 (right arm), 10.542 (left arm), 8.294 (right leg), 11.270 (left leg) with respect to the ground-truth limb pose. The idea of estimating limb pose directly from depth images may represent a future paradigm for addressing the problem of preterm-infants' movement monitoring and offer all possible support to clinicians in NICUs.



There are no comments yet.


page 1

page 2

page 3

page 5

page 6

page 7

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Preterm birth may affect infants’ anatomical and functional development, leading to lifelong morbidity or, in worst-case scenario, mortality. Monitoring preterm infants is crucial to detect the onset of short- and long-term complications [1] and cribs in neonatal intensive care units (NICUs) are commonly equipped with a large variety of monitoring medical devices.

The movement of preterm infants is a strong clinical predictor to diagnose brain lesions [2], cognitive dysfunction [3], sleep disorders [4] and pain [5]. Clinicians particularly pay attention to involuntary movements, consisting of asymmetrical and irregular banging of limb extremities (e.g., twitching and jerking) [6]. Despite being recognized as a crucial clinical task, preterm-infants’ movement evaluation is merely qualitative and episodic, and mostly based on clinicians’ (i) assessment at the crib side in NICUs or (ii) review of infants’ video-recordings. Beside being time-consuming, this evaluation may be prone to inaccuracies due to clinicians’ fatigue and susceptible to intra- and inter-clinician variability [7].

Fig. 1: Infant model. LS and RS: left and right shoulder, LE and RE: left and right elbow, LW and RW: left and right wrist, LH and RH: left and right hip, LK and RK: left and right knee, LA and RA: left and right ankle.
Fig. 2: Depth-image acquisition setup. The setup does not hinder health-operator movements.
Fig. 3: Dataset challenges includes different distance between camera and infants, varying illumination level, presence of limbs self-occlusion, different number of visible joints in the camera field of view.

Some promising computer-assisted approaches have been proposed to support clinicians in detecting infants’ movement from clinical devices (e.g., accelerometer, photopletismograph and force sensors) [8] and multimedia data (audio and video) [1, 9, 10]. With respect to intrusive clinical devices, RGB-D cameras can be easily integrated into standard clinical monitoring setup (e.g., over infants’ cage) while not hindering infants’ and health operators’ movements. Promising results have been achieved in the literature for whole-body detection as a prior for infants’ movement analysis. In [11, 12] threshold-based approaches to whole-body movement detection using an RGB-D camera are proposed. In  [13]

, optical flow and statistical classifiers are used to track manually-defined body points from RGB images.

Fig. 4: Sample detection results. First row: ground-truth (blue) and achieved (green) joint detection. Second row: ground-truth (blue) and achieved (purple) joint-connection detection.

max width = .5 Name

Kernel (Size / Stride)

Channels Downsampling path Input 1 Convolutional layer - Common branch 3x3 / 1x1 64 Block 1 - Branch 1 2x2 / 2x2 64 3x3 / 1x1 64 Block 1 - Branch 2 2x2 / 2x2 64 3x3 / 1x1 64 Block 1 - Common branch 1x1 / 1x1 128 Block 2 - Branch 1 2x2 / 2x2 128 3x3 / 1x1 128 Block 2 - Branch 2 2x2 / 2x2 128 3x3 / 1x1 128 Block 2 - Common branch 1x1 / 1x1 256 Block 3 - Branch 1 2x2 / 2x2 256 3x3 / 1x1 256 Block 3 - Branch 2 2x2 / 2x2 256 3x3 / 1x1 256 Block 3 - Common branch 1x1 / 1x1 512 Block 4 - Branch 1 2x2 / 2x2 512 3x3 / 1x1 512 Block 4 - Branch 2 2x2 / 2x2 512 3x3 / 1x1 512 Block 4 - Common branch 1x1 / 1x1 1024 Upsampling path Block 5 - Branch 1 2x2 / 2x2 256 3x3 / 1x1 256 Block 5 - Branch 2 2x2 / 2x2 256 3x3 / 1x1 256 Block 5 - Common branch 1x1 / 1x1 512 Block 6 - Branch 1 2x2 / 2x2 128 3x3 / 1x1 128 Block 6 - Branch 2 2x2 / 2x2 128 3x3 / 1x1 128 Block 6 - Common branch 1x1 / 1x1 256 Block 7 - Branch 1 2x2 / 2x2 64 3x3 / 1x1 64 Block 7 - Branch 2 2x2 / 2x2 64 3x3 / 1x1 64 Block 7 - Common branch 1x1 / 1x1 128 Block 8 - Branch 1 2x2 / 2x2 32 3x3 / 1x1 32 Block 8 - Branch 2 2x2 / 2x2 32 3x3 / 1x1 32 Block 8 - Common branch 1x1 / 1x1 64 Output 1x1/1x1 20

TABLE I: Detection-network architecture. Starting from the input depth image (1 channel), the network generates 20 maps (12 confidence maps for limb joints, and 8 affinity fields for joint connections).

Kernel (Size / Stride) Channels

Layer 1 3x3 / 1x1 64
Layer 2 3x3 / 1x1 128
Layer 3 3x3 / 1x1 256
Layer 4 3x3 / 1x1 256
Layer 5 3x3 / 1x1 256
Output 1x1 / 1x1 20
TABLE II: Regression-network architecture. The network is fed with the depth image (1 channel) stacked with the (20) output masks of the detection network, and produces20 regression maps (12 for joints and 8 for connections).

However, as explained in [6], single-limb movement should be evaluated to verify the presence of cerebral illnesses in preterm babies. An approach to limb-specific movement detection is proposed in [14]. It exploits temporal tracking with particle filtering integrated with limb-trajectory priors that, however, have to be manually identified by users, hampering the usability of the approach into the actual monitoring practice. In [15], histogram of oriented gradients is used as feature to retrieve infants’ body skeleton. Body limbs and joints are a posteriori retrieved using pre-defined body-part templates.

A different strategy has been proposed in [16]

, where a deep-learning approach to directly assess limb joints is proposed, with advantages such as reduced computational time. In particular, two CNNs are used for pedestrian limb-pose estimation: the first one (a detection fully convolutional neural network, FCNN) to retrieve joint probability maps and the second one (a regression CNN) to refine joint-estimate position.

Inspired by [16], in this paper we propose to use the same strategy to estimate preterm infants’ limb pose from images acquired in NICUs during the actual clinical practice. In particular, we will focus our analysis on depth images, following recent consideration related to infants’ privacy issues [17, 18].

This paper is organized as follows: Sec. II presents the infants’ pose-estimation approach. The evaluation protocol and the image dataset built to test the proposed approach are presented in Sec. III. Results are presented in Sec. IV and discussed in Sec. V. Sec. VI concludes this paper by summarizing the main achievements of this research.

Ii Methods

Our infant’s model considers each of the 4 limbs as a set of three connected joints (i.e., wrist, elbow and shoulder for arms and ankle, knee and hip for legs), as shown in Fig. 1. To estimate limb pose, we exploit two consecutive CNNs, one for detecting joints and joint connection (Sec. II-A), the other for regressing the joint position, exploiting both the joint probability and joint-connection maps, with the latter acting as guidance for joint linking (Sec. II-B). The joints belonging to the same limb are then connected using bipartile graph matching (Sec. II-C).

Ii-a Detection network

To develop our detection FCNN, we perform multiple binary-detection operations (considering each joint and joint-connection separately) to solve possible ambiguities of multiple joints and joint connections that may cover the same image portion (e.g. in case of limb self-occlusion). For each video frame, we generate 20 separate ground-truth binary detection maps: 12 for the joints and 8 for the joint connections (instead of generating a single ground-truth mask with 20 different annotations, which has been shown to perform less reliably) [16]. The detection network provides joint and joint-connection confidence maps as output of the joint and joint-connection branches, respectively.

For every joint mask, we consider a region of interest consisting of all pixels that lie in the circle of a given radius () centered at the joint center [19]. A similar approach is used to generate the ground truth for the joint connections. In this case, the ground truth is the rectangular region with thickness and centrally aligned with the joint-connection line.

Our architecture (Table I) is inspired by the classic encoder-decoder architecture of U-Net [20], with 8 blocks that follows input and common-branch convolutional layers and are followed by an output layer. Each block is divided in two branches (for joints and connections). The outputs of two branches in a block is then concatenated in a single output prior entering the next block. Using a bi-branch architecture has been shown to provide higher detection performance, as it allows processing separately the joint-probability and joint-connection affinity maps [16]

. Batch normalization and activation with the rectified linear unit (ReLu) is performed after each convolution.

Our FCNN is trained using the per-pixel binary cross-entropy as loss function, and the adaptive moment estimation (Adam) as optimizer.

Ii-B Regression network

Similarly to what is done for the detection FCNN, for every joint we consider a region of interest consisting of all pixels that lie in the circle with radius

centered at the joint center. In this case, instead of binary masking the circle area as for the detection FCNN, we consider a Gaussian distribution with standard deviation (

) equal to 3* and centered at the joint center. A similar approach is used to generate the ground-truth masks for the joint connections. In this case, the ground-truth mask is the rectangular region with thickness and centrally aligned with the joint-connection line. Pixel values in the mask are 1-D Gaussian distributed () along the connection direction.

The regression network (Table II) has a single-branch architecture made of 5 layers, with an additional input and output layer. The network is fed by both the depth image and the output of the detection network, which consists of 12 joint confidence maps and 8 affinity fields for joint connections. The networks then produces 20 maps, 12 for joints and 8 for joint connections. Batch normalization and activation with the rectified linear unit (ReLu) is performed after each convolution.

Our regression network is trained using the mean square error as loss function, and stochastic gradient descend as optimizer.

Ii-C Joint linking

The last step of our limb pose-estimation task is to link joints for each of the infants’ limb. First, we identify joint candidates from the joint regression output maps using non-maximum suppression, which is an algorithm commonly used in computer vision when redundant candidates are present

[21]. Once joint candidates are identified, they are linked exploiting the joint-connection regression maps. In particular, we use a bipartile matching approach, which consists in (i) computing the integral value along the line connected two candidates on the joint-connection regression output map and (ii) choosing the two winning candidates as those guaranteeing the higher integral value.

(a) Joint
(b) Connection
Fig. 5: Boxplots of the Dice similarity coefficient () for (a) joint and (b) joint-connection detection achieved with the proposed fully-convolutional neural network.
Fig. 6: Sample results of the regression-network output for left ankle superimposed on the corresponding depth images.

Iii Experimental protocol

Iii-a Dataset

Videos of four preterm infants were acquired at the G. Salesi Hospital NICU in Ancona, Italy. The infants were identified by clinicians in the NICU. All infants were spontaneously breathing and did not present hydrocephalus, congenital defects and bronchopulmonary diseases. Written informed consent was obtained from the infant’s legal guardian. Video-acquisition setup is shown in Fig. 2.

Video recordings (length = 90 s) were acquired for every infant, using the Astra Mini S - Orbbec ®with a frame rate of 30 fps and image size of 640x480 pixels. For each video, the ground truth was manually obtained every 5 frames, resulting in 540 annotated frames per patient. Then, these 540 frames were split into training and testing data: 270 frames (45 s) were used for training purpose and the remaining ones (270 frames) to test the network; resulting in a training and testing set of 1080 frames each.

Challenges in the dataset included varying infant-camera distance (due to the motility of the acquisition setup), illumination level, different number of visible joints and limb self occlusion (Fig. 3).

Iii-B Training settings

Images were resized to 128x96 pixels in order to smooth noise and reduce both training time and memory requirements. Joint annotation was performed using a custom-built annotation tool, publicly available online111 To build the ground-truth masks, we selected equal to 6 pixels.

For training the detection and regression network, we set an initial learning rate of 0.01 with a learning decay of 10% every 10 epochs, and a momentum of 0.98. We used a batch size of 16 and for both the networks the number of epochs was set to 100. We selected the best model as the one that maximized the accuracy on the validation set (training/validation split = 0.3).

All our analyses were performed using Keras

222 on a Nvidia GeForce GTX 1050 Ti/PCIe/SSE2.

Fig. 7: Boxplots of the root mean square distance () computed for the four limbs separately.
Fig. 8: Sample pose-estimation results. Green: right-arm, red: left-arm, blue: right-leg, yellow: left-leg pose obtained with the proposed approach.
Right arm Left arm Right leg Left leg
0.813 0. 798 0.778 0.823 0.843 0.837 0.849 0.858 0.792 0.758 0.863 0.847
0.690 0. 672 0.637 0.708 0.734 0.726 0.752 0.761 0.664 0.661 0.770 0.743
TABLE III: Joint-detection performance in terms of median Dice similarity coefficient () and recall (). The metrics are reported separately for each joint.
Right arm Left arm Right leg Left leg
0.851 0.817 0.818 0.850 0.888 0.838 0.803 0.861
0.760 0.706 0.703 0.750 0.826 0.744 0.679 0.768
TABLE IV: Joint-connection detection performance in terms of median Dice similarity coefficient () and recall (). The metrics are reported separately for each joint connection.
Right arm Left arm Right leg Left leg
10.790 10.542 8.294 11.270
TABLE V: Limb-pose estimation performance in terms of median root mean square distance () computed with respect to ground-truth pose. The is reported separately for each limb.

Iii-C Performance metrics

To measure the performance of the detection FCNN, we computed the Dice similarity coefficient () and recall ():


where : true positive, : false positive, : false negative.

To evaluate the overall pose estimation, we computed the root mean square distance () [pixels] for each infants’ limb.

For both the detection and regression network, we measured the testing time.

Iv Results

(a) Joint
(b) Connection
Fig. 9: Temporal evolution of joint position for each infants’ limb. Each color refers to a different limb.

Sample outputs of the detection FCNN are shown in Fig. 4, both for joints and joint connections. It is worth noting that also when some joints were occluded (e.g., due to plaster as in column 2 of the image, right leg) or they were out of field of view (column 4, left leg), the network correctly detect the others.

The boxplots for , separately computed for joints and connections, are shown in Fig. 5. Median and for joints are shown in Table III. For and , interquartile range (IQR) was always lower than 0.080 and 0.124, respectively. Median and for joint connections were evaluated too (Table IV). For and , IQR was always lower than 0.099 and 0.146, respectively. Detection time was on average 0.01 s per image.

Visual results for the regression-CNN output (left ankle) are shown in Fig. 6. The median values (for the reduced 128x96-pixel images) for pose estimation are shown in Table V. IQR was always lower than 4.760 pixel. Boxplots for are shown in Fig. 7.

Figure 8 shows visual pose-estimation results for the four infants’ limbs. Regression and bipartite-matching algorithm time was on average 0.02 s. Figure 9 shows the temporal evolution of joint position for each infants’ limb for two sample testing videos.

V Discussion

The proposed FCNN achieved similar results for the detection of all joints (i.e., without outperforming in detecting one joint with respect to others), reflecting the FCNN ability of processing in parallel the different joint probability maps. This is also visible from the visual results in Fig. 4, where the FCNN was able to correctly detect visible joints without being affected by occluded ones. The regression network provided guidance for the bipartile matching algorithm, which achieved satisfactory performance ( 12 pixels) for all limbs. The overall methodology required 0.03 s per image, hence being compatible with real-time infants’ monitoring.

Our approach, despite some limitations (e.g., dataset dimensions and video length), overcame some of the literature drawbacks. Hence, it allowed to directly estimate limb-specific pose, being computationally efficient and clinically relevant.

Future improvements of the proposed methodology may include: (i) the collection and annotation of a larger dataset (considering the lack of available datasets in this field), (ii) the analysis of temporal information (naturally encoded in videos) in both detection and regression network, as recently proposed in [22] and (iii) the integration of infant-specific measures, already stored in electronic health records (e.g., height, limbs length…) to ameliorate the limb-pose estimation. An accurate estimation may potentially allow to retrieve useful hints for movement classification (e.g., following [23]) to offer all possible supports to clinicians.

Vi Conclusion

In this paper, we have proposed a deep-learning framework for 2D pose estimation of infants’ limb in cages inside NICUs. The framework performs first a rough detection of limb-joint position via a FCNN, and then refine the detection exploiting a regression convolutional network, followed by bipartile matching to link joints belonging to the same limb. This work, to the best of our knowledge, represents a novel attempt to perform image-based infants’ limb pose estimation and can potentially be extended to handle even more complex scenario, where healthcare operators interacts with infants.

With respect to state-of-the-art approaches, our work allows a direct estimation of limb-specific pose, is completely automatic and allows real-time processing. This make it suitable for being integrated in the clinicians’ decision process and providing support for early diagnosis of brain and cognitive disorders from limb-movement analysis.


  • [1] F. Giuliani, L. Cheikh Ismail, E. Bertino, Z. A. Bhutta, E. O. Ohuma, I. Rovelli, A. Conde-Agudelo, J. Villar, and S. H. Kennedy, “Monitoring postnatal growth of preterm infants: present and future–3,” The American Journal of Clinical Nutrition, vol. 103, no. 2, pp. 635S–647S, 2016.
  • [2] F. Ferrari, G. Cioni, and H. Prechtl, “Qualitative changes of general movements in preterm infants with brain lesions,” Early Human Development, vol. 23, no. 3, pp. 193–231, 1990.
  • [3] C. Einspieler, A. F. Bos, M. E. Libertus, and P. B. Marschik, “The general movement assessment helps us to identify preterm infants at risk for cognitive dysfunction,” Frontiers in Psychology, vol. 7, p. 406, 2016.
  • [4] J. Werth, L. Atallah, P. Andriessen, X. Long, E. Zwartkruis-Pelgrim, and R. M. Aarts, “Unobtrusive sleep state measurements in preterm infants–A review,” Sleep Medicine Reviews, vol. 32, pp. 109–122, 2017.
  • [5] T. M. Heiderich, A. T. F. S. Leslie, and R. Guinsburg, “Neonatal procedural pain can be assessed by computer software that has good sensitivity and specificity to detect facial movements,” Acta Paediatrica, vol. 104, no. 2, 2015.
  • [6] D. Freymond, Y. Schutz, J. Decombaz, J.-L. Micheli, and E. Jéquier, “Energy balance, physical activity, and thermogenic effect of feeding in premature infants,” Pediatric Research, vol. 20, no. 7, p. 638, 1986.
  • [7] I. Bernhardt, M. Marbacher, R. Hilfiker, and L. Radlinger, “Inter-and intra-observer agreement of prechtl’s method on the qualitative assessment of general movements in preterm, term and young infants,” Early Human Development, vol. 87, no. 9, pp. 633–639, 2011.
  • [8] I. Zuzarte, P. Indic, D. Sternad, and D. Paydarfar, “Quantifying movement in preterm infants using photoplethysmography,” Annals of Biomedical Engineering, vol. 47, no. 2, pp. 646–658, 2019.
  • [9] S. Cabon, F. Poree, A. Simon, O. Rosec, P. Pladys, and G. Carrault, “Video and audio processing in paediatrics: a review,” Physiological Measurement, 2019.
  • [10] N. Hesse, S. Pujades, J. Romero, M. J. Black, C. Bodensteiner, M. Arens, U. G. Hofmann, U. Tacke, M. Hadders-Algra, R. Weinberger et al., “Learning an infant body model from RGB-D data for accurate full body motion analysis,” in International Conference on Medical Image Computing and Computer-Assisted Intervention.   Springer, 2018, pp. 792–800.
  • [11] A. Cenci, D. Liciotti, E. Frontoni, A. Mancini, and P. Zingaretti, “Non-contact monitoring of preterm infants using RGB-D camera,” in International Design Engineering Technical Conferences and Computers and Information in Engineering Conference.   American Society of Mechanical Engineers, 2015, pp. V009T07A003–V009T07A003.
  • [12] L. Adde, J. L. Helbostad, A. R. Jensenius, G. Taraldsen, and R. Støen, “Using computer-based video analysis in the study of fidgety movements,” Early Human Development, vol. 85, no. 9, pp. 541–547, 2009.
  • [13] A. Stahl, C. Schellewald, Ø. Stavdahl, O. M. Aamo, L. Adde, and H. Kirkerod, “An optical flow-based method to predict infantile cerebral palsy,” IEEE Transactions on Neural Systems and Rehabilitation Engineering, vol. 20, no. 4, pp. 605–614, 2012.
  • [14] H. Rahmati, R. Dragon, O. M. Aamo, L. Adde, Ø. Stavdahl, and L. Van Gool, “Weakly supervised motion segmentation with particle matching,” Computer Vision and Image Understanding, vol. 140, pp. 30–42, 2015.
  • [15] M. Khan, M. Schneider, M. Farid, and M. Grzegorzek, “Detection of infantile movement disorders in video data using deformable part-based model,” Sensors, vol. 18, no. 10, p. 3202, 2018.
  • [16] Z. Cao, T. Simon, S.-E. Wei, and Y. Sheikh, “Realtime multi-person 2D pose estimation using part affinity fields,” in

    Conference on Computer Vision and Pattern Recognition

    , 2017, pp. 7291–7299.
  • [17] A. Hernández-Vela, N. Zlateva, A. Marinov, M. Reyes, P. Radeva, D. Dimov, and S. Escalera, “Graph cuts optimization for multi-limb human segmentation in depth maps,” in IEEE Conference on Computer Vision and Pattern Recognition.   IEEE, 2012, pp. 726–732.
  • [18] C. Zhang, Y. Tian, and E. Capezuti, “Privacy preserving automatic fall detection for elderly using rgbd cameras,” in International Conference on Computers for Handicapped Persons.   Springer, 2012, pp. 625–633.
  • [19] X. Du, T. Kurmann, P.-L. Chang, M. Allan, S. Ourselin, R. Sznitman, J. D. Kelly, and D. Stoyanov, “Articulated multi-instrument 2-D pose estimation using fully convolutional networks,” IEEE Transactions on Medical Imaging, vol. 37, no. 5, pp. 1276–1287, 2018.
  • [20] O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks for biomedical image segmentation,” in International Conference on Medical Image Computing and Computer-Assisted Intervention.   Springer, 2015, pp. 234–241.
  • [21] J. Hosang, R. Benenson, and B. Schiele, “Learning non-maximum suppression,” in IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 4507–4515.
  • [22] E. Colleoni, S. Moccia, X. Du, E. De Momi, and D. Stoyanov, “Deep learning based robotic tool detection and articulation estimation with spatio-temporal layers,” IEEE Robotics and Automation Letters, 2019.
  • [23] M. Capecci, M. G. Ceravolo, F. Ferracuti, M. Grugnetti, S. Iarlori, S. Longhi, L. Romeo, and F. Verdini, “An instrumental approach for monitoring physical exercises in a visual markerless scenario: A proof of concept,” Journal of Biomechanics, vol. 69, pp. 70–80, 2018.