Preterm birth may affect infants’ anatomical and functional development, leading to lifelong morbidity or, in worst-case scenario, mortality. Monitoring preterm infants is crucial to detect the onset of short- and long-term complications  and cribs in neonatal intensive care units (NICUs) are commonly equipped with a large variety of monitoring medical devices.
The movement of preterm infants is a strong clinical predictor to diagnose brain lesions , cognitive dysfunction , sleep disorders  and pain . Clinicians particularly pay attention to involuntary movements, consisting of asymmetrical and irregular banging of limb extremities (e.g., twitching and jerking) . Despite being recognized as a crucial clinical task, preterm-infants’ movement evaluation is merely qualitative and episodic, and mostly based on clinicians’ (i) assessment at the crib side in NICUs or (ii) review of infants’ video-recordings. Beside being time-consuming, this evaluation may be prone to inaccuracies due to clinicians’ fatigue and susceptible to intra- and inter-clinician variability .
Some promising computer-assisted approaches have been proposed to support clinicians in detecting infants’ movement from clinical devices (e.g., accelerometer, photopletismograph and force sensors)  and multimedia data (audio and video) [1, 9, 10]. With respect to intrusive clinical devices, RGB-D cameras can be easily integrated into standard clinical monitoring setup (e.g., over infants’ cage) while not hindering infants’ and health operators’ movements. Promising results have been achieved in the literature for whole-body detection as a prior for infants’ movement analysis. In [11, 12] threshold-based approaches to whole-body movement detection using an RGB-D camera are proposed. In 
, optical flow and statistical classifiers are used to track manually-defined body points from RGB images.
|Kernel (Size / Stride)||Channels|
|Layer 1||3x3 / 1x1||64|
|Layer 2||3x3 / 1x1||128|
|Layer 3||3x3 / 1x1||256|
|Layer 4||3x3 / 1x1||256|
|Layer 5||3x3 / 1x1||256|
|Output||1x1 / 1x1||20|
However, as explained in , single-limb movement should be evaluated to verify the presence of cerebral illnesses in preterm babies. An approach to limb-specific movement detection is proposed in . It exploits temporal tracking with particle filtering integrated with limb-trajectory priors that, however, have to be manually identified by users, hampering the usability of the approach into the actual monitoring practice. In , histogram of oriented gradients is used as feature to retrieve infants’ body skeleton. Body limbs and joints are a posteriori retrieved using pre-defined body-part templates.
A different strategy has been proposed in 
, where a deep-learning approach to directly assess limb joints is proposed, with advantages such as reduced computational time. In particular, two CNNs are used for pedestrian limb-pose estimation: the first one (a detection fully convolutional neural network, FCNN) to retrieve joint probability maps and the second one (a regression CNN) to refine joint-estimate position.
Inspired by , in this paper we propose to use the same strategy to estimate preterm infants’ limb pose from images acquired in NICUs during the actual clinical practice. In particular, we will focus our analysis on depth images, following recent consideration related to infants’ privacy issues [17, 18].
This paper is organized as follows: Sec. II presents the infants’ pose-estimation approach. The evaluation protocol and the image dataset built to test the proposed approach are presented in Sec. III. Results are presented in Sec. IV and discussed in Sec. V. Sec. VI concludes this paper by summarizing the main achievements of this research.
Our infant’s model considers each of the 4 limbs as a set of three connected joints (i.e., wrist, elbow and shoulder for arms and ankle, knee and hip for legs), as shown in Fig. 1. To estimate limb pose, we exploit two consecutive CNNs, one for detecting joints and joint connection (Sec. II-A), the other for regressing the joint position, exploiting both the joint probability and joint-connection maps, with the latter acting as guidance for joint linking (Sec. II-B). The joints belonging to the same limb are then connected using bipartile graph matching (Sec. II-C).
Ii-a Detection network
To develop our detection FCNN, we perform multiple binary-detection operations (considering each joint and joint-connection separately) to solve possible ambiguities of multiple joints and joint connections that may cover the same image portion (e.g. in case of limb self-occlusion). For each video frame, we generate 20 separate ground-truth binary detection maps: 12 for the joints and 8 for the joint connections (instead of generating a single ground-truth mask with 20 different annotations, which has been shown to perform less reliably) . The detection network provides joint and joint-connection confidence maps as output of the joint and joint-connection branches, respectively.
For every joint mask, we consider a region of interest consisting of all pixels that lie in the circle of a given radius () centered at the joint center . A similar approach is used to generate the ground truth for the joint connections. In this case, the ground truth is the rectangular region with thickness and centrally aligned with the joint-connection line.
Our architecture (Table I) is inspired by the classic encoder-decoder architecture of U-Net , with 8 blocks that follows input and common-branch convolutional layers and are followed by an output layer. Each block is divided in two branches (for joints and connections). The outputs of two branches in a block is then concatenated in a single output prior entering the next block. Using a bi-branch architecture has been shown to provide higher detection performance, as it allows processing separately the joint-probability and joint-connection affinity maps 
Ii-B Regression network
Similarly to what is done for the detection FCNN, for every joint we consider a region of interest consisting of all pixels that lie in the circle with radius) equal to 3* and centered at the joint center. A similar approach is used to generate the ground-truth masks for the joint connections. In this case, the ground-truth mask is the rectangular region with thickness and centrally aligned with the joint-connection line. Pixel values in the mask are 1-D Gaussian distributed () along the connection direction.
The regression network (Table II) has a single-branch architecture made of 5 layers, with an additional input and output layer. The network is fed by both the depth image and the output of the detection network, which consists of 12 joint confidence maps and 8 affinity fields for joint connections. The networks then produces 20 maps, 12 for joints and 8 for joint connections. Batch normalization and activation with the rectified linear unit (ReLu) is performed after each convolution.
Our regression network is trained using the mean square error as loss function, and stochastic gradient descend as optimizer.
Ii-C Joint linking
The last step of our limb pose-estimation task is to link joints for each of the infants’ limb. First, we identify joint candidates from the joint regression output maps using non-maximum suppression, which is an algorithm commonly used in computer vision when redundant candidates are present. Once joint candidates are identified, they are linked exploiting the joint-connection regression maps. In particular, we use a bipartile matching approach, which consists in (i) computing the integral value along the line connected two candidates on the joint-connection regression output map and (ii) choosing the two winning candidates as those guaranteeing the higher integral value.
Iii Experimental protocol
Videos of four preterm infants were acquired at the G. Salesi Hospital NICU in Ancona, Italy. The infants were identified by clinicians in the NICU. All infants were spontaneously breathing and did not present hydrocephalus, congenital defects and bronchopulmonary diseases. Written informed consent was obtained from the infant’s legal guardian. Video-acquisition setup is shown in Fig. 2.
Video recordings (length = 90 s) were acquired for every infant, using the Astra Mini S - Orbbec ®with a frame rate of 30 fps and image size of 640x480 pixels. For each video, the ground truth was manually obtained every 5 frames, resulting in 540 annotated frames per patient. Then, these 540 frames were split into training and testing data: 270 frames (45 s) were used for training purpose and the remaining ones (270 frames) to test the network; resulting in a training and testing set of 1080 frames each.
Challenges in the dataset included varying infant-camera distance (due to the motility of the acquisition setup), illumination level, different number of visible joints and limb self occlusion (Fig. 3).
Iii-B Training settings
Images were resized to 128x96 pixels in order to smooth noise and reduce both training time and memory requirements. Joint annotation was performed using a custom-built annotation tool, publicly available online111https://github.com/roccopietrini/pyPointAnnotator. To build the ground-truth masks, we selected equal to 6 pixels.
For training the detection and regression network, we set an initial learning rate of 0.01 with a learning decay of 10% every 10 epochs, and a momentum of 0.98. We used a batch size of 16 and for both the networks the number of epochs was set to 100. We selected the best model as the one that maximized the accuracy on the validation set (training/validation split = 0.3).
All our analyses were performed using Keras222https://keras.io/ on a Nvidia GeForce GTX 1050 Ti/PCIe/SSE2.
|Right arm||Left arm||Right leg||Left leg|
|Right arm||Left arm||Right leg||Left leg|
|Right arm||Left arm||Right leg||Left leg|
Iii-C Performance metrics
To measure the performance of the detection FCNN, we computed the Dice similarity coefficient () and recall ():
where : true positive, : false positive, : false negative.
To evaluate the overall pose estimation, we computed the root mean square distance () [pixels] for each infants’ limb.
For both the detection and regression network, we measured the testing time.
Sample outputs of the detection FCNN are shown in Fig. 4, both for joints and joint connections. It is worth noting that also when some joints were occluded (e.g., due to plaster as in column 2 of the image, right leg) or they were out of field of view (column 4, left leg), the network correctly detect the others.
The boxplots for , separately computed for joints and connections, are shown in Fig. 5. Median and for joints are shown in Table III. For and , interquartile range (IQR) was always lower than 0.080 and 0.124, respectively. Median and for joint connections were evaluated too (Table IV). For and , IQR was always lower than 0.099 and 0.146, respectively. Detection time was on average 0.01 s per image.
Visual results for the regression-CNN output (left ankle) are shown in Fig. 6. The median values (for the reduced 128x96-pixel images) for pose estimation are shown in Table V. IQR was always lower than 4.760 pixel. Boxplots for are shown in Fig. 7.
The proposed FCNN achieved similar results for the detection of all joints (i.e., without outperforming in detecting one joint with respect to others), reflecting the FCNN ability of processing in parallel the different joint probability maps. This is also visible from the visual results in Fig. 4, where the FCNN was able to correctly detect visible joints without being affected by occluded ones. The regression network provided guidance for the bipartile matching algorithm, which achieved satisfactory performance ( 12 pixels) for all limbs. The overall methodology required 0.03 s per image, hence being compatible with real-time infants’ monitoring.
Our approach, despite some limitations (e.g., dataset dimensions and video length), overcame some of the literature drawbacks. Hence, it allowed to directly estimate limb-specific pose, being computationally efficient and clinically relevant.
Future improvements of the proposed methodology may include: (i) the collection and annotation of a larger dataset (considering the lack of available datasets in this field), (ii) the analysis of temporal information (naturally encoded in videos) in both detection and regression network, as recently proposed in  and (iii) the integration of infant-specific measures, already stored in electronic health records (e.g., height, limbs length…) to ameliorate the limb-pose estimation. An accurate estimation may potentially allow to retrieve useful hints for movement classification (e.g., following ) to offer all possible supports to clinicians.
In this paper, we have proposed a deep-learning framework for 2D pose estimation of infants’ limb in cages inside NICUs. The framework performs first a rough detection of limb-joint position via a FCNN, and then refine the detection exploiting a regression convolutional network, followed by bipartile matching to link joints belonging to the same limb. This work, to the best of our knowledge, represents a novel attempt to perform image-based infants’ limb pose estimation and can potentially be extended to handle even more complex scenario, where healthcare operators interacts with infants.
With respect to state-of-the-art approaches, our work allows a direct estimation of limb-specific pose, is completely automatic and allows real-time processing. This make it suitable for being integrated in the clinicians’ decision process and providing support for early diagnosis of brain and cognitive disorders from limb-movement analysis.
-  F. Giuliani, L. Cheikh Ismail, E. Bertino, Z. A. Bhutta, E. O. Ohuma, I. Rovelli, A. Conde-Agudelo, J. Villar, and S. H. Kennedy, “Monitoring postnatal growth of preterm infants: present and future–3,” The American Journal of Clinical Nutrition, vol. 103, no. 2, pp. 635S–647S, 2016.
-  F. Ferrari, G. Cioni, and H. Prechtl, “Qualitative changes of general movements in preterm infants with brain lesions,” Early Human Development, vol. 23, no. 3, pp. 193–231, 1990.
-  C. Einspieler, A. F. Bos, M. E. Libertus, and P. B. Marschik, “The general movement assessment helps us to identify preterm infants at risk for cognitive dysfunction,” Frontiers in Psychology, vol. 7, p. 406, 2016.
-  J. Werth, L. Atallah, P. Andriessen, X. Long, E. Zwartkruis-Pelgrim, and R. M. Aarts, “Unobtrusive sleep state measurements in preterm infants–A review,” Sleep Medicine Reviews, vol. 32, pp. 109–122, 2017.
-  T. M. Heiderich, A. T. F. S. Leslie, and R. Guinsburg, “Neonatal procedural pain can be assessed by computer software that has good sensitivity and specificity to detect facial movements,” Acta Paediatrica, vol. 104, no. 2, 2015.
-  D. Freymond, Y. Schutz, J. Decombaz, J.-L. Micheli, and E. Jéquier, “Energy balance, physical activity, and thermogenic effect of feeding in premature infants,” Pediatric Research, vol. 20, no. 7, p. 638, 1986.
-  I. Bernhardt, M. Marbacher, R. Hilfiker, and L. Radlinger, “Inter-and intra-observer agreement of prechtl’s method on the qualitative assessment of general movements in preterm, term and young infants,” Early Human Development, vol. 87, no. 9, pp. 633–639, 2011.
-  I. Zuzarte, P. Indic, D. Sternad, and D. Paydarfar, “Quantifying movement in preterm infants using photoplethysmography,” Annals of Biomedical Engineering, vol. 47, no. 2, pp. 646–658, 2019.
-  S. Cabon, F. Poree, A. Simon, O. Rosec, P. Pladys, and G. Carrault, “Video and audio processing in paediatrics: a review,” Physiological Measurement, 2019.
-  N. Hesse, S. Pujades, J. Romero, M. J. Black, C. Bodensteiner, M. Arens, U. G. Hofmann, U. Tacke, M. Hadders-Algra, R. Weinberger et al., “Learning an infant body model from RGB-D data for accurate full body motion analysis,” in International Conference on Medical Image Computing and Computer-Assisted Intervention. Springer, 2018, pp. 792–800.
-  A. Cenci, D. Liciotti, E. Frontoni, A. Mancini, and P. Zingaretti, “Non-contact monitoring of preterm infants using RGB-D camera,” in International Design Engineering Technical Conferences and Computers and Information in Engineering Conference. American Society of Mechanical Engineers, 2015, pp. V009T07A003–V009T07A003.
-  L. Adde, J. L. Helbostad, A. R. Jensenius, G. Taraldsen, and R. Støen, “Using computer-based video analysis in the study of fidgety movements,” Early Human Development, vol. 85, no. 9, pp. 541–547, 2009.
-  A. Stahl, C. Schellewald, Ø. Stavdahl, O. M. Aamo, L. Adde, and H. Kirkerod, “An optical flow-based method to predict infantile cerebral palsy,” IEEE Transactions on Neural Systems and Rehabilitation Engineering, vol. 20, no. 4, pp. 605–614, 2012.
-  H. Rahmati, R. Dragon, O. M. Aamo, L. Adde, Ø. Stavdahl, and L. Van Gool, “Weakly supervised motion segmentation with particle matching,” Computer Vision and Image Understanding, vol. 140, pp. 30–42, 2015.
-  M. Khan, M. Schneider, M. Farid, and M. Grzegorzek, “Detection of infantile movement disorders in video data using deformable part-based model,” Sensors, vol. 18, no. 10, p. 3202, 2018.
Z. Cao, T. Simon, S.-E. Wei, and Y. Sheikh, “Realtime multi-person 2D pose
estimation using part affinity fields,” in
Conference on Computer Vision and Pattern Recognition, 2017, pp. 7291–7299.
-  A. Hernández-Vela, N. Zlateva, A. Marinov, M. Reyes, P. Radeva, D. Dimov, and S. Escalera, “Graph cuts optimization for multi-limb human segmentation in depth maps,” in IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 2012, pp. 726–732.
-  C. Zhang, Y. Tian, and E. Capezuti, “Privacy preserving automatic fall detection for elderly using rgbd cameras,” in International Conference on Computers for Handicapped Persons. Springer, 2012, pp. 625–633.
-  X. Du, T. Kurmann, P.-L. Chang, M. Allan, S. Ourselin, R. Sznitman, J. D. Kelly, and D. Stoyanov, “Articulated multi-instrument 2-D pose estimation using fully convolutional networks,” IEEE Transactions on Medical Imaging, vol. 37, no. 5, pp. 1276–1287, 2018.
-  O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks for biomedical image segmentation,” in International Conference on Medical Image Computing and Computer-Assisted Intervention. Springer, 2015, pp. 234–241.
-  J. Hosang, R. Benenson, and B. Schiele, “Learning non-maximum suppression,” in IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 4507–4515.
-  E. Colleoni, S. Moccia, X. Du, E. De Momi, and D. Stoyanov, “Deep learning based robotic tool detection and articulation estimation with spatio-temporal layers,” IEEE Robotics and Automation Letters, 2019.
-  M. Capecci, M. G. Ceravolo, F. Ferracuti, M. Grugnetti, S. Iarlori, S. Longhi, L. Romeo, and F. Verdini, “An instrumental approach for monitoring physical exercises in a visual markerless scenario: A proof of concept,” Journal of Biomechanics, vol. 69, pp. 70–80, 2018.