Cerebral palsy (CP) is a group of neurodevelopmental disorders that appear in early childhood, which cause limitations to body movement and motion coordination (richards2013cerebral). Early detection of CP enables early rehabilitative interventions for high-risk infants. One effective CP detection approach is general movement assessment (GMA) (prechtl1990qualitative; adde2007general), where clinicians observe the movement of infant body parts in infant movement videos (IMVs) and evaluate the corresponding CP scores. Such a manual assessment task is tedious to clinicians and can be subjective and error-prone. Automated or computer-assisted GMA can potentially alleviate the burden for clinicians and make the process more efficient and objective.
The input to an automated or computer-assisted GMA system can be raw videos, or raw videos combined with various extracted semantic representations, such as infant body part segmentation map (i.e. body parsing results) and body pose (i.e. pose estimation results). Inferring body part segmentation or pose from IMVs is a non-trivial task however. With body parsing, one common challenge is the frequent occlusions among infant body parts due to their spontaneous movements, which makes it difficult to model the temporal correspondence between video frames. Moreover, towards training an automated body parsing system for IMVs, the annotation of segmented body parts in IMVs is expensive to acquire (zhang2019online).
In this paper, we aim to develop a robust and general framework for both infant body parsing and pose estimation and further evaluate the benefit of adding body parsing and pose estimation information to enhance the performance of neural-network based automatic GMA. This work is significantly different from our prior work (Ni2020SiamParseNetJB); while the previous work introduces our proposed SiamParseNet (SPN) for infant body parsing and demonstrates its segmentation performance on one small IMV dataset, in this work we propose an effective Factorized Video Generative Adversarial Network (FVGAN) for augmented training, demonstrate how to adapt SPN to the task of pose estimation, train and test SPN models for body parsing and pose estimation tasks using two IMV datasets. Since GMA has been used as a standard method for early prediction of CP in infants with high risk of developing neurological dysfunctions (adde2007general; stoen2019predictive), we further collect an IMV dataset containing 161 infant videos with GMA annotations to validate our proposed methods. Experiments on the clinical dataset show that the trained SPN models can generalize well to new clinical data and their results can significantly improve GMA prediction performance.
More specifically, the contributions of this work include:
We propose for the first time a semi-supervised learning framework for robust and accurate infant body parsing from IMVs, demonstrate its applicability to infant pose estimation, and show that combining body parsing and pose estimation results with raw videos leads to superior performance for GMA and CP risk prediction.
We conduct extensive experiments to evaluate our proposed body parsing and pose estimation framework and its application to GMA using three datasets from diverse sources. Our framework is shown to have good generalizability and models trained using two existing datasets in a semi-supervised manner can be directly applied to a new clinical dataset without fine-tuning and produce high-quality results.
Our proposed infant body parsing framework, termed SiamParseNet (SPN), has a siamese structure and jointly learns body part segmentation and label propagation on infant movement videos. For more efficient semi-supervised learning, we develop two alternative learning strategies to fully utilize both labeled and unlabeled frames.
To augment the training set for body parsing, we introduce a novel factorized video GAN to synthesize new labeled video frames by composing different synthesized foregrounds (i.e., infant) and backgrounds.
In the remainder of this paper, we first conduct literature review and discuss related works in Section 2; we then introduce our proposed SiamParseNet (SPN) and augmented training with factorized video GAN (FVGAN) in Section 3. Experimental evaluation for infant body parsing and pose estimation using two datasets is presented in Section 4. Section 5 introduces how to apply our proposed models to GMA and compares the performance of using different combinations of input representations on a 3rd clinical dataset with GMA annotations. We finally give conclusions in Section 6.
2 Related Work
2.1 Automated General Movement Assessment
Based on different sensing modalities, existing automated GMA systems can be classified into direct and indirect sensing assessment(Marcroft14; irshad2020ai). Direct sensing means infant movements are captured by devices directly attached to the infant subject, such as wearable movement sensors (Chen16; machireddy2017video) and magnet tracking systems (Philippi14). Indirect sensing utilizes hardwares that are integrated into the assessment environment, most of which employ video-based approaches and the videos can be captured using RGB cameras (Adde09; zhang2019online; reich2021novel), Kinect (hesse2018learning) or 3D motion capture systems (Meinecke06).
Compared with other sensing methods, the RGB camera-based approaches (Adde09; orlandi2018detection; zhang2019online; chambers2020computer; doroniewicz2020writhing; reich2021novel) have many advantages. First, the videos can be captured conveniently at home or in hospital, requiring only a cheap camera device such as the ones in mobile phones. Second, without attached sensors that affect infant motion, the video-based approach can record more spontaneous movements of infants. To the best of our knowledge, Adde09 designed the first video-based automatic GMA system. They utilized background subtraction and frame differencing methods to detect CP. However, their frame differencing methods could be vulnerable to the slow movement of infants, where small differences between adjacent frames may be regarded as noise and then mistakenly removed. orlandi2018detection developed a computer-aided GMA method which can classify infant movement videos into normal and CP. They employed a skin model for infant silhouette segmentation and large displacement optical flow (LDOF) for motion tracking. Kinematic features were then extracted and fed into several classifiers for evaluation. However, their skin model required users to manually select one skin area in advance, which may be inconvenient in practice. zhang2019online employed U-Net (ronneberger2015u) to design an online learning framework for parsing the infant body. chambers2020computer extracted body poses from the IMVs of at-risk infants using OpenPose (cao2017realtime) to calculate infant kinematic features. Then they utilized a Naïve Gaussian Bayesian Surprise metric to predict infant neuromotor risk. Similarly, reich2021novel also used OpenPose (cao2017realtime) to estimate the infant skeleton from IMVs and then employed a shallow multi-layer neural network to classify infant motor functions. However, most of these frameworks simply take a video as a set of multiple frame images and do not effectively utilize the temporal relationships between continuous frames. Very recently, cao2022aggpose proposed a Deep Aggregation Vision Transformer (AggPose) for infant pose estimation. Though achieving promising results on their infant pose dataset, their model is only designed for image pose estimation and does not exploit temporal continuity between video frames. The pose estimation result was also not used for CP prediction.
2.2 Zero-shot Video Object Segmentation
Parsing infant body in IMVs is highly relevant to the video object segmentation task (VOS). According to whether requiring the object mask during testing, VOS can be classified into zero-shot solution (no annotation given for testing frames) and one-shot solution (given the label of one testing frame) (ventura2019rvos). In this paper, we only focus on reviewing zero-shot VOS methods. Though VOS has been widely explored for natural scenes, challenges arise when directly applying existing VOS methods to infant body parsing, due to frequent occlusion among body parts during infant movements. Among current state-of-the-art methods, lu2019see introduced a CO-attention Siamese Network (COSNet) to capture the global-occurrence consistency of objects of interest among video frames for VOS by taking a pair of frames from the same video as input and learn to extract their rich correlations. However, such co-attention module may not work well for IMVs since IMVs generally use fixed camera so that the global-occurrences exist in both infant and background regions. The fixed camera characteristic of IMVs makes the global-occurrence consistency feature less distinguishable between infants and background. ventura2019rvos proposed a recurrent network RVOS to integrate spatial and temporal domains for VOS by employing uni-directional convLSTM (xingjian2015convolutional). This convLSTM only considers previous frames, which may result in unsatisfactory performance when occlusion persistently appears. zhu2017deep
proposed Deep Feature Flow, which first runs a convolutional sub-network on key frames for image recognition and then propagates their deep feature maps to other frames via optical flow. Also using optical flow, in the medical video field,jin2019incorporating proposed MF-TAPNet for instrument segmentation in minimally invasive surgery videos, which incorporates motion flow based temporal prior with an attention pyramid network. Optical flow based methods aim to find point-to-point correspondences between frames and have achieved promising results. However, for infant movement videos with frequent occlusions, it can be challenging to use optical flow to track corresponding points around occluded body parts. Moreover, few previous methods have investigated the semi-supervised training setting. As annotating IMV frames for body parsing is costly, semi-supervised methods that utilize partially labeled IMVs for training have great potential and deserve further research.
2.3 Guided Person Image Generation
To mitigate the high cost of manual annotation, we propose factorized video GAN (FVGAN) to synthesize labeled video frames based on annotated IMVs, which is related to guided image-to-image translation, especially person image generation. Currently, most methods of person image generation are pose-guided.ma2017pose proposed a two-staged coarse-to-fine network that allows synthesizing person images in an arbitrary pose when given the image of that person and a novel pose. ma2018disentangled further proposed a novel, two-stage reconstruction framework to learn a disentangled representation of person images via decomposing such images into three factors: foreground, background and pose information, and then generated novel images by manipulating these image factors. song2019unsupervised addressed the task of pose-guided person image generation via leveraging semantic generator and appearance generator. Cycle-consistency loss (zhu2017unpaired) is adopted to enable training without paired images. Despite such progress, few existing person image generation approaches explored body-parsing-mask-guided generation, which is more challenging than posed-guided generation since the shapes of body parsing masks between different subjects can be quite dissimilar while poses tend to be more subject-independent and thus pose transfer between different subjects can be easier. Furthermore, compared with the datasets used in existing works such as DeepFashion (liu2016deepfashion)
and Market-1501(zheng2015scalable), IMV datasets usually contain videos with much more complex backgrounds thus synthesizing IMV image frames is more challenging.
2.4 Differences from our previous work
A preliminary version of this work is presented in our conference publication (Ni2020SiamParseNetJB). In (Ni2020SiamParseNetJB), we proposed SiamParseNet (SPN), a siamese structured neural network (bertinetto2016fully) for infant body parsing which takes an arbitrary pair of frames from the same video as input during training. It includes one shared encoder and two branches: intra-frame body part segmentation branch , and inter-frame label propagation branch , where is particularly designed to consider multiple possible correspondences for alleviating the frequent occlusion challenge. To jointly train the two branches and , we also introduced a consistency loss to provide extra regularization. Since our proposed label propagation branch can propagate either from ground truth masks or outputs, we further utilize the consistency loss to enable semi-supervised learning (SSL) with both labeled and unlabeled video frames. To control the alternate training process between using annotated frames and unannotated ones, we proposed adaptive alternative training (AAT) in (Ni2020SiamParseNetJB). During testing, a multi-source inference (MSI) mechanism is proposed, which combines both segmentation branch and propagation branch to perform body part segmentation. MSI first automatically selects key frames and then employs branch to semantically segment them. Subsequently, MSI utilizes branch to propagate key-frame segmentation results to other non-key frames. Instead of only considering the previous or current frames like most previous video object segmentation methods (lu2019see; ventura2019rvos), MSI leverages the local context provided by representative key frames in each video clip to perform parsing, which further alleviates the frequent occlusion issue in IMVs.
In this work, we improve the SiamParseNet method by comparing and evaluating two alternative semi-supervised training strategies: thresholded automated training (TAT) and adaptive alternative training (AAT). Furthermore, inspired by the recent successful application of generative adversarial networks (GANs) (goodfellow2014generative) in both natural image generation (vondrick2016generating; ma2018disentangled; tulyakov2018mocogan; song2019unsupervised) and medical image synthesis (xue2019synthetic; xue2021selective; tajbakhsh2020embracing), we propose a novel factorized video GAN (FVGAN) to synthesize labeled frames to augment our training dataset and reduce the burden of manual annotation. More specifically, FVGAN decomposes an IMV frame into two factors: foreground (infants) and background, and decouples the generation of foreground and background. Given a body part segmentation mask from any existing labeled video frame, FVGAN can generate new video frames with various foregrounds and backgrounds (Fig. 1).
While only body parsing is demonstrated and evaluated using one small clinical IMV dataset (zhang2019online) in our previous work (Ni2020SiamParseNetJB), in this work, to better assist with GMA and verify the generalizability of our proposed framework, we show that SPN can be extended to the task of infant pose estimation by solely switching its backbone network. Our proposed models for body parsing and pose estimation are validated using two datasets: (a) the BHT dataset (zhang2019online) (used in our previous work) with approximately 6% training frames annotated and, (b) the Youtube-Infant dataset (chambers2020computer) with around 30% labeled frames. Body parsing results show that our proposed models can achieve comparable or better performance when compared with several state-of-the-art image/video segmentation methods (ronneberger2015u; chen2018encoder; lu2019see; ventura2019rvos; chen2020naive). Examples of comparison between our best-performing method (SPN + FVGAN) and other methods can be found in Fig. 2. We also conduct various ablation studies to prove the effectiveness of our proposed modules, including joint training of two branches, consistency loss, alternative semi-supervised training process, factorized video GAN and multi-source inference. Pose estimation results on an IMV dataset (chambers2020computer) demonstrate the promising performance of SPN for infant pose estimation (99.21% PCK@0.1).
Finally, in this work, to validate the clinical value of our proposed SPN, we propose a convolution-recurrent neural network (CRNN)-based IMV classification model for GMA and experiment on a newly collected clinical IMV dataset with expert GMA annotations. Results demonstrate that both body parsing and pose estimation results generated by SPN without any fine-tuning can significantly improve CP prediction performance on the new dataset, increasing the mean AUC score by more than 15% compared to models using only raw video inputs.
Fig. 3 shows the overview of our proposed SiamParseNet (SPN). SPN is a siamese structure taking an arbitrary pair of training frames from the same video as input regardless of the availability of their annotation. This increases the amount of training data as well as enables the utilization of partially labeled videos. More specifically, we consider three cases of using training frame pairs according to their annotation: the fully supervised mode, which uses two labeled frames for training; the semi-supervised mode, where only one of the two input training frames is annotated; the unsupervised mode, where neither training frames is labeled. During training, we propose two alternative training strategies, thresholded automated training (TAT) and adaptive alternative training (AAT), both of which mainly rely on fully supervised mode at early stages, and rely more on semi-supervised and unsupervised modes at later stages. To augment the set of labeled video frames, we also propose factorized video GAN (FVGAN) to generate new labeled video frames based on existing labeled training frames. FVGAN generates a new frame by synthesizing the foreground and background separately (Fig. 4). This helps reduce training difficulty and model complexity as well as makes it easier to apply image transformation. Given a labeled video frame from one video and another randomly-chosen labeled frame from a different video, FVGAN can generate three new labeled frames to help model training (Fig. 1). During testing, we propose multi-source inference (MSI) to achieve robust body parsing by combining both the segmentation branch and the propagation branch of SPN (see Fig. 5). Next, we introduce our proposed SPN and its semi-supervised learning mechanism, alternative training strategies, factorized video GAN and multi-source inference.
3.1 Semi-supervised Learning
As illustrated in Fig. 3, given a pair of input frames and , we first employ a shared encoder to extract their feature maps and . Then we feed them to segmentation branch and propagation branch . further represents and as and and generates the segmentation probability maps and with a segmentation module. processes and as and to calculate the similarity in the feature space. outputs the probability maps and through different paths according to the availability of the annotation of and .
Case 1: When both input frames and have ground truth masks available, and , we have the fully-supervised mode. In this case, propagates the ground truth mask of one frame to another, as the solid line in Fig. 3 shows. The overall loss is calculated as:
where all losses are cross-entropy losses between one-hot vectors. More specifically,and are segmentation losses of between and , and , respectively. and are losses of between and , and . The consistency loss measures the degree of overlapping between outputs of the two branches and , where is a scaling factor to ensure have roughly the same magnitude as and .
Case 2: If neither input frames and is labeled, we have the unsupervised mode. In this case, branch propagates the segmentation output of one frame to another, as the dotted line in Fig. 3 shows. More specifically, transforms the output of , and , to probability map and . Due to the lack of ground truth, we only consider the consistency loss between two branches:
Case 3: If only one in the pair of input frames is annotated, we have the semi-supervised mode. Without loss of generality, assume that is available but is not. Then branch generates the probability map with ’s output instead of (the dotted line in Fig. 3). We compute the losses and , which measure the loss between and , and , respectively. We also calculate and , which measure the loss between and , and , respectively. Thus, the semi-supervised loss is:
As a general framework, SPN can adopt various networks as backbone. We follow DeepLab (chen2017deeplab; chen2018encoder) and use the first three residual blocks of ResNet101 (he2016deep) as encoder . For branch , we employ the
block of ResNet101 for feature extraction and ASPP(chen2017deeplab) module for segmentation. Branch also utilizes the block of ResNet101 to extract feature. Note that the weights are not shareable between the two branches and thus the two branches are optimized separately during training.
To propagate a given source segmentation map to a target frame, similar to hu2018videomatch
, we first calculate the cosine similarity matrixof and as
where is the affinity value between , point in map , and , point in map . Then, given the source segmentation map (either ground truth or generated ), and the similarity matrix , produces , point in output map as
where contains the indices of the top most similar scores in the row of . Since considers multiple correspondences for a point rather than one-to-one point correspondences as in optical flow (zhu2017deep; meister2018unflow; jin2019incorporating), SPN can naturally better handle occlusions in IMVs than optical flow based methods.
3.2 Alternative Training
As mentioned briefly in the beginning of Section 3, during the training of SPN, we adopt two training strategies, thresholded automated training (TAT) and adaptive alternative training (AAT), to alternatively employ different training modes to achieve optimal performance. Intuitively, SPN should rely more on the supervised mode at early stages and then gradually incorporate more semi-supervised and unsupervised training at later stages of training. To dynamically adjust the reliance on different training modes, we propose TAT and AAT to automatically sample training data among the three cases. Assume that the probabilities of selecting case 1, case 2, and case 3 for any iteration/step of training are , , and , respectively. Considering case 2 and case 3 both involve utilizing unlabeled frames, we set . Thus we only need to control the probability of choosing case 1 training and the other two cases are automatically determined.
For TAT, we change according to the pixel accuracy during training, where is calculated as:
Since the moving average pixel accuracy of the output of branch may differ from that of branch , we choose as the minimum of the two accuracy values. is the preset accuracy threshold and is the preset probability. Note that is continuously changing during training. When is lower than the preset , we only adopt the fully-supervised mode as .
For AAT, we use an annealing temperature to gradually reduce as training continues, where is computed as
where is the training step, is the pre-defined lower bound probability of , is the maximum number of steps of using AAT.
3.3 Augmented Training with Factorized Video GAN
To alleviate the high cost of manual annotation and use synthetic data to augment the set of labeled training frames, we propose factorized video GAN (FVGAN) to generate synthetic labeled IMV frames. Given an input training IMV frame and its corresponding mask , and the target body parsing mask , the generator of FVGAN can output a new frame which is a frame of the same infant and background scene as but with the body parsing mask as , that is,
this synthesized labeled frame and its mask (, ) can be later added to the original training set for data augmentation.
When training FVGAN, we randomly sample two labeled frames (, ) and (, ) from the same video for supervised learning. Here and are a pair of images which have the same foreground (infant appearance) and background scene but are under different body parsing masks and . Using labeled frames from the same video makes it possible to use both adversarial and supervised losses for training. When synthesizing new frames using FVGAN, we randomly select two masks and from different videos, and then we utilize frame to generate new frame as in Eq. 8. By choosing from a video that is different from that of , it is guaranteed that the generated frame is a novel frame.
To simulate cross-video inference as well as reduce model overfitting, when training FVGAN, we randomly apply several large-degree image transformation operations to the original target mask , including rotation, translation, scaling, and shear. To leverage supervised training, we also apply the same image transformation to the target frame . However, different from the traditional image transformations which act on the whole image, transformations are only applied to the foreground (infant body) part of , , to avoid unnecessary changes to the background scene. That is, we apply the same transformation as to . For background regions, applied transformations only affect those areas occluded by the original or transformed foreground, where all remaining regions in the background are kept unchanged.
Due to such transformation difference between foreground and background, we decouple the generation of foreground and background in FVGAN, where foreground synthesis and background synthesis have independent training losses. The factorized design of FVGAN eases the manipulation and enables diverse synthesis outputs by combining the foreground and background of the input frame with synthesized foreground and background. Overall, FVGAN can generate three types of images, including synthesized foreground plus target background image (termed ), synthesized background plus target foreground image (termed ), synthesized foreground and synthesized background image (termed ).
The detailed structure of the generator in FVGAN is illustrated in Fig. 4. In general, we first factorize input frame to foreground embedding feature and background feature , and then combine them with which is the label feature of target body parsing mask, to generate the synthesized foreground image and background image , respectively. We finally generate by assembling and together.
More specifically, during training, for the input image pair , we first utilize the mask to decompose the image into foreground infant body image and background scene image . To promote diversity in body masks, we apply large-degree random transformation to target mask and get transformed mask . Then, we adopt three feature encoders , , and to encode foreground image , background image , and the mask as feature , and , respectively. We subsequently feed foreground feature and target label feature to the foreground decoder , and feed the background feature and label feature to the background decoder . We further use their output and to generate the synthesized foreground image and background image , where , . Here is pixel-wise multiplication and is a matrix of ones. We finally sum up and to obtain the final synthetic image .
Note that the ground truth of is missing after employing image transformation. To enable supervised training, we additionally generate new ground truth masks for synthesized foreground and background . We apply the same image transformation as to the foreground of the target image and get as the supervision signal of . For , it is hard to estimate background areas covered by but not by . We choose to ignore such ambiguous areas by creating a background mask , where . Then we generate the combined background image , where , and its corresponding supervision signal , where .
Our proposed encoder and decoder in FVGAN can use various backbone networks, such as the ones in DCGAN (radford2015unsupervised) and pix2pixHD (wang2018high). Here we adopt the architecture similar to pix2pixHD due to its strong representation power. For the encoder and
, we use the network with four stride-2 convolutions and 9 residual blocks. For the encoder, we only use four stride-2 convolutions without additional residual blocks because mask images are simple. For both decoder and
, we employ four fractionally-strided convolutions with stride. Similar to johnson2016perceptual, we use instance normalization (ulyanov2016instance). For the discriminator , we use PatchGANs (zhu2017unpaired; isola2017image; wang2018high), which aims to classify whether the overlapping patches are real or fake. To stabilize the training, we use LSGAN (mao2017least) as the adversarial loss.
The overall training loss of FVGAN is defined as:
where the whole image loss , the foreground image loss , and the background image loss . The definition of is:
where and are scaling factors, and the adversarial loss is defined by the following minimax game (goodfellow2014generative):
where is associated with the generator . We design different discriminators for different components. For , the discriminator is designed to distinguish the real image from the generated image . For , aims to distinguish the real image pair and the generated image pair . For , tries to differentiate from . The VGG loss (wang2018high) is defined as
where is the number of layers in VGG feature extraction network and denotes the output of -th layer with elements of the VGG network (simonyan2014very)
pretrained on ImageNet(deng2009imagenet). The feature matching loss (wang2018high) is defined as
where denotes the -th layer with elements of our proposed discriminator .
3.4 Multi-source Inference for Testing
To further mitigate the occlusion issues in IMVs when testing, as Fig. 5
shows, we propose multi-source inference (MSI) to fully exploit the dual branches of SPN. For each testing IMV, we first calculate the pixel difference between consecutive frames and then model the difference with a Gaussian distribution. The-th percentile is selected as threshold to sample those watershed frames, whose pixel differences from adjacent frames are higher than the threshold. We subsequently split the IMV into several shorter clips delimited by these watershed frames, so the infant pose and appearance is similar within each video clip. We further choose the middle frame of each clip as the frame to represent that clip because middle frame has the least cumulative temporal distance from other frames (griffin2019bubblenets). During inference, segments the selected key frames of each video clip. Then, for the other non-key frames within the same clip, takes the segmentation output of the corresponding frame and propagates it to other non-key frames within the clip. By splitting the long video to short clips, and using frame to provide local context and the propagation source of each clip, the proposed MSI can effectively alleviate the occlusion issues in IMVs.
4 Experiments for Infant Body Parsing and Pose Estimation in IMVs
4.1 Datasets and Metrics
We conduct extensive experiments on two IMV datasets: BHT dataset and Youtube-infant dataset.
BHT dataset. This dataset (zhang2019online) is collected through a GMA platform developed by Shenzhen Beishen Healthcare Technology Co., Ltd (BHT). 20 movement videos of infants aging from 0 - 6 months are recorded by either medical staff in hospital or parents at home. Due to the long length of original videos, we downsample them every 2 to 5 frames and the final average length of these downsampled videos is 1,500 frames. All the frames are resized to and some of them are pixel-wise annotated with five categories: background, head, arm, torso and leg. This challenging dataset covers a large variety of video contents, including diverse infant pose, varied body appearance, multiple background scenes, and viewpoint changes. We randomly split this dataset into 15 training videos and 5 testing videos, which generates that 1,267 labeled frames and 21,154 unlabeled frames in the training set (i.e. only 5.7% frames are labeled), and 333 labeled frames and 7,246 unlabeled frames in the testing set.
Youtube-Infant dataset. We further evaluate our proposed methods on various infant videos collected from Youtube (chambers2020computer). URLs of videos were provided by authors of chambers2020computer. For infant body parsing, we select 90 IMVs of various types from the Youtube dataset, where the average length of each video is about 120 frames. Similar to the BHT dataset, we collect annotations for five classes: background, head, arm, torso and leg. Around 30% of the frames for each video are manually annotated through the Amazon Mechanical Turk crowd-sourcing platform333Data and annotations are publicly available at: https://github.com/nihaomiao/Youtube-Infant-Body-Parsing.. This Youtube infant dataset is more diverse and challenging than the BHT dataset, as it includes infants of different races, poses and appearances. The videos have varying background scenes, viewpoints and lighting conditions. We randomly split all videos into 68 training videos and 22 testing videos, resulting in 2,149 labeled and 4,690 unlabeled training frames, and 1,256 labeled and 2,737 unlabeled testing frames.
To better extract kinematic features and assist with GMA, we also apply SPN to pose estimation for IMVs. For pose estimation on the Youtube dataset, we follow chambers2020computer and choose 94 infant videos which include 84 training videos and 10 testing videos. All videos are fully labeled with 17 keypoints (nose, left/right eye, left/right ear, left/right shoulder, left/right elbow, left/right wrist, left/right hip, left/right knee, left/right ankle) by either human annotators or OpenPose (cao2017realtime). In total, there are 5,681 training frames and 487 testing frames from the videos.
Metrics. For body parsing tasks, we first compute the mean Dice score of all labeled testing frames for each video and then report the average score across all videos as the final result. When calculating Dice, we represent each pixel label of the segmentation map as a one-hot vector and ignore background pixels to focus on the body parts. We also report individual mean Dice scores for each body part.
For pose estimation tasks, we first compute the mean PCK score (yang2012articulated) of all labeled testing frames for each video and then report the average score over all the videos as the final result. The PCK score is defined as the percentage of correct keypoints, where a candidate keypoint is considered as correct if it falls within pixels of the ground truth keypoint. Here and are the height and width of the bounding box of infant body, and controls the relative threshold of precision. By using two thresholds and , we report two PCK scores: PCK@0.1 and PCK@0.05.
4.2 Implementation details
When training SPN for infant body parsing, to accelerate training and reduce overfitting, similar to chen2017deeplab; chen2018encoder; lu2019see; ventura2019rvos, we utilize the weights of DeepLab V3+ (chen2018encoder)
pretrained on COCO dataset(lin2014microsoft) to initialize the shared encoder , branch and . For branch , we set in Eq. 5, after grid searching from 5 to 20 with an interval 5. The scaling factor in Eq. 1, 2, 3 is set to be to keep all the loss terms have the same magnitude. To construct the training frame pair from the same video for SPN, we collect different image pairs for three training modes mentioned in Section 3. For training case 1 (i.e. fully supervised mode), we randomly select a pair of frames of the same video from labeled training frames and repeat until we have 10,000 training image pairs, with 20,000 images in total. For training case 2 (i.e. unsupervised mode), we repeat randomly selecting a pair of unlabeled video frames until we have 10,000 image pairs. For training case 3 (i.e. semi-supervised mode), we repeat randomly choosing one unlabeled frame and one annotated frame until we have 10,000 image pairs. For semi-supervised learning with TAT, we set in Eq. 6. Unless otherwise specified, we set to be 0.85 in all experiments. For semi-supervised learning with AAT, we set and epochs. Unless otherwise specified, in Eq. 7 is set to be 0.4 in all experiments. SGD optimizer is employed with the momentum of 0.9. We set the initial learning rate to be and adopt the poly learning rate policy (chen2017deeplab) with the power of 0.9. The training batch size is set to be 20. Some traditional data augmentation operations are also applied, such as color jitter, rotation and flipping. We terminate training when the pixel accuracies of both branches are almost unchanged for 2 epochs.
To train FVGAN, similar to the fully supervised training mode of SPN, we randomly select 10,000 pairs of labeled frames from the same video in the training set. We set the batch size as 20 and train the model for 100 epochs. The Adam optimizer (kingma2014adam) is employed with and . The learning rate is fixed to be . We set and in Eq. 10 to be 10. Considering that there can be large pose difference between input mask and target mask when generating new frames since comes from a different video from during synthesis, we apply large-degree image transformations to during training. More specifically, the range of rotation degree, shear degree, translation pixel and scaling coefficient is , , , and , respectively.
When testing to parse infant bodies in IMVs using MSI, we set in the key frame selection algorithm to be . DenseCRF (krahenbuhl2011efficient) is adopted as final post-processing.
To adapt SPN to pose estimation, we only need to switch the backbone network used for body parsing to a pose estimation backbone, and replace the segmentation branch with the pose estimation branch . We follow the architecture of Pose-ResNet (xiao2018simple) to design the new backbone: we first employ the four residual blocks of ResNet101 (he2016deep) as shared encoder . For the newly designed pose estimation branch , we employ three deconvolutional layers for feature extraction and one final convolutional layer for pose estimation. Branch also utilizes three deconvolutional layers to extract features. Similar to the training for body parsing, we randomly select 10,000 image pairs to train a fully-supervised SPN and also utilize the weight of Pose-ResNet pretrained on COCO dataset (lin2014microsoft) to initialize the model. For branch , we simply set in Eq. 5 for efficiency. Following the training setting in Pose-ResNet, we set the initial learning rate to be and drop it to at 5 epochs and at 10 epochs. There are 20 epochs in total. The training batch size is 20 and Adam (kingma2014adam) optimizer is used. During testing, instead of using MSI, we only employ the trained pose estimation branch. We choose to only use this branch since propagating pose keypoints is rather sensitive to the selection of source frames, and MSI may fail when source frames lack required keypoints due to occlusion or mis-estimation, which is common in pose estimation tasks.
4.3 Result Analysis
To validate the effect of joint training of and , we first compare two variants of SPN under fully supervised training mode: [Single-] which is trained using only segmentation branch and [SPN-] which is trained jointly but only is used for all testing frames. Since is not available, multi-source inference is not used for testing. As Table 1 shows, compared with [Single-], joint training of SPN greatly boosts the mean dice of by over , which can be contributed to the siamese structure, the shared encoder, the consistency loss, among others. To further validate the effect of the consistency loss, we also compare two variants of the SPN model: [SPN w/o ] which is SPN trained without using consistency loss , semi-supervised learning, and [SPN w/o SSL] which is SPN trained with consistency loss but without SSL. For both variants, MSI is used for testing. As shown in Table 1, [SPN w/o SSL] gives better mean Dice than [SPN w/o ], which demonstrates the usefulness of the consistency loss, even in the fully supervised setting and normal training. In addition, by comparing the results of [SPN-] and [SPN w/o SSL], one can see the effectiveness of MSI since the only difference between those two models is the application of multi-source inference.
|SPN w/o SSL||75.34||62.26||81.35||82.26||80.09|
|SPN w/o SSL + FVGAN||82.36||71.22||86.61||84.43||84.13|
|SPN () + FVGAN||82.25||71.77||85.12||85.06||84.54|
|SPN w/o SSL||80.09||0||77.17||0||71.52||0|
For SPN model with SSL, we experiment with different values of the pixel accuracy threshold in TAT (Eq. 6) and different setting of the annealing temperature in AAT (Eq. 7). From Table 1, one can observe that semi-supervised learning using either TAT or AAT can improve the performance under different parameter settings. To verify the effect of proposed MSI, we also try to only use the segmentation branch for testing after SSL learning. Comparing the [SPN-] with SPN using different and , one can find that MSI improves the final Dice for all these settings.
To further demonstrate the power of semi-supervised learning in our proposed SPN, we experiment with a video-level SSL setting on the BHT dataset: we randomly choose a certain number of training videos and remove ALL annotations from those videos. This setting is more stringent than the one used in (jin2019incorporating), which showed promising results with a frame-level SSL setting: they removed labels of some frames in each video but all training video are still partially labeled. For our experiment using the video-level SSL, we compare the performance of the [SPN w/o SSL] and the SPN with SSL when only keeping labeled frames in 7 training videos (and using 8 training videos without labels), and when only keeping labels in 5 videos (and using 10 without labels).
As shown in Table 2, with fewer annotated videos, the SPN with SSL clearly shows significant performance gains (up to ) than [SPN w/o SSL] either using TAT or AAT with their best parameter setting ( and ). Such results indicate the potential of SPN under various semi-supervised learning scenarios. From both Table 1 and Tabel 2, one can find that the best model comes with using AAT (), which we will set as the default SSL in the remaining experiments.
To validate the effectiveness of proposed FVGAN augmentation, we first utilize FVGAN to separately generate 10,000 images for each of the three types: synthesized foreground with target background , synthesized background with target foreground , synthesized foreground and synthesized background . Examples of synthesized frames can be found in Fig. 1. We then add all synthesized frames to the training sets for the training of [SPN w/o SSL] and [SPN w/ SSL], respectively. Note that, while there might be slight mismatches between the foreground and background lighting in the generated images, we believe, in the context of using these images for augmentation, the mismatch could further serve as a lighting augmentation to improve the robustness of the trained model.
|SPN w/o SSL||SPN w/ SSL|
Table 3 shows our experimental results of using augmented training data with different generated images on BHT dataset. From Table 3, one can see that all three types of synthesized images can be helpful, and using fully synthesized or foreground synthesized shows better gain than background synthesized , which can improve Dice coefficient by up to 4.04% and 3.32% for [SPN w/o SSL] and [SPN w/ SSL], respectively. We speculate the reason is that synthesizing foreground introduces more diverse infant body poses and appearances than synthesizing background, thus is more efficient for data augmentation.
Table 3 also validates the effectiveness of our proposed semi-supervised learning. Qualitatively, from Fig. 6, one can see that different from other variants of SPN, our full SPN model avoids mis-segmentation of shadows and occluded head region, giving the best segmentation performance.
|DeepLab V3+ (chen2018encoder)||72.00||53.93||69.99||73.29||70.99||91.76||72.18||86.26||82.87||84.16|
|SPN w/o SSL||75.34||62.26||81.35||82.26||80.09||91.72||71.93||86.03||82.15||84.24|
|SPN w/o SSL + FVGAN||82.36||71.22||86.61||84.43||84.13||93.34||77.92||88.46||86.78||87.19|
|SPN + FVGAN||82.25||71.77||85.12||85.06||84.54||93.97||79.78||89.67||87.61||88.53|
|SPN- w/o SSL||99.21||96.69|
|SPN- w/o SSL-PA||98.10||95.19|
We further compare our proposed SPN and FVGAN with current state-of-the-art methods, including single frame based U-Net (ronneberger2015u) and DeepLab V3+ (chen2018encoder), and video based COSNet (lu2019see) and RVOS (ventura2019rvos), and a semi-supervised video segmentation model Naive-Student (chen2020naive) in Fig. 2 and Table 4
. For fair comparison, except U-Net, all methods employ pretrained models from ImageNet(deng2009imagenet) or COCO dataset (lin2014microsoft). DenseCRF is also adopted as post-processing for all methods. From Fig. 2, one can observe that SPN clearly handles occlusion better than other methods and shows better qualitative results. As Table 4 shows, [SPN w/o SSL], and SPN, and [SPN w/o SSL + FVGAN], and the full [SPN + FVGAN] model all have achieved substantially better quantitative performance when compared with previous state-of-the-art methods on BHT dataset.
To evaluate the generalizability of our proposed SPN and FVGAN, we also compare different SPN variants with current state-of-the-art methods on the Youtube-Infant dataset. As Table 4 and Fig. 2 show, SPN models achieve comparable or better performance when compared with state-of-the-art methods. Though RVOS (ventura2019rvos) achieves the best mean Dice (0.35% higher than [SPN + FVGAN]), note that the input of RVOS is a video clip while SPN only requires a pair of frames, thus SPN can be much more computationally efficient in training and testing. The comparison between [SPN w/o SSL] and SPN (using AAT with ) in Table 4 and Fig. 6 also demonstrates the effectiveness of our proposed semi-supervised learning strategy. Moreover, FVGAN augmented training improves the performance of both [SPN w/o SSL] and SPN, as demonstrated in Table 4 and Fig. 6. We also experiment augmented training with different numbers of images (2K, 5K, and 10K) for [SPN w/o SSL] method on BHT dataset and Youtube-Infant dataset. Table 5 shows the effectiveness of FVGAN augmented training using different numbers of synthesized frames, which can consistently improve the [SPN w/o SSL] by up to 4.04% and 2.95% in Dice score. Even augmenting with only 2K FVGAN synthesized frames can still boost the [SPN w/o SSL] model performance by and in Dice coefficient on the BHT and Youtube-Infant dataset, respectively.
To verify that the improvement brought by FVGAN augmentation is statistically significant, we compare models trained with and without FVGAN using a one-sided paired Wilcoxon signed-rank test on the Youtube-Infant dataset. We calculate differences between two models from the mean Dice scores on each testing video. For models trained without SSL (i.e., comparing [SPN w/o SSL] with [SPN w/o SSL + FVGAN]), the p-value is , and for models trained with SSL (i.e., comparing [SPN] with [SPN + FVGAN]), the p-value is . Both p-values are considerably smaller than the considered significance level (i.e., 0.05), indicating that the improvement brought by FVGAN is indeed significant.
Lastly, to better extract kinematic features for movement assessment, we extend our proposed SPN framework to the pose estimation task on Youtube-Infant dataset. Since Youtube-Infant is fully annotated for pose estimation, we first train a fully-supervised SPN and compare its pose estimation branch (i.e., the segmentation branch in SPN for body parsing) with a strong baseline model Pose-ResNet (xiao2018simple). To evaluate the effect of our proposed semi-supervised learning mechanism, we remove 30% labels of training frames to simulate partially annotated (PA) dataset and retrain different models as Pose-ResNet-PA, fully-supervised SPN-PA and semi-supervised SPN-PA. As shown in Table 6, when trained in a fully-supervised fashion, SPN can achieve 0.17% and 1.58% higher scores than Pose-ResNet on PCK@0.1 and PCK@0.05, respectively. Considering that the same backbone architecture is used between [SPN-] and Pose-ResNet, the major contribution to the improvement comes from the joint training of two branches in SPN. Table 6 also illustrates the effectiveness of our proposed semi-supervised learning, which can improve [SPN w/o SSL] by 1.02% and 1.11% on PCK@0.1 and PCK@0.05, respectively. We also demonstrate some visualization results of SPN and Pose-ResNet in Fig. 7, from which one can observe that SPN models show better estimation in some joint keypoints such as the wrist.
5 Application to General Movement Assessment
To further validate the clinical value of our proposed methods, we propose a convolutional recurrent neural network (CRNN) based CP prediction model for automatic GMA and conduct comprehensive experiments on a newly collected clinical IMV dataset to show that the body parsing and pose estimation results predicted by SPN can help improve CP prediction performance when combined with raw video frames.
5.1 Dataset and Metrics
We conduct comprehensive experiments on an IMV dataset collected from Shenzhen Baoan Women’s and Children’s Hospital (BWCH) with expert GMA annotations. The collection of the data was approved by the hospital ethics committee. The BWCH dataset includes IMVs of 110 normal infants and 51 abnormal infants evaluated by an experienced pediatric physical therapist using the Prechtl’s method of GMA (ferrari2004prechtl). GM videos were taken between 49 and 60 weeks. Each IMV is classified into normal or abnormal based on infant motion patterns. Abnormal was defined as infants with larger amplitude movements than normal fidgety movements, such as with excessive speed and jerkiness (spittle2013general)
. All videos have follow-up CP diagnoses that were verified to ensure the accuracy of the GMA annotation; in other words, all infants whose GM videos were annotated as abnormal in our dataset were later diagnosed as CP. We downsample the original videos every 5 frames and the average length of videos is about 900 frames. We perform 5-fold cross validation on this dataset, and the ratio of normal to abnormal infants are kept at around 2:1 for both training and testing sets in each fold to preserve class distribution. We report the mean and variance of sensitivity, specificity, and AUC (Area Under the ROC curve) scores of all folds to compare different models.
5.2 CRNN-based GMA Prediction
We build upon a popular video classification framework CRNN (shi2016end) to classify IMV into normal or abnormal as GMA prediction results. The original CRNN is based on raw video inputs, consisting of a CNN module to extract the feature map for each video frame and an LSTM model (hochreiter1997long) taking as input the feature map sequence of all frames for final classification. We enable CRNN to utilize body parsing or pose estimation predictions by adding an extra CNN to extract 2D feature maps from body parsing or pose estimation results of each frame. We subsequently concatenate these 2D feature maps with feature maps from raw video frames as inputs to LSTM.
To obtain better body parsing and pose estimation results for the testing IMVs in the new BWCH dataset, we retrain body parsing and pose estimation SPN models using all available data in the BHT dataset (zhang2019online) and Youtube-Infant dataset (chambers2020computer). Then we directly apply these trained models to new testing IMVs without any training or fine-tuning. Since no video frame-level annotation is required, our setting simulates the real-world clinical scenario where no body parsing or pose estimation annotations are available for input IMVs. For each input IMV, we first run the trained pose estimation model SPN- to obtain the pose keypoint results. Based on the estimated keypoint coordinates of each frame, we generate a fixed bounding box to crop the video to remove the noisy background. We find that this preprocessing step is essential for achieving promising body parsing results. We then applied the trained body parsing SPN and trained SPN- on the cropped video to generate the body parsing and pose estimation results for later use by GMA prediction. More implementation details are introduced in Section 5.3.
5.3 Implementation Details
We adopt ResNet-152 (he2016deep) and a three-layer LSTM to construct a CRNN model and initialize weights using a public model444https://github.com/HHTseng/video-classification
pretrained on the UCF101(soomro2012ucf101) action classification dataset. During the training, we fix the weights of ResNet-152 and fine-tune the remaining parts with a learning rate for 20 epochs in each cross-validation fold. The training batch size is set to be 32, and Adam (kingma2014adam) optimizer is employed. Due to the long length of IMVs, we randomly select 200-continuous-frame video clips to train CRNN in each iteration. When testing, we clip the IMV into several 200-continuous-frame segments. The final prediction is obtained by taking an average of the class probability of each segment. To embed richer color information, we first represent body parsing and pose estimation results as RGB images (see Fig. 8 as an example) and then use the pretrained ResNet-152 to extract their feature maps as supplement to the raw video input features.
5.4 Result Analysis
Table 7 shows the quantitative comparison between the current state-of-the-art GMA prediction model (chambers2020computer) and our proposed CRNN-based models using different input modalities. Within our proposed CRNN-based models, one can observe that using body parsing results (raw video + parsing), pose estimation predictions (raw video + pose), or their combination (raw video + pose + parsing) can all considerably improve the baseline CRNN model (raw video only). Such results further validate the clinical value of our proposed SPN. Furthermore, we find that body parsing features can generally obtain more gain than pose estimation features. The reason may be that body parsing representation contains richer information than the skeleton representation, thus is more appropriate for a CNN-based feature extraction model. We also compare our proposed CRNN-based models with the state-of-the-art GMA method (chambers2020computer) in Table 7. The original method in chambers2020computer reported Bayesian Surprise concerning the reference population. For a fair comparison using the same cross-validation setting in our experiments, we first extract kinematic features for each IMV using their released implementation555https://github.com/cchamber/Infant_movement_assessment
. Then we train an XGBoost classifier(chen2016xgboost) to evaluate extracted features on testing IMVs in each fold. From Table 7, all CRNN models with SPN features achieve more favorable performance than chambers2020computer, indicating that results generated by SPN can indeed help improve the GMA prediction performance under clinical settings.
|CRNN (raw + pose)||72.93±20.17||95.45±.07||94.02±7.26|
|CRNN (raw + parsing)||80.61±23.61||96.36±3.40||94.31±8.31|
|CRNN (raw + pose + parsing)||82.43±15.63||87.27±12.33||96.46±2.11|
In this paper, we propose SiamParseNet, a novel semi-supervised framework for joint learning of body parsing and label propagation in IMVs toward computer-assisted GMA. Our proposed SPN exploits a large number of unlabeled frames in IMVs via alternative training of different training modes and shows superior performance under various semi-supervised training settings. Combined with factorized video GAN augmented training and multi-source inference for testing, SPN not only has great potential in infant body parsing but can also be easily adapted to other video tasks such as pose estimation. A clinical GMA model for CP prediction can also benefit from the body parsing and pose estimation results generated by SPN. In the future, we plan to design a single SPN to jointly learn body parsing and pose estimation.