Fetal organ measurement using ultrasound (US) is currently the most popular way of assessing the state of the fetus’ growth and safety of the pregnancy [liu2019deep]. It enables the operator to perform an array of measurements during single imaging session. Clinically, the most important are the measurements of biparietal diameter (BPD), head circumference (HC), femur length (FL) and abdominal circumference (AC). Fetal measurements enable obstetricians to evaluate fetal’ growth and estimate the following parameters: gestational age (GA) and fetal weight (FW) [hadlock1984estimating, hadlock1985estimation].
In order to obtain proper fetal measurements, it is required to find a proper imaging plane (view) and follow strict guidelines that standardize the procedure. Both of those tasks require substantial knowledge and experience of the operator. Ultrasound images are characterized by high speckle noise, blur and shadows, which further increase the difficulty of the task. An automated way of measurement of the fetal body parts is meaningful because expert resources are scarce, especially in underdeveloped countries [shah2015perceived, van2019automated].
Related Work: To automate the fetal body part measurements, researchers used computer-aided diagnosis, including the most advanced offered by deep learning. The problem of finding suitable views that meet the criteria of a standard measurement plane has been investigated [baumgartner2017sononet, burgos2020evaluation]. Substantial research has been done to find the best algorithms that segment single fetal body parts. Encouraging results were achieved for segmentation of fetal heads [sinclair2018human, zeng2021fetal] and abdomens [ravishankar2016hybrid, sinclair2018cascaded]. Jang et al. [jang2017automatic] proposed to use simultaneous segmentation and classification of abdomen, [kim2018machine, budd2019confident] of head, and [wu2017cascaded] both. Unlike ours, the large majority of algorithms developed to date focus on solving only one task at a time. Some address the task of choosing a good plane for fetal measurement, while others focus on segmenting a single fetal body part.
To the best of our knowledge, the most similar method to ours is Liu et al. [liu2020automated]. The authors developed a model that can classify and segment fetal head, abdomen and femur simultaneously. However, our method differs from [liu2020automated] in the following aspects. First, their model was trained on single image frames and does not provide temporal fetal US scan video analysis. Second, their model always assigns one a body part to every frame. This is an important limitation because ultrasound recordings contain many frames that do not contain any body part of interest appropriate for measurements and/or segmentation, and therefore their solution is not designed to work with ultrasound recordings. The paper [prieto2021automated] has the capability of recognizing background (not appropriate for measurement frames). However, authors used the inpainting method to remove pixels with embedded annotations in retrospective cohort study data. Such image modification technique is not ideal in investigations that use deep learning and it is not practical.
In this paper, we propose an end-to-end pipeline called FetalNet that is designed to jointly localize, classify and measure the fetal body parts at the frame level. We examine the impact of temporal information extracted from frame sequence connected to the attention mechanism and stacked module on the fetal body parts measurement in the fetal US video recordings.
Contributions: The main contributions of our work are as follows: (i) we propose an end-to-end multi-task method called FetalNet for comprehensive spatio-temporal fetal US video analysis to localize, classify and measure the fetal body parts simultaneously, (ii) we extend an attention gate mechanism by aggregating multi-scale feature maps of each decoder output to learn the local and global context of the fetal body structures that help to outperform both segmentation and classification state-of-the-art results.
In this section, we describe a multi-task learning neural network for spatio-temporal fetal US video analysis. Next, we describe how to automatically obtain measurements of each fetal body parts. Figure 1 shows an overview of our method called FetalNet for the automatic evaluation of fetal biometric measurement on fetal US scan video using a multi-task deep learning framework.
Multi-task neural network: Inspired by [mehta2018net, wang2018simultaneous]
, we use an encoder-decoder based convolutional neural network (CNN) for simultaneous segmentation and classification of the fetal body parts on the fetal US video sequences. We use an encoder part to extract high-level US image features. The output of an encoder is forward to the ConvLSTM-based bottleneck. The ConvLSTM cells are able to retain spatial-temporal US image features in memory, which can effectively improve the performance precision and accuracy of both classification and segmentation. Due to various shapes and sizes in our dataset, we employ an attention gate mechanism to implicitly learn to suppress irrelevant regions in an input video sequence, while highlighting the salient features of the target region-of-interest. Attention gate mechanism helps to better exploit local information to efficient object localize (i.e. fetal body parts) and improve prediction performance. Every encoder block forwards its output feature maps and concatenates them with an attention gate to the decoder part. In fetal examinations, fetal body parts are hardly visible, and the sonographer’s manual examination relies heavily upon low-level semantic information to draw boundaries. To improve the performance of binary prediction feature maps, we employ deep supervision to connect multi-scale lower and higher-level of each decoder features together[harrison2017progressive] called stacked module. Multi-scale feature maps help to encode both global and local context. We use a set of 2D convolutional layers to up-sample the feature maps after each convolutional block. Then, we combine the previous high-level feature maps to aggregated binary segmentation map. As our ablation study demonstrates, a stacked module with an attention gate can significant impact segmentation and reduce the measurement error of the fetal body parts over both attention U-Net and stacked module U-Net. For the classification branch, we use spatial-temporal ConvLSTM-output features to classify each of the following classes: fetal head, abdomen, femur or background at the frame level. Figure 2 shows the proposed multi-task learning method called FetalNet for spatio-temporal fetal ultrasound scan video analysis.
Biometric measurement: First, we resize the segmentation output of our multi-task neural network, which comes in the form of a binary mask to the size of the input image. Next, we apply binary thresholding and perform erosion followed by dilation, using a cross-shaped structuring element. This ensures that the predicted masks are sufficiently denoised. Finally, we use a median blur filter to smooth the edges of the predicted masks. Depending on the classification results, we use different methods to obtain adequate measurements. For HC and AC, firstly, we use a function to find contours of segmentation output. Next, we use Ramer-Douglas-Peucker approximation [ramer1972iterative, douglas1973algorithms] to enhance the accuracy of the subsequent ellipse fitting, for which perimeter is calculated and stored. Additionally, to acquire the measurement of BPD, we store the length of the short axis of the ellipse fitted to the head. To obtain FL, we precisely fit a rectangular bounding box to the contours of the predicted binary mask found by using the same function as in case of head and abdomen. Next, we store the length of the fitted rectangle. Finally, we convert all of the obtained pixel-valued measurements to centimetres by multiplying them by pixel spacing, an attribute that encodes the physical distance between centres of the pixels, stored in DICOM files metadata.
3 Experiments and Results
In this section, we introduce a novel Fetal dataset and show the performance of our method called FetalNet on this dataset.
Fetal dataset: Our dataset consists of 700 two-dimensional (2D) fetal US video sequences examinations of head, abdomen and femur and comes from 700 different patients. Each 2D fetal US video sequence consists of between 250 and 460 frames. Overall our dataset consists of over 274,000 frames. The data comes from volunteer pregnant women with pregnancies between 15th and 38th weeks of gestation, acquired during a routine clinical screening examination. Data was acquired using five different GE Voluson ultrasound scanners (E8, E10, S6, S8, P8) at two different resolution: 975 742 pixels or 1100 960 pixels. From video sequences sonographers identified standard views that are suitable to perform the measurement and annotated them. Overall, our annotated dataset consists of 62324 standard views and 211951 background views. We used the number of examples for each class for training and validation: 32215 heads, 26403 abdomens, 3706 femurs and 211951 backgrounds. The background class shows indistinguishable structures around standard view plane frames.
The data come from six different research institutes and were anonymized before use in this study. Six sonographers with experience (40, 25, 20, 20, 15, 8 years, respectively) provided ground truths for the dataset in the form of annotations drawn on the anonymized images and values of the performed measurements. The annotations of heads and abdomens were ellipses similar to those that were used for manual measurements typically done at the ultrasound scanner. Annotations of femurs were created by free-hand drawings of their outlines.
Preprocessing: First of all, we transform 2D fetal US video sequences into separate ordered frames. Then, we anonymize raw 2D US images by removing personal data displayed on the top of the images. Next, we remove unnecessary text burned in images like device settings. In the next step, we convert raw DICOM data into PNG files. Finally, we randomly split our dataset into 60% training (420 cases), 20% validation (140 cases) and 20% test (140 cases) set.
Data Augmentation: During training, to prevent overfitting and make the neural network more robust, we apply various data augmentation techniques: random (i) rotation between -15 and 15 degrees, (ii) brightness, (iii) contrast, (iv) horizontal and (v) vertical flip. We also use a shuffled sampler.
We use the Jaccard Index (IoU) and Dice Similarity Coefficient (DSC) for the segmentation task as evaluation metrics. For classification, we use accuracy, precision, recall and F1-score. Finally, we evaluate measurement performance using Absolute Difference (ADF).
Implementation details: We base our network on U-Net [ronneberger2015u], our encoder-decoder includes eight convolutional blocks, four in the encoder part and four in the decoder part. We concatenate encoders with attention gates [oktay2018attention]
to decoders via skip-connections. We use an attention gate mechanism to focus on certain parts of images. To improve the performance of the segmentation and smooth our binary predicted masks, we use stacked probability multi-scale feature maps via up-sample convolutional outputs of the side feature maps to the input image size. We extend the original U-Net implementation that each block consists of the following order: Conv3x3-BatchNorm-ReLU-Conv3x3-BatchNorm-ReLU-Dropout2D with
. After each block of the encoder part, we apply Max Pooling layer with a kernel size of
and stride = 2. The number of feature maps in the input layer is equal to n = 64. The rest of the eight convolutional blocks consists of 2n-4n-8n-16n-8n-4n-2n-n feature maps. For the classification branch, we apply Adaptive Average Pooling 2D and Dropout2D withas ConvLSTM output before Fully Connected layer with feature maps on the output. We use Adam as an optimizer with a learning rate of , weight decay of and Weighted Cross-Entropy Loss with the following weights per class 0.25, 0.25, 0.4 and 0.1, respectively:
Our training set contains an overall of 87771 fetal US images and annotations of: 26072 fetal heads, 20901 abdomens, 2956 femurs and 169500 of background as 0, 1, 2, 3 class, respectively. We scale all images and annotation masks to 224
224 pixels. We train a neural network on NVIDIA Titan RTX 24GB GPU for 240h. We implement our neural network in Python based on PyTorch deep learning library. Figure2 shows proposed neural network.
Segmentation results: We evaluate our model on 57001 test images of fetal head (7250 images), abdomen (6580 images), femur (720 images) and background (42451 images) class. For segmentation, we use Jaccard Index, also known as Intersection over Union (IoU) and the Dice similarity coefficient (DSC) as the evaluation metrics. We obtain the following results: 0.905 and 0.962 for IoU and DSC, respectively. The qualitative results of our proposed network are depicted in Figure 3. It can be seen that our method was able to localise fetal head, abdomen and femur, which are subject to variability in scale and appearance.
Classification results: For classification as the evaluation metrics, we use accuracy, precision, recall and F1 score. We obtained the following results: 0.96, 0.97 and 0.96 for precision, recall and F1 score, respectively. Table 1 demonstrates FetalNet classification results against base U-Net, Deeplabv3 [chen2017rethinking], FCN-8s and FCN-32s [long2015fully]. The proposed system outperforms the state-of-the-art neural networks in terms of head, abdomen and femur in segmentation and classification accuracy.
Our method outperforms the current state-of-the-art methods in both classification and segmentation (Table 1), and the fetal body part measurement (Table 2). As shown in Figure 3, our segmentation results of standard view scans are comparable to the ground-truths provided by experienced sonographers.
Measurement results: Table 2
reports results of fetal head, abdomen and femur error measurement (in mm) against state-of-the-art neural networks measured as the mean and standard deviation.
|a) head||b) abdomen||c) femur|
We conduct the ablation study to show the effectiveness of the proposed method called FetalNet in terms of both segmentation and classification. We use the same dataset and hyperparameters of the neural network for each experiment, if not mentioned. Table3 shows the experiments for the proposed method with different combinations of modules. As we can see, multi-task learning improves segmentation results to compare with U-Net base model. For the segmentation, combining the proposed performed better than only using Dice loss. The results were further improved after introducing attention gate mechanism and stacked modules separately. Finally, our model trained with each of the proposed extensions, resulting in a notable performance gain over all the metrics.
In this paper, we propose an end-to-end multi-task method called FetalNet for spatio-temporal full-length routine fetal US scan video analysis. In particular, we consider attention mechanism and present how to incorporate it into fetal biometric measurement to better exploit local structures. We introduce aggregation of multi-scale feature maps as a stacked module making our approach more robust to spatio-temporal fetal US scan video analysis, where the previous methods fail. This allows for accurate and precise simultaneous localization and classification of the fetal body parts in freehand fetal US video recordings. Our method incorporates a classification branch to the U-Net-based encoder-decoder neural network. To learn temporal features, we employ the ConvLSTM layer as a bottleneck. To make our method more robust on ultrasound noise and shadow, we exploit an attention gate mechanism to focus on relevant ROIs at the frame level. We introduce a stacked module, aggregating the multi-scale feature maps of the decoder to learn the local and global context of the target. The ablation study (Table 3) shows that using both additional modules, our methods achieve better results in segmentation and classification of the fetal body parts.
In this paper, we proposed an end-to-end multi-task method called FetalNet for spatio-temporal fetal US scan video analysis. FetalNet is designed to jointly localize, classify and measure the fetal body parts during routine freehand fetal US examinations. The proposed method has the potential as fetal biometry assistance tool for clinical use by non-experienced personnel. Usage of our approach in a clinical environment requires real-time feedback for a sonographer during routine fetal US examinations. Due to the large size of model parameters, we will implement a more efficient neural network to work on computationally low-cost devices. Additionally, we will improve and make our model more robust by adding to the training set data generated on a larger variety of devices as well as low-quality data to simulate data acquisitions made by non-expert personnel. We will also extend our method to automatically detect abnormalities of the fetus and perform a direct estimation of gestational age and fetal weight.
The authors would like to thank the following medical sonographers for data, annotations and clinical expertise: Jan Klasa, MD; Bogusław Marinković, MD; Wojciech Górczewski, MD; Norbert Majewski, MD; Anita Smal-Obarska, MD.