Ultrasound (US) is the primary screening method for assessing fetal health and development due to its advantages of being low-cost, real-time and radiation-free . Generally, during the diagnosis, sonographers first scan pregnant women to obtain US videos or volumes, then manually localize standard planes (SPs) from them. Next, they measure biometrics on the planes and make the diagnosis . Of these, SP acquisition is vital for subsequent biometric measurement and diagnosis. However, it is very time-consuming to acquire nearly thirty SPs during the diagnosis and the process often requires extensive experiences due to the large difference in fetal posture and the complexity of SP definitions. Thus, automatic SP localization is highly expected to improve the diagnostic efficiency and decrease operator-dependency.
I-a Standard Plane Localization
2D and 3D US are two typical modalities used in prenatal diagnosis. 2D US is easy to use and has better imaging quality. However, automatic 2D SPs localization may fail in detecting SPs when they are not scanned by clinicians due to the invisibility of the fetus and fine position of each plane. 3D US can contain multiple SPs in just a single shot and has the inherent advantages of less user-dependency and more efficiency compared with 2D US . Usually, after obtaining the 3D US, the sonographer shifts and rotates the current view plane to approach the SP. However, it is very challenging to manually localize SPs in the volume due to the huge search space, the large fetal posture variability and the low image quality. Therefore, the development of automatic methods for localizing SPs in 3D US would improve diagnostic efficiency and decrease operator-dependency by providing a de-specialized scanning method for non-experts.
I-B Termination Strategy for Reinforcement Learning
In reinforcement learning (RL), a decision needs to be made by the agent as to whether to terminate the inference. The termination conditions are usually pre-set, such as reaching to the destination in MountainCar , pole’s falling up in CartPole , etc. However, the termination conditions are often indistinct and can not be precisely determined in many tasks (e.g., object detection , landmark detection , SP localization , etc.). Specifically, in the SP localization, the agent might fail to catch the target SP and continue to explore without termination condition. One solution was to extend the action space with a further terminate action . However, enlarging the action space will result in insufficient training. Several works terminated the agent searching by detecting oscillation  or the lowest Q-value . Although no additional action was introduced, these approaches still required the agent to complete inference with maximum step, which is inefficient. Therefore, a dynamic termination strategy to ensure the efficacy and efficiency of the SP localization is highly desirable in the SP localization task.
I-C Related Work
In our review of the related work on SP localization, we first introduce the approaches based on 2D US, and then we summarize the 3D US methods. Finally, we involve our previous deep RL-based algorithm.
I-C1 Standard Plane Detection in 2D US
selected SPs based on conventional machine learning methods (i.e., adaboost, random forest, support vector machine) through detecting key anatomical structures or landmarks of each frame in the video. Recent approaches made use of the convolutional neural network (CNN) due to its powerful ability in automatically learning hierarchical representations. The first two studies[6, 7]
built the CNN model with transfer learning technology to detect fetal SPs. Chenet al. 
then equipped the CNN with recurrent neural network (RNN) to capture the spatial-temporal information to detect three fetal SPs. Similar design can also be found in[17, 13]. Baumgartner et al.  further proposed a weakly-supervised approach to detect 13 fetal SPs and locate region of interest in each plane. Inspired by , Schlemper et al.  incorporated the gated attention mechanism into the CNN to contextualize local information for detecting SPs. More recently, some works [42, 24, 22] proposed to assess US image quality automatically. Wu et al.  first introduced the quality assessment system of fetal abdominal plane by a cascade CNN. Luo et al.  and Lin et al.  then proposed to assess the quality of fetal brain, abdomen and heart SPs by multi-task learning. These above methods showed the efficacy of detecting SPs and assessing image quality by transfer learning, spatial-temporal information, attention mechanisms and multi-task learning. However, automatic SP detection in 2D US still suffers from the high dependence on clinicians’ scanning skills.
I-C2 Standard Plane Localization in 3D US
Different from 2D US, localizing SPs in 3D US usually faces challenges of low image quality, large data size and huge search space. A number of works [46, 29, 23, 10] formulated this task as a cascade pipeline (i.e. from landmark detection to SP regression) based on conventional machine learning methods. Although effective by using prior anatomical knowledge, the performance of these methods is still limited by landmark detection accuracy and testing case-model difference. Recently, Ryou et al.  proposed to locate the fetus by random forest and detect SPs by CNNs sequentially. Schmidt-Richberg et al. et al. 
proposed a deep neural network to move the estimated plane to the target SP iteratively. They further customized a RL-based agent for view plane searching in MRI volumes. RL is promising for SP localization in 3D US due to its ability of mimicking experts’ operation and exploring inter-plane dependency by the agent-environment interaction. However, the RL solution may suffer from its random initialization and empirical termination when its environment, such as the US volume, has strong noise, artifacts and large appearance variations.
To address the issues mentioned above, our previous study  proposed a RL based framework to automatically localize SPs in 3D US. We equipped the RL framework with a landmark-aware alignment module for warm start to ensure its effectiveness. In this module, we leveraged the CNN to detect anatomical landmarks in the US volume and registered them to a plane-specific atlas, thus providing strong spatial bounds and effective initialization for the RL. Furthermore, instead of passively and empirically terminating the agent inference, we introduced a learning-based strategy for active termination of the agent’s interaction procedure through an RNN module. The learning-based strategy can achieve optimal termination adaptively, thus improving the accuracy and efficiency of the localization system.
In this study, we further improve the stableness, robustness and efficiency of our previous method . This article has considerable difference compared with the previous conference paper, which consists of:
We design an adaptive dynamic termination based on our previous work , which enables an early stop for the agent searching, resulting in efficiency-steered localization system. Dynamic termination is an important yet unsolved problem in reinforcement learning; Our work provides the first effective solution for this and can be generalized to other similar scenarios.
We validate the effectiveness and the generalizability of our method on a large multi-organ dataset including 433 fetal brain volumes, 519 fetal abdomen volumes, and 683 uterus volumes. Specifically, we propose to localize seven SPs (Fig. 1) from multiple organs, in contrast to the two SPs from one organ .
We have conducted comprehensive experiments to validate the superiority of our method over existing ones in aspects of SP localization performance, performance comparison, computation efficiency, effectiveness of the proposed adaptive termination module, and biometric and qualitative evaluation of the obtained results from the aspects of clinical practice.
As shown in Fig 2, an automatic plane localization system for 3D US is proposed to imitate the diagnosis of experienced physicians. This system is implemented with a two-step unified RL framework. First, a landmark-aware alignment module  is adopted to reduce large search space caused by the complex intrauterine environment and diverse fetal postures. Then, a deep reinforcement model searches the target SPs within the bounded environment resulted from the alignment module. An adaptive RNN-based termination module is adopted to dynamically stop the RL agent at the optimal interaction.
Ii-a Deep Reinforcement Learning Framework
In the classical deep RL framework, the agent interacts with the environment by making successive actions to maximize the expectation of reward, where is the action space. Meanwhile, a plane in 3D space is modelled as , where
denotes the unit normal vector of the plane, andis its Euclidean distance from the origin. In this work, the origin is set as the center of an US volume. We therefore define the main elements of this plane-localization RL framework as follows:
: The state is defined as the reconstructed 2D US image from the volume given the current plane parameters. Since the reconstructed image size may change, we pad the image to a square and resize it to 224224. In addition, we concatenate the two images obtained from the previous two iterations with the current plane to enrich the state information, which is similar to .
Action: The action is defined as incremental adjustment to the plane parameters. The complete action space is defined as . Given an action, the plane parameters are modified accordingly (e.g. ). We perform one action to adjust only one plane parameter with the others unchanged for each iteration. Specifically, the step size of angle adjustment is , while the distance step size is set as 0.5 voxel in each iteration.
Reward: The reward signal defines the goal in a RL problem. It instructs the agent what policy should be taken to select the proper action. In this study, the reward is defined as whether the agent approaches or moves away from the target, which can be obtained by , where , indicate the plane parameters of the predicted plane and the ground truth in iteration , is the sign function. The universal set of the calculated reward signal is: , where and indicate the positive and negative movement, respectively, and refers to no adjustment.
Agent: The agent is a policy component that outputs the action via interacting with environment. In this study, we adopt the Q-learning  as the solution for the SP localization. Different from the existing work using a deep neural network to estimate the Q-value directly , the dueling learning  is utilized to encourage the agent to learn which states should be weighted more and which are redundant in choosing proper actions. Specifically, the Q-value function is decomposed into a state value function and a state-dependent action advantage function, respectively.
as the convolutional backbone. The number of features in each layer is 64, 128, 256, 512, 512, respectively. To mitigate the gradient vanishing issue, we add batch normalization layer after each convolutional layer in the neural network. The extracted high-level features are then fed into two independent streams of fully connected layers to estimate the state value and the state-dependent action advantage value. The hidden units of fully connected layers are 512, 128, 1 in the state value estimation stream, and 512, 128, 8 in the state-dependent action value estimation stream, respectively. The outputs of the two streams are fused to predict the final Q-value.
Replay Buffer: The replay buffer is a memory container that stores the transitions of the agent to perform experience replay for learning procedure. Element is typically represented with a vector , where denote the state, action and reward at the step . In this study, the prioritized replay buffer  is adopted to improve the learning efficiency.
Training Loss: As explained above, we decompose the Q-value function, , into two separate estimators including the state value function and the state-dependent action advantage function , where is the input state of the agent, is the action, represent the parameters of the convolution layers and the two streams of fully-connected layers, respectively, and . The Q-value function of the agent is calculated as:
denotes the size of the action space. The loss function for our framework is then defined as:
where is a discount factor to weight future rewards; and are the state and the action in next step; represents uniform data sampling from the experience replay memory ; and are the parameters of Q network () and target Q network ().
Ii-B Landmark-aware Plane Alignment for Warm Start
Due to the low image quality, large data size and diverse fetal postures, it is very challenging to localize SPs in 3D US. Moreover, the random state initialization used in  often fails in localizing SPs because of the noisy 3D US environment. Therefore, a landmark-aware alignment module was proposed in our previous study  as a dedicated warm-start of the searching process via anatomical prior knowledge. A more concrete processing pipeline is detailed in this section.
This landmark-aware module aligns US volumes to the atlas space, thus reducing the diversity of fetal posture and US acquisition. As shown in Fig. 4, our proposed alignment module consists of two steps, namely plane-specific atlas construction and testing volume-atlas alignment. The details are described as follow.
Plane-specific Atlas Construction: In this study, the atlas is constructed to initialize the SP localization in the testing volume through landmark-based registration. Hence, the atlas selected from the training dataset must contain both reference landmarks for registration and SP parameters for plane initialization. As shown in Fig. 4, instead of selecting a common anatomical model for all SPs [27, 23], we propose to select specific atlas for each SP to improve the localization accuracy. In order to ensure the initialization effectiveness, ideally, the specific SP of the selected atlas should be as close to the SPs of other training volumes as possible. Algorithm 1 shows the determination of the plane-specific atlas volume from the training dataset based on minimum plane error (i.e. sum of the angle and distance between two planes). During the training stage, each volume is first taken as an initialized proxy atlas, then performing landmark-based rigid registration with the remaining volumes. According to the mean plane error measured between the linear-registered planes and ground truth for each proxy atlas, volume with the minimum error is chosen as the final atlas.
Testing Volume-atlas Alignment: Our alignment module is based on landmark detection and matching. Unlike the direct regression, we convert the landmark detection as a heatmap regression task  to avoid learning a highly abstract mapping function (i.e. feature representations to landmark coordinates). We trained a customized 3D U-net  with the L2-norm regression loss, denoted as:
where denotes the number of landmarks, and , represent the th predicted landmark heatmap and ground truth landmark heatmap, respectively. These ground truth heatmaps are created by placing a Gaussian kernel at the corresponding landmark location. During inference, we pass the test volumes to the landmark detector to get predicted landmark heatmaps. The coordinates with the highest value in the landmark heatmap are selected as the final prediction. We map the volume to the atlas space through the transform matrix calculated by the landmarks to create a bounded environment for the agent. Furthermore, we utilize the annotated target plane function of the atlas as the initial starting plane function for the agent.
Ii-C Adaptive Dynamic Termination
Compared with the current empirical termination strategy [14, 1], our previous work  indicated that a learning-based termination strategy can improve the planning performance in deep RL. However, it requires the whole Q-value sequence obtained by maximum iterations to determine the final termination, which is inefficient. In this study, we update the active termination strategy into the adaptive dynamic termination, which is proposed in deep RL framework for the first time.
Specifically, considering the sequential characteristics of the iterative interaction, as shown in Fig. 2, we model the mapping between the Q-value sequence and optimal step with an additional RNN model. The Q-value is defined as , consisting of 8 action candidates at the iteration ; and the Q-value sequence refers to a time-sequential matrix , where denotes the index of iteration step. Taking the Q-value sequence as input, the RNN model can learn the optimal termination step based on the highest Angle and Distance Improvement (ADI).222The definition of ADI refers to equation 7 in the Section III-C. During training, we randomly sampled the sub-sequences from the Q-value sequence as the training data and denoted the highest ADI during the sampling interval as the ground truth.
Unlike the previous studies [12, 1], we design a dynamic termination strategy to improve the inference efficiency of the reinforcement framework. Specifically, our RNN model performs one inference every two iterations based on the current zero-padding Q-value sequence, enabling an early stop at the iteration step having the first three repeated predictions.
Our previous study  used Mean Absolute Error (MAE) loss function to train the RNN in the termination module. However, it has constant gradient of back-propagation and lack of measuring the fine-grained error. This study replaces it with the Mean Square Error (MSE) loss function to relive this and target a more stable training procedure. Since ground truth, i.e. optimal termination step, is usually larger than 1 (e.g. 1075), the conventional MSE loss function may struggle to converge in training due to the excessive gradient. We adopt the MSE loss function with a balance hyper-parameter, and defined as:
where is the RNN parameters, is the input sequence of the RNN, represents the RNN network, and denotes the optimal termination step. The balance hyper-parameter = 0.01 can normalize the value range of learned steps to [0, 0.75] approximately, thus simplifying the training process. The RNN model is trained using inference results obtained from training volumes.
Iii Experiment configurations
Iii-a Implementation Details
We implemented our framework in PyTorch, using a standard PC with an NVIDIA Titan XP GPU. We trained the whole framework through Adam optimizer 
with a learning rate of 5e-5 and a batch size of 4 for 100 epochs, which cost about 4 days. We set the discount factorin the loss function (Equation 2) as 0.9. The size of the Replay Buffer was set as 15000. The target Q network copied the parameters of the Q network every 1500 iterations. The maximum number of iterations in one episode was set as 75 in fetal dataset and 30 in uterus dataset to reserve enough moving space for agent exploration. The initial for action selection strategy 
was set as 0.6 at first and multiplied by 1.01 every 10000 iterations until 0.95 during training. The RNN variants, i.e. vanilla RNN and Long Short Term Memory (LSTM)
, were trained for 100 epochs, using mini-batch Stochastic Gradient Descent (SGD) optimizer with a learning rate of 1e-4 and batch size of 100, which costed about 45 mins. The number of hidden units was 64 and that of the RNN layers is 2. The starting plane function for training the framework was randomly initialized around the ground truth plane within an angle range of and distance range of to ensure the agent can explore enough space within the US volume. For landmark detection, we trained the network using Adam optimizer with a batch size of 1 and a learning rate of 0.001 for 40 epochs.
We chose the hyper-parameters based on the validation set and evaluate the performance of our method with several metrics on the held-out test sets. In specific, we trained the model for each hyper-parameters with different magnitudes and evaluate the performance on the validation dataset. We selected the value of hyper-parameters with the best validation performance as the default setup for the training phase. In this study, three high-impact hyper-parameters including the size of Replay Buffer, and were searched.
We validated the proposed framework using three distinct 3D US datasets, including fetal brain, fetal abdomen and uterus. Specifically, we aim to localize the three SPs: the transventricular (TV), the transthalamic (TT) and the transcerebellar (TC) SPs in the fetal brain, the fetal abdominal (AM) SP in the fetal abdomen, and the mid-sagittal (S), the transverse (T) and the coronal (C) SPs in uterus, respectively. We select threefour landmarks from each fetaluterus US volume: the genu and the splenium of the corpus callosum, and the center of cerebellar vermis for fetal brain volumes; the umbilical vein entrance, the centrum, and the neck of the gallbladder for fetal abdomen volumes; two endometrial uterine horns, endometrial uterine bottom and uterine wall bottom for uterus volumes. We collected our dataset with 1635 prenatal 3D US volumes (433 fetal brains, 519 fetal abdomens and 683 uterus US volumes). Approved by the local Institutional Review Board, all volumes were anonymized and obtained by experts using a Mindray DC-9 ultrasound system with an integrated 3D probe. Average volume size of our dataset is in fetus and in uterus with a unified voxel size of . Four sonographers with 5-year experience provided manual annotations of landmarks and SPs for all the volumes. All the annotation results were double-checked under strict quality control from a senior expert with 20-year experience. We randomly split our dataset for training, validating, testing of 313, 20, 100 in fetal brain, 389, 20, 110 in the fetal abdomen, and 519, 20, 144 in uterus.
Iii-C Evaluation Criteria
In this study, we used three criteria to evaluate the localization accuracy of the predicted planes compared with the target plane. First, the angle and distance deviation between the two are estimated. Formally, we defined:
where the represent the normal of the predict plane and target plane, the represent the distance from the volume origin to the predicted plane, and that to the ground truth plane. It is noted that the and are evaluated based on the plane sampling function, i.e., , with an effective voxel size of . Moreover, it is also important to examine whether these two planes are visually alike. Therefore, Peak Structural Similarity (SSIM) 
was leveraged to measure the image similarity of the planes.
Besides, the ADI in iteration is defined as the sum of the cumulative changes of distance and angle from the start plane, which is as follows:
In this section, we conducted extensive experiments on the three dataset to validate the effectiveness and generalizability of our method. These experiments include performance comparison with state-of-the-art methods, effectiveness of the landmark-align module, effectiveness of the adaptive dynamic termination module, statistical significance test, clinical biometric evaluation, and qualitative evaluation.
Iv-a Comparison with state-of-the-art methods
To examine the effectiveness of our proposed method in standard plane localization, we conducted a comparison experiment with the classical learning-based regression method, denoted as Regression, the current state-of-the-art Automatic View Planning method , denoted as AVP, and our previous method , denoted as RL-US. To achieve a fair comparison, we used the default plane initialization strategy of the Regression and AVP, and re-trained all the two compared models using the public implementations. We also adjusted the training parameters to obtain the best localization results. As shown in Table I and II, it can be observed that our method achieves the highest accuracy compared with the alternatives in almost all of the metrics. This indicates the superior ability of our method in standard plane localization tasks.
Iv-B Impacts of the landmark-align module
To verify the impact of the landmark-aware alignment module of the proposed approach, we compared the performance of the proposed framework with and without this module. In the Pre-Regist method, we set the agent with random starting plane function like  and choose the lowest Q-value  as the termination step. The Regist method represents the framework equipped with the alignment module but without agent searching. The Post-Regist method denotes the searching result of the agent with a warm-up initialization with the alignment module. We also chose the lowest Q-value termination strategy to implement the Post-Regist for a fair comparison. As shown in the Table III and IV, the accuracy of the Pre-Regist method is significantly lower than that of the Regist and the Post-Regist method. This proves that the landmark-aware alignment module can improve the plane detection accuracy consistently and substantially. Figure 5 provides visualization of the 3D spatial distribution of the fetal brain landmarks pre-/post-alignment. It can be observed that all the landmarks are mapped to a similar spatial position, which indicates that all the fetal postures are roughly aligned.
Iv-C Analysis of adaptive dynamic termination
|Low Q-Value ||10.989.86||2.882.46||0.6360.144||11.3010.80||2.661.69||0.6490.128||12.288.77||2.622.50||0.7690.071||16.058.93||2.242.10||0.7760.079|
|Low Q-Value ||9.859.74||2.563.13||0.8840.066||9.728.08||3.102.55||0.7700.105||7.486.43||1.701.56||0.6860.093|
To demonstrate the impact of the proposed adaptive dynamic termination (ADT) strategy, we performed comparison experiments with existing popular strategies such as the termination with max iterations (Max-Step), the lowest Q Value (Low Q-Value ), and the active termination  with LSTM (AT-LSTM). We also compared with our proposed ADT
with different backbone network including Multi-layer Perceptron (ADT-MLP), vanilla RNN (ADT-RNN) and LSTM (ADT-LSTM). The superscript represents the model was trained with the normalized MSE loss function (, Eq. 4) As shown in Table V and VI, equipped with the adaptive dynamic termination strategy, the agent was able to avoid being trapped into an inferior local minimum and achieved better performance. Furthermore, from Table VII, we can observe that our proposed dynamic termination can save approximately inference time at most, thus improving the efficiency of the reinforcement framework.
Figure 6 displays the training curves and validation performance of the same model trained with the MAE loss  and the normalized MSE loss, respectively. It shows that the MSE loss can facilitate the model to obtain a more stable convergence and lower in validation as comparison to the MAE loss. This indicates the effectiveness of the in simplifying the training of the termination module.
As shown in Table VIII, we performed the ablation study of the number of layers and hidden units of the LSTM in fetal brain dataset. We can observe that the LSTM with 2 layers and 64 hidden units outperforms those with other settings.
|Num of layers||Num of units||TC||TV||TT|
Iv-D Significant Difference Analysis
To investigate if the difference between methods were statistically significant, we performed paired t-tests between the results of our methods and Regression, AVP , Registration. These tests were conducted for all of the performance metrics including Angle, Distance and SSIM. We set the significance level as 0.05. The results are shown in the Table IX and X. The results of the comparisons and tests in Tables I-IV and IX-X indicate that our method performed best among the state-of-art methods (Regression, AVP ) and Registration. Although our method outperforms the AT-LSTM  without significant difference, our method could save at most 67 inference time as shown in Table VII.
|Ours vs. Regression||0.003||0.138||0.003||0.282||0.008||0.301||0.006||0.161|
|Ours vs. AVP ||0.003||0.001|
|Ours vs. Registration||0.003||0.162||0.001||0.454||0.399||0.049||0.329||0.005||0.048||0.399|
|Ours vs. Regression||0.603||0.009||0.005||0.276||0.848||0.130||0.025|
|Ours vs. AVP |
|Ours vs. Registration||0.047||0.359||0.720||0.495||0.329||0.877|
Iv-E Clinical biometric evaluation from SP
In this section, we further explore whether the detected planes can provide accurate biometrics that are consistent with the ones obtained in manually acquired planes, which are more of clinical concerns. To obtain those on the predicted planes (TT and AM), a pre-trained DeepLabv3+  was used to perform segmentation of fetal head and abdomen. Then two smallest ellipses enclosing the segmentation map in predicted plane and the annotated ground truth in target plane are generated for the fetal head or abdominal circumference. We used three metrics to evaluate the performance of the biometric measurements including dice score (Dice), absolute error (A-Error) and relative error (R-Error) of the circumferences from the prediction and the annotation. As shown in Table XI, the proposed method gained good performance in Dice score. Meanwhile, the absolute error and relative error of fetal head circumference and abdominal circumference of our method are 1.125mm, 2.05 and 3.608mm, 3.25, respectively. The p-values in Table XI also indicate our predicted biometrics has no significant difference with the annotations. This shows a similar performance with human-level performance [43, 38] and suggests that the proposed method has potential to be applied in real clinical setting.
Iv-F Qualitative evaluation
Figure 7 and 8 provide visualization results of the proposed method. It shows the prediction plane, the ground truth, the termination curve and the 3D spatial visualization of four randomly selected cases. It can be observed that the predictions are spatially close and visually similar to the ground truth. Furthermore, the proposed method can reach an ideal stopping point consistently. Both the maximum iteration and lowest Q values termination strategies fail in spotting the optimal termination step.
Although RL is powerful in localizing view plane in MRI , it failed to localize SPs localization in 3D US. Without an alignment module and early-stop setup, the AVP needs a careful design for agent training and inference in a vast search space. Thus it is easier for learning-based localization methods to locate the SP within a limited search space. This might explain the relative low performance of  in Table III and Table IV. The proposed landmark-aware alignment module was devised based on the exact concern. It aligns all the volumes to the same atlas space using rigid registration, which can constrain the environment like that in MRI images. Furthermore, our proposed alignment method can be regarded as a prior-based initialization of the agent in testing US volumes, which reduces the search space into a fine-grained subspace.
A proper termination strategy is essential in deep RL while it is difficult to estimate the optimal termination step because the agent often gets trapped in the local minimum during the iterative searching process. Prior studies have proposed several different termination strategies for such applications [14, 1]. However, as shown in Table V and VI, Fig. 7, and 8, the aforementioned experimental or previous knowledge-based termination strategies failed in estimating the optimal termination step in this challenging task. Meanwhile, the prior studies [1, 12] default the agent terminates at the fixed maximum step, causing inefficiency of the localization system. Our previous study designed a learning-based active termination using RNN to learn the mapping between the Q-value sequence and the optimal step. However, it requires waiting for the agent to finish inference as well. In contrast, our termination module enables the dynamic agent searching with the RNN to learn the implicit relationship between the Q-value curve and the optimal termination step. The resulting RL framework can achieve more accurate efficient predictions. Note that this learning-based termination strategy is a general method and can be applied to other similar tasks.
In this paper, we present a deep RL framework equipped with 1) a landmark-aware alignment module to provide a warm start for the agent searching, and 2) a novel learning-based strategy for adaptive dynamic termination. SP localization in 3D US is challenging due to the low image quality, large data size and diverse fetal postures. Along with the proposed landmark-aware alignment module, the deep RL framework can start searching within the environment constrained by anatomical prior knowledge. In reinforcement learning for SPs localization, the termination conditions are usually indistinct and can not be precisely defined. Our proposed adaptive dynamic termination raises a new solution towards an effectiveness- and efficiency-steered localization system. Validation experiments showed that our model not only outperforms the current state-of-the-art learning based methods in detecting SPs, but also saves about 67% time during inference and shows great generalizability across multiple challenging datasets.
-  (2018) Automatic view planning with multi-scale deep reinforcement learning agents. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pp. 277–285. Cited by: §I-B, §I-C2, §II-B, §II-C, §II-C, §IV-A, §IV-B, §IV-C, §IV-D, TABLE I, TABLE X, TABLE II, TABLE V, TABLE VI, TABLE IX, §V, §V.
-  (1983) Neuronlike adaptive elements that can solve difficult learning control problems. IEEE Transactions on Systems, Man, and Cybernetics SMC-13 (5), pp. 834–846. External Links: Cited by: §I-B.
-  (2017) SonoNet: real-time detection and localisation of fetal standard scan planes in freehand ultrasound. IEEE Transactions on Medical Imaging 36 (11), pp. 2204–2215. Cited by: §I-C1.
-  (2007) The tradeoffs of large scale learning. Advances in Neural Information Processing Systems 20, pp. 161–168. Cited by: §III-A.
Active object localization with deep reinforcement learning.
2015 IEEE International Conference on Computer Vision (ICCV), Vol. , pp. 2488–2496. External Links: Cited by: §I-B.
-  (2015) Standard plane localization in fetal ultrasound via domain transferred deep neural networks. IEEE Journal of Biomedical and Health Informatics 19 (5), pp. 1627–1636. Cited by: §I-C1.
-  (2014) Fetal abdominal standard plane localization through representation learning with knowledge transfer. In International Workshop on Machine Learning in Medical Imaging, pp. 125–132. Cited by: §I-C1.
-  (2017) Ultrasound standard plane detection using a composite neural network framework. IEEE Transactions on Cybernetics 47 (6), pp. 1576–1586. Cited by: §I-C1.
-  (2018) Encoder-decoder with atrous separable convolution for semantic image segmentation. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 801–818. Cited by: §IV-E.
-  (2013) Class-specific regression random forest for accurate extraction of standard planes from 3d echocardiography. In International MICCAI Workshop on Medical Computer Vision, pp. 53–62. Cited by: §I-C2.
-  (2016) 3D U-Net: learning dense volumetric segmentation from sparse annotation. In International Conference on Medical Image Computing and Computer-assisted Intervention, pp. 424–432. Cited by: §II-B.
-  (2019) Agent with warm start and active termination for plane localization in 3d ultrasound. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pp. 290–298. Cited by: 1st item, 2nd item, §I-B, §I-C2, §I-D, §II-B, §II-C, §II-C, §II-C, §II, §IV-A, §IV-C, §IV-C, §IV-D, TABLE I, TABLE II, TABLE V, TABLE VI, §V.
-  (2017) Detection and characterization of the fetal heartbeat in free-hand ultrasound sweeps with weakly-supervised two-streams convolutional networks. In International Conference on Medical Image Computing and Computer-Assisted Intervention, Vol. 10434, pp. 305–313. Cited by: §I-C1.
-  (2019) Multi-scale deep reinforcement learning for real-time 3d-landmark detection in ct scans. IEEE Transactions on Pattern Analysis and Machine Intelligence 41, pp. 176–189. Cited by: §I-B, §II-C, §V.
-  (1997) Long short-term memory. Neural Computation 9 (8), pp. 1735–1780. Cited by: §III-A.
-  (2018) VP-Nets: efficient automatic localization of key brain structures in 3d fetal neurosonography. Medical Image Analysis 47, pp. 127–139. Cited by: §II-B.
-  (2017) Temporal heartnet: towards human-level automatic analysis of fetal cardiac screening video. In International Conference on Medical Image Computing and Computer-Assisted Intervention, Vol. 10434, pp. 341–349. Cited by: §I-C1, §III-A.
-  (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167. Cited by: §II-A.
-  (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §III-A.
-  (2014) Automatic recognition of fetal standard plane in ultrasound image. In 2014 IEEE 11th International Symposium on Biomedical Imaging (ISBI), pp. 85–88. Cited by: §I-C1.
Standard plane detection in 3d fetal ultrasound using an iterative transformation network. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pp. 392–400. Cited by: §I-C2.
-  (2019) Multi-task learning for quality assessment of fetal head ultrasound images. Medical Image Analysis 58, pp. 101548. Cited by: §I-C1.
-  (2018) Automated abdominal plane and circumference estimation in 3d us for fetal screening. In Medical Imaging 2018: Image Processing, Vol. 10574, pp. 105740I. Cited by: §I-C2, §II-B.
-  (2019) Automatic quality assessment for 2d fetal sonographic standard plane based on multi-task learning. arXiv preprint arXiv:1912.05260. Cited by: §I-C1.
-  (2015) Human-level control through deep reinforcement learning. Nature 518 (7540), pp. 529–533. Cited by: §II-A, §II-A.
-  (1990) Efficient memory-based learning for robot control. Cited by: §I-B.
-  (2014) Diagnostic plane extraction from 3d parametric surface of the fetal cranium. In MIUA, pp. 27–32. Cited by: §I-A, §II-B.
-  (2014) Standard plane localization in ultrasound by radial component model and selective search. Ultrasound in Medicine & Biology 40 (11), pp. 2728–2742. Cited by: §I-C1.
-  (2017) Automatic detection of standard sagittal plane in the first trimester of pregnancy using 3-d ultrasound data. Ultrasound in Medicine & Biology 43 (1), pp. 286–300. Cited by: §I-C2.
-  (2019) Pytorch: an imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems, pp. 8026–8037. Cited by: §III-A.
-  (2016) Automated 3d ultrasound biometry planes extraction for first trimester fetal assessment. In International Workshop on Machine Learning in Medical Imaging, pp. 196–204. Cited by: §I-C2.
-  (2013) Erratum: ISUOG practice guidelines: performance of first-trimester fetal ultrasound scan. Ultrasound in Obstetrics and Gynecology 41 (2), pp. 102–113 (English). External Links: Cited by: §I.
-  (2011) Practice guidelines for performance of the routine mid-trimester fetal ultrasound scan. Ultrasound in Obstetrics & Gynecology 37 (1), pp. 116–126. Cited by: §I.
-  (2016) Prioritized experience replay. International Conference on Learning Representations. Cited by: §II-A.
-  (2019) Attention gated networks: learning to leverage salient regions in medical images. Medical Image Analysis 53, pp. 197–207. Cited by: §I-C1.
-  (2019) Offset regression networks for view plane estimation in 3d fetal ultrasound. In Medical Imaging 2019: Image Processing, Vol. 10949, pp. 109493K. Cited by: §I-C2.
-  (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556. Cited by: §II-A.
-  (2018) Automated measurement of fetal head circumference using 2d ultrasound images. PloS one 13 (8), pp. e0200412. Cited by: §IV-E.
-  (2004) Image quality assessment: from error visibility to structural similarity. IEEE Transactions on Image Processing 13 (4), pp. 600–612. Cited by: §III-C.
-  (2016) Dueling network architectures for deep reinforcement learning. In International Conference on Machine Learning, pp. 1995–2003. Cited by: §II-A.
-  (1992) Q-learning. Machine Learning 8 (3-4), pp. 279–292. Cited by: §II-A.
-  (2017) FUIQA: fetal ultrasound image quality assessment with deep convolutional networks. IEEE Transactions on Cybernetics 47 (5), pp. 1336–1349. Cited by: §I-C1.
-  (2017) Cascaded fully convolutional networks for automatic prenatal ultrasound image segmentation. In 2017 IEEE 14th International Symposium on Biomedical Imaging (ISBI 2017), pp. 663–666. Cited by: §IV-E.
-  (2014) Standard plane localization in ultrasound by radial component. In 2014 IEEE 11th International Symposium on Biomedical Imaging (ISBI), pp. 1180–1183. Cited by: §I-C1.
-  (2012) Intelligent scanning: automated standard plane selection and biometric measurement of early gestational sac in routine ultrasound examination. Medical Physics 39 (8), pp. 5015–5027. Cited by: §I-C1.
-  (2016) Guideline-based machine learning for standard plane extraction in 3d cardiac ultrasound. In Medical Computer Vision and Bayesian and Graphical Models for Biomedical Imaging, pp. 137–147. Cited by: §I-C2.