Source code for DCL-Net, a deep learning model for sensorless freehand 3D ultrasound volume reconstruction.
Transrectal ultrasound (US) is the most commonly used imaging modality to guide prostate biopsy and its 3D volume provides even richer context information. Current methods for 3D volume reconstruction from freehand US scans require external tracking devices to provide spatial position for every frame. In this paper, we propose a deep contextual learning network (DCL-Net), which can efficiently exploit the image feature relationship between US frames and reconstruct 3D US volumes without any tracking device. The proposed DCL-Net utilizes 3D convolutions over a US video segment for feature extraction. An embedded self-attention module makes the network focus on the speckle-rich areas for better spatial movement prediction. We also propose a novel case-wise correlation loss to stabilize the training process for improved accuracy. Highly promising results have been obtained by using the developed method. The experiments with ablation studies demonstrate superior performance of the proposed method by comparing against other state-of-the-art methods. Source code of this work is publicly available at https://github.com/DIAL-RPI/FreehandUSRecon.READ FULL TEXT VIEW PDF
Source code for DCL-Net, a deep learning model for sensorless freehand 3D ultrasound volume reconstruction.
Ultrasound (US) imaging has been widely used in interventional applications to monitor and trace target tissue. US possesses many advantages, such as low cost, portable setup and the capability of navigating through patient in real-time for anatomical and functional information. Transrectal ultrasound imaging (TRUS) has been commonly used for guiding prostate cancer diagnosis and can significantly reduce the false negative rate when fused with magnetic resonance imaging (MRI) [siddiqui2015comparison]. However, 2D US images are difficult to be registered with 3D MRI volume, due to the differences in not only image dimension but also image appearance. In practice, a reconstructed 3D US image volume is usually required to assist such interventional tasks.
A reconstructed 3D US imaging volume visualizes a 3D region of interest (ROI) by using a set of 2D ultrasound frames [mohamed2019survey], which can be captured by a variety of scanning techniques such as mechanical scan [daoud2015freehand] and freehand tracked scan [wen2013accurate]. Among these categories, the tracked freehand scanning is the most favorable method in a number of clinical scenarios. For instance, during a prostate biopsy, the freehand scanning allows clinicians to freely move the US probe around the ROI and produces US images with much more flexibility. The tracking device, either an optical or electro-magnetic (EM) tracking system, helps to build a spatial transformation chain between the imaging planes in the world coordinate for 3D reconstruction.
US volume reconstruction from sensorless freehand scans takes a step further by removing the tracking devices attached to the US probe. The prior research on this was mainly supported by the speckle decorrelation [chen1997determination, tuthill1998automated], which maps the relative difference of position and orientation between neighboring US images to the correlation of their speckle patterns, i.e. higher the speckle correlation, lower the elevational distance between neighboring frames [chang20033]. By removing the tracking devices, such sensorless reconstruction allows the clinicians to move the probe with less constraint without the concerns of blocking tracking signals. In addition, it also reduces the hardware cost. Although the speckle correlation carries information of the relative transformation between neighboring frames, relying on the decorrelation alone renders unreliable performance [laporte2011learning, afsham2015nonlocal].
In the past decade, deep learning (DL) methods based on convolutional neural networks (CNN) have been identified as an important tool for automatic feature extraction. In the field of US volume reconstruction, a pioneer work carried out by Prevostet al. [prevost20183d]
explored the feasibility of using CNN to directly estimate the inter-frame motion between two 2D US scans. A 2D convolutional network takes two consecutive ultrasound frames and a generated optical flow field between them as a stacked input to estimate the relative rotations and translations between these two frames. However, a typical US scanning video contains rich contextual information beyond two neighboring frames. A sequence of 2D US frames can provide a more general representation of the motion trajectory of US probe. Using only two neighboring frames may lose temporal information and thus result in less accurate reconstruction. In addition, optical flow field, which is good at describing in-plane motion, may not help out-of-plane motion analysis. Besides, the prior works on decorrelation suggests that paying more attention to the speckle-rich regions can boost the reconstruction performance, which hasn’t been explored.
In this paper, based on the above observations, we propose a novel Deep Contextual Learning Network (DCL-Net) for sensorless freehand 3D ultrasound reconstruction. The underlying framework takes multiple consecutive US frames as input, instead of only two neighboring frames, to estimate the trajectory of US probe by efficiently exploiting the rich contextual information. Furthermore, to make the network focus on the speckle-rich image areas to utilize the decorrelation information between frames, the attention mechanism is embedded into the network architecture. Last but not the least, we introduce a new case-wise correlation loss to enhance the discriminative feature learning to prevent overfitting the scanning style.
All TRUS scanning videos studied in this work are collected by an EM-tracking device from real clinical cases. The dataset contains 640 TRUS videos all from different subjects acquired by a Philips iU22 scanner in varied lengths. Every frame corresponds to an EM tracked vector that contains the position and orientation information of that frame. We convert this vector to a 3D homogeneous tranformation matrix, where is a 33 rotation matrix and is a 3D translation vector.
The primary task of 3D ultrasound reconstruction is to obtain the relative spatial position of two or more consecutive US frames. Without loss of generality, here we use two neighboring frames as an example for illustration. Let and denote two consecutive US frames with corresponding transformation matrices and , respectively. The relative transformation matrix can be computed as . By decomposing
into 6 degrees of freedom, which contains the translations in millimeters and rotations in degrees, we can use this computed from EM tracking as the ground-truth for network training.
Fig. 1 shows the proposed DCL-Net architecture, which is designed on top of the 3D ResNext model [xie2017aggregated]. Our model consists of 3D residual blocks and other types of basic CNN layers [he2016deep]. The skip connections help preserve the gradients to train very deep networks. The use of the multiple pathways (cardinalities) enables the extraction of important features. In our design, 3D instead of 2D convolutional kernels are used, mainly because 3D convolutions can better extract the feature mappings along the axis of channel, which is the temporal direction in our case. Such properties enable the network to focus on the slight displacement of image features between consecutive frames. The network can thus be trained to connect these speckle correlated features to estimate the relative position and orientation.
During the training process, we stack a sequence of frames with height and width denoted by and , respectively, to form a 3D input volume in the shape of . Let denote the relative transform parameters between the neighboring frames. Instead of directly using these parameters as ground-truth labels for network training, the mean parameters
are used for the following two reasons. Most importantly, since the magnitude of motion between two frames is small, using the mean can effectively smooth the noise in probe motion. Another advantage in practice is that there is no need to modify the output layer every time when we change the number of input frames. During the test, we slide along the video sequence with a window size . The inter-frame motion of two neighboring frames is the average motion computed in all the batches.
The attention mechanism in the deep learning models makes the CNN to focus on a specific region of an image, which carries salient information for the targeted task [bahdanau2014neural]
. It has led to significant improvement in various computer vision tasks such as object classification[fukui2019attention] and segmentation [oktay2018attention]. In our 3D US volume reconstruction task, regions with strong speckle patterns for correlation are of high importance in estimating the transformations. Thus, we introduce a self-attention block, as shown in Fig. 1, which takes the feature maps produced by the last residual block as input and then outputs an attention map. This helps assign more weights to the highly informative regions.
The loss function of the proposed DCL-Net consists of two components. The first one is the mean squared error (MSE) loss, which is the most commonly used loss in deep regression problems. However, the use of MSE loss alone can lead to the smoothed estimation of the motion and thus the trained network tends to memorize the general style of how the clinicians move the probe,i.e. the mean trajectory of the ultrasound probes. This shortcoming of the MSE loss for network training has been reported before [yang2018low, johnson2016perceptual]. To deal with problem, we introduce a case-wise correlation loss based on the Pearson correlation coefficient to emphasize the specific motion pattern of a scan.
Fig. 2 shows the workflow of calculating the case-wise correlation loss. video segments with each having frames are randomly sampled from a TRUS video. The correlation coefficients between the estimated motion and the ground truth mean are computed for every degree-of-freedom and the loss is denoted as
where gives the covariance and
calculates the standard deviation. The total loss is the summation of the MSE loss and the case-wise correlation loss.
For the experiments performed in this study, a total of 640 TRUS scanning videos (640 patients) from the Nation Institute of Health (NIH) were acquired from IRB-approved clinical trial. During an intervention, a physician used an end-firing transrectal ultrasound probe to acquire axial images by steadily sweeping through the prostate from base to apex. The positioning information given by an electromagnetic tracking device serves as the ground truth label in our training phase. The dataset is split into 500, 70 and 70 cases as training, validation and testing, respectively. Our network is trained for 300 epochs with batch sizeusing Adam optimizer [kingma2014adam] with initial learning rate of 5, which decays by 0.9 after 5 epochs. Since the prostate US image only takes a relative small part of each frame, each frame is cropped without exceeding the imaging field and then resized to to fit the design of ResNexts [xie2017aggregated]
. We implemented the DCL-Net using the publicly available Pytorch library[pytorch]. The entire training phase of the DCL-Net takes about 4h, taking 5 frames as input. During testing, it takes about 2.58s to produce all the transformation matrix of an US video with 100 frames.
Two evaluation metrics are used for performance evaluation. The first one is the average distance between all the corresponding frame corner-points throughout a video. This distance error reveals the difference in speed and orientation variations across the entire video. The other one is the final drift[prevost20183d], which is the distance between the center points of the transformed end frames of a video segment using the EM tracking data and our DCL-Net estimated motion.
We first performed experiments to determine an optimal number of frames for each video segment. The left panel of Fig. 5
shows how the overall reconstruction performance varies as the number of consecutive frames changes. There is a decrease then increase in the error, with neighboring frame number equaling to 5 or 6 has similarly the best performance. According to our paired t-test, the calculated p-value is smaller than the confidence level of 0.05, indicating the result using 5 frames is significantly better than that using only 2 frames. The explanation is that the network takes advantage of the rich contextual information along the time-series and produces more stable trajectory estimation.
The right panel of Fig. 5 visualizes two example attention maps. The first image column shows the cropped US images. The second column is the speckle correlation map [chang20033] between a US image and its following neighboring frame. Inside this speckle correlation map, brighter the area, longer the elevational distance to the next frame. Such pattern with dark areas at the bottom and brighter on the upper part is consistent with our TRUS scanning protocol, as there is less motion around the tip of the ultrasound probe. The third column shows the attention map regarding the rotation around the Y-axis, which also indicates part of the out-of-plane rotation. The attention maps have strong activation at the bright speckle correlation regions, indicating that the attention module helps the network to focus on speckle-rich areas for better reconstruction.
|Methods||Distance Error (mm)||Final Drift (mm)|
|2D CNN [prevost20183d]||5.66||15.8||43.35||17.42||7.05||23.13||68.87||26.81|
|3D CNN (NS2) [xie2017aggregated]||2.38||10.14||31.34||12.34||1.42||19.08||68.61||21.74|
Table 1 summarizes the overall comparison of the proposed DCL-Net against other existing methods. The approach of “Linear Motion” means that we first calculate the mean motion vector of the training set and then apply this fixed vector to all the testing cases. The approach of “Decorrelation” is based on the speckle decorrelation algorithm presented in [chang20033]. “2D CNN” refers to the method presented by Prevost et al. [prevost20183d]. “3D CNN” is the vanilla ResNext [xie2017aggregated] architecture taking only two slices as input.
It can be seen from Table 1 that our proposed DCL-Net outperforms all the other methods. Paired t-test was performed and the performance improvement made by DCL-Net is significant in all the cases with -value0.05. It is worth noting that although the average distance error of 10.33mm achieved DCL-Net is still a large error, this is the best performance on real clinical data instead of phamtom studies. The performance of the state-of-the-art 2D-CNN method reproduced in our experiments has consistent performance compared to the accuracy reported in the paper [prevost20183d]. It is a challenging task to reconstruct 3D US volume using these freehand TRUS scans and we have been making significant progress in this important area.
Next, we demonstrate the effectiveness of incorporating case-wise correlation loss into the network training. Fig. 6 shows the prediction of along a video sequence. We can observe that the network trained with MSE loss can only produce mediocre results (red line) which are nearly constant, showing almost no sensitivity to the change in speed and orientation. Its prediction wonders around the linear motion (black line) which represents the mean value of the probe’s trajectory. The correlation coefficients of all testing cases show a mean of which represents little correlation. This indicates that using MSE alone makes the network memorizing the general style of the US probe motion trajectory and fails to produce valid prediction based on image contents. By incorporating the correlation loss (CL) into the loss function (blue line), the correlation coefficients of all testing cases have a mean of , representing weak correlation. Based on a paired t-test with , this is found to be significantly better than the previous results. Intuitively, the network’s prediction reacts more sensitively to the variation of the probe’s real translation and rotation (green line).
Last but not the least, we report the volume reconstruction results using four testing cases with different reconstruction qualities as shown in Fig. 7. One good case, one bad case, and two median cases are included to offer a complete view of the performance. To reduce the clutter in the figure, we only show the comparison between our DCL-Net, the 2D-CNN [prevost20183d] and the ground-truth. While producing competitive performance, the 2D-CNN method is less sensitive to the speed variations of US probe and the estimated trajectory has noisy vibration. The results sometimes even severely deviate from the ground-truth. Our proposed DCL-Net shows a much smoother trajectory estimation thanks to the contextual information provided by video segments.
This paper introduced a sensorless freehand 3D US volume reconstruction method based on deep learning. The proposed DCL-Net can well extract the information among multiple US frames to improve the US probe trajectory estimation. Experiments on a well-sized EM-tracked ultrasound dataset demonstrated that the proposed DCL-Net has benefited from the contextual learning and showed superior performance when compared to other existing methods. Further experiments on the ultrasound videos with different scanning protocols will be studied in our future work.
This work was partially supported by National Institute of Biomedical Imaging and Bioengineering (NIBIB) of the National Institutes of Health (NIH) under awards R21EB028001 and R01EB027898, and through an NIH Bench-to-Bedside award made possible by the National Cancer Institute.