Sensorless Freehand 3D Ultrasound Reconstruction via Deep Contextual Learning

by   Hengtao Guo, et al.
Rensselaer Polytechnic Institute

Transrectal ultrasound (US) is the most commonly used imaging modality to guide prostate biopsy and its 3D volume provides even richer context information. Current methods for 3D volume reconstruction from freehand US scans require external tracking devices to provide spatial position for every frame. In this paper, we propose a deep contextual learning network (DCL-Net), which can efficiently exploit the image feature relationship between US frames and reconstruct 3D US volumes without any tracking device. The proposed DCL-Net utilizes 3D convolutions over a US video segment for feature extraction. An embedded self-attention module makes the network focus on the speckle-rich areas for better spatial movement prediction. We also propose a novel case-wise correlation loss to stabilize the training process for improved accuracy. Highly promising results have been obtained by using the developed method. The experiments with ablation studies demonstrate superior performance of the proposed method by comparing against other state-of-the-art methods. Source code of this work is publicly available at


page 6

page 8


End-to-end Ultrasound Frame to Volume Registration

Fusing intra-operative 2D transrectal ultrasound (TRUS) image with pre-o...

Transducer Adaptive Ultrasound Volume Reconstruction

Reconstructed 3D ultrasound volume provides more context information com...

EchoFusion: Tracking and Reconstruction of Objects in 4D Freehand Ultrasound Imaging without External Trackers

Ultrasound (US) is the most widely used fetal imaging technique. However...

Adaptive 3D Localization of 2D Freehand Ultrasound Brain Images

Two-dimensional (2D) freehand ultrasound is the mainstay in prenatal car...

ImplicitVol: Sensorless 3D Ultrasound Reconstruction with Deep Implicit Representation

The objective of this work is to achieve sensorless reconstruction of a ...

BabyNet: Residual Transformer Module for Birth Weight Prediction on Fetal Ultrasound Video

Predicting fetal weight at birth is an important aspect of perinatal car...

Real Time Visual Tracking using Spatial-Aware Temporal Aggregation Network

More powerful feature representations derived from deep neural networks ...

Code Repositories


Source code for DCL-Net, a deep learning model for sensorless freehand 3D ultrasound volume reconstruction.

view repo

1 Introduction

Ultrasound (US) imaging has been widely used in interventional applications to monitor and trace target tissue. US possesses many advantages, such as low cost, portable setup and the capability of navigating through patient in real-time for anatomical and functional information. Transrectal ultrasound imaging (TRUS) has been commonly used for guiding prostate cancer diagnosis and can significantly reduce the false negative rate when fused with magnetic resonance imaging (MRI) [siddiqui2015comparison]. However, 2D US images are difficult to be registered with 3D MRI volume, due to the differences in not only image dimension but also image appearance. In practice, a reconstructed 3D US image volume is usually required to assist such interventional tasks.

A reconstructed 3D US imaging volume visualizes a 3D region of interest (ROI) by using a set of 2D ultrasound frames [mohamed2019survey], which can be captured by a variety of scanning techniques such as mechanical scan [daoud2015freehand] and freehand tracked scan [wen2013accurate]. Among these categories, the tracked freehand scanning is the most favorable method in a number of clinical scenarios. For instance, during a prostate biopsy, the freehand scanning allows clinicians to freely move the US probe around the ROI and produces US images with much more flexibility. The tracking device, either an optical or electro-magnetic (EM) tracking system, helps to build a spatial transformation chain between the imaging planes in the world coordinate for 3D reconstruction.

US volume reconstruction from sensorless freehand scans takes a step further by removing the tracking devices attached to the US probe. The prior research on this was mainly supported by the speckle decorrelation [chen1997determination, tuthill1998automated], which maps the relative difference of position and orientation between neighboring US images to the correlation of their speckle patterns, i.e. higher the speckle correlation, lower the elevational distance between neighboring frames [chang20033]. By removing the tracking devices, such sensorless reconstruction allows the clinicians to move the probe with less constraint without the concerns of blocking tracking signals. In addition, it also reduces the hardware cost. Although the speckle correlation carries information of the relative transformation between neighboring frames, relying on the decorrelation alone renders unreliable performance [laporte2011learning, afsham2015nonlocal].

In the past decade, deep learning (DL) methods based on convolutional neural networks (CNN) have been identified as an important tool for automatic feature extraction. In the field of US volume reconstruction, a pioneer work carried out by Prevost

et al. [prevost20183d]

explored the feasibility of using CNN to directly estimate the inter-frame motion between two 2D US scans. A 2D convolutional network takes two consecutive ultrasound frames and a generated optical flow field between them as a stacked input to estimate the relative rotations and translations between these two frames. However, a typical US scanning video contains rich contextual information beyond two neighboring frames. A sequence of 2D US frames can provide a more general representation of the motion trajectory of US probe. Using only two neighboring frames may lose temporal information and thus result in less accurate reconstruction. In addition, optical flow field, which is good at describing in-plane motion, may not help out-of-plane motion analysis. Besides, the prior works on decorrelation suggests that paying more attention to the speckle-rich regions can boost the reconstruction performance, which hasn’t been explored.

In this paper, based on the above observations, we propose a novel Deep Contextual Learning Network (DCL-Net) for sensorless freehand 3D ultrasound reconstruction. The underlying framework takes multiple consecutive US frames as input, instead of only two neighboring frames, to estimate the trajectory of US probe by efficiently exploiting the rich contextual information. Furthermore, to make the network focus on the speckle-rich image areas to utilize the decorrelation information between frames, the attention mechanism is embedded into the network architecture. Last but not the least, we introduce a new case-wise correlation loss to enhance the discriminative feature learning to prevent overfitting the scanning style.

2 Data Materials

All TRUS scanning videos studied in this work are collected by an EM-tracking device from real clinical cases. The dataset contains 640 TRUS videos all from different subjects acquired by a Philips iU22 scanner in varied lengths. Every frame corresponds to an EM tracked vector that contains the position and orientation information of that frame. We convert this vector to a 3D homogeneous tranformation matrix

, where is a 33 rotation matrix and is a 3D translation vector.

The primary task of 3D ultrasound reconstruction is to obtain the relative spatial position of two or more consecutive US frames. Without loss of generality, here we use two neighboring frames as an example for illustration. Let and denote two consecutive US frames with corresponding transformation matrices and , respectively. The relative transformation matrix can be computed as . By decomposing

into 6 degrees of freedom

, which contains the translations in millimeters and rotations in degrees, we can use this computed from EM tracking as the ground-truth for network training.

3 Ultrasound Volume Reconstruction

Figure 1: An overview of the proposed DCL-Net, which takes one video segment as input volume and gives the mean motion vector as the output.

Fig. 1 shows the proposed DCL-Net architecture, which is designed on top of the 3D ResNext model [xie2017aggregated]. Our model consists of 3D residual blocks and other types of basic CNN layers [he2016deep]. The skip connections help preserve the gradients to train very deep networks. The use of the multiple pathways (cardinalities) enables the extraction of important features. In our design, 3D instead of 2D convolutional kernels are used, mainly because 3D convolutions can better extract the feature mappings along the axis of channel, which is the temporal direction in our case. Such properties enable the network to focus on the slight displacement of image features between consecutive frames. The network can thus be trained to connect these speckle correlated features to estimate the relative position and orientation.

During the training process, we stack a sequence of frames with height and width denoted by and , respectively, to form a 3D input volume in the shape of . Let denote the relative transform parameters between the neighboring frames. Instead of directly using these parameters as ground-truth labels for network training, the mean parameters


are used for the following two reasons. Most importantly, since the magnitude of motion between two frames is small, using the mean can effectively smooth the noise in probe motion. Another advantage in practice is that there is no need to modify the output layer every time when we change the number of input frames. During the test, we slide along the video sequence with a window size . The inter-frame motion of two neighboring frames is the average motion computed in all the batches.

3.1 Attention Module

The attention mechanism in the deep learning models makes the CNN to focus on a specific region of an image, which carries salient information for the targeted task [bahdanau2014neural]

. It has led to significant improvement in various computer vision tasks such as object classification 

[fukui2019attention] and segmentation [oktay2018attention]. In our 3D US volume reconstruction task, regions with strong speckle patterns for correlation are of high importance in estimating the transformations. Thus, we introduce a self-attention block, as shown in Fig. 1, which takes the feature maps produced by the last residual block as input and then outputs an attention map. This helps assign more weights to the highly informative regions.

3.2 Case-wise Correlation Loss

Figure 2: Overview of the case-wise correlation loss computation.

The loss function of the proposed DCL-Net consists of two components. The first one is the mean squared error (MSE) loss, which is the most commonly used loss in deep regression problems. However, the use of MSE loss alone can lead to the smoothed estimation of the motion and thus the trained network tends to memorize the general style of how the clinicians move the probe,

i.e. the mean trajectory of the ultrasound probes. This shortcoming of the MSE loss for network training has been reported before [yang2018low, johnson2016perceptual]. To deal with problem, we introduce a case-wise correlation loss based on the Pearson correlation coefficient to emphasize the specific motion pattern of a scan.

Fig. 2 shows the workflow of calculating the case-wise correlation loss. video segments with each having frames are randomly sampled from a TRUS video. The correlation coefficients between the estimated motion and the ground truth mean are computed for every degree-of-freedom and the loss is denoted as


where gives the covariance and

calculates the standard deviation. The total loss is the summation of the MSE loss and the case-wise correlation loss.

4 Experiments and Results

For the experiments performed in this study, a total of 640 TRUS scanning videos (640 patients) from the Nation Institute of Health (NIH) were acquired from IRB-approved clinical trial. During an intervention, a physician used an end-firing transrectal ultrasound probe to acquire axial images by steadily sweeping through the prostate from base to apex. The positioning information given by an electromagnetic tracking device serves as the ground truth label in our training phase. The dataset is split into 500, 70 and 70 cases as training, validation and testing, respectively. Our network is trained for 300 epochs with batch size

using Adam optimizer [kingma2014adam] with initial learning rate of 5, which decays by 0.9 after 5 epochs. Since the prostate US image only takes a relative small part of each frame, each frame is cropped without exceeding the imaging field and then resized to to fit the design of ResNexts [xie2017aggregated]

. We implemented the DCL-Net using the publicly available Pytorch library 

[pytorch]. The entire training phase of the DCL-Net takes about 4h, taking 5 frames as input. During testing, it takes about 2.58s to produce all the transformation matrix of an US video with 100 frames.

Two evaluation metrics are used for performance evaluation. The first one is the average distance between all the corresponding frame corner-points throughout a video. This distance error reveals the difference in speed and orientation variations across the entire video. The other one is the final drift 

[prevost20183d], which is the distance between the center points of the transformed end frames of a video segment using the EM tracking data and our DCL-Net estimated motion.

4.1 Parameter Setting

Figure 5: (Left) Effect of number of input frames. Green curve shows the mean distance error of each box. (Right) Visualization of two attention maps regarding rotation around the Y-axis.

We first performed experiments to determine an optimal number of frames for each video segment. The left panel of Fig. 5

shows how the overall reconstruction performance varies as the number of consecutive frames changes. There is a decrease then increase in the error, with neighboring frame number equaling to 5 or 6 has similarly the best performance. According to our paired t-test, the calculated p-value is smaller than the confidence level of 0.05, indicating the result using 5 frames is significantly better than that using only 2 frames. The explanation is that the network takes advantage of the rich contextual information along the time-series and produces more stable trajectory estimation.

The right panel of Fig. 5 visualizes two example attention maps. The first image column shows the cropped US images. The second column is the speckle correlation map [chang20033] between a US image and its following neighboring frame. Inside this speckle correlation map, brighter the area, longer the elevational distance to the next frame. Such pattern with dark areas at the bottom and brighter on the upper part is consistent with our TRUS scanning protocol, as there is less motion around the tip of the ultrasound probe. The third column shows the attention map regarding the rotation around the Y-axis, which also indicates part of the out-of-plane rotation. The attention maps have strong activation at the bright speckle correlation regions, indicating that the attention module helps the network to focus on speckle-rich areas for better reconstruction.

4.2 Reconstruction Performance and Discussions

Methods Distance Error (mm) Final Drift (mm)
Min Median Max Average Min Median Max Average
Linear Motion 7.17 19.73 60.79 22.53 12.53 37.15 114.02 42.62
Decorrelation [chang20033] 9.62 17.58 56.72 18.89 15.32 38.45 104.13 38.26
2D CNN [prevost20183d] 5.66 15.8 43.35 17.42 7.05 23.13 68.87 26.81
3D CNN (NS2) [xie2017aggregated] 2.38 10.14 31.34 12.34 1.42 19.08 68.61 21.74
Our DCL-Net 1.97 9.15 27.03 10.33 1.09 17.40 55.50 17.39
Table 1: Performance of different methods on the EM-tracking dataset.

Table 1 summarizes the overall comparison of the proposed DCL-Net against other existing methods. The approach of “Linear Motion” means that we first calculate the mean motion vector of the training set and then apply this fixed vector to all the testing cases. The approach of “Decorrelation” is based on the speckle decorrelation algorithm presented in [chang20033]. “2D CNN” refers to the method presented by Prevost et al. [prevost20183d]. “3D CNN” is the vanilla ResNext [xie2017aggregated] architecture taking only two slices as input.

It can be seen from Table 1 that our proposed DCL-Net outperforms all the other methods. Paired t-test was performed and the performance improvement made by DCL-Net is significant in all the cases with -value0.05. It is worth noting that although the average distance error of 10.33mm achieved DCL-Net is still a large error, this is the best performance on real clinical data instead of phamtom studies. The performance of the state-of-the-art 2D-CNN method reproduced in our experiments has consistent performance compared to the accuracy reported in the paper [prevost20183d]. It is a challenging task to reconstruct 3D US volume using these freehand TRUS scans and we have been making significant progress in this important area.

Figure 6: Predicted rotation on one video sequence with different methods. Applying correlation loss makes our prediction (blue line) more sensitive to the strongly varying speed of the ground-truth (green line).
Figure 7: Comparison of the US volume reconstruction results of four cases with different qualities.

Next, we demonstrate the effectiveness of incorporating case-wise correlation loss into the network training. Fig. 6 shows the prediction of along a video sequence. We can observe that the network trained with MSE loss can only produce mediocre results (red line) which are nearly constant, showing almost no sensitivity to the change in speed and orientation. Its prediction wonders around the linear motion (black line) which represents the mean value of the probe’s trajectory. The correlation coefficients of all testing cases show a mean of which represents little correlation. This indicates that using MSE alone makes the network memorizing the general style of the US probe motion trajectory and fails to produce valid prediction based on image contents. By incorporating the correlation loss (CL) into the loss function (blue line), the correlation coefficients of all testing cases have a mean of , representing weak correlation. Based on a paired t-test with , this is found to be significantly better than the previous results. Intuitively, the network’s prediction reacts more sensitively to the variation of the probe’s real translation and rotation (green line).

Last but not the least, we report the volume reconstruction results using four testing cases with different reconstruction qualities as shown in Fig. 7. One good case, one bad case, and two median cases are included to offer a complete view of the performance. To reduce the clutter in the figure, we only show the comparison between our DCL-Net, the 2D-CNN [prevost20183d] and the ground-truth. While producing competitive performance, the 2D-CNN method is less sensitive to the speed variations of US probe and the estimated trajectory has noisy vibration. The results sometimes even severely deviate from the ground-truth. Our proposed DCL-Net shows a much smoother trajectory estimation thanks to the contextual information provided by video segments.

5 Conclusions

This paper introduced a sensorless freehand 3D US volume reconstruction method based on deep learning. The proposed DCL-Net can well extract the information among multiple US frames to improve the US probe trajectory estimation. Experiments on a well-sized EM-tracked ultrasound dataset demonstrated that the proposed DCL-Net has benefited from the contextual learning and showed superior performance when compared to other existing methods. Further experiments on the ultrasound videos with different scanning protocols will be studied in our future work.

6 Acknowledgements

This work was partially supported by National Institute of Biomedical Imaging and Bioengineering (NIBIB) of the National Institutes of Health (NIH) under awards R21EB028001 and R01EB027898, and through an NIH Bench-to-Bedside award made possible by the National Cancer Institute.