ImplicitVol: Sensorless 3D Ultrasound Reconstruction with Deep Implicit Representation

by   Pak Hei Yeung, et al.
University of Oxford

The objective of this work is to achieve sensorless reconstruction of a 3D volume from a set of 2D freehand ultrasound images with deep implicit representation. In contrast to the conventional way that represents a 3D volume as a discrete voxel grid, we do so by parameterizing it as the zero level-set of a continuous function, i.e. implicitly representing the 3D volume as a mapping from the spatial coordinates to the corresponding intensity values. Our proposed model, termed as ImplicitVol, takes a set of 2D scans and their estimated locations in 3D as input, jointly re?fing the estimated 3D locations and learning a full reconstruction of the 3D volume. When testing on real 2D ultrasound images, novel cross-sectional views that are sampled from ImplicitVol show significantly better visual quality than those sampled from existing reconstruction approaches, outperforming them by over 30 SSIM), between the output and ground-truth on the 3D volume testing data. The code will be made publicly available.



page 2

page 9


Transducer Adaptive Ultrasound Volume Reconstruction

Reconstructed 3D ultrasound volume provides more context information com...

SUMNet: Fully Convolutional Model for Fast Segmentation of Anatomical Structures in Ultrasound Volumes

Ultrasound imaging is generally employed for real-time investigation of ...

Sensorless Freehand 3D Ultrasound Reconstruction via Deep Contextual Learning

Transrectal ultrasound (US) is the most commonly used imaging modality t...

Continuous Spectral Reconstruction from RGB Images via Implicit Neural Representation

Existing methods for spectral reconstruction usually learn a discrete ma...

An Implicit Representation of Swept Volumes based on Local Shapes and Movements

We introduce a new way to implicitly represent swept volumes in 3D. We f...

RSV: Robotic Sonography for Thyroid Volumetry

In nuclear medicine, radioiodine therapy is prescribed to treat diseases...

Deep Adversarial Context-Aware Landmark Detection for Ultrasound Imaging

Real-time localization of prostate gland in trans-rectal ultrasound imag...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Two-dimensional (2D) freehand ultrasonography is a popular imaging tool for many clinical tests, for example standard obstetric exams, due to its cost-effectiveness, portability and real-time acquisition capabilities. Despite its popularity, since each 2D ultrasound scan only represents a cross-sectional view of the three-dimensional (3D) scanned structure (Fig. 1), during scanning and visual analysis of the images, sonographers need to mentally reconstruct the 3D structures from the 2D scans. This process requires extensive expertise and anatomical knowledge, which may be limited in less developed regions, where medical professionals are insufficient [1]. In addition, since 2D freehand ultrasound scanning requires much manual operation, inter-operator variability may often be observed, and affect the accountability of the diagnosis.

Three-dimensional ultrasonography, on the other hand, captures the whole structure as a 3D volumetric image during the scanning. This leads to a range of advantages, including shorter scanning and assessment times, more flexible offline and secondary examination, as well as providing richer diagnostic information [2, 14, 7, 6]. Nevertheless, due to its more sophisticated hardware requirements and design, a 3D ultrasound system is a lot bulkier and may cost ten times more than the 2D system, limiting its use in practical scenario. Here, we capitalize on the advantages of 2D ultrasound, by reconstructing 3D volumes from a set of freehand 2D ultrasound scans.

In the literature, as summarized in [10] and Section 2.2

, early attempts tackled such a task by registering the 2D scans into the 3D volumes, and explicitly performing interpolations in the resulting volumetric representation. Despite the promising results achieved, multi-step reconstruction often suffers from different challenges, for example, error accumulation due to the incorrect estimation from 2D scans to their corresponding 3D locations, and limited resolution from the low tessellated grid.

In this paper, we aim to tackle the aforementioned challenges by parametrising the 3D volume as a deep neural network, which jointly refines the 2D-to-3D registrations and learns full 3D reconstruction based on only a set of 2D scans. To the best of our knowledge, our framework,

ImplicitVol, is the first study to propose a genuine sensorless (i.e. in both training and inference) 3D reconstruction pipeline based on deep implicit representation. We test ImplicitVol on both volume-sampled and native freehand 2D ultrasound images. Novel cross-sectional view images sampled from ImplicitVol show significantly better visual quality and match (i.e. more than 30% improvement in structural similarity index) with the ground-truth when compared to different baseline approaches (Fig. 1). Although we only demonstrate the technique for 3D reconstruction of fetal brain ultrasound, the proposed approach is expected to be more general, and future work will aim to extend it to other modalities.

2 Construction of 3D Representations

In Section 2.1, we briefly review the two ways of representing a 3D volume, followed by the conventional 3D reconstruction approaches, and our proposed approach with implicit representation in Section 2.2.

2.1 Explicit and Implicit 3D Representations

The very nature of a 3D volume is a one-to-one mapping from a set of 3D space positions (i.e. 3D coordinates) to the corresponding intensity values in real world. In general, there are two different ways for representing a 3D volume, either explicitly or implicitly, as described and compared in Table 1 and the following sections:

Explicit Implicit
continuity discrete voxel grid continuous function
memory-efficiency lower higher
resolution defined by grid
gradient & derivatives
limited by discretization
continuous & well-defined
Table 1: Comparison between the explicit and implicit representations for 3D volumes.

Explicit Representation. Conventionally, a 3D volume, , is represented discretely and explicitly

as a tensor with height (

), width (), depth (), and intensity channels (). Most medical applications involving 3D volumes rely on using such representation.

Implicit Representation. As an alternative, a 3D volume can also be represented as a zero level set of a continuous function parameterized by . Such implicit representation compresses the volumetric information and encodes it as parameters of a model, for example a deep neural network, that maps the 3D coordinates, , to intensities, i.e. .

2.2 Conventional 3D Reconstruction Approaches

In the literature, as summarized in [10], early attempts on 2D-to-3D ultrasound reconstruction have been extensively built on the explicit representations, and can be summarised by the following steps:

  • First, the 3D location of each 2D ultrasound image is estimated, where external sensor tracking is required at either the training [15, 16] or inference [5, 4] stage, subject to errors caused by subjects’ internal motion (e.g. fetal movement).

  • Then, a 3D volume, represented discretely as a tensor of intensities, is reconstructed by ‘registering’ the localized 2D scans back to the 3D space, with holes being interpolated. However, such 2D-to-3D back-projections are often prone to errors, leading to artifacts, thus requiring post-hoc corrections.

  • Finally, approaches [3, 11] have been proposed to correct the aforementioned reconstruction artifacts, based on kernel smoothing and denoising. However the effect may be limited, as the source of inaccuracy from the localization of the 2D images is unsolved, which can be visualized in the last two rows of Fig. 1.

Contributions. In this paper, we parametrise a deep neural network for representing a 3D volume implicitly (Fig. 2). Such a representation is continuous, and enables the querying of intensities at arbitrary spatial coordinates. With only a set of 2D scans available, it can produce jointly optimal 3D structures and 3D location estimations for these scans.

Figure 2: Pipeline of our proposed framework, ImplicitVol. During training, a set of 2D freehand ultrasound images, , and their estimated 3D location, , are used to train a deep neural network to implicitly represent the continuous 3D volume from which are acquired. During inference, images at any planes can be obtained as output, by feeding the corresponding grid coordinates to the network.

3 Methods

In Section 3.1, we first formulate the problem setting in this paper, namely, reconstructing a 3D volume from only a sparse set of 2D fetal brain scans with implicit representation. Next, matching with the three conventional steps of 3D reconstruction summarized in Section 2.2, we introduce the corresponding components of ImplicitVol, namely sensorless 3D localization of 2D scans (Section 3.2), 3D reconstruction with implicit representation (Section 3.3) and joint optimization (Section 3.4). The whole pipeline is summarized in Fig. 2.

3.1 Problem Setup

In general, we have a set of 2D ultrasound images, , capturing different cross-sectional views of a fetal brain at the corresponding 3D location, parameterized by , with being the 3D Euler angle and denoting the translation. Our goal is to reconstruct the volume, such that any 2D cross-sectional view of arbitrary resolution can be generated by querying the corresponding 3D coordinates, .

Inspired by [19, 9]

, we represent the volume as a continuous function, parameterized by a multi-layer perceptron (MLP) representation network

. The weights, , are learned by minimizing the the discrepancy between the actual and network-predicted intensities of , when the 3D coordinates, , computed from the corresponding , are input to the network.

3.2 Sensorless 3D Localization of 2D Scans

We use PlaneInVol [22] for estimating the 3D locations, , of the set of 2D ultrasound images, . Without using any external tracking, PlaneInVol [22] is trained with a set of 2D slices, sampled from the aligned 3D brain volumes, and their locations in the 3D aligned space. Despite being only trained with synthetic 2D images, PlaneInVol [22] has demonstrated its generalizability when testing on real 2D freehand fetal brain images.

3.3 3D Reconstruction with Implicit Representation

Conceptually, the idea is to store the 3D volume in a MLP, the weights of which are learned through a set of training data, namely the 2D ultrasound images, and their pre-computed 3D locations, , detailed in Section 3.2.

During training, we first derive the 3D coordinate, , for pixel of the 2D ultrasound image, , from the estimated :


In practice, this is achieved by first rotating the 3D coordinate of pixel of the reference plane by , and then translating it by .

Positional Encoding. Mapping each 3D coordinate, , to a higher dimensional space better represents the high frequency variation in the object’s intensity and geometry [9, 17]. Therefore, we encode by the function  [9]:


where are the normalized values (i.e. from to ) of each , and . In the following sections, “3D coordinate” refers to the encoded coordinate, .

Network Training. With the training set, , the weights, , of the representation network, , can be learned through normal back-propagation:


where is the photometric loss between observed and reconstructed 2D slices, i.e. structural similarity (SSIM) loss [20].

3.4 Joint Optimization for Location Refinement

In practice, the 3D locations, , predicted by PlaneInVol [22], are imperfect due to prediction error. Inspired by [21], during training the network, , we update (i.e. refine) the pre-computed 3D locations, , simultaneously through joint optimization which can be summarized as:


where and and are from Eq. 1 and 2, respectively.

Inference. The trained representation network, , represents a continuous 3D fetal brain captured by the set of 2D images. Any 2D cross-sectional view at any resolution can be easily obtained as the output, by feeding the corresponding grid coordinates for the desired slice to the network, as illustrated in the bottom half of Fig. 2.

4 Experimental Setup

We test ImplicitVol on both volume-sampled and native freehand 2D ultrasound fetal brain images, and compare it with different baseline approaches. We use normalized cross-correlation (NCC) [23] and structural similarity index measure (SSIM) [20], for appearance comparison, as well as absolute difference between rotation angles and absolute distance between translations for location estimation comparison. In Section 4.1, we introduce the datasets in this study, followed by the experimental design in Section 4.2 and the implementation details in Section 4.3.

4.1 Dataset

Volume-sampled 2D images are generated by sampling planes from fifteen 3D ultrasound fetal brain volumes ( voxels ), collected at 20 weeks’ gestational age. The data were obtained as part of the INTERGROWTH-21st study [13], which were collected using a Philips HD9 curvilinear probe at a 2–5 MHz wave frequency.

In addition, two videos of native freehand 2D brain scans with around 250 frames each, collected at 20 weeks’ gestational age at the Leiden University Medical Center using GE Voluson E10, are used for qualitative analysis.

4.2 Experimental Design

Volume-Sampled Images. For each of the fifteen 3D volumes introduced in Section 4.1, () 2D slices were sampled around the central axis of the brain non-uniformly, to simulate actual freehand acquisition by rotating the probe. We conducted experiments on volume-sampled 2D images because ideal ground-truth can be easily obtained for quantitatively benchmarking different approaches. 2D slices sampled at new cross-sectional views along the coronal, sagittal and axial directions from both the original (i.e. ground-truth) and reconstructed volumes by different approaches, were analyzed. The reconstructed volumes were rigidly aligned to the ground-truth volume for fair comparison as rigid shift may be introduced to the volumes during the reconstruction. The estimated 3D locations refined by ImplicitVol were also compared to the ground-truth locations, and those predicted by other baseline approaches.

Real Images. The two videos of real 2D freehand fetal brain ultrasound were acquired along the axial direction. Novel cross-sectional view images sampled from volumes reconstructed from different approaches were only qualitatively analyzed, due to the lack of ground-truth 3D location information.

4.3 Implementation Details

ImplicitVol. Our representation network, , is a 5-layer MLP, with the hidden layer dimension of 128 and SIREN [18]

as the activation function.

, from Eq. 2, was set to 10 and we initialized the set of 3D locations, , by the estimated locations predicted by PlaneInVol [22]. The learning and decay rates followed those adopted in [21]. A representation network,

, was trained for one set of images for 10000 epochs to represent one 3D volume.

PlaneInVol [22]. With the set of 2D ultrasound images, , and the corresponding 3D locations, , predicted by PlaneInVol [22], was explicitly reconstructed by interpolating the intensity at each voxel by inverse distance weighted average from the 20 nearest pixel of .

PlaneInVol [22] + SVRTK [8].

Slice to volume registration is well studied for super resolution reconstruction of motion-corrupted MRI. We implemented SVRTK 

[8], designed for fetal brain MRI motion correction, to the PlaneInVol [22] interpolated volume to verify if using technique developed for a similar task in a different modality (i.e. MRI) may help in our problem setting.

5 Results and Discussion

The results of all the experiments are presented in Table 2, with qualitative examples shown in Fig. 3. In Section 5.1 and 5.2, we analyze the results of volume-sampled and native freehand 2D ultrasound images, respectively.

Axial Coronal Sagittal Location
No. of
(a) PlaneInVol[22] 128 -
(b) PlaneInVol[22] 256 -
(c) PlaneInVol[22]
+ SVRTK[8]
128 -
(d) PlaneInVol[22]
+ SVRTK[8]
256 -
(e) ImplicitVol 128
(f) ImplicitVol 128
(g) ImplicitVol 256
Table 2: Evaluation results (mean standard deviation) of different approaches on volume-sampled 2D images. indicates higher values being more accurate, vice versa.

5.1 Volume-Sampled Images

A few conclusions can be drawn from the results presented in Table 2 and Fig. 1.

Firstly, The novel view images sampled from our proposed approach, ImplicitVol (rows e-g), showed a better match with the corresponding ground-truth as suggested by the higher NCC and SSIM values, compared to the conventional approaches (rows a-d). Performance was improved by more than 30%, for all coronal, sagittal and axial directions, which can be further verified by the qualitative examples shown in Fig. 1

Secondly, while comparing row e to row f, updating the estimated 3D locations through the joint optimization in ImplicitVol led to a significant boost of performance on visual quality as well as more accurate estimations for localising the 2D images in the 3D space. Note that, such refinement requires no extra supervision cost, which manifests ImplicitVol’s additional potential in slice-to-volume registration of ultrasound

Thirdly, a larger training set (i.e. 128 to 256) led to better performance (row f to row g). Thanks to the real-time acquisition capability of 2D ultrasound, acquiring hundreds of images in one scan is easily achievable in practice.

Figure 3: Results of 3D reconstruction from native freehand 2D ultrasound. Novel view images sampled from different planes from volumes reconstructed by different approaches are presented. ImplicitVol, shows better visual quality in under-sampled region (yellow boxes) and is more robust against inaccurate position estimation (red boxes).

5.2 Native Freehand Images

As shown by the native freehand 2D ultrasound results presented in Fig. 3, images sampled from ImplicitVol showed better visual quality at motion-corrupted regions (red boxes), thanks to the localization refinement achieved by the joint optimization. ImplicitVol also performed better at under-sampled regions (yellow boxes), where the baseline approaches reconstructed misleading results due to the inaccuracy caused by extrapolation from the spatially distant neighbours.

6 Conclusion

In summary, we investigate on sensor-free 3D ultrasound reconstruction from a sparse set of 2D images with deep implicit representation. Our proposed framwork, ImplicitVol, demonstrates superior performance, in terms of the quality of the sampled images as well as the refinement of the 3D localization, when compared to other baseline approaches. ImplicitVol may facilitate more standardized analysis for 2D ultrasound sequence, which may lead to better and more efficient diagnosis and assessment, particularly in settings where only 2D probes are available but 3D assessment would be beneficial.


  • [1] B. R. Benacerraf (2002) Three-dimensional fetal sonography. Journal of Ultrasound in Medicine 21 (10), pp. 1063–1067. External Links: Document, Link, Cited by: §1.
  • [2] M. Chen, C. Chang, C. Yu, Y. Cheng, and F. Chang (2001) Prenatal diagnosis of cleft palate by three-dimensional ultrasound. Ultrasound in medicine & biology 27 (8), pp. 1017–1023. Cited by: §1.
  • [3] X. Chen, T. Wen, X. Li, W. Qin, D. Lan, W. Pan, and J. Gu (2014) Reconstruction of freehand 3d ultrasound based on kernel regression. Biomedical engineering online 13 (1), pp. 1–15. Cited by: item .
  • [4] S. Chung, C. Shih, and C. Huang (2017) Freehand three-dimensional ultrasound imaging of carotid artery using motion tracking technology. Ultrasonics 74, pp. 11–20. Cited by: item .
  • [5] M. I. Daoud, A. Alshalalfah, F. Awwad, and M. Al-Najar (2015) Freehand 3d ultrasound imaging system using electromagnetic tracking. In

    2015 International Conference on Open Source Software Computing (OSSCOM)

    pp. 1–5. Cited by: item .
  • [6] A. M. Dückelmann and K. D. Kalache (2010) Three-dimensional ultrasound in evaluating the fetus. Prenatal diagnosis 30 (7), pp. 631–638. Cited by: §1.
  • [7] L. F. Gonçalves (2016) Three-dimensional ultrasound of the fetus: how does it help?. Pediatric radiology 46 (2), pp. 177–189. Cited by: §1.
  • [8] M. Kuklisova-Murgasova, G. Quaghebeur, M. A. Rutherford, J. V. Hajnal, and J. A. Schnabel (2012)

    Reconstruction of fetal brain mri with intensity matching and complete outlier removal

    Medical image analysis 16 (8), pp. 1550–1564. Cited by: §4.3, Table 2.
  • [9] B. Mildenhall, P. P. Srinivasan, M. Tancik, J. T. Barron, R. Ramamoorthi, and R. Ng (2020) Nerf: representing scenes as neural radiance fields for view synthesis. In

    European Conference on Computer Vision

    pp. 405–421. Cited by: §3.1, §3.3.
  • [10] F. Mohamed and C. V. Siang (2019) ‘A survey on 3d ultrasound reconstruction techniques. Artificial Intelligence—Applications in Medicine and Biology. Cited by: §1, §2.2.
  • [11] H. Moon, G. Ju, S. Park, and H. Shin (2016) 3D freehand ultrasound reconstruction using a piecewise smooth markov random field. Computer Vision and Image Understanding 151, pp. 101–113. Cited by: item .
  • [12] F. Moser, R. Huang, A. T. Papageorghiou, B. W. Papież, and A. I. Namburete (2019)

    Automated fetal brain extraction from clinical ultrasound volumes using 3d convolutional neural networks

    In Annual Conference on Medical Image Understanding and Analysis, pp. 151–163. Cited by: Figure 1.
  • [13] A. T. Papageorghiou, E. O. Ohuma, D. G. Altman, T. Todros, L. C. Ismail, A. Lambert, Y. A. Jaffer, E. Bertino, M. G. Gravett, and M. Purwar (2014)

    International standards for fetal growth based on serial ultrasound measurements: the fetal growth longitudinal study of the INTERGROWTH-21st project

    The Lancet 384 (9946), pp. 869–879. External Links: ISSN 0140-6736 Cited by: §4.1.
  • [14] L. Pistorius, P. Stoutenbeek, F. Groenendaal, L. De Vries, G. Manten, E. Mulder, and G. Visser (2010) Grade and symmetry of normal fetal cortical development: a longitudinal two-and three-dimensional ultrasound study. Ultrasound in obstetrics & gynecology 36 (6), pp. 700–708. Cited by: §1.
  • [15] R. W. Prager, A. H. Gee, G. M. Treece, C. J. Cash, and L. H. Berman (2003) Sensorless freehand 3-d ultrasound using regression of the echo intensity. Ultrasound in medicine & biology 29 (3), pp. 437–446. Cited by: item .
  • [16] R. Prevost, M. Salehi, J. Sprung, A. Ladikos, R. Bauer, and W. Wein (2017) Deep learning for sensorless 3d freehand ultrasound imaging. In International conference on medical image computing and computer-assisted intervention, pp. 628–636. Cited by: item .
  • [17] N. Rahaman, A. Baratin, D. Arpit, F. Draxler, M. Lin, F. Hamprecht, Y. Bengio, and A. Courville (2019) On the spectral bias of neural networks. In International Conference on Machine Learning, pp. 5301–5310. Cited by: §3.3.
  • [18] V. Sitzmann, J. Martel, A. Bergman, D. Lindell, and G. Wetzstein (2020) Implicit neural representations with periodic activation functions. Advances in Neural Information Processing Systems 33. Cited by: §4.3.
  • [19] V. Sitzmann, M. Zollhöfer, and G. Wetzstein (2019) Scene representation networks: continuous 3d-structure-aware neural scene representations. arXiv preprint arXiv:1906.01618. Cited by: §3.1.
  • [20] Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli (2004) Image quality assessment: from error visibility to structural similarity. IEEE transactions on image processing 13 (4), pp. 600–612. Cited by: §3.3, §4.
  • [21] Z. Wang, S. Wu, W. Xie, M. Chen, and V. A. Prisacariu (2021) NeRF: neural radiance fields without known camera parameters. arXiv preprint arXiv:2102.07064. Cited by: §3.4, §4.3.
  • [22] P. Yeung, M. Aliasi, A. T. Papageorghiou, M. Haak, W. Xie, and A. I. Namburete (2021) Learning to map 2d ultrasound images into 3d space with minimal human annotation. Medical Image Analysis 70, pp. 101998. Cited by: §3.2, §3.4, §4.3, §4.3, §4.3, Table 2.
  • [23] J. Yoo and T. H. Han (2009) Fast normalized cross-correlation. Circuits, systems and signal processing 28 (6), pp. 819–843. Cited by: §4.