I Introduction and Background
In the typical clinical B-mode ultrasound imaging paradigm, a transducer probe will transmit acoustic energy into tissue, and the back-scatter energy is reconstructed via beamforming techniques into a human eye-friendly image. This image attempts to faithfully map tissue’s acoustic impedance, which is a property of its bulk modulus and density. Unfortunately, there are many sources of image degradation such as electronic noise, speckle from sub-resolution scatterers, reverberation, and de-focusing caused by heterogeneity in tissue sound speed . In the literature, these sources of image degradation can be suppressed through better focusing [30, 4], spatial compounding , harmonic imaging , and coherence imaging techniques [19, 22].
In addition to beamforming, image post-processing is a significant contributor to image quality improvement. Reader studies have shown that medical providers largely prefer post-processed images over raw beamformed imagery [2, 19]. Unfortunately, commercial post-processing algorithms are proprietary, and implementation details are typically kept as a black-box to the end-user. Thus, researchers that develop image improvement techniques on highly configurable research systems, such as Verasonics and Cephasonics scanners, face challenges in presenting their images alongside current clinical system scanner baselines. The current status quo for researchers working on novel image forming techniques is to compare against raw beamformed data which is not typically viewed by medical providers. To have a pixel-wise comparison with clinical-grade standards, researchers would either need access to proprietary post-processing code or access to raw data from difficult-to-configure commercial scanners. We aim to remove these significant barriers by leveraging recent deep learning methods.
Deep learning based post-processing using convolutional neural network (CNN) generators[34, 20] have become immensely popular in the image restoration problem [29, 31]. One popular network architecture used is an encoder-decoder network with skip connections commonly referred to as a Unet 
. In the image restoration problem, the encoder portion of Unet takes a noisy image as input and creates feature map stacks which are subsequently down-sampled through max pool operations. The decoder portion up-samples features and attempts to reconstruct an image of the same size as the input. Usage of skip connections in Unet has been shown to better maintain high-frequency information in the original image than without. Other encoder-decoder Unet flavours exist which exploit residual learning [10, 36], wavelet transforms , and dense blocks [12, 14]. Encoder and decoder network parameters can be optimized typically with a gradient descent based method which minimizes a distance function between the reconstructed and ground truth image . Different distance functions such as mean squared error (MSE), mean absolute error (MAE), and structural similarity index measurement (SSIM) have been used in practice [37, 28].
Adversarial objective functions are a unique class of distance functions that have shown success in the related field of image generation . The adversarial objective optimizes two networks simultaneously. Given training batch sizes of with individual examples , is a network that generates images from noise , and another network, , discriminates between real images and fake generated images . and play a min-max game since they have competing objective functions shown in Eq. 1 and Eq. 2 where are parameters of and are parameters of . If this min-max game converges, ultimately learns to generate realistic fake images that are indistinguishable from the perspective of .
In the literature, these networks are referred to as generative adversarial networks (GANs) [9, 25]. Conditional GANs (cGANs) have seen success in image restoration as well as style transfer. With cGANs, a structured input, such as an image segmentation or corrupted image, is given instead of random noise .
In the field of ultrasound, deep learning techniques using cGANs and CNNs have recently been applied to B-mode imaging. They have shown promising results for reducing speckle noise, enhancing image contrast, and increasing other image quality metrics [1, 8, 23]. However, training GANs or CNNs for image enhancement require ground truths for comparison. These are typically before and after image enhancement pairs that are registered with one another. Unfortunately, this is a luxury not often available in most research environments requiring clinical-grade ground truths.
An extension of GANs known as cycle-consistent GANs (CycleGAN) has been proposed by  to get around the requirement of paired images. CycleGANs are shown to excel at the problem of style transfer where images are mapped from one domain to another without the use of explicitly paired images. CycleGANs consist of two key components: forward-reverse domain generators, and , and forward-reverse domain discriminators, and . The generators translate images from one domain to another, and the discriminators distinguish between real and fake generated images in each domain. We show the objective functions for one direction of the cycle in Eq. 3 and Eq. 4 where is an image from domain , and is an image from domain . In Eq. 3 and Eq. 4, the variables and are the parameters for the domain forward generator and domain discriminator. In Eq. 3, can represent any distance metric to compare two images.
In this work, we investigate if it is possible to approximate post-processing algorithms found on clinical-grade scanners given raw conventional beamformed data as input to Unet generators. We first show what is theoretically feasible when before and after image pairs are provided and refer to this as a gray-box constraint. We view this as the classic image restoration problem where clinical-grade post-processed images are ground truth, and raw data are “corrupted”. Later, we constrain ourselves to the more realistic black-box setting where no before and after image pairs are available. We view this problem from the style transfer lens and train a CycleGAN from scratch to mimic clinical-grade post-processing. We refer to this trained model configuration as MimickNet. Our results suggest that any manufacturers’ post-processing can be well approximated using this framework with just data acquired through a clinical scanner’s intended use.
We start with 1500 unique ultrasound image cineloops from fetal, phantom, and liver targets across Siemens S2000, SC2000, or Verasonics Vantage scanners using various scan parameters from [15, 7, 19, 18]
. This study was approved by the Institutional Review Board at the Duke University, and each study subject provided written informed consent prior to enrollment in the study. We split whole cineloops into respective training and testing sets. Each cineloop has multiple image frames of conventional delay and summed (DAS) beamformed data. The datasets combined consist of 39200 frames with a 30691/8509 image frame train-test split. Each image frame runs through a Siemens proprietary compiled post-processing software producing before and after pairs. These pairs are shuffled and randomly cropped to 512x512 images with padded reflection if the dimensions are too small. Constraining the image dimensions enables batch training, which leads to faster and more stable training convergence. During inference time, images can be any size as long as they are divisible by 16 due to required padding in our CNN architecture. TableI contains details about our training data.
|Scanner Type||Targets||Frames||Train Frames||Test Frames|
Ii-a Gray-box Performance with Paired Images
In the gray-box case where before and after paired images are available, our problem can be seen as a classic image restoration problem where our input DAS beamformed data is “corrupted”, and our clinical-grade post-processed image is the “uncorrupted ground truth”. We optimize for the different distance metrics MSE, MAE, and SSIM. As defined in Eq. 5, MSE is the summed pixel-wise squared difference between a ground truth pixel in image
and estimated pixelin image . These residuals are averaged by all pixels in the image. MAE is defined in Eq. 6 as the summed pixel-wise absolute difference. SSIM is defined in Eq. 8 and is the multiplicative similarity between two images’ luminance , contrast , and structure (Eq. 9-11). SSIM constants we use are based on . and define 1111 kernels on two images we wish to calculate the similarity of. These kernels slide across the two images, and the output values are averaged to get the SSIM between two images. Variables , and ,
are the mean and variance of each kernel patch, respectively. Variables, , and are the constants , , and respectively. is the dynamic range of the two images, is 0.01, and is 0.03.
We calculate SSIM and peak signal to noise ratio (PSNR) by running our trained model on the full test set with images at their original non-padded size. PSNR is defined by Eq. 7 where is the maximum possible intensity of the image.
Ii-B Black-box Performance with Unpaired Images
To simulate the more realistic black-box case where paired before and after images are unavailable, we take whole cineloops from the training set used in the gray-box case and split them into two groups. For the first group, we only use the raw beamformed data, and for the second group, we only use the clinical-grade post-processed data. We then train a CycleGAN using different distance metrics MSE, MAE, and SSIM for our generators’ cycle-consistency loss (Eq. 3). Like in the gray-box case, MSE, MAE, PSNR, and SSIM metrics were calculated by running our trained model on the full test set to their original non-padded size. Since we have access to the underlying proprietary clinical post-processing, we can compare against objective ground truths solely for final evaluation.
Ii-C Generator and Discriminator Structure
The same overall generator network structure is used in both the gray-box and black-box cases. We use a simple encoder-decoder with skip connections as seen on the left side of Fig. 2
. We vary filter sizes and the number of filters per layer as hyperparameters to the generator, and we report the total number of weight parameters in each model variation.
The discriminator structure on the right side of Fig. 2 follows the PatchGAN and LSGAN approach used in [13, 20] to optimize for least-squares on patches of linearly activated final outputs. The discriminator is only used to facilitate training in the black-box case where no paired images are available, and it is not used in the gray-box case since ground truths are available. Code and models are available at https://github.com/ouwen/mimicknet.
Ii-D Worst Case Performance
We investigate outlier images that perform worst on the SSIM metric by breaking SSIM into its three components: luminance, contrast , and structure . The equations for contrast and structure are highly related in examining variance between and within patches. Thus, and are simplified into a single contrast-structure equation (Eq. 12).
Iii-a Gray-Box Performance with Paired Images
In the theoretical gray-box case where before and after paired images are available, we explore different possible Unet encoder-decoder hyperparameters. For each hyperparameter variation, we trained a triplet of models that optimize for SSIM, MSE, and MAE. We note that within each triplet, models using the SSIM minimization objective have the best SSIM and PSNR. We are primarily interested in the best SSIM metric since it was originally formulated to model the human visual system . In Table II, the best average metrics of each column are in bold. Many of the metrics across model variations are not significantly different, but the SSIM for every model is above 0.967. For subsequent worst-case performance analysis, we used the 52993 parameter model optimized on SSIM loss. This model corresponds to the same generator structure used in Fig. 2 except with a 33 instead of a 73 filter.
Iii-B Black-box Performance with Unpaired Images
In the more realistic black-box case where before and after images are not available, we also explore different Unet architecture hyperparameters. We attempted to train from scratch the same 52993 parameter generator network architecture selected from Table II, but we were unsuccessful in guiding convergence without increasing the number of generator parameters to 117697. This increase was accomplished by changing every filter size from 33 to 73, and metrics can be seen in Table III. For the large 7.76M parameter generator network, performance differences between triplets of the objective functions are not significant. The row labeled “ver”, is a model trained only on Verasonics Vantage data with MAE optimization.
We select the 117697 parameter network optimizing MSE for subsequent analysis since it achieves the highest SSIM with fewest parameters. We refer to this configuration, shown in Fig. 2, as MimickNet. In Fig. 1 and Fig. 3, fetal, liver, and phantom images are shown. Without the scaled differences in the last row, it is much more difficult to discern localized differences between MimickNet images and clinical-grade post-processed images.
Iii-C Runtime Performance
In Table IV, the runtime was examined for the best SSIM performing model in the gray-box paired image and black-box unpaired image training cases. Frames per second (FPS) measurements were calculated for an NVIDIA P100. Floating-point operations per second (FLOPS) are provided as a hardware independent measurement since runtime generally scales linearly with the number of FLOPS used by the model. As a reference point, we include metrics from MobileNetV2 
, a lightweight image classifier designed explicitly for use on mobile phones. MimickNet uses 2000x fewer FLOPS compared to MobileNetV2. Note that FPS measurements for MobileNetV2 were performed on a Google Pixel 1 phone from and not an NVIDIA P100.
|Model||Input Size||Params||MFLOPS||FPS (Hz)|
Iii-D Worst Case Performance
, these components’ histogram and kernel density estimate are plotted for the gray-box paired image and the black-box unpaired image training cases. The min-maxrange for the gray-box case is tightly between 0.950 and 0.998, and the black-box case overlaps this region with a min-max range between 0.922 and 0.990. The min-max range of the gray-box case falls between 0.842 and 1.000, but the black-box case has a large min-max range of 0.318 and 1.000.
We also closely investigated outlier images that perform poorly on the SSIM metric by looking at the worst images. Fig. 5 contains three representative images. We included gray-box image results to showcase better the performance gap between what is possible when paired images are available versus when they are not. All three images produced with black-box constraints have high contrast-structure , but variable luminance .
Iii-E Out of Dataset Distribution Performance
To assess the generalizability of MimickNet post-processing, we applied it to cardiac cineloop data. These data are outside of our train-test dataset distribution which only included phantom, fetal, and liver imaging targets. We also applied MimickNet post-processing to a recent novel beamforming method known as REFocUS . REFocUS allows for transmit-receive focusing everywhere under linear system assumptions resulting in better image resolution and contrast-to-noise ratio. In Fig. 6, we see that MimickNet post-processed images closely match clinical-grade post-processing for conventional dynamic receive beamforming with an SSIM of 0.9670.002. Similar to clinical-grade post-processing, we see that contrast improvements in the heart chamber and resolution improvements along the heart septum due to REFocUS are preserved after MimickNet post-processing, achieving an SSIM of 0.9500.0157.
MimickNet can closely approximate clinical-grade post-processing with an SSIM of 0.9300.089 such that even upon close inspection, few differences are observed. This performance was achieved without knowledge of the pre-processed pair. We do observe a performance gap compared to the gray-box setting, which achieves an SSIM of 0.9790.013. However, emulating the gray-box setting would require researchers to tamper with scanner systems to siphon off pre-processed data, so we explore ways to eliminate this gap.
The performance gap is primarily attributed to differences image luminance from outlier frames seen in Fig. 4. Although images generated under black-box constraints present a large min-max
range of 0.318 to 1.000, we note that the mean and standard deviation is. Therefore, the majority of images do have well-approximated luminance, despite the sizeable min-max range. For the two fetal brain images in Fig. 5, we qualitatively see that much of the contrast and structure are preserved while luminance is not. This matches the quantitative contrast-structure and luminance SSIM components for the top fetal image ( 0.962, 0.681) and bottom fetal image ( 0.964, 0.612).
We found it interesting that clinical-grade post-processing would remove such bright reflectors seen in the raw beamformed phantom image (Fig. 5, 2nd row). This level of artifact removal likely requires window clipping. When we clip the lower dynamic range of raw beamformed data from -120dB to -80dB, we see the bright scatterers in raw beamformed images dim and practically match clinical-grade post-processing without any additional changes. Conceptually, clipping values to -80dB is a reasonable choice since it is close to the noise floor of most ultrasound transducers. In the CycleGAN training paradigm, it can be challenging to learn these clipping cutoffs due to the cycle-consistency loss (defined in Eq. 3). The backward generator would be penalized by any information destroyed through clipping learned in the forward generator. Since the cycle-consistency loss does not exist in optimization under the gray-box setting, the model under gray-box settings can learn the clipping better than under black-box settings. Fortunately, luminance can be modified to a large extent in real-time by changing the imaging window or gain by ultrasound end-users.
One challenge we found was that training MimickNet was quite unstable for small generator networks. This instability is likely due to the nature of adversarial objectives in GANs which other works explore [16, 21]. The overall stability of the adversarial objective function appears to be a more important factor in achieving a higher SSIM rather than the specific generator distance metric used such as MAE or SSIM as seen by Table III. Training GANs is a delicate balancing act between discriminator and generator. If the discriminator overpowers the generator during training, then the generator is unable to outpace the discriminator. A quick solution is to increase the capacity of the generator by adding more parameters, or by decreasing the capacity of the discriminator by taking away parameters until convergence occurs. Future works will explore how to better increase training stability and address any remaining performance gap between the gray-box and black-box constraint settings through different deep learning model architectures, objective functions, or training processes.
As is, MimickNet shows promise for production use. It runs in real-time at 92 FPS on an NVIDIA P100 and uses 2000x fewer FLOPS than models such as MobileNetV2, which was designed for less capable hardware such as mobile phone CPUs. This runtime is relevant since more ultrasound systems are being developed for mobile phone viewing . Additionally, the last row of Table III named “ver”, is a 7.76M parameter model trained only on Verasonics Vantage data with MAE distance metric optimization while achieving similar metrics to training on the full dataset. These results hint at the possibility of achieving similar SSIM with fewer data. Future work will assess the performance of MimickNet on mobile phones and other data or compute constrained settings.
This work’s main contribution is in decreasing the barrier of clinical translation for future research. Medical images previously only understood by research domain experts can be translated to clinical-grade images widely familiar to medical providers. Future work will aim to implement a flexible end-to-end software package to train a mimic provided data from two arbitrary scanner systems. Work will also examine how much data is required to create a high-performance mimic.
MimickNet closely approximates current clinical post-processing in the realistic black-box setting where before and after post-processing image pairs are unavailable. We present it as an image matching tool to provide fair comparisons of novel beamforming and image formation techniques to a clinical baseline mimic. It runs in real-time, works for out-of-distribution cardiac data, and thus shows promise for practical production use. We demonstrated its application in comparing different beamforming methods with clinical-grade post-processing and showed that resolution improvements are carried over into the final post-processed image. Our results with ultrasound data suggest it should also be possible to approximate medical image post-processing in other modalities such as CT and MR.
This work was supported by the National Institute of Biomedical Imaging and Bioengineering under Grant R01-EB026574, and National Institute of Health under Grant 5T32GM007171-44. The authors would like to thank Siemens Medical Inc. USA for in kind technical support.
-  (2017) Ultrasound image enhancement using a deep learning architecture. In Proceedings of the International Conference on Advanced Intelligent Systems and Informatics 2016, pp. 639–649. Cited by: §I.
-  (2009-05) Understanding the advanced signal processing technique of Real-Time adaptive filters. J. Diagn. Med. Sonogr. 25 (3), pp. 145–160. Cited by: §I.
-  (2015-11) A primer on the physical principles of tissue harmonic imaging. Radiographics 35 (7), pp. 1955–1964 (en). Cited by: §I.
-  (2018-10) REFoCUS: ultrasound focusing for the software beamforming age. In 2018 IEEE International Ultrasonics Symposium (IUS), pp. 1–4. Cited by: §I, §III-E.
-  (2004-04) Image quality assessment: from error visibility to structural similarity. IEEE Trans. Image Process. 13 (4), pp. 600–612. Cited by: §II-A, §III-A.
-  (2019) Large scale GAN training for high fidelity natural image synthesis. In International Conference on Learning Representations, External Links: Cited by: §I.
-  (2017-10) Quantifying image quality improvement using elevated acoustic output in B-Mode harmonic imaging. Ultrasound Med. Biol. 43 (10), pp. 2416–2425 (en). Cited by: §II.
-  (2018-10) Ultrasound speckle reduction using generative adversial networks. In 2018 IEEE International Ultrasonics Symposium (IUS), pp. 1–4. Cited by: §I.
-  (2014) Generative adversarial nets. In Advances in Neural Information Processing Systems 27, Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger (Eds.), pp. 2672–2680. Cited by: §I.
-  (2016) Deep residual learning for image recognition. In , pp. 770–778. Cited by: §I.
-  (2015-10) Mobile ultrafast ultrasound imaging system based on smartphone and tablet devices. In 2015 IEEE International Ultrasonics Symposium (IUS), pp. 1–4. Cited by: §IV.
-  (2017) Densely connected convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4700–4708. Cited by: §I.
-  (2017) Image-to-image translation with conditional adversarial networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1125–1134. Cited by: §I, §II-C.
-  (2017) The one hundred layers tiramisu: fully convolutional densenets for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 11–19. Cited by: §I.
-  (2015-04) In vivo application of short-lag spatial coherence and harmonic spatial coherence imaging in fetal ultrasound. Ultrason. Imaging 37 (2), pp. 101–116 (en). Cited by: §II.
-  (2018) Progressive growing of GANs for improved quality, stability, and variation. In International Conference on Learning Representations, External Links: Cited by: §IV.
-  (2018) Multi-level Wavelet-CNN for image restoration. Cited by: §I.
-  (2018) Implications of lag-one coherence on real-time adaptive frequency selection. In 2018 IEEE International Ultrasonics Symposium (IUS), pp. 1–9. Cited by: §II.
-  (2018-04) Clinical utility of fetal Short-Lag spatial coherence imaging. Ultrasound Med. Biol. 44 (4), pp. 794–806 (en). Cited by: §I, §I, §II.
-  (2017-10) Least squares generative adversarial networks. In 2017 IEEE International Conference on Computer Vision (ICCV), Vol. , pp. 2813–2821. External Links: Cited by: §I, §II-C.
Which training methods for GANs do actually converge?.
Proceedings of the 35th International Conference on Machine Learning, J. Dy and A. Krause (Eds.), Proceedings of Machine Learning Research, Vol. 80, Stockholmsmässan, Stockholm Sweden, pp. 3481–3490. External Links: Cited by: §IV.
-  (2019-03) Short-lag spatial coherence imaging in 1.5-d and 1.75-d arrays: elevation performance and array design considerations. IEEE Trans. Ultrason. Ferroelectr. Freq. Control (en). Cited by: §I.
-  (2018-10) Deep convolutional neural network for ultrasound image enhancement. In 2018 IEEE International Ultrasonics Symposium (IUS), pp. 1–4. Cited by: §I.
-  (2011-04) Sources of image degradation in fundamental and harmonic ultrasound imaging using nonlinear, full-wave simulations. IEEE Trans. Ultrason. Ferroelectr. Freq. Control 58 (4), pp. 754–765 (en). Cited by: §I.
-  (2016) Unsupervised representation learning with deep convolutional generative adversarial networks. CoRR abs/1511.06434. Cited by: §I.
-  (2015) U-net: convolutional networks for biomedical image segmentation. Med. Image Comput. Comput. Assist. Interv.. Cited by: §I.
-  (2018-06) MobileNetV2: inverted residuals and linear bottlenecks. In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vol. , pp. 4510–4520. External Links: Cited by: §III-C.
-  (2017) Learning to generate images with perceptual similarity metrics. Cited by: §I.
-  (2006-10) Robust kernel regression for restoration and reconstruction of images from sparse noisy data. In 2006 International Conference on Image Processing, pp. 1257–1260. Cited by: §I.
-  (2013-06) Exploring nsight imaging, a totally new architecture for premium ultrasound. Technical report Technical Report 4522 962 95791, Philips. External Links: Cited by: §I.
-  (1998-01) Bilateral filtering for gray and color images. In Sixth International Conference on Computer Vision (IEEE Cat. No.98CH36271), pp. 839–846. Cited by: §I.
-  (1987-09) Speckle reduction achievable by spatial compounding and frequency compounding: experimental results and implications for target detectability. In Pattern Recognition and Acoustical ImagingPattern Recognition and Acoustical Imaging, Vol. 0768, pp. 185–192. Cited by: §I.
Stacked denoising autoencoders: learning useful representations in a deep network with a local denoising criterion. J. Mach. Learn. Res. 11 (Dec), pp. 3371–3408. Cited by: §I.
-  (2014) Deep convolutional neural network for image deconvolution. In Advances in Neural Information Processing Systems 27, Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger (Eds.), pp. 1790–1798. Cited by: §I.
Fast and accurate image super resolution by deep CNN with skip connection and network in network. In Neural Information Processing, pp. 217–225. Cited by: §I.
-  (2018-05) Road extraction by deep residual U-Net. IEEE Geoscience and Remote Sensing Letters 15 (5), pp. 749–753. Cited by: §I.
-  (2017-03) Loss functions for image restoration with neural networks. IEEE Transactions on Computational Imaging 3 (1), pp. 47–57. External Links: Cited by: §I.
-  (2017) Unpaired image-to-image translation using cycle-consistent adversarial networks. In Proceedings of the IEEE international conference on computer vision, pp. 2223–2232. Cited by: §I.