Log In Sign Up

MimickNet, Matching Clinical Post-Processing Under Realistic Black-Box Constraints

by   Ouwen Huang, et al.
Duke University

Image post-processing is used in clinical-grade ultrasound scanners to improve image quality (e.g., reduce speckle noise and enhance contrast). These post-processing techniques vary across manufacturers and are generally kept proprietary, which presents a challenge for researchers looking to match current clinical-grade workflows. We introduce a deep learning framework, MimickNet, that transforms raw conventional delay-and-summed (DAS) beams into the approximate post-processed images found on clinical-grade scanners. Training MimickNet only requires post-processed image samples from a scanner of interest without the need for explicit pairing to raw DAS data. This flexibility allows it to hypothetically approximate any manufacturer's post-processing without access to the pre-processed data. MimickNet generates images with an average similarity index measurement (SSIM) of 0.930±0.0892 on a 300 cineloop test set, and it generalizes to cardiac cineloops outside of our train-test distribution achieving an SSIM of 0.967±0.002. We also explore the theoretical SSIM achievable by evaluating MimickNet performance when trained under gray-box constraints (i.e., when both pre-processed and post-processed images are available). To our knowledge, this is the first work to establish deep learning models that closely approximate current clinical-grade ultrasound post-processing under realistic black-box constraints where before and after post-processing data is unavailable. MimickNet serves as a clinical post-processing baseline for future works in ultrasound image formation to compare against. To this end, we have made the MimickNet software open source.


page 1

page 5

page 6

page 7


Multiaccuracy: Black-Box Post-Processing for Fairness in Classification

Machine learning predictors are successfully deployed in applications ra...

GAMA: a General Automated Machine learning Assistant

The General Automated Machine learning Assistant (GAMA) is a modular Aut...

Post-Processed Posteriors for Banded Covariances

We consider Bayesian inference of banded covariance matrices and propose...

Post-Processing of Word Representations via Variance Normalization and Dynamic Embedding

Although embedded vector representations of words offer impressive perfo...

Char2char Generation with Reranking for the E2E NLG Challenge

This paper describes our submission to the E2E NLG Challenge. Recently, ...

3D B-mode ultrasound speckle reduction using deep learning for 3D registration applications

Ultrasound (US) speckles are granular patterns which can impede image po...

Mining Points of Interest via Address Embeddings: An Unsupervised Approach

Digital maps are commonly used across the globe for exploring places tha...

I Introduction and Background

Fig. 1: Fetal image comparing clinical-grade post-processed images (ground truth) and MimickNet post-processing. In the last row, the difference between clinical-grade and MimickNet post-processing is scaled to maximize dynamic range. The SSIM of the MimickNet image to clinical grade image is 0.972 and the PSNR is 26.78.

In the typical clinical B-mode ultrasound imaging paradigm, a transducer probe will transmit acoustic energy into tissue, and the back-scatter energy is reconstructed via beamforming techniques into a human eye-friendly image. This image attempts to faithfully map tissue’s acoustic impedance, which is a property of its bulk modulus and density. Unfortunately, there are many sources of image degradation such as electronic noise, speckle from sub-resolution scatterers, reverberation, and de-focusing caused by heterogeneity in tissue sound speed [24]. In the literature, these sources of image degradation can be suppressed through better focusing [30, 4], spatial compounding [32], harmonic imaging [3], and coherence imaging techniques [19, 22].

In addition to beamforming, image post-processing is a significant contributor to image quality improvement. Reader studies have shown that medical providers largely prefer post-processed images over raw beamformed imagery [2, 19]. Unfortunately, commercial post-processing algorithms are proprietary, and implementation details are typically kept as a black-box to the end-user. Thus, researchers that develop image improvement techniques on highly configurable research systems, such as Verasonics and Cephasonics scanners, face challenges in presenting their images alongside current clinical system scanner baselines. The current status quo for researchers working on novel image forming techniques is to compare against raw beamformed data which is not typically viewed by medical providers. To have a pixel-wise comparison with clinical-grade standards, researchers would either need access to proprietary post-processing code or access to raw data from difficult-to-configure commercial scanners. We aim to remove these significant barriers by leveraging recent deep learning methods.

Deep learning based post-processing using convolutional neural network (CNN) generators

[34, 20] have become immensely popular in the image restoration problem [29, 31]. One popular network architecture used is an encoder-decoder network with skip connections commonly referred to as a Unet [26]

. In the image restoration problem, the encoder portion of Unet takes a noisy image as input and creates feature map stacks which are subsequently down-sampled through max pool operations. The decoder portion up-samples features and attempts to reconstruct an image of the same size as the input. Usage of skip connections in Unet has been shown to better maintain high-frequency information in the original image than without

[35]. Other encoder-decoder Unet flavours exist which exploit residual learning [10, 36], wavelet transforms [17], and dense blocks [12, 14]. Encoder and decoder network parameters can be optimized typically with a gradient descent based method which minimizes a distance function between the reconstructed and ground truth image [33]. Different distance functions such as mean squared error (MSE), mean absolute error (MAE), and structural similarity index measurement (SSIM) have been used in practice [37, 28].

Adversarial objective functions are a unique class of distance functions that have shown success in the related field of image generation [6]. The adversarial objective optimizes two networks simultaneously. Given training batch sizes of with individual examples , is a network that generates images from noise , and another network, , discriminates between real images and fake generated images . and play a min-max game since they have competing objective functions shown in Eq. 1 and Eq. 2 where are parameters of and are parameters of . If this min-max game converges, ultimately learns to generate realistic fake images that are indistinguishable from the perspective of .


In the literature, these networks are referred to as generative adversarial networks (GANs) [9, 25]. Conditional GANs (cGANs) have seen success in image restoration as well as style transfer. With cGANs, a structured input, such as an image segmentation or corrupted image, is given instead of random noise [13].

In the field of ultrasound, deep learning techniques using cGANs and CNNs have recently been applied to B-mode imaging. They have shown promising results for reducing speckle noise, enhancing image contrast, and increasing other image quality metrics [1, 8, 23]. However, training GANs or CNNs for image enhancement require ground truths for comparison. These are typically before and after image enhancement pairs that are registered with one another. Unfortunately, this is a luxury not often available in most research environments requiring clinical-grade ground truths.

An extension of GANs known as cycle-consistent GANs (CycleGAN) has been proposed by [38] to get around the requirement of paired images. CycleGANs are shown to excel at the problem of style transfer where images are mapped from one domain to another without the use of explicitly paired images. CycleGANs consist of two key components: forward-reverse domain generators, and , and forward-reverse domain discriminators, and . The generators translate images from one domain to another, and the discriminators distinguish between real and fake generated images in each domain. We show the objective functions for one direction of the cycle in Eq. 3 and Eq. 4 where is an image from domain , and is an image from domain . In Eq. 3 and Eq. 4, the variables and are the parameters for the domain forward generator and domain discriminator. In Eq. 3, can represent any distance metric to compare two images.


In this work, we investigate if it is possible to approximate post-processing algorithms found on clinical-grade scanners given raw conventional beamformed data as input to Unet generators. We first show what is theoretically feasible when before and after image pairs are provided and refer to this as a gray-box constraint. We view this as the classic image restoration problem where clinical-grade post-processed images are ground truth, and raw data are “corrupted”. Later, we constrain ourselves to the more realistic black-box setting where no before and after image pairs are available. We view this problem from the style transfer lens and train a CycleGAN from scratch to mimic clinical-grade post-processing. We refer to this trained model configuration as MimickNet. Our results suggest that any manufacturers’ post-processing can be well approximated using this framework with just data acquired through a clinical scanner’s intended use.

Ii Methodology

We start with 1500 unique ultrasound image cineloops from fetal, phantom, and liver targets across Siemens S2000, SC2000, or Verasonics Vantage scanners using various scan parameters from [15, 7, 19, 18]

. This study was approved by the Institutional Review Board at the Duke University, and each study subject provided written informed consent prior to enrollment in the study. We split whole cineloops into respective training and testing sets. Each cineloop has multiple image frames of conventional delay and summed (DAS) beamformed data. The datasets combined consist of 39200 frames with a 30691/8509 image frame train-test split. Each image frame runs through a Siemens proprietary compiled post-processing software producing before and after pairs. These pairs are shuffled and randomly cropped to 512x512 images with padded reflection if the dimensions are too small. Constraining the image dimensions enables batch training, which leads to faster and more stable training convergence. During inference time, images can be any size as long as they are divisible by 16 due to required padding in our CNN architecture. Table

I contains details about our training data.

Scanner Type Targets Frames Train Frames Test Frames
S2000 873 3085 2543 542
SC2000 158 12806 9754 3052
Verasonics 469 23309 18394 4915
Total 1500 39200 30691 8509
TABLE I: Dataset Overview

Ii-a Gray-box Performance with Paired Images


In the gray-box case where before and after paired images are available, our problem can be seen as a classic image restoration problem where our input DAS beamformed data is “corrupted”, and our clinical-grade post-processed image is the “uncorrupted ground truth”. We optimize for the different distance metrics MSE, MAE, and SSIM. As defined in Eq. 5, MSE is the summed pixel-wise squared difference between a ground truth pixel in image

and estimated pixel

in image . These residuals are averaged by all pixels in the image. MAE is defined in Eq. 6 as the summed pixel-wise absolute difference. SSIM is defined in Eq. 8 and is the multiplicative similarity between two images’ luminance , contrast , and structure (Eq. 9-11). SSIM constants we use are based on [5]. and define 1111 kernels on two images we wish to calculate the similarity of. These kernels slide across the two images, and the output values are averaged to get the SSIM between two images. Variables , and ,

are the mean and variance of each kernel patch, respectively. Variables

, , and are the constants , , and respectively. is the dynamic range of the two images, is 0.01, and is 0.03.

We calculate SSIM and peak signal to noise ratio (PSNR) by running our trained model on the full test set with images at their original non-padded size. PSNR is defined by Eq. 7 where is the maximum possible intensity of the image.

Ii-B Black-box Performance with Unpaired Images

To simulate the more realistic black-box case where paired before and after images are unavailable, we take whole cineloops from the training set used in the gray-box case and split them into two groups. For the first group, we only use the raw beamformed data, and for the second group, we only use the clinical-grade post-processed data. We then train a CycleGAN using different distance metrics MSE, MAE, and SSIM for our generators’ cycle-consistency loss (Eq. 3). Like in the gray-box case, MSE, MAE, PSNR, and SSIM metrics were calculated by running our trained model on the full test set to their original non-padded size. Since we have access to the underlying proprietary clinical post-processing, we can compare against objective ground truths solely for final evaluation.

Ii-C Generator and Discriminator Structure

The same overall generator network structure is used in both the gray-box and black-box cases. We use a simple encoder-decoder with skip connections as seen on the left side of Fig. 2

. We vary filter sizes and the number of filters per layer as hyperparameters to the generator, and we report the total number of weight parameters in each model variation.

The discriminator structure on the right side of Fig. 2 follows the PatchGAN and LSGAN approach used in [13, 20] to optimize for least-squares on patches of linearly activated final outputs. The discriminator is only used to facilitate training in the black-box case where no paired images are available, and it is not used in the gray-box case since ground truths are available. Code and models are available at

Fig. 2: Above is a diagram of the generator and discriminator structure for MimickNet in one translation direction. Note: the reverse translation direction uses an identical mirrored structure. Under gray-box training constraints, only the generator is used.

Ii-D Worst Case Performance


We investigate outlier images that perform worst on the SSIM metric by breaking SSIM into its three components: luminance

, contrast , and structure . The equations for contrast and structure are highly related in examining variance between and within patches. Thus, and are simplified into a single contrast-structure equation (Eq. 12).

Iii Results

Iii-a Gray-Box Performance with Paired Images

In the theoretical gray-box case where before and after paired images are available, we explore different possible Unet encoder-decoder hyperparameters. For each hyperparameter variation, we trained a triplet of models that optimize for SSIM, MSE, and MAE. We note that within each triplet, models using the SSIM minimization objective have the best SSIM and PSNR. We are primarily interested in the best SSIM metric since it was originally formulated to model the human visual system [5]. In Table II, the best average metrics of each column are in bold. Many of the metrics across model variations are not significantly different, but the SSIM for every model is above 0.967. For subsequent worst-case performance analysis, we used the 52993 parameter model optimized on SSIM loss. This model corresponds to the same generator structure used in Fig. 2 except with a 33 instead of a 73 filter.

ssim 13377 2.783.22 3.972.40 27.43.9 0.9670.015
mse 13377 2.402.65 3.762.00 27.73.4 0.9470.022
mae 13377 2.512.86 3.832.13 27.63.5 0.9460.018
ssim 29601 2.633.10 3.912.40 27.94.0 0.9670.015
mse 29601 2.192.25 3.611.81 27.93.2 0.9400.019
mae 29601 3.463.20 4.582.16 25.73.0 0.8950.028
ssim 34849 2.492.88 3.782.28 27.93.9 0.9750.013
mse 34849 2.272.41 3.671.92 27.93.3 0.9500.019
mae 34849 2.312.54 3.681.96 27.93.4 0.9510.016
ssim 52993 2.282.77 3.652.24 28.54.2 0.9790.013
mse 52993 2.192.40 3.601.92 28.13.4 0.9560.017
mae 52993 2.112.35 3.521.89 28.33.4 0.9590.015
ssim 77185 2.382.91 3.702.28 28.34.0 0.9760.015
mse 77185 2.022.09 3.461.70 28.33.2 0.9460.022
mae 77185 2.142.23 3.551.80 28.03.2 0.9470.020
ssim 117697 2.222.65 3.592.11 28.43.9 0.9770.014
mse 117697 2.722.51 4.071.95 26.93.1 0.9310.023
mae 117697 2.932.93 4.182.11 26.73.3 0.9270.022
ssim 330401 2.252.79 3.612.22 28.64.1 0.9770.013
mse 330401 2.152.20 3.581.83 28.13.4 0.9580.016
mae 330401 2.232.42 3.611.89 28.03.4 0.9580.016
ssim 733025 2.633.06 3.932.33 27.74.0 0.9670.015
mse 733025 2.402.51 3.791.97 27.73.4 0.9450.023
mae 733025 2.802.83 4.092.04 26.93.2 0.9270.022
TABLE II: gray-box Performance with Paired Images. Best average metrics are emphasized in bold.

Iii-B Black-box Performance with Unpaired Images

In the more realistic black-box case where before and after images are not available, we also explore different Unet architecture hyperparameters. We attempted to train from scratch the same 52993 parameter generator network architecture selected from Table II, but we were unsuccessful in guiding convergence without increasing the number of generator parameters to 117697. This increase was accomplished by changing every filter size from 33 to 73, and metrics can be seen in Table III. For the large 7.76M parameter generator network, performance differences between triplets of the objective functions are not significant. The row labeled “ver”, is a model trained only on Verasonics Vantage data with MAE optimization.

We select the 117697 parameter network optimizing MSE for subsequent analysis since it achieves the highest SSIM with fewest parameters. We refer to this configuration, shown in Fig. 2, as MimickNet. In Fig. 1 and Fig. 3, fetal, liver, and phantom images are shown. Without the scaled differences in the last row, it is much more difficult to discern localized differences between MimickNet images and clinical-grade post-processed images.

ssim 117697 7.2610.5 6.544.38 23.94.41 0.8830.091
mse 117697 6.8311.1 6.314.39 24.74.95 0.9300.089
mae 117697 6.799.89 6.304.27 24.44.67 0.9000.085
ssim 7.76M 4.455.71 5.143.12 25.64.08 0.9180.078
mse 7.76M 6.236.30 6.143.24 23.63.57 0.8970.052
mae 7.76M 6.209.10 6.024.21 25.15.05 0.9180.084
ver 7.76M 6.138.95 5.994.19 25.25.08 0.9160.083
TABLE III: Black-box Performance with Unpaired Images, “ver”, is a model trained only on Verasonics Vantage data with the MAE optimization.
Fig. 3: Liver (left) and phantom (right) images. The difference between clinical-grade and MimickNet outputs are scaled to maximize dynamic range. The SSIM and PSNR between MimickNet and clinical-grade images for the liver target is 0.9472 and 26.91, respectively. The SSIM and PSNR between MimickNet and clinical-grade images for the phantom target is 0.9802 and 27.20, respectively.

Iii-C Runtime Performance

In Table IV, the runtime was examined for the best SSIM performing model in the gray-box paired image and black-box unpaired image training cases. Frames per second (FPS) measurements were calculated for an NVIDIA P100. Floating-point operations per second (FLOPS) are provided as a hardware independent measurement since runtime generally scales linearly with the number of FLOPS used by the model. As a reference point, we include metrics from MobileNetV2 [27]

, a lightweight image classifier designed explicitly for use on mobile phones. MimickNet uses 2000x fewer FLOPS compared to MobileNetV2. Note that FPS measurements for MobileNetV2 were performed on a Google Pixel 1 phone from

[27] and not an NVIDIA P100.

Model Input Size Params MFLOPS FPS (Hz)
Gray-box 512x512 52993 0.105 142
Black-box (MimickNet) 512x512 117697 0.235 92
MobileNetV2 224x224 4.3M 569 5*
TABLE IV: Runtime Performance on Nvidia P100 and *Pixel 1 Phone under gray-box and black-box training constraints

Iii-D Worst Case Performance

We investigate the distribution of SSIM across our entire test dataset. We break the the SSIM into its luminance and contrast-structure components following Eq. 9 and Eq. 12. In Fig. 4

, these components’ histogram and kernel density estimate are plotted for the gray-box paired image and the black-box unpaired image training cases. The min-max

range for the gray-box case is tightly between 0.950 and 0.998, and the black-box case overlaps this region with a min-max range between 0.922 and 0.990. The min-max range of the gray-box case falls between 0.842 and 1.000, but the black-box case has a large min-max range of 0.318 and 1.000.

We also closely investigated outlier images that perform poorly on the SSIM metric by looking at the worst images. Fig. 5 contains three representative images. We included gray-box image results to showcase better the performance gap between what is possible when paired images are available versus when they are not. All three images produced with black-box constraints have high contrast-structure , but variable luminance .

Fig. 4: The distribution of contrast-structure (top) and luminance (bottom) of all image frames in our test dataset produced under gray-box and black-box constraints. The is and is under gray-box constraints. The is and is under black-box constraints.
Fig. 5: The worst case scenario images for two fetal brain images (top, bottom), and a phantom (middle). The SSIM of the black-box case (MimickNet) to the ground truth images from top to bottom is 0.665 ( = 0.962, =0.681), 0.414 ( = 0.947, = 0.419), and 0.603 ( = 0.964, = 0.612). The SSIM of the gray-box case to the ground truth images from top to bottom is 0.873 (=0.984, =0.883), 0.967 (=0.996, =0.971), and 0.901 (=0.988, =0.911). Here is the luminance and is the contrast-structure components of SSIM.

Iii-E Out of Dataset Distribution Performance

To assess the generalizability of MimickNet post-processing, we applied it to cardiac cineloop data. These data are outside of our train-test dataset distribution which only included phantom, fetal, and liver imaging targets. We also applied MimickNet post-processing to a recent novel beamforming method known as REFocUS [4]. REFocUS allows for transmit-receive focusing everywhere under linear system assumptions resulting in better image resolution and contrast-to-noise ratio. In Fig. 6, we see that MimickNet post-processed images closely match clinical-grade post-processing for conventional dynamic receive beamforming with an SSIM of 0.9670.002. Similar to clinical-grade post-processing, we see that contrast improvements in the heart chamber and resolution improvements along the heart septum due to REFocUS are preserved after MimickNet post-processing, achieving an SSIM of 0.9500.0157.

Fig. 6: MimickNet applied to out of distribution cardiac data on conventional dynamic receive images and REFocUS ultrasound beamformed images. MimickNet is only trained on fetal, liver, and phantom data. SSIM between clinical-grade post-processing and MimickNet for conventional beamformed images was 0.9670.002. SSIM between clinical-grade post-processing and MimickNet for REFocUS beamformed images was 0.9500.0157. Note that the last row is at the same scale as the cardiac images above.

Iv Discussion

MimickNet can closely approximate clinical-grade post-processing with an SSIM of 0.9300.089 such that even upon close inspection, few differences are observed. This performance was achieved without knowledge of the pre-processed pair. We do observe a performance gap compared to the gray-box setting, which achieves an SSIM of 0.9790.013. However, emulating the gray-box setting would require researchers to tamper with scanner systems to siphon off pre-processed data, so we explore ways to eliminate this gap.

The performance gap is primarily attributed to differences image luminance from outlier frames seen in Fig. 4. Although images generated under black-box constraints present a large min-max

range of 0.318 to 1.000, we note that the mean and standard deviation is

. Therefore, the majority of images do have well-approximated luminance, despite the sizeable min-max range. For the two fetal brain images in Fig. 5, we qualitatively see that much of the contrast and structure are preserved while luminance is not. This matches the quantitative contrast-structure and luminance SSIM components for the top fetal image ( 0.962, 0.681) and bottom fetal image ( 0.964, 0.612).

We found it interesting that clinical-grade post-processing would remove such bright reflectors seen in the raw beamformed phantom image (Fig. 5, 2nd row). This level of artifact removal likely requires window clipping. When we clip the lower dynamic range of raw beamformed data from -120dB to -80dB, we see the bright scatterers in raw beamformed images dim and practically match clinical-grade post-processing without any additional changes. Conceptually, clipping values to -80dB is a reasonable choice since it is close to the noise floor of most ultrasound transducers. In the CycleGAN training paradigm, it can be challenging to learn these clipping cutoffs due to the cycle-consistency loss (defined in Eq. 3). The backward generator would be penalized by any information destroyed through clipping learned in the forward generator. Since the cycle-consistency loss does not exist in optimization under the gray-box setting, the model under gray-box settings can learn the clipping better than under black-box settings. Fortunately, luminance can be modified to a large extent in real-time by changing the imaging window or gain by ultrasound end-users.

One challenge we found was that training MimickNet was quite unstable for small generator networks. This instability is likely due to the nature of adversarial objectives in GANs which other works explore [16, 21]. The overall stability of the adversarial objective function appears to be a more important factor in achieving a higher SSIM rather than the specific generator distance metric used such as MAE or SSIM as seen by Table III. Training GANs is a delicate balancing act between discriminator and generator. If the discriminator overpowers the generator during training, then the generator is unable to outpace the discriminator. A quick solution is to increase the capacity of the generator by adding more parameters, or by decreasing the capacity of the discriminator by taking away parameters until convergence occurs. Future works will explore how to better increase training stability and address any remaining performance gap between the gray-box and black-box constraint settings through different deep learning model architectures, objective functions, or training processes.

As is, MimickNet shows promise for production use. It runs in real-time at 92 FPS on an NVIDIA P100 and uses 2000x fewer FLOPS than models such as MobileNetV2, which was designed for less capable hardware such as mobile phone CPUs. This runtime is relevant since more ultrasound systems are being developed for mobile phone viewing [11]. Additionally, the last row of Table III named “ver”, is a 7.76M parameter model trained only on Verasonics Vantage data with MAE distance metric optimization while achieving similar metrics to training on the full dataset. These results hint at the possibility of achieving similar SSIM with fewer data. Future work will assess the performance of MimickNet on mobile phones and other data or compute constrained settings.

This work’s main contribution is in decreasing the barrier of clinical translation for future research. Medical images previously only understood by research domain experts can be translated to clinical-grade images widely familiar to medical providers. Future work will aim to implement a flexible end-to-end software package to train a mimic provided data from two arbitrary scanner systems. Work will also examine how much data is required to create a high-performance mimic.

V Conclusion

MimickNet closely approximates current clinical post-processing in the realistic black-box setting where before and after post-processing image pairs are unavailable. We present it as an image matching tool to provide fair comparisons of novel beamforming and image formation techniques to a clinical baseline mimic. It runs in real-time, works for out-of-distribution cardiac data, and thus shows promise for practical production use. We demonstrated its application in comparing different beamforming methods with clinical-grade post-processing and showed that resolution improvements are carried over into the final post-processed image. Our results with ultrasound data suggest it should also be possible to approximate medical image post-processing in other modalities such as CT and MR.


This work was supported by the National Institute of Biomedical Imaging and Bioengineering under Grant R01-EB026574, and National Institute of Health under Grant 5T32GM007171-44. The authors would like to thank Siemens Medical Inc. USA for in kind technical support.


  • [1] M. Abdel-Nasser and O. A. Omer (2017) Ultrasound image enhancement using a deep learning architecture. In Proceedings of the International Conference on Advanced Intelligent Systems and Informatics 2016, pp. 639–649. Cited by: §I.
  • [2] H. Ahman, L. Thompson, A. Swarbrick, and J. Woodward (2009-05) Understanding the advanced signal processing technique of Real-Time adaptive filters. J. Diagn. Med. Sonogr. 25 (3), pp. 145–160. Cited by: §I.
  • [3] A. Anvari, F. Forsberg, and A. E. Samir (2015-11) A primer on the physical principles of tissue harmonic imaging. Radiographics 35 (7), pp. 1955–1964 (en). Cited by: §I.
  • [4] N. Bottenus (2018-10) REFoCUS: ultrasound focusing for the software beamforming age. In 2018 IEEE International Ultrasonics Symposium (IUS), pp. 1–4. Cited by: §I, §III-E.
  • [5] A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli (2004-04) Image quality assessment: from error visibility to structural similarity. IEEE Trans. Image Process. 13 (4), pp. 600–612. Cited by: §II-A, §III-A.
  • [6] A. Brock, J. Donahue, and K. Simonyan (2019) Large scale GAN training for high fidelity natural image synthesis. In International Conference on Learning Representations, External Links: Link Cited by: §I.
  • [7] Y. Deng, M. L. Palmeri, N. C. Rouze, G. E. Trahey, C. M. Haystead, and K. R. Nightingale (2017-10) Quantifying image quality improvement using elevated acoustic output in B-Mode harmonic imaging. Ultrasound Med. Biol. 43 (10), pp. 2416–2425 (en). Cited by: §II.
  • [8] F. Dietrichson, E. Smistad, A. Ostvik, and L. Lovstakken (2018-10) Ultrasound speckle reduction using generative adversial networks. In 2018 IEEE International Ultrasonics Symposium (IUS), pp. 1–4. Cited by: §I.
  • [9] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio (2014) Generative adversarial nets. In Advances in Neural Information Processing Systems 27, Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger (Eds.), pp. 2672–2680. Cited by: §I.
  • [10] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In

    Proceedings of the IEEE conference on computer vision and pattern recognition

    pp. 770–778. Cited by: §I.
  • [11] H. Hewener and S. Tretbar (2015-10) Mobile ultrafast ultrasound imaging system based on smartphone and tablet devices. In 2015 IEEE International Ultrasonics Symposium (IUS), pp. 1–4. Cited by: §IV.
  • [12] G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger (2017) Densely connected convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4700–4708. Cited by: §I.
  • [13] P. Isola, J. Zhu, T. Zhou, and A. A. Efros (2017) Image-to-image translation with conditional adversarial networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1125–1134. Cited by: §I, §II-C.
  • [14] S. Jégou, M. Drozdzal, D. Vazquez, A. Romero, and Y. Bengio (2017) The one hundred layers tiramisu: fully convolutional densenets for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 11–19. Cited by: §I.
  • [15] V. Kakkad, J. Dahl, S. Ellestad, and G. Trahey (2015-04) In vivo application of short-lag spatial coherence and harmonic spatial coherence imaging in fetal ultrasound. Ultrason. Imaging 37 (2), pp. 101–116 (en). Cited by: §II.
  • [16] T. Karras, T. Aila, S. Laine, and J. Lehtinen (2018) Progressive growing of GANs for improved quality, stability, and variation. In International Conference on Learning Representations, External Links: Link Cited by: §IV.
  • [17] P. Liu, H. Zhang, K. Zhang, L. Lin, and W. Zuo (2018) Multi-level Wavelet-CNN for image restoration. Cited by: §I.
  • [18] J. Long, W. Long, N. Bottenus, G. F. Pintonl, and G. E. Trahey (2018) Implications of lag-one coherence on real-time adaptive frequency selection. In 2018 IEEE International Ultrasonics Symposium (IUS), pp. 1–9. Cited by: §II.
  • [19] W. Long, D. Hyun, K. R. Choudhury, D. Bradway, P. McNally, B. Boyd, S. Ellestad, and G. E. Trahey (2018-04) Clinical utility of fetal Short-Lag spatial coherence imaging. Ultrasound Med. Biol. 44 (4), pp. 794–806 (en). Cited by: §I, §I, §II.
  • [20] X. Mao, Q. Li, H. Xie, R. Y. K. Lau, Z. Wang, and S. P. Smolley (2017-10) Least squares generative adversarial networks. In 2017 IEEE International Conference on Computer Vision (ICCV), Vol. , pp. 2813–2821. External Links: Document, ISSN 2380-7504 Cited by: §I, §II-C.
  • [21] L. Mescheder, A. Geiger, and S. Nowozin (2018-10–15 Jul) Which training methods for GANs do actually converge?. In

    Proceedings of the 35th International Conference on Machine Learning

    , J. Dy and A. Krause (Eds.),
    Proceedings of Machine Learning Research, Vol. 80, Stockholmsmässan, Stockholm Sweden, pp. 3481–3490. External Links: Link Cited by: §IV.
  • [22] M. R. Morgan, D. Hyun, and G. E. Trahey (2019-03) Short-lag spatial coherence imaging in 1.5-d and 1.75-d arrays: elevation performance and array design considerations. IEEE Trans. Ultrason. Ferroelectr. Freq. Control (en). Cited by: §I.
  • [23] D. Perdios, M. Vonlanthen, A. Besson, F. Martinez, and J. Thiran (2018-10) Deep convolutional neural network for ultrasound image enhancement. In 2018 IEEE International Ultrasonics Symposium (IUS), pp. 1–4. Cited by: §I.
  • [24] G. F. Pinton, G. E. Trahey, and J. J. Dahl (2011-04) Sources of image degradation in fundamental and harmonic ultrasound imaging using nonlinear, full-wave simulations. IEEE Trans. Ultrason. Ferroelectr. Freq. Control 58 (4), pp. 754–765 (en). Cited by: §I.
  • [25] A. Radford, L. Metz, and S. Chintala (2016) Unsupervised representation learning with deep convolutional generative adversarial networks. CoRR abs/1511.06434. Cited by: §I.
  • [26] O. Ronneberger, P. Fischer, and T. Brox (2015) U-net: convolutional networks for biomedical image segmentation. Med. Image Comput. Comput. Assist. Interv.. Cited by: §I.
  • [27] M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L. Chen (2018-06) MobileNetV2: inverted residuals and linear bottlenecks. In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vol. , pp. 4510–4520. External Links: Document, ISSN 2575-7075 Cited by: §III-C.
  • [28] J. Snell, K. Ridgeway, R. Liao, B. D. Roads, M. C. Mozer, and R. S. Zemel (2017) Learning to generate images with perceptual similarity metrics. Cited by: §I.
  • [29] H. Takeda, S. Farsiu, and P. Milanfar (2006-10) Robust kernel regression for restoration and reconstruction of images from sparse noisy data. In 2006 International Conference on Image Processing, pp. 1257–1260. Cited by: §I.
  • [30] K. Thiele, J. Jago, R. Entrekin, and R. Peterson (2013-06) Exploring nsight imaging, a totally new architecture for premium ultrasound. Technical report Technical Report 4522 962 95791, Philips. External Links: Link Cited by: §I.
  • [31] C. Tomasi and R. Manduchi (1998-01) Bilateral filtering for gray and color images. In Sixth International Conference on Computer Vision (IEEE Cat. No.98CH36271), pp. 839–846. Cited by: §I.
  • [32] G. E. Trahey, J. W. Allison, S. W. Smith, and O. T. von Ramm (1987-09) Speckle reduction achievable by spatial compounding and frequency compounding: experimental results and implications for target detectability. In Pattern Recognition and Acoustical ImagingPattern Recognition and Acoustical Imaging, Vol. 0768, pp. 185–192. Cited by: §I.
  • [33] P. Vincent, H. Larochelle, I. Lajoie, Y. Bengio, and P. Manzagol (2010)

    Stacked denoising autoencoders: learning useful representations in a deep network with a local denoising criterion

    J. Mach. Learn. Res. 11 (Dec), pp. 3371–3408. Cited by: §I.
  • [34] L. Xu, J. S. J. Ren, C. Liu, and J. Jia (2014) Deep convolutional neural network for image deconvolution. In Advances in Neural Information Processing Systems 27, Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger (Eds.), pp. 1790–1798. Cited by: §I.
  • [35] J. Yamanaka, S. Kuwashima, and T. Kurita (2017)

    Fast and accurate image super resolution by deep CNN with skip connection and network in network

    In Neural Information Processing, pp. 217–225. Cited by: §I.
  • [36] Z. Zhang, Q. Liu, and Y. Wang (2018-05) Road extraction by deep residual U-Net. IEEE Geoscience and Remote Sensing Letters 15 (5), pp. 749–753. Cited by: §I.
  • [37] H. Zhao, O. Gallo, I. Frosio, and J. Kautz (2017-03) Loss functions for image restoration with neural networks. IEEE Transactions on Computational Imaging 3 (1), pp. 47–57. External Links: Document, ISSN 2333-9403 Cited by: §I.
  • [38] J. Zhu, T. Park, P. Isola, and A. A. Efros (2017) Unpaired image-to-image translation using cycle-consistent adversarial networks. In Proceedings of the IEEE international conference on computer vision, pp. 2223–2232. Cited by: §I.