OT-driven Multi-Domain Unsupervised Ultrasound Image Artifact Removal using a Single CNN

07/10/2020 ∙ by Jaeyoung Huh, et al. ∙ KAIST 수리과학과 0

Ultrasound imaging (US) often suffers from distinct image artifacts from various sources. Classic approaches for solving these problems are usually model-based iterative approaches that have been developed specifically for each type of artifact, which are often computationally intensive. Recently, deep learning approaches have been proposed as computationally efficient and high performance alternatives. Unfortunately, in the current deep learning approaches, a dedicated neural network should be trained with matched training data for each specific artifact type. This poses a fundamental limitation in the practical use of deep learning for US, since large number of models should be stored to deal with various US image artifacts. Inspired by the recent success of multi-domain image transfer, here we propose a novel, unsupervised, deep learning approach in which a single neural network can be used to deal with different types of US artifacts simply by changing a mask vector that switches between different target domains. Our algorithm is rigorously derived using an optimal transport (OT) theory for cascaded probability measures. Experimental results using phantom and in vivo data demonstrate that the proposed method can generate high quality image by removing distinct artifacts, which are comparable to those obtained by separately trained multiple neural networks.



There are no comments yet.


page 1

page 7

page 8

page 9

page 10

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

In contrast to computed tomography (CT) and magnetic resonance imaging (MRI), ultrasound imaging (US) poses no radiation risks to the patient and enjoys fast acquisition time, while the hardware system is much simpler. Therefore, US is very useful for many clinical and portable diagnostic applications.

Unfortunately, US suffers from various imaging artifacts such as speckle noises from signal interference, image blur from limited bandwidth and aperture size, etc. For the last few decades, many researchers have proposed various model-based iterative algorithms to address these problems [1, 2, 3, 4]. While the results are impressive, one of the limitations of these model-based approaches is that computationally extensive optimization problem should be solved again for each measurement.

Fig. 1: Multi-domain image translation using (a) StarGAN, (b) our method, and (c) CycleGAN.

Recently, deep learning approaches has been widely used for various medical imaging problems, such as low-dose CT [5], accelerated MRI [6, 7], etc. In US, Yoon et al [8]

proposed a convolutional neural network (CNN) approach to interpolate subsampled RF channel data. Deep learning-based beamformers have been also suggested as promising alternatives to the delay-and-sum (DAS) or adaptive beamformers

[9]. Furthermore, various US artifact removal algorithms have been implemented using deep neural networks [9, 10]. These deep learning approaches are inductive in the sense that the trained network can be used for other measurement data without additional optimization procedure. Therefore, deep learning approaches are ideally suitable for medical imaging problems, where fast reconstruction is critical.

In spite of these pioneering works, the deep neural network approaches for US imaging still have several technical huddles for their wide acceptance. First, US images are usually corrupted with different types of artifacts, and each user often prefers distinct choice of artifact suppression algorithms depending on the clinical applications. This implies that large number of neural networks should be stored in a US scanner to deal with various image artifacts to satisfy the users’ demands. Second, most of the current deep learning approaches are based on supervised training. Although this requires matched ground-truth artifact-free images, obtaining matched artifact-free image is challenging in ultrasound imaging.

In this paper, we propose a novel deep learning approach to overcome these fundamental limitations. In our method, a single neural network can be switched to address different types of image artifacts by simply changing the target mask vector. Moreover, our neural network is trained in an unsupervised manner so that it does not require any matched reference data for training.

In fact, this is inspired by the recent successes of multi-domain image transfer in computer vision literature

[11, 12]. For example, Cho et al [11] proposed a StarGAN, which uses only one generator with one discriminator to translate an image to multiple domains by utilizing a mask vector to indicate which domain to transfer. However, as will be shown later, we found that the direct application of StarGAN is not necessary, since the US image artifacts removal problem is a uni-directional translation problem. More specifically, while StarGAN should translate any domain to another one equally well as shown in Fig. 1(a), this is not necessary in US artifact removal problems, since DAS images are raw data obtained from the scanner and it is not necessary to re-generate DAS images from other domain images. Therefore, a correct image reconstruction pipeline should be asymmetric as shown in Fig. 1(b), where artifact suppressed images (domain 2 and 3) are generated only from DAS raw data (domain 1). It turns out that the lack of symmetry makes the multi-domain translation task much simpler, thereby significantly reducing the network complexity and improving the performance.

Another important contribution of this work is a novel geometric insight that leads to the proposed network architecture. Specifically, inspired by the recent theory of optimal transport driven cycleGAN [13], we demonstrate that the multi-domain artifact removal problem can be formulated as an optimal transport problems between cascaded probability measures, where the goal is to minimize the statistical distances between the “push-forward” measures and the real data distributions. It turns out that if Wasserstein-1 metric[14] and Kullback-Leiber (KL) divergence are used as distances for continuous and discrete distributions, respectively, then the proposed multi-domain unsupervised image translation network emerges.

Ii Backgrounds

Ii-a Multi-Domain Image Translation

One of the earliest works in image translation is Pix2Pix

[15] based on conditional generative adversarial networks (cGAN). Pix2pix was designed for various translation tasks with paired data such as map to aerial photo, or sketch to photo, and so on [15].

To relieve the stringent requirement of paired data, Zhu et al [16] proposed an unsupervised image-translation technique called CycleGAN, which consists of two generators and two discriminators. Specifically, one generator is used to generate a target domain image and the other generator is used to generate the original domain image. Then, the cycle-consistency loss is used to impose the consistency of the the input image and the regenerated fake image from the second generator. CycleGAN has been successfully used as a promising unsupervised learning approach for various medical imaging problems, such as low-dose CT [17], etc.

However, CycleGAN approach is not scalable in the sense that a total of CNN generators are required to translate between domains [11]. To deal with the scalability issue of CycleGAN in multi-domain image translation, Choi et al [11]

proposed a StarGAN approach which performs multi-domain image translation with only one generator and one discriminator. The domain is separated with a mask vector, which consists of one-hot vector. The target domain mask vector then guides the translation between different image domains. Moreover, the discriminator is designed not only to differentiate between real and fake, but also to classify which domain the image belongs to.

By extending this idea furthermore, Collaborative GAN (CollaGAN) [12] was proposed to deal with the multi-domain transfer from multiple inputs. In particular, CollaGAN uses multiple inputs to generate more realistic target domain images, by exploiting the multiple cycle-consistency loss. Similar to StarGAN, CollaGAN is also based on single generator and discriminator.

One of the main assumptions in the aforementioned image translation tasks is that each domain is equally important, i.e. an image in any domain should be translated to another domain equally well regardless of source and target domain combination. In fact, this turns out to be quite stringent requirement in the network training. As will be discussed later, in US artifact removal problems the raw data always belongs to one specific domain and the goal is to process the raw data using different types of algorithms. Therefore, the translation task is uni-directional, which could provide an opportunity to relax the stringent constraint for network training.

Ii-B Optimal Transport driven CycleGAN

As the multi-domain image translation is based on the generative adversarial network (GAN)[11, 12], one may wonder whether the image translation is just a cosmetic change or real. In fact, this is a important issue in medical imaging applications, since any artificially created features can reduce the diagnostic accuracy of the medical images.

Fig. 2: Geometric view of (a) CycleGAN, and (b) the proposed multi-domain transfer.

In our recent paper [13], we showed that the CycleGAN architecture can be rigorously derived from an optimal transport problem [14]

of one probability distribution to another. Accordingly, we can safely assume that the generated image reflects real image distribution if networks are properly designed and trained. As our goal is to extend this to multi-domain translation, in the following we briefly review OT driven CycleGAN

[13] for self-containment.

Optimal transport (OT) provides a mathematical means to operate between two probability measures [14]. Formally, we say that transports a probability measure (or distribution) to another measure , if


which is often denoted as the push-forward of the measure, and represented by

Suppose there is a cost function such that represents the cost of moving one unit of mass from to . The OT problem [14] is then to find a transport map that transports to at the minimum total transportation cost.

Now, we will explain how optimal transport can lead to a CycleGAN architecture [13]. Suppose that the target image space is equipped with a probability measure , whereas the original image space is with a probability measure . In unsupervised learning problem, there are no paired data, so the goal of unsupervised learning is to match the probability distributions rather than each individual sample. This can be done by finding transportation maps that transport the measure to , and vice versa.

More specifically, as shown in Fig. 2(a), the transportation from a measure space to another measure space is done by a generator , realized by a deep network parameterized with . Then, the generator pushes forward the measure in to a measure in the target space , i.e. . Similarly, the transport from to is performed by another neural network generator , so that the generator pushes forward the measure in to in the original space . Then, our goal is to derive an optimal transport map by minimizing the statistical distances between and , and between and , respectively.

One of the most important contributions of our work [13] is to show that if the statistical distance is measured by Wasserstein-1 metric [14] and the distance minimization problem is solved using a dual formulation, a CycleGAN architecture emerges [13]. This gives an important mathematical foundation of CycleGAN.

Iii Problem Formulation

Iii-a DAS Image Artifacts

In US imaging, the preprocessed channel data is first obtained from the return RF echoes:


where denotes the delay corrected RF data at the -th channel element, and and denote the scan line index and the depth, and is the aperture size. The delay and sum (DAS) beamformer for the -th scanline at the depth sample can be then computed as


where denotes a -dimensional column-vector of ones.

The DAS beamformer is designed to extract the low-frequency spatial content that corresponds to the energy within the main lobe; thus, large number of receiver elements are often necessary in DAS beamformer to improve the image quality by reducing the side lobes. Moreover, to calculate accurate time delay, sufficiently large bandwidth transducers are required. Therefore, in many US systems, especially for low-end or portable systems where the number of receiver elements is not sufficient and transducer bandwidth is limited, the image quality degradation is unavoidable.

Another performance degradation of DAS beamformer comes from the inaccuracy of ray approximation of the wave propagation, as the underlying image physics is from the wave phenomenon and the deviation from the real physical introduces inevitable image artifacts.

Yet another image artifacts often observed in DAS image is speckle noise. Speckles noises are originated from the small size scatterers which reflects the US energy. As the DAS simply adds the return echoes after proper time delay, the constructive and destructive interferences from these scattered echoes generate bright and dark spots in the image, which results in the speckle noise. For example, for cardiac volume quantification using segmentation, speckle noise imposes significant algorithmic challenges, so the proper suppression of the speckle noise is often necessary. That said, in some clinical applications such as liver imaging, the speckle statistics is an important measure for the diagnosis of diseases progress, because it is highly sensitive to small variations in scatterers. Accordingly, the speckle noise removal algorithm should be used judiciously depending on specific applications.

Iii-B Classical Image Processing Pipelines

In the following, we will describe classical model-based iterative methods in more detail.

Iii-B1 Deconvolution

In US, the received signal is modeled as a convolution of tissue reflectivity function (TRF) with a point spread function (PSF) , where tissue reflectivity function represents scatter’s acoustic properties, while the impulse response of the imaging system is modeled by point spread function. In most practical cases, the complete knowledge of is not available, and therefore both unknown TRF and the PSF

have to be estimated together. These types of problems are often called deconvolution problems. Deconvolution ultrasound may help in dealing with modeling inaccuracies and finite bandwidth issues, which will eventually improve the spatial resolution of an ultrasonic imaging system

[1, 2].

Mathematically, the deconvolution problem can be formulated as follows:


where is a sparsity imposing regularization terms such as , total variation (TV), etc. One strategy to address (4) is to estimate and jointly, and another strategy is to estimate them separately, i.e., first is estimated from , and then is estimated based on [2]. In this paper, the second strategy is used for the generation of unmatched target data distribution, i.e. images in Domain 2 in Fig. 1(b).

Iii-B2 Despeckle

Speckle noises are one of the major reasons of quality degradation, so that “despeckle” (removal of the speckles) can improve the visual quality and subsequently enhance the structural details in US images [3]. In recent past, a variety of reasonably good de-speckling algorithms have been proposed for US imaging [3, 4]. One such algorithm is proposed by Zhu et al [3], which is based on the principle of non-local low-rank (NLLR) filtering.

In NLLR [3], the guidance image is first obtained as a reference. Then, using guidance images, group of similar patches are selected based on the similarity to the reference patch. Using the group of similar patches, the low-rank + sparse decomposition [18] is then performed. The optimization problem is solved using alternating directional method of multiplier (ADMM) [19]. Finally, the low-rank component is considered as speckle free images, whereas the sparse component are considered as speckle noises.

Although the algorithm provides an effective means for removing speckle, it is computationally expensive, so it cannot be used for routine US applications. In this paper, this algorithm is used to generate our target samples for despeckle images (i.e., images in Domain 3 in Fig. 1(b)).

Iii-C Multi-domain Unsupervised Artifact Removal

In this paper, we are interested in a single neural network that can remove both blur artifacts and speckle noises from the DAS images in a real-time manner. More specifically, in Fig. 1(b), Domain 1 is for the DAS input images, and Domain 2 and 3 correspond to deconvolution and despeckle image domains. Since the use of deconvolution images or despeckle images depends on specific clinical applications, our goal is therefore to design a neural network such that it can be switched to process different type of artifacts by mere changing the target domain mask vector.

However, if the existing CycleGAN is used for this applications, two separate neural networks are required for each artifact type as shown in Fig. 1(c), which increases additional burden for US scanner.

Fig. 3: Proposed multi-domain unsupervised artifact removal networks.

Iv Main Contribution

Iv-a Geometric Formulation

The geometry of our problem is illustrated in Fig. 2(b), which is similar to that of CycleGAN in Fig. 2(a) but with important differences.

Here, the original image space is the DAS image domain, whose probability distribution is , whereas the target image space contains both despeckle images and deconvolution images, and their distribution follows the probability measure . One key difference compared to the CycleGAN geometry in Fig. 2(a) is that the target distribution is modeled as the mixture distribution of despeckle and deconvolution images, which can be separately by a classifier that maps each sample in to discrete domain labels in with a discrete measure .

Thus, our goal is to find a generator that transport to a probability measure which is close to and still enables the correct classification using . Accordingly, and are dependent to each other, so they should be optimized simultaneously. In fact, this corresponds to an optimal transport problem to a cascaded measures. On the other hand, the generator that transport to should transport all mixture distribution to the probability distribution in the original space .

Similar to the optimal transport driven CycleGAN [13], in our method the statistical distance is measured using the Wasserstein-1 metric. More specifically, for the choice of a metric in , the Wasserstein-1 between and can be computed by:



denotes the set of joint distribution whose marginal distribution is

and . Similarly, the Wasserstein-1 distance between and is given by


Since the optimal joint distribution minimizing (5) and (6) could be different if they are minimized separately, a better way of finding the transportation map is to minimize them together with the same joint distribution :


This can be considered as the sum of the statistical distances in both spaces.

Next, to enable a multi-domain transfer, the measure in the target space should be transported to the discrete sample space with the discrete measure . As we have two distribution and in , the corresponding push forward measures by the classifier are given by and , respectivey (see Fig. 2(b)). Then, the classification design problem can be done again by minimizing the statistical distances in .

Since the classification label is discrete (in our case, binary), one of the simplest way of measuring the statistical distance in is using the Kullback-Leiber (KL) divergence. Specifically, the KL divergence between and the push-forward measure is given by:

where the output of the classifier is the probability distribution indicating the probability belonging to target domains, and is the one-hot vector encoded true label probability for . For example, in the conversion of DAS image to two target domains (decovolution, and despeckle), is a discrete probability mass function which is for the deconvolution target, and for despeckle targets.

Similarly, the KL divergence between and the push-forward measure is given by:


where is one-hot vector encoded target label for the generator for a given DAS image .

By putting them together, the overall loss becomes


where is the weighting parameter for the classifier.

Iv-B Proposed Method

Unfortunately, direct minimization of the loss in (9) is difficult specially due to the need for non-parametric minimization with respect to the joint distribution . To address this, one of the most important contributions of our companion paper [13] is to show that the following primal problem


is equivalent to the following dual formulation:




where is the hyper-parameter, and the cycle-consistency term is given by

whereas the discriminator term is

Here, should satisfy 1-Lipschitz condition (i.e.

To impose 1-Lipschitz condition, the gradient of the Kantorovich potential is constrained to be 1 [20]:

where and with

being random variables from the uniform distribution between


By collecting all terms together, our multi-domain artifact removal problem can be formulated by



where , , are the weighting parameters depending on the relative importance. In this paper, we used , , and .

Iv-C Neural Network Implementation

Fig. 3 illustrates the proposed multi-domain unsupervised artifact removal network originated from the optimization problem in (13). The network architecture consists of two generators. The main generator converts DAS images to either deconvolution or de-speckled images. The second generator returns the deconvolution or despeckle images to the original DAS domain images. Once the networks are trained, note that we only use at the inference phase.

Here, care should be taken due to the existence of the classifier. More specifically, in calculating the KL divergence in (8), the target domain of should be indicated. Therefore, the main idea is to indicate the target domain using an input mask vector, :


For example, a DAS input image with the mask vector creates the deconvolution image, whereas an input DAS image with the mask vector is converted to a despeckled image. The mask vector is augmented along the channel direction with the input image. On the other hand, does not need a mask vector since there exists no classifier in .

Fig. 4: Network architecture: (a) generator, and (b) discriminator/domain classifier.

Accordingly, although the same generator architecture as shown in Fig. 4(a) is used for both and , the input channel numbers are different. For the case of , the number of input channels is 3, which is composed of DAS images and two mask vector channels, whereas is composed of a single input channel.

Additionally, there are also two discriminators and as shown in Fig. 3. Specifically, tries to find the difference between the true image and the generated image , whereas attempts to find the fake measurement data that are generated by the synthetic measurement procedure .

Finally, we have the domain classifier to distinguish between deconvolution and despeckled images. In fact, the domain classifier and discriminators are both classifiers, so their structure share many commonalities. Therefore, as shown in Fig. 4(b), the discriminator and the classifier are implemented using a same network architecture with double output headers composed of a domain classifier or discriminator.

V Methods

V-a Dataset

We used in vivo and phantom images to train the network. The in vivo images are focused B-mode images using a linear probe (L3-12H) from the US system E-CUBE 12R (Alphinion, Korea). The images were measured from 10 volunteers at four parts of the carotid and thyroid areas. It has 10 time frames for each body part. The phantom images are tissue mimicking phantoms that are also acquired with focused B-mode imaging using a linear probe. Both images are taken with a center frequency of 8.5 MHz. Each image is captured with 64 channels and 96 scan lines. The details of the probe are listed in the Table I. We selected 8 volunteer data, and 304 images in in vivo images and 125 images in phantom images are used for the training data set.

To create the deconvolution images in Domain 2, we used the joint sparse deconvolution model that is iterative method along the axial direction with lateral direction constraint (Sec. III-B1). We also used the Non-Local-Low-Rank (NLLR) algorithm to generate the despeckle images (Sec. III-B2

) for Domain 3 data. We used 429 deconvolution images and 478 despeckle images to train the network as target domain images. In addition, we randomly mixed the input dataset and the target dataset for each epoch separately in order to train the network with unpaired method.

Parameter Configuration
Probe Model L3-12H
Center frequency 8.5MHz
Sampling frequency 40MHz
Wave mode Focused
Element 192
Tx element 128
Transmision emission 96
Rx elements 64
Element pitch 0.2mm
Element width 0.14 mm
TABLE I: Data configuration

V-B Network Architecture

As shown in Fig. 4, we employed the generator architecture that was used in [16, 11]

. It is composed of two downsampling steps and six residual blocks and two upsampling steps. The convolution with stride 2 is utilized for downsampling operation. The residual block has two convolution steps. The transposed convolution layer with stride 2 is used for upsampling operation. In the generator, we used instance normalization instead of batch normalization. Hyperbolic tangent (tanh) non-linearity is applied in the last layer for stablizing the training process.

The discriminator is shown in Fig. 4 (b). We used PatchGAN which classifies the local image whether real or fake. It is simply composed of six convolution sets of 4x4 kernel size with stride 2 and LeakyReLU except last layer. In the last layer, there is multiheader structure. One head is for domain classification and the other is for real/fake classification.

Fig. 5: Comparison results with other algorithms. The yellow box under each image shows the magnified image of yellow dot-line box. The red and blue line are the selected region for calculating contrast metrics. The first sample is left thyroid image and the second sample is the right thyroid image.

V-C Training Details

V-C1 Implementation details

To train the network, we used the Adam optimizer with

. The learning rate is started from 1e-4 and decreased linearly from the 500 epochs. The total epoch is 1000, and the batch size is 4. To avoid the overfitting problem, we used data augmentation technique like flipping, rotating and random scaling. We implemented all work on the Python with Tensorflow and MATLAB 2017a. We trained the network with the NVIDIA GeForce GTX 1080 Ti GPU.

V-C2 Evaluation metric

To evaluate our results, we used values for the peak-signal-to-noise ratio (PSNR), the structure similarity (SSIM)

[21], the contrast-to-noise ratio (CNR), the generalized CNR (GCNR) [22], and the Contrast Ratio (CR).

The PSNR value is calculated between the label image () and output image (). The equation between images whose size is is


where denotes the maximum value of and is -norm. The SSIM value is defined by [21]


The denotes the mean value of and , deviation value of and , covariance of , respectively. The is and is . In this paper, and are 0.01 and 0.03, respectively. The CNR value can quantify the image contrast which is defined as



denotes the mean and standard deviation value of the structure and the background, respectively. The GCNR value is the generalization value of CNR

[22]. It calculates the area of region between intensity distribution of two areas:


where and are probability distributions in structure and background, respectively. The CR value is used to compare the contrast without noise condition:


where denote the mean value of the structure and background, respectively.

V-D Baseline Algorithms

For comparative study, the following algorithms are used as baselines: 1) supervised learning, 2) CycleGAN

[16], and 3) StarGAN [11].

For supervised learning, the same generator architecture was trained using a paired data set. We normalized the intensity of the entire image from to . We also used data augmentation technique like flipping, rotating and random scaling. We trained the network with the Adam optimizer with the learning rate 1e-4. The total epoch is and the batch size is 4.

For CycleGAN training, we used the same generator architecture as the proposed method. The discriminator also has same architecture except the last layer. There is no domain classification loss in CycleGAN. We also used WGAN with gradient penalty technique [20] similar to the proposed method. Unlike the supervised learning, there are no paired dataset for CycleGAN training. We used as 30 and as 0.5. We normalized the intensity of the image as to . To increase the total dataset, we used data augmentation technique like flipping, rotating and random scaling. We trained the network with the Adam optimizer with learning rate decreasing linearly from 1e-3 to 1e-4 depending on the training epoch. The total epoch is 200 and batch size is 4.

The StarGAN also used same generator and discriminator architecutres as our method except kernel size. It also used the WGAN with gradient penalty technique [20]. Both StarGAN and the proposed method use mask vectors to separate the target domain. Therefore, each target domain image can be generated by simply changing the mask vector. However, in contrast to our method, StarGAN also utilizes the same generator for the returning path with a mask vector. This increases the difficulty of training. In the StarGAN, we used as 30 and as 1 and as 10. We normzliaed all the images as -1 to 1. Flipping, rotating, and random scaling also used for data augmentation technique. We trained it with Adam optimizer with 1e-4 linearly decreasing from the half of total epoch. The total epoch is 500 and batch size is 4.

Vi Experimental Results

To verify the performance of the proposed method, we validate the algorithm with qualitative and quantitative manners.

Vi-a Qualitative Results

Fig. 5 shows the comparison results using various algorithms. We show the in vivo images from the left thryorid and right thyroid. As shown in the figure, all the deconvolution output images show the improved contrast. Specifically, in the magnified images, it is easy to recognize that the deconvolution images have sharper structure. While the StarGAN output has slightly improved compared to the input image, the output from the proposed method is closer to the target image. Moreover, the side to side comparison with supervised and CycleGAN approach show that the proposed method provides qualitatively comparable results. However, it is remarkable that the deconvolution and despeckled images using supervised learning and CycleGAN are generated from independent network trained separately, whereas the proposed method generated both images with single generator.

The despeckle images in Fig. 5 also shows the noise suppression. After removing the speckle noise, the boundaries of structure becomes smooth out. However, as shown in the Fig. 5, the result of the proposed method preserved the structure boundaries better than StarGAN. We applied our method to tissue mimicking phantoms. One is anechoic and the other is hyperechoic phantom. The results are shown in Fig. 6. The deconvolved image in both cases relatively show high contrast compared to the input image. Moreover, the granular pattern is reduced and the boundaries of structure are well maintained in the proposed method.

Fig. 6: Artifact removal in anechoic phantom and hyper-echoic phantom using our method. The red and blue line are the selected region for calculating contrast metrics.
Fig. 7: Quantitative performance comparison. The CR graphs are shown in deconvolution case, whereas the GCNR graphs are represented in despeckle case.
Deconvolution in vivo Phantom
Algorithms Input 17.15 0.68 0.82 0.52 7.97 15.33 0.74 1.25 0.60 11.41
Label X X 0.74 0.53 9.15 X X 1.13 0.56 12.24
Supervised 24.35 0.79 0.79 0.55 9.13 21.55 0.82 1.26 0.60 14.01
CycleGAN 22.93 0.75 0.85 0.56 9.87 19.62 0.76 1.19 0.59 13.75
StarGAN 20.85 0.74 0.73 0.50 7.18 20.02 0.80 1.17 0.57 12.38
Proposed 23.86 0.79 0.76 0.53 8.98 22.20 0.84 1.19 0.58 13.03
TABLE II: Deconvolution performance evaluation
Despeckle in vivo Phantom
Algorithms Input 26.99 0.65 0.82 0.52 7.97 26.64 0.66 1.25 0.60 11.41
Label X X 1.02 0.62 7.52 X X 1.46 0.65 10.79
Supervised 35.58 0.93 1.04 0.62 7.29 29.64 0.90 1.49 0.67 11.98
CycleGAN 28.27 0.86 1.02 0.61 7.08 25.72 0.83 1.51 0.67 13.83
StarGAN 30.34 0.89 0.98 0.59 7.29 25.57 0.86 1.47 0.67 12.55
Proposed 31.78 0.91 1.02 0.61 7.86 30.11 0.89 1.49 0.67 12.52
TABLE III: Despeckle performance evaluation

Vi-B Quantitative Results

Quantitative comparison with other methods was performed using the PSNR, SSIM, CNR, GCNR, and CR values. All test data are B-mode focused image acquired from linear probe with 8.5MHz center frequency. The test dataset are composed of in vivo and phantom images. The in vivo datasets consist of 80 frames from four parts of two subjects. The phantom datasets are totally 16 images, consisting of 10 anechoic and 6 hyperechoic phantom images.

Fig. 7 shows the distribution of each quantitative metrics. The PSNR and SSIM values show significant improvement in all algorithms. As expected, the PSNR and SSIM values of the supervised learning was the best, but it is remarkable that the proposed method are superior to CycleGAN and StarGAN results in all cases.

The average values of all images are represented in the Table II and Table III. When the deconvolution is the target domain, in our method the PSNR value has 6.71 dB and 6.87 dB improvement compared to the original DAS images in in vivo and phantom case, respectively. The SSIM value also gained 0.11 and 0.10 unit, respectively. Similarly, when the target domain is despeckle images, the PSNR and SSIM values show high improvement in both in vivo and phantom images.

In the Table II, the CNR and GCNR values of our method for deconvolution tasks are slightly decreased compared to DAS input. This is because the deconvolution process also enhance the noise component. On the other hand, the CR values, which is less sensitive to noise boosting, is increased. The result from the proposed method shows the CR value enhancement about 1.01dB and 1.62dB in in vivo and phantom images, respectively.

On the other hand, for the despeckle task, the CNR, GCNR value from our method is increased significantly. Table III shows that the results using our method has 0.2 and 0.24 unit improved CNR value for in vivo and phantom images, respectively. Similarly, the GCNR value is increased about 0.09 and 0.07 unit.

Vi-C Computational Time

The average computational time for classical deconvolution model [1], NLLR despeckle [23], and our method are 3479.3sec, 435.51sec, and 0.16sec, respectively. The computational times for supervised learning, CycleGAN, StarGAN are basically same as ours.

In conventional deconvolution algorithm [1], the point spread function (PSF) should be calculated to produce the deconvolution image. The estimation of the PSF and the joint sparse recovery step is computationally expensive due to iterative method. Furthermore, the Non-Local-Low-Rank (NLLR) approach for classical despeckle approach is also based on the iterative method, so it requires huge computational burden to reconstruct the image [23]. On the other hand, our method is based on deep neural network so that once the network is trained, the artifact-free images can be generated in a real time manner. Especially, the proposed method can generate both deconvolution image and despekcle image at the same time.

Vii Discussion

Vii-a Generalization

To verify the generalization power of the proposed algorithm, we tested our method with various types of dataset. The new datasets are acquired with different manner compared to the previous dataset.

Vii-A1 Planewave mode

Although we trained the network with B-mode images acquired with focused mode, we tested our network with planewave mode images to show that the proposed method is independent of the acquisition mode. The planewave image is acquired with linear array probe (L3-12H) and 8.5MHz center frequency. The 31 planewaves are used to generate the final images.

Fig. 8: Results of planewave mode images using our method.

As shown in Fig. 8, the proposed network successfully deconvolve and despeckle the input images. The deconvolved images have better contrast compared to input DAS image. Our method also well suppressed the speckle noises from the planewave mode images. The improvement is prominent in quantitative comparison. When the target domain is the deconvolution image, each image has 5.73 dB, 1.73 dB, and 3.08dB gain in the CR value. Similary, the despeckled images show 0.08 unit gain in GCNR value.

Vii-A2 Different operating frequency

The training datasets are composed of B-mode focused images with operating freqeuncy of 8.5MHz. In this experiment, we show the generalization power with operating frequency of 10.0MHz. Moreover, the datasets are from totally different body part: forearm muscle and calf muscle.

Fig. 9: Results of the data with the center frequency of 10.0MHz using our method. The first row is calf muscle and the second row is forearm muscle.

Fig. 9 shows the results. It can be easily seen that the result from the proposed method removed the artifact well. The details of structure become prominent in deconvolved image. The granular noise reduced well with maintaining the structure boundaries.

Vii-B Extension beyond two domains

In the experiment above, we tried to remove two types of artifact. In this section, we extended artifacts to deal with. That is, we increased the number of target doamain from two to three. The first target domain is deconvolution, the second target domain is despeckle and the third target domain is decovolution-despeckle. To check the effect of domain extension, we carried out the following two experiments. Note that except mask vector, other processes are all same as main experiment.

  1. Two labels for three target domains : The mask vector [1,0] is for deconvolution, [0,1] is for despeckle, [1,1] is for deconvolution-despeckle.

  2. Three labels for three target domains: The mask vector [1,0,0] is for deconvolution, [0,1,0] is for despeckle, [0,0,1] is for deconvolution-despeckle.

Fig. 10: Extension to three target domains with different number of mask vector channels.

As shown in Fig. 10, our method works well when increasing the number of target domain. Both deconvolved images have better contrast and the speckle noise is well suppressed in despeckle images. Furthermore, the deconvolution-despeckle images show reasonable result. Two type of masks described above generates three distinct domain similarly. Therefore, we can conclude that extension of multiple domain beyond two can be easily done without concerning about the increase of the mask vector channels.

Viii Conclusion

Due to the transducer limitations or wave interference, the ultrasound image has many artifacts. To overcome this problem, we proposed the multi-domain image artifact removal method based on optimal transport theory. Using extensive experiments using in vivo and phantom dataset, we confirmed that a single generator can produce the multiple domain targets very well in both qualitatively and quantitatively. As the proposed method can process images instantaneously without increasing the number of models, we believe that our method can be used for practical US systems.


  • [1] J. Duan, H. Zhong, B. Jing, S. Zhang, and M. Wan, “Increasing axial resolution of ultrasonic imaging with a joint sparse representation model,” IEEE transactions on ultrasonics, ferroelectrics, and frequency control, vol. 63, no. 12, pp. 2045–2056, 2016.
  • [2] R. Jirik and T. Taxt, “Two-dimensional blind Bayesian deconvolution of medical ultrasound images,” IEEE transactions on ultrasonics, ferroelectrics, and frequency control, vol. 55, no. 10, pp. 2140–2153, 2008.
  • [3] L. Zhu, C. Fu, M. S. Brown, and P. Heng, “A non-local low-rank framework for ultrasound speckle reduction,” in

    2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    , July 2017, pp. 493–501.
  • [4] P. Coupé, P. Hellier, C. Kervrann, and C. Barillot, “Nonlocal means-based speckle filtering for ultrasound images,” IEEE transactions on image processing, vol. 18, no. 10, pp. 2221–2229, 2009.
  • [5] E. Kang, J. Min, and J. C. Ye, “A deep convolutional neural network using directional wavelets for low-dose x-ray CT reconstruction,” Medical physics, vol. 44, no. 10, pp. e360–e375, 2017.
  • [6] J. Schlemper, J. Caballero, J. V. Hajnal, A. N. Price, and D. Rueckert, “A deep cascade of convolutional neural networks for dynamic MR image reconstruction,” IEEE transactions on Medical Imaging, vol. 37, no. 2, pp. 491–503, 2017.
  • [7] J. Lee, Y. Han, J.-K. Ryu, J.-Y. Park, and J. C. Ye, “k-space deep learning for reference-free epi ghost correction,” Magnetic resonance in medicine, vol. 82, no. 6, pp. 2299–2313, 2019.
  • [8] Y. H. Yoon, S. Khan, J. Huh, and J. C. Ye, “Efficient B-mode ultrasound image reconstruction from sub-sampled RF data using deep learning,” IEEE transactions on medical imaging, vol. 38, no. 2, pp. 325–336, 2019.
  • [9] S. Khan, J. Huh, and J. C. Ye, “Adaptive and compressive beamforming using deep learning for medical ultrasound,” IEEE Transactions on Ultrasonics, Ferroelectrics, and Frequency Control, 2020.
  • [10] P. Kokil and S. Sudharson, “Despeckling of clinical ultrasound images using deep residual learning,” Computer Methods and Programs in Biomedicine, p. 105477, 2020.
  • [11] Y. Choi, M. Choi, M. Kim, J.-W. Ha, S. Kim, and J. Choo, “StarGAN: Unified generative adversarial networks for multi-domain image-to-image translation,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 8789–8797.
  • [12]

    D. Lee, J. Kim, W.-J. Moon, and J. C. Ye, “CollaGAN: Collaborative GAN for missing image data imputation,” in

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp. 2487–2496.
  • [13] B. Sim, G. Oh, S. Lim, and J. C. Ye, “Optimal transport, cycleGAN, and penalized LS for unsupervised learning in inverse problems,” arXiv preprint arXiv:1909.12116, 2019.
  • [14] C. Villani, Optimal transport: old and new.   Springer Science & Business Media, 2008, vol. 338.
  • [15]

    P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros, “Image-to-image translation with conditional adversarial networks,” in

    Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 1125–1134.
  • [16] J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros, “Unpaired image-to-image translation using cycle-consistent adversarial networks,” in Proceedings of the IEEE international conference on computer vision, 2017, pp. 2223–2232.
  • [17] E. Kang, H. J. Koo, D. H. Yang, J. B. Seo, and J. C. Ye, “Cycle-consistent adversarial denoising network for multiphase coronary CT angiography,” Medical physics, vol. 46, no. 2, pp. 550–562, 2019.
  • [18]

    E. J. Candès, X. Li, Y. Ma, and J. Wright, “Robust principal component analysis?”

    Journal of the ACM (JACM), vol. 58, no. 3, pp. 1–37, 2011.
  • [19] W. Yin, S. Osher, D. Goldfarb, and J. Darbon, “Bregman iterative algorithms for -minimization with applications to compressed sensing,” SIAM Journal on Imaging sciences, vol. 1, no. 1, pp. 143–168, 2008.
  • [20] I. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin, and A. C. Courville, “Improved training of wasserstein gans,” in Advances in neural information processing systems, 2017, pp. 5767–5777.
  • [21] Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli, “Image quality assessment: from error visibility to structural similarity,” IEEE Transactions on Image Processing, vol. 13, no. 4, pp. 600–612, April 2004.
  • [22] A. Rodriguez-Molares, O. M. H. Rindal, J. D’hooge, S.-E. Måsøy, A. Austeng, M. A. L. Bell, and H. Torp, “The generalized contrast-to-noise ratio: a formal definition for lesion detectability,” IEEE Transactions on Ultrasonics, Ferroelectrics, and Frequency Control, vol. 67, no. 4, pp. 745–759, 2019.
  • [23] L. Zhu, C.-W. Fu, M. S. Brown, and P.-A. Heng, “A non-local low-rank framework for ultrasound speckle reduction,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 5650–5658.