I Introduction
In contrast to computed tomography (CT) and magnetic resonance imaging (MRI), ultrasound imaging (US) poses no radiation risks to the patient and enjoys fast acquisition time, while the hardware system is much simpler. Therefore, US is very useful for many clinical and portable diagnostic applications.
Unfortunately, US suffers from various imaging artifacts such as speckle noises from signal interference, image blur from limited bandwidth and aperture size, etc. For the last few decades, many researchers have proposed various modelbased iterative algorithms to address these problems [1, 2, 3, 4]. While the results are impressive, one of the limitations of these modelbased approaches is that computationally extensive optimization problem should be solved again for each measurement.
Recently, deep learning approaches has been widely used for various medical imaging problems, such as lowdose CT [5], accelerated MRI [6, 7], etc. In US, Yoon et al [8]
proposed a convolutional neural network (CNN) approach to interpolate subsampled RF channel data. Deep learningbased beamformers have been also suggested as promising alternatives to the delayandsum (DAS) or adaptive beamformers
[9]. Furthermore, various US artifact removal algorithms have been implemented using deep neural networks [9, 10]. These deep learning approaches are inductive in the sense that the trained network can be used for other measurement data without additional optimization procedure. Therefore, deep learning approaches are ideally suitable for medical imaging problems, where fast reconstruction is critical.In spite of these pioneering works, the deep neural network approaches for US imaging still have several technical huddles for their wide acceptance. First, US images are usually corrupted with different types of artifacts, and each user often prefers distinct choice of artifact suppression algorithms depending on the clinical applications. This implies that large number of neural networks should be stored in a US scanner to deal with various image artifacts to satisfy the users’ demands. Second, most of the current deep learning approaches are based on supervised training. Although this requires matched groundtruth artifactfree images, obtaining matched artifactfree image is challenging in ultrasound imaging.
In this paper, we propose a novel deep learning approach to overcome these fundamental limitations. In our method, a single neural network can be switched to address different types of image artifacts by simply changing the target mask vector. Moreover, our neural network is trained in an unsupervised manner so that it does not require any matched reference data for training.
In fact, this is inspired by the recent successes of multidomain image transfer in computer vision literature
[11, 12]. For example, Cho et al [11] proposed a StarGAN, which uses only one generator with one discriminator to translate an image to multiple domains by utilizing a mask vector to indicate which domain to transfer. However, as will be shown later, we found that the direct application of StarGAN is not necessary, since the US image artifacts removal problem is a unidirectional translation problem. More specifically, while StarGAN should translate any domain to another one equally well as shown in Fig. 1(a), this is not necessary in US artifact removal problems, since DAS images are raw data obtained from the scanner and it is not necessary to regenerate DAS images from other domain images. Therefore, a correct image reconstruction pipeline should be asymmetric as shown in Fig. 1(b), where artifact suppressed images (domain 2 and 3) are generated only from DAS raw data (domain 1). It turns out that the lack of symmetry makes the multidomain translation task much simpler, thereby significantly reducing the network complexity and improving the performance.Another important contribution of this work is a novel geometric insight that leads to the proposed network architecture. Specifically, inspired by the recent theory of optimal transport driven cycleGAN [13], we demonstrate that the multidomain artifact removal problem can be formulated as an optimal transport problems between cascaded probability measures, where the goal is to minimize the statistical distances between the “pushforward” measures and the real data distributions. It turns out that if Wasserstein1 metric[14] and KullbackLeiber (KL) divergence are used as distances for continuous and discrete distributions, respectively, then the proposed multidomain unsupervised image translation network emerges.
Ii Backgrounds
Iia MultiDomain Image Translation
One of the earliest works in image translation is Pix2Pix
[15] based on conditional generative adversarial networks (cGAN). Pix2pix was designed for various translation tasks with paired data such as map to aerial photo, or sketch to photo, and so on [15].To relieve the stringent requirement of paired data, Zhu et al [16] proposed an unsupervised imagetranslation technique called CycleGAN, which consists of two generators and two discriminators. Specifically, one generator is used to generate a target domain image and the other generator is used to generate the original domain image. Then, the cycleconsistency loss is used to impose the consistency of the the input image and the regenerated fake image from the second generator. CycleGAN has been successfully used as a promising unsupervised learning approach for various medical imaging problems, such as lowdose CT [17], etc.
However, CycleGAN approach is not scalable in the sense that a total of CNN generators are required to translate between domains [11]. To deal with the scalability issue of CycleGAN in multidomain image translation, Choi et al [11]
proposed a StarGAN approach which performs multidomain image translation with only one generator and one discriminator. The domain is separated with a mask vector, which consists of onehot vector. The target domain mask vector then guides the translation between different image domains. Moreover, the discriminator is designed not only to differentiate between real and fake, but also to classify which domain the image belongs to.
By extending this idea furthermore, Collaborative GAN (CollaGAN) [12] was proposed to deal with the multidomain transfer from multiple inputs. In particular, CollaGAN uses multiple inputs to generate more realistic target domain images, by exploiting the multiple cycleconsistency loss. Similar to StarGAN, CollaGAN is also based on single generator and discriminator.
One of the main assumptions in the aforementioned image translation tasks is that each domain is equally important, i.e. an image in any domain should be translated to another domain equally well regardless of source and target domain combination. In fact, this turns out to be quite stringent requirement in the network training. As will be discussed later, in US artifact removal problems the raw data always belongs to one specific domain and the goal is to process the raw data using different types of algorithms. Therefore, the translation task is unidirectional, which could provide an opportunity to relax the stringent constraint for network training.
IiB Optimal Transport driven CycleGAN
As the multidomain image translation is based on the generative adversarial network (GAN)[11, 12], one may wonder whether the image translation is just a cosmetic change or real. In fact, this is a important issue in medical imaging applications, since any artificially created features can reduce the diagnostic accuracy of the medical images.
In our recent paper [13], we showed that the CycleGAN architecture can be rigorously derived from an optimal transport problem [14]
of one probability distribution to another. Accordingly, we can safely assume that the generated image reflects real image distribution if networks are properly designed and trained. As our goal is to extend this to multidomain translation, in the following we briefly review OT driven CycleGAN
[13] for selfcontainment.Optimal transport (OT) provides a mathematical means to operate between two probability measures [14]. Formally, we say that transports a probability measure (or distribution) to another measure , if
(1) 
which is often denoted as the pushforward of the measure, and represented by
Suppose there is a cost function such that represents the cost of moving one unit of mass from to . The OT problem [14] is then to find a transport map that transports to at the minimum total transportation cost.
Now, we will explain how optimal transport can lead to a CycleGAN architecture [13]. Suppose that the target image space is equipped with a probability measure , whereas the original image space is with a probability measure . In unsupervised learning problem, there are no paired data, so the goal of unsupervised learning is to match the probability distributions rather than each individual sample. This can be done by finding transportation maps that transport the measure to , and vice versa.
More specifically, as shown in Fig. 2(a), the transportation from a measure space to another measure space is done by a generator , realized by a deep network parameterized with . Then, the generator pushes forward the measure in to a measure in the target space , i.e. . Similarly, the transport from to is performed by another neural network generator , so that the generator pushes forward the measure in to in the original space . Then, our goal is to derive an optimal transport map by minimizing the statistical distances between and , and between and , respectively.
One of the most important contributions of our work [13] is to show that if the statistical distance is measured by Wasserstein1 metric [14] and the distance minimization problem is solved using a dual formulation, a CycleGAN architecture emerges [13]. This gives an important mathematical foundation of CycleGAN.
Iii Problem Formulation
Iiia DAS Image Artifacts
In US imaging, the preprocessed channel data is first obtained from the return RF echoes:
(2) 
where denotes the delay corrected RF data at the th channel element, and and denote the scan line index and the depth, and is the aperture size. The delay and sum (DAS) beamformer for the th scanline at the depth sample can be then computed as
(3) 
where denotes a dimensional columnvector of ones.
The DAS beamformer is designed to extract the lowfrequency spatial content that corresponds to the energy within the main lobe; thus, large number of receiver elements are often necessary in DAS beamformer to improve the image quality by reducing the side lobes. Moreover, to calculate accurate time delay, sufficiently large bandwidth transducers are required. Therefore, in many US systems, especially for lowend or portable systems where the number of receiver elements is not sufficient and transducer bandwidth is limited, the image quality degradation is unavoidable.
Another performance degradation of DAS beamformer comes from the inaccuracy of ray approximation of the wave propagation, as the underlying image physics is from the wave phenomenon and the deviation from the real physical introduces inevitable image artifacts.
Yet another image artifacts often observed in DAS image is speckle noise. Speckles noises are originated from the small size scatterers which reflects the US energy. As the DAS simply adds the return echoes after proper time delay, the constructive and destructive interferences from these scattered echoes generate bright and dark spots in the image, which results in the speckle noise. For example, for cardiac volume quantification using segmentation, speckle noise imposes significant algorithmic challenges, so the proper suppression of the speckle noise is often necessary. That said, in some clinical applications such as liver imaging, the speckle statistics is an important measure for the diagnosis of diseases progress, because it is highly sensitive to small variations in scatterers. Accordingly, the speckle noise removal algorithm should be used judiciously depending on specific applications.
IiiB Classical Image Processing Pipelines
In the following, we will describe classical modelbased iterative methods in more detail.
IiiB1 Deconvolution
In US, the received signal is modeled as a convolution of tissue reflectivity function (TRF) with a point spread function (PSF) , where tissue reflectivity function represents scatter’s acoustic properties, while the impulse response of the imaging system is modeled by point spread function. In most practical cases, the complete knowledge of is not available, and therefore both unknown TRF and the PSF
have to be estimated together. These types of problems are often called deconvolution problems. Deconvolution ultrasound may help in dealing with modeling inaccuracies and finite bandwidth issues, which will eventually improve the spatial resolution of an ultrasonic imaging system
[1, 2].Mathematically, the deconvolution problem can be formulated as follows:
(4) 
where is a sparsity imposing regularization terms such as , total variation (TV), etc. One strategy to address (4) is to estimate and jointly, and another strategy is to estimate them separately, i.e., first is estimated from , and then is estimated based on [2]. In this paper, the second strategy is used for the generation of unmatched target data distribution, i.e. images in Domain 2 in Fig. 1(b).
IiiB2 Despeckle
Speckle noises are one of the major reasons of quality degradation, so that “despeckle” (removal of the speckles) can improve the visual quality and subsequently enhance the structural details in US images [3]. In recent past, a variety of reasonably good despeckling algorithms have been proposed for US imaging [3, 4]. One such algorithm is proposed by Zhu et al [3], which is based on the principle of nonlocal lowrank (NLLR) filtering.
In NLLR [3], the guidance image is first obtained as a reference. Then, using guidance images, group of similar patches are selected based on the similarity to the reference patch. Using the group of similar patches, the lowrank + sparse decomposition [18] is then performed. The optimization problem is solved using alternating directional method of multiplier (ADMM) [19]. Finally, the lowrank component is considered as speckle free images, whereas the sparse component are considered as speckle noises.
Although the algorithm provides an effective means for removing speckle, it is computationally expensive, so it cannot be used for routine US applications. In this paper, this algorithm is used to generate our target samples for despeckle images (i.e., images in Domain 3 in Fig. 1(b)).
IiiC Multidomain Unsupervised Artifact Removal
In this paper, we are interested in a single neural network that can remove both blur artifacts and speckle noises from the DAS images in a realtime manner. More specifically, in Fig. 1(b), Domain 1 is for the DAS input images, and Domain 2 and 3 correspond to deconvolution and despeckle image domains. Since the use of deconvolution images or despeckle images depends on specific clinical applications, our goal is therefore to design a neural network such that it can be switched to process different type of artifacts by mere changing the target domain mask vector.
However, if the existing CycleGAN is used for this applications, two separate neural networks are required for each artifact type as shown in Fig. 1(c), which increases additional burden for US scanner.
Iv Main Contribution
Iva Geometric Formulation
The geometry of our problem is illustrated in Fig. 2(b), which is similar to that of CycleGAN in Fig. 2(a) but with important differences.
Here, the original image space is the DAS image domain, whose probability distribution is , whereas the target image space contains both despeckle images and deconvolution images, and their distribution follows the probability measure . One key difference compared to the CycleGAN geometry in Fig. 2(a) is that the target distribution is modeled as the mixture distribution of despeckle and deconvolution images, which can be separately by a classifier that maps each sample in to discrete domain labels in with a discrete measure .
Thus, our goal is to find a generator that transport to a probability measure which is close to and still enables the correct classification using . Accordingly, and are dependent to each other, so they should be optimized simultaneously. In fact, this corresponds to an optimal transport problem to a cascaded measures. On the other hand, the generator that transport to should transport all mixture distribution to the probability distribution in the original space .
Similar to the optimal transport driven CycleGAN [13], in our method the statistical distance is measured using the Wasserstein1 metric. More specifically, for the choice of a metric in , the Wasserstein1 between and can be computed by:
(5) 
where
denotes the set of joint distribution whose marginal distribution is
and . Similarly, the Wasserstein1 distance between and is given by(6) 
Since the optimal joint distribution minimizing (5) and (6) could be different if they are minimized separately, a better way of finding the transportation map is to minimize them together with the same joint distribution :
(7) 
This can be considered as the sum of the statistical distances in both spaces.
Next, to enable a multidomain transfer, the measure in the target space should be transported to the discrete sample space with the discrete measure . As we have two distribution and in , the corresponding push forward measures by the classifier are given by and , respectivey (see Fig. 2(b)). Then, the classification design problem can be done again by minimizing the statistical distances in .
Since the classification label is discrete (in our case, binary), one of the simplest way of measuring the statistical distance in is using the KullbackLeiber (KL) divergence. Specifically, the KL divergence between and the pushforward measure is given by:
where the output of the classifier is the probability distribution indicating the probability belonging to target domains, and is the onehot vector encoded true label probability for . For example, in the conversion of DAS image to two target domains (decovolution, and despeckle), is a discrete probability mass function which is for the deconvolution target, and for despeckle targets.
Similarly, the KL divergence between and the pushforward measure is given by:
(8) 
where is onehot vector encoded target label for the generator for a given DAS image .
By putting them together, the overall loss becomes
(9)  
where is the weighting parameter for the classifier.
IvB Proposed Method
Unfortunately, direct minimization of the loss in (9) is difficult specially due to the need for nonparametric minimization with respect to the joint distribution . To address this, one of the most important contributions of our companion paper [13] is to show that the following primal problem
(10) 
is equivalent to the following dual formulation:
(11) 
where
(12) 
where is the hyperparameter, and the cycleconsistency term is given by
whereas the discriminator term is
Here, should satisfy 1Lipschitz condition (i.e.
To impose 1Lipschitz condition, the gradient of the Kantorovich potential is constrained to be 1 [20]:
where and with
being random variables from the uniform distribution between
[20].By collecting all terms together, our multidomain artifact removal problem can be formulated by
(13) 
where
where , , are the weighting parameters depending on the relative importance. In this paper, we used , , and .
IvC Neural Network Implementation
Fig. 3 illustrates the proposed multidomain unsupervised artifact removal network originated from the optimization problem in (13). The network architecture consists of two generators. The main generator converts DAS images to either deconvolution or despeckled images. The second generator returns the deconvolution or despeckle images to the original DAS domain images. Once the networks are trained, note that we only use at the inference phase.
Here, care should be taken due to the existence of the classifier. More specifically, in calculating the KL divergence in (8), the target domain of should be indicated. Therefore, the main idea is to indicate the target domain using an input mask vector, :
(14) 
For example, a DAS input image with the mask vector creates the deconvolution image, whereas an input DAS image with the mask vector is converted to a despeckled image. The mask vector is augmented along the channel direction with the input image. On the other hand, does not need a mask vector since there exists no classifier in .
Accordingly, although the same generator architecture as shown in Fig. 4(a) is used for both and , the input channel numbers are different. For the case of , the number of input channels is 3, which is composed of DAS images and two mask vector channels, whereas is composed of a single input channel.
Additionally, there are also two discriminators and as shown in Fig. 3. Specifically, tries to find the difference between the true image and the generated image , whereas attempts to find the fake measurement data that are generated by the synthetic measurement procedure .
Finally, we have the domain classifier to distinguish between deconvolution and despeckled images. In fact, the domain classifier and discriminators are both classifiers, so their structure share many commonalities. Therefore, as shown in Fig. 4(b), the discriminator and the classifier are implemented using a same network architecture with double output headers composed of a domain classifier or discriminator.
V Methods
Va Dataset
We used in vivo and phantom images to train the network. The in vivo images are focused Bmode images using a linear probe (L312H) from the US system ECUBE 12R (Alphinion, Korea). The images were measured from 10 volunteers at four parts of the carotid and thyroid areas. It has 10 time frames for each body part. The phantom images are tissue mimicking phantoms that are also acquired with focused Bmode imaging using a linear probe. Both images are taken with a center frequency of 8.5 MHz. Each image is captured with 64 channels and 96 scan lines. The details of the probe are listed in the Table I. We selected 8 volunteer data, and 304 images in in vivo images and 125 images in phantom images are used for the training data set.
To create the deconvolution images in Domain 2, we used the joint sparse deconvolution model that is iterative method along the axial direction with lateral direction constraint (Sec. IIIB1). We also used the NonLocalLowRank (NLLR) algorithm to generate the despeckle images (Sec. IIIB2
) for Domain 3 data. We used 429 deconvolution images and 478 despeckle images to train the network as target domain images. In addition, we randomly mixed the input dataset and the target dataset for each epoch separately in order to train the network with unpaired method.
Parameter  Configuration 
Probe Model  L312H 
Center frequency  8.5MHz 
Sampling frequency  40MHz 
Wave mode  Focused 
Element  192 
Tx element  128 
Transmision emission  96 
Rx elements  64 
Element pitch  0.2mm 
Element width  0.14 mm 
VB Network Architecture
As shown in Fig. 4, we employed the generator architecture that was used in [16, 11]
. It is composed of two downsampling steps and six residual blocks and two upsampling steps. The convolution with stride 2 is utilized for downsampling operation. The residual block has two convolution steps. The transposed convolution layer with stride 2 is used for upsampling operation. In the generator, we used instance normalization instead of batch normalization. Hyperbolic tangent (tanh) nonlinearity is applied in the last layer for stablizing the training process.
The discriminator is shown in Fig. 4 (b). We used PatchGAN which classifies the local image whether real or fake. It is simply composed of six convolution sets of 4x4 kernel size with stride 2 and LeakyReLU except last layer. In the last layer, there is multiheader structure. One head is for domain classification and the other is for real/fake classification.
VC Training Details
VC1 Implementation details
To train the network, we used the Adam optimizer with
. The learning rate is started from 1e4 and decreased linearly from the 500 epochs. The total epoch is 1000, and the batch size is 4. To avoid the overfitting problem, we used data augmentation technique like flipping, rotating and random scaling. We implemented all work on the Python with Tensorflow and MATLAB 2017a. We trained the network with the NVIDIA GeForce GTX 1080 Ti GPU.
VC2 Evaluation metric
To evaluate our results, we used values for the peaksignaltonoise ratio (PSNR), the structure similarity (SSIM)
[21], the contrasttonoise ratio (CNR), the generalized CNR (GCNR) [22], and the Contrast Ratio (CR).The PSNR value is calculated between the label image () and output image (). The equation between images whose size is is
(15) 
where denotes the maximum value of and is norm. The SSIM value is defined by [21]
(16) 
The denotes the mean value of and , deviation value of and , covariance of , respectively. The is and is . In this paper, and are 0.01 and 0.03, respectively. The CNR value can quantify the image contrast which is defined as
(17) 
where
denotes the mean and standard deviation value of the structure and the background, respectively. The GCNR value is the generalization value of CNR
[22]. It calculates the area of region between intensity distribution of two areas:(18) 
where and are probability distributions in structure and background, respectively. The CR value is used to compare the contrast without noise condition:
(19) 
where denote the mean value of the structure and background, respectively.
VD Baseline Algorithms
For comparative study, the following algorithms are used as baselines: 1) supervised learning, 2) CycleGAN
[16], and 3) StarGAN [11].For supervised learning, the same generator architecture was trained using a paired data set. We normalized the intensity of the entire image from to . We also used data augmentation technique like flipping, rotating and random scaling. We trained the network with the Adam optimizer with the learning rate 1e4. The total epoch is and the batch size is 4.
For CycleGAN training, we used the same generator architecture as the proposed method. The discriminator also has same architecture except the last layer. There is no domain classification loss in CycleGAN. We also used WGAN with gradient penalty technique [20] similar to the proposed method. Unlike the supervised learning, there are no paired dataset for CycleGAN training. We used as 30 and as 0.5. We normalized the intensity of the image as to . To increase the total dataset, we used data augmentation technique like flipping, rotating and random scaling. We trained the network with the Adam optimizer with learning rate decreasing linearly from 1e3 to 1e4 depending on the training epoch. The total epoch is 200 and batch size is 4.
The StarGAN also used same generator and discriminator architecutres as our method except kernel size. It also used the WGAN with gradient penalty technique [20]. Both StarGAN and the proposed method use mask vectors to separate the target domain. Therefore, each target domain image can be generated by simply changing the mask vector. However, in contrast to our method, StarGAN also utilizes the same generator for the returning path with a mask vector. This increases the difficulty of training. In the StarGAN, we used as 30 and as 1 and as 10. We normzliaed all the images as 1 to 1. Flipping, rotating, and random scaling also used for data augmentation technique. We trained it with Adam optimizer with 1e4 linearly decreasing from the half of total epoch. The total epoch is 500 and batch size is 4.
Vi Experimental Results
To verify the performance of the proposed method, we validate the algorithm with qualitative and quantitative manners.
Via Qualitative Results
Fig. 5 shows the comparison results using various algorithms. We show the in vivo images from the left thryorid and right thyroid. As shown in the figure, all the deconvolution output images show the improved contrast. Specifically, in the magnified images, it is easy to recognize that the deconvolution images have sharper structure. While the StarGAN output has slightly improved compared to the input image, the output from the proposed method is closer to the target image. Moreover, the side to side comparison with supervised and CycleGAN approach show that the proposed method provides qualitatively comparable results. However, it is remarkable that the deconvolution and despeckled images using supervised learning and CycleGAN are generated from independent network trained separately, whereas the proposed method generated both images with single generator.
The despeckle images in Fig. 5 also shows the noise suppression. After removing the speckle noise, the boundaries of structure becomes smooth out. However, as shown in the Fig. 5, the result of the proposed method preserved the structure boundaries better than StarGAN. We applied our method to tissue mimicking phantoms. One is anechoic and the other is hyperechoic phantom. The results are shown in Fig. 6. The deconvolved image in both cases relatively show high contrast compared to the input image. Moreover, the granular pattern is reduced and the boundaries of structure are well maintained in the proposed method.
Deconvolution  in vivo  Phantom  
PSNR  SSIM  CNR  GCNR  CR  PSNR  SSIM  CNR  GCNR  CR  
Algorithms  Input  17.15  0.68  0.82  0.52  7.97  15.33  0.74  1.25  0.60  11.41 
Label  X  X  0.74  0.53  9.15  X  X  1.13  0.56  12.24  
Supervised  24.35  0.79  0.79  0.55  9.13  21.55  0.82  1.26  0.60  14.01  
CycleGAN  22.93  0.75  0.85  0.56  9.87  19.62  0.76  1.19  0.59  13.75  
StarGAN  20.85  0.74  0.73  0.50  7.18  20.02  0.80  1.17  0.57  12.38  
Proposed  23.86  0.79  0.76  0.53  8.98  22.20  0.84  1.19  0.58  13.03 
Despeckle  in vivo  Phantom  
PSNR  SSIM  CNR  GCNR  CR  PSNR  SSIM  CNR  GCNR  CR  
Algorithms  Input  26.99  0.65  0.82  0.52  7.97  26.64  0.66  1.25  0.60  11.41 
Label  X  X  1.02  0.62  7.52  X  X  1.46  0.65  10.79  
Supervised  35.58  0.93  1.04  0.62  7.29  29.64  0.90  1.49  0.67  11.98  
CycleGAN  28.27  0.86  1.02  0.61  7.08  25.72  0.83  1.51  0.67  13.83  
StarGAN  30.34  0.89  0.98  0.59  7.29  25.57  0.86  1.47  0.67  12.55  
Proposed  31.78  0.91  1.02  0.61  7.86  30.11  0.89  1.49  0.67  12.52 
ViB Quantitative Results
Quantitative comparison with other methods was performed using the PSNR, SSIM, CNR, GCNR, and CR values. All test data are Bmode focused image acquired from linear probe with 8.5MHz center frequency. The test dataset are composed of in vivo and phantom images. The in vivo datasets consist of 80 frames from four parts of two subjects. The phantom datasets are totally 16 images, consisting of 10 anechoic and 6 hyperechoic phantom images.
Fig. 7 shows the distribution of each quantitative metrics. The PSNR and SSIM values show significant improvement in all algorithms. As expected, the PSNR and SSIM values of the supervised learning was the best, but it is remarkable that the proposed method are superior to CycleGAN and StarGAN results in all cases.
The average values of all images are represented in the Table II and Table III. When the deconvolution is the target domain, in our method the PSNR value has 6.71 dB and 6.87 dB improvement compared to the original DAS images in in vivo and phantom case, respectively. The SSIM value also gained 0.11 and 0.10 unit, respectively. Similarly, when the target domain is despeckle images, the PSNR and SSIM values show high improvement in both in vivo and phantom images.
In the Table II, the CNR and GCNR values of our method for deconvolution tasks are slightly decreased compared to DAS input. This is because the deconvolution process also enhance the noise component. On the other hand, the CR values, which is less sensitive to noise boosting, is increased. The result from the proposed method shows the CR value enhancement about 1.01dB and 1.62dB in in vivo and phantom images, respectively.
On the other hand, for the despeckle task, the CNR, GCNR value from our method is increased significantly. Table III shows that the results using our method has 0.2 and 0.24 unit improved CNR value for in vivo and phantom images, respectively. Similarly, the GCNR value is increased about 0.09 and 0.07 unit.
ViC Computational Time
The average computational time for classical deconvolution model [1], NLLR despeckle [23], and our method are 3479.3sec, 435.51sec, and 0.16sec, respectively. The computational times for supervised learning, CycleGAN, StarGAN are basically same as ours.
In conventional deconvolution algorithm [1], the point spread function (PSF) should be calculated to produce the deconvolution image. The estimation of the PSF and the joint sparse recovery step is computationally expensive due to iterative method. Furthermore, the NonLocalLowRank (NLLR) approach for classical despeckle approach is also based on the iterative method, so it requires huge computational burden to reconstruct the image [23]. On the other hand, our method is based on deep neural network so that once the network is trained, the artifactfree images can be generated in a real time manner. Especially, the proposed method can generate both deconvolution image and despekcle image at the same time.
Vii Discussion
Viia Generalization
To verify the generalization power of the proposed algorithm, we tested our method with various types of dataset. The new datasets are acquired with different manner compared to the previous dataset.
ViiA1 Planewave mode
Although we trained the network with Bmode images acquired with focused mode, we tested our network with planewave mode images to show that the proposed method is independent of the acquisition mode. The planewave image is acquired with linear array probe (L312H) and 8.5MHz center frequency. The 31 planewaves are used to generate the final images.
As shown in Fig. 8, the proposed network successfully deconvolve and despeckle the input images. The deconvolved images have better contrast compared to input DAS image. Our method also well suppressed the speckle noises from the planewave mode images. The improvement is prominent in quantitative comparison. When the target domain is the deconvolution image, each image has 5.73 dB, 1.73 dB, and 3.08dB gain in the CR value. Similary, the despeckled images show 0.08 unit gain in GCNR value.
ViiA2 Different operating frequency
The training datasets are composed of Bmode focused images with operating freqeuncy of 8.5MHz. In this experiment, we show the generalization power with operating frequency of 10.0MHz. Moreover, the datasets are from totally different body part: forearm muscle and calf muscle.
Fig. 9 shows the results. It can be easily seen that the result from the proposed method removed the artifact well. The details of structure become prominent in deconvolved image. The granular noise reduced well with maintaining the structure boundaries.
ViiB Extension beyond two domains
In the experiment above, we tried to remove two types of artifact. In this section, we extended artifacts to deal with. That is, we increased the number of target doamain from two to three. The first target domain is deconvolution, the second target domain is despeckle and the third target domain is decovolutiondespeckle. To check the effect of domain extension, we carried out the following two experiments. Note that except mask vector, other processes are all same as main experiment.

Two labels for three target domains : The mask vector [1,0] is for deconvolution, [0,1] is for despeckle, [1,1] is for deconvolutiondespeckle.

Three labels for three target domains: The mask vector [1,0,0] is for deconvolution, [0,1,0] is for despeckle, [0,0,1] is for deconvolutiondespeckle.
As shown in Fig. 10, our method works well when increasing the number of target domain. Both deconvolved images have better contrast and the speckle noise is well suppressed in despeckle images. Furthermore, the deconvolutiondespeckle images show reasonable result. Two type of masks described above generates three distinct domain similarly. Therefore, we can conclude that extension of multiple domain beyond two can be easily done without concerning about the increase of the mask vector channels.
Viii Conclusion
Due to the transducer limitations or wave interference, the ultrasound image has many artifacts. To overcome this problem, we proposed the multidomain image artifact removal method based on optimal transport theory. Using extensive experiments using in vivo and phantom dataset, we confirmed that a single generator can produce the multiple domain targets very well in both qualitatively and quantitatively. As the proposed method can process images instantaneously without increasing the number of models, we believe that our method can be used for practical US systems.
References
 [1] J. Duan, H. Zhong, B. Jing, S. Zhang, and M. Wan, “Increasing axial resolution of ultrasonic imaging with a joint sparse representation model,” IEEE transactions on ultrasonics, ferroelectrics, and frequency control, vol. 63, no. 12, pp. 2045–2056, 2016.
 [2] R. Jirik and T. Taxt, “Twodimensional blind Bayesian deconvolution of medical ultrasound images,” IEEE transactions on ultrasonics, ferroelectrics, and frequency control, vol. 55, no. 10, pp. 2140–2153, 2008.

[3]
L. Zhu, C. Fu, M. S. Brown, and P. Heng, “A nonlocal lowrank
framework for ultrasound speckle reduction,” in
2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
, July 2017, pp. 493–501.  [4] P. Coupé, P. Hellier, C. Kervrann, and C. Barillot, “Nonlocal meansbased speckle filtering for ultrasound images,” IEEE transactions on image processing, vol. 18, no. 10, pp. 2221–2229, 2009.
 [5] E. Kang, J. Min, and J. C. Ye, “A deep convolutional neural network using directional wavelets for lowdose xray CT reconstruction,” Medical physics, vol. 44, no. 10, pp. e360–e375, 2017.
 [6] J. Schlemper, J. Caballero, J. V. Hajnal, A. N. Price, and D. Rueckert, “A deep cascade of convolutional neural networks for dynamic MR image reconstruction,” IEEE transactions on Medical Imaging, vol. 37, no. 2, pp. 491–503, 2017.
 [7] J. Lee, Y. Han, J.K. Ryu, J.Y. Park, and J. C. Ye, “kspace deep learning for referencefree epi ghost correction,” Magnetic resonance in medicine, vol. 82, no. 6, pp. 2299–2313, 2019.
 [8] Y. H. Yoon, S. Khan, J. Huh, and J. C. Ye, “Efficient Bmode ultrasound image reconstruction from subsampled RF data using deep learning,” IEEE transactions on medical imaging, vol. 38, no. 2, pp. 325–336, 2019.
 [9] S. Khan, J. Huh, and J. C. Ye, “Adaptive and compressive beamforming using deep learning for medical ultrasound,” IEEE Transactions on Ultrasonics, Ferroelectrics, and Frequency Control, 2020.
 [10] P. Kokil and S. Sudharson, “Despeckling of clinical ultrasound images using deep residual learning,” Computer Methods and Programs in Biomedicine, p. 105477, 2020.
 [11] Y. Choi, M. Choi, M. Kim, J.W. Ha, S. Kim, and J. Choo, “StarGAN: Unified generative adversarial networks for multidomain imagetoimage translation,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 8789–8797.

[12]
D. Lee, J. Kim, W.J. Moon, and J. C. Ye, “CollaGAN: Collaborative GAN for missing image data imputation,” in
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp. 2487–2496.  [13] B. Sim, G. Oh, S. Lim, and J. C. Ye, “Optimal transport, cycleGAN, and penalized LS for unsupervised learning in inverse problems,” arXiv preprint arXiv:1909.12116, 2019.
 [14] C. Villani, Optimal transport: old and new. Springer Science & Business Media, 2008, vol. 338.

[15]
P. Isola, J.Y. Zhu, T. Zhou, and A. A. Efros, “Imagetoimage translation with conditional adversarial networks,” in
Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 1125–1134.  [16] J.Y. Zhu, T. Park, P. Isola, and A. A. Efros, “Unpaired imagetoimage translation using cycleconsistent adversarial networks,” in Proceedings of the IEEE international conference on computer vision, 2017, pp. 2223–2232.
 [17] E. Kang, H. J. Koo, D. H. Yang, J. B. Seo, and J. C. Ye, “Cycleconsistent adversarial denoising network for multiphase coronary CT angiography,” Medical physics, vol. 46, no. 2, pp. 550–562, 2019.

[18]
E. J. Candès, X. Li, Y. Ma, and J. Wright, “Robust principal component analysis?”
Journal of the ACM (JACM), vol. 58, no. 3, pp. 1–37, 2011.  [19] W. Yin, S. Osher, D. Goldfarb, and J. Darbon, “Bregman iterative algorithms for minimization with applications to compressed sensing,” SIAM Journal on Imaging sciences, vol. 1, no. 1, pp. 143–168, 2008.
 [20] I. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin, and A. C. Courville, “Improved training of wasserstein gans,” in Advances in neural information processing systems, 2017, pp. 5767–5777.
 [21] Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli, “Image quality assessment: from error visibility to structural similarity,” IEEE Transactions on Image Processing, vol. 13, no. 4, pp. 600–612, April 2004.
 [22] A. RodriguezMolares, O. M. H. Rindal, J. D’hooge, S.E. Måsøy, A. Austeng, M. A. L. Bell, and H. Torp, “The generalized contrasttonoise ratio: a formal definition for lesion detectability,” IEEE Transactions on Ultrasonics, Ferroelectrics, and Frequency Control, vol. 67, no. 4, pp. 745–759, 2019.
 [23] L. Zhu, C.W. Fu, M. S. Brown, and P.A. Heng, “A nonlocal lowrank framework for ultrasound speckle reduction,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 5650–5658.
Comments
There are no comments yet.