I Introduction
Prostate cancer (PCa) is one of the leading causes of cancer death among men [1]. In particular, men with clinically significant (CS) PCa, i.e. the Gleason Score (GS) of PCa being equal to or greater than , have a much higher fatality than patients with indolent PCa. Early detection of CS PCa is a key to increasing the survival rate of patients. Recent studies [2, 3, 4, 5, 6, 7, 8]
have demonstrated that multiparametric magnetic resonance imaging data (mpMRI), which typically includes T2weighted (T2w) images and apparent diffusion coefficient (ADC) maps could be an accurate and noninvasive biomarker for PCa detection and aggressive assessment. However, training an accurate classifier leveraging recent advances in datahungry deep learning, e.g. convolutional neural networks (CNNs), for PCa detection and aggressive assessment based on mpMRI images is challenging as mpMRI data of PCa, in particular CS PCa, are often scarce and costly to obtain.
To address this challenge, the most common approach [7, 8] is data augmentation that increases data volume by modifying the original data through rotation, translation, scaling, nonrigid transformations, etc. However, these data augmentation approaches cannot greatly increase the data variety, limiting the performance of deep CNN models. Recently, several medical image synthesis methods based on Generative Adversarial Networks (GAN) [9, 10, 11, 12, 13, 14] have been developed. These GANbased methods can learn the data distribution of training samples in a lowdimensional manifold, and generate new data by sampling from the learned manifold, providing an effective way to greatly increase the quantity and diversity of training data and in turn to benefit the deep learning methods. Despite the success of existing GAN based methods, they can hardly meet the following requirements concurrently which are critical for clinical usage: (1) ensuring correct paired relationship between synthesized ADC and T2w image pairs, (2) increasing data variety based on a very small training set, and, (3) containing distinguishable CS cancerous patterns in each synthesized ADCT2w image pair. In this work, we aim at a GANbased method which can concurrently meet these three requirements. In the following, we start with a survey of related work and summarize their limitations.
Prior work: If we refer different modalities of MRI as different domains, mpMRI data synthesis can then be more broadly formulated as a multidomain or crossdomain data synthesis problem, which has been widely studied in recent years for synthesizing both natural [15, 16, 17] and medical [11, 12, 13, 18] images. Existing multi/crossdomain data synthesis methods can be categorized into three major classes: 1) crossdomain image translation which, given a real image sampled from one domain (e.g., a T1 image), synthesizes its counterpart in another domain (e.g., a T2 image); 2) direct multidomain image synthesis which generates two images of different domains with a constraint on the relationship of the pair based on a common lowdimensional vector; 3) sequential multidomain image synthesis, which first generates images in one domain based on lowdimensional vectors, followed by crossdomain image translation that maps them to their counterparts in another domain.
For crossdomain image translation, Isola et al. [17]
proposed a conditional GAN (cGAN) named pix2pix which utilizes a UNet
[19] as the generator to condition on an input image of one domain and generate the corresponding output image of another domain. The L1 pixelwise loss and a patchbased discriminator are used for ensuring both global and local consistency between real and generated images. However, pix2pix requires a large amount of paired labels (i.e., true pairs of images) for training, which is costly, if not infeasible. Moreover, crossdomain image translation methods have to condition on a real input image in a reference domain and cannot concurrently synthesize new data in both domains. Thus the diversity of the synthesized multidomain data largely depends on the quantity and diversity of the real input data.Rather than relying on real input data of one domain, Liu and Tuzel [16] proposed a multidomain image synthesis method, named coupled GAN (CoGAN), for synthesizing multidomain data directly from random noise vectors. Specifically, they utilized two parallel identical GANs to learn the marginal image distributions in two different domains respectively. Images of each domain are then synthesized by sampling from the corresponding marginal distribution. To guarantee that the multidomain images of a pair indeed follow the intended relationship, the authors proposed to share weight parameters in the parallel GANs. Despite the success in handling their target tasks, direct mapping from random noises to multidomain data completely ignores the inherent complexity difference between the synthesis tasks of different MRI modalities. For instance, as shown in Fig. 1, the ADC map, capturing the functional information of a prostate, usually has a low spatial resolution and thus is easier to synthesize. In comparison, the T2w image which captures the anatomical structure of a prostate contains much more detailed highfrequency texture information, and thus encounters a greater challenge for the synthesis task.
To take into consideration of the differences in task complexity, Costa et al. [12] proposed a sequential multidomain image synthesis method, which first tackles the simpler task, generating a retinal vessel network map, via a GAN, followed by tackling the more difficult task, synthesizing the corresponding retinal fundus image, using another GAN. The method is trained in a supervised manner, where only the encodings of real data along with paired labels are used. The constraint of the paired relationship between a vessel map and its corresponding retinal image is accomplished by minimizing both the reconstruction losses of vesselfundus pairs and an adversarial loss. However, the method in [12] can hardly capture meaningful information of CS PCa since normal prostate gland tissues typically take predominant regions compared to CS lesions, biasing the learning process to model the distribution of the predominant prostate gland rather than CS PCa. In addition, the variety of the synthetic data is still restricted if the amount of training data is limited because the method could not generate ’unseen’ data from random inputs due to its overfitting these encodings [15]. To summarize, existing methods can hardly synthesize mpMRI data with sufficient variety which contain meaningful CS PCa patterns and have the correct paired relationship from a small amount of training data.
In this paper, we propose a novel semisupervised method which can meet all the abovementioned requirements and synthesize high quality pairs of ADC and T2w images. As illustrated in Fig. 1, our mpMRI data synthesizer consists of three cascaded components: a decoder to derive lowdimensional ADC maps from 128d latent vectors, a StitchLayer to convert lowdimensional ADC maps to a fullsize ADC image, and a UNet to convert the ADC image to a paired T2w image. By training the synthesizer in a supervised manner which minimizes pixelwise reconstruction losses between real and fake ADCT2w pairs, the mapping relation between lowdimensional vectors and ADC maps, and the paired relationships can be captured by the decoder and the UNet of the synthesizer respectively. Training the synthesizer using supervised learning requires real pairs of CS ADCT2w images, which could lead to overfitting a small set of ADC encodings derived from a training set of limited size [12]. To increase the diversity of the generated data, we further train the synthesizer in an unsupervised manner through provision of various random latent vectors as illustrated in the bottom of Fig. 1
. To ensure high visual similarity between real ADC/T2w images and fake ones generated from random vectors, we minimize the Wasserstein distances (Wdistance) between the marginal distributions of synthesized and real images of the two modalities. The synthesizer is trained following supervised and unsupervised processes alternatively, i.e. in a semisupervised fashion, to enforce paired relationships in the network. This method can achieve a greater diversity in the generated data, and higher visual similarity between the fake and real images.
It’s difficult and fragile to optimize a synthesizer, in the unsupervised training process, to directly generate fullsize ADC maps from very lowdimensional latent vectors because no explicit guidance from the real ADC images would be provided. To address this problem, we propose a StitchLayer which seamlessly fuses subregions of an ADC map into a fullsize ADC image in an interlaced manner. With the StitchLayer, the decoder could focus on tackling simpler tasks, each of which generates a subimage with a smaller size (i.e., ), instead of directly tackling the difficult task of generating a single fullsize map (i.e., ). This strategy greatly reduces the complexity of modeling the entire manifold of ADC data.
To further ensure the synthesized data contain distinguishable CS cancerous patterns, we compute the auxiliary distances (AD) of JensenShannon divergence (JSD) between the synthetic CS and real nonCS images of the two modalities. By maximizing the two ADs, in addition to minimizing the Wdistances in the unsupervised training process, the synthesizer is guided to include meaningful CS PCa features through attempting to better distinguish the synthesized CS mpMRI data from those real nonCS mpMRI data.
To summarize, the main contributions of this work include:

We develop a novel semisupervised mpMRI data synthesis method, which is trained using both supervised and unsupervised approaches alternatively. Supervised learning enforces the correct paired relationship between synthesized ADC and T2w images of a pair, and unsupervised learning avoids overfitting encodings and thus ensures greater diversity of the synthesized data.

We propose a novel StitchLayer for more robustly synthesizing ADC maps in a coarsetofine manner. Consequently, the enhanced ADC map synthesis helps improve the subsequent ADCtoT2w translation, which, in turn boosts the overall quality of synthetic mpMRI data.

We introduce the auxiliary distances of JSD between synthetic CS and real nonCS images for encoding clinically meaningful CS PCarelevant visual patterns in the synthetic data.
Ii Method
In this section, we detail the key techniques used in our synthesizer, including 1) semisupervised training for explicitly learning the paired relationships and modeling the marginal image distribution of each modality; 2) a StitchLayer to alleviate the complexity of the decoder for optimizing fullsize ADC map generation; and 3) an auxiliary distance for enforcing the synthetic data containing distinguishable CS cancerous patterns.
Iia The Semisupervised mpMRI data synthesis
Given a set of CS cancerous ADCT2w prostate pairs for training, where indicates the ADCT2w pair and is the total number of pairs, a straightforward solution of adversarial training is to input random noise vectors to a synthesizer and train the synthesizer to fool the discriminator that can best distinguish synthetic mpMRI data from real ones . A major limitation of this solution is that the task of the discriminator could be too burdensome as it needs to not only learn the paired relationships but also ensure synthesized data being visuallyrealistic. As a result, the discriminator could be difficult to train. An alternative solution is to input encodings of real data rather than noise vectors to the synthesizer as proposed in [12]
which will then alleviate the task of the discriminator by minimizing the pixelwise reconstruction losses between reconstructed fake ADCT2w images and true ADCT2w images. However, this strategy could make the synthesizer easily overfit a small set of encodings and consequently lead to poor performance of the synthesizer for random vector inputs which are beyond the distribution of true encodings. In this section, we present a novel semisupervised learning to address the abovementioned limitations. We divide the entire task into three subtasks: 1) learning the paired relationships in the supervised learning process via pixelwise reconstruction loss minimization, 2) learning the marginal distributions of the real ADC and T2w images via Wdistance minimization in unsupervised training, and 3) learning the distinguishable visual features of CS PCa via maximization of the auxiliary distance between CS and nonCS images in unsupervised training.
IiA1 Supervised Learning of Paired Relationships
Fig. 2 shows details of the supervised training process. We first utilize an encoder to obtain encodings of real CS ADC maps, i.e. , where is a real CS ADC map. Then a decoder is applied to reconstruct a fake ADC map which is then converted to a fake T2w image as . In this work, we implement using the UNet. The reconstructed ADCT2w pair can find its target pair according to the paired labels. We train the entire network, including the encoder and the synthesizer (i.e. and ), by minimizing the pixelwise reconstruction loss as:
(1) 
where is the expectation over the pairs , sampled from the joint data distribution of real training pairs and the operation calculates the average of pixelwise Euclidean distances between images and . As the training process is guided by real pairs of ADCT2w images, the paired relationships can be effectively captured by the network.
To enable the synthesizer for generating reasonable ADCT2w pairs from noise vectors rather than from ADC encodings, we adopted the approach in [20] to reshape the distribution of encodings to a predefined distribution
, i.e. Gaussian distribution. Assume that the data distribution of ADC lies on a 128d manifold, denoted as a latent space, the reshaped distribution
is obtained by minimizing the JSD between the predefined and the distribution of true ADC encodings . The minimization process is implemented by designing a to best distinguish encodings from the latent vectors . Then the JSD (i.e. ) between and is calculated as follows:(2) 
According to [20], minimizing the makes the and the identical to each other, which allows us to sample from the known prior for synthesizing in the test phase.
Therefore, the final objective function of supervised learning is formulated as Eq. 3.
(3) 
By minimizing , our synthesizer can focus on learning the paired relationships and confining the distribution of ADC encodings to conform to a predefined distribution .
In principal, by learning the synthesizer based on the procedure described above, we could generate a variety of reasonable ADCT2w pair from various latent vectors . However, in practice the visual quality of the synthesized ADCT2w pairs from noise vectors could be extremely poor. We believe the reason is that the synthesizer (i.e. and ) only ’sees’ a very sparse and small portion of the latent space which contains the ADC encodings. When the amount of training ADCT2w pairs is very small, it is easy to overfit the synthesizer to the limited training samples. To address this problem, we further apply an unsupervised learning approach to guide the synthesizer learn the marginal distributions of real ADC and T2w images and in turn to ensure a high visual quality of the synthetic ADC T2w images generated from random latent vectors.
IiA2 Unsupervised Learning of Marginal Distributions
Fig. 3 shows details of the unsupervised approach. Compared to the supervised approach shown in Fig. 2, the unsupervised approach trains the synthesizer not based on limited encodings and paired ADCT2w pairs, but based on unlimited latent vectors drawn from and unpaired ADC images and T2w images of CS PCa. We employ a discriminator to approximate the Wdistances between the fake and real ADC images, and another discriminator to approximate as:
(4) 
(5) 
where and are real and synthetic ADC maps of CS PCa respectively, and are real and synthetic T2w images of CS PCa respectively, and are used for enforcing the 1Lipschitz constraint of and respectively, and are two parameters for adjusting the impact of and respectively [21].
Therefore, the final objective function of unsupervised learning is calculated as Eq. 6.
(6) 
Minimizing can train the synthesizer to generate ADC and T2w images which conform to the marginal distributions of true ADC and T2w images respectively, i.e. visually realistic ADC and T2w images. It is noteworthy that we do not require the unsupervised approach to learn the paired relationship between ADC and T2w of a synthetic pair. As such information has been captured by the supervised approach and encoded in the network. By alternatively training the entire network using the supervised and unsupervised approaches, our synthesizer can generate a great variety of ADCT2w pairs that are both visually realistic and having correct paired relationships.
IiB The StitchLayer for Alleviating Generation Complexity
The quality of the synthesized ADC is critical for the following synthesis of T2w. We observe that directly generating a fullsize ADC map from a lowdimensional latent vector, i.e. 128d vector, via a decoder is very challenging, especially for the unsupervised approach without any explicit guidance from real ADC images. A typical matrix size of abdominal MRI scan is around , in which the prostate gland and its neighboring tissues roughly locate at the center of an ADC map and cover around area of the entire image. This implies that, to maintain the original resolution, a synthesized prostate ADC map should be at least . However, most widelyused GANs [22, 23] are limited to synthesize very small images such as images from the CIFAR10 [24] and MNIST [25] datasets whose image sizes are and respectively.
A potential solution to synthesize higher dimensional images is coarsetofine learning adopted in recent studies [26, 27, 28]. In these studies, customized generative networks and/or sophisticated training strategies were developed to synthesize data from low to high resolutions gradually. However, these techniques are very timeconsuming and hard to tune in their training phase, and seem overkill for synthesis of images given that synthesis of is just slightly beyond the capability of most plain GANs.
In this work we propose a StitchLayer, which is simple yet effective, and can be embedded in any generative networks to boost existing GANs to synthesize images with a greater size. Specifically, given the goal of generating a target image with the size of , we start from generating smaller subimages with a total number of by multiple decoders as shown in Fig. 4. Each decoder, instead of modeling a single difficult mapping task , where is a fullsize image, only optimizes one of simpler mapping tasks , where is the subimage. The StitchLayer ’stitches’ these subimages into a fullsize ADC image . Specifically, we consider that of size consists of nonoverlapping blocks. Each block is a square superpixel of the size , which is derived as:
(7) 
where indicates a block in the th row and th column of . The block shown in Fig. 4 is an exemplar of .
In our implementation, instead of utilizing different decoders, all decoders share common features in the same upsampling layers, and only differ with each other in the last fullyconvolutional layer. Accordingly, feature maps , i.e. subimages of a fullsize image, are generated by the decoders. Sharing common features in the upsampling layers among ensures globally spatial consistency among the subimages and eliminates many unnecessary computations. The last fullyconvolutional layer of each decoder encodes unique details for each subimage. By ’stitching’ the subimages together, both global structure of a fullsize image and complementary details of each subimage are combined together.
Reducing the size of subimages could make it easier to optimize the common upsampling layers among decoders as it is easier to map a latent vector to a lower dimensional manifold. However, reducing the subimage’s size also increases the difficulty of optimizing the decoders’ last convolutional layers which capture the complementary detailed information among maps. Therefore, it is important to choose a proper subimage’s size for a good tradeoff. In Sec. IVB we experimentally evaluate different subimages size for synthesizing ADC maps. Results show that a subimage size of (i.e. ) achieves the best performance.
IiC Auxiliary Distance Maximization for Capturing CS PCa Patterns
In this section we present the solution for capturing CS PCa patterns in our synthesizer which is critical for clinical usage. The challenges of capturing CS PCa patterns in the synthesizer are twofold: 1) normal prostate gland tissues typically are predominant in a prostate image compared to a CS lesion, causing great difficulties for the synthesizer to capture sufficient CS PCarelevant information, and 2) real CS PCa data for training is quite scarce, leading to overfitting with a high probability. To address these challenges, we introduce two critic networks to learn CS PCa features from a prostate gland by distinguishing between synthetic CS PCa data and real nonCS PCa data in each modality. For both ADC and T2w, the critic networks are trained to approximate two auxiliary distances of JSD between CS PCa and nonCS PCa data respectively as follows:
(8) 
(9) 
where and are the two critic networks, and are synthetic CS ADC and T2w images respectively, and are distributions of real nonCS ADC and T2w images respectively.
Once we obtain and , we train our synthesizer to maximize both and . We choose JSD as the AD rather than Wdistance used in the unsupervised learning is because JSD could better guide the synthesizer to increase the distance between CS and nonCS PCa data only when the synthetic data lacks CS PCa information. This is because JSD derives no gradient unless the manifolds of synthetic CS and real nonCS PCa data align each other [29]. Moreover, the overfitting problem can be greatly alleviated as there are more nonCS PCa data than CS PCa data for training.
(10) 
The overall training target of our method shown in Fig. 1 is summarized in Eq. 10, where and are weights tuning the contributions of semisupervised learning and the auxiliary distance maximization. With the optimization of Eq. 10, the synthesizer is trained to generate a large variety of ADCT2w pairs including meaningful and visually realistic CS PCa patterns besides the prostate gland, satisfying all the three requirements outlined in the introduction.
Iii Data Preparation
Iiia Data Collection
This study was approved by our local institutional review board. The mpMRI data used in the study were collected from two datasets:

A locally collected dataset named TJPCa Dataset [7, 8, 6] includes data conforming to the following five criteria: 1) the data for PCa assessment were acquired between June 2014 and December 2015; 2) all data included either pathologicallyproven PCa or benign prostatic hyperplasia (BPH) by a 12core systematic TRUS guided plus targeted prostate biopsy which were performed within six weeks after the MRI examination; 3) the data were from the patients who did not receive focal therapy, hormones, or radiation prior the MRI scan; 4) the data include both ADC and T2w images; and 5) the imaging data do not include severe artifacts that made the examination nondiagnostic. Indications for prostate MRI include: tumor detection for patients with clinical suspicion of prostate cancer (elevated PSA 4.0 ng/mL and/or suspicious DRE) before biopsy, cancer staging, radiation planning, surgical planning, active surveillance, planning for biopsy targeting and evaluation of patients with a prior negative biopsy but could have continuous clinical suspicion of prostate cancer. According to above criteria, we eventually collected data of patients, among whom were CS PCa patients (i.e., PCa with GS ) and were nonCS patients (i.e., BPH or indolent PCa). The mean age of CS and nonCS PCa patients are years old, ranging from to years old, and years old, ranging from to years old, respectively. The median PSA value for CS and nonCS patients are ng/ml, ranging from to ng/ml and ng/ml, ranging from to ng/ml, respectively.

A publicly available dataset named the PROSTATEx (training), which is the training set from the PROSTATEx challenge [30, 4, 31], includes data of MRItargeted biopsyproven CS PCa and nonCS PCa patients. Remaining testing data of patients from the PROSTATEx challenge are excluded from this study due to the lack of groundtruth labels (i.e. CS or nonCS).
In total, we have data of patients, where patients were normal, with benign prostatic hyperplasia (BPH) or indolent lesions, which are collectively referred to as nonCS PCa, and patients were with CS PCa.
All mpMRI data in the TJPCa dataset were acquired on a 3.0 Tesla (T) wholebody unit MR imaging system (MAGNETOM Skyra, Siemens Medical Solutions, Erlangen, Germany), running software version Syngo MR D13. The acquisition parameters for the transverse, coronal, and sagittal T2WI TSE images were set as follows: repetition time [TR] is ms, echo time [TE] is ms, echo train length is 16, section thickness is mm, there is no intersection gap, field of view [FOV] is mm and the image size is . The acquisition parameters for the transverse plane of DWI sequences were set as follows: values are and s/mm, TR/TE are ms/ ms, section thickness is mm, FOV is mm and the image size is . The ADC maps were computed from an Advanced Workstation. MRI protocols for data acquisition of the PROSTATEx (training) dataset are provided in [4].
IiiB Data Preprocessing for Training and Testing
For these two datasets, a radiologist manually selected original ADCT2w pairs containing CS PCa lesions and ADCT2w pairs which are nonCS PCa cancerous. The selection criterion was that both CS PCa lesions and prostate glands were clearly visible. For each selected ADCT2w pair, we cropped and aligned the prostate region using an automated prostate detection and registration method proposed in [6]. As shown in Fig. 5 (left), the original sizes of ADC and T2w are and respectively, and the width of the entire image is around and times of the width of the prostate region in ADC and T2w respectively. Therefore, we first resized the ADCT2w pairs to the same size of as the input to the automated detection and registration method, and then set the output image size of the automated detection and registration method to for better preserving useful information of the prostate in mpMRI data. The exemplar output of the processed prostate ADCT2w pair is shown in Fig. 5 (right).
The processed prostate ADCT2w pairs, with a total of pairs, were randomly divided into the TrainSet ( CS and nonCS pairs) and the TestSet ( CS and nonCS pairs). Each patient’s data is either solely in the training or solely in the test set, but not both, to avoid overfitting data of specific patients.
Iv Experimental Results
Iva Semisupervised Learning v.s. Supervised Learning
We first visually compare the synthesized images produced by the semisupervised (employing both top and bottom parts of Fig. 1) and supervised (employing only top part of Fig. 1) methods to qualitatively demonstrate the effectiveness of our proposed semisupervised method for addressing the overfitting problem. To better focus this analysis on only the two learning approaches, we excluded the StitchLayer and auxiliary distance maximization from the network shown in Fig. 1 for both learning approaches. To further improve the performance of the supervised method, we adopt a discriminator used in [12] which distinguishes fake pairs from real pairs for reducing the blurs in synthetic images. Both semisupervised and supervised methods are trained based on
CS ADCT2w pairs from the TrainSet, and latent vectors used for synthesis were obtained by two different approaches, i.e., spherical interpolation
[12] and random sampling.The spherical interpolation approach is shown in Fig. 6. The left most and right most blue dots denote ADC encodings in the latent space derived from two real CS ADC maps. By interpolating additional dots between the two encodings, we could generate a set of new latent vectors (i.e. blue dots), based on which new fake ADC images can be generated via the decoder. A decoder learns a complete mapping relationship between latent vectors and ADC images should be able to generate smoothly transitional images from interpolated vectors between every two real images. To validate this, we purposely select two real ADC maps (i.e. the leftmost and rightmost images of Fig. 6) from the TestSet with a single CS PCa lesion locating on the right (in the leftmost image) and the top (in the rightmost image) of the prostate gland respectively. The lesions are visually darker than surrounding tissues as denoted by the red circles. Figs. 6(a) and (b) show synthesized ADC maps based on interpolations by the semisupervised and supervised synthesizers respectively. As seen in Fig. 6(a), the CS PCa lesion is gradually and smoothly transitioned from the right to the top in the prostate gland, while the first three images of CS PCa in Fig. 6(b) are almost identical to the leftmost real image and the transition from the image (i.e. lesion on the right) to the image (i.e. lesion at the top) is sudden and not smooth.
We also utilize latent vectors randomly sampled from the prior Gaussian distribution for synthesis. Figs. 7(a) and (b) show synthetic pairs generated by the semisupervised and supervised methods respectively. The top row in each blue box indicates synthetic ADC maps and the bottom row indicates their corresponding generated T2w images. As can be seen, the shapes of the prostate glands generated by the semisupervised method are much clearer and have greater variety than those from the supervised method. There exists a severe mode collapse in the synthesized pairs by the supervised method especially for T2w images. By comparing the results in Figs. 6(b) and Fig. 7(b), we notice that the supervised method can only produce somewhat realistic results for latent vectors around the encodings, but the performance degrades significantly for randomly sampled vectors in the latent space, which could greatly limit the variety of the synthesized data. In comparison, the semisupervised method which further guides the synthesizer to learn more complete distributions of real ADC and T2w images could facilitate the synthesizer to generate a large variety of visuallyrealistic data in both modalities.
IvB Upsampling Architecture and Parameter Setting of the StitchLayer
In this section, we explore the optimal setting for the StitchLayer, i.e. the size of subimages (i.e., ) and number of blocks (i.e., ). The upsampling architecture of the decoder is shown in Fig 8. Specifically, we first extract the intermediate feature maps with a size of from the architecture, and then utilize a fullconvolutional layer with kernels as decoders to obtain subimages, followed by the StitchLayer to ’stitch’ them into a fullsize ADC map. Based on the upsampling architecture, we built different StitchLayerbased models with different parameter settings, denoted as StitchLayer#F#a. For example, as shown in Fig. 8, the StitchLayer2F32a utilizes 4 kernels to fully convolve the penultimate feature maps with the size of , yielding 4 subimages with the same size, and then uses the StitchLayer to obtain the fullsize ADC map. For a fair comparison, those five StitchLayerbased models share the first fullyconnected layer and upsampling block, which together account for 77% parameters of the upsampling architecture, and thus have almost identical model complexity and learning ability.
We trained five StitchLayerbased models based on ADC maps from the TrainSet in an unsupervised manner, and then used them to synthesize ADC maps based on the same set of random latent vectors. Fig. 9 shows synthetic ADC maps from StitchLayerbased models with different selections of . From the outputs of the StitchLayer1F64a, shown in Fig. 9(a), we observe a slight mode collapse problem as the first four maps are almost identical and the shapes of some prostate glands are quite ambiguous. The StitchLayer1F64a actually optimizes a direct generation using a single decoder. Therefore, the decoder has to learn both global structure and local details of the fullsize image, making the generation hard and fragile for optimization.
From Figs. 9(b)(e), we observe that, for , a smaller results in visually more realistic and satisfactory ADC maps. Increasing to greater than could yield comparable or even worse results than Fig. 9(a). Our conjecture is that a larger increases the difficulties for reconstructing complementary local details. In addition, a large also reduces the amount of global structure information preserved in the subimages, and in turn yields large noises in both global shapes of prostate glands and local tissue patterns.
StitchLayer1F64a  StitchLayer2F32a  StitchLayer4F16a  StitchLayer8F8a  StitchLayer16F4a 
87%  90%  89%  86%  84% 
We further quantitatively evaluate the performances of the five StitchLayerbased models in a specific task of slicelevel CS vs. nonCS PCa classification. For each model, we combine the synthesized ADC maps of CS PCa and real ADC maps of nonCS PCa from the TrainSet to train an Artificial Neural Network (ANN), which consists of two fullyconnected layers, as the classifier. We test the trained classifiers on the TestSet and use the classification accuracy as the metric. The more realistic synthetic ADC maps of CS PCa are used for training, the higher classification accuracy on the TestSet can be achieved. The classification results in Table. I are consistent with the visual qualities of synthetic ADC maps shown in Fig. 9. Therefore, based on both visual and quantitative evaluation results, we adopt the StitchLayer2F32 as our generation model of ADC maps in the following experiments.
IvC Comparison with the Stateofthearts
We compare our semisupervised synthesis method of mpMRI data with two stateoftheart methods [16, 12]. The CoGAN proposed in [16] in an unsupervised method for data synthesis in multidomain, and the method proposed by Costa et al. [12] is trained in a supervised manner. To the best of our knowledge, our work is the first proposal that adopts the semisupervised approach for mpMRI data synthesis. Besides the two stateofthearts, we also compare two variants of our method which are trained with and without the auxiliary distance (AD) maximization respectively to evaluate the effectiveness of AD maximization. The synthesized mpMRI data by different methods are evaluated to test and verify the following three characteristics: i) paired relationship, ii) variety, and iii) distinguishability of CS PCa. Accordingly, we respectively chose the Fréchet Inception distance (FID) [32], the inception score (IS) [33], and the slicelevel classification accuracy (SCA) as the evaluation metrics. For each synthesis model, we randomly generated sets of mpMRI data of CS PCa, each of which contains
synthetic ADCT2w pairs, and reported the average value (Avg) and the standard deviation (Std) of synthetic datasets for each metric. Furthermore, statistical significance testing based on the ttest was performed to evaluate the statistical significance when making the comparison.
FID is a widely used metric for evaluating the distance between the distributions of the synthetic data and real data. Specifically, FID calculates the Wasserstein2 distance between the generated pairs and the real pairs in the feature space of an Inceptionv3 network [34]. Synthetic ADC maps and T2w images which are more realistic and have stronger paired relationships should have a higher probability of coming from a joint data distribution similar to the real joint data distribution of multimodalties, yielding lower FID values.
The IS is an alternative to human annotators which can automatically measure the visual quality and diversity of synthesized samples [33]
. The IS first uses an InceptionV3 model pretrained on the ImageNet dataset to assign each synthetic image a class label, and then calculate the KL divergence between the conditional class distribution and the marginal class distribution. Higher IS values indicate better quality and diversity of synthetic data. Although the pretrained InceptionV3 model cannot produce matching labels for our PCa data due to a different dataset for training, it is still meaningful and insightful to use IS since some common features are shared among both medical and natural data (e.g., edges, blob patterns, brightness, etc.), and more varied PCa data could be assigned more different labels by the InceptionV3 model, yielding higher IS values. We calculated two IS values, denoted as ISADC and IST2w respectively, to measure the data variety in two modalities separately.
Inspired by the previous studies of [35, 12], we introduce a taskspecific evaluation metric to verify whether the model captures CS PCarelevant information during synthesis. Specifically, we used synthetic ADCT2w pairs of CS PCa and real ADCT2w pairs of nonCS PCa from the TrainSet to train a multimodal ANNbased classifier, which takes the ADCT2w pair as input and predicts its a probability of being CS cancerous. The SCA on the TestSet is then used as the metric to evaluate distinguishability of CS PCa of the synthetic data. Higher SCA values imply that the corresponding method can synthesize ADCT2w pairs with more distinguishable CS PCa patterns and in turn lead to a more accurate multimodal classifier.
mpMRI data Synthsis Method  ISADC  IST2w  FID  SCA % 

CoGAN [16]  
Costa et al. [12]  
Ours w/o the AD Maximization  
Ours w/ the AD Maximization  2.24 0.03  2.10 0.05  178.2 3.7  94.4 0.5 
Real Data  3.27  3.26  143.8  93 
Table II shows the comparison results of different mpMRI data synthesis methods from which four observations can be made:

By comparing the 1 and 2 rows, we observe that the unsupervised method [16] significantly outperforms the supervised method [12] ( for ISADC, IST2w and SCA, and for FID) since the overfitting problem prevents the synthesizer from generating realistic data with sufficient variety for random latent vectors.

By comparing the 1 and 3 rows, we observe that our method achieves much lower FID () value than the CoGAN while the IS values are comparable ( for both ISADC and IST2w), implying that sharing weights is too weak to restrict paired relationships compared to minimizing pixelwise reconstruction losses. Based on these first two observations, we can conclude that our semisupervised method can combine the strengths of the unsupervised and supervised methods, and thus produce more realistic, varied and paired mpMRI data, which meets the first two requirements of data characteristics outlined in Introduction for clinical usage.

Comparing the two variants of our method shown in the 3 and 4 rows, we observe that ours with the AD maximization achieves a higher SCA value than that without the AD maximization (). The results confirm that the proposed AD maximization indeed helps our method learn to generate ADCT2w data with more distinguishable CS PCa patterns, which meets the last required data characteristic outlined in Introduction.

We further evaluated the TrainSet with respect to these metrics which are presented in the row. For fairness, we augmented real CS PCa data to using the data augmentation approach proposed in [8] for training the multimodal classifier. By comparing all rows, we observe that the Real Data achieves the highest IS values and the lowest FID value among all synthesis methods, implying that there still exists room for improvement in synthesizing truly realistic and varied mpMRI data. However, the comparison results of SCA are encouraging. The classifier trained with the synthetic data from “Ours w/ the AD Maximization” achieves a slightly better performance than that with real ones, implying that our method could synthesize data with indepth features and is a more viable alternative for addressing the insufficiency of medical data than the traditional data augmentation for specific clinical tasks.
V Conclusions and Future Work
Despite of a large amount of GANbased image synthesis methods in the literature of computer vision, few of them can be directly applied to multimodal medical image synthesis tasks which possess unique challenges. In this study, we take the task of generating mpMRI data of CS PCa as an application driver, to carefully study the limitations of existing methods and propose a list of novel techniques for generating clinically meaningful ADCT2w images under the constraint of limited amount of training data. First, we propose a semisupervised method to enable the synthesizer to comprehensively understand the entire latent space consisting of random vectors and encodings, and thus learn to generate an unlimited number of varied and paired ADCT2w images based on a limited amount of real data for training. Second, we propose the StitchLayer which can be easily integrated into any synthesizer for alleviating the complexity of direct mapping from a lowdimensional noise to a fullsize image without explicit supervision from groundtruth images. Third, to encode distinguishable CS cancerous visual patterns in the synthetic mpMRI data, we propose to maximize an auxiliary distance between the real nonCS and the synthetic CS images in each modality, which enforces the synthesis process to increase the reliance on the clinically meaningful CS PCarelevant features rather than the dominant prostate or bladder tissues. We collected pathology proven mpMRI data from both a local hospital and public datasets. Visual and quantitative experimental results demonstrate that our synthesizer achieves superior performance to the stateoftheart methods
[16, 12] and can generate ADCT2w pairs with a great variety, with correct paired relationships and containing distinguishable CS cancerous patterns. Even more encouraging, our synthetic ADCT2w data can help boost the performance of a specific clinical task (i.e. slicelevel CS vs. nonCS classification) compared to relying only on the real data. In our future work, we could investigate the performance of synthesizing three or even more modalities in mpMRI, e.g. ADC, T2w and Dynamic Contrast Enhanced MRI (DCEMRI). Our future work also includes extending the 2D synthesizer to 3D to better capture more comprehensive 3D information of mpMRI data. In addition, the proposed techniques should be also applicable to the task of synthesizing many other types of medical imaging data. Our future work will explore the potentials of our method in more medical imaging applications.References
 [1] R. L. Siegel, K. D. Miller, and A. Jemal, “Cancer statistics, 2015,” CA: a cancer journal for clinicians, vol. 65, no. 1, pp. 5–29, 2015.
 [2] D. Fehr, H. Veeraraghavan, A. Wibmer, T. Gondo, K. Matsumoto, H. A. Vargas, E. Sala, H. Hricak, and J. O. Deasy, “Automatic classification of prostate cancer gleason scores from multiparametric magnetic resonance images,” Proceedings of the National Academy of Sciences, vol. 112, no. 46, pp. E6265–E6273, 2015.
 [3] G. Lemaitre, “Computeraided diagnosis for prostate cancer using multiparametric magnetic resonance imaging,” Ph.D. dissertation, Universite de Bourgogne; Universitat de Girona, 2016.
 [4] G. Litjens, O. Debats, J. Barentsz, N. Karssemeijer, and H. Huisman, “Computeraided detection of prostate cancer in mri,” IEEE transactions on medical imaging, vol. 33, no. 5, pp. 1083–1092, 2014.
 [5] G. Lemaître, R. Martí, J. Freixenet, J. C. Vilanova, P. M. Walker, and F. Meriaudeau, “Computeraided detection and diagnosis for prostate cancer based on mono and multiparametric mri: A review,” Computers in biology and medicine, vol. 60, pp. 8–31, 2015.
 [6] Z. Wang, C. Liu, D. Cheng, L. Wanga, X. Yang, and K.T. T. Chengb, “Automated detection of clinically significant prostate cancer in mpmri images based on an endtoend deep neural network,” IEEE Transactions on Medical Imaging, 2018.
 [7] X. Yang, Z. Wang, C. Liu, H. M. Le, J. Chen, K.T. T. Cheng, and L. Wang, “Joint detection and diagnosis of prostate cancer in multiparametric mri based on multimodal convolutional neural networks,” in International Conference on Medical Image Computing and ComputerAssisted Intervention. Springer, 2017, pp. 426–434.
 [8] X. Yang, C. Liu, Z. Wang, J. Yang, H. Le Min, L. Wang, and K.T. T. Cheng, “Cotrained convolutional neural networks for automated detection of prostate cancer in multiparametric mri,” Medical image analysis, vol. 42, pp. 212–227, 2017.
 [9] H. Salehinejad, S. Valaee, T. Dowdell, E. Colak, and J. Barfett, “Generalization of deep neural networks for chest pathology classification in xrays using generative adversarial networks,” arXiv preprint arXiv:1712.01636, 2017.
 [10] F. Calimeri, A. Marzullo, C. Stamile, and G. Terracina, “Biomedical data augmentation using generative adversarial neural networks,” in International Conference on Artificial Neural Networks. Springer, 2017, pp. 626–634.
 [11] A. Chartsias, T. Joyce, M. V. Giuffrida, and S. A. Tsaftaris, “Multimodal mr synthesis via modalityinvariant latent representation,” IEEE transactions on medical imaging, 2017.
 [12] P. Costa, A. Galdran, M. I. Meyer, M. Niemeijer, M. Abràmoff, A. M. Mendonça, and A. Campilho, “Endtoend adversarial retinal image synthesis,” IEEE transactions on medical imaging, 2017.
 [13] T. Joyce, A. Chartsias, and S. A. Tsaftaris, “Robust multimodal mr image synthesis,” in International Conference on Medical Image Computing and ComputerAssisted Intervention. Springer, 2017, pp. 347–355.
 [14] A. Osokin, A. Chessel, R. E. C. Salas, and F. Vaggi, “Gans for biological image synthesis,” in 2017 IEEE International Conference on Computer Vision (ICCV). IEEE, 2017, pp. 2252–2261.
 [15] J.Y. Zhu, R. Zhang, D. Pathak, T. Darrell, A. A. Efros, O. Wang, and E. Shechtman, “Toward multimodal imagetoimage translation,” in Advances in Neural Information Processing Systems, 2017, pp. 465–476.
 [16] M.Y. Liu and O. Tuzel, “Coupled generative adversarial networks,” in Advances in neural information processing systems, 2016, pp. 469–477.

[17]
P. Isola, J.Y. Zhu, T. Zhou, and A. A. Efros, “Imagetoimage translation with conditional adversarial networks,”
arXiv preprint, 2017.  [18] J. T. Guibas, T. S. Virdi, and P. S. Li, “Synthetic medical images from dual generative adversarial networks,” arXiv preprint arXiv:1709.01872, 2017.
 [19] O. Ronneberger, P. Fischer, and T. Brox, “Unet: Convolutional networks for biomedical image segmentation,” in International Conference on Medical image computing and computerassisted intervention. Springer, 2015, pp. 234–241.
 [20] A. Makhzani, J. Shlens, N. Jaitly, I. Goodfellow, and B. Frey, “Adversarial autoencoders,” arXiv preprint arXiv:1511.05644, 2015.
 [21] I. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin, and A. Courville, “Improved training of wasserstein gans,” arXiv preprint arXiv:1704.00028, 2017.
 [22] A. Radford, L. Metz, and S. Chintala, “Unsupervised representation learning with deep convolutional generative adversarial networks,” arXiv preprint arXiv:1511.06434, 2015.
 [23] X. Chen, Y. Duan, R. Houthooft, J. Schulman, I. Sutskever, and P. Abbeel, “Infogan: Interpretable representation learning by information maximizing generative adversarial nets,” in Advances in Neural Information Processing Systems, 2016, pp. 2172–2180.
 [24] A. Krizhevsky and G. Hinton, “Learning multiple layers of features from tiny images,” 2009.
 [25] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradientbased learning applied to document recognition,” Proceedings of the IEEE, vol. 86, no. 11, pp. 2278–2324, 1998.
 [26] T.C. Wang, M.Y. Liu, J.Y. Zhu, A. Tao, J. Kautz, and B. Catanzaro, “Highresolution image synthesis and semantic manipulation with conditional gans,” arXiv preprint arXiv:1711.11585, 2017.
 [27] E. L. Denton, S. Chintala, R. Fergus et al., “Deep generative image models using a laplacian pyramid of adversarial networks,” in Advances in neural information processing systems, 2015, pp. 1486–1494.
 [28] X. Huang, Y. Li, O. Poursaeed, J. Hopcroft, and S. Belongie, “Stacked generative adversarial networks,” arXiv preprint arXiv:1612.04357, 2016.
 [29] M. Arjovsky and L. Bottou, “Towards principled methods for training generative adversarial networks,” arXiv preprint arXiv:1701.04862, 2017.
 [30] L. Geert, D. Oscar, B. Jelle, K. Nico, and H. Henkjan. (2017) Prostatex challenge data. The Cancer Imaging Archive. [Online]. Available: https://doi.org/10.7937/K9TCIA.2017.MURS5CL
 [31] K. Clark, B. Vendt, K. Smith, J. Freymann, J. Kirby, P. Koppel, S. Moore, S. Phillips, D. Maffitt, and M. Pringle, “The cancer imaging archive (tcia): Maintaining and operating a public information repository,” Journal of Digital Imaging, vol. 26, no. 6, p. 1045, 2013.
 [32] M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, G. Klambauer, and S. Hochreiter, “Gans trained by a two timescale update rule converge to a nash equilibrium,” arXiv preprint arXiv:1706.08500, 2017.
 [33] T. Salimans, I. Goodfellow, W. Zaremba, V. Cheung, A. Radford, and X. Chen, “Improved techniques for training gans,” in Advances in Neural Information Processing Systems, 2016, pp. 2234–2242.

[34]
C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna, “Rethinking the
inception architecture for computer vision,” in
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
, 2016, pp. 2818–2826. 
[35]
R. Zhang, P. Isola, and A. A. Efros, “Colorful image colorization,” in
European Conference on Computer Vision. Springer, 2016, pp. 649–666.
Comments
There are no comments yet.