DiamondGAN: Unified Multi-Modal Generative Adversarial Networks for MRI Sequences Synthesis

04/29/2019 ∙ by Hongwei Li, et al. ∙ Technische Universität München 0

Recent studies on medical image synthesis reported promising results using generative adversarial networks, mostly focusing on one-to-one cross-modality synthesis. Naturally, the idea arises that a target modality would benefit from multi-modal input. Synthesizing MR imaging sequences is highly attractive for clinical practice, as often single sequences are missing or of poor quality (e.g. due to motion). However, existing methods fail to scale up to image volumes with high numbers of modalities and extensive non-aligned volumes, facing common draw-backs of complex multi-modal imaging sequences. To address these limitations, we propose a novel, scalable and multi-modal approach calledDiamondGAN. Our model is capable of performing flexible non-aligned cross-modality synthesis and data infill, when given multiple modalities or any of their arbitrary subsets. It learns structured information using non-aligned input modalities in an end-to-end fashion. We synthesize two MRI sequences with clinical relevance (i.e., double inversion recovery (DIR) and contrast-enhanced T1 (T1-c)), which are reconstructed from three common MRI sequences. In addition, we perform multi-rater visual evaluation experiment and find that trained radiologists are unable to distinguish our synthetic DIR images from real ones.



There are no comments yet.


page 6

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In clinical practice, imaging datasets often consists of high-dimensional image volumes, with multiple imaging protocols and repeated scans at multiple time points. Given the multiplicity of possible sequence parameters, protocols largely vary according to the scanner and imaging center, hindering their comparability. This often leads to repeated exams or severely limits the clinical information that can be drawn from those MRI studies. Particularly in the case of multiple sclerosis, longitudinal comparisons of MRI studies are the main reason for treatment decisions while lesion quantification tools require complete identical modalities at multiple time points. Potentially cross-modality image synthesis technique can resolve those obstacles through efficient data infilling and re-synthesis.

Recently, generative adversarial networks (GANs) have been applied in translating MRI sequences, positron emission tomography (PET) and computed tomography (CT) images. Most of them are one-to-one cross-modality synthesis approaches, for example, PET-CT [1], CT-MRI [16], MRI sequences translation [4]. Recent multi-modal synthesis method [13] has limited scalability because the input modalities are required to be uniform and spatially aligned. Although there are several multi-domain translation algorithms [9, 3]

in the computer vision community, these approaches design one-to-multiple domain translation but do not model the multiple-to-one domain mapping. Especially in medical images synthesis, multiple-to-one cross-modality mapping is highly relevant as proprietary information of individual and non-aligned modalities is probably synergistic.

There are two main challenges in the scenarios of multi-modal cross-modality medical image synthesis: 1) the input modalities are assumed to be not spatially-aligned because the registration methods for multiple aligning modalities may fail, limiting the applicability of conventional regression approaches. 2) some specific modalities may be missing due to different clinical settings between centers, thus a traditional regression-based data infill would be restricted to the smallest uniform subset or rely on iterative data infill methods. For example, in a Cycle-GAN [17] setting, one would therefore have to train individual models for possible combinations of the input modalities.


Overall, our contribution to address the challenges are threefold. 1) We propose DiamondGAN

, a unified multi-modal generative adversarial network that learns the multiple-to-one cross-modality mapping among non-aligned modalities using only a pair of generators and discriminators, training effectively with a multi-modal cycle-consistency loss function. 2) We provide both qualitative and quantitative results on two clinically-relevant MRI sequences synthesis tasks, showing

DiamondGAN’s superiority over baseline models. 3) Finally, we present the results of extensive visual evaluation experiments, performed by fourteen experienced radiologists to confirm the quality of the synthetic images.

2 Methodology

We first describe the proposed DiamondGAN, a unified framework to address multi-modal image synthesis task. Then we present the detailed implementations. Finally, we introduce our expert rating protocol.

2.1 Multi-Modal Cross-Modality Synthesis

Given an input set of n modalities: X = {xi = 1, …, n} and a target modality T. Our goal is to learn a generator G that learns mappings from multiple input modalities to one target modality. We assume that all the modalities, i.e., X and T, are not spatially-aligned for two reasons: 1) it is rather difficult to obtain strictly spatially-aligned images as mentioned in Section 1. 2) the applicability of conventional regression-based methods are limited by modalities alignment. We assume the input modalities are to be any subset of X, denoted as X’ during the training and inference stages, because some modalities of a subject may be missing in clinical practice.

We enforce G to be capable of translating any subset X’ into a target modality T using a condition c which indicates the presence of the input modalities, i.e., G(X’, T, c) T. This condition handles the missing modality issue and makes it a scalable model in both the training and the inference stages. We further introduce a multi-modal cycle-consistency loss to handle the ”non-aligned modalities” issues among the input and output. Fig. 1 illustrates the main idea of our proposed approach. We regularly generate the condition c and the corresponding multi-modal data X of all possible combinations, so that G learns to flexibly translate the arbitrary multi-modal input. As mentioned in the caption of Fig. 1, we use an availability condition to serve as an indicator of the input modalities. It is spatially replicated to the image size () and is a part of the two-stream network input. In the case of 3 modalities as the input, the condition would indicate that every input modality is given.

Figure 1: Left: The high-level idea behind our DiamondGAN, which is capable of learning mappings between any subset of multiple input modalities (X) to a target modality in a single model. This mapping represents a diamond-shape topology. Right: Overview of DiamondGAN. It consists of two modules, a pair of discriminators D and a pair of generators G. (a) D1 and D2

learn to distinguish between real and synthetic images from multi-modal input and the target output respectively. (b)

G1 takes both multi-modal input and the condition as input and generates a target modality. The condition c

is a binary vector:

, where = 0 or 1 indicates the corresponding input modality as available (1) or not (0). It is spatially replicated and concatenated with the input image in the feature-map level. (c) G2 tries to generate the original modalities from the synthetic target modality given the original availability condition.

2.1.1 Multi-Modal Reconstruction Loss

We aim to train G to guarantee that synthetic target modality preserves the content of its input modalities. The input modalities are assumed to be not spatially aligned or not from the same subject as mentioned above. In this situation, the traditional cycle loss [17] as well as the regression loss [7] would fail to tackle with the multi-modal and non-aligning issues. To alleviate the two problems, we extend the traditional cycle-consistency loss [17] to a multi-modal one. Specifically, we concatenate the source modalities into a multi-channel input and define a multi-channel output as the target modality. We then simultaneously train two generators and in a cycle-consistency fashion. Please note that the output target modality is with multiple channels which correspond to the input modalities. The loss function of the generator is defined as:


2.1.2 Adversarial Loss

To make the generated images indistinguishable from real images, we adopt an adversarial loss:


where G generates a target modality G(X, c) which is conditioned on the presence of input modalities X, while D tries to distinguish between real input modalities and generated input ones. Similarly, G generates the original input modalities G(T, c) conditioned on the presence of original input modalities X and D tries to distinguish between real target modality and generated target one. The generators try to minimize this objective, while the discriminators try to maximize it.

2.1.3 Full Objective

Finally, the objective functions to optimize D and G are written respectively, as


where is the hyper-parameter that balances the reconstruction loss and adversarial loss. We use = 10 in all of our experiments.

2.2 Implementation

2.2.1 Two-Stream Network Architecture

To leverage the information from both input modalities and corresponding label conditions, we build a two-stream network architecture based on the popular encoder-decoder network [8]

. It takes the multi-modal images and condition as two inputs and merges them in the feature level. This network contains two stride-2 convolutions, several residual blocks


and two fractionally strided convolutions (1/2 stride). We use 6 blocks for the input size of

, where , and are the number of modalities, height and width of the images respectively. The input images and labels pass through two encoders and are merged in the last layer before the decoder. We leverage PatchGANs [8]

for the discriminator network, which classifies the patch features maps to real or fake, instead of using a fully-connected layer.

2.2.2 Training Details

We apply two recent techniques to stabilize the training of the model. First, for (Eq. 2), we replace the negative log likelihood objective by a least-squares loss [12]. Second, to reduce the model oscillation [5], we update the discriminators using a history of generated images rather than the ones produced by the latest generators, as proposed in [14]. Thus we put the 25 previously generated images in an image buffer. We set = 10 in Equation 3 for all the experiments. We use the Adam solver [10]

with a batch size of 5. All networks were trained from scratch with a learning rate of 0.0002 and for 20 epochs. When given

input modalities, for each epoch the parameters in both generator and discriminator are updated for 2-1 times given 2-1 training subsets of input modalities excluding empty set.

2.3 Expert Rating and Evaluation Protocol

Quantitative evaluation of generated images in terms of standard scores for errors and correlation remains a debatable task [2]. Additionally, the evaluation with common metrics such as PSNR and MAE [15]

would not allow us to catch the most important information, i.e., clinically relevant small substructures. Therefore, we strive to get experts’ estimates of the image quality. For this task we design a multi-rater quality evaluation experiment.

The neuro-radiologists rated our images in a browser-application. In each trial, they were provided with two images. On the left side, one real source image of a T1 or Flair images is presented. On the other side, a paired image of the target modality is shown which is either a real image or a generated one. The displayed paired images were randomly chosen in the pool of generated images and real images. This particular setup enables the experts to identify very small inconsistencies or implausibilities between the two images immediately. For evaluation, the doctors were asked to rate the plausibility of the image on the right based on the real image on the left. To rate the plausibility, they were asked to assign a 6-star rating, where 6 stars denoted a perfectly plausible image and 1 star a completely implausible image. The images were presented in 280 trials. The sequence of trials was randomized across participants.

3 Experiments

3.0.1 Datasets

Dataset 1 consists of 65 scans of patients with MS lesions from a local hospital, acquired with a multi-parametric protocol, which includes co-registered Flair, T1, T2, double inversion recovery (DIR) and contrast-enhanced T1 (T1-c). The first three modalities are common modalities in most MS lesion exams. DIR is a MRI pulse sequence, which suppresses signal from the cerebrospinal fluid and the white matter, enhancing the inflammatory lesion. T1-c is a MRI sequence which requires a paramagnetic contrast agent (usually gadolinium) that reduces the T1 relaxation time and thereby increases the signal intensity. Synthesizing DIR and T1-c is of clinical relevance because it can substantially reduce labor and costs in clinical practice. We mainly report our result on Dataset 1. Additional Dataset 2 and Dataset 3 are used for demonstrating our approach can work on multiple datasets with incomplete and non-aligned modalities. They are part of the public MICCAI-WMH challenge dataset [11], which includes 40 subjects with white matter lesions who went through only T1 and Flair scanning. We replace a large part of the Flair and T1 images in Dataset 1 with the images from Dataset 2 and Dataset 3

. We used 2D axial slices for training the network. All the images were cropped or padded to a uniform size of 240

240 and the intensity values are rescaled to [-1, 1].

3.0.2 Reconstructing DIR and T1-c from Common Modalities

We perform two image synthesis tasks on two clinically-relevant MRI sequences (DIR and T1-c), using three common modalities (i.e., Flair, T1 and T2). We separate the Dataset 1 into a training set, a validation set and a test set, resulting in 30 scans (2015 slices for each modality) for training and 35 scans for testing (2100 slices for each modality). To obtain the optimal hyper-parameters of the model, we use 5 out of the 30 training scans as a validation set. The DiamondGAN simultaneously learns from multiple input modalities. A common approach for quantitative evaluation of medical GAN images is to calculate relative errors and signal to noise ratio between the synthetic image and the real image [15]. Table 1 shows the results of peak signal-to-noise ratio 111Given a real image R and a synthetic image S, , where MSE is the mean square error, defined as: (PSNR) and mean absolute error (MAE) by comparing the synthetic images and real T1-c and DIR images. For the synthetic DIR and T1-c images, we report the highest PSNR and the lowest MAE for a combined T1+T2+Flair input to our model. In the DIR synthesis experiment, the listed scores of using multiple inputs to our GAN are comparable (MAE 0.058-0.065). Whereas, the scores for single inputs are substantially worse (MAE 0.073-0.084). For the T1-c synthesis task, we find that any combination of multi-modal inputs involving the T1 modality (MAE 0.045-0.048) results in better scores compared to other input. This indicates that our model successfully extracts the relevant information, as T1-c is essentially a T1 scan with a contrast enhancing agent. For further comparison, we use CycleGAN [17] to perform one-to-one cross-modality synthesis, the best results of CycleGAN were listed in Table. 1. For the synthesis of DIR, using Flair images as an input of CycleGAN achieves the highest PSNR and lowest MAE while for T1-c, T1 as the input gets the highest performance. The proposed model outperforms the CycleGAN in both tasks. We further replace a part of the training Flair and T1 images in Dataset 1 with the images from Dataset 2 and Dataset 3 (totally 794 images for each modality) and we find the result on same testing set is comparable to using the original Dataset 1.

Figure 2: Samples of synthetic T1-c and DIR images given the combination of T1, T2 and Flair modalities. Difference images are generated and visualized in heat maps. Our generated images preserve the tissue contrast and the anatomy. However, we find more differences in synthetic DIR images than in synthetic T1-c ones, especially around the brain boundary. This could be due to the alignment error by registration methods.

.     DIR   DIR   T1-c   T1-c   CycleGAN [17] 17.34 0.068 20.36 0.045   15.46 0.084 20.21 0.048 15.99 0.073 19.34 0.054 16.16 0.078 17.15 0.068 17.41 0.065 20.75 0.046 18.58 0.059 19.78 0.051 18.02 0.062 20.40 0.047 18.63 0.058 20.86 0.045

Table 1: Quantitative evaluation of our generated images compared to the real DIR and T1-c images. We evaluate using PSNR and the mean absolute error (MAE) across 2100 testing images. Results show that the generated images benefit from a multi-modal input. indicates that the higher value corresponds to better image quality.
Figure 3: Box plots comparing the star ratings between the synthesized images and real images for the T1-c modality on the left and the DIR modality on the right. The mean is shown as numbers in black. A total of 14 neuro-radiologists with median 5+ years of professional experience participated in the experiment. The experts rate the generated T1c images as acceptable when T1 is part of the input. The DiamondGAN achieves comparable plausibility levels for the DIR modality.

3.0.3 Rating Experiment by Fourteen Experts

Fourteen neuro-radiologists participated and each of them evaluates 210 synthetic images and 70 original images. The 210 synthetic images are generated enforcing 6 different input conditions. Specifically, each expert rated each input condition for 35 samples. The rating results of the 14 raters are averaged and the box plots of the results are shown in Figure 3. For the synthesis of T1-c images, we found that three multi-modal combinations (i.e., T1, T1+Flair and T1+T2+Flair) gave comparable results, while synthetic T1-c images based solely on a Flair were consistently rated as implausible. The plausibility of DIR images synthesized with input was rated in average 0.83 star higher than that with solely T1 input. This is plausible as the DIR is a complex sequence containing a lot of proprietary information, its synthesis thus benefiting from more input data. For the synthetic images with T1+T2+Flair input, the experts assigned an almost identical rating to the synthetic and original image (4.54 stars vs 4.7 stars).

We conduct Wilcoxon rank-sum tests on the paired rating scores of synthetic and real images from 14 raters on 6 conditions which results in 6 pairs of 14 observations. Results show that the pair of rating scores on synthetic DIR images by T1+T2+Flair input and real DIR images are not significantly different (p-value = 0.1432) while all other pairs are significantly different (p-values 0.0001). This demonstrates that trained radiologists are unable to distinguish our synthetic DIR images from real ones. Furthermore, the experts ratings for the individual conditions of synthetic images are in agreement with the metrical evaluation in Table 1. For T1-c synthesis, the PSNR and MAE scores are consistently good when T1 modality is fed to DiamondGAN.

4 Conclusion

This work introduces a novel approach for generating MRI sequences using multiple non-aligned modalities, tackling the technically challenging problems in cross-modality medical image synthesis. With an extensive multi-rater experiment and statistic tests, we show that the proposed model generated high-quality images from multiple input modalities. The multi-modal approach allows us to find out the structured information inside the existing extensive MRI sequences.