Multi-Domain Image Completion for Random Missing Input Data

07/10/2020 ∙ by Liyue Shen, et al. ∙ Stanford University Nvidia 0

Multi-domain data are widely leveraged in vision applications taking advantage of complementary information from different modalities, e.g., brain tumor segmentation from multi-parametric magnetic resonance imaging (MRI). However, due to possible data corruption and different imaging protocols, the availability of images for each domain could vary amongst multiple data sources in practice, which makes it challenging to build a universal model with a varied set of input data. To tackle this problem, we propose a general approach to complete the random missing domain(s) data in real applications. Specifically, we develop a novel multi-domain image completion method that utilizes a generative adversarial network (GAN) with a representational disentanglement scheme to extract shared skeleton encoding and separate flesh encoding across multiple domains. We further illustrate that the learned representation in multi-domain image completion could be leveraged for high-level tasks, e.g., segmentation, by introducing a unified framework consisting of image completion and segmentation with a shared content encoder. The experiments demonstrate consistent performance improvement on three datasets for brain tumor segmentation, prostate segmentation, and facial expression image completion respectively.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 3

page 11

page 26

page 27

page 28

page 29

page 30

page 31

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Multi-domain images are often required as inputs in various vision tasks because of the nature that different domains could provide complementary knowledge. For example, four medical imaging modalities, MRI with T1, T1-weighted, T2-weighted, FLAIR (FLuid-Attenuated Inversion Recovery), are acquired as a standard protocol to accurately segment the tumor regions for each patient in the brain tumor segmentation task [30]. Different modalities provide distinct features to locate tumor boundaries from differential diagnosis perspectives. Additionally, when it comes to the natural image tasks, there are similar scenarios such as person re-identification across different cameras or times [43, 44]. Here, the medical images in different modalities or natural images with the person under varied appearances can be considered as different image domains, depicting the same underlying subject or scene from various aspects.

Figure 1: Image translation using (a) MUNIT (1-to-1), (b) StarGAN / Ours (ReMIC) (1-to-), (c) CollaGAN / ReMIC (-to-1), and (d) ReMIC (-to-). In multi-domain image completion, Ours (ReMIC) completes the missing-domain images given randomly distributed numbers (-to-, ) of visible domains in the input. Note the missing-domain images are denoted as blurred images.

However, some image domains might be missing in practice. Especially when it comes to a large-scale multi-institute study, it is generally difficult or even infeasible to guarantee the availability of data in all domains for every data entry. For example, some patients might lack certain imaging scans due to different imaging protocols, data loss or image corruption. For these rare and valuable collected data, it is costly to just throw away the incomplete samples during training, and also infeasible to test with missing-domain inputs. Thus, in order to take the most advantage of such missing data, it becomes crucial to design an effective data completion algorithm to cope with this challenge. An intuitive approach is to impute the missing domain of one sample with the nearest neighbor from other samples whose corresponding domain image exists. But this might lack of semantic consistency among different domains of the input sample as shown in Fig. 2 since it only focuses on the pixel-level similarity compared with existing images. Another possible solution is to generate images and complete missing domains via image translation from existing domains using generative models, such as GAN models, as illustrated in Fig. 

1.

In this work, we propose a general -to- image completion framework based on a Representational disentanglement scheme for Multi-domain Image Completion (ReMIC). Specifically, our contribution is fourfold: (1) We propose a novel GAN framework for a general and flexible -to- image generation with representational disentanglement, i.e., learning semantically shared representations cross domains (content code) and domain-specific features (style code) for each input domain; (2) We demonstrate the learned content code could be utilized for the high-level task, i.e., developing a unified framework for jointly learning the image completion and segmentation based on shared content encoder; (3) We demonstrate the proposed -to- image generation model can effectively completes the missing domains given randomly distributed numbers (-to-, ) of visible domains in the input; (4) Experiments on three datasets illustrate that the proposed method consistently achieves better performance than previous approaches in both multi-domain image completion and missing-domain segmentation.

Figure 2: BraTS images in four modalities with nearest neighbors and generated images from the proposed method (ReMIC). From the segmentation prediction of brain tumor, the generated images preserve better semantic consistency with ground truth in addition to the pixel-level similarity in images.

2 Related Work

Image-to-Image Translation The recent success of GANs [8, 32, 15, 45, 46, 17, 27, 5, 41, 20, 42, 7, 36]

in image-to-image translation provides a promising solution to deal with the challenge of missing image domains. CycleGAN 

[45] shows impressive performance in image-to-image translation via cycle-consistency between real and generated images. However, it mainly focuses on 1-to-1 mapping between two domains and assumes corresponding images in two domains strictly share the same representation in latent space. This is limited in multi-domain applications since CycleGAN models are required if there are domains. Following this, StarGAN [5]

proposes to use a mask vector in inputs to specify the desired target domain in multi-domain image generation. Meantime, RadialGAN 

[41] also deals with the multi-domain generation problem by assuming all the domains share the same latent space. Although these works make it possible to generate images in different target domains through 1-to- mapping with multiple inference passes, the representation learning and image generation are always conditioned on the single input image as the only source domain. In order to take advantage of multiple available domains, CollaGAN [20] proposes a collaborative model to incorporate multiple domains for generating one missing domain. Similar to StarGAN, CollaGAN relies on the cycle-consistency to preserve the contents in the generated images, which is an indirect and implicit constraint for target domain images. Additionally, since the target domain is specified by an one-hot mask vector in input, CollaGAN is essentially doing -to-1 translation with a single output in an one-time inference. As illustrated in Fig. 1, our proposed model is a more general -to- image generation framework that can overcome aforementioned limitations.

Learning Disentangled Representations Recently, learning disentangled representations is proposed to capture the full distribution of possible outputs by introducing a random style code [4, 10, 12, 22, 23, 24], or to transfer information across domains for adaptation [28, 26]. InfoGAN [4] and -VAE [10] learn the disentangled representation in an unsupervised manner. In image translation, DRIT [22] disentangles content and attribute features by exchanging the features encoded from two domains respectively. The image consistency during translation is constrained by the code and image reconstruction. With a similar code exchange scheme, MUNIT [12] assumes a prior distribution on style code, which allows directly sampling style codes from the prior distribution to generate target domain images. However, both DRIT and MUNIT only deal with image translation between two domains, which requires to independently train separate translation models among domains. While the recent work [26] also tackles multi-domain image translation, it focuses more on learning cross-domain latent code for domain adaptation with less discussion about the domain-specific style code. Moreover, our proposed method handles a more challenging problem with random missing domains motivated by practical medical applications. Aiming at higher completion accuracy for the segmentation task with missing domains, we further add reconstruction and segmentation constraints in our framework.

Medical Image Synthesis Synthesizing medical images has attracted increasing interests in recent researches [42, 7, 36, 13, 14, 37, 6, 16, 33, 47]. The synthesized images are generated across multi-contrast MRI modalities or between MRI and computed tomography (CT). [39, 9, 3] also discuss how to extract representations from multi-modalities especially for segmentation with missing imaging modalities. However, these studies mostly focus on how to fuse the features from multiple modalities but not from the perspective of representation disentanglement. Our model disentangles the shared content and separate style representations for a more general -to- multi-domain image completion task, and we further validate that the generation benefits the segmentation task.

3 Method

Figure 3: Overview of the proposed -to- multi-domain completion and segmentation framework. and two domains () are missing in this example. Our model contains a unified content encoder (red lines), domain-specific style encoders (orange lines) and generators (blue lines), . A variety of losses are adopted (burgundy lines), i.e., image consistency loss for visible domains , latent consistency loss and , adversarial loss and reconstruction loss for the generated images. Furthermore, representational learning framework combines a segmentation generator following the content code for a unified image generation and segmentation.

Images from different domains for the same sample present their exclusive features of the subject. Nonetheless, they also inherit some global content structures. For instance, in the parametric MRI for brain tumors, T2 and FLAIR MRI highlight the differences in tissues’ water relaxational properties, which will distinguish the tumor tissue from normal ones. Contrasted T1 MRI can examine the pathological intratumoral take-up of contrast agents so that the boundary between tumor core and the rest will be highlighted. However, the underlying anatomical structure of the brain is shared by all these modalities. With the availability of multiple domain data, it is meaningful to decompose the images into the shared content structure (skeleton) and their unique characteristics (flesh) through learning. Therefore, we will be able to reconstruct the missing image during the testing by using the shared skeleton (extracted from the available data domains) and a sampled flesh from the learned model. Without assuming a fixed set of missing domains during the training, the learned framework could flexibly handle one or more missing domains in a random set. In addition, we further enforce the accuracy of the extracted content structure by connecting it to the segmentation task. In such manner, the disentangled representations of multiple domain images (both the skeleton and flesh) can help both the image completion and segmentation.

Suppose there are domains: . Let , , , be the images from different domains respectively, which are grouped data describing the same subject as one sample. Assume the dataset contains independent data samples in total. For each sample, we assume one or many of the domain images might be randomly missing, i.e. the number and category of missing domains are both random. The goal of our first task is to complete all the missing domains for a random input sample.

To accomplish the completion of all missing domains from a random set of available domains, we assume the domains share the latent representation of underlying structure. We name the shared latent representation as content code and meanwhile each domain also exclusively contains the domain-specific latent representation, i.e., style code

, that is related to various characteristics or attributes in different domains. The missing domains can be reconstructed from these two aspects of information through the learning of deep neural networks. Similar to the setting in MUNIT 

[12], we assume a prior distribution for style latent code as to capture the full distribution of possible styles in each domain. However, MUNIT trains separate content encoder for each domain and enforce the disentanglement via coupled cross-domain translation during training while our method employs a single content encoder to extract the anatomic representation shared across all the domains.

3.1 Unified Image Completion and Segmentation

As shown in Fig. 3, our model contains a unified content encoder and domain-specific style encoders (), where is the total number of domains. Content encoder extracts the shared content code from all existing domains:

. For the missing domains, we use zero padding in corresponding input channels. For each domain, a style encoder

learns the domain-specific style code from the corresponding domain image () respectively: .

During the training, our model captures the shared content code and separate style codes () through the disentanglement process (denoted as red and orange arrows respectively in Fig. 3) with a random set of input images (in green box). In Fig. 4

, we visualize the extracted content codes (randomly selected 8 out of 256 channels) of one BraTS image sample. Various focuses (on different anatomical structures, e.g., tumor, brain, skull) are demonstrated by different channel-wise feature maps. Together with combined individual style code (sampling from a Gaussian distribution

), we only need to train one single ReMIC model to complete the multiple missing domains in the inputs.

In the image generation process (denoted as blue arrows in Fig. 3), our model samples style codes from a prior distribution and integrates with the content code to generate images in domains through generators (). The generator for each domain generates images in the corresponding domain from the domain-shared content code and the domain-specific style code: .

Additionally, we extend the introduced image completion framework to a more practical scenario, i.e., tackling the missing data problem in image segmentation. Specifically, another branch of segmentation generator is added after content codes to generate the segmentation masks of the input images. Our underlying assumption is that the domain-shared content codes contain essential image structure information for the segmentation task. By simultaneously optimizing the generation loss and segmentation Dice loss (detailed in Section 3.2), the model could adaptively learn how to generate missing images to improve the segmentation performance.

3.2 Training Loss

In the training of GAN models, the setting of losses is of paramount importance to the final generation results. Our loss functions contain the cycle-consistency loss of images and latent codes, adversarial loss and reconstruction loss on the generated and input images.

Image Consistency Loss: For each sample, the proposed model is able to extract a domain-shared content code and domain-specific style codes respectively from visible domains. Then by recombining the content and style codes, the domain generators are expected to recover the input images. The image consistency loss is defined to constrain the reconstructed images and real images as in the direction of “Image Code Image” in Fig. 3.

(1)

where is the data distribution in domain (). Here, we use loss to strengthen anatomical-structure related generation.

Latent Consistency Loss: The latent consistency loss constrains the learning of both content and style codes before decoding and after encoding in the direction of “Code Image Code”.

(2)
(3)

where is the prior distribution of style code: , is given by and (), i.e., the content code is sampled by firstly sampling images from data distribution. Specifically, taking BraTS data as an example, style distribution contains various domain-specific characteristics in each domain, like varied image contrasts. Content distribution contains various anatomy structure related features among different brain subjects as shown in Fig. 4.

Figure 4: Content codes visualization in BraTS image generation. The first 4 images are ground truth modalities.

Adversarial Loss: The adversarial learning between generators and discriminators forces the data distribution of the generated images to be close to that of the real images for each domain.

(4)

where is the discriminator for domain to distinguish the generated images and real images .

Reconstruction Loss: In addition to the feature-level consistency mentioned above to constrain the relationship between the generated images and real images in different domains, we also constrain the pixel-level similarity between generated images and ground truth images in the same domain during training stage, for accurately completing missing domains given visible images of the current subject or scene.

(5)

Segmentation Loss: In the -to- image translation, the model learns a complementary representation of multiple domains, which can further facilitate the high-level tasks. For instance, extracted content code (containing the underlying anatomical structures) may benefit the segmentation of organs and lesions in medical image analysis, vice versa. Therefore, we train a multi-task network for both segmentation and generation. In the proposed framework, we construct a unified generation and segmentation model by adding a segmentation generator following the content code from the completed images as shown in Fig. 3. We utilize Dice loss [31, 35] for accurate segmentation from multiple domain images

(6)

where is the total number of classes, is the spatial position index in the image,

is the predicted segmentation probability map for class

from and is the ground truth segmentation mask for class . The segmentation loss can be added into the total loss in Eq. 7 for an end-to-end joint learning optionally.

Total Loss: The encoders, generators, discriminators (and segmentor) are jointly trained to optimize the total objective as follows

(7)

where , , , , and are hyper-parameters to balance the losses. Please note that the segmentation loss is included in the total training loss only when we train the unified generation and segmentation model for BraTS and ProstateX datasets.

4 Experiments

To validate the feasibility and generalization of the proposed model, we conduct experiments on two medical image datasets as well as a natural image dataset: BraTS, ProstateX, and RaFD. We firstly demonstrate the advantage of the proposed method in the -to- multi-domain image completion task given a random set of visible domains. Moreover, we illustrate that the proposed model (a variation with two branches of image translation and segmentation) provides an efficient solution to multi-domain segmentation with missing-domain inputs.

BraTS: The Multimodal Brain Tumor Segmentation Challenge (BraTS) 2018 [30, 1, 2] provides multi-modal brain MRI with four modalities: a) native (T1), b) post-contrast T1-weighted (T1Gd), c) T2-weighted (T2), and d) T2 Fluid Attenuated Inversion Recovery (FLAIR). Following CollaGAN [21], 218 and 28 subjects are randomly selected for training and testing. A set of 2D slices is extracted from 3D volumes for four modalities respectively. In total, the training and testing sets contain 40,148 and 5,340 images. We resize the images of size to . Three categories are labeled for brain tumor segmentation, i.e., enhancing tumor (ET), tumor core (TC), and whole tumor (WT).

ProstateX: The ProstateX dataset [25] contains multi-parametric prostate MR scans for 98 subjects. Each sample contains three modalities : 1) T2-weighted (T2), 2) Apparent Diffusion Coefficient (ADC), 3) high b-value DWI images (HighB). We randomly split it into 78 and 20 subjects for training and testing respectively. By extracting 2D slices from 3D volumes, the training and testing sets contain 3,540 and 840 images in total. Images of are resized to . Prostate regions are manually labeled as the whole prostate (WP) by board-certificated radiologists.

RaFD: The Radboud Faces Database (RaFD) [19] contains eight facial expressions collected from 67 participants: neutral, angry, contemptuous, disgusted, fearful, happy, sad, and surprised. Following StarGAN [5], we adopt images from three camera angles (45, 90, 135) with three gaze directions (left, frontal, right), and obtain 4,824 images in total. The data is randomly split to training set of 54 participants (3,888 images) and testing set of 13 participants (936 images). We crop the image with the face in the center and then resize to .

In all experiments, we set , and if is included in Eq. 7. The adversarial loss and consistency loss follow the same loss weights choices as in [12] which reported the necessity of the consistency losses in its ablative study. In the following, we will demonstrate ablative studies on the reconstruction and segmentation loss.

5 Results

5.1 Results of Multi-Domain Image Completion

For comparison purpose, we firstly assume there are only one missing domain for each data sample. In training, the one missing domain is randomly distributed among all the domains. During testing, at a time, we fix the one missing domain in inputs and evaluate the generation outputs only on that missing modality, whose results are demonstrated in one column (modality) of Table 12. Multiple metrics are used to measure the similarity between the generated and teh target images, i.e., normalized root mean-squared error (NRMSE), mean structural similarity index (SSIM), and peak-signal-noise ratio (PSNR). We compare our results with previous methods on all three datasets. The results of the proposed method (“ReMIC”), ReMIC without reconstruction loss (“ReMIC w/o Recon”) are reported.

Moreover, we investigate a more practical scenario when there are more than one missing domains and show that our proposed method is capable to handle a general random -to- image completion. In this setting, we assume the set of missing domains in training data is randomly distributed, i.e. each training data has randomly selected visible domains where . During testing, we fix the number of visible domains () while these available domains are also randomly distributed among domains. We evaluate all the generated images in outputs, showing results in all columns (modalities) of Table 12. “ReMIC-Random()” denotes evaluation on the test set with random visible domains or random missing domains. Note that by leveraging the unified content code and sampling the style code for each domain respectively, the proposed model could handle any number of missing domains, which is more general and flexible for the random -to- image completion as shown in Fig. 1(d). We compare our model with following methods:

MUNIT [12] conducts -to- image translation between two domains through representational disentanglement as shown in Fig. 1(a). In RaFD experiments, we train and test MUNIT models between any pair of two domains. Without loss of generality, we use “neural” image to generate all the other domains by following StarGAN setting, and “angry” image is used to generate “neural” image. In BraTS , the typical modality “T1” is used to generate other domains while “T1” is generated from “T1Gd”. Similarly, “T2” is used to generate other domains in ProstateX while it is generated from “ADC”.

StarGAN [5] adopts a mask vector to generate image in the specified target domain. In this way, different target domains could be generated from one source domain in multiple inference passes. This is actually a -to- image translation as in Fig. 1(b). Since only one domain can be used as input in StarGAN, we use the same domain pair match as MUNIT, following the same setting in [5].

(a) Single missing modality.
(b) Multiple missing modalities.
Figure 5: (a) BraTS image generation results with a single missing modality. Rows: 4 modalities. Columns: compared methods. (b) BraTS image generation results with multiple missing modalities (in columns). Ground truth image for each modality is shown in (a). Rows: the first domains (from left to right) are given in inputs ().

CollaGAN [20, 21] carries out the -to- image translation in Fig. 1(c), where multiple source domains collaboratively generate one target domain which is assumed missing in inputs. But it does not deal with multiple missing domains. In CollaGAN experiments, we use the same domain generation setting as ours, i.e., fix one missing domain in inputs and generate from all the other domains.

(a) Single missing modality.
(b) Multiple missing modalities.
Figure 6: (a) ProstateX image generation results with a single missing modality. Rows: 3 modalities. Columns: compared methods. (b) ProstateX image generation results with multiple missing modalities (in columns). Ground truth image for each modality is shown in (a). Rows: the first domains (from left to right) are given in inputs ().
(a) BraTS

 

Methods T1 T1Gd T2 FLAIR
NRMSE() / SSIM() / PSNR() NRMSE() / SSIM() / PSNR() NRMSE() / SSIM() / PSNR() NRMSE() / SSIM() / PSNR()

 

MUNIT [12] 0.3709 / 0.9076 / 23.2385 0.2827 / 0.9221 / 27.3836 0.4073 / 0.8757 / 22.8936 0.4576 / 0.8702 / 21.5568
StarGAN [5] 0.3233 / 0.9282 / 24.2840 0.2718 / 0.9367 / 27.6901 0.5002 / 0.8464 / 21.3614 0.4642 / 0.8855 / 22.0483
CollaGAN [20] 0.4800 / 0.8954 / 21.2803 0.4910 / 0.8706 / 22.9042 0.5310 / 0.8886 / 21.2163 0.4231 / 0.8635 / 22.4188

 

ReMIC w/o Recon 0.3366 / 0.9401 / 24.5787 0.2398 / 0.9435 / 28.8571 0.3865 / 0.9011 / 23.4876 0.3650 / 0.8978 / 23.5918
ReMIC 0.2008 / 0.9618 / 28.5508 0.2375 / 0.9521 / 29.1628 0.2481 / 0.9457 / 27.4829 0.2469 / 0.9367 / 27.1540

 

ReMIC-Random(k=1) 0.2263 / 0.9603 / 27.5198 0.2118 / 0.9600 / 30.5945 0.2566 / 0.9475 / 27.7646 0.2742 / 0.9399 / 26.8257
ReMIC-Random(k=2) 0.1665 / 0.9751 / 30.8579 0.1697 / 0.9730 / 32.7615 0.1992 / 0.9659 / 30.3789 0.2027 / 0.9591 / 29.7351
ReMIC-Random(k=3) 0.1274 / 0.9836 / 33.2458 0.1405 / 0.9812 / 34.3967 0.1511 / 0.9788 / 32.6743 0.1586 / 0.9724 / 31.8967

 

(b) ProstateX

 

Methods T2 ADC HighB
NRMSE() / SSIM() / PSNR() NRMSE() / SSIM() / PSNR() NRMSE() / SSIM() / PSNR()

 

MUNIT [12] 0.6904 / 0.4428 / 15.6308 0.9208 / 0.4297 / 13.8983 0.9325 / 0.5383 / 16.9616
StarGAN [5] 0.6638 / 0.4229 / 15.9468 0.9157 / 0.3665 / 13.8014 0.9188 / 0.4350 / 17.1168
CollaGAN [20] 0.8070 / 0.2667 / 14.2640 0.7621 / 0.4875 / 15.4242 0.7722 / 0.6824 / 18.6481

 

ReMIC w/o Recon 0.8567 / 0.3330 / 13.6738 0.7289 / 0.5377 / 15.7083 0.8469 / 0.7818 / 17.8987
ReMIC 0.4908 / 0.5427 / 18.6200 0.2179 / 0.9232 / 26.6150 0.3894 / 0.9150 / 24.7927

 

ReMIC-Random(k=1) 0.3786 / 0.6569 / 22.5314 0.2959 / 0.8256 / 26.9485 0.4091 / 0.8439 / 27.7499
ReMIC-Random(k=2) 0.2340 / 0.8166 / 27.0598 0.1224 / 0.9664 / 33.2475 0.1958 / 0.9587 / 34.4775

 

Table 1: BraTS and Prostate multi-domain image completion results.

Results of medical image generation: Fig. 5(a) and Fig. 6(a) show the results of image completion (modalities in rows) on BraTS and ProstateX data in comparison to others [12, 5, 20] (methods in columns). Each cell illustrates the generated image when the current modality is missing in inputs. The corresponding quantitative results averaged across all testing data are shown in Table 1. In comparison, our model generates better results in meaningful details, e.g., a more accurate outstanding tumor region in BraTS and prostate regions are better-preserved in ProstateX. This is achieved by learning a better content code through factorized latent space in our method, which is essential in preserving the anatomical structures in medical images. Furthermore, we illustrate the generation results when multiple modalities are missing in BraTS and ProstateX dataset. We show the results in the rows of Fig. 5(b) and Fig. 6(b), where images are generated when only the first modalities (from left to right) are given in the inputs (). The averaged quantitative results for random -to- image generation are denoted as “ReMIC-Random()” in Table 1.

Figure 7: RaFD image generation results with a single missing modality. Columns: 8 facial expressions. Rows: compared methods.
Figure 8: RaFD image generation results with multiple missing modalities (in columns). Ground truth image for each modality is shown in “Target” row of Fig. 7. Rows: the first domains (from left to right) are given in inputs ().

 

Methods Neutral Angry Contemptuous Disgusted
NRMSE() / SSIM() / PSNR() NRMSE() / SSIM() / PSNR() NRMSE() / SSIM() / PSNR() NRMSE() / SSIM() / PSNR()

 

MUNIT [12] 0.1589 / 0.8177 / 19.8469 0.1637 / 0.8156 / 19.7303 0.1518 / 0.8319 / 20.2793 0.1563 / 0.8114 / 19.9362
StarGAN [5] 0.1726 / 0.8206 / 19.2725 0.1722 / 0.8245 / 19.4336 0.1459 / 0.8506 / 20.7605 0.1556 / 0.8243 / 20.0036
CollaGAN [20] 0.1867 / 0.7934 / 18.3691 0.1761 / 0.7736 / 18.8678 0.1856 / 0.7928 / 18.4040 0.1823 / 0.7812 / 18.5160

 

ReMIC w/o Recon 0.1215 / 0.8776 / 22.2963 0.1335 / 0.8556 / 21.4615 0.1192 / 0.8740 / 22.4073 0.1206 / 0.8559 / 22.1819
ReMIC 0.1225 / 0.8794 / 22.2679 0.1290 / 0.8598 / 21.7570 0.1217 / 0.8725 / 22.2414 0.1177 / 0.8668 / 22.4135

 

ReMIC-Random(k=1) 0.1496 / 0.8317 / 20.7821 0.1413 / 0.8368 / 21.5096 0.1407 / 0.8348 / 21.2486 0.1394 / 0.8352 / 21.4443
ReMIC-Random(k=4) 0.0990 / 0.9014 / 24.7746 0.0988 / 0.8964 / 24.8327 0.0913 / 0.9048 / 25.2826 0.0969 / 0.8934 / 24.8231
ReMIC-Random(k=7) 0.0756 / 0.9280 / 26.6861 0.0679 / 0.9332 / 27.4557 0.0665 / 0.9346 / 27.5942 0.0675 / 0.9308 / 27.3955

 

Methods Fearful Happy Sad Surprised
NRMSE() / SSIM() / PSNR() NRMSE() / SSIM() / PSNR() NRMSE() / SSIM() / PSNR() NRMSE() / SSIM() / PSNR()

 

MUNIT [12] 0.1714 / 0.7792 / 19.1714 0.1623 / 0.8073 / 19.7709 0.1677 / 0.7998 / 19.3867 0.1694 / 0.7884 / 19.3867
StarGAN [5] 0.1685 / 0.7943 / 19.3516 0.1522 / 0.8288 / 20.4397 0.1620 / 0.8227 / 19.7368 0.1634 / 0.7974 / 19.6744
CollaGAN [20] 0.1907 / 0.7442 / 18.1518 0.1829 / 0.7601 / 18.5503 0.1783 / 0.7766 / 18.7450 0.1888 / 0.7495 / 18.2169

 

ReMIC w/o Recon 0.1321 / 0.8384 / 21.4604 0.1399 / 0.8332 / 20.9334 0.1284 / 0.8597 / 21.7430 0.1333 / 0.8347 / 21.3782
ReMIC 0.1316 / 0.8395 / 21.5295 0.1383 / 0.8406/ 21.0465 0.1301 / 0.8581 / 21.6384 0.1276 / 0.8484 / 21.7793

 

ReMIC-Random(k=1) 0.1479 / 0.8132 / 21.0039 0.1567 / 0.8121 / 20.3798 0.1491 / 0.8244 / 20.6888 0.1434 / 0.8218 / 21.2411
ReMIC-Random(k=4) 0.1043 / 0.8769 / 24.2623 0.1065 / 0.8852 / 23.9813 0.0960 / 0.8971 / 24.9114 0.1022 / 0.8835 / 24.2613
ReMIC-Random(k=7) 0.0769 / 0.9209 / 26.5362 0.0794 / 0.9200 / 26.1515 0.0729 / 0.9291 / 26.8993 0.0735 / 0.9248 / 26.7651

 

Table 2: RaFD multi-domain image completion results.

 

Methods BraTS ProstateX
T1 T1Gd T2 FLAIR T2 ADC HighB

 

Oracle+All 0.822 0.908

 

Oracel+Zero padding 0.651 0.473 0.707 0.454 0.528 0.243 0.775
Oracle+Average imputation 0.763 0.596 0.756 0.671 0.221 0.692 0.685
Oracle+Nearest neighbor 0.769 0.540 0.724 0.606 0.759 0.850 0.854
Oracle+MUNIT 0.783 0.537 0.782 0.492 0.783 0.708 0.858
Oracle+StarGAN 0.799 0.553 0.746 0.613 0.632 0.653 0.832
Oracle+CollaGAN 0.753 0.564 0.798 0.674 0.472 0.760 0.842
Oracle+ReMIC 0.789 0.655 0.805 0.765 0.871 0.898 0.891

 

ReMIC+Seg 0.806 0.674 0.822 0.771 0.872 0.909 0.905
ReMIC+Joint 0.828 0.693 0.828 0.791 0.867 0.904 0.904

 

Table 3: Missing-domain segmentation. Dice scores are reported.

Results of facial expression image generation: Fig. 7 shows the result of facial expression image completion for RaFD dataset. In each column, we show the target and generated images of each domain (facial expression), where we assume the current target domain is missing in the inputs at a time and needs to be generated using the rest

available domains. Compared with MUNIT and StarGAN results, our method could generate missing images with a better quality, especially in generating details like teeth, mouth and eyes. This benefits from that our method can incorporate complementary information from multiple available domains, while MUNIT and StarGAN can adopt only one domain as input. For example, in the generation of “happy” and “disgusted” expressions, either MUNIT nor StarGAN could generate a good teeth and mouth region, since their source domain “neutral” does not contain the teeth. Compared with CollaGAN, our method could generate images with a better content due to the explicit disentangled representational learning in feature level instead of the implicit cycle-consistency constraints only in pixel level. Moreover, Fig. 

8 shows the results of multiple missing domains. Each row shows the generated images in each of 8 domains, when the first domains (from left to right) are given in inputs (

). The superior performance could also be observed in the NRMSE, and SSIM and PSNR evaluation metrics averaged across all testing samples as reported in Table 

2 with all the eight expression domains.

5.2 Results of Missing-Domain Segmentation

Based on the missing-domain image completion, we demonstrate that our proposed method could go beyond image generation to solve the missing-domain image segmentation. Specifically, our model learns factorized representations by disentangling latent space, which could be efficiently leveraged for high-level segmentation task. As shown in Fig. 3, a segmentation branch is added using the learned content code to generate segmentation prediction. We evaluate the segmentation performance with dice coefficient on both BraTS and ProstateX datasets as shown in Table 3. Please note that we show the average dice coefficient across three categories for BraTS dataset: enhancing tumor (ET), tumor core (TC), and whole tumor (WT). (details of per-category results in supplementary.)

We train a fully supervised 2D U-shaped segmentation network (a U-Net variation [34]) without missing images as the “Oracle”. “Oracle+*” means that the results are computed by testing the missing images generated or imputed from the “*” method with the pretrained “Oracle” model. “All” represents the full testing set without any missing domains. “ReMIC+Seg” stands for using separate content encoders for image generation and segmentation tasks in our proposed unified framework, while “ReMIC+Joint” indicates sharing the weights of content encoder for the two tasks. For the results on both datasets, our proposed unified framework with joint training of image generation and segmentation could achieve the best segmentation performance in comparison to other imputation or generation methods. Moreover, it even obtains comparable results as “Oracle” model when some modalities are missing. This indicates that the learned content codes indeed embed and extract efficient anatomical structures for image representation.

In our experiments, we choose the widely used U-shaped segmentation network [34] as the backbone for segmentation generator . Here, we focus on showing how the proposed method could benefit the segmentation when missing domains exist and the segmentation backbone is fixed. But our method can also be easily generalized to other segmentation models with a similar methodology.

6 Conclusion

In this work, we propose a general framework for multi-domain image completion, given that one or more input domains are missing. The proposed model learns shared content and domain-specific style encoding across multiple domains. We show the proposed image completion approach can be well generalized to both natural and medical images. Our framework is further extended to a unified image generation and segmentation framework to tackle a practical problem of missing-domain segmentation. Experiments on three datasets demonstrate the proposed method consistently achieves better performance than several previous approaches on both multi-domain image completion and segmentation with random missing domains.

References

  • [1] S. Bakas, H. Akbari, A. Sotiras, M. Bilello, M. Rozycki, J. S. Kirby, J. B. Freymann, K. Farahani, and C. Davatzikos (2017) Advancing the cancer genome atlas glioma mri collections with expert segmentation labels and radiomic features. Scientific data 4, pp. 170117. Cited by: §4.
  • [2] S. Bakas, M. Reyes, A. Jakab, S. Bauer, M. Rempfler, A. Crimi, R. T. Shinohara, C. Berger, S. M. Ha, M. Rozycki, et al. (2018)

    Identifying the best machine learning algorithms for brain tumor segmentation, progression assessment, and overall survival prediction in the brats challenge

    .
    arXiv preprint arXiv:1811.02629. Cited by: §4.
  • [3] A. Chartsias, T. Joyce, M. V. Giuffrida, and S. A. Tsaftaris (2017) Multimodal mr synthesis via modality-invariant latent representation. IEEE transactions on medical imaging 37 (3), pp. 803–814. Cited by: §2.
  • [4] X. Chen, Y. Duan, R. Houthooft, J. Schulman, I. Sutskever, and P. Abbeel (2016) Infogan: interpretable representation learning by information maximizing generative adversarial nets. In Advances in neural information processing systems, pp. 2172–2180. Cited by: §2.
  • [5] Y. Choi, M. Choi, M. Kim, J. Ha, S. Kim, and J. Choo (2018) Stargan: unified generative adversarial networks for multi-domain image-to-image translation. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    ,
    pp. 8789–8797. Cited by: §0.A.1, §2, §4, §5.1, §5.1, Table 1, Table 2.
  • [6] P. Costa, A. Galdran, M. I. Meyer, M. D. Abràmoff, M. Niemeijer, A. M. Mendonça, and A. Campilho (2017) Towards adversarial retinal image synthesis. arXiv preprint arXiv:1701.08974. Cited by: §2.
  • [7] S. U. Dar, M. Yurt, L. Karacan, A. Erdem, E. Erdem, and T. Çukur (2019) Image synthesis in multi-contrast mri with conditional generative adversarial networks. IEEE transactions on medical imaging. Cited by: §2, §2.
  • [8] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio (2014) Generative adversarial nets. In Advances in neural information processing systems, pp. 2672–2680. Cited by: §2.
  • [9] M. Havaei, N. Guizard, N. Chapados, and Y. Bengio (2016) Hemis: hetero-modal image segmentation. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pp. 469–477. Cited by: §2.
  • [10] I. Higgins, L. Matthey, A. Pal, C. Burgess, X. Glorot, M. Botvinick, S. Mohamed, and A. Lerchner (2017) Beta-vae: learning basic visual concepts with a constrained variational framework.. ICLR 2 (5), pp. 6. Cited by: §2.
  • [11] X. Huang and S. Belongie (2017) Arbitrary style transfer in real-time with adaptive instance normalization. In Proceedings of the IEEE International Conference on Computer Vision, pp. 1501–1510. Cited by: §0.A.2, §0.A.2.
  • [12] X. Huang, M. Liu, S. Belongie, and J. Kautz (2018) Multimodal unsupervised image-to-image translation. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 172–189. Cited by: §0.A.1, §0.A.2, §0.A.2, §2, §3, §4, §5.1, §5.1, Table 1, Table 2.
  • [13] Y. Huo, Z. Xu, S. Bao, A. Assad, R. G. Abramson, and B. A. Landman (2018) Adversarial synthesis learning enables segmentation without target modality ground truth. In 2018 IEEE 15th International Symposium on Biomedical Imaging (ISBI 2018), pp. 1217–1220. Cited by: §2.
  • [14] J. E. Iglesias, E. Konukoglu, D. Zikic, B. Glocker, K. Van Leemput, and B. Fischl (2013) Is synthesizing mri contrast useful for inter-modality analysis?. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pp. 631–638. Cited by: §2.
  • [15] P. Isola, J. Zhu, T. Zhou, and A. A. Efros (2017)

    Image-to-image translation with conditional adversarial networks

    .
    In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1125–1134. Cited by: §2.
  • [16] K. Kamnitsas, C. Baumgartner, C. Ledig, V. Newcombe, J. Simpson, A. Kane, D. Menon, A. Nori, A. Criminisi, D. Rueckert, et al. (2017) Unsupervised domain adaptation in brain lesion segmentation with adversarial networks. In International conference on information processing in medical imaging, pp. 597–609. Cited by: §2.
  • [17] T. Kim, M. Cha, H. Kim, J. K. Lee, and J. Kim (2017) Learning to discover cross-domain relations with generative adversarial networks. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 1857–1865. Cited by: §2.
  • [18] D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §0.A.1.
  • [19] O. Langner, R. Dotsch, G. Bijlstra, D. H. Wigboldus, S. T. Hawk, and A. Van Knippenberg (2010) Presentation and validation of the radboud faces database. Cognition and emotion 24 (8), pp. 1377–1388. Cited by: §4.
  • [20] D. Lee, J. Kim, W. Moon, and J. C. Ye (2019) CollaGAN: collaborative gan for missing image data imputation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2487–2496. Cited by: §0.A.1, §2, §5.1, §5.1, Table 1, Table 2.
  • [21] D. Lee, W. Moon, and J. C. Ye (2019) Which contrast does matter? towards a deep understanding of mr contrast using collaborative gan. arXiv preprint arXiv:1905.04105. Cited by: §4, §5.1.
  • [22] H. Lee, H. Tseng, J. Huang, M. K. Singh, and M. Yang (2018) Diverse image-to-image translation via disentangled representations. In European Conference on Computer Vision, Cited by: §2.
  • [23] H. Lee, H. Tseng, Q. Mao, J. Huang, Y. Lu, M. Singh, and M. Yang (2019) Drit++: diverse image-to-image translation via disentangled representations. arXiv preprint arXiv:1905.01270. Cited by: §2.
  • [24] J. Lin, Z. Chen, Y. Xia, S. Liu, T. Qin, and J. Luo (2019) Exploring explicit domain supervision for latent space disentanglement in unpaired image-to-image translation. IEEE transactions on pattern analysis and machine intelligence. Cited by: §2.
  • [25] G. Litjens, O. Debats, J. Barentsz, N. Karssemeijer, and H. Huisman (2014) Computer-aided detection of prostate cancer in mri. IEEE transactions on medical imaging 33 (5), pp. 1083–1092. Cited by: §4.
  • [26] A. H. Liu, Y. Liu, Y. Yeh, and Y. F. Wang (2018) A unified feature disentangler for multi-domain image translation and manipulation. In Advances in Neural Information Processing Systems, pp. 2590–2599. Cited by: §2.
  • [27] M. Liu, T. Breuel, and J. Kautz (2017) Unsupervised image-to-image translation networks. In Advances in neural information processing systems, pp. 700–708. Cited by: §2.
  • [28] Y. Liu, Y. Yeh, T. Fu, S. Wang, W. Chiu, and Y. Frank Wang (2018) Detach and adapt: learning cross-domain disentangled deep representation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8867–8876. Cited by: §2.
  • [29] X. Mao, Q. Li, H. Xie, R. Y. Lau, Z. Wang, and S. Paul Smolley (2017) Least squares generative adversarial networks. In Proceedings of the IEEE International Conference on Computer Vision, pp. 2794–2802. Cited by: §0.A.2.
  • [30] B. H. Menze, A. Jakab, S. Bauer, J. Kalpathy-Cramer, K. Farahani, J. Kirby, Y. Burren, N. Porz, J. Slotboom, R. Wiest, et al. (2014) The multimodal brain tumor image segmentation benchmark (brats). IEEE transactions on medical imaging 34 (10), pp. 1993–2024. Cited by: §1, §4.
  • [31] F. Milletari, N. Navab, and S. Ahmadi (2016)

    V-net: fully convolutional neural networks for volumetric medical image segmentation

    .
    In 2016 Fourth International Conference on 3D Vision (3DV), pp. 565–571. Cited by: §3.2.
  • [32] M. Mirza and S. Osindero (2014) Conditional generative adversarial nets. arXiv preprint arXiv:1411.1784. Cited by: §2.
  • [33] D. Nie, R. Trullo, J. Lian, C. Petitjean, S. Ruan, Q. Wang, and D. Shen (2017) Medical image synthesis with context-aware generative adversarial networks. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pp. 417–425. Cited by: §2.
  • [34] O. Ronneberger, P. Fischer, and T. Brox (2015) U-net: convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computer-assisted intervention, pp. 234–241. Cited by: §0.A.2, §5.2, §5.2.
  • [35] S. S. M. Salehi, D. Erdogmus, and A. Gholipour (2017) Tversky loss function for image segmentation using 3d fully convolutional deep networks. In International Workshop on Machine Learning in Medical Imaging, pp. 379–387. Cited by: §3.2.
  • [36] A. Sharma and G. Hamarneh (2019) Missing mri pulse sequence synthesis using multi-modal generative adversarial network. IEEE transactions on medical imaging. Cited by: §2, §2.
  • [37] A. Shrivastava, T. Pfister, O. Tuzel, J. Susskind, W. Wang, and R. Webb (2017) Learning from simulated and unsupervised images through adversarial training. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2107–2116. Cited by: §2.
  • [38] D. Ulyanov, A. Vedaldi, and V. Lempitsky (2017) Improved texture networks: maximizing quality and diversity in feed-forward stylization and texture synthesis. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6924–6932. Cited by: §0.A.2.
  • [39] H. Van Nguyen, K. Zhou, and R. Vemulapalli (2015) Cross-domain synthesis of medical images using efficient location-sensitive deep network. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pp. 677–684. Cited by: §2.
  • [40] T. Wang, M. Liu, J. Zhu, A. Tao, J. Kautz, and B. Catanzaro (2018) High-resolution image synthesis and semantic manipulation with conditional gans. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 8798–8807. Cited by: §0.A.2.
  • [41] J. Yoon, J. Jordon, and M. van der Schaar (2018) RadialGAN: leveraging multiple datasets to improve target-specific predictive models using generative adversarial networks. arXiv preprint arXiv:1802.06403. Cited by: §2.
  • [42] Z. Zhang, L. Yang, and Y. Zheng (2018) Translating and segmenting multimodal medical volumes with cycle-and shape-consistency generative adversarial network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 9242–9251. Cited by: §2, §2.
  • [43] L. Zheng, L. Shen, L. Tian, S. Wang, J. Wang, and Q. Tian (2015) Scalable person re-identification: a benchmark. In Proceedings of the IEEE international conference on computer vision, pp. 1116–1124. Cited by: §1.
  • [44] Z. Zheng, X. Yang, Z. Yu, L. Zheng, Y. Yang, and J. Kautz (2019) Joint discriminative and generative learning for person re-identification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2138–2147. Cited by: §1.
  • [45] J. Zhu, T. Park, P. Isola, and A. A. Efros (2017) Unpaired image-to-image translation using cycle-consistent adversarial networks. In Proceedings of the IEEE international conference on computer vision, pp. 2223–2232. Cited by: §2.
  • [46] J. Zhu, R. Zhang, D. Pathak, T. Darrell, A. A. Efros, O. Wang, and E. Shechtman (2017) Toward multimodal image-to-image translation. In Advances in Neural Information Processing Systems, Cited by: §2.
  • [47] W. Zhu, X. Xiang, T. D. Tran, G. D. Hager, and X. Xie (2018) Adversarial deep structured nets for mass segmentation from mammograms. In 2018 IEEE 15th International Symposium on Biomedical Imaging (ISBI 2018), pp. 847–850. Cited by: §2.

Appendix 0.A Implementation Details

Here, we describe the implementation details of our method. We will also open source all the source codes and models upon the acceptance of this work.

0.a.1 Hyperparameters

In our algorithm, we use the Adam optimizer [18] with . The learning rate is 0.0001. We set the loss weights in the total loss (Equation 7 in main text) as , and in the unified model for image completion and segmentation. For comparison purpose, we train the model with batch size 1 and 100,000 iterations for image generation task, and compare the results across MUNIT [12], StarGAN [5], CollaGAN [20], and ours ReMIC in all the three datasets. In ReMIC, we set the dimension of the style code as 8 for comparison purpose with MUNIT. For image generation during testing, we use a fixed style code of 0.5 in each dimension for both MUNIT and ReMIC to compute quantitative results.

0.a.2 Network Architectures

The network structure of ReMIC is developed on the backbone of MUNIT model [12]. We describe the details of each module here.

Unified Content Encoder: consists of a down-sampling module and residual blocks to extract contextual knowledge from all available domain images in inputs. The down-sampling module contains a

convolutional block with stride 1 and 64 filters, and two

convolutional blocks with stride 2 and, 128 and 256 filters respectively. The convolutional layers downsample the input to features maps of size , where and are the width and height of input image. Next, there are four residual blocks, each of which contains two convolutional blocks with 256 filters and stride 1. We apply Instance Normalization (IN) [38] after all the convolutional layers. Note that the proposed unified content encoder accepts images of all domains as input (missing domains are filled up with zeros padding in the initialization), and learns a universe content code complementarily and collaboratively, which are different from MUNIT.

Style Encoder: contains a similar down-sampling module and several residual blocks, which is followed by a global average pooling layer and a fully connected layer to learn the verteorized style code. The down-sampling module is developped using the same structure as that in the unified content encoder above, and two more

convolutional blocks with stride 2 and 256 filters are followed. The final fully connected layer generates style code as a 8-dim vector. There is no IN applied to the style encoders to keep the original feature means and variances with style information 

[11].

Generator: includes four residual blocks, each of which contains two convolutional blocks with 256 filters and stride 1. Two nearest-neighbor upsampling layers and a convolutional block with stride 1 and, 128 and 64 filters respectively are followed to up-sample content codes back to the original image size. Finally, there is a a convolutional block with stride 1 to output the reconstructed image. In order to incorporate the style code in the generation process, the Adaptive Instance Normalization (AdaIN) [11] is applied to each residual block as follows [12]:

(8)

where is the activation from the last convolutional layer. and

are the channel-wise mean and standard deviations of the activation.

and

are the affine parameters in the AdaIN layers that are generated from style codes via a multi-layer perceptron (MLP). In this way, the input style code controls the generated style information through the affine transformation in the AdaIN layers in all generators 

[11].

Discriminator: includes four

convolutional blocks with stride 2 and, 64, 128, 256, and 512 filters in sequence. The Leaky ReLU activation with slope 0.2 is applied after convolutional layers. A multi-scale discriminator 

[40] is used to include the results at three different scales together. In adversarial training, we adopt LSGAN objective [29] as the adversarial loss to learn to generate realistic images.

Segmentor: We adopt a segmentation net with a U-Net shape [34]. In order to build a joint model with the image generation modules, we build a variant of U-Net, that is, the downsampling part shares the same structure as the content encoder aforementioned while the upsampling part has the same layers as the generator as described above. Similar to the original U-Net [34], we also adopt the skip connections between the downsampling and upsampling layers in our segmentation module.

Appendix 0.B Extended Ablative Study and Results for Multi-domain Image Completion

 

Methods T1 T1Gd
MAE() / NRMSE() / PSNR() / SSIM() MAE() / NRMSE() / PSNR() / SSIM()
ReMIC 0.0187 / 0.2008 / 28.5508 / 0.9618 0.0153 / 0.2375 / 29.1628 / 0.9521
ReMIC+Multi-Sample 0.0180 / 0.1942 / 28.8354 / 0.9634 0.0127 / 0.2070 / 30.2444 / 0.9555
ReMIC+Seg 0.0195 / 0.2033 / 28.5679 / 0.9597 0.0142 / 0.2285 / 29.2134 / 0.9468
ReMIC+Joint 0.0214 / 0.2128 / 27.9944 / 0.9568 0.0140 / 0.2251 / 29.3624 / 0.9484

 

Methods T2 FLAIR
MAE / NRMSE / PSNR / SSIM MAE / NRMSE / PSNR / SSIM
ReMIC 0.0190 / 0.2481 / 27.4829 /0.9457 0.0198 / 0.2469 / 27.1540 / 0.9367
ReMIC+Multi-Sample 0.0195 / 0.2493 / 27.5168 / 0.9463 0.0192 / 0.2456 / 27.3598 / 0.9385
ReMIC+Seg 0.0193 / 0.2525 / 27.2864 / 0.9431 0.0206 / 0.2553 / 26.9191 / 0.9333
ReMIC+Joint 0.0197 / 0.2596 / 26.9954 / 0.9429 0.0220 / 0.2651 / 26.5068 / 0.9302

 

Table 4: Extended results of multi-domain image completion for BraTS dataset

 

Methods T2 ADC
MAE() / NRMSE() / PSNR() / SSIM() MAE() / NRMSE() / PSNR() / SSIM()
ReMIC 0.0840 / 0.4908 / 18.6200 / 0.5427 0.0253 / 0.2179 / 26.6150 / 0.9232
ReMIC+Multi-Sample 0.0810 / 0.4742 / 18.8986 / 0.5493 0.0250 / 0.2171 / 26.7024 / 0.9263
ReMIC+Seg 0.0871 / 0.5024 / 18.4236 / 0.5336 0.0272 / 0.2322 / 26.0828 / 0.9107
ReMIC+Joint 0.0881 / 0.5071 / 18.3206 / 0.5353 0.0288 / 0.2403 / 25.8024 / 0.9064

 

Methods HighB
MAE() / NRMSE() / PSNR() / SSIM()
ReMIC 0.0254 / 0.3894 / 24.7927 / 0.9150
ReMIC+Multi-Sample 0.0268 / 0.3945 / 24.8066 / 0.9116
ReMIC+Seg 0.0272 / 0.4110 / 24.3277 / 0.9061
ReMIC+Joint 0.0286 / 0.4359 / 23.8270 / 0.9006

 

Table 5: Extended results of multi-domain image completion for ProstateX dataset

In this section, we conduct more ablative studies in multi-domain image completion with multi-sample learning and multi-task learning with the unified model of image generation and segmentation. More quantitative results are demonstrated in Table 4 and Table 5, which are the corresponding extended tables for Table 1 in the main text. Please note that in addition to the metrics of NRMSE, SSIM and PSNR, we also add the MAE metric to measure the difference between generated images and the ground truth.

0.b.1 Multi-sample learning

Based on the proposed model as shown in Fig. 3 in the main text, we further propose a training strategy when multiple samples are inputted at one time to facilitate learning disentangled representations. Specifically, based on the assumption of partially shared latent space, we assume that the factorized latent code can represent the corresponding content and style information in the input image. Therefore, by exchanging the style codes from two independent samples in all available domains, it should be able to reconstruct the original input images by recombining the original content and the new style codes from the other sample. Based on this idea, we build a comprehensive model with cross-sample training between two samples. Similarly as the framework in Fig. 3 in main text, the image and latent consistency loss and image reconstruction loss are also constrained through the encoding and decoding procedure. The results of multi-sample learning are shown in Table 4 and Table 5 denoted as “ReMIC+Multi-Sample”.

0.b.2 Multi-task learning

For the jointly trained model of image completion and segmentation, the generated images are also evaluated using the same metrics as shown in Table 4 and Table 5. Similarly to Table 3 in the main text, “ReMIC+Seg” stands for using separate content encoders for image generation and segmentation tasks in our proposed unified framework, while “ReMIC+Joint” indicates sharing the weights of content encoder for both two tasks. The results indicate that adding segmentation branch does not bring an obvious benefit for image generation. This is because the segmentation sub-module mainly focuses on the tumor region which takes up only a small part among the whole slice image. Besides, we use dice loss as the segmentation training objective which might not be consistent with the metrics used to evaluate generated image quality, which mainly emphasize the whole-slice pixel-level similarity.

0.b.3 Random multi-domain image completion

As described in Section 5.1 of the main text, we investigate a more practical scenario when there are more than one missing domains and show that our proposed method is capable to handle a general random -to- image completion. In this setting, we assume the set of missing domains in training data is randomly distributed, i.e. each training sample has randomly selected domains where is at least 1. During testing, we evaluate the model with different number of existing domains () while these available domains are in the order of from domain 1 to domains .

In addition to the qualitative results of image completion with multiple random missing modalities as shown in the Figs. 5(b), 6(b), 8 in the main text, we demonstrate more testing samples of the three datasets as shown in Figs. 9-12. The left half or the top half of each figure shows the input domain(s), where the missing domains are filled up with zeros. The right half or the bottom half of each figure shows the image generation results for all the domains no matter whether it exists in the input domains or not. Firstly, no matter how many or which domains are visible in the input, the proposed model could generate images for all the domains including the missing ones in an one inference go. Especially for the missing domains, the domain-specific image characteristics are well captured although they do not appear in the input images. Comparing the generated images in the same domain, we could see that the domain-specific style and domain-shared content are all preserved well even when we limit the number of input visible domain to be only 1. In addition, when the number of visible domains increases, the content in each domain image is enhanced gradually and gets closer to the target image. This illustrates that our model is efficiently learning a better content code complementarily from multiple visible domains.

Appendix 0.C Extended Ablative Study and Results for Missing-domain Segmentation

Based on the results of missing-domain image completion, we show that our proposed method could go beyond image translation to solve the missing-domain segmentation problem. Specifically, our model learns efficient content representations of the subject, which could be efficiently leveraged for high-level recognition tasks. As shown in Fig. 3 in main text, a segmentation branch is added after the learned content code to generate segmentation prediction map. We adopt the dice loss as the segmentation loss in the training process. We run the segmentation experiments on both BraTS and ProstateX datasets, and use the dice score as the evaluation metric. In the following, we look into two specific settings in missing-domain segmentation.

0.c.1 Missing-domain segmentation with inference on pre-trained segmentation model

Suppose we have trained an oracle segmentation model on a complete dataset with all domain images. Then this pre-trained model would be used to predict segmentation results for new samples during the inference. For new subjects, some domains might be missing. Straightforward solutions to complete the missing domains include zero filling, average image computed from the existing domains, and the nearest neighbor (NN) searching among available training samples. We show the dice scores for these baseline methods in Table 6. Oracle results give the average testing dice score when all the domains are available in the inference. Each column shows the dice scores of segmentation predictions when the current domain is missing during inference. Moreover, based on image translation methods, we can generate fake images for missing domain imputation, and the results for different methods are shown in Table 6. We show that our proposed method achieves the best dice score compared with all aforementioned baselines and other GAN-based image translation methods. This also indicates that our method could generate better images by preserving a better content representation. Furthermore, from the results in Table 6, we know that the T1Gd modality and the T2 modality are the most significant contrasts in the segmentation of BraTS and ProstateX data, missing of which will cause a severe performance decrease in dice score. Our method could alleviate such a loss to a large extent. Here, the dice score for BraTS is the average for the three segmentation categories: enhancing tumor (ET), tumor core (TC), and whole tumor (WT). Please see Table 8 for a full table with per-class dice scores.

 

Methods BraTS ProstateX
T1 T1Gd T2 FLAIR T2 ADC HighB

 

Oracle 0.822 0.908

 

Zero 0.651 0.473 0.707 0.454 0.528 0.243 0.775
Average 0.763 0.596 0.756 0.671 0.221 0.692 0.685
NN 0.769 0.540 0.724 0.606 0.759 0.850 0.854
MUNIT 0.783 0.537 0.782 0.492 0.783 0.708 0.858
StarGAN 0.799 0.553 0.746 0.613 0.632 0.653 0.832
CollaGAN 0.753 0.564 0.798 0.674 0.472 0.760 0.842

 

ReMIC 0.819 0.641 0.823 0.784 0.863 0.907 0.903

 

Table 6: Missing-domain segmentation with inference on pre-trained segmentation model (average dice scores are reported)

 

Methods BraTS ProstateX
T1 T1Gd T2 FLAIR T2 ADC HighB

 

Oracle 0.822 0.908

 

Zero 0.811 0.656 0.823 0.775 0.868 0.899 0.897
Average 0.796 0.604 0.788 0.759 0.856 0.885 0.897

 

ReMIC 0.789 0.655 0.805 0.765 0.871 0.898 0.891
ReMIC+Seg 0.806 0.674 0.822 0.771 0.872 0.909 0.905
ReMIC+Joint 0.828 0.693 0.828 0.791 0.867 0.904 0.904

 

Table 7: Missing-domain segmentation with re-training segmentation model (average dice scores are reported)

0.c.2 Missing-domain segmentation with re-training segmentation model

Suppose we would like to train a segmentation model for a new data set, but most patients in this cohort just contain a random subset of all required domains. In this scenario, it is definitely not efficient to just use the most common domain overlapped by most patients. One simple solution is to complete all the missing images in training set by some imputation method, such as zero-filling image, average image, or generating images via image translation model. The results for these methods are shown in Table 7. More advanced, based on the content code learned in our model, we could develop a join model for multi-task learning of both generation and segmentation. By optimizing the generation loss and segmentation loss simultaneously, the unified model could learn how to generate missing images to promote segmentation performance. The results of jointly learned model as shown in Table 7 achieve the best dice score in both BraTS and ProstateX datasets. “ReMIC+Seg” stands for using separate content encoders for generation and segmentation tasks, while “ReMIC+Joint” indicates sharing the weights of content encoder for the two tasks. We note that the baseline methods get better results after retraining the model on the missing data, since the model is trained to fit to the exact missing inputs format by optimizing the segmentation objective under the supervision of segmentation labels, which makes it more robust to missing inputs. However, our method can still get the best results through adaptive learning model.

 

Methods T1 T1Gd T2 FLAIR
WT / TC / ET WT / TC / ET WT / TC / ET WT / TC / ET

 

2D Oracle 0.910 / 0.849 / 0.708
Zero 0.771 / 0.609 / 0.572 0.872 / 0.539 / 0.008 0.755 / 0.690 / 0.677 0.458 / 0.468 / 0.435
Average 0.870 / 0.744 / 0.674 0.882 / 0.603 / 0.303 0.849 / 0.732 / 0.686 0.655 / 0.710 / 0.648
NN 0.883 / 0.765 / 0.660 0.871 / 0.564 / 0.186 0.811 / 0.720 / 0.642 0.534 / 0.669 / 0.614
MUNIT 0.886 / 0.785 / 0.679 0.872 / 0.552 / 0.187 0.882 / 0.781 / 0.682 0.408 / 0.541 / 0.527
StarGAN 0.897 / 0.795 / 0.704 0.886 / 0.588 / 0.184 0.851 / 0.725 / 0.661 0.570 / 0.664 / 0.604
CollaGAN 0.860 / 0.747 / 0.651 0.864 / 0.576 / 0.252 0.882 / 0.811 / 0.700 0.663 / 0.697 / 0.663
ReMIC 0.909 / 0.834 / 0.714 0.899 / 0.669 / 0.354 0.905 / 0.855 / 0.709 0.853 / 0.807 / 0.691

 

3D Oracle 0.909 / 0.867 / 0.733
Zero 0.876 / 0.826 / 0.694 0.884 / 0.574 / 0.020 0.901 / 0.865 / 0.728 0.661 / 0.730 / 0.643
Average 0.880 / 0.814 / 0.640 0.854 / 0.618 / 0.282 0.838 / 0.801 / 0.695 0.713 / 0.732 / 0.675
NN 0.890 / 0.829 / 0.703 0.859 / 0.538 / 0.081 0.790 / 0.799 / 0.704 0.472 / 0.686 / 0.607
ReMIC 0.905 / 0.864 / 0.722 0.888 / 0.614 / 0.273 0.902 / 0.871 / 0.734 0.855 / 0.850 / 0.724

 

Table 8: Missing-domain segmentation with inference on pre-trained 2D and 3D segmentation model (per-class dice scores are reported)

0.c.3 3D image segmentation with missing domains

Furthermore, we validate that our method could not only work for 2D image segmentation but also 3D image segmentation. When a 3D volumetric image is missing in some domain, we deploy our method to generate 2D images per slice and stack them to build the whole 3D volumetric image in the corresponding missing domain. As shown in Table 8, we evaluate the per-class dice score for missing-domain imputation with the oracle model trained from complete-domain 3D image segmentation. The results show our method could give a better performance in most domains. During experiments, we find that the smoothness among different slices in the 3D image generation might be an issue that needs to be further improved. Besides, we also show that the per-class dice scores for BraTS segmentation results in Table 8. Compared with WT and TC classes, ET class is definitely more challenging in the brain tumor segmentation, since enhancing tumor usually just covers a very small region among the whole tumor. Particularly in the ET class segmentation, we can see our method outperforms the other methods to a large extent.

0.c.4 Analysis of missing-domain segmentation results

To better understand why our method is a better solution in missing-domain imputation for multi-domain recognition tasks like the multi-modal image segmentation, we demonstrate three randomly selected testing samples in BraTS and ProstateX dataset as shown in Figs. 13-14 respectively. Rows 13 shows the results for the first sample, and the other two samples are shown in the same format. In each sample, the first row shows real images in four domains and its ground truth segmentation labels. If some domain is randomly missing for the target sample, a straightforward solution is to search through all the available training data and find the nearest neighbor (NN) sample to complete the missing image. We search the nearest neighbor according to the Euclidean distance in 2D image space, and display the NN sample with all modalities, which actually looks very similar to the target sample visually. However, we note that the tumor region is seriously different between the target sample and its NN sample, which shows that the NN images is not a good missing imputation in terms of the image semantics. To cope with this issue, our proposed method is able to generate images for missing domains with not only pixel-level similarity but also similar predicted tumor regions, which are the most significant semantics in the tumor segmentation task. As shown in Figs. 13-14, the generated images in multiple domains closely resemble the target images. The segmentation map shows the prediction results when the generated T1 (T2) image is used as the imputation in inputs for BraTS (ProstateX) segmentation, which predicts a segmentation mask very close to the ground truth label. These results illustrate the superiority of our method in efficiently learning semantic content codes in the feature level.

Figure 9: Random multi-domain image completion results of three testing samples in BraTS. The completed images (right) are generated from partial existing images in inputs (left). The number of inputted visible domains is in the range of where is the number of all domains. Rows: every 4 rows show results for one testing sample. Columns: 4 image modalities of T1, T1Gd, T2, and FLAIR.
Figure 10: Random multi-domain image completion results of three testing samples in ProstateX. The completed images (right) are generated from partial existing images in inputs (left). The number of inputted visible domains is in the range of where is the number of all domains. Rows: every 3 rows show results for one testing sample. Columns: 3 image modalities of T2, ADC, and HighB.
Figure 11: Random multi-domain image completion results of three testing samples in RaFD. The completed images (bottom) are generated from partial existing images in inputs (top). The number of inputted visible domains is in the range of where is the number of all domains. Rows: 8 image domains of “neutral”, “angry”, “contemptuous”, “disgusted”, “fearful”, “happy”, “sad”, “surprised”.
Figure 12: Random multi-domain image completion results of three testing samples in RaFD. The completed images (bottom) are generated from partial existing images in inputs (top). The number of inputted visible domains is in the range of where is the number of all domains. Rows: 8 image domains of “neutral”, “angry”, “contemptuous”, “disgusted”, “fearful”, “happy”, “sad”, “surprised”.
Figure 13: Missing-domain segmentation results of three testing samples in BraTS. Every three rows show results for one testing sample. For each testing sample, we show: 1) real images with ground truth segmentation label, 2) nearest neighbor searched from training data with its segmentation label, 3) generated images using our method and segmentation prediction when T1 image is missing and completed with the generated image.
Figure 14: Missing-domain segmentation results of three testing samples in ProstateX. Every three rows show results for one testing sample. For each testing sample, we show: 1) real images with ground truth segmentation label, 2) nearest neighbor searched from training data with its segmentation label, 3) generated images using our method and segmentation prediction when T2 image is missing and completed with the generated image.