In2I : Unsupervised Multi-Image-to-Image Translation Using Generative Adversarial Networks

11/26/2017 ∙ by Pramuditha Perera, et al. ∙ Rutgers University 0

In unsupervised image-to-image translation, the goal is to learn the mapping between an input image and an output image using a set of unpaired training images. In this paper, we propose an extension of the unsupervised image-to-image translation problem to multiple input setting. Given a set of paired images from multiple modalities, a transformation is learned to translate the input into a specified domain. For this purpose, we introduce a Generative Adversarial Network (GAN) based framework along with a multi-modal generator structure and a new loss term, latent consistency loss. Through various experiments we show that leveraging multiple inputs generally improves the visual quality of the translated images. Moreover, we show that the proposed method outperforms current state-of-the-art unsupervised image-to-image translation methods.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 7

page 8

page 11

page 12

page 13

page 14

page 15

page 16

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The problem of unsupervised image-to-image translation has made promising strides with the advent of Generative Adversarial Networks (GAN)

[5] in recent years. Given an input from a particular domain, the goal of image-to-image translation is to transform the input onto a specified second domain. Recent works in image-to-image translation has successfully learned this transformation across various tasks including satellite images to map images, night images to day images, greyscale images to color images etc. [29], [12], [11] [26].

In this work, we propose an extension of the original problem from a single input image to multiple input images, called multi-image-to-image translation . Given semantically related multiple images across number of different domains, the goal of is to produce the corresponding image in a specified domain. For example, the traditional problem of translating a greyscale image onto the RGB domain can be extended into an problem by providing the near infrared (NIR) image of the same scene as an additional input. Now, the objective would be to use information present in greyscale and NIR domains to produce the corresponding output in the RGB domain as shown in Figure 1. In this paper, we study the problem of in the more generic unsupervised setting and provide initial direction to solve the problem.

Figure 1: Traditional unsupervised image-to-image translation maps a single image onto a specified domain. In this work, we address multi-image-to-image translation , where a set of input images from different domains are mapped onto a specified domain. This figure illustrates three applications of the proposed method.

Image-to-image translation is a challenging problem. For a given input, there exists multiple possible representations in the specified second domain. Having multiple inputs from different image modalities reduces this ambiguity due to the presence of complimentary information. Therefore, as we show later in experimental results section, leveraging multiple input images leads to an output of higher perceptual quality.

Multiple input modalities can be incorporated naively by concatenating all available modalities as channels and feeding into an existing image-to-image translation algorithm. However, such an approach leads to unsatisfactory results (see supplementary material for experiments). Therefore, we argue that unsupervised multi-image-to-image translation should be treated as a unique problem. In this work our main contributions are three-fold:

1. We introduce the problem of unsupervised multi-image-to-image translation. We show that by leveraging multiple modalities one can produce a better output in the desired domain as compared to when only a single input modality is used.

2. A GAN-based scheme is proposed to combine information of multiple modalities to produce the corresponding output from the desired domain. We introduce a new latent consistency loss term, into the objective function.

3. We propose a generalization to the GAN generator network by introducing a multi-modal generator structure.

2 Related Work

To the best of our knowledge, problem has not been previously addressed in the literature. In this section, we outline previous work related to the proposed method.

Generative Adversarial Networks (GANs). The fundamental idea behind GAN introduced in [5],[19] is to use two competing Fully Convolutional Networks (FCN [20]), the generator and the discriminator, for generative tasks. Here, the generator FCN learns to generate images from the target domain and the discriminator tries to distinguish generated images from real images of the given domain. During training, two FCNs engage in a min-max game trying to outperform each other. Learning objective of this problem is collectively called as the adversarial loss [29]. Many applications have since employed GANs for various image generation tasks with success [17], [8],[10], [7],[24], [28],[27].

In, [13] GANs were studied in a conditional setting where a conditioning label is provided as the input to both the generator and the discriminator. Here, we limit our discussion on Conditional GANs (CGAN) to image-to-image generation tasks. The Pix2Pix framework introduced in [7] uses CGANs for supervised image-to-image translation. In their work, they showed successful transformations across multiple tasks when paired samples across the two domains are available during training. In [21], CGANs are used to generate real eye images from synthetic eye images. In order to learn an effective discriminator, [21] proposes to maintain and use a history of generated images. The CoGAN framework introduced in [12], maps images of two different domains onto a common representation to perform domain adaptation using a weight sharing FCN architecture.

Unpaired image-to-image translation. Several recent methods have addressed the unsupervised image-to-image translation task when the input is a single image. Here, unlike in the supervised setting, paired samples across the two domains do not exist. In [29], image-to-image translation problem is tackled by having two generators and discriminators, one for each domain. In addition to the adversarial loss, a cycle consistency constraint is added to ensure that the semantic information is preserved in the translation. A similar rationale is adopted in DualGAN [26] which has been developed independently of CycleGAN. In [11]

, the CoGAN framework was extended using GANs and variational autoencoders with the assumption of a common latent space between the domains.

Image fusion. Although image fusion [14] operates on multiple input images, we note that our task is very different from image fusion since the former does not involve a domain translation process. In image fusion tasks, multiple input modalities are combined in an informative latent space. This space is usually found by a derived multi-resolution transformation such as wavelets [16]. In [15]

operating on deep networks, a latent space is used to re-generate outputs of multiple modalities. Motivated by this technique, we fuse mid-level deep features from each input domain in the proposed generator FCN.

3 Proposed Method

Notation. In this paper, we use the following notations. Source domain and target domain are denoted by and , respectively. The latent space is denoted by . In the presence of multiple source domains, the set of source domains are denoted collectively as . A data sample drawn from an arbitrary domain is denoted as . The transformation between domains and is denoted by the function . The transformation between the domains and the latent space is denoted by .

Overview. In conventional image-to-image translation, the objective is to translate images from an original domain to a target domain using a learned transformation . In the supervised setting of the problem, a set of image pairs are given, where and are paired images from the two domains. Image-to-image translation task is less challenging in this scenario since the desired output for a given input is known ahead of time.

Similar to the supervised version of the problem, images from both target and source domains are provided in the unsupervised image-to-image translation problem. However, in this case, provided images of the two domains are not paired. In other words, for a given source image , the corresponding ground truth image

is not provided. In the absence of image pairs from both domains, it is not possible to optimize over a distance between the estimated output and the target. One possible option is to introduce an

adversarial loss to facilitate reward if the generated image is from the same domain as the target domain. However, having an adversarial loss alone does not guarantee that the generated image will share semantics with the input. Therefore, to successfully solve this problem, additional constraints need to be imposed.

In [29], such a solution is sought by enforcing the cycle consistency property. Here, an inverse transformation is learned along with . Then, the cycle consistency ensures that the learned transformation yields a good approximation of the input by comparing with . We develop our method based on the foundations of CycleGAN proposed in [29]. Here, we briefly review the CycleGAN method and we will draw differences between CycleGAN and our method in succeeding sections. CycleGAN as shown in Figure 2 (top), contains a forward transformation from source domain to target domain and a reverse transformation from target to source. Two discriminators and are used to asses whether a given input belongs to source or target, respectively.

Figure 2: Network structure used for unsupervised image-to-image translation. Top: CycleGAN, Bottom: Proposed method for

Multimodal Generator. The problem accepts inputs and translates them into a single output. Therefore, in contrast to CycleGAN, the proposed method deals with multiple inputs in the forward transformation and multiple outputs in the reverse transformation. In order to facilitate this operation, we propose a generalization of the generator structure for multiple inputs and outputs. The generic structure of the proposed generator is shown in Figure 3. In general, it is possible for the generator to have inputs and outputs. The generator treats each input modality independently and extracts features and fuses them prior to feeding them to the encoder. The encoder maps resultant features to a latent space. Operating on the latent space, number of independent decoders generate output images.

For the specific application of , two generators are used for the forward and reverse transformations. When there are input images, is set to be equal to one during the forward transformation where the goal is to generate a single output image (). In the reverse transformation, a single input image is processed to generate outputs thereby making and . Therefore, generator networks used in are asymmetric in structure as shown in Figure 2 (bottom).

The proposed method treats inputs independently initially in the forward transformation and then extracted features are fused together. The fused feature is first transformed into a latent space as shown in in Figure 2 (bottom) and then transformed into the target domain. In the reverse transformation, the single input is mapped back to the same latent space first. Then, the latent space representation is used to produce outputs belonging to source domains. In this formulation, discriminators are used, one for each domain as opposed to CycleGAN. In addition, a latent space consistency loss is added to ensure that the same concept in all domains have a common latent space representation.

Figure 3: Multi-modal generator: generalization of the generator for multiple inputs and multiple outputs.

Problem Formulation. Formally, given number of input modalities , the objective is to learn a transformation . Here, we note that the input to the forward transformation is a set of images, where the output of the transformation is a single image. Similarly, the backward transformation takes a single image input and produces output images.

In order to approach the solution to this problem, first we view all input images and the desired output image as different representations of the same concept. Motivated by the techniques used in domain adaptation [1],[4],[22] we hypothesize the existence of a latent representation that can be derived using the provided representations. With this assumption, we treat our original problem as a series of sub-problems where the requirement is to learn the transformation and the inverse transformation to the latent representation from each domain. If the latent representation is , we will attempt to learn transformations and , where and . With this formulation, the forward transform becomes and the reverse transformation becomes .

Adversarial Loss. In order to learn transformation , we use an adversarial generator-discriminator pair {, } [5]. Denoting the data distributions of domains and as and , respectively, the generator function tries to learn the transformation . The discriminator is trained to differentiate real images from the target domain from generated images . This procedure is captured in the adversarial loss as follows:

(1)

Similarly, to learn we use a single generator . However, since there exists input domains in total, we require discriminators , where , one for each domain. With this formulation, the total adversarial loss in backward transformation becomes a summation of adversarial terms as follows: L_GAN,T →S = ∑_i=1^n E_s_i ∼p_data(s_i)[logD_S_i(s_i)]+∑_i=1^n E_t∼p_data(t)[1-logD_S_i (f_T →S_i(t))].

Latent Consistency Loss. As briefly discussed above, the adversarial loss only ensures that the generated image looks realistic in the target domain. Therefore, adversarial loss alone is inadequate to result in a transformation which preserves semantic information of the input. However, based on the assumption that both input and target domains share a common latent representation, it is possible to enforce a more strict constraint to ensure semantics between the input and the output are preserved. This is done by forcing the latent representation obtained during the forward transformation to be equal to the latent representation obtained during the reverse transformation for the same input.

More specifically, for a given input , a set of latent representations

are recorded. Then, this recorded vector is compared against the latent representation obtained during the reverse transformation

. The latent consistency loss in the forward transformation is defined as, L_latent,S →T = E_s ∼p_data(s) ∥h_S →Z (s)-h_T →Z (f_S →T(s)) ∥_1. Similarly, the latent consistency loss in the reverse transformation is defined as, L_latent,T→S = E_t ∼p_data(t) ∥h_T →Z (t)-h_S →Z (f_T →S(t)) ∥_1.

Cycle Consistency Loss. If the input and the transformed image do not share semantic information, it is impossible to regenerate the input using the transformed image. Therefore by forcing the learned transformation to have a valid inverse transform, it is further possible to force the generated image to share semantics with the input. Based on this rationale, in [29] cycle consistency loss is introduced to ensure that the transformed image shares semantics with the input image. Since this argument is equally valid for the multi-input case, we adopt cycle consistency loss [29] in our formulation. Proposed backward cycle consistency loss is similar to that of [29] in definition. We define the reverse cycle consistency loss as: L_cyc,T →S = E_t∼p_data(t)[∥ f_S →T(f_T →S(t))-t∥_1]. However, in comparison, the forward cycle consistency loss takes into account inputs and compares the distance among the reconstructions as opposed to [29]. The forward cycle consistency loss is defined as, L_cyc, s →T = E_s∼p_data(s)[∥ F_T →s(F_s →T(s))-s∥_1].

Cumulative Loss. The final objective function is the addition of all three losses introduced in this section. The cumulative loss is defined as follows: L_total = L_GAN,S →T + L_GAN,T →S + λ_1 ( L_cyc,T →S + L_cyc, s →T ) + λ_2 ( L_latent,S →T+ L_latent,T→S ), where, and are constants.

Limiting Case. It is interesting to investigate the behavior of the proposed network in the limiting case when . In this case, both the number of input and output modalities of the network becomes one; i.e. and . Therefore becomes in equations (3), (3), (3) and (3). In addition, with , summation in (3) reduces to a single statement. If we disregard the latent consistency loss by forcing , the total objective reduces to, L_total = E_t∼p_data(T)[logD_T (t)] + E_s ∼p_data(S)[logD_S(s)]+ E_s ∼p_data(s)[1-log D_T(f_s →T(s) )]+ E_t∼p_data(T)[1-logD_S (t)] + E_t∼p_data(t)[∥ f_s →T(f_T →s(t))-t∥_1]
+E_s∼p_data(s)[∥ F_T →s(F_s →T(s))-s∥_1]. This reduced objective is identical to the total objective in CycleGAN. Therefore, in the limiting case when , the proposed method reduces to the cycleGAN formulation when the latent consistency loss is disregarded.

Network Architecture. In this section, we describe the network architecture of the proposed Generator by considering the case where two input modalities are used; i,e when . The resulting two generators in this case is illustrated in Figure 4

. It should be noted that the Convolutional Neural Network (CNN) architectures used in both forward and reverse transformations here are in coherence with the generic structure shown in Figure 

3. In principle, the generator can be based on any backbone architecture. In our work, we used ResNet [6] with nine resnet blocks as the backbone. In our proposed network, a CNN is used for each module in Figure 3

. These CNNs are typically convolutions/transposed convolutions followed by nonlinearities, batch-normalization layers and possibly with skip connections.

Two input images (from the two input domains) are present as the input of the forward transformation. These images are subjected to two parallel CNNs to extract features from each modality. Then, the extracted features are fused to generate an intermediate feature representation. In our work, feature fusion is performed by concatenating feature maps of feature extraction stage and using a convolution operation to reduce the dimension. This feature is then subjected to a set of convolution operations to arrive at the latent space. Finally, the latent space representation is subjected to a series of CNNs with transposed convolution operations to generate a single output image (from the target domain).

During the backward transformation, a single input is present. A CNN with convolution operations is used to transform the input into the latent space. It should be noted that since there is only a single input, there is no notion of fusion in this case. Two parallel CNNs consisting of transposed convolutions branch out from the latent space to produce two outputs corresponding to domains and .

This architecture can be extended modalities. In this case, the core structure will be similar to that of Figure 4 except that there will be parallel branches instead of two at either ends of the network. For the discriminator networks we use PatchGANs proposed in [7]. Please refer to the supplementary material for exact details of the architecture.

Figure 4: Generator architecture of for when two input modalities are used.

4 Experimental Results

We test the proposed method on three publicly available multi-modal image datasets across three tasks against state-of-the-art unsupervised image-to-image translation methods. The training was carried out adhering to principles of unsupervised learning. Even when ground truth images of the desired translation were available, they were not used during training. When available, ground truth images were used during testing to quantify the structural distortion introduced by each method through calculating PSNR and SSIM

[25] metrics. It should be noted that in this case, two disjoint image sets were used for training and testing.

As the benchmark for performance comparison, we use CycleGAN [29] and UNIT [11] frameworks. Since both of these methods are specifically designed for single inputs, we used all available image modalities, one at a time to produce the corresponding outputs. In the implementation of the proposed method, and in (3) are set equal to 10 and 1, respectively. Learning is performed using the Adam optimizer[9]

with a batch size of 1. Initial learning rates of generators and discriminators were set equal to 0.0002 and 0.0001, respectively. Training was conducted for 200 epochs, where learning rate was linearly decayed in the last 100 epochs.

Image Colorization (EPFL NIR-VIS Dataset.) The EPFL NIR-VIS dataset [2]

includes 477 images in 9 categories captured in the RGB and the Near-infrared (NIR) image modalities across diverse scenes. Scenes included in this dataset are categorized as country, field, forest, indoor, mountain, old building, street, urban and water. We use this dataset to simulate the image colorization task. We generated greyscale images from the RGB visible images and use greyscale and NIR images as the input modalities with the aim of producing the corresponding RGB image. We randomly selected 50 images to be the test images and used the remaining images for training.

First we trained CycleGAN [29] and UNIT [11] models for each input modality independently. Then, the proposed method was used to train a model based on both input modalities. Obtained results for these cases are shown in Figure 5. Obtained PSNR and SSIM values for each method on the test data are tabulated in Table 1. By inspection, CycleGAN operating on greyscale images were able to identify segments in the image but failed to assign correct colors. For example, in the first row, the tree is correctly segmented but with a wrong color. In comparison, CycleGAN with NIR images have resulted in a much better colorization. Since the amount of energy a color reflects depends on the wavelength of the color, a NIR signal contains some information about the color of the object. This could be the reason why NIR images have performed better colorization compared to greyscale. The same trend can be observed in the outputs of the UNIT method.

On the other hand, the proposed method has produced a colorization very similar to the ground truth. As an example we wish to draw the attention of the reader to the color of the tree and the field in the first row, colors of the building and the tree in the last row. It has also recorded a superior PSNR and SSIM values compared with the other baselines as shown in Table 1. It should be noted that PSNR and SSIM values only reflect how well the structure of objects in images have been preserved. It is not meant to be an indication of how well colorization task has been carried out.

Figure 5: Qualitative results for the image colorization task.
EPFL NIR-VIS Freiburg Forest Dataset
Method PSNR SSIM Method PSNR SSIM
Ours (NIR+Grey) 23.113 (9.147) 0.739 (0.008) Ours (NIR+Depth) 19.444 (7.059) 0.584 (0.012)
UNIT (Grey) 8.324 (2.219) 0.041 (0.018) UNIT (Depth) 9.681 (1.490) 0.414 (0.007)
UNIT (NIR) 15.331 (9.088) 0.544 (0.012) UNIT (NIR) 9.494 (0.868) 0.382 (0.004)
CycleGAN (Grey) 8.438 (2.939) 0.056 (0.018) CycleGAN (Depth) 16.5945 (4.308) 0.525 (0.010)
CycleGAN (NIR) 17.381 (9.345) 0.657 (0.018) CycleGAN (NIR) 18.574 (3.252) 0.552 (0.014)
- - - Ours (All Inputs) 21.65 (2.302) 0.600 (0.0105)
Table 1:

Mean PSNR and SSIM values for image colorization and hyperspectral-to-real image translation tasks. Variance is indicated within the brackets.

Synthetic-to-Real Image Translation (Synthia and CityScapes Datasets). In this subsection, we experiment on generating real images using synthetic images. For this purpose, we use two datasets, Synthia [18] and CityScapes [3], respectively as the source and the target domains. The Cityscapes dataset contains images taken across fifty urban cities during daytime. We use 1525 images from the validation set of the dataset to represent the target domain in the synthetic-to-real translation task. The Synthia dataset contains graphical simulations of an urban city. The scenes included in the dataset contain different weather conditions, lighting conditions and seasons. For our work, we only use the summer day light subset of the dataset which includes 901 images for training. The Synthia dataset provides RGB image intensities as well as the depth information of the scene. Hence, we use these as the two input modalities.

Results of this experiment are shown in Figure 6. In this particular task, UNIT method has only changed the generic color scheme of the scene with incorrect association; for example note that skies look brown instead of blue in resulting images. In addition, objects in the scene continues to possess the characteristics of synthetic images. In contrast, CycleGAN has attempted to convert appearance of synthetic images to real. However, in the process it has distorted the structure of objects. When only depth information is used, the cycleGAN method is unable to preserve the structure of objects in the scene. For example, lines along the roads have ended up being warped in the learned representation in Figure 6. The CycleGAN model based on the RGB images preserves the overall structure to an extent. However, vital details are either missing or misleading. For example, pavements are missing from images shown in rows 2 and 3 in Figure 6. The absence of a shadow on the road in row 2, addition of clutter in the left pavement in row 3 and disappearance of the telephone pole in row 4 are some of the notable incoherences. Comparatively, fusion of both RGB and depth information has resulted in a more realistic translation. It should be noted that synthetic-to-real translation is a challenging problem in practice and when certain concepts were missing in either of source or target domains, the model found it difficult to learn such concepts. For example, training images from Cityscape did not have zebra crossings in any of the images. Therefore, the concept of zebra crossings is not learned well by the model as shown in row 1.

Figure 6: Qualitative results for the synthetic-to-real translation task.
Figure 7: Qualitative results for the hyperspectral-to-visible translation.

Hyperspectral-to-Visible Image Translation. The Freiburg Forest Dataset [23] contains 230 training images and 136 test images of a forest scenery with images of different wavelengths. The types of images contained in the dataset include RGB, depth, NIR, EVI (Enhanced Vegetation Index) and combinations of aforementioned image types. In our experiments, we use this dataset to perform hyperspectral-to-real image translation where the depth and the NIR image modalities are used to recover the RGB images of the visual spectrum.

We note that this task is relatively easier compared to the earlier two tasks due to the low diversity in scenes. All methods, except for UNIT, were able to generate realistic RGB images from the provided input modalities, as shown in Figure 7. The reason why UNIT fails in this task could be due to the low number of training image samples [11]. In CycleGAN, visual domain image reconstruction solely based on the depth information has resulted in sub-standard images. For example, the road is missing in row 1 and the road takes a wrong shape in row 2 for the depth image-based CycleGAN output. When the NIR images are used as the input, the resulting CycleGAN output is more closer to the ground truth. But, in this case, there exists multiple missing regions, where pixels are painted in white, as shown in rows 1-3.

In terms of reconstructing finer details, the proposed method that utilizes both NIR and depth information have outperformed the other baselines. In comparison to the earlier task, all methods have less structural distortion as evident from Table 1. However, even in this case, the proposed method has performed marginally better than the other baseline methods in terms of SSIM and PSNR performance.

Since the dataset has three modalities available, we ran an extra experiment by inputting all three modalities (Depth, NIR and EVI) into the proposed method. Results produced in this case were more closer to the ground truth in general. For example, water residues on the ground are produced in row 3 for this case, which was absent in all prior cases. The use of three modalities have improved the PSNR performance by more than according as shown in Table 1. Please refer to supplementary material for more results and analysis.

Ablation Study.

The loss function proposed in (

3) has three main components: the adversarial loss, latent consistency loss and cycle consistency loss. In this subsection, we carry out an ablation study on the Freiburg Forest Dataset to investigate the impact of each individual loss term. In this study, we considered three alternative loss functions: (a) only adversarial loss, (b) addition of adversarial loss and the latent consistency loss and the (c) total loss. Obtained image constructions for each case for a set of sample images are shown in Figure 8. Images generated in case (a) are plausible forest images; but they are very different from the ground truth. This is because the adversarial loss doesn’t take semantic information into account. Comparatively, images in case (b) shares more semantics with ground truth. For example, in row 1, some trees are generated at the right side of the road when compared with case (a). Further, image artifacts present at the left side in case (b) have disappeared in case (c). Addition of the cycle consistency loss in case (c), increases the coherence between the output and the ground truth even more compared with the earlier cases.

Figure 8: Results of the ablation study carried out on the hyperspectral-to-visible translation task.

Impact of Multiple Inputs. Three experiments performed in this section are of different levels of difficulty. The hyperspectral-to-visible image translation task is the easiest task due to the low variance in the scenes in the dataset. In such scenarios, even a single modality is able to produce a reasonable translation. However, we note that introducing an additional modality has improved the performance. In comparison, the colorization task is more challenging due to the availability of diverse scenes. As a result, a single modality was not able to perform colorization satisfactorily. In this case, multi-image-to-image translation was able to induce a high improvement in terms of visual quality by using two informative input modalities. The final case, synthetic-to-real image translation, is very challenging. We note that the depth modality in this case is not very informative since it leads to image constructions of sub-standard quality. In comparison, the RGB synthetic image modality resulted in better translations. Using both modalities has improved the visual quality of the output. But this improvement was marginal as compared to the case of the colorization task.

In summary, multiple modalities generally improve the visual quality of the output image; specially when the translation is more challenging. However, the amount of improvement introduced was dependent on the informativeness of the second modality. In fact, introducing a noisy modality for the sake of having multiple inputs would not contribute towards an improvement.

5 Conclusion

In this work, we introduced multi-image-to-image translation problem. We proposed a multi-modal generator structure and a GAN based framework as the initial direction to solve the problem. We tested the proposed method across three tasks against state-of-the-art unsupervised image-to-image translation methods. We showed that using multiple image modalities improves the visual quality of the output compared with results generated by the state-of-the-art methods. We analyzed the behavior of the proposed method in the limiting case and provided discussion as to when the use of multiple image modalities is most suited.

6 Supplimetary Materials

6.1 Detailed Network Architecture

In this section, we provide details about the architecture used in the generator network for the cases when two and three input modalities are used. In both cases there are two generators; one for the forward transformation and the other for the reverse transformation. These generators follow the structure shown in Figure 3 in the main paper.

When two input modalities are available, the used generator structure is shown in Figure 4 in the paper in modular form. Exact details of each of these modules are tabulated in Table 2

. Details of the reverse transformation network shown in Figure 4 in the paper can be found in Table 

3. Here, Conv

refers to a collection of convolution, instance normalization and relu layers.

Deconv refers to a collection of transposed convolution, instance normalization and relu layers. Res refers to a Resnet block.

Layer Input output Kernel

(stride, pad)

Feature Extraction Conv Image 1 Conv (1,0)
Conv Conv Conv (2,1)
Conv Conv Res (2,1)
Res Conv Res (1,0)
Res Res Res (1,0)
Res Res Res (1,0)
Res Res Conv (1,0)
Feature Extraction Conv Image 2 Conv (1,0)
Conv Conv Conv (2,1)
Conv Conv Res (2,1)
Res Conv Res (1,0)
Res Res Res (1,0)
Res Res Res (1,0)
Res Res Conv (1,0)
Fusion Conv Res , Res Res (1,0)
Encoder Res Conv res (1,0)
Res Res res (1,0)
Res Res res (1,0)
Res Res Latent (1,0)
Z Latent Res Res Identity
Decoder Res Latent res (1,0)
Res Res res (1,0)
Res Res Deconv (1,0)
Deconv Res Deconv (2,1)
Deconv Deconv Deconv (2,1)
Deconv Deconv Output (1,0)
Table 2: Network details corresponding to the forward transformation generator for .
Layer Input output Kernel (stride, pad)
Feature Extraction Conv Image 1 Conv (1,0)
Conv Conv Conv (2,1)
Conv Conv Res (2,1)
Res Conv Res (1,0)
Encoder Res Res Res (1,0)
Res Res Res (1,0)
Res Res Latent (1,0)
Z Latent Res Res , Res Identity
Decoder Res Latent res (1,0)
Res Res Res (1,0)
Res Res res (1,0)
Res Res Res (1,0)
Res Res Deconv (1,0)
Deconv Res Deconv (2,1)
Deconv Deconv Deconv (2,1)
Deconv Deconv (1,0)
Decoder Res Latent res (1,0)
Res Res Res (1,0)
Res Res res (1,0)
Res Res Res (1,0)
Res Res Deconv (2,1)
Deconv Res Deconv (2,1)
Deconv Deconv Deconv (2,1)
Deconv Deconv (1,0)
Table 3: Network details corresponding to the reverse transformation generator for .

We carried out a single experiment using three modalities using the hyperspectral-to-visible image translation task. We outline the network architecture in both the forward and the reverse transformations in Tables 4 and 5, respectively.

Layer Input output Kernel (stride, pad)
Feature Extraction Conv Image 1 Conv (1,0)
Conv Conv Conv (2,1)
Conv Conv Res (2,1)
Res Conv Res (1,0)
Res Res Res (1,0)
Res Res Res (1,0)
Res Res Conv (1,0)
Feature Extraction Conv Image 2 Conv (1,0)
Conv Conv Conv (2,1)
Conv Conv Res (2,1)
Res Conv Res (1,0)
Res Res Res (1,0)
Res Res Res (1,0)
Res Res Conv (1,0)
Feature Extraction Conv Image 3 Conv (1,0)
Conv Conv Conv (2,1)
Conv Conv Res (2,1)
Res Conv Res (1,0)
Res Res Res (1,0)
Res Res Res (1,0)
Res Res Conv (1,0)
Fusion Conv Res , Res , Res Res (1,0)
Encoder Res Conv res (1,0)
Res Res res (1,0)
Res Res res (1,0)
Res Res Latent (1,0)
Z Latent Res Res Identity
Decoder Res Latent res (1,0)
Res Res res (1,0)
Res Res Deconv (1,0)
Deconv Res Deconv (2,1)
Deconv Deconv Deconv (2,1)
Deconv Deconv Output (1,0)
Table 4: Forward transformation for
Layer Input output Kernel (stride, pad)
Feature Extraction Conv Image 1 Conv (1,0)
Conv Conv Conv (2,1)
Conv Conv Res (2,1)
Res Conv Res (1,0)
Encoder Res Res Res (1,0)
Res Res Res (1,0)
Res Res Latent (1,0)
Z Latent Res Res , Res , Res Identity
Decoder Res Latent res (1,0)
Res Res Res (1,0)
Res Res res (1,0)
Res Res Res (1,0)
Res Res Deconv (1,0)
Deconv Res Deconv (2,1)
Deconv Deconv Deconv (2,1)
Deconv Deconv Output (2,1)
Decoder Res Latent res (1,0)
Res Res Res (1,0)
Res Res res (1,0)
Res Res Res (1,0)
Res Res Deconv (1,0)
Deconv Res Deconv (2,1)
Deconv Deconv Deconv (2,1)
Deconv Deconv Output (2,1)
Decoder Res Latent res (1,0)
Res Res Res (1,0)
Res Res res (1,0)
Res Res Res (1,0)
Res Res Deconv (2,1)
Deconv Res Deconv (2,1)
Deconv Deconv Deconv (2,1)
Deconv Deconv Output (1,0)
Table 5: Reverse transformation for

6.2 Additional Experimental Results

In this section, we present results reported in the main paper with a higher resolution. In addition, we present the following two additional baseline comparisons.

1. CycleGAN (Concat). Input of multiple modalities are concatenated as channels. Operating on the concatenated input, cycleGAN is used to find the relevant transformation.

2. Image Fusion. Input images are first fused using a wavelet-based image fusion technique. In particular, wavelet coefficients of each input modality is found independently using db4 wavelet. In the wavelet domain, coefficients are fused by taking the average over all modalities. Fused coefficients are transformed back to the image domain by taking inverse wavelet transformation. Then, CycleGAN is operated on the fused image.

Results corresponding to the image colorization task are shown in Figures 9,10,11 and 12. In all these cases, the proposed method yields more realistic colorization. In Figures 13,14,15 and 16 results obtained for synthetic-to-real translation are shown. As described in the main paper, the proposed method has performed a translation of higher quality in this task as well. The third task, hyper-spectral-to-visual image translation, is the easier task among the three tasks. Therefore, CycleGAN (NIR), CycleGAN(Concat) and Image Fusion(CycleGAN) are able to produce results on par with the proposed method (Figure 19). However, in images 17 and 18, the proposed method is able to produce images of distinguishable higher quality compared with the baselines.

Input(NIR)
Input (Greyscale)
Target (RGB)
Ours
CycleGAN (NIR)
CycleGAN (Grey) [29]
CycleGAN (Concat) [29]
Image Fusion (CycleGAN) [29]
UNIT (NIR) [11]
UNIT (Grey)[11]
Figure 9: Results corresponding to the image colorization task (EPFL NIR-VIS Dataset - country category). Colorization achieved by the proposed method is closer to the ground truth compared with the baseline methods. This is most evident in the color of tree leaves, grass and the sky.
Input(NIR)
Input (Greyscale)
Target (RGB)
Ours
CycleGAN (NIR)
CycleGAN (Grey) [29]
CycleGAN (Concat) [29]
Image Fusion (CycleGAN) [29]
UNIT (NIR) [11]
UNIT (Grey)[11]
Figure 10: Results corresponding to the image colorization task (EPFL NIR-VIS Dataset - mountain category). Both CycleGAN(NIR) and UNIT(NIR) has produced results on par with the proposed method. However, valleys are colored in green in CycleGAN. UNIT has a dark gray shade imposed over the whole image. Comparatively the proposed method is more closer to the ground truth.
Input(NIR)
Input (Greyscale)
Target (RGB)
Ours
CycleGAN (NIR)
CycleGAN (Grey) [29]
CycleGAN (Concat) [29]
Image Fusion (CycleGAN) [29]
UNIT (NIR) [11]
UNIT (Grey)[11]
Figure 11: Results corresponding to the image colorization task (EPFL NIR-VIS Dataset - urban category). Color of sky, tree and the building of the proposed method’s output is more closer to the ground truth compared to other baselines.
Input(NIR)
Input (Greyscale)
Target (RGB)
Ours
CycleGAN (NIR)
CycleGAN (Grey) [29]
CycleGAN (Concat) [29]
Image Fusion (CycleGAN) [29]
UNIT (NIR) [11]
UNIT (Grey)[11]
Figure 12: Results corresponding to the image colorization task (EPFL NIR-VIS Dataset - water category). CycleGAN (NIR) has the closest result in terms of quality to the proposed method. However, water has a yellow shade mixed to it and the sky seems more blue compared to the ground truth. These differences are not present in our output.
Input(RGB)
Input (Depth)
Ours
CycleGAN (RGB)
CycleGAN (Depth) [29]
CycleGAN (Concat) [29]
Image Fusion (CycleGAN) [29]
UNIT (RGB) [11]
UNIT (Depth)[11]
Figure 13: Results corresponding to the Synthetic-to-Real translation task (sample 00400). CycleGAN (RGB) and CycleGAN (Concat) have distorted constructions. The pole color is inverted; distortions are visible on cars and buildings; building far away are mistaken as trees. UNIT(RGB) has only changed the color scheme of the output. Characteristics of artificial graphics are apparent (texture of the road for an example). Proposed method produces a more realistic construction comparatively.
Input(RGB)
Input (Depth)
Ours
CycleGAN (RGB)
CycleGAN (Depth) [29]
CycleGAN (Concat) [29]
Image Fusion (CycleGAN) [29]
UNIT (RGB) [11]
UNIT (Depth)[11]
Figure 14: Results corresponding to the Synthetic-to-Real translation task (sample 00600). In addition to points made in Figure 13, note that both CycleGAN (RGB) and CycleGAN (Concat) have erroneously generated clutter in the left pavement.
Input(RGB)
Input (Depth)
Ours
CycleGAN (RGB)
CycleGAN (Depth) [29]
CycleGAN (Concat) [29]
Image Fusion (CycleGAN) [29]
UNIT (RGB) [11]
UNIT (Depth)[11]
Figure 15: Results corresponding to the Synthetic-to-Real translation task (sample 00800). CycleGAN (RGB) output is similar to the output of the proposed method to an extent in this case. Note there are some distortions present in walls in either side of the street in CycleGAN (RGB) output image.
Input(RGB)
Input (Depth)
Ours
CycleGAN (RGB)
CycleGAN (Depth) [29]
CycleGAN (Concat) [29]
Image Fusion (CycleGAN) [29]
UNIT (RGB) [11]
UNIT (Depth)[11]
Figure 16: Results corresponding to the Synthetic-to-Real translation task (sample 00700). Both CycleGAN (RGB) and CycleGAN (Concat) have failed to identify the shadow of the tree in their respective outputs. Compared to the output of the proposed method, both these outputs have considerable distortions on the walls at either side of the street.
Input(NIR)
Input (Depth)
Target (RGB)
Ours
CycleGAN (NIR)
CycleGAN (Depth) [29]
CycleGAN (Concat) [29]
Image Fusion (CycleGAN) [29]
UNIT (NIR) [11]
UNIT (Depth)[11]
Figure 17: Results corresponding to the hyperspectral-to-visible translation task (image b118-047). White color artifacts present in CycleGAN (NIR) is more apparent in CycleGAN (Concat). Image Fusion (CycleGAN) has produced an image with less artifacts. However, the proposed method produces the closest reconstruction to the ground truth.
Input(NIR)
Input (Depth)
Target (RGB)
Ours
CycleGAN (NIR)
CycleGAN (Depth) [29]
CycleGAN (Concat) [29]
Image Fusion (CycleGAN) [29]
UNIT (NIR) [11]
UNIT (Depth)[11]
Figure 18: Results corresponding to the hyperspectral-to-visible translation task (image b197-61). CycleGAN (Concat) have produced a result on par with the proposed method while there are observable deficiencies in other outputs. However, in terms of colors (trees at the left and the grass), latter is more closer to the ground truth.
Input(NIR)
Input (Depth)
Target (RGB)
Ours
CycleGAN (NIR)
CycleGAN (Depth) [29]
CycleGAN (Concat) [29]
Image Fusion (CycleGAN) [29]
UNIT (NIR) [11]
UNIT (Depth)[11]
Figure 19: Results corresponding to the hyperspectral-to-visible translation task (image b304-36).

Both CycleGAN (Concat) and the proposed method produces results of similar quality for this image. Therefore, feature fusion and pixel-level fusion both have similar performances for this image.

6.3 Pixel-level Fusion vs Feature Fusion

In the proposed network, information of input modalities are fused at the beginning of the encoder sub-network. In principle, fusion can carried out as pixel-level fusion, feature fusion or decision fusion [14]. Since the task in hand takes the form of image reconstruction, decision fusion is not applicable. In our method, we utilize the feature fusion technique where we first extract some feature maps from each modality and fuse them together using a convolution operation. In principle, it is also possible to use pixel-level fusion for this task. For example, pixel-level fusion can be performed by training a CycleGAN model where a concatenation of all inputs are provided as the input to the network (CycleGAN (Concat) in our experiments).

However, when the input modalities are from incompatible domains, pixel-level fusion may result in incoherent reconstructions. In order to illustrate this, we direct readers’ attention to CycleGAN (Concat) results shown in Figures 9 to 19. Compared to the results of CycleGAN in Figures 9,10,11,12 the colorization in cycleGAN-fused is less realistic. For example, CycleGAN operating on the NIR images has captured the colorization of the mountain image in Figure 10 much better than the fused version has. In this case, pixel-level fusion in fact has deteriorated the performance as compared to the original case. This trend can be observed across all three tasks. However, the performance of CycleGAN(Concat) is reasonable in most images (except for Figure 17), in the hyper-spectral-to-visible translation task. In this special case, pixel-level fusion has worked effectively.

Acknowledgement

This work was supported by US Office of Naval Research (ONR) Grant YIP N00014-16-1-3134.

References

  • [1] K. Bousmalis, G. Trigeorgis, N. Silberman, D. Krishnan, and D. Erhan. Domain separation networks. 2016.
  • [2] M. Brown and S. Süsstrunk. Multispectral SIFT for scene category recognition. In Computer Vision and Pattern Recognition (CVPR11), pages 177–184, Colorado Springs, June 2011.
  • [3] M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benenson, U. Franke, S. Roth, and B. Schiele.

    The cityscapes dataset for semantic urban scene understanding.

    In Proc. of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
  • [4] Y. Ganin, E. Ustinova, H. Ajakan, P. Germain, H. Larochelle, F. Laviolette, M. Marchand, and V. S. Lempitsky. Domain-adversarial training of neural networks.

    Journal of Machine Learning Research

    , 17:59:1–59:35, 2016.
  • [5] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial nets. In Advances in neural information processing systems, pages 2672–2680, 2014.
  • [6] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 27-30, 2016, pages 770–778, 2016.
  • [7] P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros.

    Image-to-image translation with conditional adversarial networks.

    In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017.
  • [8] J. Johnson, A. Alahi, and L. Fei-Fei.

    Perceptual losses for real-time style transfer and super-resolution.

    In European Conference on Computer Vision, pages 694–711. Springer, 2016.
  • [9] D. Kingma and J. Ba. Adam: A method for stochastic optimization. In The International Conference on Learning Representations, 2015.
  • [10] C. Ledig, L. Theis, F. Huszar, J. Caballero, A. P. Aitken, A. Tejani, J. Totz, Z. Wang, and W. Shi. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017.
  • [11] M.-Y. Liu, T. Breuel, and J. Kautz. Unsupervised image-to-image translation networks. In Advances in Neural Information Processing Systems, 2017.
  • [12] M.-Y. Liu and O. Tuzel. Coupled generative adversarial networks. In D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett, editors, Advances in Neural Information Processing Systems, pages 469–477. Curran Associates, Inc., 2016.
  • [13] M. Mirza and S. Osindero. Conditional generative adversarial nets. arXiv preprint arXiv:1411.1784, 2014.
  • [14] H. B. Mitchell. Image Fusion: Theories, Techniques and Applications. Springer Publishing Company, Incorporated, 1st edition, 2010.
  • [15] J. Ngiam, A. Khosla, M. Kim, J. Nam, H. Lee, and A. Y. Ng.

    Multimodal deep learning.

    In ICML, pages 689–696. Omnipress, 2011.
  • [16] G. Pajares and J. M. de la Cruz. A wavelet-based image fusion tutorial. Elsevier Journal of Pattern Recognition, 37:1855–1872, 2004.
  • [17] A. Radford, L. Metz, and S. Chintala. Unsupervised representation learning with deep convolutional generative adversarial networks. CoRR, abs/1511.06434, 2015.
  • [18] G. Ros, L. Sellart, J. Materzynska, D. Vazquez, and A. Lopez. The SYNTHIA Dataset: A large collection of synthetic images for semantic segmentation of urban scenes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016.
  • [19] T. Salimans, I. Goodfellow, W. Zaremba, V. Cheung, A. Radford, X. Chen, and X. Chen. Improved techniques for training gans. In Advances in Neural Information Processing Systems 29, pages 2234–2242. 2016.
  • [20] E. Shelhamer, J. Long, and T. Darrell. Fully convolutional networks for semantic segmentation. IEEE Trans. Pattern Anal. Mach. Intell., 39(4):640–651, 2017.
  • [21] A. Shrivastava, T. Pfister, O. Tuzel, J. Susskind, W. Wang, and R. Webb. Learning from simulated and unsupervised images through adversarial training. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017.
  • [22] B. Sun, J. Feng, and K. Saenko. Return of frustratingly easy domain adaptation. In

    AAAI Conference on Artificial Intelligence

    , pages 2058–2065, 2016.
  • [23] A. Valada, G. Oliveira, T. Brox, and W. Burgard. Deep multispectral semantic scene understanding of forested environments using multimodal fusion. In The 2016 International Symposium on Experimental Robotics (ISER 2016), Tokyo, Japan, Oct. 2016.
  • [24] X. Wang and A. Gupta. Generative image modeling using style and structure adversarial networks. In ECCV, 2016.
  • [25] Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli. Image quality assessment: From error visibility to structural similarity. Transactions on Image Processing, 13(4):600–612, Apr. 2004.
  • [26] Z. Yi, H. Zhang, P. Tan, and M. Gong. Dualgan: Unsupervised dual learning for image-to-image translation. In Proceedings of the International Conference on Computer Vision, 2017.
  • [27] H. Zhang, T. Xu, H. Li, S. Zhang, X. Wang, X. Huang, and D. Metaxas. Stackgan: Text to photo-realistic image synthesis with stacked generative adversarial networks. In ICCV, 2017.
  • [28] R. Zhang, P. Isola, and A. A. Efros. Colorful image colorization. In ECCV, 2016.
  • [29] J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros. Unpaired image-to-image translation using cycle-consistent adversarial networkss. In Proceedings of the International Conference on Computer Vision, 2017.