Log In Sign Up

Mix and match networks: encoder-decoder alignment for zero-pair image translation

We address the problem of image translation between domains or modalities for which no direct paired data is available (i.e. zero-pair translation). We propose mix and match networks, based on multiple encoders and decoders aligned in such a way that other encoder-decoder pairs can be composed at test time to perform unseen image translation tasks between domains or modalities for which explicit paired samples were not seen during training. We study the impact of autoencoders, side information and losses in improving the alignment and transferability of trained pairwise translation models to unseen translations. We show our approach is scalable and can perform colorization and style transfer between unseen combinations of domains. We evaluate our system in a challenging cross-modal setting where semantic segmentation is estimated from depth images, without explicit access to any depth-semantic segmentation training pairs. Our model outperforms baselines based on pix2pix and CycleGAN models.


page 5

page 7

page 8


Mix and match networks: multi-domain alignment for unpaired image-to-image translation

This paper addresses the problem of inferring unseen cross-domain and cr...

Zero-Pair Image to Image Translation using Domain Conditional Normalization

In this paper, we propose an approach based on domain conditional normal...

Image-to-Image Translation for Autonomous Driving from Coarsely-Aligned Image Pairs

A self-driving car must be able to reliably handle adverse weather condi...

Extremely Weak Supervised Image-to-Image Translation for Semantic Segmentation

Recent advances in generative models and adversarial training have led t...

Langevin Cooling for Domain Translation

Domain translation is the task of finding correspondence between two dom...

A Unified Hyper-GAN Model for Unpaired Multi-contrast MR Image Translation

Cross-contrast image translation is an important task for completing mis...

1 Introduction

Image-to-image translations (or simply image translations) are an integral part of many computer vision systems. They include transformations between different modalities, such as from RGB to depth [19], or domains, such as luminance to color images [32], horses to zebras [34], or editing operations such as artistic style changes [9]. These mappings can also include 2D label representations such as semantic segmentations [21] or surface normals [5]. Deep networks have shown excellent results in learning models to perform image translations between different domains and modalities [1, 10, 21]. These systems are typically trained with pairs of matching images between domains, e.g. an RGB image and its corresponding depth image.

Figure 1: Zero-pair image translation: (a) given a set of domains or modalities (circles) for which paired training data is available, the objective is to evaluate zero-pair translations. Translations are implemented as aligned encoder-decoder networks. (b) Mix and match networks do not require retraining on unseen transformations, in contrast to unpaired translation alternatives (e.g. CycleGAN [34]). (c) Two cascaded paired translations (e.g. 2pix2pix [10]) require explicit translation to an intermediate domain. Better seen in color.

Image translation methods, which transfer images from one domain to another, are often based on encoder-decoder frameworks [1, 10, 21, 34]

. In these approaches, an encoder network maps the input image from domain A to a continuous vector in a latent space. From this latent representation the decoder generates an image in domain B. The latent representation is typically much smaller than the original image size, thereby forcing the network to learn to efficiently compress the information from domain A which is relevant for domain B into the latent representation. Encoder-decoder networks are trained end-to-end by providing the network with matching pairs from both domains or modalities. An example could be learning a mapping from RGB to depth 

[19]. Other applications include semantic segmentation  [1] and image restoration [23].

In this paper we introduce zero-pair image translation: a new setting for testing image translations which involves evaluating on unseen translations, i.e. no matching image or dataset pairs are available during training (see Figure 1a). Note that this setting is different from unpaired image translation [34, 14, 20], which is evaluated on the same paired domains seen during training.

We also propose mix and match networks, an approach that addresses zero-pair image translation by seeking alignment between encoders and decoders via their latent spaces. An unseen translation between two domains is performed by simply concatenating the input domain encoder and the output domain decoder (see Figure 1b). We study several techniques that can improve this alignment, including the usage of autoencoders, latent space consistency losses and the usage of pooling indices as side information to guide the reconstruction of spatial structure. We evaluate this approach in a challenging cross-modal task, where we perform zero-pair depth to semantic segmentation translation, using only RGB to depth and RGB to semantic segmentation pairs during training.

Finally, we show that aligned encoder-decoder networks also have advantages in domains with unpaired data. In this case, we show that mix and match networks scale better with the number of domains, since they are not required to learn all pairwise image translation networks (i.e. scales linearly instead of quadratically). The code is available at

2 Related Work

Image translation

Recently, generic encoder-decoder architectures have achieved impressive results in a wide range of transformations between images. Isola et al. [10] trained from pairs of input and output images to learn a variety of image translations (e.g. color, style), using an adversarial loss. These models require paired training data to be available (i.e. paired image translation). Various works extended this idea to the case where no explicit input-output image pairs are available (unpaired image translation), using the idea of cyclic consistency [34, 14]. Liu et al. [20] show that unsupervised mappings can be learned by imposing a joint latent space between the encoder and the decoder. In this work we consider the case were paired data is available between some domains or modalities and not available between others (i.e. zero-pair), and how this knowledge can be transfered to those zero-pair cases. In concurrent work, Choi et al. [4] also address scaling to multiple domains (always in the RGB modality) by using a single encoder-decoder model. In contrast, our approach uses multiple cross-aligned encoders and decoders. Our cross-modal setting is also requires deeper structural changes and modality-specific encoder-decoders.

Multimodal encoder-decoders

Encoder-decoder networks can be extended into multi-way encoder-decoder networks by adding encoders and/or decoders for multiple domains together. Recently, joint encoder-decoder architectures have been used in multi-task settings, where the network is trained to perform multiple tasks (e.g. depth estimation, semantic segmentation, surface normals) [5, 13], and multimodal settings, where the inputs data can be from different modalities or even combine several ones [25].

Training a multimodal encoder-decoder network was recently studied in [16]. They use a joint latent representation space for the various modalities. In our work we consider the alignment and transferability of pairwise image translations to unseen translations, rather than joint encoder-decoder architectures. Another multimodal encoder-decoder network was studied in  [2]. They show that multimodal autoencoders can address the depth estimation and semantic segmentation tasks simultaneously, even in the absence of some of the input modalities. All these works do not consider the zero-pair image translation problem addressed in this paper.

Zero-shot recognition

In conventional supervised image recognition, the objective is to predict the class label that is provided during training [18, 8]. However, this poses limitations in scalability to new classes, since new training data and annotations are required. In zero-shot learning, the objective is to predict an unknown class for which there is no image available, but a description of the class (i.e. class prototype). This description can be a set of attributes(e.g. has wings, blue, four legs, indoor)  [18, 11], concept ontologies [6, 27] or textual descriptions [26]. In general, an intermediate semantic space is leveraged as a bridge between the visual features from seen classes and class description from unseen ones. In contrast to zero-shot recognition, we focus on unseen translations (unseen input-output pairs rather than simply unseen class labels).

Zero-pair language translation

Evaluating models on unseen language pairs has been studied recently in machine translation [12, 3, 33, 7]. Johnson et al. [12] proposed a neural language model that can translate between multiple languages, even pairs of language where no explicit paired sentences where provided111Note that [12] refers to this as zero-shot translation. In this paper we refer to this setting as zero-pair to emphasize that what is unseen is paired data and avoid ambiguities with traditional zero-shot recognition which typically refers to unseen samples.. In their method, the encoder, decoder and attention are shared. In our method we focus on images, which are essentially a radically different type of data, with two dimensional structure in contrast to the sequential structure of language.

3 Encoder-decoder alignment

3.1 Multi-domain image translation

We consider the problem of image translation from domain to domain as . In our case it is implemented as a encoder-decoder chain with encoder and decoder (see Figure 1). The domains connected during training are all trained jointly, and in both directions. It is important to note that for each domain one encoder and one decoder are trained. By training all these encoders and decoders jointly the latent representation is encouraged to align. As a consequence of the alignment of the latent space we can do zero-pair translation at testing time between the domains for which no training pairs were available. The main aim of this article is to analyze to what extend this alignment allows for zero-pair image translation.

3.2 Aligning for zero-pair translation

Zero-pair translation in images is especially challenging due to the inherent complexity of images, especially in multimodal settings. Ideally, a good latent representation that also works in unseen translations should be not only well-aligned but also unbiased to any particular domain. In addition, the encoder-decoder system should be able to preserve the spatial structure, even in unseen transformations.

Autoencoders   One way to improve alignment is by jointly training domain-specific autoencoders with the image translation networks. By sharing the weights between the auto-encoders and the image translation encoder-decoders the latent space is forced to align.

Latent space consistency   The latent space can be enforced to be invariant across multiple domains. Taigman et al. [30] use L2 distance between a latent representation and the reconstructed after another decoding and encoding cycle. When paired samples are available, we propose using cross-domain latent space consistency in order to enforce and to be aligned.

Preserving spatial structure using side information   In general, image translation tasks result in output images with similar spatial structure as the input ones, such as scene layouts, shapes and contours that are preserved across modalities. In fact, this spatial structure available in the input image is critical to simplify the problem and achieve good results, and successful image translation methods often use multi-scale intermediate representations from the encoder as side information to guide the decoder in the upsampling process. Skip connections are widely used for this purpose. However, conditional decoders using skip connections expect specific information from a particular domain-specific encoder that would be unlikely to work in unseen encoder-decoder pairs. Motivated by efficiency, pooling indices [1] were recently proposed as a compact descriptor to guide the decoder in lightweight models. We show here that pooling indices constitute robust and relatively encoder-independent side information, suitable for improving decoding even in unseen translations.

Adding noise to latent space   We found that adding some noise at the output of each encoder also helps to train the network and improves the results during test. This seems to help in obtaining more invariance in the common latent representation and better alignment across modalities.

3.3 Scalable image translation

One of the advantages of our mix and match networks is that the system can infer many pairwise domain-to-domain translations when the number of domains is high, without explicitly training them. Other pairwise methods where encoders and decoders are not cross-aligned, such as CycleGAN[34], would require training encoders and decoders for domains. For mix and match networks each encoder and decoder should be involved in at least one translation pair during training in order to be aligned with the others, thereby reducing complexity from quadratic to linear with the number of domains (i.e. encoders/decoders).

4 Zero-pair cross-modal image translation

We propose a challenging cross-modal setting to evaluate zero-pair image translation involving three modalities222Here the term modality has the same role of domain in the previous section.: RGB, depth and semantic segmentation. It is important to observe that a setting involving heterogeneous modalities333For simplicity, we will refer to semantic segmentation maps and depth as modalities rather than tasks (in terms of complexity, number and meaning of different channels, etc.) is likely to require modality-specific architectures and losses.

4.1 Problem definition

We consider the problem of jointly learning RGB-to-segmentation translation with and RGB-to-depth translation and evaluating on an unseen transformation . The first translation is learned from a semantic segmentation dataset with pairs of RGB images and segmentation maps, and the second from a disjoint RGB-D dataset with pairs of RGB and depth images . Therefore no depth image and segmentation map pairs are available to the system. However, note that the RGB images from both datasets could be combined if necessary (we denote this dataset as . The system is evaluated on a third dataset with paired depth images and segmentation maps.

4.2 Mix and match networks architecture

Figure 2: Zero-pair cross-modal and multimodal image translation with mix and match networks. Two disjoint sets and are seen during training, containing (RGB,depth) pairs and (RGB,segmentation) pairs, respectively. The system is tested on the unseen translation depth-to-segmentation (zero-pair) and (RGB+depth)-to-segmentation (multimodal), using a third unseen set . Encoders and decoders with the same color share weights. Better viewed in color.

The overview of the framework is shown in Fig. 2. As basic building blocks we use three modality-specific encoders , and (RGB, depth and semantic segmentation, respectively), and the corresponding three modality-specific decoders , and , where is the latent representation in the shared space. The required translations are implemented as , and .

Encoder and decoder weights are shared across the different translations involving same modalities (same color in Fig. 2). To enforce better alignment between encoders and decoders of the same modality, we also include self-translations using the corresponding three autoencoders , and .

We based our encoders and decoders on the SegNet architecture [1]. The encoder of SegNet itself is based on the 13 convolutional layers VGG-16 architecture [29]

. The decoder mirrors the encoder architecture with 13 deconvolutional layers. All encoders and decoders are randomly initialized except for the RGB encoder which is pretrained on ImageNet.

As in SegNet, pooling indices at each downsampling layer of the encoder are provided to the corresponding upsampling layer of the (seen or unseen) decoder444The RGB decoder does not use pooling indices, since in our experiments we observed undesired grid-like artifacts in the RGB output.. These pooling indices seem to be relatively similar across the three modalities and effective to transfer spatial structure information that help to obtain better depth and segmentation boundaries in higher resolutions. Thus, they provide relatively modality-independent side information.

4.3 Loss functions

As we saw before, a correct cross-alignment between encoders and decoders that have not been connected during training is critical for zero-pair translation. The final loss combines a number of modality-specific losses for both cross-domain translation and self-translation (i.e. autoencoders) and alignment constraints in the latent space


We use a combination of L2 distance and adversarial loss . L2 distance is used to compare the estimated and the ground truth RGB image after translation from a corresponding depth or segmentation image. It is also used in the RGB autoencoder

In addition, we also include the least squares adversarial loss [22, 10] on the output of the RGB decoder

where is the resulting distribution of the combined images generated by , and . Note that the RGB autoencoder and the discriminator are both trained with the combined RGB data .


For depth we use the Berhu loss  [17] in both RGB-to-depth translation and in the depth autoencoder

where is the average Berhu loss.

Semantic segmentation

For segmentation we use the average cross-entropy loss in both RGB-to-segmentation translation and the segmentation autoencoder

Latent space consistency

We enforce latent representations to remain close independently of the encoder that generated them. In our case we have two consistency losses

5 Experimental Results

To the best of our knowledge there is no existing work which reports results for the setting of zero-pair image translation. In particular, we evaluate the proposed mix and match networks on zero-pair translation for semantic segmentation from depth images (and viceversa), and we show results for semantic segmentation from multimodal data. Finally, we illustrate the possibility to perform zero-pair translations for unpaired datasets, and the advantage of mix and match networks in terms of scalability.

5.1 Datasets and experimental settings

SceneNetRGBD   The SceneNetRGBD dataset [24] consists of 16865 synthesized train videos and 1000 test videos. Each of them includes 300 matching triplets (RGB, depth and segmentation map), with a size of 320x240 pixels. In our examples, we use two subsets as our datasets:

  • 51K dataset: the train set is selected from the first 50 frames from each of the first 1000 videos in the train set. The test set is collected by selecting the 60th frame from the same 1000 videos. This dataset was used to evaluate some of the architecture design choices.

  • 170K dataset: We collected a larger dataset which consists of 150K triplets for the train set, 10K triplets for the validation set and 10K triplets for the test set. The 10K validation set is also from the train set of SceneNetRGBD. For the train set, we select 10 triplets from the first 150000 training triplets. The triplets are sampled from the first frame to last frame every 30 frames. The validation set is from the remaining videos of the train set and the test set is taken from the test dataset.

Each train set is divided into two equal non-overlapping splits from different videos. Although the collected splits contain triplets, we only use pairs to train our model.

Following common practices in these tasks, for segmentation we compute the intersection over union (IoU) and report per-class average (mIoU), and the global scores, which gives the percentage of correctly classified pixels. For depth we also include quantitative evaluation, following the standard error metrics for depth estimation 


Network training   We use Adam [15] with batch size of 6, using a learning rate of 0.0002. We set , , , , . For the first 200,000 iterations we train all networks. For the following 200,000 iterations we use , , and freeze the RGB encoder. We found the network converges faster using a large initial

. We add Gaussian noise to the latent space with zero mean and a standard deviation of 0.5.

Side information Pretrained mIoU Global
- N 32.2% 63.5%
Skip connections N 14.1% 52.6%
Pooling indices N 45.6% 73.4%
Pooling indices Y 49.5% 80.0%
Table 1: Influence of side information and RGB encoder pretraining on the final results. The task is zero-shot depth-to-semantic segmentation.
AutoEnc Latent Noise mIoU Global
N N N 5.64% 13.5%
Y N N 22.9% 52.6%
Y Y N 48.9% 78.2%
Y Y Y 49.5% 80.0%
Table 2: Impact of several components (autoencoder, latent space consistency loss and noise) in the performance. The task is zero-shot depth-to-semantic segmentation.

5.2 Ablation study

In a first experiment we use the 51K dataset to study the impact of several design elements on the overall performance of the system.

Side information   We first evaluate the usage of side information from the encoder to guide the upsampling process in the decoder. We consider three variants: no side information, skip connections [28] and pooling indices [1]. The results in Table 1 show that skip connections obtain worse results than no side information at all. This is due to the fact that skip connections are not domain-invariant and at testing time when we combine an encoder and decoder these connections result in a different input from the one seen under training, resulting in a drop of performance. Fig. 3 illustrates the differences between these three variants. Without side information the network is able to reconstruct a coarse segmentation but without further guidance it is not able to refine it properly. Skip connections provide features that could guide the decoder but instead confuse it, since in the zero-pair case the decoder has not seen the features of that particular encoder. Pooling indices are more invariant as side information and obtaining the best results.

RGB pretraining   We also compare training the RGB encoder from scratch and initializing with pretrained weights from ImageNet. Table 1 show an additional gain of around 5% in mIoU when using the pretrained weights.

Given these results we perform all the remaining experiments initializing the RGB encoder with pretrained weights and use pooling indices as side information.

Latent space consistency, noise and autoencoders   We evaluate these three factors, with Table 2 showing that latent space consistency and the usage of autoencoders lead to significant performance gains; for both, the performance (in mIoU) is more than doubled. Adding noise to the output of the encoder results in a small performance gain.

Method Conn.
















CycleGAN [34] SC CE 2.79 0.00 16.9 6.81 4.48 0.92 7.43 0.57 9.48 0.92 0.31 17.4 15.1 6.34 14.2
2pix2pix [10] SC CE 34.6 1.88 70.9 20.9 63.6 17.6 14.1 0.03 38.4 10.0 4.33 67.7 20.5 25.4 57.6
M&MNet PI CE 0.02 0.00 8.76 0.10 2.91 2.06 1.65 0.19 0.02 0.28 0.02 58.2 3.3 5.96 32.3
M&MNet SC CE 25.4 0.26 82.7 0.44 56.6 6.30 23.6 5.42 0.54 21.9 10.0 68.6 19.6 24.7 59.7
M&MNet PI CE 50.8 18.9 89.8 31.6 88.7 48.3 44.9 62.1 17.8 49.9 51.9 86.2 79.2 55.4 80.4
M&MNet PI CE 49.9 25.5 88.2 31.8 86.8 56.0 45.4 70.5 17.4 46.2 57.3 87.9 79.8 57.1 81.2
Table 3: Zero-pair depth-to-semantic segmentation. SC: skip connections, PI: pooling indexes, CE: cross-entropy
(lin) (log)
CycleGAN [34] 0.05 0.12 0.20 4.63 1.98
M&MNet 0.15 0.30 0.44 3.24 1.24
M&MNet 0.33 0.42 0.59 2.8 0.67
M&MNet 0.36 0.48 0.65 2.48 0.64
Table 4: Zero-pair semantic segmentation-to-depth.

5.3 Comparison with other methods

We compare the results of our mix and match networks for depth to segmentation, , to the following two baselines:

  • CycleGAN [34] learns a mapping from depth to semantic segmentation without explicit pairs. In contrast to ours, this method only leverages depth and semantic segmentation, ignoring the available RGB data and the corresponding pairs.

  • 2pix2pix [10] learns from paired data two encoder-decoder pairs ( and ). The architecture uses skip connections and the corresponding modality-specific losses. We use the exact code from [10]. In contrast to ours, it requires explicit decoding to RGB, which may degrade the quality of the prediction.

  • is similar as the 2pix2pix but than with a similar architecture as we use in our M&MNet. We train a translation model from depth to RGB and from RGB to segmentation, and obtain the transformation depth-to-segmentation by concatenating them. Note that it requires using an intermediate RGB image.

Table 3 compares the three methods on the 170K dataset. CycleGAN is not able to learn a good mapping from depth to semantic segmentation, showing the difficulty of unpaired translation to solve this complex cross-modal task. 2pix2pix manages to improve the results by resorting to the anchor domain RGB, although still not satisfactory since the first translation network drops information not relevant for the RGB task but necessary for reconstructing depth (like in the ”Chinese whispers”/telephone game).

Mix and match networks evaluated on () achieve a similar result to CycleGAN, but significantly worse than 2pix2pix. However, when we run our architecture with skip connections we obtain similar results as 2pix2pix. Note that because in this setting the encoders and decoders are used in the same setting in both training and testing, skip connections function well.

The direct combination () outperforms all baselines significantly. The results more than double in terms of mIoU. Figure 4 illustrates the comparison between our approach and the baselines; our method is the only one that manages to identify the main semantic classes and their general contours in the image. In conclusion, the results show that mix and match networks enable effective zero-pair translation.

Figure 3: Role of side information in unseen depth-to-segmentation translation.
(a) Input: depth
(b) Pix2pix
(c) CycleGAN
(d) DRS
(e) Proposed
(f) Ground truth
Figure 4: Different methods evaluated on zero-pair depthsegmentation.

In Table 4 we show the results when we test in the opposite direction from semantic segmentation to depth. The conclusions are similar as in previous experiment. Again our method

outperforms both baseline methods on all five evaluation metrics. Fig. 

5 illustrates this case, showing how pooling indices are also key to obtain good depth images, compared with no side information at all.

5.4 Multimodal translation

Next we consider the case of multimodal translation from pairs (RGB, depth) to semantic segmentation. As depicted in Figure 2 multiple modalities can be combined (since the latent spaces are aligned) at the input of semantic segmentation decoder. To combine the two modalities we perform a weighted average of both RGB and depth latent vectors (the weight ranges from , only RGB, and , only depth, depending on the case). We set to 0.2 and use the pooling indices from the RGB encoder (instead of those from the depth encoder, see supplementary material for more details). The results in Table 3 and Table 4 and the example in Figure 5 show that this multimodal combination further improves the performance of zero-pair translation.

Figure 5: Zero-pair and multimodal segmentation-to-depth.

5.5 Scalable unpaired image translation

(a) Color transfer: only transformations from or to blue (anchor domain) are seen. Input images are highlighted in red and seen translations in blue.
(b) Style transfer: trained on four styles + photo (anchor) with data from [34]). From left to right: photo, Monet, van Gogh, Ukiyo-e and Cezanne. Input images are highlighted in red and seen translations in blue.
Figure 6: Zero-pair cross-domain unpaired translations.

As explained in Section 3.3, mix and match networks scale linearly with the number of domains, whereas existing unpaired image translation methods scale quadratically. As examples of translations between many domains, we show results for object recoloring and style transfer, using mix and match networks based on multiple CycleGANs [34] combined with autoencoders. For the former we use the colored objects dataset [31] with eleven distinct colors () and around 1000 images per color. Covering all possible image-to-image recoloring combinations requires training a total of encoders (and decoders) using CycleGANs. In contrast, mix and match networks only require to train encoders and eleven decoders, while still successfully addressing the recoloring task (see Figure 5(a)). Similarly, scalable style transfer can be addressed using mix and match networks (see Figure 5(b) for an example).

6 Conclusion

In this paper we introduce the problem of zero-pair image translation, where knowledge learned in paired translation models can be effectively transferred and leveraged to perform new unseen translations. The image-to-image scenario poses several challenges to the alignment of encoders and decoders in a way that guarantees cross-domain transferability and without too much dependence on the domain or the modality. We studied this scenario in zero-pair cross-modal and multimodal settings. Notably, we found that side information in the form of pooling indices is robust to modality changes and very helpful to guide the reconstruction of spatial structure. Other helpful techniques are cross-modal consistency losses and adding noise to the latent representation.

Acknowledgements   Herranz acknowledges the European Union’s H2020 research under Marie Sklodowska-Curie grant No. 6655919. We acknowledge the project TIN2016-79717-R, the CHISTERA project M2CR (PCIN-2015-251) of the Spanish Government and the CERCA Programme of Generalitat de Catalunya. Yaxing Wang acknowledges the Chinese Scholarship Council (CSC) grant No.201507040048. We also acknowledge the generous GPU support from Nvidia.


Appendix A Appendix: Network architecture

Table 7 shows the architecture (convolutional and pooling layers) of the encoders used in the cross-modal experiment. Tables 8 and 5 show the corresponding decoders. Table 6

shows the discriminator used for RGB. . Every convolutional layer of the encoders, decoders and the discriminator is followed by a batch normalization layer and a ReLU layer (LeakyReLU for the discriminator). The only exception is the RGB encoder, which is is initialized with weights from the VGG16 model

[29] and does not use batch normalization.

layer Input Output

Kernel, stride

conv1 [6,8,8,512] [3, 3], 1
conv2 [6,16,16,512] [3, 3], 1
conv3 [6,32,32,256] [3, 3], 1
conv4 [6,64,64,128] [3, 3], 1
conv5 [6,128,128,64] [3, 3], 1
Table 5: Convolutional and pooling layers of the RGB decoder.
layer Input Output Kernel, stride
deconv1 [5, 5], 2
deconv2 [5, 5], 2
deconv3 [5,5], 2
deconv4 [5,5], 2
Table 6: RGB discriminator.

Appendix B Appendix: Multimodal fusion

Figure 7 shows the performance for different values of for multimodal semantic segmentation. It also compares the performance when the semantic segmentation decoder uses the pooling indices from the depth encoder instead of the ones from the RGB encoder.

Figure 7: Multimodal semantic segmentation: pooling indices modality and modality weight ( for RGB only, for depth only).
Layer Input Output Kernel, stride
conv1 (RGB) [6,256,256,3] [6,256,256,64] [3,3], 1
conv1 (Depth) [3,3], 1
conv1 (Segm.) [6,256,256,14] [6,256,256,64] [3,3], 1
conv2 [6,256,256,64] [6,256,256,64] [3,3], 1
pool2 (max) [6,256,256,64] [6,128,128,64]+indices2 [2,2], 2
conv3 [6,128,128,64] [6,128,128,128] [3,3], 1
conv4 [6,128,128,128] [6,128,128,128] [3,3], 1
pool4 (max) [6,128,128,128] [6,64,64,128]+indices4 [2,2], 2
conv5 [6,64,64,128] [6,64,64,256] [3,3], 1
conv6 [6,64,64,256] [6,64,64,256] [3,3], 1
conv7 [6,64,64,256] [6,64,64,256] [3,3], 1
pool7 (max) [6,64,64,256] [6,32,32,256]+indices7 [2,2], 2
conv8 [6,32,32,256] [6,32,32,512] [3,3], 1
conv9 [6,32,32,512] [6,32,32,512] [3,3], 1
con10 [6,32,32,512] [6,32,32,512] [3,3], 1
pool10 (max) [6,32,32,512] [6,16,16,512]+indices10 [2,2], 2
conv11 [6,16,16,512] [6,16,16,512] [3,3], 1
conv12 [6,16,16,512] [6,16,16,512] [3,3], 1
conv13 [6,16,16,512] [6,16,16,512] [3,3], 1
pool13 (max) [6,16,16,512] [6,8,8,512]+indices13 [2,2], 2
Table 7: Convolutional and pooling layers of the encoders.
layer Input Output Kernel, stride
unpool1 indices13 + [6,8,8,512] [2, 2], 2
conv1 [6,16,16,512] [3,3], 1
conv2 [6,16,16,512] [3,3], 1
conv3 [6,16,16,512] [3,3], 1
unpool4 indices10 + [6,16,16,512] [2, 2], 2
conv4 [6,32,32,512] [3,3], 1
conv5 [6,32,32,512] [3,3], 1
conv6 [6,32,32,512] [3,3], 1
unpool7 indices7 + [6,32,32,256] [2, 2], 2
conv7 [6,64,64,256] [3,3], 1
conv8 [6,64,64,256] [3,3], 1
conv9 [6,64,64,256] [3,3], 1
unpool10 indices4 + [6,64,64,128] [2, 2], 2
conv10 [6,128,128,128] [3,3], 1
conv11 [6,128,128,128] [3,3], 1
unpool12 indices2 + [6,128,128,64] [2, 2], 2
conv12 [6,256,256,64] [3,3], 1
conv13 (Depth) [6,256,256,64] [3,3], 1
conv13 (Segm.) [6,256,256,64] [3,3], 1
Table 8: Convolutional and pooling layers of the segmentation and depth decoders.