Mix and match networks: multi-domain alignment for unpaired image-to-image translation

03/08/2019 ∙ by Yaxing Wang, et al. ∙ Universitat Autònoma de Barcelona 8

This paper addresses the problem of inferring unseen cross-domain and cross-modal image-to-image translations between multiple domains and modalities. We assume that only some of the pairwise translations have been seen (i.e. trained) and infer the remaining unseen translations (where training pairs are not available). We propose mix and match networks, an approach where multiple encoders and decoders are aligned in such a way that the desired translation can be obtained by simply cascading the source encoder and the target decoder, even when they have not interacted during the training stage (i.e. unseen). The main challenge lies in the alignment of the latent representations at the bottlenecks of encoder-decoder pairs. We propose an architecture with several tools to encourage alignment, including autoencoders and robust side information and latent consistency losses. We show the benefits of our approach in terms of effectiveness and scalability compared with other pairwise image-to-image translation approaches. We also propose zero-pair cross-modal image translation, a challenging setting where the objective is inferring semantic segmentation from depth (and vice-versa) without explicit segmentation-depth pairs, and only from two (disjoint) segmentation-RGB and depth-segmentation training sets. We observe that certain part of the shared information between unseen domains might not be reachable, so we further propose a variant that leverages pseudo-pairs to exploit all shared information.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 3

page 7

page 8

page 12

page 13

page 15

page 16

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

For many computer vision applications, the task is to estimate a mapping between an input image and an output image. This family of methods is often known as image-to-image translations (image translations hereinafter). They include transformations between different modalities, such as from RGB to depth 

liu2016learning , or domains, such as luminance to color images zhang2016colorful , or editing operations such as artistic style changes gatys2016image . These mappings can also include 2D label representations such as semantic segmentations long2015fully or surface normals eigen2015predicting . Deep networks have shown excellent results in learning models to perform image translations between different domains and modalities badrinarayanan2015segnet ; isola2016image ; long2015fully .

One drawback of the initial research on image translations is that the methods required paired data to train the mapping between the domains long2015fully ; eigen2015predicting ; isola2016image . For many domains these might be costly or impossible to collect. For example, to learn a mapping from Van Gogh paintings to Monet paintings no possible paired images exist. Another class of algorithms, based on cycle consistency, address the problem of mapping between unpaired domains kim2017learning ; yi2017dualgan ; zhu2017unpaired . These methods are based on the observation that translating from one domain to another and translating back to the original domain should result in recovering the original input image. However, this is a relatively weak training signal which is effective when domains are relatively close (such as Van Gogh and Monet paintings) but, as we will show in this article, it is not strong enough to learn mappings between distant domains111For simplicity, here we use domain in a broad sense that also includes modalities. Sections 4-6 will focus on modalities., such as RGB images and their semantic segmentation maps.

The above mentioned image translation methods are often based on encoder-decoder frameworks badrinarayanan2015segnet ; isola2016image ; long2015fully ; zhu2017unpaired

. In these approaches an encoder network maps the input image from domain A to a continuous vector in a latent space. From this latent representation the decoder generates an image in domain B. The latent representation is typically much smaller than the original image size, thereby forcing the network to learn to efficiently compress the information from domain A that is relevant for domain B into the latent representation. Autoencoder networks 

kingma2013auto are a special case of encoder-decoder architectures where the input and output are the same image. Encoder-decoder networks are generally trained end-to-end by providing the network with aligned image pairs from both domains.

In this article, we consider the case of image translation between multiple domains. For some of the domains we have access to aligned data pairs but not for all domains. We aim to exploit the knowledge from the paired domains to obtain an improved mapping for the unpaired domain. An example of such a translation setting is the following: you have access to a set of RGB images and their semantic segmentation, and a (different) set of RGB images and their corresponding depth maps, but you are interested in obtaining a mapping from depth to semantic segmentation (see Figure 1. We call this the unseen translation because we do not have pairs for this translation, and we refer to this setting as zero-pair translation. Our method, which we call mix and match networks, addresses the problem of learning a mapping between unpaired domains by seeking alignment between encoders and decoders via their latent spaces222The code is available online at http://github.com/yaxingwang/Mix-and-match-networks.. The translation between unseen domains is performed by simply concatenating the source domain encoder and the target domain decoder (see Figure 1). The success of the method depends on the alignment of the encoder and decoder for the unseen translation. We study several techniques that contribute to achieve alignment, including the usage of autoencoders, latent space consistency losses and the usage of robust side information to guide the reconstruction of spatial structure.

We evaluate our approach in a challenging cross-modal task, where we perform zero-pair depth to semantic segmentation translation (or semantic segmentation to depth translation), using only RGB to depth and RGB to semantic segmentation pairs during training. Furthermore, we show that the results can be further improved by using pseudo-pairs between the unseen domains that allow the network to exploit unseen shared information. Finally, we show that aligned encoder-decoder networks also have advantages in domains with unpaired data. In this case, we show that mix and match networks scale better with the number of domains, since they are not required to learn all pairwise image translation networks (i.e. scales linearly instead of quadratically).

This journal is an extended version of a previous conference publication wang2018mix . We have included more analysis and insight about how mix and match networks exploit the information shared between modalities, and propose an improved mix and match networks framework with pseudo-pairs which allows us to access previously unexploited shared information between unseen domains and modalities (see Section 5). This was found to significantly improve performance. In addition, wang2018mix only reports results on a synthetic dataset. Here we also provide results on real images (SUN RGB-D dataset song2015sun ). Furthermore, we have added more insights on how the alignments between encoders and decoders evolve during training.

Figure 1: Overview of mix and match networks (M&MNet) and zero-pair translation. Two disjoint datasets are used to train seen translations between RGB and segmentation and between RGB and depth (and vice versa). We want to infer the unseen depth-to-segmentation translation (i.e. Zero-pair translation). The M&MNet approach builds the unseen translator by simply cascading the source encoder and target decoder (i.e. depth and segmentation, respectively). Best viewed in color.

2 Related work

In this section we discuss the literature of related research areas.

(a) Paired translation
(b) Unpaired translation
(c) Unsupervised domain adapt.
(d) Zero-pair translation
Figure 2: Cross-modal image translation train and test settings: (a) paired translation, (b) unpaired translation, (c) unsupervised domain adaptation for segmentation (two modalities and two domains in the RGB modality), (d) zero-paired translation (three modalities). Best viewed in color.

2.1 Image-to-image translation

Paired translations   Generic encoder-decoder architectures have achieved impressive results in a wide range of transformations between images. Isola et al. isola2016image proposed pix2pix, which is a conditional generative adversarial network (conditional GAN)goodfellow2014generative ; mirza2014conditional

trained with pairs of input and output images to learn a variety of image translations. Those translations include cross-domain image translations such as colorization and style transfer. González-García

et al. gonzalez2018image disentangle the information of the domains in the latent space, which allows to do cross-domain retrieval as well as perform one-to-many translations. The ability of GANs to generate realistic images also enables pix2pix to address effectively challenging cross-modal image translations, such as semantic segmentation to RGB image. In this case, recent multi-scale architectureschen2017photographic ; wang2018high achieve better results in higher resolution images.

Unpaired translations   Various works extended this idea to the case where no explicit input-output image pairs are available (unpaired image translation), using the idea of cyclic consistency kim2017learning ; yi2017dualgan ; zhu2017unpaired ; lin2018conditional or consistency between certain extracted features taigman2016unsupervised . To avoid accidental artifacts and improve learning, Mejjati et al. mejjati2018unsupervised integrate an attention mechanism to help translations focus on semantically meaningful regions. Liu et al. liu2017unsupervised show that unsupervised mappings can be learned by imposing a joint latent space between the encoder and the decoder.

In this work, we consider the case where paired data is available between some domains or modalities and not available between others (i.e. zero-pair), and how this knowledge can be transferred to those zero-pair cases.

Diversity in translations   Given an input image (e.g. an edge image or a grayscale image) there are often multiple possible solutions (e.g. different plausible colorizations). The paired translation framework was extended to one-to-many translations in the work of Zhu et al. zhu2017toward . DRIT lee2018diverse , MUNIT huang2018multimodal and Augmented CycleGAN almahairi2018augmented can learn one-to-many translations in unpaired settings. In general, disentangled representations allow achieving diversity by keeping the content component and sampling the style component of the latent representation mathieu2016disentangling ; gonzalez2018image ; lee2018diverse .

Multi-domain translations   We also consider the case of multiple domains (and modalities). In concurrent work, Choi et al. choi2017stargan also address scaling to multiple domains by using a single encoder-decoder model, which was previously explored by Perarnau et al. perarnau2016invertible . These works focus on faces and changing relatively superficial and localized attributes such as make-up, hair color, gender, etc., always within the RGB modality. In contrast, our approach uses multiple cross-aligned modality-specific encoders and decoders, which are necessary to address the deeper structural changes required by our cross-modal setting. Anoosheh et al. Anoosheh_2018 also uses multiple encoders-decoders but focus on the easier cross-domain task of style transfer.

2.2 Semantic segmentation and depth estimation

Semantic segmentation and depth estimation could be also considered (cross-modal) image translation. In contrast to general image translation, both have been largely studied since they are fundamental problems in computer vision. They are also addressed using encoder-decoder architectures and paired data, but with specific architectures and losses, and in general they do not rely on GANs.

Semantic segmentation   Semantic image segmentation aims at assigning each pixel to an object class. Benefiting from the use of CNN-based models, recent methods on semantic segmentation have obtained significant improvements compared with traditional approaches shotton2008semantic . Long et al. long2015fully propose fully convolutional networks(FCN), following an encoder-decoder structure. The encoder is composed of convolutional and pooling layers, while the decoder applies traditional upsampling without any fully connected layers. The fully convolutional nature of this architecture relaxes the requirement of constant image size. Since the FCN shows outstanding performance, this paradigm has been adopted in many current methods for semantic segmentation badrinarayanan2015segnet ; ronneberger2015u ; yu2015multi ; chen2018deeplab ; zhao2017pyramid . Of particular interest is Segnet badrinarayanan2015segnet , which we adapt in our method. Segnet introduces the use of pooling indices instead of copying encoder features (i.e. skip connections, as in U-Net ronneberger2015u ). We also consider pooling indices in our architecture for zero-pair image translation because we found them to be more robust and invariant under unseen translations.

Depth estimation   Depth estimation aims at estimating the depth structure of a 2D RGB image, usually represented as a depth map encoding the distance of each pixel to the camera. Most depth estimation methods are formalized as regression problems, where the aim is to minimize the mean squared error (MSE) with respect to a ground truth depth map. In general, an encoder-decoder architecture is used, often incorporating multiscale networks and skip connections liu2016learning ; wang2015towards ; roy2016monocular ; eigen2015predicting ; kim2016unified ; kuznietsov2017semi ; laina2016deeper . In this paper, we use pooling indices into the encoder-decoder pipeline. To our best knowledge, this design has not been explored for depth estimation, but as we will see it is necessary to address unseen translations in our setting.

Multimodal encoder-decoders   With the development of multi-sensor cameras and datasets lai2011large ; silberman2012indoor ; song2015sun , encoder-decoder architectures have been adapted to multi-modal inputs ngiam2011multimodal , where different modalities (e.g. RGB, depth, infrared, surface normals) are encoded and combined prior to the decoding. The network is trained to perform tasks such as multi-modal object recognitioneitel2015multimodal ; cheng2016semi ; song2015sun

, scene recognition 

song2017depth ; song2015sun , object detection Gupta_2016_CVPR

(with simple classifiers or regressors as decoders in these cases) and semantic segmentation 

silberman2012indoor ; kendall2017multi ; wang2018depth . Similarly, multi-task learning can be applied to reconstruct multiple modalities eigen2015predicting ; kendall2017multi . For instance Eigen et al. eigen2015predicting estimate depth, surface normals and semantic segmentation from a single RGB image, which can be seen as cross-modal image translation.

Training a multi-task multimodal encoder-decoder network was recently studied in Kuga_2017_ICCV . They use a joint latent representation space for the various modalities. In our work we consider the alignment and transferability of pairwise image translations to unseen translations, rather than joint encoder-decoder architectures. Another multimodal encoder-decoder network was studied in cadena2016multi . They show that multimodal autoencoders can address the depth estimation and semantic segmentation tasks simultaneously, even in the absence of some of the input modalities. All these works do not consider the zero-pair image translation problem addressed in this paper.

2.3 Zero-shot recognition

In conventional supervised image recognition, the objective is to predict the class label that is provided during training. However, this poses limitations in scalability to new classes, since new training data and annotations are required. In zero-shot learning lampert2014attribute ; fu2017recent ; xian2018zero ; xian2018feature ; akata2016label , the objective is to predict an unknown class for which there is no image available, but a description of the class (i.e. class prototype) or any other source of semantic similarity with seen classes. This description can be a set of attributes(e.g. has wings, blue, four legs, indoor) lampert2014attribute ; jayaraman2014zero , concept ontologies fergus2010semantic ; rohrbach2011evaluating or textual descriptions reed2016learning . In general, an intermediate semantic space is leveraged as a bridge between the visual features from seen classes and class description from unseen ones. In contrast to zero-shot recognition, we focus on unseen translations (unseen input-output pairs rather than simply unseen class labels).

2.4 Zero-pair language translation

Evaluating models on unseen language pairs has been studied recently in machine translation johnson2016google ; chen2017teacher ; zheng2017maximum ; firat2016multi . Johnson et al. johnson2016google proposed a neural language model that can translate between multiple languages, even pairs of language where no explicit paired sentences where provided333Note that johnson2016google refers to this as zero-shot translation. In this paper we refer to this setting as zero-pair to emphasize that what is unseen is paired data and avoid ambiguities with traditional zero-shot recognition which typically refers to unseen samples.. In their method, the encoder, decoder and attention are shared. In our method we focus on images, which are essentially a radically different type of data, with two dimensional structure in contrast to the sequential structure of language.

2.5 Domain adaptation

A related line of research is unsupervised domain adaptation. There the task is to transfer knowledge from a supervised source domain to an unsupervised target domain (see Figure 2c). This problem has been approached by finding domain invariant feature spaces gong2012geodesic ; ganin2015unsupervised ; tsai2018learning , using image translations models to map between source and target domain wu2018dcan ; zhang2019synthetic , and exploiting pseudo-labels saito2017asymmetric ; zou2018domain .

When comparing this line of research with the setting we consider in this paper (i.e. zero-pair translation) there are some important differences. The unsupervised domain adaptation setting (see Figure 2c) typically involves two modalities (i.e. RGB and segmentation), and two domains within the RGB modality (e.g. synthetic and real). Paired data is available only for the synthetic-segmentation while the synthetic-real translation is unpaired, and the unseen translation is real-segmentation (with test paired data). In contrast, our setting (see Figure 2d) is more challenging involving three modalities, with one disjoint paired training set for each seen translation. In comparison, using paired data to tackle domain shift allows us to reach much larger and challenging domain shifts and even modality shifts, a setting which, to the best of our knowledge, is not considered in domain adaptation literature.

(a) All seen
(b) Seen and unseen
Figure 3: Multi-domain image translation using pairwise translations: (a) all translations are seen during training, and (b) our setting: some translations are seen, then test on unseen. Best viewed in color.

3 Multi-domain image translations

We consider the problem of image translation between multiple domains In particular, a translation from a source domain to a target domain is a mapping . This mapping is implemented as an encoder-decoder chain with source encoder and target decoder . Translations between domains connected during training are all learned jointly, and in both directions. Note that the encoder and decoder of translation are different from those of . In order to perform any arbitrary translation between domains, all pairwise translations must be trained (i.e. seen) during the training stage (see Figure 2(a)).

In this article we address the case where only a subset of the translations are seen during training, while the rest remain unseen (see Figure 2(b)). Our objective is to be able to infer these unseen translations during test time.

3.1 Inferring unseen translations

In the case where some of the translations are unseen during training, we could still try to infer them by reusing the available networks. Here we discuss two possible ways: cascading translators, which we use as baseline, and the proposed mix and match networks approach.

(a) Cascade
(b) Mix&match
(c) Ideal
Figure 4: Inferring unseen translations: (a) cascading translators, (b) mix and match networks (M&MNet), and (c) ideal case of encoders-decoders with aligned representations. Best viewed in color.

Cascaded translators   Assuming there is a path of seen translations between the source domain and the target domain via intermediate domains (see Figure 2(b)), a possible solution is simply concatenating the seen translators across this path. This will result in a mapping from the source to target domain by reconstructing images in the intermediate domains (see Figure 3(a)). However, the success of this approach depends on the effectiveness of the intermediate translators.

Unpaired translators   An alternative is to frame the problem as unpaired translation between the source and target domains and disregard the other domains, learning a translation using methods based on cycle consistencyzhu2017unpaired ; kim2017learning ; yi2017dualgan ; liu2017unsupervised . This approach requires training an unpaired translator per unseen translation. In general, unpaired translation can be effective when the translation involves a relatively small shift between the two domains (e.g. body texture in horse-to-zebra), but struggle in more challenging cross-domain and cross-modal image translations.

Mix and match networks (M&MNet)   We propose to obtain the unseen translator by simply concatenating the encoder of the source domain and the decoder of the target domain (see Figure 3(b)). The problem is that these two networks have not directly interacted during training, and therefore, for this approach to be successful, the two latent spaces must be aligned.

3.2 Aligning for unseen translations

The key challenge in M&MNet is to ensure that the latent representation from the encoders can be decoded by all decoders, including those unseen (see Figure 3(c)). In order to address this challenge, encoders and decoders must be aligned in their latent representations. In addition, the encoder-decoder pair should be able to preserve the spatial structure, even in unseen translations.

In the following we describe the different techniques we use to enforce feature alignment between unseen encoder-decoder pairs.

Shared encoders and decoders   Sharing encoders and decoders is a basic requirement to reuse latent representations and reduce the number of networks.

Autoencoders   We jointly train domain-specific autoencoders with the image translation networks. By sharing the weights between the auto-encoders and the image translation encoder-decoder pairs the latent space is forced to align.

Robust side information   In general, image translation tasks result in output images with similar spatial structure as the input ones, such as scene layouts, shapes and contours that are preserved across the translation. In fact, this spatial structure available in the input image is critical to simplify the problem and achieve good results, especially in cross-modal image translations. Successful image translation methods often use multi-scale intermediate representations from the encoder as side information to guide the decoder in the upsampling process. Examples of side information are skip connections he2016deep ; ronneberger2015u and pooling indices badrinarayanan2015segnet ; li2018closed . We exploit side information in cross-modal image translation (see discussion in Section 4.4.

Latent space consistency (only in paired settings)   When paired data between some domains is available, we can enforce consistency in the latent representations of each direction of the translations. Taigman et al. taigman2016unsupervised use L2 distance between a latent representation and the reconstructed after another decoding and encoding cycle. Here we enforce the representations and of two paired samples , to be aligned, since both images represent the same content (just in two different modalities). This is done by introducing a latest space consistency which is defined as . We exploit this constraint in zero-pair image translation (see Section 4).

Adding noise to latent space   The latent space consistency we apply is based on reducing the difference between the and . The network can satisfy this loss by aligning the representations of and as desired. However, it could also lower this loss by just reducing the signal and . This would reduce the latent space consistency loss but not improve the alignment. Adding noise to the output of the each encoder prevents this problem, since reducing the signal would then hurt the translation and auto-encoder losses. In practice, we found that adding noise helps to train the networks and improves the results during test.

3.3 Scalable image translation with M&MNets

As the number of domains increases, the number of pairwise translations grows quadratically. Training encoder-decoder pairs for all pairwise translations in domains would require encoders and decoders (see Figure 2(a)). One of the advantages of M&MNets is their better scalability, since many of those translations can be inferred without explicitly training them (see Figure 2(b)). It requires that each encoder and decoder should be involved in at least one translation pair during training in order to be aligned with the others, thereby reducing complexity from quadratic to linear with the number of domains (i.e. encoders and decoders).

(a) Input+seen
(b) Input+seen+unseen
(c) Input+seen
(d) Input+seen+unseen
Figure 5: Two examples of scalable inference of multi-domain translations with M&MNets. Color transfer (a-b): only transformations from blue or to blue (anchor domain) are seen. Style transfer (c-d): trained on four styles + photo (anchor) with data from zhu2017unpaired ). From left to right: photo, Monet, van Gogh, Ukiyo-e and Cezanne. Input images are highlighted in red and seen translations in blue. Best viewed in color.

Figure 5 illustrates M&MNets and their scalability in two examples involving multi-domain unpaired image translation. Figure 5a-b show an image recoloring application with eleven domains (). Images are objects in the colored objects dataset yu2018beyond and each domain is a color. A naive solution is training all recoloring combinations with CycleGANs, which requires training a total of encoders (and decoders). In contrast, M&MNets only require to train eleven encoders and eleven decoders, while still successfully addressing the recoloring task. In particular all translations from or to the blue domain are trained, while the remaining pairs not involving blue are unseen. The input images (framed in red) and the resulting seen translations (framed in blue) are shown in Figure 5a. The additional images in Figure 5b correspond to the remaining unseen translations.

We also illustrate M&MNets in a style transfer setting with five domains. They include photo (used as anchor domain) and four artistic styles with data from zhu2017unpaired ). M&MNets can reasonably infer unseen translations between styles (see Figure 5d) using only five encoders and five decoders (for a total of twenty possible translations). Note that the purpose of these examples is to illustrate the scalability aspect of M&MNets in multiple domains, not to compete with state-of-the-art recoloring or style transfer methods.

4 Zero-pair cross-modal image translation

Well aligned M&MNets can be applied to a variety of problems. Here, we apply them to a challenging setting we call zero-pair cross-modal image translation, which involves three modalities444Here the term modality has the same role of domain in the previous section.: RGB, depth and semantic segmentation. Note that cross-modal image translation is in general more complex than cross-domain image translation, since it involves deeper transformations to heterogeneous modalities555For simplicity, we will refer to the output semantic segmentation maps and depth as modalities rather than tasks, as done in some works.. This often requires modality-specific architectures and losses.

4.1 Problem definition

We consider the problem of jointly learning two seen cross-modal image translations: RGB-to-segmentation translation (and ) and RGB-to-depth translation (and ) and ) and evaluating on the unseen depth-to-segmentation transformations and (see Figures 1 and 2c). In contrast to the conventional unpaired translation setting, here seen translations have paired data (cross-modal image translation is difficult to learn without paired samples). In particular, we consider the case where the former translations are learned from a semantic segmentation dataset with pairs of RGB images and segmentation maps, and the second from a disjoint RGB-D dataset with pairs of RGB and depth images . Therefore no pairs with matching depth images and segmentation maps are available to the system. The system is evaluated on a third dataset with paired depth images and segmentation maps.

4.2 Mix and match networks architecture

Figure 6: Zero-pair cross-modal and multimodal image translation with M&MNets. Two disjoint sets and are seen during training, containing (RGB,depth) pairs and (RGB,segmentation) pairs, respectively. The system is tested on the unseen translation depth-to-segmentation (zero-pair) and (RGB+depth)-to-segmentation (multimodal), using a third unseen set . Encoders and decoders with the same color share weights. Best viewed in color.

The overview of the framework is shown in Figure 6. As basic building blocks we use three modality-specific encoders , and (RGB, depth and semantic segmentation, respectively), and the corresponding three modality-specific decoders , and , where is the latent representation in the shared space. The required translations are implemented as , and .

Encoder and decoder weights are shared across the different translations involving same modalities (same color in Figure 6). To enforce better alignment between encoders and decoders of the same modality, we also include self-translations using the corresponding three autoencoders , and .

We base our encoders and decoders on the SegNet architecture badrinarayanan2015segnet . The encoder of SegNet itself is based on the 13 convolutional layers of the VGG-16 architecture simonyan2014very

. The decoder mirrors the encoder architecture with 13 deconvolutional layers. All encoders and decoders are randomly initialized except for the RGB encoder which is pretrained on ImageNet 

imagenet_cvpr09 .

As in SegNet, pooling indices at each downsampling layer of the encoder are provided to the corresponding upsampling layer of the (seen or unseen) decoder666The RGB decoder does not use pooling indices, since in our experiments we observed undesired grid-like artifacts in the RGB output when we use them.. These pooling indices seem to be relatively similar across the three modalities and effective to transfer spatial structure information that help to obtain better depth and segmentation boundaries in higher resolutions. Thus, they provide relatively modality-independent side information. We also experimented with skip connections and no side information at all.

4.3 Loss functions

As we mentioned before, for a correct cross-alignment between encoders and decoders, training is critical for zero-pair translation. The final loss combines a number of modality-specific losses for both cross-domain image translation and self-translation (i.e. autoencoders) and alignment constraints in the latent space

where , , and are weights which balance the losses.

RGB   We use a combination of pixelwise L2 distance and adversarial loss . L2 distance is used to compare the ground truth RGB image and the output RGB image of the translation from a corresponding depth or segmentation image. It is also used in the RGB autoencoder

(1)
(2)
(3)

In addition, we also include the least squares adversarial loss mao2016multi ; isola2016image on the output of the RGB decoder

where is the resulting distribution of the combined images generated by , and . Note that the RGB autoencoder and the discriminator are both trained with the combined RGB data .

Depth   For depth we use the Berhu loss laina2016deeper in both RGB-to-depth translation and in the depth autoencoder

(4)
(5)

where is the average Berhu loss.

Semantic segmentation   For segmentation we use the average cross-entropy loss in both RGB-to-segmentation translation and the segmentation autoencoder

(6)
(7)

Latent space consistency   We enforce latent representations to remain close, independently of the encoder that generated them. In our case we have two latent space consistency losses

(8)
(9)
(10)

4.4 The role of side information

(a) No side info.
(b) Skip connect.
(c) Pooling indices
Figure 7: Side information between encoders and decoders.

Spatial side information plays an important role in image translation, especially in cross-modal image translation (e.g. semantic segmentation). Reconstructing images requires reconstructing spatial details. Side information from a particular encoder layer can provide helpful hints to the decoder about how to reconstruct the spatial structure at a specific scale and level of abstraction (see Figure 7).

Skip connections   Perhaps the most common type of side information connecting encoders and decoders comes in the form of skip connections, where the feature from a particular layer is copied and concatenated with another feature further in the processing chain. U-Netronneberger2015u introduced a widely used architecture in image segmentation and image translation where convolutional layers in encoder and decoder are mirrored and the feature of a particular encoding layer is concatenated with the feature with the corresponding layer at the decoder (see Figure 7b). It is important to observe that skip connections make the decoder heavily condition on the particular features of the encoder. This is not a problem in general because translations are usually seen during training and therefore latent representations are aligned. However, in our setting with unseen translations that conditioning is simply catastrophic, because the target decoder is only aware of the features in encoders from modalities seen during training. Otherwise, as in the case of an unseen encoder, the result is largely unpredictable.

Pooling indices   The SegNet architecture badrinarayanan2015segnet includes unpooling layers that leverage pooling indices from the mirror layers of the encoder (see Figure 7

c). Pooling indices capture the locations of the maximum values in the input feature map of a max pooling layer. These locations are then used to guide the corresponding unpooling operation in the decoder, helping to preserve finer details. Note that pooling indices are more compact descriptors than encoder features from skip connections, and since the unpooling operation is not learned, pooling indices are less dependent on the particular encoder and therefore more robust for unseen translations.

5 Shared information between unseen modalities

5.1 Shared and modality-specific information

The information conveyed by the latent representation is key to perform image translation. Encoders extract this information from the input image and decoders use it to reconstruct the output image. In general, this latent representation can contain information shared between the source and target modalities (or domains), and information specific to each modality. In a setting where the same latent representation is used across multiple encoders and decoders, the latent representation must capture information about all input and output modalities.

We can represent modalities as circles, whose intersections represent shared information between them. Figure 8a represents the particular case of zero-pair cross-modal image translation with three modalities (described in the previous section). Note that translators and autoencoders force the latent representation to capture both shared and modality-specific information. However, the better the information shared between modalities is captured in the latent representation, the more effective cross-modal image translations are.

The framework described in Section 4.2 enables the inference of unseen translations via the anchor modality RGB, whose encoder and decoder are shared across the two seen translations. That is the only component that indirectly enforces alignment of depth and segmentation encoders and decoders. Therefore, the latent information used in the unseen translation is the one shared by the three modalities.

In contrast, the information shared between depth and segmentation that is not shared with RGB (the dashed region in Figure 8a) is not exploited during training by depth and segmentation encoders and decoders, because it is of no use to solve any of the seen translations. This makes inferred translations less effective because depth and segmentation encoders are ignoring potentially useful information that could improve translation to segmentation and depth, respectively. In this section we propose an extension of our basic framework that aims at explicitly enforcing alignment between unseen modalities in order to exploit all shared information between unseen modalities (see the highlighted region in Figure 8b). Since no training pairs between those modalities are available, that alignment requires to be between unpaired samples.

(a) Seen shared information
(b) Seen+unseen shared information
Figure 8: Modality-specific and shared information: (a) basic mix and match nets (see Fig 6) ignore depth-segmentation shared information, (b) extended mix and match net exploiting depth-segmentation shared information (unpaired information in our case). Best viewed in color.

5.2 Exploiting shared information between unseen modalities

Figure 9: Pseudo-pairs pipeline on the unseen translation. This pipeline is combined with the basic cross-modal M&MNets of Fig 6.

5.2.1 Pseudo-pairs

We adapt the idea of pseudo-labels, used previously in unsupervised domain adaptation saito2017asymmetric ; zou2018domain , to our zero-pair cross-modal setting. The main idea is that we would also like to train directly the encoder-decoder between the unseen domains. However, since we have no paired data between these domains, we propose to use pseudo-pairs.

Here we describe pseudo-pairs in the specific setting of cross-modal image translation between RGB, depth and semantic segmentation (see Section 4). Recall we use , , and to respectively indicate data from the the RGB, semantic segmentation and depth domain. We use the encoder-decoder networks between the seen domains to form the pseudo-pairs and . Now we can also train encoders and decoders between the unseen domains depth and segmentation (see Figure 9) using the following loss:

(11)
(12)

where is the average Berhu loss laina2016deeper , and is the cross-entropy loss. The direct training of the encoder-decoder between the unseen domain allows us to exploit correlation between features in these domains for which no evidence exists in the RGB modality (dashed region in Figure 8a). In practice we first train the network without the pseudo-labels. After convergence we add and train further with all losses until final convergence.

Note that this additional term encourages the segmentation-to-depth and depth-to-segmentation translators to exploit all the shared information, including the previously ignored one, in order to improve the translation to match the one obtained from RGB. The latter is more accurate since it has been trained with paired samples. A problem with this approach is that this new loss can harm the training of seen translations from RGB, since pseudo-labels are less reliable than true labels. For this reason we do not update the weights of the translators involving RGB of the pseudo-pair paths, which are updated only when true pairs (this is indicated with the red line in Figure 9).

Figure 10: Cross-domain image translation experiment. Trained with color opponent pairs (,) and opponent-RGB (,) , and evaluated on unseen translation (,). Best viewed in color.

5.3 Pseudo-pair example

To show the potential of pseudo-pairs we consider an experiment between domains where the not-used part between the unpaired domains (striped region in Figure 8) is expected to be substantial. We consider the task of estimating an RGB image from a single channel, where we use the opponent channels and as input777We choose the opponent channels because they are less correlated than the R,G and B channels geusebroek2001color . (see Figure 10). Both domains and contain relevant and complementary information on estimating the RGB image. For this experiment we use the ten most frequent classes of the Flower dataset nilsback2008automated which are passionflower, petunia, rose, wallflower, watercress, waterlily, cyclamen, foxglove, frangipani, hibiscus.

In particular, we consider the following three domains

(13)

where and are scalar images and is a three channel RGB image. For training we have pairs (, ) and (, ) of non-overlapping images. For testing we use a separate testset. To evaluate the quality of the computed RGB images, we apply a flower classification algorithm on them and report the classification accuracy.

Type Method Accuracy (%)
Seen Paired
M&MNet 75.0
Unseen Zero-pair
M&MNet 36.5
M&MNet +PP 57.5
Seen/unseen Multi-modal
M&MNet 77.5
M&MNet + PP 80.5
Table 1: Flower classification accuracy obtained on computed for various image translation models. The importance of pseudo-pairs can be clearly seen.

The results are presented in Table 1. In the first two rows the result of M&MNets with and without pseudo-pairs are compared. The usage of pseudo-pairs results in a huge absolute performance gain of 21%. This shows that, for domains which have considerable amounts of complementary information, pseudo-pairs can significantly improve performance. In the next two rows, we have also included the multi-modal results. In this case the pseudo-pairs double the performance gain with respect to the paired domain (last row) from to .

The qualitative results are provided in Figure 11. The results show the effectiveness of the pseudo-pairs. The method without the pseudo-pairs can only exploit information which is shared between the three domains. The domain contains information about the red-green color axes, and the mix and match nets (without pseudo-pairs) approach does partially manage to reconstruct that part (see first row Figure 11). However, has no access to the blue-yellow information which is encoded in the . Adding the pseudo-pairs allows to exploit this information and the reconstructed RGB images are closer to the ground truth image (see second and third row Figure 11).

Figure 11: Visualization of RGB image estimation in Flowers dataset. (a) input image from (via seen translation) (b) zero pair translation without pseudo-pairs wang2018mix (c) zero pair with the pseudo-pairs(PP) (d) ground truth.

6 Experiments

In this section we demonstrate the effectiveness of M&MNets and their variants to address unseen translations in the challenging cross-modal image translation setting involving the modalities RGB, depth and segmentation.

6.1 Datasets and experimental settings

We use two RGB-D datasets annotated with segmentation maps, one with synthetic images and the other with real captured images:

SceneNet RGB-D   The SceneNet RGB-D dataset McCormac:etal:ICCV2017 consists of 16865 synthesized train videos and 1000 test videos. Each of them contains 300 frames representing the same scene in a multi-modal triplet (i.e. RGB, depth and segmentation), with a size of 320x240 pixels. We collected 150K triplets for our train set, 10K triplets for our validation set and 10K triplets for our test set. The triplets are sampled uniformly from the first frame to the last frame every 30 frames. The triplets for the validation set are collected from the remaining train videos and the test set is taken from the test dataset.

In order to evaluate zero-pair translation, we divided the train set (and validation set) into two equal non-overlapping splits from different videos (to avoid covering the same scenes). We discard depth images in one set and segmentation maps in the other, thus creating two disjoint training sets with paired instances, and respectively, to train our model.

SUN RGB-D   The SUN RGBD dataset song2015sun contains 10335 real RGB-D images of room scenes. Each RGB image has a corresponding depth and segmentation map. We collected two sets: 10K triplets for the train set and 335 triplets for test set. For the train set, we split it into two disjoing subsets, one containing (RGB, segmentation) pairs, and the other containing (RGB, depth) pairs, each of them consisting of 5k pairs.

Network training   We use Adam kingma2014adam with a batch size of 6, using a learning rate of 0.0002. We set , , , , . We initially train the mix and match framework without autoencoders, without latent consistency losses, and without adding noise during the first 200K iterations. Then we freeze the RGB encoder, add the autoencoders, latent consistency losses and noise to the latent space, and for the following 200K iterations we use , , . We found that the network converges faster using a large initial

. The noise is sampled from a Gaussian distribution with zero mean and a standard deviation of 0.5. For the variant with pseudo-pairs, in a third stage we include the pseudo-pair pipeline and the corresponding loss and train for another additional 100K iterations, using

and learning rate 0.00002.

Evaluation metrics

   Following common practice, for the segmentation modality we compute the intersection over union (IoU) and per-class average (mIoU), and the global scores, which gives the percentage of correctly classified pixels. For the depth modality we also include quantitative evaluation, following the standard error metrics for depth estimation 

eigen2015predicting .

6.2 Experiments on SceneNet RGB-D

6.2.1 Ablation study

We first performed an ablation study on the impact of several design elements on the overall performance of the system. We use a smaller subset of SceneNet RGB-D based on 51K triplets from the first 1000 videos (selecting 50 frames from the first 1000 videos for train, and the first frame from another 1000 videos for test).

Side information   We first evaluate the usage of side information from the encoder to guide the upsampling process in the decoder. We consider three variants: no side information, skip connections ronneberger2015u and pooling indices badrinarayanan2015segnet . The results in Table 2 show that skip connections obtain worse results than no side information at all. This is caused by the fact that side information makes the decoder(s) conditioned on the seen encoder(s). This is problematic for unseen translations because the features passed through skip connections are different from those seen by the decoder during training, resulting in a drop in performance. In contrast, pooling indices provide a significant boost over no side information. Although the decoder is still conditioned to the particular seen encoders, pooling indices seem to provide helpful spatial hints to recover finer details, while being more invariant to the particular input-output combination, and even generalizing to unseen ones.

Figure 12 illustrates the differences between these three variants in depth-to-segmentation translation. Without side information the network is able to reconstruct a coarse segmentation, but without further guidance it is not able to refine it properly. Skip connections completely confuse the decoder by providing unseen encoding features. Pooling indices is able to provide helpful hints about spatial structure that allows the unseen decoder to recover finer segmentation maps.

Side information Pretrained mIoU Global
- N 29.8% 61.6%
Skip connections N 12.7% 50.1%
Pooling indices N 43.2% 73.5%
Pooling indices Y 46.7% 78.4%
Table 2: Influence of side information and RGB encoder pretraining on the final results. The task is zero-shot depth-to-semantic segmentation in SceneNet RGB-D (51K).
Figure 12: Role of side information in unseen depth-to-segmentation translation in SceneNet RGB-D.

RGB pretraining   We also compare training the RGB encoder from scratch and initializing with pretrained weights from ImageNet. Table 2 shows an additional gain of around 4% in mIoU when using the pretrained weights.

Given these results we perform all the remaining experiments initializing the RGB encoder with pretrained weights and use pooling indices as side information.

Latent space consistency, noise and autoencoders   We evaluate these three factors in Table 3. The results show that latent space consistency and the usage of autoencoders lead to significant performance gains; for both, the performance (in mIoU) is more than doubled. Adding noise to the output of the encoder results in a small performance gain.

Pseudo-pairs   We also evaluate the impact of using pseudo-pairs to exploit shared information between unseen modalities. Table 3 shows a significant gain of almost 3% in mIoU and a more moderate gain in global accuracy.

AutoEnc Latent Noise PP mIoU Global
N N N N 6.48% 15.7%
Y N N N 20.3% 49.4%
Y Y N N 45.8% 76.9%
Y Y Y N 46.7% 78.4%
Y Y Y Y 49.2% 80.5%
Table 3: Impact of several components (autoencoder, latent space consistency loss, noise and pseudo-pairs) in the performance. The task is zero-pair depth-to-segmentation in SceneNet RGB-D (51K). PP: pseudo-pairs.

In the following sections we use the SceneNet RGB-D dataset with 170K triplets.

6.2.2 Monitoring alignment

The main challenge for M&MNets is to align the different modality-specific bottleneck features, in particular for unseen translations. We measure the alignment between the features extracted from the triplets in the test set

. For each triplet (i.e. RGB, segmentation and depth images) we extract the corresponding triplet of latent features and measure their average pairwise cross-modal alignment. The alignment between RGB and segmentation features is measured using the following alignment factor

(14)

The other alignment factors and between RGB and depth features and between depth and segmentation features are defined analogously. Figure 13 shows the evolution of this alignment during training and across the different stages. The three curves follow a similar trend, with the alignment increasing in the first iterations of each stage and then stabilizing. The beginning of stage two shows a dramatic increase in the alignment, with a more moderate increase at stage three. These results are consistent with those of the ablation study of the previous section, showing that better alignment typically leads to better results in unseen translations. Overall, they show that latent space consistency, autoencoders, pseudo-pairs and pooling indices contribute to the effectiveness of M&MNets to address unseen image translation in the zero-pair setting.

Figure 13: Monitoring alignment between latent features on SceneNet RGB-D.

6.2.3 Comparison with other models

In this section we compare M&MNets, and its variant with pseudo-pairs with several baselines:

  • CycleGAN. We adapt CycleGAN zhu2017unpaired to learn a mapping from depth to semantic segmentation (and vice versa) in a purely unpaired setting. In contrast to M&MNets, this method only leverages depth and semantic segmentation, ignoring the available RGB data and the corresponding pairs (as shown in Figure 2a).

  • 2pix2pix. We adapt pix2pix isola2016image to learn two cross-modal image translations from paired data (i.e. and ). The architecture uses skip connections (which are effective in this case since both translations are seen) and the corresponding modality-specific losses. We adapt the code from isola2016image . In contrast to ours, it requires explicit decoding to RGB, which may degrade the quality of the prediction.

  • is similar to 2pix2pix but with the architecture used in M&MNet. We train a translation model from depth to RGB and from RGB to segmentation, and obtain the transformation depth-to-segmentation by concatenating them. Note that it also requires translating to intermediate RGB images.

  • is analogous to the previous baseline.

  • M&MNet is the original mix and match networks wang2018mix .

  • M&MNet+PP is a variant of M&MNet using pseudo-pairs.

Method Conn.

Bed

Book

Ceiling

Chair

Floor

Furniture

Object

Picture

Sofa

Table

TV

Wall

Window

mIoU

Global

Baselines
CycleGAN zhu2017unpaired SC CE 2.79 0.00 16.9 6.81 4.48 0.92 7.43 0.57 9.48 0.92 0.31 17.4 15.1 6.34 14.2
2pix2pix isola2016image SC CE 34.6 1.88 70.9 20.9 63.6 17.6 14.1 0.03 38.4 10.0 4.33 67.7 20.5 25.4 57.6
M&MNet PI CE 0.02 0.00 8.76 0.10 2.91 2.06 1.65 0.19 0.02 0.28 0.02 58.2 3.3 5.96 32.3
M&MNet SC CE 25.4 0.26 82.7 0.44 56.6 6.30 23.6 5.42 0.54 21.9 10.0 68.6 19.6 24.7 59.7
Zero-pair
M&MNet PI CE 50.8 18.9 89.8 31.6 88.7 48.3 44.9 62.1 17.8 49.9 51.9 86.2 79.2 55.4 80.4
M&MNet+PP PI CE 52.1 29.0 88.6 32.7 86.9 66.9 48.4 76.6 25.1 45.5 58.8 88.5 82.0 60.1 82.2
Multi-modal
M&MNet PI CE 49.9 25.5 88.2 31.8 86.8 56.0 45.4 70.5 17.4 46.2 57.3 87.9 79.8 57.1 81.2
M&MNet+PP PI CE 53.3 35.7 89.9 37.0 88.6 59.3 55.8 76.9 25.7 46.6 69.6 89.5 80.0 62.2 83.5
Table 4: Zero-pair depth-to-segmentation translation on SceneNet RGB-D. SC: skip connections, PI: pooling indexes, CE: cross-entropy, PP: pseudo-pairs.

Table 4 shows results for the different methods for depth-to-segmentation translation. CycleGAN is not able to learn a good mapping, showing the difficulty of unpaired translation to solve this complex cross-modal task. 2pix2pix manages to improve the results by resorting to the anchor domain RGB, although still not satisfactory since this sequence of translations does not enforce explicit alignment between depth and segmentation, and the first translation network may also discard information not relevant for the RGB task but necessary for reconstructing the segmentation image (like in the ”Chinese whispers”/telephone game).

M&MNets evaluated on () achieve a similar result as CycleGAN, but significantly worse than 2pix2pix. However, when we run our architecture with skip connections we obtain results similar to 2pix2pix. Note that in this setting translations only involve seen encoders and decoders, so skip connections function well. The direct combination () with M&MNets outperforms all baselines significantly. The performance more than doubles in terms of mIoU. Results improve another in mIoU when adding the pseudo-pairs during training.

Figure 14 shows a representative example of the differences between the evaluated methods. CycleGAN fails to recover any meaningful segmentation of the scene, revealing the difficulty to learn cross-modal image translations without unpaired data. 2pix2pix manages to recover the layout and coarse segmentation, but fails to segment medium and small size objects. M&MNets are able to obtain finer and more accurate segmentations.

Figure 14: Zero-pair depth-to-segmentation translation on SceneNet RGB-D.

Table 5 shows results when we test in the opposite direction from semantic segmentation to depth. The conclusions are similar as in previous experiment: M&MNets outperform both baseline methods on all five evaluation metrics. Figure 15 illustrates this case, showing how pooling indices are also key to obtain good depth images, compared with no side information at all. The variant with pseudo-pairs obtains the best results.

Method RMSE RMSE
(lin) (log)
Baselines
CycleGAN zhu2017unpaired 0.05 0.12 0.20 4.63 1.98
2pix2pix isola2016image 0.14 0.31 0.46 3.14 1.28
M&MNet 0.15 0.30 0.44 3.24 1.24
Zero-pair
M&MNet 0.33 0.42 0.59 2.8 0.67
M&MNet+PP 0.42 0.61 0.79 2.24 0.60
Multi-modal
M&MNet 0.36 0.48 0.65 2.48 0.64
M&MNet+PP 0.47 0.69 0.81 1.98 0.49
Table 5: Zero-pair segmentation-to-depth on SceneNet RGB-D.
Figure 15: Zero-pair and multimodal segmentation-to-depth on SceneNet RGB-D.

6.2.4 Multi-modal translation

Since features from different modalities are aligned, we can also use M&MNets for multi-modal translation. For instance, in the previous multi-modal setting, given the RGB and depth images of the same scene we can translate to segmentation. We simply combine both modality-specific latent features and using a weighted average , where controls the weight of each modality. We set and use the pooling indices from the RGB encoder (instead of those from depth encoder). The resulting feature is then decoded using the segmentation decoder. We proceed analogously to translation from RGB and segmentation to depth. The results in Table 4 and Table 5show that this multi-modal combination further improves the performance of zero-pair translation, as the example in Figure 15 illustrate .

6.3 Experiments on SUN RGB-D

The previous results were obtained on the SceneNet RGB-D dataset which consists of synthetic images. Here we also show that M&MNets can be effective for the more challenging dataset SUN RGB-D, which involves real images and more limited data. The results in Table 6 and Table 7 show that M&MNets consistently outperform the other baselines in both unseen translation directions, with the new variant with pseudo-pairs obtaining the best performance. Similarly, multi-modal translation further improves the performance. Figures 16 and 17 illustrate how the proposed methods can reconstruct more reliably the target modality, especially the finer details.

Method Conn.

Bed

Book

Ceiling

Chair

Floor

Furniture

Object

Picture

Sofa

Table

TV

Wall

Window

mIoU

Global

Baselines
CycleGAN zhu2017unpaired SC CE 0.00 0.00 0.00 17.9 46.9 1.67 4.59 0.00 0.00 18.9 0.00 29.6 25.4 11.1 26.3
2pix2pix isola2016image SC CE 3.88 0.00 12.4 29.6 57.1 17.2 13.0 35.4 8.07 35.1 0.00 47.0 7.73 20.5 38.6
M&MNet PI CE 0.00 0.00 0.00 17.0 39.4 0.52 0.01 0.00 0.01 12.2 0.00 31.0 5.19 8.12 22.8
M&MNet SC CE 39.9 0.25 15.2 37.6 58.0 19.0 11.7 2.45 4.82 36.9 0.00 46.8 12.3 21.9 40.6
Zero-pair
M&MNet PI CE 28.4 2.90 22.6 41.9 71.6 14.1 25.1 17.8 11.8 49.7 0.08 64.2 15.5 28.1 51.8
M&MNet+PP PI CE 29.8 4.52 28.5 44.1 73.3 17.2 27.5 20.1 9.81 53.4 0.14 67.5 17.9 30.2 54.2
Multi-modal
M&MNet PI CE 0.00 16.6 21.4 56.0 72.1 24.2 28.3 38.1 21.7 57.0 64.6 68.0 43.7 39.4 58.8
M&MNet+PP PI CE 0.10 19.3 25.5 54.6 74.6 25.6 30.1 42.4 21.0 58.1 65.2 69.0 49.7 41.1 59.8
Table 6: Zero-pair depth-to-semantic segmentation on SUN RGB-D. SC: skip connections, PI: pooling indexes, CE: cross-entropy, PP: pseudo-pairs.
Method RMSE RMSE
(lin) (log)
Baselines
CycleGAN zhu2017unpaired 0.06 0.13 0.24 4.8 1.57
2pix2pix isola2016image 0.13 0.34 0.59 3.8 1.30
M&MNet 0.12 0.35 0.62 3.9 1.36
Zero-pair
M&MNet 0.45 0.66 0.78 1.75 0.53
M&MNet+PP 0.49 0.77 0.90 1.42 0.37
Multi-modal
M&MNet 0.53 0.80 0.92 1.63 0.35
M&MNet+PP 0.56 0.83 0.93 1.33 0.34
Table 7: Zero-pair semantic segmentation-to-depth on SUN RGB-D.
Figure 16: Example of zero-pair depth-to-segmentation on SUN RGB-D.
Figure 17: Example of zero-pair segmentation-to-depth on SUN RGB-D.

7 Conclusions

We have introduced mix and match networks as a framework to perform image translations between unseen domains or modalities by leveraging the knowledge learned from seen translations with explicit training data. The key challenge lies in aligning the latent representations in the bottlenecks in such a way that any encoder-decoder combination is able to perform effectively its corresponding translation. M&MNets have advantages in terms of scalability since only seen translations need to be trained. We also introduced the zero-pair cross-modal image translation, a challenging scenario involving three modalities and paired seen and unseen translations. In order to effectively address this problem, we described several tools to enforce the alignment of latent representations, including autoencoders, latent consistency losses, and robust side information. In particular, our results show that side information is critical to perform satisfactory cross-modal image translations, but conventional side information such as skip connections may not work properly with unseen translations. We found that pooling indices are more robust and invariant, and provide helpful hints to guide the reconstruction of spatial structure.

We also analyzed a specific limitation of the original M&MNets wang2018mix in the zero-pair setting, which is that a significant part of the shared features between unseen domains is not exploited. We proposed a variant that generates pseudo-pairs to enforce the networks to use all the information shared between unseen domains, even when that information is not shared by seen translations. The effectiveness of M&MNets with pseudo-pairs has been evaluated in several multi-modal datasets.

Acknowledgements.
The Titan Xp used for this research was donated by the NVIDIA Corporation. We acknowledge the Spanish project TIN2016-79717-R, the CHISTERA project M2CR (PCIN-2015-251) and the CERCA Programme / Generalitat de Catalunya. Herranz also acknowledges the European Union’s H2020 research under Marie Sklodowska-Curie grant No. 665919. Yaxing Wang acknowledges the Chinese Scholarship Council (CSC) grant No. 201507040048.

References

  • (1) Akata, Z., Perronnin, F., Harchaoui, Z., Schmid, C.: Label-embedding for image classification. IEEE Transactions on Pattern Analysis and Machine Intelligence 38(7), 1425–1438 (2016)
  • (2) Almahairi, A., Rajeswar, S., Sordoni, A., Bachman, P., Courville, A.: Augmented cyclegan: Learning many-to-many mappings from unpaired data.

    International Conference on Machine Learning (2018)

  • (3) Anoosheh, A., Agustsson, E., Timofte, R., Van Gool, L.: Combogan: Unrestrained scalability for image domain translation.

    2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW) (2018).

    DOI 10.1109/cvprw.2018.00122. URL http://dx.doi.org/10.1109/CVPRW.2018.00122
  • (4) Badrinarayanan, V., Handa, A., Cipolla, R.: Segnet: A deep convolutional encoder-decoder architecture for robust semantic pixel-wise labelling. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2015)
  • (5) Cadena, C., Dick, A.R., Reid, I.D.: Multi-modal auto-encoders as joint estimators for robotics scene understanding. In: Robotics: Science and Systems (2016)
  • (6) Chen, L.C., Papandreou, G., Kokkinos, I., Murphy, K., Yuille, A.L.: Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE Transactions on Pattern Analysis and Machine Intelligence 40(4), 834–848 (2018)
  • (7) Chen, Q., Koltun, V.: Photographic image synthesis with cascaded refinement networks. Proceedings of the International Conference on Computer Vision (2017)
  • (8) Chen, Y., Liu, Y., Cheng, Y., Li, V.O.: A teacher-student framework for zero-resource neural machine translation. arXiv preprint arXiv:1705.00753 (2017)
  • (9)

    Cheng, Y., Zhao, X., Cai, R., Li, Z., Huang, K., Rui, Y., et al.: Semi-supervised multimodal deep learning for rgb-d object recognition.

    Proceedings of the International Joint Conference on Artificial Intelligence (2016)

  • (10) Choi, Y., Choi, M., Kim, M., Ha, J.W., Kim, S., Choo, J.: Stargan: Unified generative adversarial networks for multi-domain image-to-image translation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2018)
  • (11) Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: ImageNet: A Large-Scale Hierarchical Image Database. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2009)
  • (12) Eigen, D., Fergus, R.: Predicting depth, surface normals and semantic labels with a common multi-scale convolutional architecture. In: Proceedings of the International Conference on Computer Vision, pp. 2650–2658 (2015)
  • (13) Eitel, A., Springenberg, J.T., Spinello, L., Riedmiller, M., Burgard, W.: Multimodal deep learning for robust rgb-d object recognition. In: Proceedings of the IEEE/RSJ Conference on Intelligent Robots and Systems, pp. 681–687. IEEE (2015)
  • (14) Fergus, R., Bernal, H., Weiss, Y., Torralba, A.: Semantic label sharing for learning with many categories. Proceedings of the European Conference on Computer Vision pp. 762–775 (2010)
  • (15) Firat, O., Cho, K., Bengio, Y.: Multi-way, multilingual neural machine translation with a shared attention mechanism. arXiv preprint arXiv:1601.01073 (2016)
  • (16) Fu, Y., Xiang, T., Jiang, Y.G., Xue, X., Sigal, L., Gong, S.: Recent advances in zero-shot recognition. arXiv preprint arXiv:1710.04837 (2017)
  • (17)

    Ganin, Y., Lempitsky, V.: Unsupervised domain adaptation by backpropagation.

    In: International Conference on Machine Learning, pp. 1180–1189 (2015)
  • (18)

    Gatys, L.A., Ecker, A.S., Bethge, M.: Image style transfer using convolutional neural networks.

    In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2414–2423 (2016)
  • (19) Geusebroek, J.M., Van den Boomgaard, R., Smeulders, A.W.M., Geerts, H.: Color invariance. IEEE Transactions on Pattern Analysis and Machine Intelligence 23(12), 1338–1350 (2001)
  • (20) Gong, B., Shi, Y., Sha, F., Grauman, K.: Geodesic flow kernel for unsupervised domain adaptation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2066–2073. IEEE (2012)
  • (21) Gonzalez-Garcia, A., van de Weijer, J., Bengio, Y.: Image-to-image translation for cross-domain disentanglement. In: Advances in Neural Information Processing Systems, pp. 1294–1305 (2018)
  • (22) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial nets. In: Advances in Neural Information Processing Systems, pp. 2672–2680 (2014)
  • (23) Gupta, S., Hoffman, J., Malik, J.: Cross modal distillation for supervision transfer. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2016)
  • (24) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)
  • (25) Huang, X., Liu, M.Y., Belongie, S., Kautz, J.: Multimodal unsupervised image-to-image translation. In: Proceedings of the European Conference on Computer Vision, pp. 172–189 (2018)
  • (26)

    Isola, P., Zhu, J.Y., Zhou, T., Efros, A.A.: Image-to-image translation with conditional adversarial networks.

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2017)
  • (27) Jayaraman, D., Grauman, K.: Zero-shot recognition with unreliable attributes. In: Advances in Neural Information Processing Systems, pp. 3464–3472 (2014)
  • (28) Johnson, M., Schuster, M., Le, Q.V., Krikun, M., Wu, Y., Chen, Z., Thorat, N., Viégas, F., Wattenberg, M., Corrado, G., et al.: Google’s multilingual neural machine translation system: enabling zero-shot translation. arXiv preprint arXiv:1611.04558 (2016)
  • (29) Kendall, A., Gal, Y., Cipolla, R.: Multi-task learning using uncertainty to weigh losses for scene geometry and semantics. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2018)
  • (30) Kim, S., Park, K., Sohn, K., Lin, S.: Unified depth prediction and intrinsic image decomposition from a single image via joint convolutional neural fields. In: Proceedings of the European Conference on Computer Vision, pp. 143–159. Springer (2016)
  • (31) Kim, T., Cha, M., Kim, H., Lee, J., Kim, J.: Learning to discover cross-domain relations with generative adversarial networks. International Conference on Machine Learning (2017)
  • (32) Kingma, D., Ba, J.: Adam: A method for stochastic optimization. International Conference on Learning Representations (2014)
  • (33) Kingma, D.P., Welling, M.: Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114 (2013)
  • (34) Kuga, R., Kanezaki, A., Samejima, M., Sugano, Y., Matsushita, Y.: Multi-task learning using multi-modal encoder-decoder networks with shared skip connections. In: Proceedings of the International Conference on Computer Vision (2017)
  • (35) Kuznietsov, Y., Stückler, J., Leibe, B.: Semi-supervised deep learning for monocular depth map prediction. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6647–6655 (2017)
  • (36) Lai, K., Bo, L., Ren, X., Fox, D.: A large-scale hierarchical multi-view rgb-d object dataset. In: Proceedings of IEEE International Conference on Robotics and Automation, pp. 1817–1824. IEEE (2011)
  • (37) Laina, I., Rupprecht, C., Belagiannis, V., Tombari, F., Navab, N.: Deeper depth prediction with fully convolutional residual networks. In: 3D Vision (3DV), 2016 Fourth International Conference on, pp. 239–248. IEEE (2016)
  • (38) Lampert, C.H., Nickisch, H., Harmeling, S.: Attribute-based classification for zero-shot visual object categorization. IEEE Transactions on Pattern Analysis and Machine Intelligence 36(3), 453–465 (2014)
  • (39) Lee, H.Y., Tseng, H.Y., Huang, J.B., Singh, M., Yang, M.H.: Diverse image-to-image translation via disentangled representations. In: Proceedings of the European Conference on Computer Vision, pp. 35–51 (2018)
  • (40) Li, Y., Liu, M.Y., Li, X., Yang, M.H., Kautz, J.: A closed-form solution to photorealistic image stylization. In: Proceedings of the European Conference on Computer Vision, pp. 453–468 (2018)
  • (41) Lin, J., Xia, Y., Qin, T., Chen, Z., Liu, T.Y.: Conditional image-to-image translation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5524–5532 (2018)
  • (42) Liu, F., Shen, C., Lin, G., Reid, I.: Learning depth from single monocular images using deep convolutional neural fields. IEEE Transactions on Pattern Analysis and Machine Intelligence 38(10), 2024–2039 (2016)
  • (43) Liu, M.Y., Breuel, T., Kautz, J.: Unsupervised image-to-image translation networks. Advances in Neural Information Processing Systems (2017)
  • (44) Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3431–3440 (2015)
  • (45) Mao, X., Li, Q., Xie, H., Lau, R.Y., Wang, Z.: Multi-class generative adversarial networks with the l2 loss function. arXiv preprint arXiv:1611.04076 (2016)
  • (46) Mathieu, M.F., Zhao, J.J., Zhao, J., Ramesh, A., Sprechmann, P., LeCun, Y.: Disentangling factors of variation in deep representation using adversarial training. In: Advances in Neural Information Processing Systems, pp. 5040–5048 (2016)
  • (47) McCormac, J., Handa, A., Leutenegger, S., J.Davison, A.: Scenenet rgb-d: Can 5m synthetic images beat generic imagenet pre-training on indoor segmentation? Proceedings of the International Conference on Computer Vision (2017)
  • (48) Mejjati, Y.A., Richardt, C., Tompkin, J., Cosker, D., Kim, K.I.: Unsupervised attention-guided image-to-image translation. In: Advances in Neural Information Processing Systems, pp. 3697–3707 (2018)
  • (49) Mirza, M., Osindero, S.: Conditional generative adversarial nets. arXiv preprint arXiv:1411.1784 (2014)
  • (50) Ngiam, J., Khosla, A., Kim, M., Nam, J., Lee, H., Ng, A.Y.: Multimodal deep learning. In: International Conference on Machine Learning, pp. 689–696 (2011)
  • (51) Nilsback, M.E., Zisserman, A.: Automated flower classification over a large number of classes. In: Computer Vision, Graphics & Image Processing, 2008. ICVGIP’08. Sixth Indian Conference on, pp. 722–729. IEEE (2008)
  • (52) Perarnau, G., Van De Weijer, J., Raducanu, B., Álvarez, J.M.: Invertible conditional gans for image editing. arXiv preprint arXiv:1611.06355 (2016)
  • (53) Reed, S., Akata, Z., Lee, H., Schiele, B.: Learning deep representations of fine-grained visual descriptions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 49–58 (2016)
  • (54) Rohrbach, M., Stark, M., Schiele, B.: Evaluating knowledge transfer and zero-shot learning in a large-scale setting. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1641–1648. IEEE (2011)
  • (55) Ronneberger, O., Fischer, P., Brox, T.: U-net: Convolutional networks for biomedical image segmentation. In: International Conference on Medical image computing and computer-assisted intervention, pp. 234–241. Springer (2015)
  • (56) Roy, A., Todorovic, S.: Monocular depth estimation using neural regression forest. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5506–5514 (2016)
  • (57) Saito, K., Ushiku, Y., Harada, T.: Asymmetric tri-training for unsupervised domain adaptation. International Conference on Machine Learning (2017)
  • (58) Shotton, J., Johnson, M., Cipolla, R.: Semantic texton forests for image categorization and segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–8. IEEE (2008)
  • (59) Silberman, N., Hoiem, D., Kohli, P., Fergus, R.: Indoor segmentation and support inference from rgbd images. In: Proceedings of the European Conference on Computer Vision, pp. 746–760. Springer (2012)
  • (60) Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. International Conference on Learning Representations (2015)
  • (61) Song, S., Lichtenberg, S.P., Xiao, J.: Sun rgb-d: A rgb-d scene understanding benchmark suite. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 567–576 (2015)
  • (62) Song, X., Herranz, L., Jiang, S.: Depth cnns for rgb-d scene recognition: learning from scratch better than transferring from rgb-cnns. In: Proceedings of the AAAI Conference on Artificial Intelligence (2017)
  • (63) Taigman, Y., Polyak, A., Wolf, L.: Unsupervised cross-domain image generation. International Conference on Learning Representations (2017)
  • (64) Tsai, Y.H., Hung, W.C., Schulter, S., Sohn, K., Yang, M.H., Chandraker, M.: Learning to adapt structured output space for semantic segmentation. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2018)
  • (65) Wang, P., Shen, X., Lin, Z., Cohen, S., Price, B., Yuille, A.L.: Towards unified depth and semantic prediction from a single image. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2800–2809 (2015)
  • (66) Wang, T.C., Liu, M.Y., Zhu, J.Y., Tao, A., Kautz, J., Catanzaro, B.: High-resolution image synthesis and semantic manipulation with conditional gans. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8798–8807 (2018)
  • (67) Wang, W., Neumann, U.: Depth-aware cnn for rgb-d segmentation. In: Proceedings of the European Conference on Computer Vision, pp. 135–150 (2018)
  • (68) Wang, Y., van de Weijer, J., Herranz, L.: Mix and match networks: encoder-decoder alignment for zero-pair image translation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5467–5476 (2018)
  • (69) Wu, Z., Han, X., Lin, Y.L., Uzunbas, M.G., Goldstein, T., Lim, S.N., Davis, L.S.: Dcan: Dual channel-wise alignment networks for unsupervised scene adaptation. In: Proceedings of the European Conference on Computer Vision (2018)
  • (70) Xian, Y., Lampert, C.H., Schiele, B., Akata, Z.: Zero-shot learning-a comprehensive evaluation of the good, the bad and the ugly. IEEE Transactions on Pattern Analysis and Machine Intelligence (2018)
  • (71) Xian, Y., Lorenz, T., Schiele, B., Akata, Z.: Feature generating networks for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5542–5551 (2018)
  • (72) Yi, Z., Zhang, H., Gong, P.T., et al.: Dualgan: Unsupervised dual learning for image-to-image translation. In: Proceedings of the International Conference on Computer Vision (2017)
  • (73) Yu, F., Koltun, V.: Multi-scale context aggregation by dilated convolutions. International Conference on Learning Representations (2016)
  • (74) Yu, L., Zhang, L., van de Weijer, J., Khan, F.S., Cheng, Y., Parraga, C.A.: Beyond eleven color names for image understanding. Machine Vision and Applications 29(2), 361–373 (2018)
  • (75) Zhang, L., Gonzalez-Garcia, A., van de Weijer, J., Danelljan, M., Khan, F.S.: Synthetic data generation for end-to-end thermal infrared tracking. IEEE Transactions on Image Processing 28(4), 1837–1850 (2019)
  • (76) Zhang, R., Isola, P., Efros, A.A.: Colorful image colorization. In: Proceedings of the European Conference on Computer Vision, pp. 649–666. Springer (2016)
  • (77) Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2881–2890 (2017)
  • (78)

    Zheng, H., Cheng, Y., Liu, Y.: Maximum expected likelihood estimation for zero-resource neural machine translation.

    In: Proceedings of the International Joint Conference on Artificial Intelligence (2017)
  • (79) Zhu, J.Y., Park, T., Isola, P., Efros, A.A.: Unpaired image-to-image translation using cycle-consistent adversarial networks. In: Proceedings of the International Conference on Computer Vision (2017)
  • (80) Zhu, J.Y., Zhang, R., Pathak, D., Darrell, T., Efros, A.A., Wang, O., Shechtman, E.: Toward multimodal image-to-image translation. In: Advances in Neural Information Processing Systems, pp. 465–476 (2017)
  • (81) Zou, Y., Yu, Z., Vijaya Kumar, B., Wang, J.: Unsupervised domain adaptation for semantic segmentation via class-balanced self-training. In: Proceedings of the European Conference on Computer Vision (2018)