TriGAN: Image-to-Image Translation for Multi-Source Domain Adaptation

04/19/2020 ∙ by Subhankar Roy, et al. ∙ Università di Trento 31

Most domain adaptation methods consider the problem of transferring knowledge to the target domain from a single source dataset. However, in practical applications, we typically have access to multiple sources. In this paper we propose the first approach for Multi-Source Domain Adaptation (MSDA) based on Generative Adversarial Networks. Our method is inspired by the observation that the appearance of a given image depends on three factors: the domain, the style (characterized in terms of low-level features variations) and the content. For this reason we propose to project the image features onto a space where only the dependence from the content is kept, and then re-project this invariant representation onto the pixel space using the target domain and style. In this way, new labeled images can be generated which are used to train a final target classifier. We test our approach using common MSDA benchmarks, showing that it outperforms state-of-the-art methods.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 8

page 12

page 13

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

A well known problem in computer vision is the need to adapt a classifier trained on a given

source domain in order to work on a different, target domain. Since the two domains typically have different marginal feature distributions, the adaptation process needs to reduce the corresponding domain shift [45]. In many practical scenarios, the target data are not annotated and Unsupervised Domain Adaptation (UDA) methods are required.

While most previous adaptation approaches consider a single source domain, in real world applications we may have access to multiple datasets. In this case, Multi-Source Domain Adaptation (MSDA) methods [52, 31, 51, 36] may be adopted, in which more than one source dataset is considered in order to make the adaptation process more robust. However, despite more data can be used, MSDA is challenging as multiple domain-shift problems need to be simultaneously and coherently solved.

In this paper we deal with (unsupervised) MSDA using a data-augmentation approach based on a Generative Adversarial Network (GAN) [13]. Specifically, we generate artificial target samples by “translating” images from all the source domains into target-like images. Then the synthetically generated images are used for training the target classifier. While this strategy has been recently adopted in the single-source UDA scenario [40, 17, 27, 34, 41], we are the first to show how it can be effectively used in a MSDA setting. In more detail, our goal is to build and train a “universal” translator which can transform an image from an input domain to a target domain. The translator network is “universal” because the number of parameters which need to be optimized should scale linearly with the number of domains. We achieve this goal using domain-invariant intermediate features, computed by the encoder part of our generator, and then projecting these features onto the domain-specific target distribution using the decoder.

To make this image translation effective, we assume that the appearance of an image depends on three factors: the content, the domain and the style. The domain models properties that are shared by the elements of a dataset but which may not be shared by other datasets. On the other hand, the style factor represents properties which are shared among different local parts of a single image and describes low-level features which concern a specific image (e.g., the color or the texture). The content is what we want to keep unchanged during the translation process: typically, it is the foreground object shape which is described by the image labels associated with the source data samples. Our encoder obtains the intermediate representations in a two-step process: we first generate style-invariant representations and then we compute the domain-invariant representations. Symmetrically, the decoder transforms the intermediate representations first projecting these features onto a domain-specific distribution and then onto a style-specific distribution. In order to modify the underlying distribution of a set of features, inspired by [38], in the encoder we use whitening layers which progressively align the style-and-domain feature distributions. Then, in the decoder, we project the intermediate invariant representation onto a new domain-and-style specific distribution with Whitening and Coloring () [42] batch transformations, according to the target data.

A “universal” translator similar in spirit to our proposed generator is StarGAN [5]

(proposed in a non UDA task). However, in StarGAN the domain information is represented by a one-hot vector concatenated with the input image. When we use StarGAN in our MSDA scenario, the synthesized images are much less effective for training the target classifier, and this emiprically shows that our batch-based transformation of the image distribution is more effective for our translation task.

Contributions. Our main contributions can be summarized as follows. (i) We propose the first generative MSDA method. We call our approach TriGAN because it is based on three different factors of the images: the style, the domain and the content. (ii) The proposed image translation process is based on style and domain specific statistics which are first removed from and then added to the source images by means of modified layers. Specifically, we use the following feature transformations (associated with a corresponding layer type): Instance Whitening Transform (), Domain Whitening Transform () [38], conditional Domain Whitening Transform () and Adaptive Instance Whitening Transform (). and are novel layers introduced in this paper. (iii) We test our method on two MSDA datasets, Digits-Five [51] and Office-Caltech10 [12], outperforming state-of-the-art methods.

Figure 1: An overview of the TriGAN generator. We schematically show 3 domains - objects with holes, 3D objects and skewered objects, respectively. The content is represented by the object’s shape - square, circle or triangle. The style is represented by the color: each image input to has a different color and each domain has it own set of styles. First, the encoder creates a style-invariant representation using IWT blocks. DWT blocks are then used to obtain a domain-invariant representation. Symmetrically, the decoder brings back domain-specific information with cDWT blocks (for simplicity we show only a single output domain, ). Finally, we apply a reference style. The reference style is extracted using the style path and it is applied using the Adaptive IWT blocks.

2 Related Work

In this section we review the previous approaches on UDA, considering both single source and multi-source methods. Since the proposed generator is also related to deep models used for image-to-image translation, we also analyse related work on this topic.

Single-source UDA. Single-source UDA approaches assume a single labeled source domain and can be broadly classified under three main categories, depending upon the strategy adopted to cope with the domain-shift problem. The first category uses first and second order statistics to model the source and the target feature distributions. For instance, [28, 29, 50, 48] minimize the Maximum Mean Discrepancy, i.e. the distance between the mean of feature distributions between the two domains. On the other hand, [44, 33, 37] achieve domain invariance by aligning the second-order statistics through correlation alignment. Differently, [3, 25, 30]

reduce the domain shift by domain alignment layers derived from batch normalization (BN)

[20]. This idea has been recently extended in [38], where grouped-feature whitening (DWT) is used instead of feature standardization as in . In our proposed encoder we also use the DWT layers, which we adapt to work in a generative network. In addition, we also propose other style and domain dependent batch-based normalizations (i.e., , and ).

The second category of methods computes domain-agnostic representations by means of an adversarial learning-based approach. For instance, discriminative domain-invariant representations are constructed through a gradient reversal layer in [9]. Similarly, the approach in [46] uses a domain confusion loss to promote the alignment between the source and the target domain.

The third category of methods uses adversarial learning in a generative framework (i.e., GANs [13]) to reconstruct artificial source and/or target images and perform domain adaptation. Notable approaches are SBADA-GAN [40], CyCADA [17], CoGAN [27], I2I Adapt [34] and Generate To Adapt (GTA) [41]. While these generative methods have been shown to be very successful in UDA, none of them deals with a multi-source setting. Note that trivially extending these approaches to an MSDA scenario involves training different generators, being the number of source domains. In contrast, in our universal translator only a subset of parameters grow linearly with the number of domains (Sec. 3.2.3), while the others are shared over all the domains. Moreover, since we train our generator using translation directions, we can largely increase the number of training sample-domain pairs effectively used (Sec. 3.3).

Multi-source UDA. In [52], multiple-source knowledge transfer is obtained by borrowing knowledge from the target k nearest-neighbour sources. Similarly, a distribution-weighted combining rule is proposed in [31] to construct a target hypothesis as a weighted combination of source hypotheses. Recently, Deep Cocktail Network (DCTN) [51]

uses the distribution-weighted combining rule in an adversarial setting. A Moment Matching Network (

) is introduced in [36] to reduce the discrepancy between the multiple source and the target domains. Differently from these methods which operate in a discriminative setting, our method relies on a deep generative approach for MSDA.

Image-to-image Translation. Image-to-image translation approaches, i.e. those methods which learn how to transform an image from one domain to another, possibly keeping its semantics, are the basis of our method. In [21] the pix2pix network translates images under the assumption that paired images in the two domains are available at training time. In contrast, CycleGAN [53] can learn to translate images using unpaired training samples. Note that, by design, these methods work with two domains. ComboGAN [1] partially alleviates this issue by using generators for translations among domains. Our work is also related to StarGAN [5] which handles unpaired image translation amongst N domains (N 2) through a single generator. However, StarGAN achieves image translation without explicitly forcing the image representations to be domain invariant, and this may lead to a significant reduction of the network representation power as the number of domains increases. On the other hand, our goal is to obtain an explicit, intermediate image representation which is style-and-domain independent. We use IWT and DWT to achieve this. We also show that this invariant representation can simplify the re-projection process onto a desired style and target domain. This is achieved through and which results into very realistic translations amongst domains. Very recently, a whitening and colouring based image-to-image translation method was proposed in [4], where the whitening operation is weight-based

: the transformation is embedded into the network weights. Specifically, whitening is approximated by enforcing the convariance matrix, computed using the intermediate features, to be equal to the identity matrix. Conversely, our whitening transformation is

data dependent (i.e., it depends on the specific batch statistics, Sec. 3.2.1) and uses the Cholesky decomposition [6] to compute the whitening matrices of the input samples in a closed form, thereby eliminating the need of additional ad-hoc losses.

3 Style-and-Domain based Image Translation

In this section we describe the proposed approach for MSDA. We first provide an overview of our method and we introduce the notation adopted throughout the paper (Sec. 3.1). Then we describe the TriGAN architecture (Sec. 3.2) and our training procedure (Sec.3.3).

3.1 Notation and Overview

In the MSDA scenario we have access to labeled source datasets , where , and a target unlabeled dataset . All the datasets (target included) share the same categories and each of them is associated to a domain , respectively. Our final goal is to build a classifier for the target domain exploiting the data in .

Our method is based on two separate training stages. We initially train a generator which learns how to change the appearance of a real input image in order to adhere to a desired domain and style. Importantly, our learns mappings between every possible pair of image domains. Learning translations makes it possible to exploit much more supervisory information with respect to a plain strategy in which different source-to-target generators are trained (Sec. 3.3). Once is trained, in the second stage we use it to generate target data having the same content of the source data, thus creating a new, labeled, target dataset, which is finally used to train a target classifier . However, in training (first stage), we do not use class labels and is treated in the same way as the other datasets.

As mentioned in Sec. 1, is composed of an encoder and a decoder (Fig. 1). The role of is to “whiten”, i.e., to remove, both domain-specific and style-specific aspects of the input image features in order to obtain domain and style invariant representations. Symmetrically, “colors” the domain-and-style invariant features generated by , by progressively projecting these intermediate representations onto a domain-and-style specific space.

In the first training stage, takes as input a batch of images with corresponding domain labels , where belongs to the domain and . Moreover, takes as input a batch of output domain labels , and a batch of reference style images , such that has domain label . For a given , the task of is to transform into such that: (1) and share the same content but (2) belongs to domain and has the same style of .

3.2 TriGAN Architecture

The TriGAN architecture is composed of a generator network and a discriminator network . As above mentioned, comprises an encoder and decoder , which we describe in (Sec. 3.2.2-3.2.3). The discriminator is based on the Projection Discriminator ([32]). Before describing the details of , we briefly review the transform ([42]) (Sec. 3.2.1) which is used as the basic operation in our proposed batch-based feature transformations.

3.2.1 Preliminaries: Whitening Coloring Transform

Let

be the tensor representing the activation values of the convolutional feature maps in a given layer corresponding to the input image

, with channels and spatial locations. We treat each spatial location as a -dimensional vector, in this way each image contains a set of vectors . With a slight abuse of the notation, we use , which includes all the spatial locations in all the images in a batch. The transform is a multivariate extension of the per-dimension normalization and shift-scaling transform () proposed in ([20]) and widely adopted in both generative and discriminative networks. can be described by:

(1)

where:

(2)

In Eq. 2, is the centroid of the elements in , while is such that: , where is the covariance matrix computed using . The result of applying Eq. 2 to the elements of , is a set of whitened features , which lie in a spherical distribution (i.e., with a covariance matrix equal to the identity matrix). On the other hand, Eq. 1 performs a coloring transform, i.e. projects the elements in

onto a learned multivariate Gaussian distribution. While

and are computed using the elements in (they are data-dependent), Eq. 1 depends on the dimensional learned parameter vector and the dimensional learned parameter matrix . Eq. 1 is a linear operation and can be simply implemented using a convolutional layer with kernel size .

In this paper we use the WC transform in our encoder and decoder , in order to first obtain a style-and-domain invariant representation for each , and then transform this representation accordingly to the desired output domain and style image sample . The next sub-sections show the details of the proposed architecture.

3.2.2 Encoder

The encoder is composed of a sequence of standard - - - blocks and some (more details in the Supplementary Material), in which we replace the common layers ([20]) with our proposed normalization modules, which are detailed below.

Obtaining Style Invariant Representations. In the first two blocks of we whiten first and second-order statistics of the low-level features of each , which are mainly responsible for the style of an image ([11]). To do so, we propose the Instance Whitening Transform (), where the term instance is inspired by Instance Normalization () ([49]

) and highlights that the proposed transform is applied to a set of features extracted from a single image

. For each , is defined as:

(3)

Note that in Eq. 3 we use as the batch, where contains only feautures of a specific image (Sec. 3.2.1). Moreover, each is extracted from the first two convolutional layers of , thus has a small receptive field. This implies that whitening is performed using an image-specific feature centroid and covariance matrix , which represent the first and second-order statistics of the low-level features of . On the other hand, coloring is based on the parameters and , which do not depend on or . The coloring operation is the analogous of the shift-scaling per-dimension transform computed in just after feature standardization ([20]) and is necessary to avoid decreasing the network representation capacity ([42]).

Obtaining Domain Invariant Representations. In the subsequent blocks of we whiten first and second-order statistics which are domain specific. For this operation we adopt the Domain Whitening Transform () proposed in ([38]). Specifically, for each , let be its domain label (see Sec. 3.1) and let be the subset of feature which have been extracted from all those images in which share the same domain label. Then, for each :

(4)

Similarly to Eq. 3, Eq. 4 performs whitening using a subset of the current feature batch. Specifically, all the features in are partitioned depending on the domain label of the image they have been extracted from, so obtaining , etc, where all the features in belongs to the images of the domain . Then, is used to compute domain-dependent first and second order statistics (). These statistics are used to project each onto a domain-invariant spherical distribution. A similar idea was recently proposed in ([38]) in a discriminative network for single-source UDA. However, differently from ([38]), we also use coloring by re-projecting the whitened features onto a new space governed by a learned multivariate distribution. This is done using the (layer-specific) parameters and which do not depend on .

3.2.3 Decoder

Our decoder is functionally and structurally symmetric with respect to : it takes as input the domain and style invariant features computed by and projects these features onto the desired domain with the style extracted from the reference image .

Similarly to , is a sequence of and a few - - - blocks (more details in the Supplementary Material). Similarly to Sec. 3.2.2, in the layers we replace with our proposed feature normalization approaches, which are detailed below.

Projecting Features onto a Domain-specific Distribution. Apart from the last two blocks of (see below), all the other blocks are dedicated to project the current set of features onto a domain-specific subspace. This subspace is learned from data using domain-specific coloring parameters , where is the label of the corresponding domain. To this purpose we introduce the conditional Domain Whitening Transform (), where the term “conditional” specifies that the coloring step is conditioned on the domain label . In more detail: Similarly to Eq. 4, we first partition into , etc. However, the membership of to is decided taking into account the desired output domain label for each image rather than its original domain as in case of Eq. 4. Specifically, if and the output domain label of is , then is included in . Once has been partitioned, we define as follows:

(5)

Note that, after whitening, and differently from Eq. 4, coloring in Eq. 5 is performed using domain-specific parameters .

Applying a Specific Style. In order to apply a specific style to , we first extract the output style from the reference image associated with (Sec. 3.1). This is done using the Style Path (see Fig. 1), which consists of two - - -

blocks (which share the parameters with the first two layers of the encoder) and a MultiLayer Perceptron (MLP)

. Following ([11]) we represent a style using the first and the second order statistics , which are extracted using the blocks (Sec. 3.2.2). Then we use to adapt these statistics to the domain-specific representation obtained as the output of the previous step. In fact, in principle, for each , the operation inside the transform could be “inverted” using:

(6)

Indeed, the coloring operation (Eq. 1) is the inverse of whitening (Eq. 2). However, the elements of now lie in a feature space different from the output space of Eq. 3, thus the transformation defined by Style Path needs to be adapted. For this reason, we use a MLP () which implements this adaptation:

(7)

Note that, in Eq. 7, is the (concatenated) input and is the MLP output, one input-output pair per image .

Once have been generated, we use them as the coloring parameters of our Adaptive IWT ():

(8)

Eq. 8 imposes style-specific first and second order statistics to the features of the last blocks of in order to mimic the style of .

3.3 Network Training

GAN Training. For the sake of clarity, in the rest of the paper we use a simplified notation for , in which takes as input only one image instead of a batch. Specifically, let be the generated image, starting from () and with desired output domain and style image . is trained using the combination of three different losses, with the goal of changing the style and the domain of while preserving its content.

First, we use an adversarial loss based on the Projection Discriminator ([32]) (), which is conditioned on labels (domain labels, in our case) and uses a hinge loss:

(9)
(10)

The second loss is the Identity loss proposed in ([53]), which in our framework is implemented as follows:

(11)

In Eq. 11, computes an identity transformation, being the input and the output domain and style the same. After that, a pixel-to-pixel norm is computed.

Finally, we propose to use a third loss which is based on the rationale that the generation process should be equivariant with respect to a set of simple transformations which preserve the main content of the images (e.g., the foreground object shape). Specifically, we use the set of the affine transformations of image which are defined by the parameter ( is a 2D transformation matrix). The affine transformation is implemented by a differentiable bilinear kernel as in ([22]). The Equivariance loss is:

(12)

In Eq. 12, for a given image , we randomly choose a geometric parameter and we apply to . Then, using the same , we apply to and we get , which is input to in order to generate a second image. The two generated images are finally compared using the norm. This is a form of self-supervision, in which equivariance to geometric transformations is used to extract semantics. Very recently a similar loss has been proposed in ([19]), where equivariance to affine transformations is used for image co-segmentation.

The complete loss for is:

(13)

Note that Eq. 9, 10 and 12 depend on the pair : This means that the supervisory information we effectively use, grows with , which is quadratic with respect to a plain strategy in which different source-to-target generators are trained (Sec. 2).

Classifier Training. Once is trained, we use it to artificially create a labeled training dataset () for the target domain. Specifically, for each and each , we randomly pick , which is used as the reference style image, and we generate: , where is fixed and indicates the target domain () label (see Sec. 3.1). is added to

and the process is iterated. Note that, in different epochs, for the same

, we randomly select a different reference style image .

Finally, we train a classfier on using the cross-entropy loss:

(14)

4 Experimental Results

In this section we describe the experimental setup and then we evaluate our approach using common MSDA datasets. We also present an ablation study in which we separately analyse the impact of each TriGAN component.

4.1 Datasets

In our experiments we consider two common domain adaptation benchmarks, namely the Digits-Five benchmark [51] and the Office-Caltech dataset [12].

Digits-Five [51] is composed of five digit-recognition datasets: USPS [8], MNIST [24], MNIST-M [9], SVHN [35] and Synthetic numbers datasets [10] (SYNDIGITS). SVHN [35] contains Google Street View images of real-world house numbers. Synthetic numbers [10] includes 500K computer-generated digits with different sources of variations (i.e. position, orientation, color, blur). USPS [8] is a dataset of digits scanned from U.S. envelopes, MNIST [24] is a popular benchmark for digit recognition and MNIST-M [9] is its colored counterpart. We adopt the experimental protocol described in [51]: in each domain the train/test split is composed of a subset of 25000 images for training and 9000 images for testing. For USPS, the entire dataset is used.

Office-Caltech [12] is a domain-adaptation benchmark, obtained selecting the subset of those categories which are shared between Office31 and Caltech256 [14]. It contains images, about half of which belonging to Caltech256. There are four different domains: Amazon (A), DSLR (D), Webcam (W) and Caltech256 (C).

Standards Models
mt, up, sv, sy
mm
mm, up, sv, sy
mt
mt, mm, sv, sy
up
mt, up, mm, sy
sv
mt, up, sv, mm
sy
Avg
Source
Combine
Source Only 63.700.83 92.300.91 90.710.54 71.510.75 83.440.79 80.330.76
DAN[28] 67.870.75 97.500.62 93.490.85 67.800.84 86.930.93 82.720.79
DANN[9] 70.810.94 97.900.83 93.470.79 68.500.85 87.370.68 83.610.82
Multi-
Source
Source Only 63.370.74 90.500.83 88.710.89 63.540.93 82.440.65 77.710.81
DAN[28] 63.780.71 96.310.54 94.240.87 62.450.72 85.430.77 80.440.72
CORAL[43] 62.530.69 97.210.83 93.450.82 64.400.72 82.770.69 80.070.75
DANN[9] 71.300.56 97.600.75 92.330.85 63.480.79 85.340.84 82.010.76
ADDA[47] 71.570.52 97.890.84 92.830.74 75.480.48 86.450.62 84.840.64
DCTN[51] 70.531.24 96.230.82 92.810.27 77.610.41 86.770.78 84.790.72
[36] 72.821.13 98.430.68 96.140.81 81.320.86 89.580.56 87.650.75
StarGAN [5] 44.711.39 96.260.62 55.323.71 58.931.95 63.362.41 63.712.01
TriGAN (Ours) 83.200.78 97.200.45 94.080.92 85.660.79 90.300.57 90.080.70
Table 1: Classification accuracy (%) on Digits-Five. MNIST-M, MNIST, USPS, SVHN, Synthetic Digits are abbreviated as mm, mt, up, sv and sy respectively. Best number is in bold and second best is underlined.

4.2 Experimental Setup

For lack of space, we provide the architectural details of our generator and discriminator networks in the Supplementary Material. We train TriGAN for 100 epochs using the Adam optimizer [23] with the learning rate set to 1e-4 for and 4e-4 for as in [16]. The loss weighing factor in Eqn. 13 is set to 10 as in [53].

In the Digits-Five experiments we use a mini-batch of size 256 for TriGAN training. Due to the difference in image resolution and image channels, the images of all the domains are converted to 32 32 RGB. For a fair comparison, for the final target classifier we use exactly the same network architecture used in [10, 36].

In the Office-Caltech10 experiments we downsample the images to 164 164 to accommodate more samples in a mini-batch. We use a mini-batch of size 24 for training with 1 GPU. For the back-bone target classifier we use the ResNet101 [15] architecture used by [36]. The weights are initialized with a network pre-trained on the ILSVRC-2012 dataset [39]

. In our experiments we remove the output layer and we replace it with a randomly initialized fully-connected layer with 10 logits, one for each class of the Office-Caltech10 dataset.

is trained with Adam with an initial learning rate of 1e-5 for the randomly initialized last layer and 1e-6 for all other layers. In this setting we also include in for training the classifier .

4.3 Results

In this section we quantitatively analyse TriGAN. In the Supplementary Material we show some qualitative results for Digits-Five and Office-Caltech10.

4.3.1 Comparison with State-of-the-Art Methods

Tab. 1 and Tab. 2 show the results on the Digits-Five and the Office-Caltech10 datset, respectively. Table 1 shows that TriGAN achieves an average accuracy of 90.08% which is higher than all other methods. is better in the mm, up, sv, sy mt and in the mt, mm, sv, sy up settings, where TriGAN is the second best. In all the other settings, TriGAN outperforms all the other approaches. As an example, in the mt, up, sv, sy mm setting, TriGAN is better than the second best method by a significant margin of 10.38%. In the same table we also show the results obtained when we replace TriGAN with StarGAN [5], which is another “universal” image translator. Specifically, we use StarGAN to generate synthetic target images and then we train the target classifier using the same protocol described in Sec. 3.3. The corresponding results in Table 1 show that StarGAN, despite to be known to work well for aligned face translation, drastically fails when used in this UDA scenario.

Finally, we also use Office-Caltech10, which is considered to be difficult for reconstruction-based GAN methods because of the high-resolution images. Although the dataset is quite saturated, TriGAN achieves a classification accuracy of 97.0%, outperforming all the other methods and beating the previous state-of-the-art approach () by a margin of 0.6% on average (see Tab. 2).

Standards
Models
W
D
C
A
Avg
Source
Combine
Source only 99.0 98.3 87.8 86.1 92.8
DAN [28] 99.3 98.2 89.7 94.8 95.5
Multi-
Source
Source only 99.1 98.2 85.4 88.7 92.9
DAN [28] 99.5 99.1 89.2 91.6 94.8
DCTN [51] 99.4 99.0 90.2 92.7 95.3
[36] 99.5 99.2 92.2 94.5 96.4
StarGAN [5] 99.6 100.0 89.3 93.3 95.5
TriGAN (Ours) 99.7 100.0 93.0 95.2 97.0
Table 2: Classification accuracy (%) on Office-Caltech10.

4.3.2 Ablation Study

In this section we analyse the different components of our method and study in isolation their impact on the final accuracy. Specifically, we use the Digits-Five dataset and the following models: i) Model A, which is our full model containing the following components: IWT, DWT, cDWT, AdaIWT and . ii) Model B, which is similar to Model A except we replace with the cycle-consistency loss of CycleGAN [53]. iii) Model C, where we replace IWT, DWT, cDWT and AdaIWT of Model A with IN [49], BN [20], conditional Batch Normalization (cBN) [7] and Adaptive Instance Normalization (AdaIN) [18]. This comparison highlights the difference between feature whitening and feature standardisation. iv) Model D, which ignores the style factor. Specifically, in Model D, the blocks related to the style factor, i.e., the IWT and the AdaIWT blocks, are replaced by DWT and cDWT blocks, respectively. v) Model E, in which the style path differs from Model A in the way the style is applied to the domain-specific representation. Specifically, we remove the MLP and we directly apply (). vi) Finally, Model F represents no-domain assumption (e.g. the DWT and cDWT blocks are replaced with standard WC blocks).

Model Description
Avg. Accuracy (%)
(Difference)
A TriGAN (full method) 90.08
B
Replace Equivariance Loss
with Cycle Loss
88.38 (-1.70)
C
Replace Whitening with
Feature Standardisation
89.39 (-0.68)
D No Style Assumption 88.32 (-1.76)
E
Applying style directly
instead of style path
88.36 (-1.71)
F No Domain Assumption 89.10 (-0.98)
Table 3: An analysis of the main TriGAN components using Digits-Five.

Tab. 3 shows that Model A outperforms all the ablated models. Model B shows that is detrimental for the accuracy because may focus on meaningless information to reconstruct back the image. Conversely, the affine transformations used in case of , force to focus on the shape (i.e., the content) of the images. Also Model C is outperformed by model A, demonstrating the importance of feature whitening over feature standardisation, corroborating the findings of [38] in a pure-discriminative setting. Moreover, the no-style assumption in Model D hurts the classification accuracy by a margin of 1.76% when compared with Model A. We believe this is due to the fact that, when only domain-specific latent factors are modeled but instance-specific style information is missing in the image translation process, then the diversity of the translations decreases, consequently reducing the final accuracy (see the role of the randomly picked , in Sec. 3.3). Model E shows the need of using the proposed style path. Finally, Model F shows that having a separate factor for domain yields a better performance. Note that the ablation analysis in Tab. 3 is done by removing a single component from the full model A, and the marginal difference with Model A shows that all the components are important. On the other hand, simultaneously removing all the components makes our model become similar to StarGAN, where there is no style information and where the domain information is not “whitened” but provided as input to the network. As shown in Table 1, our full model drastically outperfoms a StarGAN-based generative MSDA approach.

4.3.3 Multi domain image-to-image translation

Our proposed generator can be used for a pure generative (non-UDA), multi-domain image-to-image translation task. We conduct experiments on the Alps Seasons dataset [1] which consists of images of Alps mountains with 4 different domains (corresponding to 4 seasons). Fig. 2 shows some images generated using our generator. For this experiment we compare our generator with StarGAN [5] using the FID [16] metrics. FID measures the realism of the generated images (the lower the better). The FID scores are computed considering all the real samples in the target domain and generating an equivalent number of synthetic images in the target domain. Tab. 4 shows that the TriGAN FID scores are significantly lower than the StarGAN scores. This further highlights that decoupling the style and the domain and using -based layers to progressively “whiten” and “color” the image statistics, yields to a more realistic cross-domain image translation than using domain labels as input as in the case of StarGAN.

Figure 2: Some example images generated by TriGAN across different domains (i.e., seasons). We show two generated images for each domain combination. This figure is reported also in the Supplementary Material with a higher resolution.
Target Winter Summer Spring Autumn
StarGAN [5] 148.45 180.36 175.40 145.24
TriGAN (Ours) 41.03 38.59 40.75 32.71
Table 4: Alps Seasons, FID scores: Comparing TriGAN with StarGAN [5].

5 Conclusions

In this work we proposed TriGAN, an MSDA framework which is based on data-generation from multiple source domains using a single generator. The underlying principle of our approach to to obtain intermediate, domain and style invariant representations in order to simplify the generation process. Specifically, our generator progressively removes style and domain specific statistics from the source images and then re-projects the intermediate features onto the desired target domain and style. We obtained state-of-the-art results on two MSDA datasets, showing the potentiality of our approach.

References

  • [1] Asha Anoosheh, Eirikur Agustsson, Radu Timofte, and Luc Van Gool. Combogan: Unrestrained scalability for image domain translation. In CVPR, 2018.
  • [2] Konstantinos Bousmalis, Nathan Silberman, David Dohan, Dumitru Erhan, and Dilip Krishnan. Unsupervised pixel-level domain adaptation with generative adversarial networks. In CVPR, 2017.
  • [3] Fabio Maria Carlucci, Lorenzo Porzi, Barbara Caputo, Elisa Ricci, and Samuel Rota Bulò. Autodial: Automatic domain alignment layers. In ICCV, 2017.
  • [4] Wonwoong Cho, Sungha Choi, David Keetae Park, Inkyu Shin, and Jaegul Choo. Image-to-image translation via group-wise deep whitening-and-coloring transformation. In CVPR, 2019.
  • [5] Yunjey Choi, Minje Choi, Munyoung Kim, Jung-Woo Ha, Sunghun Kim, and Jaegul Choo. StarGAN: Unified generative adversarial networks for multi-domain image-to-image translation. In CVPR, 2018.
  • [6] Dariusz Dereniowski and Marek Kubale. Cholesky factorization of matrices in parallel and ranking of graphs. In International Conference on Parallel Processing and Applied Mathematics, 2003.
  • [7] Vincent Dumoulin, Jonathon Shlens, and Manjunath Kudlur. A learned representation for artistic style. In ICLR, 2016.
  • [8] Jerome Friedman, Trevor Hastie, and Robert Tibshirani. The elements of statistical learning, volume 1. Springer series in statistics New York, 2001.
  • [9] Yaroslav Ganin and Victor Lempitsky.

    Unsupervised domain adaptation by backpropagation.

    In ICML, 2015.
  • [10] Yaroslav Ganin, Evgeniya Ustinova, Hana Ajakan, Pascal Germain, Hugo Larochelle, François Laviolette, Mario Marchand, and Victor Lempitsky.

    Domain-adversarial training of neural networks.

    JMLR, 2016.
  • [11] Leon A Gatys, Alexander S Ecker, and Matthias Bethge.

    Image style transfer using convolutional neural networks.

    In CVPR, 2016.
  • [12] Boqing Gong, Yuan Shi, Fei Sha, and Kristen Grauman. Geodesic flow kernel for unsupervised domain adaptation. In CVPR, 2012.
  • [13] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In NIPS, 2014.
  • [14] Gregory Griffin, Alex Holub, and Pietro Perona. Caltech-256 object category dataset. 2007.
  • [15] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In CVPR, 2016.
  • [16] Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium. In NIPS, 2017.
  • [17] Judy Hoffman, Eric Tzeng, Taesung Park, Jun-Yan Zhu, Phillip Isola, Kate Saenko, Alexei A Efros, and Trevor Darrell. Cycada: Cycle-consistent adversarial domain adaptation. In ICML, 2017.
  • [18] Xun Huang, Ming-Yu Liu, Serge Belongie, and Jan Kautz. Multimodal unsupervised image-to-image translation. In ECCV, 2018.
  • [19] Wei-Chih Hung, Varun Jampani, Sifei Liu, Pavlo Molchanov, Ming-Hsuan Yang, and Jan Kautz. Scops: Self-supervised co-part segmentation. In CVPR, 2019.
  • [20] Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In ICML, 2015.
  • [21] Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A Efros.

    Image-to-image translation with conditional adversarial networks.

    In CVPR, 2017.
  • [22] Max Jaderberg, Karen Simonyan, Andrew Zisserman, et al. Spatial transformer networks. In NIPS, 2015.
  • [23] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv :1412.6980, 2014.
  • [24] Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 1998.
  • [25] Yanghao Li, Naiyan Wang, Jianping Shi, Jiaying Liu, and Xiaodi Hou. Revisiting batch normalization for practical domain adaptation. In ICLR-WS, 2017.
  • [26] Ming-Yu Liu, Thomas Breuel, and Jan Kautz. Unsupervised image-to-image translation networks. In NIPS, 2017.
  • [27] Ming-Yu Liu and Oncel Tuzel. Coupled generative adversarial networks. In NIPS, 2016.
  • [28] Mingsheng Long and Jianmin Wang. Learning transferable features with deep adaptation networks. In ICML, 2015.
  • [29] Mingsheng Long, Han Zhu, Jianmin Wang, and Michael I Jordan.

    Deep transfer learning with joint adaptation networks.

    In ICML, 2017.
  • [30] Massimiliano Mancini, Lorenzo Porzi, Samuel Rota Bulò, Barbara Caputo, and Elisa Ricci. Boosting domain adaptation by discovering latent domains. In CVPR, 2018.
  • [31] Yishay Mansour, Mehryar Mohri, and Afshin Rostamizadeh. Domain adaptation with multiple sources. In NIPS, 2009.
  • [32] Takeru Miyato and Masanori Koyama. cGANs with projection discriminator. In ICLR, 2018.
  • [33] Pietro Morerio, Jacopo Cavazza, and Vittorio Murino. Minimal-entropy correlation alignment for unsupervised deep domain adaptation. In ICLR, 2018.
  • [34] Zak Murez, Soheil Kolouri, David Kriegman, Ravi Ramamoorthi, and Kyungnam Kim. Image to image translation for domain adaptation. In CVPR, 2018.
  • [35] Yuval Netzer, Tao Wang, Adam Coates, Alessandro Bissacco, Bo Wu, and Andrew Y Ng. Reading digits in natural images with unsupervised feature learning. In NIPS-WS, 2011.
  • [36] Xingchao Peng, Qinxun Bai, Xide Xia, Zijun Huang, Kate Saenko, and Bo Wang. Moment matching for multi-source domain adaptation. ICCV, 2019.
  • [37] Xingchao Peng and Kate Saenko. Synthetic to real adaptation with generative correlation alignment networks. In WACV, 2018.
  • [38] Subhankar Roy, Aliaksandr Siarohin, Enver Sangineto, Samuel Rota Bulo, Nicu Sebe, and Elisa Ricci. Unsupervised domain adaptation using feature-whitening and consensus loss. CVPR, 2019.
  • [39] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al. Imagenet large scale visual recognition challenge. IJCV, 2015.
  • [40] Paolo Russo, Fabio Maria Carlucci, Tatiana Tommasi, and Barbara Caputo. From source to target and back: symmetric bi-directional adaptive gan. In CVPR, 2018.
  • [41] Swami Sankaranarayanan, Yogesh Balaji, Carlos D Castillo, and Rama Chellappa. Generate to adapt: Aligning domains using generative adversarial networks. In CVPR, 2018.
  • [42] Aliaksandr Siarohin, Enver Sangineto, and Nicu Sebe. Whitening and Coloring batch transform for GANs. In ICLR, 2019.
  • [43] Baochen Sun, Jiashi Feng, and Kate Saenko. Return of frustratingly easy domain adaptation. In AAAI, 2016.
  • [44] Baochen Sun and Kate Saenko. Deep coral: Correlation alignment for deep domain adaptation. In ECCV, 2016.
  • [45] Antonio Torralba and Alexei A Efros. Unbiased look at dataset bias. In CVPR, 2011.
  • [46] Eric Tzeng, Judy Hoffman, Trevor Darrell, and Kate Saenko. Adversarial discriminative domain adaptation. In CVPR, 2017.
  • [47] Eric Tzeng, Judy Hoffman, Kate Saenko, and Trevor Darrell. Adversarial discriminative domain adaptation. In CVPR, 2017.
  • [48] Eric Tzeng, Judy Hoffman, Ning Zhang, Kate Saenko, and Trevor Darrell. Deep domain confusion: Maximizing for domain invariance. arXiv:1412.3474, 2014.
  • [49] Dmitry Ulyanov, Andrea Vedaldi, and Victor S. Lempitsky. Instance normalization: The missing ingredient for fast stylization. arXiv:1607.08022, 2016.
  • [50] Hemanth Venkateswara, Jose Eusebio, Shayok Chakraborty, and Sethuraman Panchanathan. Deep hashing network for unsupervised domain adaptation. In CVPR, 2017.
  • [51] Ruijia Xu, Ziliang Chen, Wangmeng Zuo, Junjie Yan, and Liang Lin. Deep cocktail network: Multi-source unsupervised domain adaptation with category shift. In CVPR, 2018.
  • [52] Yi Yao and Gianfranco Doretto. Boosting for transfer learning with multiple sources. In CVPR, 2010.
  • [53] Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A Efros. Unpaired image-to-image translation using cycle-consistent adversarial networks. In CVPR, 2017.

A Additional Multi-Source Results

Some sample translations of our are shown in Figs. 345. For example, in Fig. 3 when the SVHN digit “six” with side-digits is translated to MNIST-M the cDWT blocks re-projects it to MNIST-M domain (i.e., single digit without side-digits) and the AdaIWT block applies the instance-specific style of the digit “three” (i.e., blue digit with red background) to yield a blue “six” with red background. Similar trends are also observed in Fig. 4.

Figure 3: Generations of our across different domains of Digits-Five. Leftmost column shows the source images, one from each domain and the topmost row shows the style image from the target domain, two from each domain.
Figure 4: Generations of our across different domains of Office-Caltech10. Leftmost column shows the source images, one from each domain and the topmost row shows the style image from the target domain, two from each domain.
Figure 5: Generations of our across different domains of Alps dataset. Leftmost column shows the source images, one from each domain and the topmost row shows the style image from the target domain, two from each domain.

B Implementation details

In this section we provide the architecture details of the TriGAN generator and the discriminator .

Instance Whitening Transform (IWT) blocks. As shown in Fig 6 (a) each IWT block is a sequence composed of: , where and denote the kernel sizes. There are two IWT blocks in the . In the first IWT block we use and , and in the second we use and .

(a) IWT block
(b) AdaIWT block
(c) Style Path
Figure 6: A schematic representation of the (a) IWT block; (b) AdaIWT block; and (c) Style Path.

Adaptive Instance Whitening (AdaIWT) blocks. The AdaIWT blocks are analogous to the IWT blocks except from the IWT which is replaced by the AdaIWT. The AdaIWT block is a sequence: , where and . AdaIWT also takes as input the coloring parameters (, ) (See Sec. 3.2.3) and Fig. 6 (b)). Two AdaIWT blocks are consecutively used in . The last AdaIWT block is followed by a layer.

Style Path. The Style Path is composed of: (Fig. 6 (c)). The output of the Style Path is and , which are input to the second and the first AdaIWT blocks, respectively (see Fig. 6 (b)). The

is composed of five fully-connected layers with 256, 128, 128, 256 neurons, with the last fully-connected layer having a number of neurons equal to the cardinality of the coloring parameters

.

(a) DWT block
(b) cDWT block
Figure 7: Schematic representation of (a) DWT block; and (b) cDWT block.

Domain Whitening Transform (DWT) blocks. The schematic representation of a DWT block is shown in Fig. 7 (a). For the DWT blocks we adopt a residual-like structure [15]: . We also add identity shortcuts in the DWT residual blocks to aid the training process.

Conditional Domain Whitening Transform (cDWT) blocks. The proposed cDWT blocks are schematically shown in Fig. 7 (b). Similarly to a DWT block, a cDWT block contains the following layers: . Identity shortcuts are also used in the cDWT residual blocks.

All the above blocks are assembled to construct , as shown in Fig. 8. Specifically, contains two IWT blocks, one DWT block, one cDWT block and two AdaIWT blocks. It also contains the Style Path and 2 (one before the first IWT block and another after the last AdaIWT block), which is omitted in Fig. 8 for the sake of clarity. {} are computed using the Style Path.

Figure 8: Schematic representation of the Generator block.
Figure 9: Schematic representation of the Discriminator block.

For the discriminator architecture we use a Projection Discriminator [32]. In we use projection shortcuts instead of identity shortcuts. In Fig 9 we schematically show a discriminator block. is composed of 2 such blocks. We use spectral normalization [32] in .

C Experiments for single-source UDA

Since, our proposed TriGAN has a generic framework and can handle -way domain translations, we also conduct experiments for Single-Source UDA scenario where and the source domain is grayscale MNIST. We consider the following UDA settings with the digits dataset:

Methods
Source
Target
MNIST
USPS
MNIST
MNIST-M
MNIST
SVHN
Source Only 78.9 63.6 26.0
DANN [10] 85.1 77.4 35.7
CoGAN [27] 91.2 62.0 -
ADDA [46] 89.4 - -
PixelDA [2] 95.9 98.2 -
UNIT [26] 95.9 - -
SBADA-GAN [40] 97.6 99.4 61.1
GenToAdapt [41] 92.5 - 36.4
CyCADA [17] 94.8 - -
I2I Adapt [34] 92.1 - -
TriGAN (Ours) 98.0 95.7 66.3
Table 5: Classification Accuracy (%) of GAN-based methods on the Single-source UDA setting for Digits Recognition. The best number is in bold and the second best is underlined.

c.1 Datasets

MNIST USPS. The MNIST dataset contains grayscale images of handwritten digits 0 to 9. The pixel resolution of MNIST digits is 28 28. The USPS contains similar grayscale handwritten digits except the resolution is 16 16. We up-sample images from both domains to 32 32 during training. For training TriGAN 50000 MNIST and 7438 USPS samples are used. For evaluation we used 1860 test samples from USPS.

MNIST MNIST-M. MNIST-M is a coloured version of grayscale MNIST digits. MNIST-M has RGB images with resolution 28 28. For training TriGAN all 50000 training samples from MNIST and MNIST-M are used and the dedicated 10000 MNIST-M test samples are used for evaluation. Upsampling to 32 32 is also done during training.

MNIST SVHN. SVHN is the short form of Street View House Number and contains real world version of digits ranging from 0 to 9. The images in SVHN are RGB with pixel resolution of 32 32. SVHN has non-centered digits with varying colour intensities. Presence of side-digits also makes adaption to SVHN a hard task. For training TriGAN 60000 MNIST and 73257 SVHN training samples are used. During evaluation all 26032 SVHN test samples are utilized.

c.2 Comparison with GAN-based state-of-the-art methods

In this section we compare our proposed TriGAN with GAN-based state-of-the-art methods, both with adversarial learning based approaches and reconstruction-based approaches. Tab. 5 reports the performance of our TriGAN alongside the results obtained from the following baselines: Domain Adversarial Neural Network [10] (DANN), Coupled generative adversarial networks [27] (CoGAN), Adversarial discriminative domain adaptation [46] (ADDA), Pixel-level domain adaptation [2] (PixelDA), Unsupervised image-to-image translation networks [26] (UNIT), Symmetric bi-directional adaptive gan [40] (SBADA-GAN), Generate to adapt [41] (GenToAdapt), Cycle-consistent adversarial domain adaptation [17] (CyCADA) and Image to image translation for domain adaptation [34] (I2I Adapt). As can be seen from Tab. 5 TriGAN does better in two out of three adaptation settings. It is only worse in the MNIST MNIST-M setting where it is the third best. It is to be noted that TriGAN does significantly well in MNIST SVHN adaptation which is particularly considered as a hard setting. TriGAN is 5.2% better than the second best method SBADA-GAN for MNIST SVHN.