1 Introduction
A well known problem in computer vision is the need to adapt a classifier trained on a given
source domain in order to work on a different, target domain. Since the two domains typically have different marginal feature distributions, the adaptation process needs to reduce the corresponding domain shift [45]. In many practical scenarios, the target data are not annotated and Unsupervised Domain Adaptation (UDA) methods are required.While most previous adaptation approaches consider a single source domain, in real world applications we may have access to multiple datasets. In this case, MultiSource Domain Adaptation (MSDA) methods [52, 31, 51, 36] may be adopted, in which more than one source dataset is considered in order to make the adaptation process more robust. However, despite more data can be used, MSDA is challenging as multiple domainshift problems need to be simultaneously and coherently solved.
In this paper we deal with (unsupervised) MSDA using a dataaugmentation approach based on a Generative Adversarial Network (GAN) [13]. Specifically, we generate artificial target samples by “translating” images from all the source domains into targetlike images. Then the synthetically generated images are used for training the target classifier. While this strategy has been recently adopted in the singlesource UDA scenario [40, 17, 27, 34, 41], we are the first to show how it can be effectively used in a MSDA setting. In more detail, our goal is to build and train a “universal” translator which can transform an image from an input domain to a target domain. The translator network is “universal” because the number of parameters which need to be optimized should scale linearly with the number of domains. We achieve this goal using domaininvariant intermediate features, computed by the encoder part of our generator, and then projecting these features onto the domainspecific target distribution using the decoder.
To make this image translation effective, we assume that the appearance of an image depends on three factors: the content, the domain and the style. The domain models properties that are shared by the elements of a dataset but which may not be shared by other datasets. On the other hand, the style factor represents properties which are shared among different local parts of a single image and describes lowlevel features which concern a specific image (e.g., the color or the texture). The content is what we want to keep unchanged during the translation process: typically, it is the foreground object shape which is described by the image labels associated with the source data samples. Our encoder obtains the intermediate representations in a twostep process: we first generate styleinvariant representations and then we compute the domaininvariant representations. Symmetrically, the decoder transforms the intermediate representations first projecting these features onto a domainspecific distribution and then onto a stylespecific distribution. In order to modify the underlying distribution of a set of features, inspired by [38], in the encoder we use whitening layers which progressively align the styleanddomain feature distributions. Then, in the decoder, we project the intermediate invariant representation onto a new domainandstyle specific distribution with Whitening and Coloring () [42] batch transformations, according to the target data.
A “universal” translator similar in spirit to our proposed generator is StarGAN [5]
(proposed in a non UDA task). However, in StarGAN the domain information is represented by a onehot vector concatenated with the input image. When we use StarGAN in our MSDA scenario, the synthesized images are much less effective for training the target classifier, and this emiprically shows that our batchbased transformation of the image distribution is more effective for our translation task.
Contributions. Our main contributions can be summarized as follows. (i) We propose the first generative MSDA method. We call our approach TriGAN because it is based on three different factors of the images: the style, the domain and the content. (ii) The proposed image translation process is based on style and domain specific statistics which are first removed from and then added to the source images by means of modified layers. Specifically, we use the following feature transformations (associated with a corresponding layer type): Instance Whitening Transform (), Domain Whitening Transform () [38], conditional Domain Whitening Transform () and Adaptive Instance Whitening Transform (). and are novel layers introduced in this paper. (iii) We test our method on two MSDA datasets, DigitsFive [51] and OfficeCaltech10 [12], outperforming stateoftheart methods.
2 Related Work
In this section we review the previous approaches on UDA, considering both single source and multisource methods. Since the proposed generator is also related to deep models used for imagetoimage translation, we also analyse related work on this topic.
Singlesource UDA. Singlesource UDA approaches assume a single labeled source domain and can be broadly classified under three main categories, depending upon the strategy adopted to cope with the domainshift problem. The first category uses first and second order statistics to model the source and the target feature distributions. For instance, [28, 29, 50, 48] minimize the Maximum Mean Discrepancy, i.e. the distance between the mean of feature distributions between the two domains. On the other hand, [44, 33, 37] achieve domain invariance by aligning the secondorder statistics through correlation alignment. Differently, [3, 25, 30]
reduce the domain shift by domain alignment layers derived from batch normalization (BN)
[20]. This idea has been recently extended in [38], where groupedfeature whitening (DWT) is used instead of feature standardization as in . In our proposed encoder we also use the DWT layers, which we adapt to work in a generative network. In addition, we also propose other style and domain dependent batchbased normalizations (i.e., , and ).The second category of methods computes domainagnostic representations by means of an adversarial learningbased approach. For instance, discriminative domaininvariant representations are constructed through a gradient reversal layer in [9]. Similarly, the approach in [46] uses a domain confusion loss to promote the alignment between the source and the target domain.
The third category of methods uses adversarial learning in a generative framework (i.e., GANs [13]) to reconstruct artificial source and/or target images and perform domain adaptation. Notable approaches are SBADAGAN [40], CyCADA [17], CoGAN [27], I2I Adapt [34] and Generate To Adapt (GTA) [41]. While these generative methods have been shown to be very successful in UDA, none of them deals with a multisource setting. Note that trivially extending these approaches to an MSDA scenario involves training different generators, being the number of source domains. In contrast, in our universal translator only a subset of parameters grow linearly with the number of domains (Sec. 3.2.3), while the others are shared over all the domains. Moreover, since we train our generator using translation directions, we can largely increase the number of training sampledomain pairs effectively used (Sec. 3.3).
Multisource UDA. In [52], multiplesource knowledge transfer is obtained by borrowing knowledge from the target k nearestneighbour sources. Similarly, a distributionweighted combining rule is proposed in [31] to construct a target hypothesis as a weighted combination of source hypotheses. Recently, Deep Cocktail Network (DCTN) [51]
uses the distributionweighted combining rule in an adversarial setting. A Moment Matching Network (
) is introduced in [36] to reduce the discrepancy between the multiple source and the target domains. Differently from these methods which operate in a discriminative setting, our method relies on a deep generative approach for MSDA.Imagetoimage Translation. Imagetoimage translation approaches, i.e. those methods which learn how to transform an image from one domain to another, possibly keeping its semantics, are the basis of our method. In [21] the pix2pix network translates images under the assumption that paired images in the two domains are available at training time. In contrast, CycleGAN [53] can learn to translate images using unpaired training samples. Note that, by design, these methods work with two domains. ComboGAN [1] partially alleviates this issue by using generators for translations among domains. Our work is also related to StarGAN [5] which handles unpaired image translation amongst N domains (N 2) through a single generator. However, StarGAN achieves image translation without explicitly forcing the image representations to be domain invariant, and this may lead to a significant reduction of the network representation power as the number of domains increases. On the other hand, our goal is to obtain an explicit, intermediate image representation which is styleanddomain independent. We use IWT and DWT to achieve this. We also show that this invariant representation can simplify the reprojection process onto a desired style and target domain. This is achieved through and which results into very realistic translations amongst domains. Very recently, a whitening and colouring based imagetoimage translation method was proposed in [4], where the whitening operation is weightbased
: the transformation is embedded into the network weights. Specifically, whitening is approximated by enforcing the convariance matrix, computed using the intermediate features, to be equal to the identity matrix. Conversely, our whitening transformation is
data dependent (i.e., it depends on the specific batch statistics, Sec. 3.2.1) and uses the Cholesky decomposition [6] to compute the whitening matrices of the input samples in a closed form, thereby eliminating the need of additional adhoc losses.3 StyleandDomain based Image Translation
In this section we describe the proposed approach for MSDA. We first provide an overview of our method and we introduce the notation adopted throughout the paper (Sec. 3.1). Then we describe the TriGAN architecture (Sec. 3.2) and our training procedure (Sec.3.3).
3.1 Notation and Overview
In the MSDA scenario we have access to labeled source datasets , where , and a target unlabeled dataset . All the datasets (target included) share the same categories and each of them is associated to a domain , respectively. Our final goal is to build a classifier for the target domain exploiting the data in .
Our method is based on two separate training stages. We initially train a generator which learns how to change the appearance of a real input image in order to adhere to a desired domain and style. Importantly, our learns mappings between every possible pair of image domains. Learning translations makes it possible to exploit much more supervisory information with respect to a plain strategy in which different sourcetotarget generators are trained (Sec. 3.3). Once is trained, in the second stage we use it to generate target data having the same content of the source data, thus creating a new, labeled, target dataset, which is finally used to train a target classifier . However, in training (first stage), we do not use class labels and is treated in the same way as the other datasets.
As mentioned in Sec. 1, is composed of an encoder and a decoder (Fig. 1). The role of is to “whiten”, i.e., to remove, both domainspecific and stylespecific aspects of the input image features in order to obtain domain and style invariant representations. Symmetrically, “colors” the domainandstyle invariant features generated by , by progressively projecting these intermediate representations onto a domainandstyle specific space.
In the first training stage, takes as input a batch of images with corresponding domain labels , where belongs to the domain and . Moreover, takes as input a batch of output domain labels , and a batch of reference style images , such that has domain label . For a given , the task of is to transform into such that: (1) and share the same content but (2) belongs to domain and has the same style of .
3.2 TriGAN Architecture
The TriGAN architecture is composed of a generator network and a discriminator network . As above mentioned, comprises an encoder and decoder , which we describe in (Sec. 3.2.23.2.3). The discriminator is based on the Projection Discriminator ([32]). Before describing the details of , we briefly review the transform ([42]) (Sec. 3.2.1) which is used as the basic operation in our proposed batchbased feature transformations.
3.2.1 Preliminaries: Whitening Coloring Transform
Let
be the tensor representing the activation values of the convolutional feature maps in a given layer corresponding to the input image
, with channels and spatial locations. We treat each spatial location as a dimensional vector, in this way each image contains a set of vectors . With a slight abuse of the notation, we use , which includes all the spatial locations in all the images in a batch. The transform is a multivariate extension of the perdimension normalization and shiftscaling transform () proposed in ([20]) and widely adopted in both generative and discriminative networks. can be described by:(1) 
where:
(2) 
In Eq. 2, is the centroid of the elements in , while is such that: , where is the covariance matrix computed using . The result of applying Eq. 2 to the elements of , is a set of whitened features , which lie in a spherical distribution (i.e., with a covariance matrix equal to the identity matrix). On the other hand, Eq. 1 performs a coloring transform, i.e. projects the elements in
onto a learned multivariate Gaussian distribution. While
and are computed using the elements in (they are datadependent), Eq. 1 depends on the dimensional learned parameter vector and the dimensional learned parameter matrix . Eq. 1 is a linear operation and can be simply implemented using a convolutional layer with kernel size .In this paper we use the WC transform in our encoder and decoder , in order to first obtain a styleanddomain invariant representation for each , and then transform this representation accordingly to the desired output domain and style image sample . The next subsections show the details of the proposed architecture.
3.2.2 Encoder
The encoder is composed of a sequence of standard    blocks and some (more details in the Supplementary Material), in which we replace the common layers ([20]) with our proposed normalization modules, which are detailed below.
Obtaining Style Invariant Representations. In the first two blocks of we whiten first and secondorder statistics of the lowlevel features of each , which are mainly responsible for the style of an image ([11]). To do so, we propose the Instance Whitening Transform (), where the term instance is inspired by Instance Normalization () ([49]
) and highlights that the proposed transform is applied to a set of features extracted from a single image
. For each , is defined as:(3) 
Note that in Eq. 3 we use as the batch, where contains only feautures of a specific image (Sec. 3.2.1). Moreover, each is extracted from the first two convolutional layers of , thus has a small receptive field. This implies that whitening is performed using an imagespecific feature centroid and covariance matrix , which represent the first and secondorder statistics of the lowlevel features of . On the other hand, coloring is based on the parameters and , which do not depend on or . The coloring operation is the analogous of the shiftscaling perdimension transform computed in just after feature standardization ([20]) and is necessary to avoid decreasing the network representation capacity ([42]).
Obtaining Domain Invariant Representations. In the subsequent blocks of we whiten first and secondorder statistics which are domain specific. For this operation we adopt the Domain Whitening Transform () proposed in ([38]). Specifically, for each , let be its domain label (see Sec. 3.1) and let be the subset of feature which have been extracted from all those images in which share the same domain label. Then, for each :
(4) 
Similarly to Eq. 3, Eq. 4 performs whitening using a subset of the current feature batch. Specifically, all the features in are partitioned depending on the domain label of the image they have been extracted from, so obtaining , etc, where all the features in belongs to the images of the domain . Then, is used to compute domaindependent first and second order statistics (). These statistics are used to project each onto a domaininvariant spherical distribution. A similar idea was recently proposed in ([38]) in a discriminative network for singlesource UDA. However, differently from ([38]), we also use coloring by reprojecting the whitened features onto a new space governed by a learned multivariate distribution. This is done using the (layerspecific) parameters and which do not depend on .
3.2.3 Decoder
Our decoder is functionally and structurally symmetric with respect to : it takes as input the domain and style invariant features computed by and projects these features onto the desired domain with the style extracted from the reference image .
Similarly to , is a sequence of and a few    blocks (more details in the Supplementary Material). Similarly to Sec. 3.2.2, in the layers we replace with our proposed feature normalization approaches, which are detailed below.
Projecting Features onto a Domainspecific Distribution. Apart from the last two blocks of (see below), all the other blocks are dedicated to project the current set of features onto a domainspecific subspace. This subspace is learned from data using domainspecific coloring parameters , where is the label of the corresponding domain. To this purpose we introduce the conditional Domain Whitening Transform (), where the term “conditional” specifies that the coloring step is conditioned on the domain label . In more detail: Similarly to Eq. 4, we first partition into , etc. However, the membership of to is decided taking into account the desired output domain label for each image rather than its original domain as in case of Eq. 4. Specifically, if and the output domain label of is , then is included in . Once has been partitioned, we define as follows:
(5) 
Note that, after whitening, and differently from Eq. 4, coloring in Eq. 5 is performed using domainspecific parameters .
Applying a Specific Style. In order to apply a specific style to , we first extract the output style from the reference image associated with (Sec. 3.1). This is done using the Style Path (see Fig. 1), which consists of two   
blocks (which share the parameters with the first two layers of the encoder) and a MultiLayer Perceptron (MLP)
. Following ([11]) we represent a style using the first and the second order statistics , which are extracted using the blocks (Sec. 3.2.2). Then we use to adapt these statistics to the domainspecific representation obtained as the output of the previous step. In fact, in principle, for each , the operation inside the transform could be “inverted” using:(6) 
Indeed, the coloring operation (Eq. 1) is the inverse of whitening (Eq. 2). However, the elements of now lie in a feature space different from the output space of Eq. 3, thus the transformation defined by Style Path needs to be adapted. For this reason, we use a MLP () which implements this adaptation:
(7) 
Note that, in Eq. 7, is the (concatenated) input and is the MLP output, one inputoutput pair per image .
Once have been generated, we use them as the coloring parameters of our Adaptive IWT ():
(8) 
Eq. 8 imposes stylespecific first and second order statistics to the features of the last blocks of in order to mimic the style of .
3.3 Network Training
GAN Training. For the sake of clarity, in the rest of the paper we use a simplified notation for , in which takes as input only one image instead of a batch. Specifically, let be the generated image, starting from () and with desired output domain and style image . is trained using the combination of three different losses, with the goal of changing the style and the domain of while preserving its content.
First, we use an adversarial loss based on the Projection Discriminator ([32]) (), which is conditioned on labels (domain labels, in our case) and uses a hinge loss:
(9) 
(10) 
The second loss is the Identity loss proposed in ([53]), which in our framework is implemented as follows:
(11) 
In Eq. 11, computes an identity transformation, being the input and the output domain and style the same. After that, a pixeltopixel norm is computed.
Finally, we propose to use a third loss which is based on the rationale that the generation process should be equivariant with respect to a set of simple transformations which preserve the main content of the images (e.g., the foreground object shape). Specifically, we use the set of the affine transformations of image which are defined by the parameter ( is a 2D transformation matrix). The affine transformation is implemented by a differentiable bilinear kernel as in ([22]). The Equivariance loss is:
(12) 
In Eq. 12, for a given image , we randomly choose a geometric parameter and we apply to . Then, using the same , we apply to and we get , which is input to in order to generate a second image. The two generated images are finally compared using the norm. This is a form of selfsupervision, in which equivariance to geometric transformations is used to extract semantics. Very recently a similar loss has been proposed in ([19]), where equivariance to affine transformations is used for image cosegmentation.
The complete loss for is:
(13) 
Note that Eq. 9, 10 and 12 depend on the pair : This means that the supervisory information we effectively use, grows with , which is quadratic with respect to a plain strategy in which different sourcetotarget generators are trained (Sec. 2).
Classifier Training. Once is trained, we use it to artificially create a labeled training dataset () for the target domain. Specifically, for each and each , we randomly pick , which is used as the reference style image, and we generate: , where is fixed and indicates the target domain () label (see Sec. 3.1). is added to
and the process is iterated. Note that, in different epochs, for the same
, we randomly select a different reference style image .Finally, we train a classfier on using the crossentropy loss:
(14) 
4 Experimental Results
In this section we describe the experimental setup and then we evaluate our approach using common MSDA datasets. We also present an ablation study in which we separately analyse the impact of each TriGAN component.
4.1 Datasets
In our experiments we consider two common domain adaptation benchmarks, namely the DigitsFive benchmark [51] and the OfficeCaltech dataset [12].
DigitsFive [51] is composed of five digitrecognition datasets: USPS [8], MNIST [24], MNISTM [9], SVHN [35] and Synthetic numbers datasets [10] (SYNDIGITS). SVHN [35] contains Google Street View images of realworld house numbers. Synthetic numbers [10] includes 500K computergenerated digits with different sources of variations (i.e. position, orientation, color, blur). USPS [8] is a dataset of digits scanned from U.S. envelopes, MNIST [24] is a popular benchmark for digit recognition and MNISTM [9] is its colored counterpart. We adopt the experimental protocol described in [51]: in each domain the train/test split is composed of a subset of 25000 images for training and 9000 images for testing. For USPS, the entire dataset is used.
OfficeCaltech [12] is a domainadaptation benchmark, obtained selecting the subset of those categories which are shared between Office31 and Caltech256 [14]. It contains images, about half of which belonging to Caltech256. There are four different domains: Amazon (A), DSLR (D), Webcam (W) and Caltech256 (C).
Standards  Models 





Avg  

Source Only  63.700.83  92.300.91  90.710.54  71.510.75  83.440.79  80.330.76  
DAN[28]  67.870.75  97.500.62  93.490.85  67.800.84  86.930.93  82.720.79  
DANN[9]  70.810.94  97.900.83  93.470.79  68.500.85  87.370.68  83.610.82  

Source Only  63.370.74  90.500.83  88.710.89  63.540.93  82.440.65  77.710.81  
DAN[28]  63.780.71  96.310.54  94.240.87  62.450.72  85.430.77  80.440.72  
CORAL[43]  62.530.69  97.210.83  93.450.82  64.400.72  82.770.69  80.070.75  
DANN[9]  71.300.56  97.600.75  92.330.85  63.480.79  85.340.84  82.010.76  
ADDA[47]  71.570.52  97.890.84  92.830.74  75.480.48  86.450.62  84.840.64  
DCTN[51]  70.531.24  96.230.82  92.810.27  77.610.41  86.770.78  84.790.72  
[36]  72.821.13  98.430.68  96.140.81  81.320.86  89.580.56  87.650.75  
StarGAN [5]  44.711.39  96.260.62  55.323.71  58.931.95  63.362.41  63.712.01  
TriGAN (Ours)  83.200.78  97.200.45  94.080.92  85.660.79  90.300.57  90.080.70 
4.2 Experimental Setup
For lack of space, we provide the architectural details of our generator and discriminator networks in the Supplementary Material. We train TriGAN for 100 epochs using the Adam optimizer [23] with the learning rate set to 1e4 for and 4e4 for as in [16]. The loss weighing factor in Eqn. 13 is set to 10 as in [53].
In the DigitsFive experiments we use a minibatch of size 256 for TriGAN training. Due to the difference in image resolution and image channels, the images of all the domains are converted to 32 32 RGB. For a fair comparison, for the final target classifier we use exactly the same network architecture used in [10, 36].
In the OfficeCaltech10 experiments we downsample the images to 164 164 to accommodate more samples in a minibatch. We use a minibatch of size 24 for training with 1 GPU. For the backbone target classifier we use the ResNet101 [15] architecture used by [36]. The weights are initialized with a network pretrained on the ILSVRC2012 dataset [39]
. In our experiments we remove the output layer and we replace it with a randomly initialized fullyconnected layer with 10 logits, one for each class of the OfficeCaltech10 dataset.
is trained with Adam with an initial learning rate of 1e5 for the randomly initialized last layer and 1e6 for all other layers. In this setting we also include in for training the classifier .4.3 Results
In this section we quantitatively analyse TriGAN. In the Supplementary Material we show some qualitative results for DigitsFive and OfficeCaltech10.
4.3.1 Comparison with StateoftheArt Methods
Tab. 1 and Tab. 2 show the results on the DigitsFive and the OfficeCaltech10 datset, respectively. Table 1 shows that TriGAN achieves an average accuracy of 90.08% which is higher than all other methods. is better in the mm, up, sv, sy mt and in the mt, mm, sv, sy up settings, where TriGAN is the second best. In all the other settings, TriGAN outperforms all the other approaches. As an example, in the mt, up, sv, sy mm setting, TriGAN is better than the second best method by a significant margin of 10.38%. In the same table we also show the results obtained when we replace TriGAN with StarGAN [5], which is another “universal” image translator. Specifically, we use StarGAN to generate synthetic target images and then we train the target classifier using the same protocol described in Sec. 3.3. The corresponding results in Table 1 show that StarGAN, despite to be known to work well for aligned face translation, drastically fails when used in this UDA scenario.
Finally, we also use OfficeCaltech10, which is considered to be difficult for reconstructionbased GAN methods because of the highresolution images. Although the dataset is quite saturated, TriGAN achieves a classification accuracy of 97.0%, outperforming all the other methods and beating the previous stateoftheart approach () by a margin of 0.6% on average (see Tab. 2).










Source only  99.0  98.3  87.8  86.1  92.8  
DAN [28]  99.3  98.2  89.7  94.8  95.5  

Source only  99.1  98.2  85.4  88.7  92.9  
DAN [28]  99.5  99.1  89.2  91.6  94.8  
DCTN [51]  99.4  99.0  90.2  92.7  95.3  
[36]  99.5  99.2  92.2  94.5  96.4  
StarGAN [5]  99.6  100.0  89.3  93.3  95.5  
TriGAN (Ours)  99.7  100.0  93.0  95.2  97.0 
4.3.2 Ablation Study
In this section we analyse the different components of our method and study in isolation their impact on the final accuracy. Specifically, we use the DigitsFive dataset and the following models: i) Model A, which is our full model containing the following components: IWT, DWT, cDWT, AdaIWT and . ii) Model B, which is similar to Model A except we replace with the cycleconsistency loss of CycleGAN [53]. iii) Model C, where we replace IWT, DWT, cDWT and AdaIWT of Model A with IN [49], BN [20], conditional Batch Normalization (cBN) [7] and Adaptive Instance Normalization (AdaIN) [18]. This comparison highlights the difference between feature whitening and feature standardisation. iv) Model D, which ignores the style factor. Specifically, in Model D, the blocks related to the style factor, i.e., the IWT and the AdaIWT blocks, are replaced by DWT and cDWT blocks, respectively. v) Model E, in which the style path differs from Model A in the way the style is applied to the domainspecific representation. Specifically, we remove the MLP and we directly apply (). vi) Finally, Model F represents nodomain assumption (e.g. the DWT and cDWT blocks are replaced with standard WC blocks).
Model  Description 


A  TriGAN (full method)  90.08  
B 

88.38 (1.70)  
C 

89.39 (0.68)  
D  No Style Assumption  88.32 (1.76)  
E 

88.36 (1.71)  
F  No Domain Assumption  89.10 (0.98) 
Tab. 3 shows that Model A outperforms all the ablated models. Model B shows that is detrimental for the accuracy because may focus on meaningless information to reconstruct back the image. Conversely, the affine transformations used in case of , force to focus on the shape (i.e., the content) of the images. Also Model C is outperformed by model A, demonstrating the importance of feature whitening over feature standardisation, corroborating the findings of [38] in a purediscriminative setting. Moreover, the nostyle assumption in Model D hurts the classification accuracy by a margin of 1.76% when compared with Model A. We believe this is due to the fact that, when only domainspecific latent factors are modeled but instancespecific style information is missing in the image translation process, then the diversity of the translations decreases, consequently reducing the final accuracy (see the role of the randomly picked , in Sec. 3.3). Model E shows the need of using the proposed style path. Finally, Model F shows that having a separate factor for domain yields a better performance. Note that the ablation analysis in Tab. 3 is done by removing a single component from the full model A, and the marginal difference with Model A shows that all the components are important. On the other hand, simultaneously removing all the components makes our model become similar to StarGAN, where there is no style information and where the domain information is not “whitened” but provided as input to the network. As shown in Table 1, our full model drastically outperfoms a StarGANbased generative MSDA approach.
4.3.3 Multi domain imagetoimage translation
Our proposed generator can be used for a pure generative (nonUDA), multidomain imagetoimage translation task. We conduct experiments on the Alps Seasons dataset [1] which consists of images of Alps mountains with 4 different domains (corresponding to 4 seasons). Fig. 2 shows some images generated using our generator. For this experiment we compare our generator with StarGAN [5] using the FID [16] metrics. FID measures the realism of the generated images (the lower the better). The FID scores are computed considering all the real samples in the target domain and generating an equivalent number of synthetic images in the target domain. Tab. 4 shows that the TriGAN FID scores are significantly lower than the StarGAN scores. This further highlights that decoupling the style and the domain and using based layers to progressively “whiten” and “color” the image statistics, yields to a more realistic crossdomain image translation than using domain labels as input as in the case of StarGAN.
5 Conclusions
In this work we proposed TriGAN, an MSDA framework which is based on datageneration from multiple source domains using a single generator. The underlying principle of our approach to to obtain intermediate, domain and style invariant representations in order to simplify the generation process. Specifically, our generator progressively removes style and domain specific statistics from the source images and then reprojects the intermediate features onto the desired target domain and style. We obtained stateoftheart results on two MSDA datasets, showing the potentiality of our approach.
References
 [1] Asha Anoosheh, Eirikur Agustsson, Radu Timofte, and Luc Van Gool. Combogan: Unrestrained scalability for image domain translation. In CVPR, 2018.
 [2] Konstantinos Bousmalis, Nathan Silberman, David Dohan, Dumitru Erhan, and Dilip Krishnan. Unsupervised pixellevel domain adaptation with generative adversarial networks. In CVPR, 2017.
 [3] Fabio Maria Carlucci, Lorenzo Porzi, Barbara Caputo, Elisa Ricci, and Samuel Rota Bulò. Autodial: Automatic domain alignment layers. In ICCV, 2017.
 [4] Wonwoong Cho, Sungha Choi, David Keetae Park, Inkyu Shin, and Jaegul Choo. Imagetoimage translation via groupwise deep whiteningandcoloring transformation. In CVPR, 2019.
 [5] Yunjey Choi, Minje Choi, Munyoung Kim, JungWoo Ha, Sunghun Kim, and Jaegul Choo. StarGAN: Unified generative adversarial networks for multidomain imagetoimage translation. In CVPR, 2018.
 [6] Dariusz Dereniowski and Marek Kubale. Cholesky factorization of matrices in parallel and ranking of graphs. In International Conference on Parallel Processing and Applied Mathematics, 2003.
 [7] Vincent Dumoulin, Jonathon Shlens, and Manjunath Kudlur. A learned representation for artistic style. In ICLR, 2016.
 [8] Jerome Friedman, Trevor Hastie, and Robert Tibshirani. The elements of statistical learning, volume 1. Springer series in statistics New York, 2001.

[9]
Yaroslav Ganin and Victor Lempitsky.
Unsupervised domain adaptation by backpropagation.
In ICML, 2015. 
[10]
Yaroslav Ganin, Evgeniya Ustinova, Hana Ajakan, Pascal Germain, Hugo
Larochelle, François Laviolette, Mario Marchand, and Victor Lempitsky.
Domainadversarial training of neural networks.
JMLR, 2016. 
[11]
Leon A Gatys, Alexander S Ecker, and Matthias Bethge.
Image style transfer using convolutional neural networks.
In CVPR, 2016.  [12] Boqing Gong, Yuan Shi, Fei Sha, and Kristen Grauman. Geodesic flow kernel for unsupervised domain adaptation. In CVPR, 2012.
 [13] Ian Goodfellow, Jean PougetAbadie, Mehdi Mirza, Bing Xu, David WardeFarley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In NIPS, 2014.
 [14] Gregory Griffin, Alex Holub, and Pietro Perona. Caltech256 object category dataset. 2007.
 [15] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In CVPR, 2016.
 [16] Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two timescale update rule converge to a local nash equilibrium. In NIPS, 2017.
 [17] Judy Hoffman, Eric Tzeng, Taesung Park, JunYan Zhu, Phillip Isola, Kate Saenko, Alexei A Efros, and Trevor Darrell. Cycada: Cycleconsistent adversarial domain adaptation. In ICML, 2017.
 [18] Xun Huang, MingYu Liu, Serge Belongie, and Jan Kautz. Multimodal unsupervised imagetoimage translation. In ECCV, 2018.
 [19] WeiChih Hung, Varun Jampani, Sifei Liu, Pavlo Molchanov, MingHsuan Yang, and Jan Kautz. Scops: Selfsupervised copart segmentation. In CVPR, 2019.
 [20] Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In ICML, 2015.

[21]
Phillip Isola, JunYan Zhu, Tinghui Zhou, and Alexei A Efros.
Imagetoimage translation with conditional adversarial networks.
In CVPR, 2017.  [22] Max Jaderberg, Karen Simonyan, Andrew Zisserman, et al. Spatial transformer networks. In NIPS, 2015.
 [23] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv :1412.6980, 2014.
 [24] Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradientbased learning applied to document recognition. Proceedings of the IEEE, 1998.
 [25] Yanghao Li, Naiyan Wang, Jianping Shi, Jiaying Liu, and Xiaodi Hou. Revisiting batch normalization for practical domain adaptation. In ICLRWS, 2017.
 [26] MingYu Liu, Thomas Breuel, and Jan Kautz. Unsupervised imagetoimage translation networks. In NIPS, 2017.
 [27] MingYu Liu and Oncel Tuzel. Coupled generative adversarial networks. In NIPS, 2016.
 [28] Mingsheng Long and Jianmin Wang. Learning transferable features with deep adaptation networks. In ICML, 2015.

[29]
Mingsheng Long, Han Zhu, Jianmin Wang, and Michael I Jordan.
Deep transfer learning with joint adaptation networks.
In ICML, 2017.  [30] Massimiliano Mancini, Lorenzo Porzi, Samuel Rota Bulò, Barbara Caputo, and Elisa Ricci. Boosting domain adaptation by discovering latent domains. In CVPR, 2018.
 [31] Yishay Mansour, Mehryar Mohri, and Afshin Rostamizadeh. Domain adaptation with multiple sources. In NIPS, 2009.
 [32] Takeru Miyato and Masanori Koyama. cGANs with projection discriminator. In ICLR, 2018.
 [33] Pietro Morerio, Jacopo Cavazza, and Vittorio Murino. Minimalentropy correlation alignment for unsupervised deep domain adaptation. In ICLR, 2018.
 [34] Zak Murez, Soheil Kolouri, David Kriegman, Ravi Ramamoorthi, and Kyungnam Kim. Image to image translation for domain adaptation. In CVPR, 2018.
 [35] Yuval Netzer, Tao Wang, Adam Coates, Alessandro Bissacco, Bo Wu, and Andrew Y Ng. Reading digits in natural images with unsupervised feature learning. In NIPSWS, 2011.
 [36] Xingchao Peng, Qinxun Bai, Xide Xia, Zijun Huang, Kate Saenko, and Bo Wang. Moment matching for multisource domain adaptation. ICCV, 2019.
 [37] Xingchao Peng and Kate Saenko. Synthetic to real adaptation with generative correlation alignment networks. In WACV, 2018.
 [38] Subhankar Roy, Aliaksandr Siarohin, Enver Sangineto, Samuel Rota Bulo, Nicu Sebe, and Elisa Ricci. Unsupervised domain adaptation using featurewhitening and consensus loss. CVPR, 2019.
 [39] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al. Imagenet large scale visual recognition challenge. IJCV, 2015.
 [40] Paolo Russo, Fabio Maria Carlucci, Tatiana Tommasi, and Barbara Caputo. From source to target and back: symmetric bidirectional adaptive gan. In CVPR, 2018.
 [41] Swami Sankaranarayanan, Yogesh Balaji, Carlos D Castillo, and Rama Chellappa. Generate to adapt: Aligning domains using generative adversarial networks. In CVPR, 2018.
 [42] Aliaksandr Siarohin, Enver Sangineto, and Nicu Sebe. Whitening and Coloring batch transform for GANs. In ICLR, 2019.
 [43] Baochen Sun, Jiashi Feng, and Kate Saenko. Return of frustratingly easy domain adaptation. In AAAI, 2016.
 [44] Baochen Sun and Kate Saenko. Deep coral: Correlation alignment for deep domain adaptation. In ECCV, 2016.
 [45] Antonio Torralba and Alexei A Efros. Unbiased look at dataset bias. In CVPR, 2011.
 [46] Eric Tzeng, Judy Hoffman, Trevor Darrell, and Kate Saenko. Adversarial discriminative domain adaptation. In CVPR, 2017.
 [47] Eric Tzeng, Judy Hoffman, Kate Saenko, and Trevor Darrell. Adversarial discriminative domain adaptation. In CVPR, 2017.
 [48] Eric Tzeng, Judy Hoffman, Ning Zhang, Kate Saenko, and Trevor Darrell. Deep domain confusion: Maximizing for domain invariance. arXiv:1412.3474, 2014.
 [49] Dmitry Ulyanov, Andrea Vedaldi, and Victor S. Lempitsky. Instance normalization: The missing ingredient for fast stylization. arXiv:1607.08022, 2016.
 [50] Hemanth Venkateswara, Jose Eusebio, Shayok Chakraborty, and Sethuraman Panchanathan. Deep hashing network for unsupervised domain adaptation. In CVPR, 2017.
 [51] Ruijia Xu, Ziliang Chen, Wangmeng Zuo, Junjie Yan, and Liang Lin. Deep cocktail network: Multisource unsupervised domain adaptation with category shift. In CVPR, 2018.
 [52] Yi Yao and Gianfranco Doretto. Boosting for transfer learning with multiple sources. In CVPR, 2010.
 [53] JunYan Zhu, Taesung Park, Phillip Isola, and Alexei A Efros. Unpaired imagetoimage translation using cycleconsistent adversarial networks. In CVPR, 2017.
A Additional MultiSource Results
Some sample translations of our are shown in Figs. 3, 4, 5. For example, in Fig. 3 when the SVHN digit “six” with sidedigits is translated to MNISTM the cDWT blocks reprojects it to MNISTM domain (i.e., single digit without sidedigits) and the AdaIWT block applies the instancespecific style of the digit “three” (i.e., blue digit with red background) to yield a blue “six” with red background. Similar trends are also observed in Fig. 4.
B Implementation details
In this section we provide the architecture details of the TriGAN generator and the discriminator .
Instance Whitening Transform (IWT) blocks. As shown in Fig 6 (a) each IWT block is a sequence composed of: , where and denote the kernel sizes. There are two IWT blocks in the . In the first IWT block we use and , and in the second we use and .
Adaptive Instance Whitening (AdaIWT) blocks. The AdaIWT blocks are analogous to the IWT blocks except from the IWT which is replaced by the AdaIWT. The AdaIWT block is a sequence: , where and . AdaIWT also takes as input the coloring parameters (, )
(See Sec. 3.2.3) and
Fig. 6 (b)). Two AdaIWT blocks are consecutively used in . The last AdaIWT block is followed by a layer.
Style Path. The Style Path is composed of: (Fig. 6 (c)). The output of the Style Path is and , which are input to the second and the first AdaIWT blocks, respectively (see Fig. 6 (b)). The
is composed of five fullyconnected layers with 256, 128, 128, 256 neurons, with the last fullyconnected layer having a number of neurons equal to the cardinality of the coloring parameters
.Domain Whitening Transform (DWT) blocks. The schematic representation of a DWT block is shown in Fig. 7 (a). For the DWT blocks we adopt a residuallike structure [15]: . We also add identity shortcuts in the DWT residual blocks to aid the training process.
Conditional Domain Whitening Transform (cDWT) blocks. The proposed cDWT blocks are schematically shown in Fig. 7 (b). Similarly to a DWT block, a cDWT block contains the following layers: . Identity shortcuts are also used in the cDWT residual blocks.
All the above blocks are assembled to construct , as shown in Fig. 8. Specifically, contains two IWT blocks, one DWT block, one cDWT block and two AdaIWT blocks. It also contains the Style Path and 2 (one before the first IWT block and another after the last AdaIWT block), which is omitted in Fig. 8 for the sake of clarity. {} are computed using the Style Path.
C Experiments for singlesource UDA
Since, our proposed TriGAN has a generic framework and can handle way domain translations, we also conduct experiments for SingleSource UDA scenario where and the source domain is grayscale MNIST. We consider the following UDA settings with the digits dataset:
Methods 





Source Only  78.9  63.6  26.0  
DANN [10]  85.1  77.4  35.7  
CoGAN [27]  91.2  62.0    
ADDA [46]  89.4      
PixelDA [2]  95.9  98.2    
UNIT [26]  95.9      
SBADAGAN [40]  97.6  99.4  61.1  
GenToAdapt [41]  92.5    36.4  
CyCADA [17]  94.8      
I2I Adapt [34]  92.1      
TriGAN (Ours)  98.0  95.7  66.3 
c.1 Datasets
MNIST USPS. The MNIST dataset contains grayscale images of handwritten digits 0 to 9. The pixel resolution of MNIST digits is 28 28. The USPS contains similar grayscale handwritten digits except the resolution is 16 16. We upsample images from both domains to 32 32 during training. For training TriGAN 50000 MNIST and 7438 USPS samples are used. For evaluation we used 1860 test samples from USPS.
MNIST MNISTM. MNISTM is a coloured version of grayscale MNIST digits. MNISTM has RGB images with resolution 28 28. For training TriGAN all 50000 training samples from MNIST and MNISTM are used and the dedicated 10000 MNISTM test samples are used for evaluation. Upsampling to 32 32 is also done during training.
MNIST SVHN. SVHN is the short form of Street View House Number and contains real world version of digits ranging from 0 to 9. The images in SVHN are RGB with pixel resolution of 32 32. SVHN has noncentered digits with varying colour intensities. Presence of sidedigits also makes adaption to SVHN a hard task. For training TriGAN 60000 MNIST and 73257 SVHN training samples are used. During evaluation all 26032 SVHN test samples are utilized.
c.2 Comparison with GANbased stateoftheart methods
In this section we compare our proposed TriGAN with GANbased stateoftheart methods, both with adversarial learning based approaches and reconstructionbased approaches. Tab. 5 reports the performance of our TriGAN alongside the results obtained from the following baselines: Domain Adversarial Neural Network [10] (DANN), Coupled generative adversarial networks [27] (CoGAN), Adversarial discriminative domain adaptation [46] (ADDA), Pixellevel domain adaptation [2] (PixelDA), Unsupervised imagetoimage translation networks [26] (UNIT), Symmetric bidirectional adaptive gan [40] (SBADAGAN), Generate to adapt [41] (GenToAdapt), Cycleconsistent adversarial domain adaptation [17] (CyCADA) and Image to image translation for domain adaptation [34] (I2I Adapt). As can be seen from Tab. 5 TriGAN does better in two out of three adaptation settings. It is only worse in the MNIST MNISTM setting where it is the third best. It is to be noted that TriGAN does significantly well in MNIST SVHN adaptation which is particularly considered as a hard setting. TriGAN is 5.2% better than the second best method SBADAGAN for MNIST SVHN.