1 Introduction
Deep learning methods have been successfully applied to different visual recognition tasks, demonstrating an excellent generalization ability. However, analogously to other statistical machine learning techniques, deep neural networks also suffer from the problem of
domain shift [49], which is observed when predictors trained on a dataset do not perform well when applied to novel domains.Since collecting annotated training data from every possible domain is expensive and sometimes even impossible, over the years several Domain Adaptation (DA) methods [34, 5] have been proposed. DA approaches leverage labeled data in a source domain in order to learn an accurate prediction model for a target domain. Specifically, in the special case of Unsupervised Domain Adaptation (UDA), no annotated target data are available at training time. Note that, even if targetsample labels are not available, unlabeled data can and usually are exploited at training time.
Most UDA methods attempt to reduce the domain shift by directly aligning the source and target marginal distributions. Notably, approaches based on the Correlation Alignment
paradigm model domain data distributions in terms of their secondorder statistics. Specifically, they match distributions by minimizing a loss function which corresponds to the difference between the source and the target covariance matrices obtained using the network’s lastlayer activations
[45, 46, 32]. Another recent and successful UDA paradigm exploits domainspecific alignment layers, derived from Batch Normalization (BN)
[19], which are directly embedded within the deep network [3, 25, 31]. Other prominent research directions in UDA correspond to those methods which also exploit the target data posterior distribution. For instance, the entropy minimization paradigm adopted in [3, 37, 13], enforces the network’s prediction probability distribution on each target sample to be peaked with respect to some (unknown) class, thus penalizing highentropy target predictions. On the other hand, the
consistencyenforcing paradigm [38, 7, 48] is based on specific loss functions which penalize inconsistent predictions over perturbed copies of the same target samples.In this paper we propose to unify the above paradigms by introducing two main novelties. First, we align the source and the target data distributions using covariance matrices similarly to [45, 46, 32]. However, instead of using a loss function computed on the lastlayer activations, we use domainspecific alignment layers which compute domainspecific covariance matrices of intermediate features. These layers “whiten” the source and the target features and project them into a common spherical distribution (see Fig. 1 (a), blue box). We call this alignment strategy Domainspecific Whitening Transform (DWT). Notably, our approach generalizes previous BNbased DA methods [3, 25, 30] which do not consider interfeature correlations and rely only on feature standardization.
The second novelty we introduce is a novel loss function, the MinEntropy Consensus (MEC) loss, which merges both the entropy [3, 37, 13] and the consistency [7] loss function. The motivation behind our proposal is to avoid the tuning of the many hyperparameters which are typically required when considering several loss terms and, specifically, the confidencethreshold hyperparameters [7]. Indeed, due to the mismatch between the source and the target domain, and because of the unlabeled targetdata assumption, hyperparameters are hard to be tuned in UDA [32]. The proposed MEC loss simultaneously encourages coherent predictions between two perturbed versions of the same target sample and exploits these predictions as pseudolabels for training. (Fig. 1 (b), purple box).
We plug our proposed DWT and the MEC loss into different network architectures and we empirically show a significant boost in performance. In particular, we achieve stateoftheart results in different UDA benchmarks: MNIST [23], USPS [8], SVHN [33], CIFAR10, STL10 [4] and OfficeHome [53]. Our code will be made publicly available soon.
2 Related Work
Unsupervised Domain Adaptation. Several previous works have addressed the problem of DA, considering both shallow models and deep architectures. In this section we focus on only deep learning methods for UDA, as these are the closest to our proposal.
UDA methods mostly differ in the strategy used to reduce the discrepancy between the source and the target feature distributions and can be grouped in different categories. The first category includes methods modeling the domain distributions in terms of their first and second order statistics. For instance, some works aim at reducing the domain shift by minimizing the Maximum Mean Discrepancy [28, 29, 53] and describe distributions in terms of their first order statistics. Other works consider also secondorder statistics using the correlation alignment paradigm (Sec. 1) [46, 32]. Instead of introducing additional loss functions, more recent works deal with the domainshift problem by directly embedding into a deep network domain alignment layers which exploit BN [25, 3, 31].
A second category of methods include approaches which learn domaininvariant deep representations. For instance, in [9] a gradient reversal layer learns discriminative domainagnostic representations. Similarly, in [51] a domainconfusion loss is introduced, encouraging the network to learn features robust to the domain shift. Haeusser et al. [14] present Associative Domain Adaptation, an approach which also learns domaininvariant embeddings.
A third category includes methods based on Generative Adversarial Networks (GANs) [35, 1, 47, 41, 39]. The main idea behind these approaches is to directly transform images from the target domain to the source domain. While GANbased methods are especially successful in adaptation from synthetic to real images and in case of noncomplex datasets, they have limited capabilities for complex images.
Entropy minimization, first introduced in [12]
, is a common strategy in semisupervised learning
[54]. In a nutshell, it consists in exploiting the highconfidence predictions of unlabeled samples as pseudolabels. Due to its effectiveness, several popular UDA methods [35, 3, 37, 29] have adopted the entropyloss for training deep networks.Another popular paradigm in UDA, which we refer to as the consistencyenforcing paradigm, is realized by perturbing the target samples and then imposing some consistency between the predictions of two perturbed versions of the same target input. Consistency is imposed by defining appropriate loss functions, as shown in [37, 7, 38]. The consistency loss paradigm is effective but it becomes uninformative if the network produces uniform probability distributions for corresponding target samples. Thus, previous methods also integrate a Confidence Thresholding (CT) technique [7], in order to discard unreliable predictions. Unfortunately, CT introduces additional userdefined and datasetspecific hyperparameters which are difficult to tune in an UDA scenario [32]. Differently, as demonstrated in our experiments, our MEC loss eliminates the need of CT and the corresponding hyperparameters.
Feature Decorrelation. Recently, Huang et al. [17] and Siarohin et al. [43] proposed to replace BN with feature whitening in a discriminative and generative setting, respectively. However, none of these works consider a DA problem. We show in this paper that feature whitening can be used to align the source and the target marginal distributions using layerspecific covariance matrices without the need of a dedicated loss function as in previous correlation alignment methods.
3 Method
In this section we present the proposed UDA approach. Specifically, after introducing some preliminaries, we describe our DomainSpecific Whitening Transform and, finally, the proposed MinEntropy Consensus loss.
3.1 Preliminaries
Let be the labeled source dataset, where is an image and its associated label, and be the unlabeled target dataset. The goal of UDA is to learn a predictor for the target domain by using samples from both and . Learning a predictor for the target domain is not trivial because of the issues discussed in Sec. 1.
A common technique to reduce domain shift is to use BNbased layers inside a network, such as to project the source and target feature distributions to a reference distribution through feature standarization. As mentioned in Sec. 1, in this work we propose to replace feature standardization with whitening, where the whitening operation is domainspecific. Before introducing the proposed whiteningbased distribution alignment, we recap below BN. Let be a minibatch of input samples to a given network layer, where each element is a
dimensional feature vector,
i.e. . Given , in BN each is transformed as follows:(1) 
where () indicates the th dimension of the data, and
are, respectively, the mean and the standard deviation computed with respect to the
th dimension of the samples in and is a constant used to prevent numerical instability. Finally, and are scaling and shifting learnable parameters.In the next section we present our DWT, while in Sec. 3.3 we present the proposed MEC loss. It is worth noting that each proposed component can be plugged independently in a network without having to rely on each other.
3.2 Domainspecific Whitening Transform
As stated above, BN is based on a perdimension standardization of each sample . Hence, once normalized, the batch samples may still have correlated feature values. Since our goal is to use feature normalization in order to alleviate the domainshift problem (see below), we argue that plain standardization is not enough to align the source and the target marginal distributions. For this reason we propose to use Batch Whitening (BW) instead of BN, which is defined as:
(2)  
(3) 
In Eq. (3), the vector is the mean of the elements in (being its th component) while the matrix is such that: , where is the covariance matrix computed using . are the batchdependent first and secondorder statistics. Eq. (3) performs the whitening of and the resulting set of vectors
lie in a spherical distribution (i.e., with a covariance matrix equal to the identity matrix).
Our network takes as input two different batches of data, randomly extracted from and , respectively. Specifically, given any arbitrary layer in the network, let and denote the batch of intermediate output activations, from layer , for the source and target domain, respectively. Using Eq. (2)(3) we can now define our Domainspecific Whitening Transform (DWT). Let and denote the inputs to the DWT layer from the source and the target domain, respectively. Our DWT is defined as follows (we drop the sample index and dimension index for the sake of clarity):
(4)  
(5) 
We estimate separate statistics ( and ) for and and use them for whitening the corresponding activations, projecting the two batches into a common spherical distribution (Fig. 1 (a)).
and are computed following the approach described in [43], which is based on the Cholesky decomposition [6]. The latter is faster [43] than the ZCAbased whitening [20] adopted in [17]. In the Supplementary Material we provide more details on how and are computed. Differently from [43] we replace the “coloring” step after whitening with simple scale and shift operations, thereby preventing the introduction of extra parameters in the network. Also, differently from [43] we use feature grouping [17] (Sec. 3.2.1) in order to make the batchstatistics estimate more robust when is small and is large. During training, the DWT layers accumulate the statistics for the target domain using a moving average of the batch statistics ().
In summary, the proposed DWT layers replace the correlation alignment of the lastlayer feature activations with the intermediatelayer feature whitening, performed at different levels of abstraction. In Sec. 3.2.1 we show that BNbased domain alignment layers [25, 3] can be seen as a special case of DWT layers.
3.2.1 Implementation Details
Given a typical block (Conv layer BN ReLU) of a CNN, we replace the BN layer with our proposed DWT layer (see in Fig. 1), obtaining: (Conv layer DWT ReLU). Ideally, in order to project the source and target feature distributions to a reference one, the DWT layers should perform fullfeature whitening using a whitening matrix, where is the number of features. However, computing the covariance matrix can be illconditioned if is large and is small. For this reason, unlike [43] and similar to [17] we use feature grouping, where the features are grouped into subsets of size . This results in betterconditioned covariance matrices but into partially whitened features. In this way we reach a compromise between fullfeature whitening and numerical stability. Interestingly, when , the whitening matrices reduce to diagonal matrices, thus realizing feature standardization as in [3, 25].
3.3 MinEntropy Consensus Loss
The impossibility of using the crossentropy loss on the unlabeled target samples is commonly circumvented using some common unsupervised loss, such as the entropy [3, 37] or the consistency loss [7, 38]. While minimizing the entropy loss ensures that the predictor maximally separates the target data, minimization of the consistency loss forces the predictor to deliver consistent predictions for target samples coming from identical (yet unknown) category. Therefore, given the importance of exploiting better the unlabeled target data and the limitations of the above two losses (see Sec. 1), we propose a novel MinEntropy Consensus (MEC) loss within the framework of UDA. We explain below how MEC loss merges both the entropy and the consistency loss into a single unified function.
Similar to the consistency loss, the proposed MEC loss requires input data perturbations. Unless otherwise explicitly specified, we apply common dataperturbation techniques on both and using affine transformations and Gaussian blurring operations. When we use the MEC loss, the network is fed with three batches instead of two. Specifically, apart from , we use two different target batches ( and ), which contain duplicate pairs of images differing only with respect to the adopted image perturbation.
Conceptually, we can think of this pipeline as three different networks with three separate domainspecific statistics , and but with shared network weights. However, since both and are drawn from the same distribution, we estimate a single using both the target batches (). As an additional advantage, this makes it possible to use samples for computing .
Let , and be three batches of the lastlayer activations. Since the source samples are labeled, the crossentropy loss () can be used in case of :
(6) 
where is the (softmaxbased) probability prediction assigned by the network to a sample with respect to its groundtruth label . However, groundtruth labels are not available for target samples. For this reason, we propose the following MEC loss ():
(7) 
(8) 
In Eq. (8), and are activations of two corresponding perturbed target samples.
The intuitive idea behind our proposal is that, similarly to consistencybased losses [7, 38], since and correspond to the same image, the network should provide similar predictions. However, unlike the aforementioned methods which compute the L2norm or the binary crossentropy between these predictions, the proposed MEC loss finds the class such that . is the class in which the posteriors corresponding to and maximally agree. We then use as the pseudolabel, which can be selected without adhoc confidence thresholds. In other words, instead of using highconfidence thresholds to discard unreliable target samples [7]
, we use all the samples but we backpropagate the error with respect to only
.The dynamics of MEC loss is the following. First, similarly to the consistency losses, it forces the network to provide coherent predictions. Second, differently from consistency losses, which are prone to attain a near zero value with uniform posterior distributions, it enforces peaked predictions. See the Supplementary Material for a more formal relation between the MEC loss and both entropy and consistency loss.
The final loss is a weighted sum of and : .
3.4 Discussion
The proposed DWT generalizes the BNbased DA approaches by decorrelating the batch features. Besides the analogy with the correlationalignment methods mentioned in Sec. 1, in which covariance matrices are used to estimate and align the source and the target distributions, a second reason for which we believe that fullwhitening is important is due to the relation between feature normalization and the smoothness of the loss [42, 22, 17, 24, 36]. For instance, previous works [24, 36] showed that better conditioning of the inputfeature covariance matrix leads to better conditioning of the Hessian of the loss function, making the gradient descent weight updates closer to Newton updates. However, BN only performs standardization, which barely improves the conditioning of the covariance matrix when the features are correlated [17]. Conversely, feature whitening completely decorrelates the batch samples, thus potentially improving the smoothness of the landscape of the loss function.
The importance of a smoothed loss function is even higher when entropylike losses on unlabeled data are used. For instance, Shu et al. [42] showed that minimizing the entropy forces the classifier to be confident on the unlabeled target data, thus potentially driving the classifier’s decision boundaries away from the target data. However, without a locallyLipschitz constraint on the loss function (i.e. with a non smoothed loss landscape), the decision boundaries can be placed close to the training samples even when the entropy is minimized [42]. Since our MEC loss is related with both the entropy and the consistency loss, we employ DWT also to improve the smoothness of our loss function in order to alleviate overfitting phenomena related to the use of unlabeled data.
4 Experiments
In this section we provide details about our implementation and training protocols and we report our experimental evaluation. We conduct experiments on both small and largescale datasets and we compare our method with stateoftheart approaches. We also present an ablation study to analyze the impact of each of our contributions on the classification accuracy.
4.1 Datasets
We conduct experiments on the following datasets:
MNIST USPS. The MNIST dataset [23] contains grayscale images (28 28 pixels) depicting handwritten digits ranging from 0 to 9. The USPS [8] dataset is similar to MNIST, but images have smaller resolution (16 16 pixels). The domain shift between USPS and MNIST datasets can be visually observed from Fig. 2(a).
MNIST SVHN. Street View House Number (SVHN) [33] images are 32 32 pixels RGB images. Similarly to the MNIST dataset digits range from 0 to 9. However, in SVHN images have variable colour intensities and depict noncentered digits. Thus, there is a significant domain shift with respect to MNIST (Fig. 2(b))
CIFAR10 STL: CIFAR10 is a 10 class dataset of RGB images depicting generic objects and with resolution 32 32 pixels. STL [4] is similar to the CIFAR10, except it has fewer labelled training images per class and has images of resolution 96 96 pixels. The nonoverlapping classes  “frog” and “monkey” are removed from CIFAR10 and STL, respectively. Samples are shown in Fig. 2.(c).
(a) MNIST USPS  
(b) SVHN MNIST  
(c) CIFAR10 STL 
OfficeHome: The OfficeHome [53] dataset comprises 4 distinct domains, each corresponding to 65 different categories (Fig. 3). There are 15,500 images in the dataset, thus this represents largescale benchmark for testing domain adaptation methods. The domains are: Art(Ar), Clipart (Cl), Product (Pr) and Real World (Rw).
4.2 Experimental Setup
To fairly compare our method with other UDA approaches, in the digits experiments we adopt the same base networks proposed in [10]. For the CIFAR10STL experiments we use the network described in [7]. We train the networks using the Adam optimizer [21] with a minibatch of cardinality = 64 samples, an initial learning rate of 0.001 and weight decay of 5
. The networks are trained for a total of 120 epochs with learning rate being decreased by a factor of 10 after 50 and 90 epochs. We use the SVHN
MNIST setting to fix the value of the hyperparameter
to 0.1 and to set group size () equal to 4. These hyperparameters values are used for all the datasets. The accuracy values reported in Tab. 1, 4 and 3 are averaged over five runs.In the OfficeHome dataset experiments we use a ResNet50 [15] architecture following [27]
. In our experiments we modify ResNet50 by replacing the first BN layer and the BN layers in the first residual block (with 64 features) with DWT layers. The network is initialized with weights taken from a pretrained model trained on the ILSVRC2012 dataset. We discard the final fullyconnected layer and we replace it with a randomly initialized fullyconnected layer with 65 output logits. During training, each domainspecific batch is limited to
= 20 samples (due to GPU memory constraints). The Adam optimizer is used with an initial learning rate of for the randomly initialized final layer and for the rest of the trainable parameters of the network. The network is trained for a total of 60 epochs where one “epoch” is the pass through the entire data set having the lower number of training samples. The learning rates are then decayed by a factor of 10 after 54 epochs. Differently from the smallscale datasets experiments, where target samples have predefined train and test splits, in the OfficeHome experiments, all the target samples (without labels) are used during training and evaluation.To demonstrate the effect our contributions, we consider three different variants for the proposed method. In the first variant (denoted as DWT in Sec. 3.2), we only consider DWT layers without the proposed MEC loss. In practice, in the considered network architectures we replace the BN layers which follows the convolutional layers with DWT layers. Supervised crossentropy loss is used for the labeled source samples and the entropyloss as in [3] is used for the unlabeled target samples. No dataaugmentation is used here. In the second variant, denoted as DWTMEC, we also exploit the proposed MEC loss (this corresponds to our full method). In this case we need perturbations of the input data, which are obtained by following some basic dataperturbation schemes like image translation by a factor of [0.05, 0.05], Gaussian blur () and random affine transformation as proposed in [7]. Finally, in the third variant (DWTMEC (MT)) we plug our proposed DWT layers and the MEC loss in the MeanTeacher (MT) training paradigm [48].
4.3 Results
In this section we present an extensive experimental analysis of our approach, showing both the results of an ablation study and a comparison with stateoftheart methods.
Methods  Source Target  MNIST USPS  USPS MNIST  SVHN MNIST  MNIST SVHN 

Source Only  78.9  57.11.7  60.11.1  20.231.8  
w/o augmentation  
CORAL [45]  81.7    63.1    
MMD [51]  81.1    71.1    
DANN [10]  85.1  73.02.0  73.9  35.7  
DSN [2]  91.3    82.7    
CoGAN [26]  91.2  89.10.8      
ADDA [52]  89.40.2  90.10.8  76.01.8    
DRCN [11]  91.80.1  73.70.1  82.00.2  40.10.1  
ATT [37]      86.20  52.8  
ADA [13]      97.6    
AutoDIAL [3]  97.96  97.51  89.12  10.78  
SBADAGAN [35]  97.6  95.0  76.1  61.1  
GAM [16]  95.70.5  98.00.5  74.61.1    
MECA [32]      95.2    
DWT  99.090.09  98.790.05  97.750.10  28.92 1.9  
Target Only  96.5  99.2  99.5  96.7  
w/ augmentation  
SE ^{a} [7]  88.140.34  92.358.61  93.335.88  33.874.02  
SE ^{b} [7]  98.230.13  99.540.04  99.260.05  37.492.44  
SE ^{†} ^{b} [7]  99.290.16  99.260.04  97.880.03  24.090.33  
DWTMEC^{b}  99.010.06  99.020.05  97.800.07  30.200.92  
DWTMEC (MT)^{b}  99.300.19  99.150.05  99.140.02  31.582.34 
4.3.1 Ablation Study
We first conduct a thorough analysis of our method assessing, in isolation, the impact of our two main contributions: (i) aligning source and target distributions by embedded DWT layers; and (ii) leveraging target data through our thresholdfree MEC loss.
First, we consider the SVHNMNIST setting and we show the benefit of feature whitening over BN. We vary the number of whitening layers from 1 to 3 and simultaneously change the group size () from 1 to 8 (see Sec. 3.2.1). With group size equal to 1, DWT layers reduces to DA layers as proposed in [3, 25]. Our results are shown in Fig. 4 and from the figure it is clear that when the accuracy stays consistently below 90 . This behaviour can be ascribed to the suboptimal alignment of source and target data distributions achieved with previous BNbased DA layers. When the group size increases, the feature decorrelation performed by the DWT layers comes into play and results into a significant improvement in terms of performance. The accuracy increases monotonically as the group size grows until the value of , then it start to decrease. This final drop in accuracy is probably due to an inaccurate estimation of covariance matrices. Indeed, a covariance matrix with size 8 8 is perhaps poorly estimated due to the lack of samples in a batch (Sec. 3.2.1). Importantly, Fig. 4 also shows that increasing the number of DWT layers has a positive impact on the accuracy. This is in contrast with [17], where feature decorrelation is used only in the first layer of the network.
In Tab. 2 we evaluate the effectiveness of the proposed MEC loss and we compare our approach with the consistency based loss recently adopted by French et al. [7]. We use SelfEnsembling (SE) [7] with and without confidence thresholding (CT) on the network predictions of the teacher network. To fairly compare our approach with SE we also consider a meanteacher scheme in our framework. We observe that SE have excellent performance when the CT is set to a very high value (0.936 as reported in [7]) but it performance drops when CT is set equal to 0, especially in the SVHNMNIST setting. This shows that the consistency loss in [7] may be harmful when the network is not confident on the target domain samples. On the contrary, the proposed MEC loss leads to results which are on par to SE in the MNISTUSPS settings and to higher accuracy in the SVHNMNIST setting. This clearly demonstrates that our proposed loss avoids the need of introducing the CT hyperparameter and, at the same time, yields to better performance. It is important to remark that, in the case of UDA, tuning hyperparameters is hard as target samples are unlabeled and crossvalidation on source data is unreliable because of the domain shift problem [32].
Method  Source Target  MNIST USPS  USPS MNIST  SVHN MNIST 

SE (w/ CT) [7]  99.29  99.26  97.88  
SE (w/o CT) [7]  98.71  97.63  26.80  
DWTMEC (MT)  99.30  99.15  99.14 
Method  Source Target  Ar Cl  Ar Pr  Ar Rw  Cl Ar  Cl Pr  Cl Rw  Pr Ar  Pr Cl  Pr Rw  Rw Ar  Rw Cl  Rw Pr  Avg 

ResNet50 [15]  34.9  50.0  58.0  37.4  41.9  46.2  38.5  31.2  60.4  53.9  41.2  59.9  46.1  
DAN [28]  43.6  57.0  67.9  45.8  56.5  60.4  44.0  43.6  67.7  63.1  51.5  74.3  56.3  
DANN [10]  45.6  59.3  70.1  47.0  58.5  60.9  46.1  43.7  68.5  63.2  51.8  76.8  57.6  
JAN [29]  45.9  61.2  68.9  50.4  59.7  61.0  45.8  43.4  70.3  63.9  52.4  76.8  58.3  
CDANRM [27]  49.2  64.8  72.9  53.8  63.9  62.9  49.8  48.8  71.5  65.8  56.4  79.2  61.6  
CDANM [27]  50.6  65.9  73.4  55.7  62.7  64.2  51.8  49.1  74.5  68.2  56.9  80.7  62.8  
DWT  50.8  72.0  75.8  58.9  65.6  60.2  57.2  49.5  78.3  70.1  55.3  78.2  64.3  
SE [7]  48.8  61.8  72.8  54.1  63.2  65.1  50.6  49.2  72.3  66.1  55.9  78.7  61.5  
DWTMEC  54.7  72.3  77.2  56.9  68.5  69.8  54.8  47.9  78.1  68.6  54.9  81.2  65.4 
Source Target  CIFAR10 STL  STL CIFAR10  

Source Only  60.35  51.88  
w/o augmentation  
DANN [10]  66.12  56.91  
DRCN [11]  66.37  58.65  
AutoDIAL [3]  79.10  70.15  
DWT  79.750.25  71.180.56  
Target Only  67.75  88.86  
w/ augmentation  
SE ^{a} [7]  77.530.11  71.650.67  
SE ^{b} [7]  80.090.31  69.861.97  
DWTMEC^{b}  80.390.31  72.520.94  
DWTMEC (MT)^{b}  81.830.14  71.310.22 
4.3.2 Comparison with StateoftheArt Methods
In this section we present the results of our comparison with previous UDA methods. Tab. 1 reports the results obtained on the digits datasets. We compare with several baselines: Correlation Alignment (CORAL) [45], Simultaneous Deep Transfer (MMD) [51], DomainAdversarial Training of Neural Networks (DANN) [10], Domain separation networks [2], Coupled generative adversarial networks (CoGAN) [26], Adversarial discriminative domain adaptation (ADDA) [52], Deep reconstructionclassification networks (DRCN), [11], Asymmetric tritraining [37], Associative domain adaptation (ADA) [13], AutoDIAL [3], SBADAGAN [35], Domain transferthrough deep activation matching (GAM) [16], Minimalentropy correlation alignment (MECA) [32] and SE [7]. Note that the Virtual Adversarial Domain Adaptation (VADA) [42] use a different network, thus cannot be compared with the other methods (including ours) which are based on a different capacity network. For this reason, [42] is not reported in Tab. 1. Results associated with each method are taken from the corresponding papers. We reimplemented SE as the numbers reported in the original paper [7] refer to a different deep architecture. We also report results where the network is trained only on labeled source and target data.
Tab. 1 is split in two sections, separating those methods that exploit data augmentation from those which use only the original training data. Compared with nodata augmentation methods, our DWT performs better than previous UDA methods in the three settings. Our method is less effective in the MNISTSVHN due to the strong domain shift between the two domains. In this setting, GANbased methods [35] are more effective. Looking at methods which consider data augmentation, we compare our approach with SE [7]. To be consistent with other methods, we plug the architectures described in [9] in SE. Comparing the proposed approach with our reimplementation of SE (SE^{†}^{b}) we observe that DWTMEC (MT) is almost on par with SE in the MNISTUSPS setting and better than SE in the SVHNMNIST. For the sake of completeness, we also report the performance of SE taken from the original paper [7], considering SE with minimal augmentation (only gaussian blur) and SE with full augmentation (translation, horizontal flip, affine transformations).
With the rapid progress of deep DA methods, the results in the digits datasets have saturated. This makes it difficult to gauge the merit of the proposed contributions. Therefore, we also consider the CIFAR10 STL setting. Our results are reported in Tab. 4. Similarly to the experiments in Tab. 1, we separate those methods exploiting data augmentation from those not using targetsample perturbations. Tab. 4 shows that our method (DWT), outperforms all previous baselines which also do not consider augmentation. Furthermore, by exploiting data perturbation and the proposed MEC loss our approach (with and without MeanTeacher) reaches higher accuracy than SE.^{1}^{1}1In this case the accuracy values reported for SE are taken directly from the original paper as the underlying network architecture is the same.
Finally, we also perform experiments on the largescale OfficeHome dataset and we compare with the baselines methods as reported in the very recent work of Long et al. [27]. The results reported in Tab. 3 show that our approach outperforms all the other methods. On average, the proposed approach improves over Conditional Domain Adversarial Networks (CDAN) by 2.4 and it is also more accurate than SE.
5 Conclusions
In this work we address UDA by proposing domainspecific feature whitening with DWT layers and the MEC loss. On the one hand, whitening of intermediate features enables the alignment of the source and the target distributions at intermediate feature levels and increases the smoothness of the loss landscape. On the other hand, our MEC loss better exploits the target data. Both these components can be easily integrated in any standard CNN. Our experiments on standard benchmarks show stateoftheart performance on digits categorization and object recognition tasks. As future work, we plan to extend our method to handle multiple source and target domains.
References
 [1] K. Bousmalis, N. Silberman, D. Dohan, D. Erhan, and D. Krishnan. Unsupervised pixellevel domain adaptation with gans. In CVPR, 2017.
 [2] K. Bousmalis, G. Trigeorgis, N. Silberman, D. Krishnan, and D. Erhan. Domain separation networks. In NIPS, 2016.
 [3] F. M. Carlucci, L. Porzi, B. Caputo, E. Ricci, and S. R. Bulò. Autodial: Automatic domain alignment layers. In ICCV, pages 5077–5085, 2017.

[4]
A. Coates, A. Ng, and H. Lee.
An analysis of singlelayer networks in unsupervised feature
learning.
In
Proceedings of the fourteenth international conference on artificial intelligence and statistics
, pages 215–223, 2011. 
[5]
G. Csurka, editor.
Domain Adaptation in Computer Vision Applications
.Advances in Computer Vision and Pattern Recognition. Springer, 2017.
 [6] D. Dereniowski and K. Marek. Cholesky factorization of matrices in parallel and ranking of graphs. In 5th Int. Conference on Parallel Processing and Applied Mathematics, 2004.
 [7] G. French, M. Mackiewicz, and M. Fisher. Selfensembling for visual domain adaptation. ICLR, 2018.
 [8] J. Friedman, T. Hastie, and R. Tibshirani. The elements of statistical learning, volume 1. 2001.
 [9] Y. Ganin and V. Lempitsky. Unsupervised domain adaptation by backpropagation. ICML, 2015.
 [10] Y. Ganin, E. Ustinova, H. Ajakan, P. Germain, H. Larochelle, F. Laviolette, M. Marchand, and V. Lempitsky. Domainadversarial training of neural networks. Journal of Machine Learning Research, 17(59):1–35, 2016.
 [11] M. Ghifary, W. B. Kleijn, M. Zhang, D. Balduzzi, and W. Li. Deep reconstructionclassification networks for unsupervised domain adaptation. In ECCV, 2016.
 [12] Y. Grandvalet and Y. Bengio. Semisupervised learning by entropy minimization. In NIPS, 2004.
 [13] P. Haeusser, T. Frerix, A. Mordvintsev, and D. Cremers. Associative domain adaptation. In ICCV, volume 2, page 6, 2017.
 [14] P. Haeusser, T. Frerix, A. Mordvintsev, and D. Cremers. Associative domain adaptation. In ICCV, 2017.
 [15] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In CVPR, pages 770–778, 2016.
 [16] H. Huang, Q. Huang, and P. Krahenbuhl. Domain transfer through deep activation matching. In ECCV, pages 590–605, 2018.
 [17] L. Huang, D. Yang, B. Lang, and J. Deng. Decorrelated batch normalization. In CVPR, 2018.
 [18] L. Huang, D. Yang, B. Lang, and J. Deng. Decorrelated batch normalization. In CVPR, 2018.
 [19] S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In ICML, 2015.
 [20] A. Kessy, A. Lewin, and K. Strimmer. Optimal whitening and decorrelation. The American Statistician, 2017.
 [21] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv :1412.6980, 2014.
 [22] J. Kohler, H. Daneshmand, A. Lucchi, M. Zhou, K. Neymeyr, and T. Hofmann. Towards a Theoretical Understanding of Batch Normalization. arXiv:1805.10694, 2018.
 [23] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradientbased learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
 [24] Y. LeCun, L. Bottou, G. B. Orr, and K. Müller. Efficient backprop. In Neural Networks: Tricks of the Trade  Second Edition, pages 9–48. 2012.
 [25] Y. Li, N. Wang, J. Shi, J. Liu, and X. Hou. Revisiting batch normalization for practical domain adaptation. arXiv:1603.04779, 2016.
 [26] M.Y. Liu and O. Tuzel. Coupled generative adversarial networks. In NIPS, pages 469–477, 2016.
 [27] M. Long, Z. Cao, J. Wang, and M. I. Jordan. Conditional adversarial domain adaptation. NIPS, 2018.
 [28] M. Long and J. Wang. Learning transferable features with deep adaptation networks. In ICML, 2015.

[29]
M. Long, H. Zhu, J. Wang, and M. I. Jordan.
Deep transfer learning with joint adaptation networks.
ICML, 2017.  [30] M. Mancini, H. Karaoguz, E. Ricci, P. Jensfelt, and B. Caputo. Kitting in the wild through online domain adaptation. IROS, 2018.
 [31] M. Mancini, L. Porzi, S. R. Bulò, B. Caputo, and E. Ricci. Boosting domain adaptation by discovering latent domains. CVPR, 2018.
 [32] P. Morerio, J. Cavazza, and V. Murino. Minimalentropy correlation alignment for unsupervised deep domain adaptation. ICLR, 2018.
 [33] Y. Netzer, T. Wang, A. Coates, A. Bissacco, B. Wu, and A. Y. Ng. Reading digits in natural images with unsupervised feature learning. In NIPS workshop on deep learning and unsupervised feature learning, volume 2011, page 5, 2011.
 [34] S. J. Pan, Q. Yang, et al. A survey on transfer learning. IEEE Transactions on knowledge and data engineering, 22(10):1345–1359, 2010.
 [35] P. Russo, F. M. Carlucci, T. Tommasi, and B. Caputo. From source to target and back: symmetric bidirectional adaptive gan. In CVPR, 2018.
 [36] H. N. S. Wiesler. A convergence analysis of loglinear training. In NIPS, 2011.
 [37] K. Saito, Y. Ushiku, and T. Harada. Asymmetric tritraining for unsupervised domain adaptation. arXiv:1702.08400, 2017.
 [38] M. Sajjadi, M. Javanmardi, and T. Tasdizen. Regularization with stochastic transformations and perturbations for deep semisupervised learning. In NIPS, pages 1163–1171, 2016.
 [39] S. Sankaranarayanan, Y. Balaji, C. D. Castillo, and R. Chellappa. Generate to adapt: Aligning domains using generative adversarial networks. In CVPR, 2018.
 [40] J. Schäfer and K. Strimmer. A shrinkage approach to largescale covariance matrix estimation and implications for functional genomics. Statistical Applications in Genetics and Molecular Biology, 4(1), 2005.
 [41] A. Shrivastava, T. Pfister, O. Tuzel, J. Susskind, W. Wang, and R. Webb. Learning from simulated and unsupervised images through adversarial training. arXiv:1612.07828, 2016.
 [42] R. Shu, H. H. Bui, H. Narui, and S. Ermon. A dirtt approach to unsupervised domain adaptation. arXiv preprint arXiv:1802.08735, 2018.
 [43] A. Siarohin, E. Sangineto, and N. Sebe. Whitening and Coloring transform for GANs. arXiv:1806.00420, 2018.
 [44] J. Stallkamp, M. Schlipsing, J. Salmen, and C. Igel. Man vs. computer: Benchmarking machine learning algorithms for traffic sign recognition. Neural networks, 32:323–332, 2012.
 [45] B. Sun, J. Feng, and K. Saenko. Return of frustratingly easy domain adaptation. In AAAI, 2016.
 [46] B. Sun and K. Saenko. Deep coral: Correlation alignment for deep domain adaptation. ECCV, 2016.
 [47] Y. Taigman, A. Polyak, and L. Wolf. Unsupervised crossdomain image generation. ICLR, 2017.
 [48] A. Tarvainen and H. Valpola. Mean teachers are better role models: Weightaveraged consistency targets improve semisupervised deep learning results. In NIPS, pages 1195–1204, 2017.
 [49] A. Torralba and A. A. Efros. Unbiased look at dataset bias. In CVPR, pages 1521–1528. IEEE, 2011.
 [50] L. N. Trefethen and D. Bau. Numerical Linear Algebra. SIAM, 1997.
 [51] E. Tzeng, J. Hoffman, T. Darrell, and K. Saenko. Simultaneous deep transfer across domains and tasks. In ICCV, 2015.
 [52] E. Tzeng, J. Hoffman, K. Saenko, and T. Darrell. Adversarial discriminative domain adaptation. In CVPR, volume 1, page 4, 2017.
 [53] H. Venkateswara, J. Eusebio, S. Chakraborty, and S. Panchanathan. Deep hashing network for unsupervised domain adaptation. In CVPR, 2017.
 [54] X. Zhu. Semisupervised learning literature survey. 2005.
6 Computing the whitening matrix
The whitening matrix in Eq. (3) of the main paper can be computed in different ways. For instance, Huang et al. [18] use the ZCA whitening [20], while Siarohin et al. [43] use the Cholesky decomposition [6]. Both tecniques are unique (given a covariance matrix) and differentiable, however we adopted the method proposed in [43] because it is faster [50] and more stable [43] than the ZCAbased whitening. Moreover, many modern platforms for deepnetwork developing include tools for computing the Cholesky decomposition, thus this solution makes our approach easier to be reproduced.
We describe below the main steps we used to compute . Since and , respectively used in Eq. (4) and in Eq. (5) of the main paper, and depending on and , are computed exacly in the same way, in the following we refer to the generic matrix in Eq. (3) which depends on the batch statistics .
The first step consists in computing the covariance matrix . To avoid instability issues, we blend the empirical covariance matrix with , the identity matrix [40]:
(9) 
where:
(10) 
Once is computed, we use the approach proposed in [43] to compute such that :

Let , where is a lower triangular matrix.

Using the Cholesky decomposition we compute and from .

We invert and we obtain: .
For more details, we refer to [43].
7 Relation between the MEC loss and the Entropy and the Consistency losses
We show below a formal relation between our MEC loss and the Entropy and the Consistency losses.
Proposition 1.
Let be an hypothesis space of predictors of infinite capacity. Then the minimization of the consensus loss yields a predictor that is consistent, i.e. for any pairs of perturbed datapoints and confident, i.e. for all and some depending on .
Proof.
The pointwise loss is lower bounded by and it attains if and only if the conditions on listed in the theorem are satisfied. The result follows noting that predictors of infinite capacity can always attain loss. ∎
8 Additional experiments using synthetictoreal adaptation settings
In this section we report results of additional UDA experiments using synthetic source images and real target images and we compare our method with the stateoftheart approaches in these settings.
8.1 Datasets and experimental setup
Synthetic numbers SVHN. It is a common practice in UDA to train a predictor on annotated synthetic images and then test on real images. In this setting we use the SYN NUMBERS [10] as the source dataset and SVHN [33] as the target dataset. The former (SYN NUMBERS) is composed of images which are softwaregenerated (e.g., using different orientations, stroke colors, etc.), in order to simulate the latter (SVHN). Despite some geometric similarities between the two datasets, there exists a significant domain shift between them because, for instance, the cluttered background in SVHN, which is absent in SYN NUMBERS images (see Fig. 5 (a)). There are approximately 500,000 annotated images in the SYN NUMBERS dataset.
(a) SYN NUMBERS SVHN  
(b) SYN SIGNS GTSRB 
Synthetic Signs GSTRB. In this setting, which is analogous to the SYN NUMBERS SVHN experiment, the source dataset (SYN SIGNS [10]) is composed of synthetic traffic signs, while the target dataset is the German Traffic Sign Recognition Benchmark (GTSRB [44]). The SYN SIGNS dataset is composed of 100,000 synthetic images belonging to 43 different traffic signs categories, while the GTSRB dataset is composed of 39,209 real images, partitioned using the same 43 categories. As shown in Fig. 5 (b), the real target domain exhibits a domain shift because of different illumination conditions, background clutter, etc.
In the experiments conducted on both settings we adopt the standard evaluation protocols and the corresponding training/testing splits [10], using identical experimental setups as reported in Sec. 4.2 of the main paper.
Method  Source Target  Syn Numbers SVHN  Syn Signs GTSRB 

Source Only  86.70.8  80.60.6  
w/o augmentation  
DANN [10]  91.0  88.6  
ATT [37]  92.9  96.2  
ADA [13]  91.8  97.6  
AutoDIAL ^{†} [3]  87.9  97.8  
DWT  93.700.21  98.110.13  
Target Only  92.2  99.8  
w/ augmentation  
SE ^{a} [7]  96.010.08^{*}  98.530.15^{*}  
SE ^{b} [7]  97.110.04^{*}  99.370.09^{*}  
SE ^{†} ^{a} [7]  91.920.09  97.730.10  
SE ^{†} ^{b} [7]  95.620.12  99.010.04  
DWTMEC  94.620.13  99.300.07  
DWTMEC (MT)  94.100.21  99.220.16 
8.2 Comparison with stateoftheart methods
In Tab 5 we report the results of our method compared with other UDA methods. We compare with the following baselines: DomainAdversarial Training of Neural Networks (DANN) [10], Asymmetric tritraining (ATT) [37], Associative Domain Adaptation (ADA) [13], AutoDIAL [3] and SelfEnsembling (SE) [7]. The results of most of the methods reported in Tab. 5 are taken from the original papers. In the same table we also show SE and AutoDIAL results obtained using comparable basenetwork architectures as those used by our method. Moreover, similarly to the main paper, and for a fair comparison, we split Tab. 5 into two sections in order to differentiate the methods which use data augmentation from those methods which do not exploit data augmentation.
When DWT is compared with the methods using nodata augmentation, it outperforms all the baselines in both the SYN NUMBERS SVHN and the SYN DIGITS GTSRB setting. When data augmentation is considered, DWTMEC outperforms all the other approaches in the second setting but performs worse by 1% when compared with SE [7] in the first setting. The superior performance of SE in SYN NUMBERS SVHN can be attributed to the use of a very conservative threshold on the target predictions, which helps to filterout noisy predictions during training. However, as demonstrated in Sec 4.3.1 of the main paper (Tab. 2), the absence of a confidence threshold, tuned on the specific setting, might lead SE to a drastic performance degradation.
9 CNN Architectures
In this section we report the network architectures used in all the smallimage experiments shown in both the main paper and in this Supplementary Material (Tab. 6, 7, 9, 8).
Description 

Input: 28 28 
Conv 5 5 32, pad 2 
Maxpool 2 2, stride 2 
Conv 5 5 48, pad 2 
Maxpool 2 2, stride 2 
Fully connected, 100 units 
Fully connected, 100 units 
Fully connected, 10 units, softmax 
Description 

Input: 32 32 3 
Conv 5 5 64, pad 2 
Maxpool 3 3, stride 2 
Conv 5 5 64, pad 2 
Maxpool 3 3, stride 2 
Conv 5 5 128, pad 2 
Fully connected, 3072 units 
Dropout, 50% 
Fully connected, 2048 units 
Dropout, 50% 
Fully connected, 10 units, softmax 
Description 

Input: 40 40 3 
Conv 5 5 96, pad 2 
Maxpool 2 2, stride 2 
Conv 3 3 144, pad 1 
Maxpool 2 2, stride 2 
Conv 5 5 256, pad 2 
Maxpool 2 2, stride 2 
Fully connected, 512 units 
Dropout, 50% 
Fully connected, 43 units, softmax 
Description 

Input: 32 32 3 
Conv 3 3 128, pad 1 
Conv 3 3 128, pad 1 
Conv 3 3 128, pad 1 
Maxpool 2 2, stride 2 
Dropout, 50% 
Conv 3 3 256, pad 1 
Conv 3 3 256, pad 1 
Conv 3 3 256, pad 1 
Maxpool 2 2, stride 2 
Dropout, 50% 
Conv 3 3 512, pad 0 
Conv 1 1 256, pad 0 
Conv 1 1 128, pad 0 
Global Average Pooling 
Fully connected, 9 units, softmax 